Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge

arXiv cs.RO / 5/4/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • Tempus is a proposed GEMM streaming framework designed for AMD Versal AI Edge SoCs to improve LLM inference efficiency under strict edge constraints on compute, memory, and power.
  • The framework avoids spatial scaling across hundreds of cores (which can fail on edge hardware) by using a fixed 16 AIE-ML core compute block with iterative graph execution, algorithmic data tiling, and replication in programmable logic.
  • Tempus uses high-speed cascade streaming and a deadlock-free DATAFLOW protocol to reduce partial sums with a reported initiation interval (II) of 1 and to maximize overlap between data transfer and computation.
  • In evaluated GEMM workloads, Tempus reports 607 GOPS at 10.677 W on-chip power, and a Platform-Aware Utility (PAU) analysis indicates a 211.2× higher prominence factor than the spatial SOTA (ARIES).
  • Tempus further claims strong efficiency properties, including 0.00% URAM/DSP utilization, with reported gains in core frugality (22.0×), power frugality (7.1×), and I/O demand reduction (6.3×).

Abstract

Scaling laws for Large Language Models (LLMs) establish that model quality improves with computational scale, yet edge deployment imposes strict constraints on compute, memory, and power. Since General Matrix Multiplication (GEMM) accounts for up to 90\% of inference time, efficient GEMM acceleration is critical for edge AI. The Adaptive Intelligent Engines available in the AMD Versal adaptive SoCs are well suited for this task, but existing state-of-the-art (SOTA) frameworks maximize performance through spatial scaling, distributing workloads across hundreds of cores -- an approach that fails on resource-limited edge SoCs due to physical implementation failures, bandwidth saturation, and excessive resource consumption. We propose Tempus, a Resource-Invariant Temporal GEMM framework for the AMD Versal AI Edge SoC. Rather than expanding hardware resources with matrix size, Tempus employs a fixed compute block of 16 AIE-ML cores, achieving scalability through iterative graph execution and algorithmic data tiling and replication in the Programmable Logic. High-speed cascade streaming ensures low-latency partial sum reduction at Initiation Interval (II) of 1, while a deadlock-free DATAFLOW protocol maximizes transfer-compute overlap and PLIO reuse. Evaluated on GEMM workloads, Tempus achieves 607 GOPS at 10.677 W total on-chip power. By characterizing system-level efficiency through the Platform-Aware Utility (PAU) metric, we prove that Tempus achieves a 211.2x higher prominence factor than the leading spatial SOTA (ARIES). Furthermore, the framework maintains a 0.00\% utilization of URAM/DSP, yielding 22.0x core frugality, 7.1x power frugality, and a 6.3x reduction in I/O demand, establishing a sustainable, scalable foundation for edge LLM inference.