AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving
arXiv cs.AI / 4/30/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- The paper argues that existing LLM serving architectures put GPUs at the center, which is inefficient for decode-phase attention because it is memory-bound rather than compute-bound.
- It introduces AMMA, a memory-centric, multi-chiplet design that replaces GPU compute dies with HBM-PNM “cubes,” roughly doubling memory bandwidth for long-context (up to ~1M tokens) attention serving.
- AMMA includes a custom logic-die microarchitecture to leverage each cube’s internal bandwidth efficiently, plus two-level hybrid parallelism and a reordered collective communication flow to cut die-to-die overhead.
- Design-space exploration (varying per-cube compute power and intra-chip D2D link bandwidth) provides practical tuning guidance for hardware architects.
- In evaluation results, AMMA reduces attention latency by 15.5× and energy consumption by 6.9× versus NVIDIA H100.
Related Articles
Claude Opus 4.7: What Actually Changed and Whether You Should Migrate
Dev.to
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to
Sector HQ Daily AI Intelligence - April 30, 2026
Dev.to
The Inference Inflection: Why AI's Center of Gravity Has Shifted from Training to Inference
Dev.to
Dragon Quest X’s Gemini AI
Dev.to