Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling

arXiv cs.LG / 4/23/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper addresses a key bottleneck for long-context LLMs: exact self-attention requires quadratic memory, which commonly causes out-of-memory (OOM) failures.
It introduces CQS Divide, which decomposes full-sequence attention into independent subsequence computations that recombine to produce exactly the same attention result.
Building on this, Stream-CQSA is a memory-adaptive scheduling framework that partitions attention into subproblems sized to fit within any given memory budget.
The approach turns attention from a single monolithic operation into many schedulable tasks, allowing flexible execution across devices without requiring inter-device communication.
Experiments indicate predictable memory scaling and show that exact attention for billion-token sequences can run on a single GPU using streaming, without changing the mathematical definition or adding approximation error.

Abstract

The scalability of long-context large language models is fundamentally limited by the quadratic memory cost of exact self-attention, which often leads to out-of-memory (OOM) failures on modern hardware. Existing methods improve memory efficiency to near-linear complexity, while assuming that the full query, key, and value tensors fit in device memory. In this work, we remove this assumption by introducing CQS Divide, an operation derived from cyclic quorum sets (CQS) theory that decomposes attention into a set of independent subsequence computations whose recomposition yields exactly the same result as full-sequence attention. Exploiting this decomposition, we introduce Stream-CQSA, a memory-adaptive scheduling framework that partitions attention into subproblems that fit within arbitrary memory budgets. This recasts attention from a logically monolithic operation into a collection of schedulable tasks, enabling flexible execution across devices without inter-device communication. Experiments demonstrate predictable memory scaling and show that exact attention over billion-token sequences can be executed on a single GPU via streaming, without changing the underlying mathematical definition of attention or introducing approximation error.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/23DailyView insight →

The anti-AI crowd is giving “real farmers don’t use tractors” energy, and it’s getting old.

Dev.to

Training ChatGPT on Private Data: A Technical Reference

Dev.to

The Rise of Intelligent Software: How AI is Reshaping Modern Product Development

Dev.to

AI Tutor and Doubt Solver — EaseLearn AI Complete Review 2026

Dev.to

Why all AI-coding plans are getting more expensive?

Dev.to

Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling

Key Points

Abstract

💡 Insights using this article

Related Articles

The anti-AI crowd is giving “real farmers don’t use tractors” energy, and it’s getting old.

Training ChatGPT on Private Data: A Technical Reference

The Rise of Intelligent Software: How AI is Reshaping Modern Product Development

AI Tutor and Doubt Solver — EaseLearn AI Complete Review 2026

Why all AI-coding plans are getting more expensive?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer