The ARC of Progress towards AGI: A Living Survey of Abstraction and Reasoning

arXiv cs.AI / 3/17/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The ARC-AGI Living Survey analyzes 82 approaches across ARC-AGI-1 to ARC-AGI-3 and ARC Prize 2024–2025, revealing that performance degrades by 2–3× across program synthesis, neuro-symbolic, and neural paradigms, suggesting fundamental limits in compositional generalization.
AI performance on ARC-AGI-1 is 93.0% (Opus 4.6), falls to 68.8% on ARC-AGI-2, and to 13% on ARC-AGI-3, while humans maintain near-perfect accuracy across all versions.
The cost per task declined about 390× over a year (from about $4,500 to $12) largely due to reduced test-time parallelism rather than a fundamental jump in model efficiency.
Test-time adaptation and refinement loops emerge as critical success factors, while compositional reasoning and interactive learning remain unsolved, with Kaggle-constrained 660M–8B entries achieving competitive results, supporting a view on skill-acquisition efficiency.
The ARC Prize 2025 winners needed hundreds of thousands of synthetic examples to reach 24% on ARC-AGI-2, underscoring that reasoning remains knowledge-bound, with the living survey capturing field progress as of February 2026 and providing updates at the linked site.

Abstract

The Abstraction and Reasoning Corpus (ARC-AGI) has become a key benchmark for fluid intelligence in AI. This survey presents the first cross-generation analysis of 82 approaches across three benchmark versions and the ARC Prize 2024-2025 competitions. Our central finding is that performance degradation across versions is consistent across all paradigms: program synthesis, neuro-symbolic, and neural approaches all exhibit 2-3x drops from ARC-AGI-1 to ARC-AGI-2, indicating fundamental limitations in compositional generalization. While systems now reach 93.0% on ARC-AGI-1 (Opus 4.6), performance falls to 68.8% on ARC-AGI-2 and 13% on ARC-AGI-3, as humans maintain near-perfect accuracy across all versions. Cost fell 390x in one year (o3's

4,500/task to GPT-5.2's

12/task), although this largely reflects reduced test-time parallelism. Trillion-scale models vary widely in score and cost, while Kaggle-constrained entries (660M-8B) achieve competitive results, aligning with Chollet's thesis that intelligence is skill-acquisition efficiency. Test-time adaptation and refinement loops emerge as critical success factors, while compositional reasoning and interactive learning remain unsolved. ARC Prize 2025 winners needed hundreds of thousands of synthetic examples to reach 24% on ARC-AGI-2, confirming that reasoning remains knowledge-bound. This first release of the ARC-AGI Living Survey captures the field as of February 2026, with updates at https://nimi-ai.com/arc-survey/