IMPACT-CYCLE: A Contract-Based Multi-Agent System for Claim-Level Supervisory Correction of Long-Video Semantic Memory

arXiv cs.CV / 4/23/2026

📰 NewsModels & Research

共有:

Key Points

IMPACT-CYCLE addresses the high cost of correcting long-video understanding errors by introducing an explicit supervisory interface rather than relying on opaque end-to-end multimodal outputs.
The system restructures long-video understanding as iterative, claim-level maintenance of a shared, versioned semantic memory, including a claim dependency graph and provenance logs.
Role-specialized agents verify correctness at multiple levels—local object-relation validity, cross-temporal consistency, and global semantic coherence—while limiting edits to structurally dependent claims.
When evidence is insufficient, IMPACT-CYCLE escalates decisions to human arbitration with final override authority, then re-verifies via dependency-closure to keep correction effort proportional to error scope.
Experiments on VidOR report improved downstream reasoning performance (VQA 0.71→0.79) and a 4.8x reduction in human arbitration cost, and the authors plan to release the code on GitHub.

Abstract

Correcting errors in long-video understanding is disproportionately costly: existing multimodal pipelines produce opaque, end-to-end outputs that expose no intermediate state for inspection, forcing annotators to revisit raw video and reconstruct temporal logic from scratch. The core bottleneck is not generation quality alone, but the absence of a supervisory interface through which human effort can be proportional to the scope of each error. We present IMPACT-CYCLE, a supervisory multi-agent system that reformulates long-video understanding as iterative claim-level maintenance of a shared semantic memory -- a structured, versioned state encoding typed claims, a claim dependency graph, and a provenance log. Role-specialized agents operating under explicit authority contracts decompose verification into local object-relation correctness, cross-temporal consistency, and global semantic coherence, with corrections confined to structurally dependent claims. When automated evidence is insufficient, the system escalates to human arbitration as the supervisory authority with final override rights; dependency-closure re-verification then ensures correction cost remains proportional to error scope. Experiments on VidOR show substantially improved downstream reasoning (VQA: 0.71 to 0.79) and a 4.8x reduction in human arbitration cost, with workload significantly lower than manual annotation. Code will be released at https://github.com/MKong17/IMPACT_CYCLE.