AI Navigate

Attention Gathers, MLPs Compose: A Causal Analysis of an Action-Outcome Circuit in VideoViT

arXiv cs.AI / 3/13/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates how video vision transformers encode nuanced action-outcome information in classification tasks, highlighting hidden knowledge relevant to Trustworthy AI.
  • Using mechanistic interpretability and causal analysis, the authors show that the "Success vs Failure" outcome signal is amplified from layer 5 to 11, with only modest differences observed at layer 0.
  • Attention heads act as "evidence gatherers" providing low-level information for partial signal recovery, while MLP blocks function as "concept composers" driving the final outcome.
  • The results reveal a distributed and redundant internal circuit in the model, resilient to simple ablations, and emphasize the need for mechanistic oversight to build genuinely explainable and trustworthy AI systems.

Abstract

The paper explores how video models trained for classification tasks represent nuanced, hidden semantic information that may not affect the final outcome, a key challenge for Trustworthy AI models. Through Explainable and Interpretable AI methods, specifically mechanistic interpretability techniques, the internal circuit responsible for representing the action's outcome is reverse-engineered in a pre-trained video vision transformer, revealing that the "Success vs Failure" signal is computed through a distinct amplification cascade. While there are low-level differences observed from layer 0, the abstract and semantic representation of the outcome is progressively amplified from layers 5 through 11. Causal analysis, primarily using activation patching supported by ablation results, reveals a clear division of labor: Attention Heads act as "evidence gatherers", providing necessary low-level information for partial signal recovery, while MLP Blocks function as robust "concept composers", each of which is the primary driver to generate the "success" signal. This distributed and redundant circuit in the model's internals explains its resilience to simple ablations, demonstrating a core computational pattern for processing human-action outcomes. Crucially, the existence of this sophisticated circuit for representing complex outcomes, even within a model trained only for simple classification, highlights the potential for models to develop forms of 'hidden knowledge' beyond their explicit task, underscoring the need for mechanistic oversight for building genuinely Explainable and Trustworthy AI systems intended for deployment.