Separable Pathways for Causal Reasoning: How Architectural Scaffolding Enables Hypothesis-Space Restructuring in LLM Agents

arXiv cs.AI / 4/23/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that robust causal reasoning in LLM agents requires restructuring the hypothesis space (not only updating beliefs) when new evidence calls for representations the agent has not yet built.
  • It extends the developmental “blicket detector” paradigm to test this capability in AI agents using architectural scaffolding specifically designed for hypothesis-space restructuring.
  • The proposed compositional architecture uses two separate components: context graphs that organize exploration as typed state machines, and dynamic behaviors that detect when the current hypothesis space is inadequate and expand it during runtime.
  • In 1,085 experimental trials, the authors find orthogonal contributions: context graphs account for 94% of the accuracy improvement after hypothesis switching, while dynamic behaviors improve eligibility by detecting regime changes and avoiding premature commitment to outdated hypotheses.

Abstract

Causal discovery through experimentation and intervention is fundamental to robust problem solving. It requires not just updating beliefs within a fixed framework but revising the hypothesis space itself, a capacity current AI agents lack when evidence demands representations they have not previously constructed. We extend the blicket detector paradigm from developmental science to test this capacity in AI agents equipped with architectural scaffolding that targets hypothesis-space restructuring. Our compositional architecture has two discrete components: context graphs, which structure exploration as typed state machines, and dynamic behaviors, which monitor for evidence that the current hypothesis space is inadequate and expand it at runtime. Across 1,085 experimental trials, these components make orthogonal contributions: context graphs drive reasoning quality within the post-switch hypothesis space, accounting for 94\% of the accuracy gain, while dynamic behaviors drive reasoning eligibility by detecting regime changes and preventing premature commitment to outdated hypotheses.