AI Navigate

Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

arXiv cs.LG / 3/12/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The authors propose Causal Concept Graphs (CCG), a directed acyclic graph over sparse latent features to model causal interactions among concepts during stepwise reasoning in LLMs.
  • They combine task-conditioned sparse autoencoders for concept discovery with DAGMA-style differentiable structure learning to recover the graph.
  • They introduce the Causal Fidelity Score (CFS) to quantify how graph-guided interventions affect downstream results, showing larger effects than random baselines.
  • On ARC-Challenge, StrategyQA, and LogiQA with GPT-2 Medium across five seeds, CCG achieves significant CFS improvements over baselines (p<0.0001 after Bonferroni correction).
  • The learned graphs are sparse (about 5-6% edge density), domain-specific, and stable across seeds.

Abstract

Sparse autoencoders can localize where concepts live in language models, but not how they interact during multi-step reasoning. We propose Causal Concept Graphs (CCG): a directed acyclic graph over sparse, interpretable latent features, where edges capture learned causal dependencies between concepts. We combine task-conditioned sparse autoencoders for concept discovery with DAGMA-style differentiable structure learning for graph recovery and introduce the Causal Fidelity Score (CFS) to evaluate whether graph-guided interventions induce larger downstream effects than random ones. On ARC-Challenge, StrategyQA, and LogiQA with GPT-2 Medium, across five seeds (n{=}15 paired runs), CCG achieves \CFS=5.654\pm0.625, outperforming ROME-style tracing (3.382\pm0.233), SAE-only ranking (2.479\pm0.196), and a random baseline (1.032\pm0.034), with p<0.0001 after Bonferroni correction. Learned graphs are sparse (5-6\% edge density), domain-specific, and stable across seeds.