Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

arXiv cs.LG / 3/12/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The authors propose Causal Concept Graphs (CCG), a directed acyclic graph over sparse latent features to model causal interactions among concepts during stepwise reasoning in LLMs.
They combine task-conditioned sparse autoencoders for concept discovery with DAGMA-style differentiable structure learning to recover the graph.
They introduce the Causal Fidelity Score (CFS) to quantify how graph-guided interventions affect downstream results, showing larger effects than random baselines.
On ARC-Challenge, StrategyQA, and LogiQA with GPT-2 Medium across five seeds, CCG achieves significant CFS improvements over baselines (p<0.0001 after Bonferroni correction).
The learned graphs are sparse (about 5-6% edge density), domain-specific, and stable across seeds.

Abstract

Sparse autoencoders can localize where concepts live in language models, but not how they interact during multi-step reasoning. We propose Causal Concept Graphs (CCG): a directed acyclic graph over sparse, interpretable latent features, where edges capture learned causal dependencies between concepts. We combine task-conditioned sparse autoencoders for concept discovery with DAGMA-style differentiable structure learning for graph recovery and introduce the Causal Fidelity Score (CFS) to evaluate whether graph-guided interventions induce larger downstream effects than random ones. On ARC-Challenge, StrategyQA, and LogiQA with GPT-2 Medium, across five seeds (

n{=}15

paired runs), CCG achieves

\CFS=5.654\pm0.625

, outperforming ROME-style tracing (

3.382\pm0.233

), SAE-only ranking (

2.479\pm0.196

), and a random baseline (

1.032\pm0.034

), with

p<0.0001

after Bonferroni correction. Learned graphs are sparse (5-6\% edge density), domain-specific, and stable across seeds.

The programming passion is melting

Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Dev.to

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

Reddit r/LocalLLaMA

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)

Dev.to

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more

Reddit r/LocalLLaMA

Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

Key Points

Abstract

Related Articles

The programming passion is melting

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer