Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task

arXiv cs.LG / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper tests whether transformer models can use their depth adaptively as task difficulty increases, using a multi-hop relational reasoning benchmark with difficulty set by the number of reasoning “hops.”
It evaluates adaptation using two probing methods: early layer readouts (logit lens) to track prediction evolution and causal patching to measure how task-relevant information is integrated across tokens.
Results show only limited adaptive-depth behavior in pretrained models, where easier tasks may be solvable with fewer layers while longer chains generally require more layers for cross-token integration.
For models finetuned on the task, evidence for adaptive depth becomes clearer and more consistent, and the effect is stronger under looser finetuning that does not preserve general language-modeling capabilities.
The findings suggest that apparent depth adaptation depends on training regime and may be more pronounced when fine-tuning shapes computation toward the specific reasoning task.

Abstract

We investigate whether transformers use their depth adaptively across tasks of increasing difficulty. Using a controlled multi-hop relational reasoning task based on family stories, where difficulty is determined by the number of relationship hops that must be composed, we monitor (i) how predictions evolve across layers via early readouts (the logit lens) and (ii) how task-relevant information is integrated across tokens via causal patching. For pretrained models, we find some limited evidence for adaptive depth use: some larger models need fewer layers to arrive at plausible answers for easier tasks, and models generally use more layers to integrate information across tokens as chain length increases. For models finetuned on the task, we find clearer and more consistent evidence of adaptive depth use, with the effect being stronger for less constrained finetuning regimes that do not preserve general language modeling abilities.

Black Hat Asia

AI Business

The Complete Guide to Better Meeting Productivity with AI Note-Taking

Dev.to

5 Ways Real-Time AI Can Boost Your Sales Call Performance

Dev.to

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Dev.to

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

Reddit r/MachineLearning

Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task

Key Points

Abstract

Related Articles

Black Hat Asia

The Complete Guide to Better Meeting Productivity with AI Note-Taking

5 Ways Real-Time AI Can Boost Your Sales Call Performance

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer