Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling

arXiv cs.LG / 4/16/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Linear probes can serve as detectors for language-model outputs the model “knows” are wrong, but prior work shows single-layer probing is brittle and fails on certain deception types.
The study introduces multi-layer ensembling of linear probes, which restores strong detection performance even when individual probes fail, yielding AUROC gains of +29% on Insider Trading and +78% on Harm-Pressure Knowledge.
Experiments across 12 model sizes (0.5B–176B parameters) show probe accuracy systematically improves with model scale at roughly ~5% AUROC per 10× parameters (R=0.81).
The authors argue the key mechanism is geometric: “deception directions” rotate gradually across layers rather than being localized to a single layer, explaining both fragility of single-layer probes and robustness of ensembles.

Abstract

Linear probes can detect when language models produce outputs they "know" are wrong, a capability relevant to both deception and reward hacking. However, single-layer probes are fragile: the best layer varies across models and tasks, and probes fail entirely on some deception types. We show that combining probes from multiple layers into an ensemble recovers strong performance even where single-layer probes fail, improving AUROC by +29% on Insider Trading and +78% on Harm-Pressure Knowledge. Across 12 models (0.5B--176B parameters), we find probe accuracy improves with scale: ~5% AUROC per 10x parameters (R=0.81). Geometrically, deception directions rotate gradually across layers rather than appearing at one location, explaining both why single-layer probes are brittle and why multi-layer ensembles succeed.

Introducing Claude Opus 4.7

Anthropic News

Who Audits the Auditors? Building an LLM-as-a-Judge for Agentic Reliability

Dev.to

"Enterprise AI Cost Optimization: How Companies Are Cutting AI Infrastructure Sp

Dev.to

Config-first code generator to replace repetitive AI boilerplate — looking for feedback and collaborators

Dev.to

The US Government Fired 40% of an Agency, Then Asked AI to Do Their Jobs

Dev.to

Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling

Key Points

Abstract

Related Articles

Introducing Claude Opus 4.7

Who Audits the Auditors? Building an LLM-as-a-Judge for Agentic Reliability

"Enterprise AI Cost Optimization: How Companies Are Cutting AI Infrastructure Sp

Config-first code generator to replace repetitive AI boilerplate — looking for feedback and collaborators

The US Government Fired 40% of an Agency, Then Asked AI to Do Their Jobs

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer