大規模言語モデルの読み出し（Readout）を用いた、スケーラブルなデータ帰属と評価のためのスケッチ手法

arXiv cs.LG / 2026/4/20

📰 ニュースSignals & Early TrendsTools & Practical UsageModels & Research

共有:

要点

本論文は、LLMにおけるデータ帰属と評価をスケーラブルに行うために、モデル全体の勾配計算を避ける新手法「RISE（Readout Influence Sketching Estimator）」を提案する。
RISEは、出力層の影響の「ホットスポット」を狙い、外積型の勾配分解と、双チャネル表現（語彙残差と意味射影誤差）をCountSketchで圧縮して用いる。
OLMo（1B–32B）およびPythia（14M–6.9B）での実験では、RapidInに比べて最大112倍のインデックス保存削減を達成し、勾配ベース手法ではメモリ負荷が問題となる32B規模でも影響分析を可能にする。
Howdyのバックドアデータ検出、Finance-Medicalの領域分離、Brain Rotの高品質データ選別といった課題で評価され、RISE選択データで追加事前学習を行うクローズドループでも下流性能の改善が一貫している。
総じて、RISEは影響分析と、候補データの有用性をゼロショットでスコアリングするための、実用的でスケーラブルな基盤（プリミティブ）として位置づけられている。

Abstract

Data attribution and valuation are critical for understanding data-model synergy for Large Language Models (LLMs), yet existing gradient-based methods suffer from scalability challenges on LLMs. Inspired by human cognition, where decision making relies on a focused readout of relevant memories rather than replaying all pathways, we introduce RISE (Readout Influence Sketching Estimator). Instead of computing and indexing gradients across the entire LLM, RISE focuses on influence hotspots at the output layer, where influence signals concentrate, and the gradient admits a decomposed outer-product form. This enables a dual-channel representation combining a lexical residual channel (RH) and a semantic projected-error channel (GH). Applying CountSketch projections to these channels achieves strong compression while maintaining accurate attribution. Across the OLMo (1B-32B) and Pythia (14M-6.9B) families, RISE reduces index storage by up to 112

\times

compared to RapidIn and scales to 32B parameters LLM, where gradient-based baselines such as RapidIn and ZO-Inf become memory-infeasible. We evaluate RISE on two paradigms: (1) retrospective attribution, retrieving influential training examples for specific predictions, and (2) prospective valuation, scoring candidate data utility zero-shot. We validate RISE on three tasks: Howdy backdoor data detection, Finance-Medical domain separation, and Brain Rot high-quality data selection. In a closed-loop Brain Rot study, continued pretraining on RISE-selected data yields consistent downstream improvements. Overall, RISE provides a practical and scalable primitive for influence analysis and training-data selection in modern large language models.

Black Hat USA

AI Business

ブラックハット・アジア

AI Business

推論では余裕の8GBが、ファインチューニングでは即死する — 学習が推論の8倍のVRAMを食う理由

Qiita

Appleが「声なき入力」に約3000億円を投じた理由｜Q.ai買収とAirPods Pro 3の接点

Innovatopia

Claude Opus 4.7でトークン消費量がどれだけ増えたか可視化するサイトが登場、同じ入力で4.6の2倍消費する実例も

GIGAZINE

大規模言語モデルの読み出し（Readout）を用いた、スケーラブルなデータ帰属と評価のためのスケッチ手法

要点

Abstract

関連記事

Black Hat USA

ブラックハット・アジア

推論では余裕の8GBが、ファインチューニングでは即死する — 学習が推論の8倍のVRAMを食う理由

Appleが「声なき入力」に約3000億円を投じた理由｜Q.ai買収とAirPods Pro 3の接点

Claude Opus 4.7でトークン消費量がどれだけ増えたか可視化するサイトが登場、同じ入力で4.6の2倍消費する実例も

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer