SRA: Span Representation Alignment for Large Language Model Distillation

arXiv cs.CL / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces SRA, a new framework for Cross-Tokenizer Knowledge Distillation (CTKD) that avoids brittle token-level alignment by aligning at the span level.
SRA treats each span as a cluster of particles and represents it using an attention-weighted “center of mass” to capture semantic information in a tokenizer-agnostic way.
It prioritizes salient spans via attention-derived weighting and adds a geometric regularizer to maintain the structure of the representation space.
The method also includes aligned span logit distillation to improve how knowledge transfers between teacher and student models.
In cross-architecture distillation experiments, SRA reportedly delivers consistent and significant improvements over existing CTKD baselines, supporting the authors’ physically grounded modeling approach.

Abstract

Cross-Tokenizer Knowledge Distillation (CTKD) enables knowledge transfer between a large language model and a smaller student, even when they employ different tokenizers. While existing approaches mainly focus on token-level alignment strategies, which are often brittle and sensitive to discrepancies between tokenizers, we argue that the method of aggregating tokens into more robust representations before distillation is of equal importance. In this paper, we introduce \textbf{SRA} (\textbf{S}pan \textbf{R}epresentation \textbf{A}lignment for Large Language Model Distillation), a novel framework that reframes CTKD through the physical lens of Multi-Particle Dynamical Systems. SRA shifts the fundamental unit of alignment from tokens to robust, tokenizer-agnostic spans. We model each span as a cluster of particles and represent its state by its Center of Mass (CoM) - an attention-weighted average that captures rich semantic information. We leverage the concept of span centers of mass with attention-derived weighting to prioritize the most salient spans. In addition, we employ a geometric regularizer to preserve the structural integrity of the representation space and introduce aligned span logit distillation to enhance knowledge transfer across models. In challenging cross-architecture distillation experiments, SRA consistently and significantly outperforms state-of-the-art CTKD baselines, validating our physically-grounded approach.

The 55.6% problem: why frontier LLMs fail at embedded code

Dev.to

Four CVEs in a week, all the same shape: when agents execute LLM-generated code

Dev.to

Healthcare AI Is Absorbing Institutional Knowledge It Can't Actually Hold

Reddit r/artificial

The Transformer: The Architecture Behind Modern AI

Dev.to

Foundational Models Defining a New Era in Vision: A Survey and Outlook

Dev.to

SRA: Span Representation Alignment for Large Language Model Distillation

Key Points

Abstract

Related Articles

The 55.6% problem: why frontier LLMs fail at embedded code

Four CVEs in a week, all the same shape: when agents execute LLM-generated code

Healthcare AI Is Absorbing Institutional Knowledge It Can't Actually Hold

The Transformer: The Architecture Behind Modern AI

Foundational Models Defining a New Era in Vision: A Survey and Outlook

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer