[D] Is language modeling fundamentally token-level or sequence-level?

Reddit r/MachineLearning / 3/19/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The post argues there is evidence for token-level pretraining and sequence-level alignment and sampling, suggesting two different perspectives on language modeling.
It explains the technical distinction where token-level cross-entropy averages by token count while sequence-level would average by batch size, affecting gradient weighting during training.
It cites Long Horizon Temperature Scaling (Shih et al., 2023) to show that token-level temperature is myopic and that sequence-level reasoning is needed to properly scale, linking sampling to sequence-level likelihood.
It notes that in reinforcement learning, rewards are sequence-level, raising questions about credit assignment across tokens (e.g., GRPO and its discussion in TRL documentation).
It lists open questions about potential repetition issues from token-level training, whether there are sequence-level pretraining approaches, and the search for a unified, principled framework.

Is language modeling fundamentally token-level or sequence-level?

There is evidence for both: pretraining and sampling lean towards a token-level view, while alignment is fundamentally sequence-level. Curious if there is any work trying to unify the two perspectives, and which is the more principled framing.

Pretraining

Textbook language modeling defines the task as learning a distribution over strings, but all cross-entropy loss implementations I've seen operate at the token level. The difference is subtle but real: both compute sum of -log P(next token | previous tokens) over all tokens in the batch — same numerator, different denominator. Token-level divides by total token count (changes with batch composition). Sequence-level divides by batch size (fixed). A short sequence's tokens get more or less gradient weight depending on what else is in the batch under token-level averaging, but not under sequence-level.

Sampling

Given a distribution over strings, we can do temperature scaling to sample from a flatter version of that distribution. But in practice, temperature scaling is applied over the distribution of next tokens. This is again not equivalent to temperature scaling the distribution over strings.

Long Horizon Temperature Scaling (Shih et al., 2023) makes this point explicitly: standard token-level temperature is "myopic," and correcting it requires reasoning about sequence-level likelihood. The paper proposes an approximate method to recover sequence-level temperature scaling from token-level sampling.

Alignment

The above examples support a token-level perspective on language modeling. But in reinforcement learning, rewards are fundamentally awarded at the sequence level.

Take GRPO as an example. Rewards are sequence-level — e.g., whether the full generation follows a specified regex format. How these rewards are then distributed across tokens as credit assignment is an area of active disagreement (see the formula and brief discussion of this discrepancy in the TRL GRPO documentation).

Questions

Could token-level language modeling be causing problems? (e.g., repetition might stem from the model not being trained to produce coherent sequences as a whole, only to predict the next token.)
Does anyone know of work exploring a sequence-level perspective on the pretraining phase? Would you expect it to lead to any difference in the trained base model?
What do people feel is the more principled way to model language? Any work or thoughts on unifying the two perspectives?

submitted by /u/36845277
[link] [comments]

The programming passion is melting

Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Dev.to

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

Reddit r/LocalLLaMA

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)

Dev.to

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more

Reddit r/LocalLLaMA

[D] Is language modeling fundamentally token-level or sequence-level?

Key Points

Is language modeling fundamentally token-level or sequence-level?

Pretraining

Sampling

Alignment

Questions

Related Articles

The programming passion is melting

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer