AI Navigate

[D] Is language modeling fundamentally token-level or sequence-level?

Reddit r/MachineLearning / 3/19/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The post argues there is evidence for token-level pretraining and sequence-level alignment and sampling, suggesting two different perspectives on language modeling.
  • It explains the technical distinction where token-level cross-entropy averages by token count while sequence-level would average by batch size, affecting gradient weighting during training.
  • It cites Long Horizon Temperature Scaling (Shih et al., 2023) to show that token-level temperature is myopic and that sequence-level reasoning is needed to properly scale, linking sampling to sequence-level likelihood.
  • It notes that in reinforcement learning, rewards are sequence-level, raising questions about credit assignment across tokens (e.g., GRPO and its discussion in TRL documentation).
  • It lists open questions about potential repetition issues from token-level training, whether there are sequence-level pretraining approaches, and the search for a unified, principled framework.

Is language modeling fundamentally token-level or sequence-level?

There is evidence for both: pretraining and sampling lean towards a token-level view, while alignment is fundamentally sequence-level. Curious if there is any work trying to unify the two perspectives, and which is the more principled framing.

Pretraining

Textbook language modeling defines the task as learning a distribution over strings, but all cross-entropy loss implementations I've seen operate at the token level. The difference is subtle but real: both compute sum of -log P(next token | previous tokens) over all tokens in the batch — same numerator, different denominator. Token-level divides by total token count (changes with batch composition). Sequence-level divides by batch size (fixed). A short sequence's tokens get more or less gradient weight depending on what else is in the batch under token-level averaging, but not under sequence-level.

Sampling

Given a distribution over strings, we can do temperature scaling to sample from a flatter version of that distribution. But in practice, temperature scaling is applied over the distribution of next tokens. This is again not equivalent to temperature scaling the distribution over strings.

Long Horizon Temperature Scaling (Shih et al., 2023) makes this point explicitly: standard token-level temperature is "myopic," and correcting it requires reasoning about sequence-level likelihood. The paper proposes an approximate method to recover sequence-level temperature scaling from token-level sampling.

Alignment

The above examples support a token-level perspective on language modeling. But in reinforcement learning, rewards are fundamentally awarded at the sequence level.

Take GRPO as an example. Rewards are sequence-level — e.g., whether the full generation follows a specified regex format. How these rewards are then distributed across tokens as credit assignment is an area of active disagreement (see the formula and brief discussion of this discrepancy in the TRL GRPO documentation).

Questions

  • Could token-level language modeling be causing problems? (e.g., repetition might stem from the model not being trained to produce coherent sequences as a whole, only to predict the next token.)
  • Does anyone know of work exploring a sequence-level perspective on the pretraining phase? Would you expect it to lead to any difference in the trained base model?
  • What do people feel is the more principled way to model language? Any work or thoughts on unifying the two perspectives?
submitted by /u/36845277
[link] [comments]