OLLM: Options-based Large Language Models

arXiv cs.AI / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Options LLM (OLLM), which replaces standard single next-token prediction with a learned set of “options” selected via a discrete latent variable.
  • OLLM is designed as a lightweight plug-in architecture (adding an encoder and decoder before the output head) that can convert many pretrained LLMs with minimal extra trainable parameters.
  • Experiments on a 1.7B backbone trained with only 1.56% trainable parameters show that OLLM can reach about 70% final answer correctness under optimal latent selection, outperforming LoRA baselines that peak around 51%.
  • The approach also trains a compact policy in the latent option space to control generation, improving reward optimization sample-efficiency and reducing misalignments through structural constraints rather than extra KL or handcrafted alignment losses.
  • The authors conclude that optionized next-token modeling improves controllability, robustness, and efficiency for mathematical reasoning, and positions latent-space policy learning as a promising RL direction for LLMs.

Abstract

We introduce Options LLM (OLLM), a simple, general method that replaces the single next-token prediction of standard LLMs with a \textit{set of learned options} for the next token, indexed by a discrete latent variable. Instead of relying on temperature or sampling heuristics to induce diversity, OLLM models variation explicitly: a small latent space parametrizes multiple plausible next-token options which can be selected or searched by a downstream policy. Architecturally, OLLM is a lightweight "plug-in" that inserts two layers: an encoder and a decoder, before the output head, allowing almost any pretrained LLM to be converted with minimal additional parameters. We apply OLLM to a 1.7B-parameter backbone (only 1.56\% of parameters trainable) trained on OpenMathReasoning and evaluated on OmniMath. The SOTA LoRA-adapted baselines peak at 51\% final answer correctness, while OLLM's option set allows up to \sim 70\% under optimal latent selection. We then train a compact policy in the latent space that emits latents to control generation. Operating in a low-dimensional option space makes reward optimization far more sample-efficient and substantially reduces common misalignments (e.g., language switching or degenerate reasoning), as the policy is constrained to options learned during SFT. Crucially, this alignment arises from model structure rather than additional KL or handcrafted alignment losses. Our results demonstrate that optionized next-token modeling enhances controllability, robustness, and efficiency in math reasoning, and highlight latent-space policy learning as a promising direction for reinforcement learning in LLMs.