OLLM: Options-based Large Language Models

arXiv cs.AI / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Options LLM (OLLM), which replaces standard single next-token prediction with a learned set of “options” selected via a discrete latent variable.
OLLM is designed as a lightweight plug-in architecture (adding an encoder and decoder before the output head) that can convert many pretrained LLMs with minimal extra trainable parameters.
Experiments on a 1.7B backbone trained with only 1.56% trainable parameters show that OLLM can reach about 70% final answer correctness under optimal latent selection, outperforming LoRA baselines that peak around 51%.
The approach also trains a compact policy in the latent option space to control generation, improving reward optimization sample-efficiency and reducing misalignments through structural constraints rather than extra KL or handcrafted alignment losses.
The authors conclude that optionized next-token modeling improves controllability, robustness, and efficiency for mathematical reasoning, and positions latent-space policy learning as a promising RL direction for LLMs.

Abstract

We introduce Options LLM (OLLM), a simple, general method that replaces the single next-token prediction of standard LLMs with a \textit{set of learned options} for the next token, indexed by a discrete latent variable. Instead of relying on temperature or sampling heuristics to induce diversity, OLLM models variation explicitly: a small latent space parametrizes multiple plausible next-token options which can be selected or searched by a downstream policy. Architecturally, OLLM is a lightweight "plug-in" that inserts two layers: an encoder and a decoder, before the output head, allowing almost any pretrained LLM to be converted with minimal additional parameters. We apply OLLM to a 1.7B-parameter backbone (only

1.56\%

of parameters trainable) trained on OpenMathReasoning and evaluated on OmniMath. The SOTA LoRA-adapted baselines peak at

51\%

final answer correctness, while OLLM's option set allows up to

\sim 70\%

under optimal latent selection. We then train a compact policy in the latent space that emits latents to control generation. Operating in a low-dimensional option space makes reward optimization far more sample-efficient and substantially reduces common misalignments (e.g., language switching or degenerate reasoning), as the policy is constrained to options learned during SFT. Crucially, this alignment arises from model structure rather than additional KL or handcrafted alignment losses. Our results demonstrate that optionized next-token modeling enhances controllability, robustness, and efficiency in math reasoning, and highlight latent-space policy learning as a promising direction for reinforcement learning in LLMs.

Autoencoders and Representation Learning in Vision

Dev.to

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Dev.to

Context Bloat in AI Agents

Dev.to

We open sourced the AI dev team that builds our product

Dev.to

Qwen 3.6 35B A3B vs Qwen 3.5 122B A10B

Reddit r/LocalLLaMA

OLLM: Options-based Large Language Models

Key Points

Abstract

Related Articles

Autoencoders and Representation Learning in Vision

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Context Bloat in AI Agents

We open sourced the AI dev team that builds our product

Qwen 3.6 35B A3B vs Qwen 3.5 122B A10B

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer