Apriel-Reasoner: RL Post-Training for General-Purpose and Efficient Reasoning

arXiv cs.LG / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Apriel-Reasoner is presented as an RL post-training method for general-purpose and efficient reasoning, using verifiable rewards (RLVR) across multiple domains.
  • The work claims a fully reproducible multi-domain training recipe on Apriel-Base (15B parameters), covering mathematics, code generation, instruction following, logical puzzles, and function calling.
  • It introduces adaptive domain sampling to maintain target domain ratios despite differences in rollout length, difficulty, and sample efficiency across domains.
  • A difficulty-aware length-penalty extension is proposed to encourage longer chain-of-thought traces for hard problems and shorter traces for easy ones, without additional training overhead.
  • Experiments report improved benchmarks (AIME 2025, GPQA, MMLU-Pro, LiveCodeBench) versus Apriel-Base, while producing 30–50% shorter reasoning traces and showing generalization from a 16K training output budget to 32K at inference.

Abstract

Building general-purpose reasoning models using reinforcement learning with verifiable rewards (RLVR) across diverse domains has been widely adopted by frontier open-weight models. However, their training recipes and domain mixtures are often not disclosed. Joint optimization across domains poses significant challenges: domains vary widely in rollout length, problem difficulty and sample efficiency. Further, models with long chain-of-thought traces increase inference cost and latency, making efficiency critical for practical deployment. We present Apriel-Reasoner, trained with a fully reproducible multi-domain RL post-training recipe on Apriel-Base, a 15B-parameter open-weight LLM, across five domains using public datasets: mathematics, code generation, instruction following, logical puzzles and function calling. We introduce an adaptive domain sampling mechanism that preserves target domain ratios despite heterogeneous rollout dynamics, and a difficulty-aware extension of the standard length penalty that, with no additional training overhead, encourages longer reasoning for difficult problems and shorter traces for easy ones. Trained with a strict 16K-token output budget, Apriel-Reasoner generalizes to 32K tokens at inference and improves over Apriel-Base on AIME 2025, GPQA, MMLU-Pro, and LiveCodeBench while producing 30-50% shorter reasoning traces. It matches strong open-weight models of similar size at lower token cost, thereby pushing the Pareto frontier of accuracy versus token budget.