JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

arXiv cs.CL / 4/6/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces JoyAI-LLM Flash, an efficient Mixture-of-Experts (MoE) language model aimed at improving the performance–token-efficiency trade-off for sub-50B parameter settings.
JoyAI-LLM Flash is pretrained on 20T tokens and then post-trained using SFT, DPO, and large-scale reinforcement learning across diverse environments.
To boost token efficiency, the model balances “thinking” and “non-thinking” cognitive modes and proposes FiberPO, an RL algorithm that decomposes trust-region maintenance into global and local components for unified multi-scale stability control.
Architecturally, it uses 48B total parameters while activating only 2.7B per forward pass, targeting a much higher sparsity ratio than similarly sized industry-leading models.
For faster inference, it applies joint training–inference co-design with dense Multi-Token Prediction (MTP) and Quantization-Aware Training (QAT), and releases base and post-trained checkpoints on Hugging Face.

Abstract

We introduce JoyAI-LLM Flash, an efficient Mixture-of-Experts (MoE) language model designed to redefine the trade-off between strong performance and token efficiency in the sub-50B parameter regime. JoyAI-LLM Flash is pretrained on a massive corpus of 20 trillion tokens and further optimized through a rigorous post-training pipeline, including supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and large-scale reinforcement learning (RL) across diverse environments. To improve token efficiency, JoyAI-LLM Flash strategically balances \emph{thinking} and \emph{non-thinking} cognitive modes and introduces FiberPO, a novel RL algorithm inspired by fibration theory that decomposes trust-region maintenance into global and local components, providing unified multi-scale stability control for LLM policy optimization. To enhance architectural sparsity, the model comprises 48B total parameters while activating only 2.7B parameters per forward pass, achieving a substantially higher sparsity ratio than contemporary industry leading models of comparable scale. To further improve inference throughput, we adopt a joint training-inference co-design that incorporates dense Multi-Token Prediction (MTP) and Quantization-Aware Training (QAT). We release the checkpoints for both JoyAI-LLM-48B-A3B Base and its post-trained variants on Hugging Face to support the open-source community.