Temporally Extended Mixture-of-Experts Models

arXiv cs.LG / 4/23/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

Mixture-of-Experts (MoE) models typically switch expert sets at nearly every token to scale capacity, but this can break optimizations like offloading and pre-fetching when the model exceeds GPU memory limits.
The paper proposes using the reinforcement-learning “options framework” to create temporally extended MoE layers that decide when to switch experts and which expert sets to load, reducing churn.
Building on the option-critic framework, it adds a per-layer controller with deliberation costs so trainers can explicitly trade off lower switch rates versus model capability.
On GPT-OSS-20B augmented with low-rank adapters and trained with a self-distillation reward, the method reduces expert switch rates from over 50% to below 5% while retaining up to 90% of baseline accuracy on MATH, MMLU, and MMMLU.
The authors argue the approach can convert existing pre-trained models into temporally extended MoEs using lightweight training, aiming to enable more memory-efficient serving and continual learning for growing MoE systems.

Abstract

Mixture-of-Experts models, now popular for scaling capacity at fixed inference speed, switch experts at nearly every token. Once a model outgrows available GPU memory, this churn can render optimizations like offloading and pre-fetching ineffective. We make the case that the options framework in reinforcement learning is a perfect match to tackle this problem, and argue for temporally extended mixture-of-experts layers. Building on the option-critic framework with deliberation costs, we add a controller to each layer that learns when to switch expert sets and which to load. By applying this to gpt-oss-20b with low-rank adapters and a self-distillation reward, our method reduces switch rates from over 50% to below 5% while retaining up to 90% of base-model accuracy on MATH, MMLU, and MMMLU. This shows that even existing pre-trained models can be converted to temporally extended MoEs with lightweight training, with the deliberation cost allowing model trainers to trade off switching rates against capability. We hope this opens a principled path, grounded in the options framework, for memory-efficient serving and continual learning in ever-growing MoE models.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/23DailyView insight →

Training ChatGPT on Private Data: A Technical Reference

Dev.to

Why all AI-coding plans are getting more expensive?

Dev.to

AI as a Fascist Artifact

Dev.to

Forget the Flashy Keynote — The A2A Protocol Is the Real Revolution From Google Cloud Next '26

Dev.to

Sony Ace: el robot que ganó 3 de 5 a élites de ping-pong en Nature

Dev.to

Temporally Extended Mixture-of-Experts Models

Key Points

Abstract

💡 Insights using this article

Related Articles

Training ChatGPT on Private Data: A Technical Reference

Why all AI-coding plans are getting more expensive?

AI as a Fascist Artifact

Forget the Flashy Keynote — The A2A Protocol Is the Real Revolution From Google Cloud Next '26

Sony Ace: el robot que ganó 3 de 5 a élites de ping-pong en Nature

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer