Olmo Hybrid: From Theory to Practice and Back

arXiv cs.LG / 4/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • 非トランスフォーマー系(線形RNNや、再帰と注意を組み合わせたハイブリッド)アーキテクチャには有望さがある一方で、スケール時のリスクやコストに見合うか不明だった点を、理論と実験の両面から検証している。
  • ハイブリッドモデルは単に既存(トランスフォーマー/線形RNN)の表現力を継承するだけでなく、両者を超えるタスク(例:コード実行)も表現できることを理論的に示している。
  • 実装として、Olmo 3 7Bと概ね同等の7Bモデル「Olmo Hybrid」を学習し、スライディングウィンドウ層をGated DeltaNet(再帰系)に置き換えることで、プリトレーニングおよびミッドトレーニング評価でOlmo 3を上回ったと報告している。
  • ハイブリッドがトランスフォーマーよりもスケーリング効率が高いことを性能差の主要因として挙げつつ、特定の形式問題での表現力増加が下流タスクに効く理由について理論に立ち返り説明を試みている。
  • 結論として、ハイブリッド(注意+再帰)を単なる推論メモリ削減手段ではなく、プリトレーニングでのスケール性能を改善する“言語モデリングの基本拡張”として位置付けている。

Abstract

Recent work has demonstrated the potential of non-transformer language models, especially linear recurrent neural networks (RNNs) and hybrid models that mix recurrence and attention. Yet there is no consensus on whether the potential benefits of these new architectures justify the risk and effort of scaling them up. To address this, we provide evidence for the advantages of hybrid models over pure transformers on several fronts. First, theoretically, we show that hybrid models do not merely inherit the expressivity of transformers and linear RNNs, but can express tasks beyond both, such as code execution. Putting this theory to practice, we train Olmo Hybrid, a 7B-parameter model largely comparable to Olmo 3 7B but with the sliding window layers replaced by Gated DeltaNet layers. We show that Olmo Hybrid outperforms Olmo 3 across standard pretraining and mid-training evaluations, demonstrating the benefit of hybrid models in a controlled, large-scale setting. We find that the hybrid model scales significantly more efficiently than the transformer, explaining its higher performance. However, its unclear why greater expressivity on specific formal problems should result in better scaling or superior performance on downstream tasks unrelated to those problems. To explain this apparent gap, we return to theory and argue why increased expressivity should translate to better scaling efficiency, completing the loop. Overall, our results suggest that hybrid models mixing attention and recurrent layers are a powerful extension to the language modeling paradigm: not merely to reduce memory during inference, but as a fundamental way to obtain more expressive models that scale better during pretraining.