Expected Reward Prediction, with Applications to Model Routing

arXiv cs.CL / 2026/3/24

💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

要点

The paper studies how response-level reward models can be lifted to predict an LLM model’s expected reward for a prompt before generating responses, enabling pre-routing decisions.
It shows that expected reward prediction (ERP) can be both precise and discriminative, supporting an inference-time model routing protocol that optimizes reward while controlling compute costs.
The proposed ERP-based routing is evaluated on the open-perfectblend dataset using a pool of Llama 3.1 Instruct and Gemma Instruct models, where it outperforms simpler baselines that choose the best average-performing model per prompt category.
The approach is presented as explaining why more complex routing methods work (they effectively estimate expected reward) and is described as easy to extend when new models are added to the routing pool.

Abstract

Reward models are a standard tool to score responses from LLMs. Reward models are built to rank responses to a fixed prompt sampled from a single model, for example to choose the best of n sampled responses. In this paper, we study whether scores from response-level reward models lifted to score a model's suitability for a prompt, prior to seeing responses from that model. Specifically, we show that it is straightforward to predict the expected reward that an LLM would earn from the reward model under repeated sampling. Further, we show that these expected reward predictions are precise and discriminative enough to support an application to a model routing protocol that routes prompts to models at inference time to maximize reward while controlling computational cost. We demonstrate the performance of this routing procedure on the open-perfectblend dataset, using a model pool composed of Llama3.1-Instruct 8B/70B, Gemma2-IT 9B/27B, and Gemma1-IT 7B models. Our simple expected reward prediction--based routing (ERP) outperforms baselines that route prompts to models with the best average performance within each prompt's category, and explains the success of more complex routing protocols that implicitly estimate an expected reward. Our approach has the added advantage of being trivially extensible as new models are added to the pool.

人型ロボットを被災建築物の調査に活用、建築研究所などが公開実験

日経XTECH

ホンダEV3車種の開発中止、損失はなぜこれほど膨らんだのか

日経XTECH

TSMC、光電融合でライバル突き放しへ半導体の設計情報「PDK」を広く提供

日経XTECH

イーロン・マスク氏、AI半導体を1テラワット製造 8割を宇宙へ

日経XTECH

Microsoftも実証「中空コア光ファイバー」、空気でガラスの限界突破へ

日経XTECH

Expected Reward Prediction, with Applications to Model Routing

要点

Abstract

関連記事

人型ロボットを被災建築物の調査に活用、建築研究所などが公開実験

ホンダEV3車種の開発中止、損失はなぜこれほど膨らんだのか

TSMC、光電融合でライバル突き放しへ半導体の設計情報「PDK」を広く提供

イーロン・マスク氏、AI半導体を1テラワット製造 8割を宇宙へ

Microsoftも実証「中空コア光ファイバー」、空気でガラスの限界突破へ

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

要点

Abstract

関連記事

人型ロボットを被災建築物の調査に活用、建築研究所などが公開実験

ホンダEV3車種の開発中止、損失はなぜこれほど膨らんだのか

TSMC、光電融合でライバル突き放しへ 半導体の設計情報「PDK」を広く提供

イーロン・マスク氏、AI半導体を1テラワット製造 8割を宇宙へ

Microsoftも実証「中空コア光ファイバー」、空気でガラスの限界突破へ

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

TSMC、光電融合でライバル突き放しへ半導体の設計情報「PDK」を広く提供