Rethinking LLM Ensembling from the Perspective of Mixture Models

arXiv cs.LG / 5/4/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that conventional LLM ensembling is computationally inefficient because it explicitly computes the ensemble distribution via separate forward passes for each model.
It proposes a Mixture-model-like Ensemble (ME) method that treats ensembling as a mixture model and stochastically selects a single model at each token generation step.
ME is mathematically equivalent to sampling from the full ensemble distribution while requiring only one model invocation per step, achieving a reported 1.78x–2.68x speedup.
The work links LLM ensembling to token-level routing methods, suggesting that ensembling can be viewed as a special case of routing-based approaches.
The authors release code publicly and highlight this as a starting point for exploring more efficient token-level routing strategies for LLMs.

Abstract

Model ensembling is a well-established technique for improving the performance of machine learning models. Conventionally, this involves averaging the output distributions of multiple models and selecting the most probable label. This idea has been naturally extended to large language models (LLMs), yielding improved performance but incurring substantial computational cost. This inefficiency stems from directly applying conventional ensemble implementation to LLMs, which require a separate forward pass for each model to explicitly compute the ensemble distribution. In this paper, we propose the Mixture-model-like Ensemble (ME). By reinterpreting the ensemble as a mixture model, ME stochastically selects a single model at each step to generate the next token, thereby avoiding the need to explicitly compute the full ensemble distribution. ME is mathematically equivalent to sampling from the ensemble distribution, but requires invoking only one model, making it 1.78x-2.68x faster than conventional ensemble. Furthermore, this perspective connects LLM ensembling and token-level routing methods, suggesting that LLM ensembling is a special case of routing methods. Our findings open new avenues for efficient LLM ensembling and motivate further exploration of token-level routing strategies for LLMs. Our code is available at https://github.com/jialefu/Mixture-model-like-Ensemble/.