Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning

arXiv cs.AI / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper introduces “Tandem,” a collaborative framework that combines large and small language models to perform reasoning-intensive inference more efficiently.
In Tandem, an LLM first generates a compact set of critical reasoning insights, which then guides an SLM to carry out the full reasoning and produce the final answer.
To trade off efficiency and reliability, Tandem uses a cost-aware termination mechanism that adaptively stops the LLM early once enough guidance has been accumulated.
Experiments on mathematical reasoning and code generation benchmarks show about a 40% reduction in computational cost versus standalone LLM reasoning while maintaining superior or competitive accuracy.
A “sufficiency classifier” trained on one domain reportedly transfers to other domains effectively without retraining, and the implementation is released on GitHub.

Abstract

Recent advancements in large language models (LLMs) have catalyzed the rise of reasoning-intensive inference paradigms, where models perform explicit step-by-step reasoning before generating final answers. While such approaches improve answer quality and interpretability, they incur substantial computational overhead due to the prolonged generation sequences. In this paper, we propose Tandem, a novel collaborative framework that synergizes large and small language models (LLMs and SLMs) to achieve high-quality reasoning with significantly reduced computational cost. Specifically, the LLM serves as a strategic coordinator, efficiently generating a compact set of critical reasoning insights. These insights are then used to guide a smaller, more efficient SLM in executing the full reasoning process and delivering the final response. To balance efficiency and reliability, Tandem introduces a cost-aware termination mechanism that adaptively determines when sufficient reasoning guidance has been accumulated, enabling early stopping of the LLM's generation. Experiments on mathematical reasoning and code generation benchmarks demonstrate that Tandem reduces computational costs by approximately 40% compared to standalone LLM reasoning, while achieving superior or competitive performance. Furthermore, the sufficiency classifier trained on one domain transfers effectively to others without retraining. The code is available at: https://github.com/Applied-Machine-Learning-Lab/ACL2026_Tandem.

How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI

MarkTechPost

An improvement of the convergence proof of the ADAM-Optimizer

Dev.to

Claude Code 会话历史在哪里？如何找回你的 AI 编程对话记录

Dev.to

We built an AI that runs an entire business autonomously. Not a demo. Not a prototype. Actually running. YC-backed, here's what we learned.

Reddit r/artificial

langchain-tests==1.1.7

LangChain Releases

Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning

Key Points

Abstract

Related Articles

How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI

An improvement of the convergence proof of the ADAM-Optimizer

Claude Code 会话历史在哪里？如何找回你的 AI 编程对话记录

We built an AI that runs an entire business autonomously. Not a demo. Not a prototype. Actually running. YC-backed, here's what we learned.

langchain-tests==1.1.7

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer