[D] Tested model routing on financial AI datasets — good savings and curious what benchmarks others use.

Reddit r/MachineLearning / 4/7/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The author benchmarks prompt-complexity-based model routing on financial HuggingFace datasets, comparing an all-Claude Opus baseline versus two routing strategies (intra-provider and flexible to OSS models).
Across FiQA-SA, Financial Headlines, FPB, and ConvFinQA, routing yields roughly 60% blended savings overall, with large reductions in usage cost/latency depending on task.
The most notable result is that ConvFinQA still shows major savings because many queries inside long 10-K documents can be answered as simple lookups even within a complex, multi-turn setting.
The flexible strategy sends medium-complexity prompts to self-hosted Qwen 3.5 27B or Gemma 3 27B while keeping complex prompts on Opus, improving savings relative to intra-provider routing for most tasks.
The study’s limitations include a finance-only scope, issues routing very long-form tasks (ECTSum transcripts were always classified as complex), and evaluation based on limited representative samples rather than fully automated large-scale scoring.

Ran a benchmark evaluating whether prompt complexity-based routing delivers meaningful savings. Used public HuggingFace datasets. Here's what I found.

Setup

Baseline: Claude Opus for everything. Tested two strategies:

Intra-provider — routes within same provider by complexity. Simple → Haiku, Medium → Sonnet, Complex → Opus
Flexible — medium prompts go to self-hosted Qwen 3.5 27B / Gemma 3 27B. Complex always stays on Opus

Datasets used

All from AdaptLLM/finance-tasks on HuggingFace:

FiQA-SA — financial tweet sentiment
Financial Headlines — yes/no classification
FPB — formal financial news sentiment
ConvFinQA — multi-turn Q&A on real 10-K filings

Results

Task	Intra-provider	Flexible (OSS)
FiQA Sentiment	-78%	-89%
Headlines	-57%	-71%
FPB Sentiment	-37%	-45%
ConvFinQA	-58%	-40%

Blended average: ~60% savings.

Most interesting finding

ConvFinQA showed 58% intra-provider savings despite being a complex multi-turn QA dataset. The scorer correctly identified that many questions inside long 10-K documents are simple lookups even when the surrounding document is complex.

"What was operating cash flow in 2014?" → answer is in the table → Haiku

"What is the implied effective tax rate adjustment across three years?" → multi-step reasoning → Opus

Caveats

Financial vertical only
ECTSum transcripts at ~5K tokens scored complex every time — didn't route. Still tuning for long-form tasks
Quality verification on representative samples not full automated eval

What datasets do you use for evaluating task-specific LLM routing decisions — specifically trying to find benchmarks that span simple classification through complex multi-step reasoning?

submitted by /u/Dramatic_Strain7370
[link] [comments]