[D] Tested model routing on financial AI datasets — good savings and curious what benchmarks others use.

Reddit r/MachineLearning / 4/7/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • The author benchmarks prompt-complexity-based model routing on financial HuggingFace datasets, comparing an all-Claude Opus baseline versus two routing strategies (intra-provider and flexible to OSS models).
  • Across FiQA-SA, Financial Headlines, FPB, and ConvFinQA, routing yields roughly 60% blended savings overall, with large reductions in usage cost/latency depending on task.
  • The most notable result is that ConvFinQA still shows major savings because many queries inside long 10-K documents can be answered as simple lookups even within a complex, multi-turn setting.
  • The flexible strategy sends medium-complexity prompts to self-hosted Qwen 3.5 27B or Gemma 3 27B while keeping complex prompts on Opus, improving savings relative to intra-provider routing for most tasks.
  • The study’s limitations include a finance-only scope, issues routing very long-form tasks (ECTSum transcripts were always classified as complex), and evaluation based on limited representative samples rather than fully automated large-scale scoring.

Ran a benchmark evaluating whether prompt complexity-based routing delivers meaningful savings. Used public HuggingFace datasets. Here's what I found.

Setup

Baseline: Claude Opus for everything. Tested two strategies:

  • Intra-provider — routes within same provider by complexity. Simple → Haiku, Medium → Sonnet, Complex → Opus
  • Flexible — medium prompts go to self-hosted Qwen 3.5 27B / Gemma 3 27B. Complex always stays on Opus

Datasets used

All from AdaptLLM/finance-tasks on HuggingFace:

  • FiQA-SA — financial tweet sentiment
  • Financial Headlines — yes/no classification
  • FPB — formal financial news sentiment
  • ConvFinQA — multi-turn Q&A on real 10-K filings

Results

Task Intra-provider Flexible (OSS)
FiQA Sentiment -78% -89%
Headlines -57% -71%
FPB Sentiment -37% -45%
ConvFinQA -58% -40%

Blended average: ~60% savings.

Most interesting finding

ConvFinQA showed 58% intra-provider savings despite being a complex multi-turn QA dataset. The scorer correctly identified that many questions inside long 10-K documents are simple lookups even when the surrounding document is complex.

"What was operating cash flow in 2014?" → answer is in the table → Haiku

"What is the implied effective tax rate adjustment across three years?" → multi-step reasoning → Opus

Caveats

  • Financial vertical only
  • ECTSum transcripts at ~5K tokens scored complex every time — didn't route. Still tuning for long-form tasks
  • Quality verification on representative samples not full automated eval

What datasets do you use for evaluating task-specific LLM routing decisions — specifically trying to find benchmarks that span simple classification through complex multi-step reasoning?

submitted by /u/Dramatic_Strain7370
[link] [comments]