FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

arXiv cs.CL / 4/8/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces FrontierFinance, a long-horizon benchmark designed to evaluate LLMs on real-world, professional financial modeling workflows rather than short, synthetic tasks.
  • FrontierFinance covers 25 complex tasks across five core finance models and is intended to better reflect practical expertise, with each task requiring an average of over 18 hours of skilled human labor.
  • The benchmark was developed with financial professionals, includes detailed rubrics for structured evaluation, and uses human experts to define tasks, grade model outputs, and produce human baselines.
  • Results indicate that human experts achieve higher average scores and are more likely to generate client-ready outputs than current state-of-the-art systems, highlighting current limitations in real task performance.
  • The work targets an accountability gap in LLM deployments by providing a measurable framework for tracking performance in a high-exposure domain for AI-driven labor displacement risks.

Abstract

As concerns surrounding AI-driven labor displacement intensify in knowledge-intensive sectors, existing benchmarks fail to measure performance on tasks that define practical professional expertise. Finance, in particular, has been identified as a domain with high AI exposure risk, yet lacks robust benchmarks to track real-world developments. This gap is compounded by the absence of clear accountability mechanisms in current Large Language Model (LLM) deployments. To address this, we introduce FrontierFinance, a long-horizon benchmark of 25 complex financial modeling tasks across five core finance models, requiring an average of over 18 hours of skilled human labor per task to complete. Developed with financial professionals, the benchmark reflects industry-standard financial modeling workflows and is paired with detailed rubrics for structured evaluation. We engage human experts to define the tasks, create rubrics, grade LLMs, and perform the tasks themselves as human baselines. We demonstrate that our human experts both receive higher scores on average, and are more likely to provide client-ready outputs than current state-of-the-art systems.