TRAJEVAL: Decomposing Code Agent Trajectories for Fine-Grained Diagnosis
arXiv cs.AI / 3/27/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces TRAJEVAL, a diagnostic framework for code agents that breaks execution trajectories into three interpretable stages: search (file localization), read (function comprehension), and edit (modification targeting).
- Instead of relying on coarse metrics like Pass@1, TRAJEVAL computes precision and recall per stage by comparing agent trajectories to reference patches, enabling pinpointed “where and why” failure analysis.
- Analyzing 16,758 trajectories across three agent architectures and seven models, the authors find shared inefficiencies (agents examine ~22x more functions than necessary) along with model-specific failure modes (e.g., GPT-5 mis-targets edits vs. Qwen-32B failing file discovery).
- The framework’s diagnostics are predictive of overall performance (model-level Pass@1 prediction within 0.87–2.1% MAE) and actionable, improving two SOTA models by 2.2–4.6 points and cutting costs by 20–31% using real-time feedback from trajectory signals.
- Overall, TRAJEVAL shifts code-agent evaluation from outcome-based benchmarking toward mechanism-driven diagnosis that can directly drive performance improvements.
広告
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.




