ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation

arXiv cs.AI / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Skill-distillation methods for LLM agents can’t reliably decide whether to add or remove steps because they typically lack per-step cost signals tied to agent trajectories.
  • The paper introduces ClawTrace, a tracing platform that records each LLM call, tool use, and sub-agent spawn, compiling them into a compact TraceCard (YAML) with per-step USD cost, token counts, and redundancy flags.
  • It also presents CostCraft, a distillation pipeline that generates preserve, prune, and repair skill patches from TraceCards using counterfactual arguments and oracle-grounded evidence for failures.
  • Experiments on 30 SpreadsheetBench tasks show that per-step cost attribution and prune patches each independently reduce quality regressions.
  • Cross-benchmark tests on 30 unrelated SkillsBench tasks reveal that prune rules transfer well (cutting median cost by 32%), while preserve rules can cause regressions due to benchmark-specific conventions.

Abstract

Skill-distillation pipelines learn reusable rules from LLM agent trajectories, but they lack a key signal: how much each step costs. Without per-step cost, a pipeline cannot distinguish adding a missing step to fix a bug from removing an expensive step that never affected the outcome. We introduce ClawTrace, an agent tracing platform that records every LLM call, tool use, and sub-agent spawn during an agent session and compiles each session into a TraceCard: a compact YAML summary with per-step USD cost, token counts, and redundancy flags. Built on ClawTrace, CostCraft is a distillation pipeline that reads TraceCards and produces three types of skill patches. Preserve patches keep behaviors that led to success. Prune patches remove expensive steps that did not matter, each backed by a counterfactual argument against a named high-cost step. Repair patches fix failures grounded in oracle evidence. Ablations on 30 held-out SpreadsheetBench tasks show that both cost attribution and prune patches independently reduce quality regressions. When the same skill is applied to 30 unrelated SkillsBench tasks, an unexpected asymmetry emerges: prune rules transferred across benchmarks and cut median cost by 32%, while preserve rules, trained on benchmark-specific conventions, caused regressions on new task types. We release ClawTrace and TraceCards as open infrastructure for cost-aware agent research.