StepCache: Step-Level Reuse with Lightweight Verification and Selective Patching for LLM Serving
arXiv cs.AI / 4/1/2026
💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- StepCache is a backend-agnostic, step-level reuse layer for LLM serving that reuses cached request “steps” when prompts share a solution structure but differ in localized constraints (e.g., schema, names, constants).
- It retrieves the best-matching cached request, verifies each reused step with lightweight, task-aware checks, and selectively regenerates only the regions that fail via selective patching.
- StepCache supports strict structured-output enforcement for JSON (including required-key constraints and one-shot repair) plus conservative skip-reuse fallbacks when semantic changes are detected.
- For tasks like linear equations, it integrates verification into a bounded correction/repair loop with a deterministic fallback to guarantee correctness even if the backend model fails.
- In CPU-only perturbation-heavy micro-benchmarks (math and JSON variants), StepCache cuts mean/median/p95 latency and token usage substantially while improving correctness from 72.5% to 100% under task-specific stitched integrity checks.
Related Articles

Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs
Dev.to

I Built an AI Agent That Can Write Its Own Tools When It Gets Stuck
Dev.to

We Traced One Query Through Perplexity’s Entire Stack in Cohort – Here’s What Actually Happens in 3 Seconds
Dev.to

Agent Self-Discovery: How AI Agents Find Their Own Wallets
Dev.to

attn-rot (TurboQuant-like KV cache trick) lands in llama.cpp
Reddit r/LocalLLaMA