Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM-Based Coding Agents

arXiv cs.CL / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that LLM-based coding agents must not always reuse retrieved external memory, because superficial matches can cause unsafe “memory injection” into the debugging process.
  • It reframes memory retrieval as a risk-sensitive selective control problem and proposes RSCB-MC, a contextual bandit that can choose among multiple retrieval and abstention actions (including not using memory or asking for feedback).
  • RSCB-MC stores reusable issue knowledge using a pattern-variant-episode schema and represents retrieval context with a 16-feature state capturing relevance, uncertainty, structural compatibility, feedback history, false-positive risk, latency, and token cost.
  • Its reward function heavily penalizes false-positive memory injection relative to missed reuse, making abstention and non-injection explicit safety-first options.
  • In offline and bounded hot-path validations, RSCB-MC achieves strong replay/proxy success rates (62.5% offline; 60.5% proxy) while maintaining a 0.0% false-positive rate and low decision latency (p95 ~331 microseconds).

Abstract

Large language model (LLM)-based coding agents increasingly rely on external memory to reuse prior debugging experience, repair traces, and repository-local operational knowledge. However, retrieved memory is useful only when the current failure is genuinely compatible with a previous one; superficial similarity in stack traces, terminal errors, paths, or configuration symptoms can lead to unsafe memory injection. This paper reframes issue-memory use as a selective, risk-sensitive control problem rather than a pure top-k retrieval problem. We introduce RSCB-MC, a risk-sensitive contextual bandit memory controller that decides whether an agent should use no memory, inject the top resolution, summarize multiple candidates, perform high-precision or high-recall retrieval, abstain, or ask for feedback. The system stores reusable issue knowledge through a pattern-variant-episode schema and converts retrieval evidence into a fixed 16-feature contextual state capturing relevance, uncertainty, structural compatibility, feedback history, false-positive risk, latency, and token cost. Its reward design penalizes false-positive memory injection more strongly than missed reuse, making non-injection and abstention first-class safety actions. In deterministic smoke-scale artifacts, RSCB-MC obtains the strongest non-oracle offline replay success rate, 62.5%, while maintaining a 0.0% false-positive rate. In a bounded 200-case hot-path validation, it reaches 60.5% proxy success with 0.0% false positives and a 331.466 microseconds p95 decision latency. The results show that, for coding-agent memory, the key question is not only which memory is most similar, but whether any retrieved memory is safe enough to influence the debugging trajectory.