AI Navigate

CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering

arXiv cs.CL / 3/18/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • CounterRefine introduces a lightweight inference-time repair layer for retrieval-grounded question answering that tests provisional answers by requesting additional evidence conditioned on the draft answer.
  • The approach first generates a short answer from retrieved evidence, then gathers supporting and conflicting evidence via follow-up queries, and finally applies a restricted refinement step to KEEP or REVISE with revisions accepted only after deterministic validation.
  • This shifts retrieval from merely adding context to using evidence to reevaluate and repair its own answer, addressing errors arising from commitment rather than access.
  • On the SimpleQA benchmark, CounterRefine improves a GPT-5 Baseline-RAG by 5.8 points to 73.1% accuracy and significantly outperforms the reported one-shot GPT-5.4 score by roughly 40 points.

Abstract

In factual question answering, many errors are not failures of access but failures of commitment: the system retrieves relevant evidence, yet still settles on the wrong answer. We present CounterRefine, a lightweight inference-time repair layer for retrieval-grounded question answering. CounterRefine first produces a short answer from retrieved evidence, then gathers additional support and conflicting evidence with follow-up queries conditioned on that draft answer, and finally applies a restricted refinement step that outputs either KEEP or REVISE, with proposed revisions accepted only if they pass deterministic validation. In effect, CounterRefine turns retrieval into a mechanism for testing a provisional answer rather than merely collecting more context. On the full SimpleQA benchmark, CounterRefine improves a matched GPT-5 Baseline-RAG by 5.8 points and reaches a 73.1 percent correct rate, while exceeding the reported one-shot GPT-5.4 score by roughly 40 points. These findings suggest a simple but important direction for knowledgeable foundation models: beyond accessing evidence, they should also be able to use that evidence to reconsider and, when necessary, repair their own answers.