Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training

arXiv cs.AI / 4/15/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Cycle-Consistent Search (CCS), a gold-supervision-free reinforcement learning framework for training search agents using cycle-consistency ideas.
  • CCS relies on the hypothesis that an optimal search trajectory acts as an information-preserving representation of the question intent, enabling question reconstruction as a proxy reward.
  • To prevent naive cycle objectives from exploiting lexical shortcuts, the method uses information bottlenecks such as excluding the final response and masking queries via named entity recognition (NER).
  • Experiments on question-answering benchmarks show CCS matches supervised baselines in performance and surpasses prior methods that do not use gold supervision.
  • Overall, CCS is positioned as a scalable training paradigm for search agents when ground-truth supervision is unavailable.

Abstract

Reinforcement Learning (RL) has shown strong potential for optimizing search agents in complex information retrieval tasks. However, existing approaches predominantly rely on gold supervision, such as ground-truth answers, which is difficult to scale. To address this limitation, we propose Cycle-Consistent Search (CCS), a gold-supervision-free framework for training search agents, inspired by cycle-consistency techniques from unsupervised machine translation and image-to-image translation. Our key hypothesis is that an optimal search trajectory, unlike insufficient or irrelevant ones, serves as a lossless encoding of the question's intent. Consequently, a high-quality trajectory should preserve the information required to accurately reconstruct the original question, thereby inducing a reward signal for policy optimization. However, naive cycle-consistency objectives are vulnerable to information leakage, as reconstruction may rely on superficial lexical cues rather than the underlying search process. To reduce this effect, we apply information bottlenecks, including exclusion of the final response and named entity recognition (NER) masking of search queries. These constraints force reconstruction to rely on retrieved observations together with the structural scaffold, ensuring that the resulting reward signal reflects informational adequacy rather than linguistic redundancy. Experiments on question-answering benchmarks show that CCS achieves performance comparable to supervised baselines while outperforming prior methods that do not rely on gold supervision. These results suggest that CCS provides a scalable training paradigm for training search agents in settings where gold supervision is unavailable.