Extending MONA in Camera Dropbox: Reproduction, Learned Approval, and Design Implications for Reward-Hacking Mitigation

arXiv cs.AI / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies how the construction of “approval” signals in MONA (Myopic Optimization with Non-myopic Approval) impacts whether reward-hacking mitigation guarantees hold.
It provides a reproduction-first extension of the MONA “Camera Dropbox” environment by repackaging the public code into a standard Python project and running scripted PPO training to replicate key results (91.5% reward hacking for ordinary RL vs. 0.0% for oracle MONA).
The authors introduce a modular learned-approval suite covering oracle, noisy, misspecified, learned, and calibrated approval mechanisms to test the “approval-spectrum” conjecture in a runnable form.
In reduced-budget experiments, the best calibrated learned approval eliminates observed reward hacking but yields significantly lower intended-behavior performance than oracle MONA (11.9% vs. 99.9%), suggesting under-optimization rather than renewed hacking.
The main implication is that the engineering challenge shifts toward building learned approval models that retain enough foresight to prevent reward hacking without reintroducing vulnerabilities.

Abstract

Myopic Optimization with Non-myopic Approval (MONA) mitigates multi-step reward hacking by restricting the agent's planning horizon while supplying far-sighted approval as a training signal~\cite{farquhar2025mona}. The original paper identifies a critical open question: how the method of constructing approval -- particularly the degree to which approval depends on achieved outcomes -- affects whether MONA's safety guarantees hold. We present a reproduction-first extension of the public MONA Camera Dropbox environment that (i)~repackages the released codebase as a standard Python project with scripted PPO training, (ii)~confirms the published contrast between ordinary RL (91.5\% reward-hacking rate) and oracle MONA (0.0\% hacking rate) using the released reference arrays, and (iii)~introduces a modular learned-approval suite spanning oracle, noisy, misspecified, learned, and calibrated approval mechanisms. In reduced-budget pilot sweeps across approval methods, horizons, dataset sizes, and calibration strategies, the best calibrated learned-overseer run achieves zero observed reward hacking but substantially lower intended-behavior rates than oracle MONA (11.9\% vs.\ 99.9\%), consistent with under-optimization rather than re-emergent hacking. These results operationalize the MONA paper's approval-spectrum conjecture as a runnable experimental object and suggest that the central engineering challenge shifts from proving MONA's concept to building learned approval models that preserve sufficient foresight without reopening reward-hacking channels. Code, configurations, and reproduction commands are publicly available. https://github.com/codernate92/mona-camera-dropbox-repro