Abstract
Reproducing an empirical NLP study used to take weeks. Given the released data and a modern agentic-research harness, we redo every experiment of a recent ACL\,2026 study on personal-style post-editing of LLM drafts -- and add three new ones -- with the human investigator acting only as a reviewer-in-the-loop. We reproduce all seven preregistered hypotheses and recover the paper's headline correlation between perceived self-similarity and embedding-measured self-similarity to three decimal places (r{=}{+}0.244, p{<}10^{-8}, n{=}648). Under a leakage-free held-out protocol, GPT-5.5 and Claude\,Opus\,4.7 close 71--75\,\% of the style gap to the same-author ceiling on 324 paired tasks, against 24\,\% for the human post-edit, and beat the human post-edit on $\sim$$80\,\% of tasks. We then frame the same data as an AI-text detection arms race. A leave-authors-out linear SVM on LUAR-MUD embeddings reaches AUC 0.93--1.00 across approaches; six diagnostics show that GPT-5.5 detection is mostly a length confound while Opus detection is a genuine stylistic signature. Given T{=}20 feedback iterations against the frozen detector, an Opus agent flips two of five held-out test mimics to the human half-space and shrinks every margin by an order of magnitude. With moderate effort against a known detector, a frontier LLM can already efficiently lower its own AI-detection probability. All code, 648$ mimic drafts, trained detectors, diagnostics, and adversarial trajectories are released.