Beating the Style Detector: Three Hours of Agentic Research on the AI-Text Arms Race

arXiv cs.CL / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper reports an agentic research workflow on arXiv (2605.02620v1) that reproduces an ACL 2026 empirical NLP study in hours rather than weeks, using a modern agentic-research harness with the human acting only as a reviewer-in-the-loop.
  • The authors exactly reproduce all seven preregistered hypotheses and recover the original key finding linking perceived self-similarity and embedding-measured self-similarity (r=+0.244, p<1e-8, n=648) with high precision.
  • Using a leakage-free held-out protocol on 324 paired tasks, GPT-5.5 and Claude Opus 4.7 are shown to close 71–75% of the style gap to the same-author ceiling, outperforming the human post-edit’s 24% and beating human post-editing on ~80% of tasks.
  • Framing the dataset as an AI-text detection arms race, the study finds high-performing detectors (leave-authors-out linear SVM with AUC 0.93–1.00) but reveals different failure modes: GPT-5.5 detection is largely driven by text-length confounds, while Opus detection reflects a genuine stylistic signature.
  • In adversarial testing, an Opus agent with 20 feedback iterations can flip multiple held-out mimics toward the human-like region and reduce detector margins by an order of magnitude, demonstrating that frontier LLMs can efficiently lower their own AI-detection probability against a known detector.

Abstract

Reproducing an empirical NLP study used to take weeks. Given the released data and a modern agentic-research harness, we redo every experiment of a recent ACL\,2026 study on personal-style post-editing of LLM drafts -- and add three new ones -- with the human investigator acting only as a reviewer-in-the-loop. We reproduce all seven preregistered hypotheses and recover the paper's headline correlation between perceived self-similarity and embedding-measured self-similarity to three decimal places (r{=}{+}0.244, p{<}10^{-8}, n{=}648). Under a leakage-free held-out protocol, GPT-5.5 and Claude\,Opus\,4.7 close 71--75\,\% of the style gap to the same-author ceiling on 324 paired tasks, against 24\,\% for the human post-edit, and beat the human post-edit on $\sim$$80\,\% of tasks. We then frame the same data as an AI-text detection arms race. A leave-authors-out linear SVM on LUAR-MUD embeddings reaches AUC 0.93--1.00 across approaches; six diagnostics show that GPT-5.5 detection is mostly a length confound while Opus detection is a genuine stylistic signature. Given T{=}20 feedback iterations against the frozen detector, an Opus agent flips two of five held-out test mimics to the human half-space and shrinks every margin by an order of magnitude. With moderate effort against a known detector, a frontier LLM can already efficiently lower its own AI-detection probability. All code, 648$ mimic drafts, trained detectors, diagnostics, and adversarial trajectories are released.