StyleShield: Exposing the Fragility of AIGC Detectors through Continuous Controllable Style Transfer

arXiv cs.LG / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that AIGC detectors are inherently fragile because the line between AI-written and human-written text erodes as language models improve.
It introduces StyleShield, a flow-matching framework that performs controllable style transfer directly in continuous token-embedding space using a DiT-based backbone and Qwen-7B-conditioned adapters.
During inference, StyleShield adapts an SDEdit-like approach to text embeddings with a single control parameter (gamma) to balance evasion versus preservation of original content.
Experiments on a multi-domain Chinese benchmark show high evasion rates against both the training detector (94.6%) and three unseen detectors (>=99%) while keeping strong semantic similarity (0.928).
The authors propose RateAudit, a document-level scheduling method that can force detection verdict rates to arbitrary targets, raising doubts about score-based evaluation reliability.

Abstract

AI-generated content (AIGC) detectors are increasingly deployed in high-stakes settings such as academic integrity screening, yet their reliability rests on a fundamental paradox: as language models are trained on human-written corpora, the statistical boundary between AI and human writing will inevitably dissolve as models improve. Commercial incentives have further distorted this landscape -- detection services and "de-AIification" tools often operate within the same supply chain, replacing evaluation of content quality with judgment of content origin. We present StyleShield, the first flow matching framework for conditional text style transfer, operating directly in continuous token embedding space via a DiT backbone with zero-initialized cross-attention adapters conditioned on frozen Qwen-7B representations. At inference, we adapt the SDEdit paradigm from image synthesis to text embeddings, with a single parameter gamma providing smooth continuous control over the evasion-preservation trade-off. On a multi-domain Chinese benchmark, StyleShield achieves 94.6% evasion against the training detector and >=99% against three unseen detectors, maintaining 0.928 semantic similarity. We further introduce RateAudit, a document-level scheduling algorithm that demonstrates detection-rate verdicts can be set to arbitrary values, directly questioning the reliability of score-based evaluation.