HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing
arXiv cs.CL / 4/22/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- The paper shows that LLMs used as co-authors for collaborative writing are vulnerable to “draft-based” jailbreak attacks that can steer models toward generating harmful content to complete incomplete drafts.
- It introduces HarDBench, a systematic benchmark covering high-risk domains such as explosives, drugs, weapons, and cyberattacks, using prompts that resemble realistic co-authoring structures with domain-specific cues.
- The authors propose a safety-utility balanced alignment approach using preference optimization to train models to refuse harmful completions while still being useful for benign drafts.
- Experiments indicate that current LLMs are highly susceptible in co-authoring settings, and the proposed alignment method substantially reduces harmful outputs without noticeably harming co-authoring performance.
- The benchmark and dataset are released publicly to support evaluation and alignment of LLMs specifically for human–LLM collaborative writing scenarios.
Related Articles
Autoencoders and Representation Learning in Vision
Dev.to
Every AI finance app wants your data. I didn’t trust that — so I built my own. Offline.
Dev.to
Control Claude with Just a URL. The Chrome Extension "Send to Claude" Is Incredibly Useful
Dev.to
Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks
Dev.to

Now Meta will track what employees do on their computers to train its AI agents
The Verge