The Last Fingerprint: How Markdown Training Shapes LLM Prose
arXiv cs.CL / 3/31/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that LLM em-dash “overuse” is not just a stylistic quirk, but a form of markdown leaking into prose from markdown-saturated training data.
- It proposes a mechanistic genealogy linking dataset structure, internalization of formatting conventions, the em dash’s dual role in markdown/prose, and how post-training amplifies the effect.
- A two-condition suppression experiment across 12 models from multiple providers finds that when instructed to avoid markdown, most overt markdown features disappear while em dashes largely persist.
- Em-dash frequency and suppression resistance are shown to vary by model, ranging from zero in Meta Llama models to substantially higher rates in others, and serving as a diagnostic signature of fine-tuning methodology.
- Additional suppression gradients and base-versus-instruct comparisons suggest the tendency can exist pre-RLHF and may not be fully removable even with explicit prohibition prompts.


