HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing
arXiv cs.CL / 4/22/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces Tree-of-Writing (ToW), a holistic evaluation approach that addresses inconsistencies in LLM-as-a-judge methods by explicitly modeling how sub-features are aggregated for writing quality.
- It also releases HowToBench, a large-scale Chinese writing benchmark with 12 genres and 1,302 instructions spanning contextual completion, outline-guided writing, and open-ended generation.
- Experimental results show ToW substantially reduces bias and achieves strong alignment with human judgments, with a 0.93 Pearson correlation.
- The authors find that common overlap-based metrics and typical LLM-as-a-judge practices are sensitive to textual perturbations, whereas ToW is more robust.
- They further report a negative correlation between input length and content-related scores in the outline-guided (Guide) task, suggesting that adding more input does not automatically improve writing evaluations.
Related Articles

Autoencoders and Representation Learning in Vision
Dev.to

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks
Dev.to

Now Meta will track what employees do on their computers to train its AI agents
The Verge
Context Bloat in AI Agents
Dev.to

We open sourced the AI dev team that builds our product
Dev.to