From Reviews to Requirements: Can LLMs Generate Human-Like User Stories?
arXiv cs.CL / 3/31/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The study investigates whether large language models can transform messy app store reviews into backlog-ready, human-like user stories for agile development.
- Using the Mini-BAR dataset (1,000+ health app reviews), researchers evaluated multiple prompting strategies (zero-shot, one-shot, two-shot) across models including GPT-3.5 Turbo, Gemini 2.0 Flash, and Mistral 7B Instruct.
- Evaluation combined human judgment using the RUST framework with an ML-based approach: a RoBERTa classifier fine-tuned on UStAI to score generated user-story quality.
- Results indicate LLMs can produce fluent and well-formatted user stories that can match or outperform human writing, particularly with few-shot prompting.
- Despite strong formatting and fluency, LLMs have difficulty generating truly independent and unique user stories, limiting how well they support building diverse agile backlogs.
Related Articles

Black Hat Asia
AI Business
[D] How does distributed proof of work computing handle the coordination needs of neural network training?
Reddit r/MachineLearning

Claude Code's Entire Source Code Was Just Leaked via npm Source Maps — Here's What's Inside
Dev.to

BYOK is not just a pricing model: why it changes AI product trust
Dev.to

AI Citation Registries and Identity Persistence Across Records
Dev.to