From Reviews to Requirements: Can LLMs Generate Human-Like User Stories?

arXiv cs.CL / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study investigates whether large language models can transform messy app store reviews into backlog-ready, human-like user stories for agile development.
Using the Mini-BAR dataset (1,000+ health app reviews), researchers evaluated multiple prompting strategies (zero-shot, one-shot, two-shot) across models including GPT-3.5 Turbo, Gemini 2.0 Flash, and Mistral 7B Instruct.
Evaluation combined human judgment using the RUST framework with an ML-based approach: a RoBERTa classifier fine-tuned on UStAI to score generated user-story quality.
Results indicate LLMs can produce fluent and well-formatted user stories that can match or outperform human writing, particularly with few-shot prompting.
Despite strong formatting and fluency, LLMs have difficulty generating truly independent and unique user stories, limiting how well they support building diverse agile backlogs.

Abstract

App store reviews provide a constant flow of real user feedback that can help improve software requirements. However, these reviews are often messy, informal, and difficult to analyze manually at scale. Although automated techniques exist, many do not perform well when replicated and often fail to produce clean, backlog-ready user stories for agile projects. In this study, we evaluate how well large language models (LLMs) such as GPT-3.5 Turbo, Gemini 2.0 Flash, and Mistral 7B Instruct can generate usable user stories directly from raw app reviews. Using the Mini-BAR dataset of 1,000+ health app reviews, we tested zero-shot, one-shot, and two-shot prompting methods. We evaluated the generated user stories using both human judgment (via the RUST framework) and a RoBERTa classifier fine-tuned on UStAI to assess their overall quality. Our results show that LLMs can match or even outperform humans in writing fluent, well-formatted user stories, especially when few-shot prompts are used. However, they still struggle to produce independent and unique user stories, which are essential for building a strong agile backlog. Overall, our findings show how LLMs can reliably turn unstructured app reviews into actionable software requirements, providing developers with clear guidance to turn user feedback into meaningful improvements.

Black Hat Asia

AI Business

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Reddit r/MachineLearning

Claude Code's Entire Source Code Was Just Leaked via npm Source Maps — Here's What's Inside

Dev.to

BYOK is not just a pricing model: why it changes AI product trust

Dev.to

AI Citation Registries and Identity Persistence Across Records

Dev.to

From Reviews to Requirements: Can LLMs Generate Human-Like User Stories?

Key Points

Abstract

Related Articles

Black Hat Asia

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Claude Code's Entire Source Code Was Just Leaked via npm Source Maps — Here's What's Inside

BYOK is not just a pricing model: why it changes AI product trust

AI Citation Registries and Identity Persistence Across Records

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer