MobileDev-Bench: A Comprehensive Benchmark for Evaluating Language Models on Mobile Application Development

arXiv cs.LG / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

MobileDev-Bench is introduced as a new benchmark for evaluating LLMs on real-world mobile application development tasks, covering Android Native (Java/Kotlin), React Native (TypeScript), and Flutter (Dart).
The benchmark includes 384 issue-resolution tasks paired with executable test patches, allowing fully automated validation of model-generated fixes in mobile build environments.
The tasks are notably complex, averaging fixes across 12.5 files and 324.9 lines, with 35.7% of instances requiring coordinated multi-artifact changes (e.g., source and manifest files).
Evaluations of four code-capable state-of-the-art models (GPT-5.2, Claude Sonnet 4.5, Gemini Flash 2.5, Qwen3-Coder) show low end-to-end resolution rates of 3.39%–5.21%, highlighting substantial gaps versus other software-engineering benchmarks.
The study identifies systematic bottlenecks in fault localization for coordinated multi-file, multi-artifact changes, suggesting where future model improvements are most needed for mobile dev workflows.

Abstract

Large language models (LLMs) have shown strong performance on automated software engineering tasks, yet existing benchmarks focus primarily on general-purpose libraries or web applications, leaving mobile application development largely unexplored despite its strict platform constraints, framework-driven lifecycles, and complex platform API interactions. We introduce MobileDev-Bench, a benchmark comprising 384 real-world issue-resolution tasks collected from 18 production mobile applications spanning Android Native (Java/Kotlin), React Native (TypeScript), and Flutter (Dart). Each task pairs an authentic developer-reported issue with executable test patches, enabling fully automated validation of model-generated fixes within mobile build environments. The benchmark exhibits substantial patch complexity: fixes modify 12.5 files and 324.9 lines on average, and 35.7% of instances require coordinated changes across multiple artifact types, such as source and manifest files. Evaluation of four state-of-the-art code-capable LLMs, GPT- 5.2, Claude Sonnet 4.5, Gemini Flash 2.5, and Qwen3-Coder, yields low end-to-end resolution rates of 3.39%-5.21%, revealing significant performance gaps compared to prior benchmarks. Further analysis reveals systematic failure modes, with fault localization across multi-file and multi-artifact changes emerging as the primary bottleneck.

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

Dev.to

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Sector HQ Daily AI Intelligence - March 27, 2026

Dev.to

AI Crawler Management: The Definitive Guide to robots.txt for AI Bots

Dev.to

MobileDev-Bench: A Comprehensive Benchmark for Evaluating Language Models on Mobile Application Development

Key Points

Abstract

Related Articles

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Sector HQ Daily AI Intelligence - March 27, 2026

AI Crawler Management: The Definitive Guide to robots.txt for AI Bots

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer