Measuring Intent Comprehension in LLMs

arXiv cs.CL / 3/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that LLMs are trained to predict the next token from text, not to infer underlying user intent, making intent a challenging target due to reliance on surface cues.
It introduces a formal framework that decomposes model output variance into three components—user intent, user articulation, and model uncertainty—to assess whether models primarily reflect intent differences.
Across five LLaMA and Gemma models, the study finds that larger models tend to allocate more of the output variance to intent, suggesting stronger intent comprehension, though improvements are uneven and often modest with size.
The authors argue for moving beyond accuracy-only benchmarks toward semantic diagnostics that directly evaluate whether models understand what users want, especially in high-stakes settings.

Abstract

People judge interactions with large language models (LLMs) as successful when outputs match what they want, not what they type. Yet LLMs are trained to predict the next token solely from text input, not underlying intent. Because written language is an imperfect proxy for intent, and correlations between phrasing and desired outcomes can break down in training data, models that rely too heavily on surface cues may respond inconsistently to semantically equivalent prompts. This makes it essential to evaluate whether LLMs can reliably infer user intent-especially in high-stakes settings where robustness and generalization are critical. We introduce a formal framework for assessing intent comprehension in LLMs: whether a model demonstrates robust understanding of user intent by producing consistent outputs across semantically equivalent prompts while differentiating between prompts with distinct intents. Our evaluation approach is based on a variance decomposition of model responses into three components: variability due to user intent, user articulation, and model uncertainty. Models that understand what users want, and are not overly sensitive to textual cues, should attribute most output variance to intent differences, rather than articulation style. Applying this framework across diverse domains, we find that, within the five LLaMA and Gemma models we evaluate, larger models typically assign a greater share of variance to intent, indicating stronger comprehension of intent, although gains are uneven and often modest with increasing model size. These results motivate moving beyond accuracy-only benchmarks toward semantic diagnostics that directly assess whether models understand what users intend.