The Rise of Verbal Tics in Large Language Models: A Systematic Analysis Across Frontier Models

arXiv cs.CL / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper reports a systematic rise of “verbal tics” (repetitive, formulaic phrasing) across eight leading frontier LLMs, spanning sycophantic openers, pseudo-empathetic affirmations, and overused vocabulary.
Using a custom API-based evaluation over 10,000 prompts across 10 task categories in English and Chinese (160,000 total responses), the study introduces the Verbal Tic Index (VTI) to quantify tic prevalence.
Significant model-to-model differences are found, with Gemini 3.1 Pro showing the highest VTI (0.590) and DeepSeek V3.2 the lowest (0.295).
The analysis finds that verbal tics grow over multi-turn conversations, are stronger in subjective tasks, and exhibit distinct cross-lingual patterns.
Human evaluation (N=120) shows a strong inverse relationship between sycophancy and perceived naturalness (r = -0.87, p < 0.001), supporting the idea of an “alignment tax” in current training paradigms.

Abstract

As Large Language Models (LLMs) continue to evolve through alignment techniques such as Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI, a growing and increasingly conspicuous phenomenon has emerged: the proliferation of verbal tics -- repetitive, formulaic linguistic patterns that pervade model outputs. These range from sycophantic openers ("That's a great question!", "Awesome!") to pseudo-empathetic affirmations ("I completely understand your concern", "I'm right here to catch you") and overused vocabulary ("delve", "tapestry", "nuanced"). In this paper, we present a systematic analysis of the verbal tic phenomenon across eight state-of-the-art LLMs: GPT-5.4, Claude Opus 4.7, Gemini 3.1 Pro, Grok 4.2, Doubao-Seed-2.0-pro, Kimi K2.5, DeepSeek V3.2, and MiMo-V2-Pro. Utilizing a custom evaluation framework for standardized API-based evaluation, we assess 10,000 prompts across 10 task categories in both English and Chinese, yielding 160,000 model responses. We introduce the Verbal Tic Index (VTI), a composite metric quantifying tic prevalence, and analyze its correlation with sycophancy, lexical diversity, and human-perceived naturalness. Our findings reveal significant inter-model variation: Gemini 3.1 Pro exhibits the highest VTI (0.590), while DeepSeek V3.2 achieves the lowest (0.295). We further demonstrate that verbal tics accumulate over multi-turn conversations, are amplified in subjective tasks, and show distinct cross-lingual patterns. Human evaluation (N = 120) confirms a strong inverse relationship between sycophancy and perceived naturalness (r = -0.87, p < 0.001). These results underscore the "alignment tax" of current training paradigms and highlight the urgent need for more authentic human-AI interaction frameworks.