Individual and Combined Effects of English as a Second Language and Typos on LLM Performance

arXiv cs.CL / 4/7/2026

💬 OpinionModels & Research

Key Points

  • The paper studies how English-as-a-second-language (ESL) variation and typographical errors jointly affect large language model performance, motivated by the fact that both issues commonly co-occur in real use.
  • Using the Trans-EnV framework (to generate eight ESL variants) and MulTypo (to inject typos at low, moderate, and severe levels), the authors quantify performance changes under combined conditions.
  • The results show that combining ESL variation with typos typically causes larger performance drops than either factor alone, and the combined effect is not simply additive.
  • Degradation is more consistently characterized for closed-ended tasks than for open-ended tasks, where findings are more mixed.
  • The study concludes that evaluations on clean standard English can overestimate real-world performance and that assessing ESL variation and typos separately does not fully reflect realistic model behavior.

Abstract

Large language models (LLMs) are used globally, and because much of their training data is in English, they typically perform best on English inputs. As a result, many non-native English speakers interact with them in English as a second language (ESL), and these inputs often contain typographical errors. Prior work has largely studied the effects of ESL variation and typographical errors separately, even though they often co-occur in real-world use. In this study, we use the Trans-EnV framework to transform standard English inputs into eight ESL variants and apply MulTypo to inject typos at three levels: low, moderate, and severe. We find that combining ESL variation and typos generally leads to larger performance drops than either factor alone, though the combined effect is not simply additive. This pattern is clearest on closed-ended tasks, where performance degradation can be characterized more consistently across ESL variants and typo levels, while results on open-ended tasks are more mixed. Overall, these findings suggest that evaluations on clean standard English may overestimate real-world model performance, and that evaluating ESL variation and typographical errors in isolation does not fully capture model behavior in realistic settings.