AI Navigate

There Are No Silly Questions: Evaluation of Offline LLM Capabilities from a Turkish Perspective

arXiv cs.AI / 3/12/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The researchers created the Turkish Anomaly Suite (TAS) of 10 edge-case scenarios to test offline LLMs in Turkish heritage language education, evaluating epistemic resistance, logical consistency, and pedagogical safety.
  • In tests with 14 different models from 270M to 32B parameters, anomaly resistance did not scale straightforwardly with model size, challenging the assumption that bigger models are inherently safer or more reliable.
  • The study found that siphon bias (sycophancy) can pose pedagogical risks even in large models, raising safety concerns for classroom use.
  • The results suggest that reasoning-focused models in the 8B-14B parameter range provide the best balance of cost and safety for language learners in offline deployments.
  • The work emphasizes privacy and reliability constraints of offline LLMs in education and underscores the need for careful evaluation before deployment.

Abstract

The integration of large language models (LLMs) into educational processes introduces significant constraints regarding data privacy and reliability, particularly in pedagogically vulnerable contexts such as Turkish heritage language education. This study aims to systematically evaluate the robustness and pedagogical safety of locally deployable offline LLMs within the context of Turkish heritage language education. To this end, a Turkish Anomaly Suite (TAS) consisting of 10 original edge-case scenarios was developed to assess the models' capacities for epistemic resistance, logical consistency, and pedagogical safety. Experiments conducted on 14 different models ranging from 270M to 32B parameters reveal that anomaly resistance is not solely dependent on model scale and that sycophancy bias can pose pedagogical risks even in large-scale models. The findings indicate that reasoning-oriented models in the 8B--14B parameter range represent the most balanced segment in terms of cost-safety trade-off for language learners.