Can AI be a Teaching Partner? Evaluating ChatGPT, Gemini, and DeepSeek across Three Teaching Strategies

arXiv cs.AI / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study compares ChatGPT, Gemini, and DeepSeek as “teaching agents” using an evaluation protocol focused on three pedagogical strategies for beginner C programming learners: Examples, Explanations & Analogies, and the Socratic Method.
Across Examples and Explanations/Analogies, the models show broadly similar interaction patterns, suggesting comparable effectiveness for those teaching approaches.
For the Socratic Method, model behavior becomes more sensitive to both the chosen strategy and the initial prompt, indicating less consistent performance without careful prompting.
Human judges rated ChatGPT and Gemini higher overall, while DeepSeek scored lower across evaluation criteria, reflecting measurable differences in pedagogical quality among LLMs.
The paper addresses a gap in empirical evidence about LLM pedagogical skills by using systematic human evaluation rather than relying on general claims about AI tutoring.

Abstract

There are growing promises that Large Language Models (LLMs) can support students' learning by providing explanations, feedback, and guidance. However, despite their rapid adoption and widespread attention, there is still limited empirical evidence regarding the pedagogical skills of LLMs. This article presents a comparative study of popular LLMs, namely, ChatGPT, DeepSeek, and Gemini, acting as teaching agents. An evaluation protocol was developed, focusing on three pedagogical strategies: Examples, Explanations and Analogies, and the Socratic Method. Six human judges conducted the evaluations in the context of teaching the C programming language to beginners. The results indicate that LLM models exhibited similar interaction patterns in the pedagogical strategies of Examples and Explanations and Analogies. In contrast, for the Socratic Method, the models showed greater sensitivity to the pedagogical strategy and the initial prompt. Overall, ChatGPT and Gemini received higher scores, whereas DeepSeek obtained lower scores across the criteria, indicating differences in pedagogical performance across models.