Evaluating Large Language Models on Computer Science University Exams in Data Structures

arXiv cs.CL / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper presents a comprehensive evaluation of large language models (LLMs) on computer science data structures exam questions.
It introduces a new benchmark dataset built from Tel Aviv University (TAU) exam questions to test LLM performance on closed and multiple-choice formats.
The study evaluates OpenAI’s GPT-4o and Anthropic’s Claude 3.5, along with smaller models (Mathstral 7B and LLaMA 3 8B), using the TAU exam benchmark.
The results are intended to shed light on how well today’s LLMs perform on CS education assessments and question-answering tasks.

Abstract

We present a comprehensive evaluation of Large Language Models (LLMs) on Computer Science (CS) Data Structure examination questions. Our work introduces a new benchmark dataset comprising exam questions from Tel Aviv University (TAU), curated to assess LLMs' abilities in handling closed and multiple-choice questions. We evaluated the performance of OpenAI's GPT 4o and Anthropic's Claude 3.5, popular LLMs, alongside two smaller LLMs, Mathstral 7B and LLaMA 3 8B, across the TAU exams benchmark. Our findings provide insight into the current capabilities of LLMs in CS education.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Everyone Wants AI Agents. Fewer Teams Are Ready for the Messy Business Context Behind Them

Dev.to

AI 编程工具对比 2026：Claude Code vs Cursor vs Gemini CLI vs Codex

Dev.to

How I Improved My YouTube Shorts and Podcast Audio Workflow with AI Tools

Dev.to

An improvement of the convergence proof of the ADAM-Optimizer

Dev.to

Evaluating Large Language Models on Computer Science University Exams in Data Structures

Key Points

Abstract

Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Everyone Wants AI Agents. Fewer Teams Are Ready for the Messy Business Context Behind Them

AI 编程工具对比 2026：Claude Code vs Cursor vs Gemini CLI vs Codex

How I Improved My YouTube Shorts and Podcast Audio Workflow with AI Tools

An improvement of the convergence proof of the ADAM-Optimizer

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer