ManiBench: A Benchmark for Testing Visual-Logic Drift and Syntactic Hallucinations in Manim Code Generation

arXiv cs.AI / 3/17/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

ManiBenchはManim CEコードを生成するLLMの性能を評価するベンチマークで、時間的忠実度とAPIのバージョン適合性を重視します。
2つの主要な失敗モードを対象します。Syntactic Hallucinations（存在しないまたは廃止済みのManim APIを参照する構文誤認）とVisual-Logic Drift（意図した数学的論理からビジュアルが逸脱する現象）です。
難易度5段階・計算論、線形代数、確率、位相、AIの5分野にまたがる150–200問を用意し、3Blue1BrownのManimGLソースを基に設計されています。
評価はExecutability、Version-Conflict Error Rate、Alignment Score、Coverage Scoreの4軸で行われ、複数モデルと prompting戦略を横断して評価するオープンソースの評価フレームワークを提供します。
コード・データ・ベンチマークスイートはGitHubとHuggingFaceで公開されています。

Abstract

Traditional benchmarks like HumanEval and MBPP test logic and syntax effectively, but fail when code must produce dynamic, pedagogical visuals. We introduce ManiBench, a specialized benchmark evaluating LLM performance in generating Manim CE code, where temporal fidelity and version-aware API correctness are critical. ManiBench targets two key failure modes: Syntactic Hallucinations (valid Python referencing non-existent or deprecated Manim APIs) and Visual-Logic Drift (generated visuals diverging from intended mathematical logic through timing errors or missing causal relationships). The benchmark comprises 150-200 problems across five difficulty levels spanning calculus, linear algebra, probability, topology, and AI, grounded in analysis of 3Blue1Brown's ManimGL source (53,000 lines, 143 scene classes). Evaluation uses a four-tier framework measuring Executability, Version-Conflict Error Rate, Alignment Score, and Coverage Score. An open-source framework automates evaluation across multiple models and prompting strategies. Code, data and benchmark suite are available at https://github.com/nabin2004/ManiBench. and the dataset is hosted on https://huggingface.co/datasets/nabin2004/ManiBench.

We asked 200 ChatGPT users their biggest frustration. All top 5 answers are problems ChatGPT Toolbox solves.

Reddit r/artificial

I Built an AI That Reviews Every PR for Security Bugs — Here's How (2026)

Dev.to

[R] Combining Identity Anchors + Permission Hierarchies achieves 100% refusal in abliterated LLMs — system prompt only, no fine-tuning

Reddit r/MachineLearning

How I Built an AI SDR Agent That Finds Leads and Writes Personalized Cold Emails

Dev.to

Complete Guide: How To Make Money With Ai

Dev.to

ManiBench: A Benchmark for Testing Visual-Logic Drift and Syntactic Hallucinations in Manim Code Generation

Key Points

Abstract

Related Articles

We asked 200 ChatGPT users their biggest frustration. All top 5 answers are problems ChatGPT Toolbox solves.

I Built an AI That Reviews Every PR for Security Bugs — Here's How (2026)

[R] Combining Identity Anchors + Permission Hierarchies achieves 100% refusal in abliterated LLMs — system prompt only, no fine-tuning

How I Built an AI SDR Agent That Finds Leads and Writes Personalized Cold Emails

Complete Guide: How To Make Money With Ai

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer