ManiBench: A Benchmark for Testing Visual-Logic Drift and Syntactic Hallucinations in Manim Code Generation

arXiv cs.AI / 3/17/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

ManiBenchはManim CEコードを生成するLLMの性能を評価するベンチマークで、時間的忠実度とAPIのバージョン適合性を重視します。
2つの主要な失敗モードを対象します。Syntactic Hallucinations（存在しないまたは廃止済みのManim APIを参照する構文誤認）とVisual-Logic Drift（意図した数学的論理からビジュアルが逸脱する現象）です。
難易度5段階・計算論、線形代数、確率、位相、AIの5分野にまたがる150–200問を用意し、3Blue1BrownのManimGLソースを基に設計されています。
評価はExecutability、Version-Conflict Error Rate、Alignment Score、Coverage Scoreの4軸で行われ、複数モデルと prompting戦略を横断して評価するオープンソースの評価フレームワークを提供します。
コード・データ・ベンチマークスイートはGitHubとHuggingFaceで公開されています。

Abstract

Traditional benchmarks like HumanEval and MBPP test logic and syntax effectively, but fail when code must produce dynamic, pedagogical visuals. We introduce ManiBench, a specialized benchmark evaluating LLM performance in generating Manim CE code, where temporal fidelity and version-aware API correctness are critical. ManiBench targets two key failure modes: Syntactic Hallucinations (valid Python referencing non-existent or deprecated Manim APIs) and Visual-Logic Drift (generated visuals diverging from intended mathematical logic through timing errors or missing causal relationships). The benchmark comprises 150-200 problems across five difficulty levels spanning calculus, linear algebra, probability, topology, and AI, grounded in analysis of 3Blue1Brown's ManimGL source (53,000 lines, 143 scene classes). Evaluation uses a four-tier framework measuring Executability, Version-Conflict Error Rate, Alignment Score, and Coverage Score. An open-source framework automates evaluation across multiple models and prompting strategies. Code, data and benchmark suite are available at https://github.com/nabin2004/ManiBench. and the dataset is hosted on https://huggingface.co/datasets/nabin2004/ManiBench.

Astral to Join OpenAI

Dev.to

I Built a MITM Proxy to See What Claude Code Actually Sends to Anthropic

Dev.to

Your AI coding agent is installing vulnerable packages. I built the fix.

Dev.to

ChatGPT Prompt Engineering for Freelancers: Unlocking Efficient Client Communication

Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

Reddit r/LocalLLaMA

ManiBench: A Benchmark for Testing Visual-Logic Drift and Syntactic Hallucinations in Manim Code Generation

Key Points

Abstract

Related Articles

Astral to Join OpenAI

I Built a MITM Proxy to See What Claude Code Actually Sends to Anthropic

Your AI coding agent is installing vulnerable packages. I built the fix.

ChatGPT Prompt Engineering for Freelancers: Unlocking Efficient Client Communication

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer