Children's Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs
arXiv cs.AI / 3/24/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes KidGym, a new 2D grid-based benchmark designed to evaluate multimodal large language models (MLLMs) using a child-intelligence-test inspired framework.
- KidGym targets five interpretable capabilities—Execution, Perception Reasoning, Learning, Memory, and Planning—across 12 distinct tasks.
- The benchmark uses randomly generated layouts and varied scenarios/objects to provide more robust and generalizable evaluation of MLLM abilities.
- It is built to be user-customizable and extensible, enabling researchers to add scenarios and tune difficulty to fit different research needs.
- Experiments with state-of-the-art MLLMs reveal both strengths and notable limitations, and the authors release the benchmark publicly via their website.
Related Articles

Composer 2: What is new and Compares with Claude Opus 4.6 & GPT-5.4
Dev.to
How UCP Breaks Your E-Commerce Tracking Stack: A Platform-by-Platform Analysis
Dev.to
AI Text Analyzer vs Asking Friends: Which Gives Better Perspective?
Dev.to
[D] Cathie wood claims ai productivity wave is starting, data shows 43% of ceos save 8+ hours weekly
Reddit r/MachineLearning

Microsoft hires top AI researchers from Allen Institute for AI for Suleyman's Superintelligence team
THE DECODER