| Hey there, we’re sharing KidGym, an interactive 2D grid-based benchmark for evaluating MLLMs in continuous, trajectory-based interaction, accepted to ICLR 2026. Motivation: Many existing MLLM benchmarks are static and focus on isolated skills, which makes them less faithful for characterizing model capabilities in continuous interactive settings. Inspired by the Wechsler Intelligence Scale for Children (WISC), we organize evaluation into five cognitive dimensions and design tasks to probe both single abilities and compositional abilities. Previews of 12 tasks in KIDGYM KidGym Features:
Five-dimensional capability radar chart Findings: We find that while strong models can perform very well on some single-ability tasks, performance drops noticeably on tasks requiring:
We hope KidGym can provide a more fine-grained, interpretable, and interaction-oriented perspective for evaluating multimodal large models. Feedback and discussion are very welcome! Paper:https://arxiv.org/abs/2603.20209 Project Page:https://bobo-ye.github.io/KidGym/ [link] [comments] |
[R] Evaluating MLLMs with Child-Inspired Cognitive Tasks
Reddit r/MachineLearning / 2026/3/24
📰 ニュースSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research
要点
- KidGym is an interactive 2D grid-based benchmark designed to evaluate multimodal large language models (MLLMs) in continuous, trajectory-driven interaction, and it has been accepted to ICLR 2026.
- The benchmark is inspired by the WISC framework and evaluates models across five cognitive dimensions—Execution, Memory, Learning, Planning, and Perception Reasoning—using both single-ability and compositional tasks.
- It includes 12 task categories with three difficulty levels, randomized layouts and diverse scenarios to reduce memorization/data leakage, and an LLM-friendly “backpack”/hint/item-indexing interaction interface.
- Initial results suggest strong models excel at some isolated abilities but show noticeable degradation on abstract/non-semantic visual reasoning, numerical/counting sensitivity, and multi-rule compositional reasoning across dimensions.
- The paper and open resources (project page and GitHub) aim to provide a more fine-grained and interpretable view of interaction-oriented multimodal model capabilities, encouraging community customization via a gym-style API.
- Provides: accepted-to-ICLR 2026 benchmark release and early findings about weaknesses in trajectory-based, compositional cognitive task evaluation for MLLMs.
