MirrorBench: Evaluating Self-centric Intelligence in MLLMs by Introducing a Mirror

arXiv cs.AI / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces MirrorBench, a simulation-based benchmark designed to evaluate “self-centric” intelligence in multimodal large language models (MLLMs), going beyond existing benchmarks focused on external-object understanding.
MirrorBench is inspired by the Mirror Self-Recognition (MSR) test from psychology and uses a tiered set of tasks that increase in difficulty, from basic visual perception to higher-level self-representation.
Experiments on leading MLLMs show that performance is substantially worse than humans even at the lowest benchmark tier, indicating fundamental limits in self-referential understanding.
The authors propose a framework that connects psychological self-recognition paradigms with embodied intelligence evaluation to support principled measurement of general intelligence emergence in large models.

Abstract

Recent progress in Multimodal Large Language Models (MLLMs) has demonstrated remarkable advances in perception and reasoning, suggesting their potential for embodied intelligence. While recent studies have evaluated embodied MLLMs in interactive settings, current benchmarks mainly target capabilities to perceive, understand, and interact with external objects, lacking a systematic evaluation of self-centric intelligence. To address this, we introduce MirrorBench, a simulation-based benchmark inspired by the classical Mirror Self-Recognition (MSR) test in psychology. MirrorBench extends this paradigm to embodied MLLMs through a tiered framework of progressively challenging tasks, assessing agents from basic visual perception to high-level self-representation. Experiments on leading MLLMs show that even at the lowest level, their performance remains substantially inferior to human performance, revealing fundamental limitations in self-referential understanding. Our study bridges psychological paradigms and embodied intelligence, offering a principled framework for evaluating the emergence of general intelligence in large models. Project page: https://fflahm.github.io/mirror-bench-page/.