Bridging Coarse and Fine Recognition: A Hybrid Approach for Open-Ended Multi-Granularity Object Recognition in Interactive Educational Games

arXiv cs.CV / 4/21/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes HyMOR, a hybrid open-ended object recognition framework that combines an MLLM for coarse-grained, open-ended recognition with a CLIP-style model for fine-grained, domain-specific object identification.
HyMOR is designed to improve object understanding across multiple semantic granularities, providing a perceptual backbone for downstream multimodal content generation and interactive educational gameplay.
The authors introduce the TBO (TextBook Objects) dataset with 20,942 images and 8,816 categories extracted from textbooks to enable evaluation in education-oriented, content-rich scenarios.
Experiments reportedly reduce the fine-grained recognition gap with CLIP to 0.2% and improve general object recognition by 2.5% over a baseline MLLM, yielding a 23.2% average gain in Sentence-BERT similarity across evaluated datasets.
The work targets interactive learning applications by focusing on robust, accurate recognition performance that can support multimodal generation and game content creation.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have enabled open-ended object recognition, yet they struggle with fine-grained tasks. In contrast, CLIP-style models excel at fine-grained recognition but lack broad coverage of general object categories. To bridge this gap, we propose \textbf{HyMOR}, a \textbf{Hy}brid \textbf{M}ulti-granularity open-ended \textbf{O}bject \textbf{R}ecognition framework that integrates an MLLM with a CLIP model. In HyMOR, the MLLM performs open-ended and coarse-grained object recognition, while the CLIP model specializes in fine-grained identification of domain-specific objects such as animals and plants. This hybrid design enables accurate object understanding across multiple semantic granularities, serving as a robust perceptual foundation for downstream multi-modal content generation and interactive gameplay. To support evaluation in content-rich and educational scenarios, we introduce TBO (TextBook Objects), a dataset containing 20,942 images annotated with 8,816 object categories extracted from textbooks. Extensive experiments demonstrate that HyMOR narrows the fine-grained recognition gap with CLIP to 0.2\% while improving general object recognition by 2.5\% over a baseline MLLM, measured by average Sentence-BERT (SBert) similarity. Overall, HyMOR achieves a 23.2\% improvement in average SBert across all evaluated datasets, highlighting its effectiveness in enabling accurate perception for multi-modal game content generation and interactive learning applications.