SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition

arXiv cs.CL / 4/23/2026

💬 OpinionModels & Research

共有:

Key Points

The paper targets Grounded Multimodal Named Entity Recognition (GMNER), which must identify named entities and localize their corresponding visual regions from image-text pairs in open-world social media settings.
It argues that prior methods over-rely on either noisy heuristic retrieval (hurting precision on known entities) or internal LLM-based refinement (limited by model knowledge and prone to hallucinations).
The proposed SAKE framework combines internal “knowledge exploitation” with external “knowledge exploration” using self-aware reasoning and adaptive invocation of search tools.
SAKE is trained in two stages: difficulty-aware search tag generation to produce entity-level uncertainty signals, and SAKE-SeCoT supervised fine-tuning to teach self-awareness and tool use.
Experiments on two social media benchmarks show SAKE improves performance by using agentic reinforcement learning with rewards that discourage unnecessary retrieval, enabling better decisions about when to search.

Abstract

Grounded Multimodal Named Entity Recognition (GMNER) aims to extract named entities and localize their visual regions within image-text pairs, serving as a pivotal capability for various downstream applications. In open-world social media platforms, GMNER remains challenging due to the prevalence of long-tailed, rapidly evolving, and unseen entities. To tackle this, existing approaches typically rely on either external knowledge exploration through heuristic retrieval or internal knowledge exploitation via iterative refinement in Multimodal Large Language Models (MLLMs). However, heuristic retrieval often introduces noisy or conflicting evidence that degrades precision on known entities, while solely internal exploitation is constrained by the knowledge boundaries of MLLMs and prone to hallucinations. To address this, we propose SAKE, an end-to-end agentic framework that harmonizes internal knowledge exploitation and external knowledge exploration via self-aware reasoning and adaptive search tool invocation. We implement this via a two-stage training paradigm. First, we propose Difficulty-aware Search Tag Generation, which quantifies the model's entity-level uncertainty through multiple forward samplings to produce explicit knowledge-gap signals. Based on these signals, we construct SAKE-SeCoT, a high-quality Chain-of-Thought dataset that equips the model with basic self-awareness and tool-use capabilities through supervised fine-tuning. Second, we employ agentic reinforcement learning with a hybrid reward function that penalizes unnecessary retrieval, enabling the model to evolve from rigid search imitation to genuine self-aware decision-making about when retrieval is truly necessary. Extensive experiments on two widely used social media benchmarks demonstrate SAKE's effectiveness.