Code-MIE: A Code-style Model for Multimodal Information Extraction with Scene Graph and Entity Attribute Knowledge Enhancement

arXiv cs.CL / 3/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes Code-MIE, a code-style framework that reframes multimodal information extraction (MIE) as structured code understanding and generation rather than template-based text I/O.
  • It enhances entity extraction by incorporating entity attribute knowledge (e.g., gender and affiliation) extracted from text to better condition the model on context.
  • Images are converted into scene graphs and paired with visual features so the model can incorporate relational and visual evidence during extraction.
  • The approach uses a Python-function input template (taking entity attributes, scene graphs, and raw text) and outputs extraction results as Python dictionaries containing entities, relations, and related fields.
  • Experiments on M$^3$D, Twitter-15, Twitter-17, and MNRE report state-of-the-art results, indicating improved performance over multiple baseline multimodal IE methods.

Abstract

With the rapid development of large language models (LLMs), more and more researchers have paid attention to information extraction based on LLMs. However, there are still some spaces to improve in the existing related methods. First, existing multimodal information extraction (MIE) methods usually employ natural language templates as the input and output of LLMs, which mismatch with the characteristics of information tasks that mostly include structured information such as entities and relations. Second, although a few methods have adopted structured and more IE-friendly code-style templates, they just explored their methods on text-only IE rather than multimodal IE. Moreover, their methods are more complex in design, requiring separate templates to be designed for each task. In this paper, we propose a Code-style Multimodal Information Extraction framework (Code-MIE) which formalizes MIE as unified code understanding and generation. Code-MIE has the following novel designs: (1) Entity attributes such as gender, affiliation are extracted from the text to guide the model to understand the context and role of entities. (2) Images are converted into scene graphs and visual features to incorporate rich visual information into the model. (3) The input template is constructed as a Python function, where entity attributes, scene graphs and raw text compose of the function parameters. In contrast, the output template is formalized as Python dictionaries containing all extraction results such as entities, relations, etc. To evaluate Code-MIE, we conducted extensive experiments on the M^3D, Twitter-15, Twitter-17, and MNRE datasets. The results show that our method achieves state-of-the-art performance compared to six competing baseline models, with 61.03\% and 60.49\% on the English and Chinese datasets of M^3D, and 76.04\%, 88.07\%, and 73.94\% on the other three datasets.