Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

arXiv cs.AI / 4/22/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper argues that post-training reinforcement learning is key to improving LLM reasoning, but highlights that “visual semantic arithmetic” (inferring relationships from images) has been less studied.
  • It formulates new benchmark tasks—two-term subtraction and three-term operations—and introduces the Image-Relation-Pair Dataset (IRPD) to systematically evaluate image-based relational reasoning.
  • The authors propose Semantic Arithmetic Reinforcement Fine-Tuning (SAri-RFT), which post-trains large vision-language models using a verifiable training signal and Group Relative Policy Optimization (GRPO).
  • The approach achieves state-of-the-art performance on IRPD and also performs well on the real-world Visual7W-Telling dataset.
  • By grounding symbolic relational reasoning in perception, the work targets improvements relevant to domestic and service robotics operating in unstructured environments.

Abstract

Reinforcement learning (RL) as post-training is crucial for enhancing the reasoning ability of large language models (LLMs) in coding and math. However, their capacity for visual semantic arithmetic, inferring relationships from images, remains underexplored. The classic text analogy "king"-"man"+"woman" = "queen" illustrates relational reasoning, yet replacing text with images of "king" and "man" significantly reduces performance because it requires commonsense knowledge and the extraction of concise concepts from irrelevant visual details. This capability is important for service and domestic robotics in unstructured environments, where robots must infer semantic relationships among objects, agents, and actions. In a kitchen, recognizing from images that "powder" and "cake" are related by "is made of" grounds symbolic relations in perception, enabling tool substitution, task generalization, and improved semantic reasoning. Prior work approaches semantic arithmetic by decoding image features after vector arithmetic, but suffers from modality gaps and lacks systematic evaluation. In this paper, we formulate two novel tasks, two-term subtraction and three-term operations, and construct the Image-Relation-Pair Dataset (IRPD) for benchmarking. We further propose Semantic Arithmetic Reinforcement Fine-Tuning (SAri-RFT), which post-trains large vision-language models (LVLMs) using a verifiable function and Group Relative Policy Optimization (GRPO). Our method achieves state-of-the-art results on IRPD and the real-world Visual7W-Telling dataset. By equipping LVLMs with robust cross-modal relational reasoning, this work advances domestic robots' ability to ground symbolic reasoning in perception, enhancing decision-making, tool adaptability, and human-robot interaction in complex environments. Datasets and source code are provided in the supplementary material.