SmoGVLM: A Small, Graph-enhanced Vision-Language Model

arXiv cs.CV / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces SmoGVLM, a small vision-language model that uses graph neural networks to integrate structured knowledge across visual and text modalities.
  • It targets common issues in large VLMs such as hallucination and weak grounding in knowledge-intensive reasoning.
  • The authors evaluate SmoGVLM across multiple model sizes (from 1.3B to 13B) to study how graph-enhanced training affects performance scaling.
  • Results show that a small SmoGVLM can improve performance by up to 16.24% and outperform larger VLMs and strong fine-tuned baselines.
  • The work suggests that structured knowledge augmentation can enable more efficient, smaller-scale multimodal reasoning systems without relying solely on very large model sizes.

Abstract

Large vision-language models (VLMs) achieve strong performance on multimodal tasks but often suffer from hallucination and poor grounding in knowledge-intensive reasoning. We propose SmoGVLM, a small, graph-enhanced VLM that integrates structured knowledge with visual and textual modalities, using Graph Neural Networks. We investigate the effects of our method across a range of model sizes, from tiny (1.3B) to large (13B) models. Our results demonstrate that, when trained using our approach, a small model can achieve performance gains upto 16.24%, and surpass its larger counterparts, outperforming larger VLMs and strong fine-tuned baselines. These results highlight the potential of structured knowledge augmentation for efficient, smaller-scale multimodal reasoning systems.