Belief-Aware VLM Model for Human-like Reasoning

arXiv cs.AI / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that current vision-language models infer intent using observable states but struggle to generalize in dynamic, long-horizon settings because they lack explicit belief tracking.
It proposes a belief-aware VLM framework that approximates human-like belief via retrieval-based, vector memory storing multimodal context instead of training a separate explicit belief model.
The retrieved belief-relevant context is fed into the VLM to improve reasoning, and decision-making is further optimized using reinforcement learning over the model’s latent space.
Experiments on VQA datasets (including HD-EPIC) show consistent gains versus zero-shot baselines, suggesting belief-aware reasoning improves performance.
Overall, the work positions belief updating and long-horizon intent capture as key missing components for VLM/VLA systems aspiring to human-like reasoning.

Abstract

Traditional neural network models for intent inference rely heavily on observable states and struggle to generalize across diverse tasks and dynamic environments. Recent advances in Vision Language Models (VLMs) and Vision Language Action (VLA) models introduce common-sense reasoning through large-scale multimodal pretraining, enabling zero-shot performance across tasks. However, these models still lack explicit mechanisms to represent and update belief, limiting their ability to reason like humans or capture the evolving human intent over long-horizon. To address this, we propose a belief-aware VLM framework that integrates retrieval-based memory and reinforcement learning. Instead of learning an explicit belief model, we approximate belief using a vector-based memory that retrieves relevant multimodal context, which is incorporated into the VLM for reasoning. We further refine decision-making using a reinforcement learning policy over the VLM latent space. We evaluate our approach on publicly available VQA datasets such as HD-EPIC and demonstrate consistent improvements over zero-shot baselines, highlighting the importance of belief-aware reasoning.