MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering

arXiv cs.CV / 4/14/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces MedLVR, a latent visual reasoning framework for medical visual question answering that addresses the limitation of existing VLMs relying too heavily on static, text-dominant reasoning over images.
MedLVR adds an explicit latent visual evidence state into autoregressive decoding by interleaving short latent reasoning steps that iteratively preserve and refine query-relevant visual information.
It uses a two-stage training approach: ROI-supervised fine-tuning to align latent states with clinically relevant regions, followed by Visual-Latent Policy Optimization (VLPO) to optimize both latent reasoning and answer generation via outcome-level rewards.
Experiments on OmniMedVQA and five additional medical VQA benchmarks show consistent gains, including improving the average score of the Qwen2.5-VL-7B backbone from 48.3% to 53.4% over reasoning baselines.

Abstract

Medical vision--language models (VLMs) have shown strong potential for medical visual question answering (VQA), yet their reasoning remains largely text-centric: images are encoded once as static context, and subsequent inference is dominated by language. This paradigm is fundamentally limited in clinical scenarios, where accurate answers often depend on subtle, localized visual evidence that cannot be reliably preserved in static embeddings. We propose \textsc{MedLVR}, a latent visual reasoning framework that introduces an explicit visual evidence state into autoregressive decoding. Instead of relying solely on text-based intermediate reasoning, \textsc{MedLVR} interleaves a short latent reasoning segment within the decoder by reusing hidden states as continuous latent steps, enabling iterative preservation and refinement of query-relevant visual evidence before answer generation. To support effective visual supervision, we adopt a two-stage training strategy: region of interest (ROI)-supervised fine-tuning aligns latent states with clinically relevant image evidence, and Visual-Latent Policy Optimization (VLPO) further optimizes latent reasoning and answer generation under outcome-level rewards. Experiments on OmniMedVQA and five external medical VQA benchmarks show that \textsc{MedLVR} consistently outperforms recent reasoning baselines and improves the average score over the Qwen2.5-VL-7B backbone from 48.3\% to 53.4\%. These results show that latent visual reasoning provides an effective mechanism for preserving diagnostically relevant visual evidence and improving the reliability of medical VQA.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/14DailyView insight →

FastAPI With LangChain and MongoDB

Dev.to

Best AI Game Creator in 2026

Dev.to

Smart AI Recruiter Assistant with OpenClaw

Dev.to

[2026] OpenTelemetry for LLM Observability — Self-Hosted Setup

Dev.to

Beyond Chatbots: Building Your AI-Coaching Engine

Dev.to

MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering

Key Points

Abstract

💡 Insights using this article

Related Articles

FastAPI With LangChain and MongoDB

Best AI Game Creator in 2026

Smart AI Recruiter Assistant with OpenClaw

[2026] OpenTelemetry for LLM Observability — Self-Hosted Setup

Beyond Chatbots: Building Your AI-Coaching Engine

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer