Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following

arXiv cs.CV / 3/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes instruction-free tuning for large vision-language models in the medical domain by training on image-description pairs instead of curated image-instruction-output data.
It introduces a momentum proxy instruction to substitute handcrafted instructions, preserving instruction-following behavior while guiding parameter updates during inference.
A response shuffling strategy is added to reduce over-reliance on previous words, enabling more robust fine-tuning.
The method achieves state-of-the-art accuracy on multiple-choice visual question answering benchmarks (SKINCON, WBCAtt, CBIS, MIMIC-CXR), demonstrating improved fine-tuning efficiency in medical LVLMs.
This approach lowers the barrier to adapting LVLMs for medical instruction following by reducing dependence on expert-crafted datasets.

Abstract

Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning approach that reduces reliance on handcrafted instructions, leveraging only image-description pairs for fine-tuning. Specifically, we introduce a momentum proxy instruction as a replacement for curated text instructions, which preserves the instruction-following capability of the pre-trained LVLM while promoting updates to parameters that remain valid during inference. Consequently, the fine-tuned LVLM can flexibly respond to domain-specific instructions, even though explicit instructions are absent during fine-tuning. Additionally, we incorporate a response shuffling strategy to mitigate the model's over-reliance on previous words, facilitating more effective fine-tuning. Our approach achieves state-of-the-art accuracy on multiple-choice visual question answering tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets, significantly enhancing the fine-tuning efficiency of LVLMs in medical domains.

AgentDesk vs Hiring Another Consultant: A Cost Comparison

Dev.to

"Why Your AI Agent Needs a System 1"

Dev.to

When should we expect TurboQuant?

Reddit r/LocalLLaMA

AI as Your Customs Co-Pilot: Automating HS Code Chaos in Southeast Asia

Dev.to

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Dev.to

Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following

Key Points

Abstract

Related Articles

AgentDesk vs Hiring Another Consultant: A Cost Comparison

"Why Your AI Agent Needs a System 1"

When should we expect TurboQuant?

AI as Your Customs Co-Pilot: Automating HS Code Chaos in Southeast Asia

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer