Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following

arXiv cs.CV / 3/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes instruction-free tuning for large vision-language models in the medical domain by training on image-description pairs instead of curated image-instruction-output data.
  • It introduces a momentum proxy instruction to substitute handcrafted instructions, preserving instruction-following behavior while guiding parameter updates during inference.
  • A response shuffling strategy is added to reduce over-reliance on previous words, enabling more robust fine-tuning.
  • The method achieves state-of-the-art accuracy on multiple-choice visual question answering benchmarks (SKINCON, WBCAtt, CBIS, MIMIC-CXR), demonstrating improved fine-tuning efficiency in medical LVLMs.
  • This approach lowers the barrier to adapting LVLMs for medical instruction following by reducing dependence on expert-crafted datasets.

Abstract

Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning approach that reduces reliance on handcrafted instructions, leveraging only image-description pairs for fine-tuning. Specifically, we introduce a momentum proxy instruction as a replacement for curated text instructions, which preserves the instruction-following capability of the pre-trained LVLM while promoting updates to parameters that remain valid during inference. Consequently, the fine-tuned LVLM can flexibly respond to domain-specific instructions, even though explicit instructions are absent during fine-tuning. Additionally, we incorporate a response shuffling strategy to mitigate the model's over-reliance on previous words, facilitating more effective fine-tuning. Our approach achieves state-of-the-art accuracy on multiple-choice visual question answering tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets, significantly enhancing the fine-tuning efficiency of LVLMs in medical domains.