Edge Reliability Gap in Vision-Language Models: Quantifying Failure Modes of Compressed VLMs Under Visual Corruption

arXiv cs.CV / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies whether compressed/quantized vision-language models fail in qualitatively different ways from larger FP16 VLMs when faced with visual corruption, not just at lower accuracy.
  • It compares a 4-bit quantized 7B model (Qwen2.5-VL-7B, NF4) to a 500M FP16 model (SmolVLM2-500M) across 4,000 samples from VQAv2 and COCO, using a three-part error taxonomy: Object Blindness, Semantic Drift, and Prior Bias.
  • Semantic Drift is identified as the dominant failure mode on VQAv2 for both models and on COCO specifically for Qwen, while Prior Bias appears on VQAv2 but is absent on COCO for both.
  • The compact model shows a significantly larger “negation collapse” under compositional negation probes, driven largely by COCO (a statistically significant 12.5pp gap), and a key template (false_yn) reveals extreme bias toward “Yes” on COCO for SmolVLM2.
  • The authors evaluate confidence calibration via Expected Calibration Error (ECE), include blur robustness experiments, and release a fully reproducible pipeline intended for systematic safety auditing before edge deployment.

Abstract

The rapid compression of large vision-language models (VLMs) for edge deployment raises an underexplored question: do compact models fail differently, not merely more often? This study compares a 7-billion-parameter quantised VLM (Qwen2.5-VL-7B, 4-bit NF4) against a 500-million-parameter FP16 model (SmolVLM2-500M) across 4,000 samples from VQAv2 and COCO Captions. A three-category error taxonomy (Object Blindness, Semantic Drift, Prior Bias) is applied as a diagnostic framework. A text-only GPT-4o judge reveals Semantic Drift (B) as the dominant failure mode on VQAv2 and on COCO for Qwen, with a mixed Object Blindness / Semantic Drift profile for SmolVLM2 on COCO; Prior Bias (C) is present on VQAv2 but absent on COCO for both models. Confidence calibration is measured via Expected Calibration Error (ECE) using geometric mean token probability, compositional reasoning is probed with structured negation probes across four templates, and a blur robustness experiment completes the evaluation. For this model pair, the compact model exhibits a qualitatively distinct failure signature: a 12.5pp larger negation collapse (-33.2pp vs. -20.8pp, Wald 95% CI [8.2, 16.8]pp, p < 10^-8), driven almost entirely by COCO while the VQAv2 gap is not statistically significant (4.5pp, p=0.19). The most discriminating template is false_yn: SMOLVLM2-500M responds "Yes" (incorrectly claiming a depicted object is absent) on 100% of COCO trials vs. 14% for Q WEN 2.5-VL-7B. Asymmetric dataset-dependent miscalibration and a blur experiment with two controlled ablations complete the analysis. The fully reproducible pipeline is released for systematic safety auditing of compressed VLMs prior to edge deployment.