DUALVISION: RGB-Infrared Multimodal Large Language Models for Robust Visual Reasoning
arXiv cs.CV / 4/22/2026
📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The paper introduces DUALVISION, a lightweight fusion module that integrates infrared (IR) and RGB information into multimodal large language models (MLLMs) for more robust visual reasoning.
- DUALVISION uses patch-level localized cross-attention to combine IR-RGB cues efficiently, addressing the fragility of RGB-only MLLMs under degradations like fog, blur, and low light.
- To enable training and evaluation, the authors release DV-204K, a ~25K public dataset of aligned IR-RGB image pairs with modality-specific QA annotations.
- They also provide DV-500, a smaller benchmark of 500 IR-RGB pairs with 500 QA pairs specifically aimed at evaluating cross-modal reasoning.
- Experiments across both open- and closed-source MLLMs show that DUALVISION improves empirical performance across a wide range of visual degradation conditions.

![AI TikTok Marketing for Pet Brands [2026 Guide]](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D1200%2Cheight%3D627%2Cfit%3Dcover%2Cgravity%3Dauto%2Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Farticles%252Fj35r9qm34d68qf2gq7no.png&w=3840&q=75)


