DOSE: Data Selection for Multi-Modal LLMs via Off-the-Shelf Models
arXiv cs.CV / 4/21/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper argues that multimodal (vision-language) training data often suffers from noise, redundancy, and poor alignment, which limits VLM improvements.
- It introduces DOSE, a method that uses off-the-shelf pretrained models that have never seen the target data to score and select candidate samples without any task-specific training or fine-tuning.
- DOSE evaluates sample text quality and image-text alignment, then builds a joint quality–alignment distribution and applies adaptive weighted sampling to choose informative data while preserving long-tail diversity.
- Experiments on VQA and math benchmarks show that models trained on DOSE-filtered data can match or outperform models trained on the full dataset, while improving efficiency and scalability.
- The work suggests that reusing existing pretrained models for data curation can reduce the extra compute cost typically required by conventional filtering pipelines.
Related Articles

A practical guide to getting comfortable with AI coding tools
Dev.to

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆
Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)
Dev.to

🚀 Major BrowserAct CLI Update
Dev.to

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims
Dev.to