Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection
arXiv cs.CV / 4/24/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper introduces Ramen, a robust test-time adaptation framework for vision-language models like CLIP that need to handle distribution shifts during inference.
- Unlike prior methods that assume test data comes from a single domain, Ramen is designed for mixed-domain settings by actively selecting relevant past samples using domain consistency and prediction balance.
- To reduce compute, Ramen uses an embedding-gradient cache to store embeddings and per-sample gradients from earlier test images, enabling model updates without extra forward/backward passes.
- The authors provide theoretical justification for why their adaptation strategy works under mixed-domain shifts and show strong, consistent results on multiple image corruption and domain-shift benchmarks.
- The project’s code is released on GitHub, supporting reproducibility and potential adoption of the method.
Related Articles

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.
Dev.to

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)
Dev.to
Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF
Reddit r/LocalLLaMA

Building a Visual Infrastructure Layer: How We’re Solving the "Visual Trust Gap" for E-com
Dev.to
Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan)
Reddit r/LocalLLaMA