One Token, Two Fates: A Unified Framework via Vision Token Manipulation Against MLLMs Hallucination
arXiv cs.CV / 3/12/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper critiques existing training-free methods for reducing MLLM hallucination, noting that improving vision or suppressing language priors alone trades off performance and can introduce noise.
- It proposes a unified framework focused on vision tokens, built around two latent-representation modules: Synergistic Visual Calibration (SVC) and Causal Representation Calibration (CRC).
- SVC uses augmented visual tokens to strengthen visuals, while CRC prunes tokens to create latent-space negative samples for correcting internal model biases.
- The approach restores the vision-language balance and demonstrates about 2% absolute POPE improvement on LLaVA-1.5 across multiple benchmarks, with a 1.06x inference latency overhead.




