HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning
arXiv cs.CV / 3/19/2026
📰 NewsModels & Research
Key Points
- HopChain presents a scalable data-synthesis framework to create multi-hop vision-language reasoning data for RLVR training of VLMs.
- The method builds logically dependent chains of hops and yields final answers that are precise numbers for verifiable rewards, addressing long-CoT reasoning and related errors.
- Empirically, adding HopChain data improves 20 of 24 benchmarks across models and tasks (STEM, General VQA, Text Recognition, Document Understanding, Video Understanding).
- Ablations show that removing or shortening the hops reduces performance significantly, while full multi-hop data yields large gains, including gains of more than 50 accuracy points in the ultra-long-CoT regime, supporting broad generalizability.
Related Articles

Interactive Web Visualization of GPT-2
Reddit r/artificial
[R] Causal self-attention as a probabilistic model over embeddings
Reddit r/MachineLearning
The 5 software development trends that actually matter in 2026 (and what they mean for your startup)
Dev.to
iPhone 17 Pro Running a 400B LLM: What It Really Means
Dev.to
[R] V-JEPA 2 has no pixel decoder, so how do you inspect what it learned? We attached a VQ probe to the frozen encoder and found statistically significant physical structure
Reddit r/artificial