StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning
arXiv cs.CV / 5/6/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- StateVLM is introduced as a state-aware vision-language model for robotic affordance reasoning, targeting VLM weaknesses in numerical reasoning such as object detection and state localization.
- The paper proposes a fine-tuning strategy that uses box decoder outputs to compute an Auxiliary Regression Loss (ARL) while keeping standard sequence prediction at inference.
- By framing numerical reasoning as a regression task, the approach aims to learn fine-grained object representations including precise localization, object states, and graspable regions.
- The authors create an open-source benchmark called OSAR (Object State Affordance Reasoning) with 1,172 scenes, 7,746 objects, and corresponding bounding boxes to evaluate object-state reasoning.
- Experiments show that adding ARL yields an average performance improvement of 1.6% on adapted benchmarks and 5.2% on OSAR, with ARL also improving output consistency on complex affordance reasoning.
Related Articles

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide
Dev.to

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'
Dev.to

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss
MarkTechPost
When Claude Hallucinates in Court: The Latham & Watkins Incident and What It Means for Attorney Liability
MarkTechPost
Solidity LM surpasses Opus
Reddit r/LocalLLaMA