FD-VLA: Force-Distilled Vision-Language-Action Model for Contact-Rich Manipulation
arXiv cs.RO / 3/23/2026
📰 NewsModels & Research
Key Points
- FD-VLA introduces a Force Distilled Vision-Language-Action framework that enables force-aware reasoning in contact-rich manipulation without relying on physical force sensors.
- It uses a Force Distillation Module to map a learnable query token, conditioned on visual observations and robot states, into a predicted force token aligned with actual force signals.
- During inference, the distilled force token is injected into the pretrained vision-language model to preserve vision-language semantics while enabling force-aware reasoning, allowing deployment on robots lacking expensive force-torque sensors.
- Experiments show the distilled force token can outperform direct sensor measurements and baselines, and the FDM provides an additional force-vision-state fusion prior that improves cross-modal alignment and robustness.
Related Articles
Does Synthetic Data Generation of LLMs Help Clinical Text Mining?
Dev.to
The Dawn of the Local AI Era: From iPhone 17 Pro to the Future of NVIDIA RTX
Dev.to
[P] Prompt optimization for analog circuit placement — 97% of expert quality, zero training data
Reddit r/MachineLearning
[R] Looking for arXiv endorser (cs.AI or cs.LG)
Reddit r/MachineLearning

I curated an 'Awesome List' for Generative AI in Jewelry- papers, datasets, open-source models and tools included!
Reddit r/artificial