A ROS 2 Wrapper for Florence-2: Multi-Mode Local Vision-Language Inference for Robotic Systems
arXiv cs.RO / 4/2/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The paper introduces a ROS 2 wrapper that integrates the Florence-2 vision-language model to support more semantic perception in robotic systems than task-specific vision pipelines.
- It exposes Florence-2 through three interaction modes—continuous topic-driven processing, synchronous service calls, and asynchronous actions—so developers can choose the right control flow for their robot stack.
- The wrapper is built for local execution and supports both native installation and Docker deployment, aiming to improve reproducibility in real robot middleware.
- For outputs, it provides generic JSON plus standard ROS 2 message bindings tailored to detection-oriented vision-language tasks.
- The authors report functional validation and a GPU throughput study, concluding that local deployment is feasible even on consumer-grade hardware.




