FalconApp: Rapid iPhone Deployment of End-to-End Perception via Automatically Labeled Synthetic Data

arXiv cs.RO / 4/30/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The paper introduces FalconApp, an iPhone app that creates an end-to-end perception module from a short handheld video of a rigid object, targeting mask detection and 6-DoF pose estimation.
  • It uses a rapid mobile deployment pipeline plus photorealistic auto-labeling: reconstruct a GSplat asset, composite it into varied backgrounds, render synthetic training data with ground-truth masks/poses, train a perception model, and redeploy it to the iPhone.
  • Experiments on five rigid objects show the workflow averages about 20 minutes for synthetic-data generation and training per object.
  • The resulting on-device inference achieves roughly 30 ms end-to-end latency on iPhone, and pose accuracy improves over a PnP baseline on 4 out of 5 objects in both simulation and real-world tests.

Abstract

Reliable perception for robotics depends on large-scale labeled data, yet real-world datasets rely on heavy manual annotation and are time-consuming to produce. We present FalconApp, an iPhone app with an end-to-end frontend-backend pipeline that turns a short handheld capture of a rigid object into a perception module for mask detection and 6-DoF pose estimation. Our core contribution is a rapid mobile deployment pipeline paired with a photorealistic auto-labeling workflow: from a user-captured video of an object, FalconApp reconstructs an editable GSplat asset, composites it with diverse photorealistic backgrounds, renders synthetic images with ground-truth masks and poses, trains the perception module, and deploys it back to the iPhone frontend. Experiments across five rigid objects with diverse geometry and appearance show that FalconApp produces usable perception models with about 20 minutes of synthetic-data generation and training per object on average, around 30 ms end-to-end on-device latency on iPhone, and better overall pose accuracy than a PnP baseline on 4 / 5 objects in both simulation and real-world evaluation.