Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors
arXiv cs.CV / 4/17/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- ViT-based sparse multi-view 3D object detectors are accurate but still slow at inference because token processing is computationally heavy.
- Prior token compression approaches (token pruning/merging and enlarging patch sizes) can remove informative background cues, break contextual consistency, and degrade fine-grained semantics, harming 3D detection.
- The paper proposes SEPatch3D, which dynamically adjusts patch sizes to retain critical semantic information in coarse patches while cutting computation.
- SEPatch3D includes SPSS (choosing small patches for nearby-object scenes and large patches for background-dominated scenes), IPS (selecting informative patches for refinement), and CGFE (injecting fine-grained details into coarse patches).
- Experiments on nuScenes and Argoverse 2 show up to 57% faster inference than the StreamPETR baseline and 20% higher efficiency than ToC3D-faster, with comparable detection accuracy, and the authors provide code on GitHub.

![[Patterns] AI Agent Error Handling That Actually Works](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D1200%2Cheight%3D627%2Cfit%3Dcover%2Cgravity%3Dauto%2Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Farticles%252Frn5czaopq2vzo7cglady.png&w=3840&q=75)


![[2026] OpenTelemetry for LLM Observability — Self-Hosted Setup](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D1200%2Cheight%3D627%2Cfit%3Dcover%2Cgravity%3Dauto%2Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Farticles%252Flu4b6ttuhur71z5gemm0.png&w=3840&q=75)