Le MuMo JEPA: Multi-Modal Self-Supervised Representation Learning with Learnable Fusion Tokens
arXiv cs.CV / 3/26/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- Le MuMo JEPA is introduced as a multi-modal self-supervised representation learning framework that learns unified embeddings from RGB images and aligned companion modalities (notably camera-aligned LiDAR depth).
- The method extends LeJEPA by adding learnable fusion tokens that form a latent bottleneck inside a shared transformer, with an efficient “pruned fusion” strategy that drops modality-specific tokens after an initial cross-modal attention layer.
- It applies SIGReg regularization to the joint multimodal CLS embedding to improve representation quality for downstream tasks.
- Driving experiments on Waymo and nuScenes show Le MuMo JEPA achieves strong performance-efficiency trade-offs versus from-scratch multimodal baselines, improving CenterNet detection and dense depth while staying competitive on segmentation.
- The framework also transfers well to the Teledyne FLIR ADAS benchmark, delivering the best results in the study, particularly after Waymo-initialized fine-tuning, with reduced compute/memory/training time.
Related Articles
Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets
Dev.to
Mercor competitor Deccan AI raises $25M, sources experts from India
Dev.to
How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)
Dev.to
How Should Students Document AI Usage in Academic Work?
Dev.to

I asked my AI agent to design a product launch image. Here's what came back.
Dev.to