LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
arXiv cs.AI / 3/23/2026
📰 NewsModels & Research
Key Points
- LeWorldModel (LeWM) is the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: next-embedding prediction and a Gaussian latent regularizer.
- The approach reduces loss hyperparameters from six to one, simplifying tuning and reducing fragility compared to prior end-to-end JEPA methods.
- With ~15M parameters, LeWM can be trained on a single GPU in a few hours and achieves up to 48x faster training than foundation-model-based world models, while staying competitive across 2D and 3D control tasks.
- The latent space encodes meaningful physical structure, and probing demonstrates the model can detect physically implausible events.
Related Articles
Data Augmentation Using GANs
Dev.to
Zero Shot Deformation Reconstruction for Soft Robots Using a Flexible Sensor Array and Cage Based 3D Gaussian Modeling
arXiv cs.RO
Speculative Policy Orchestration: A Latency-Resilient Framework for Cloud-Robotic Manipulation
arXiv cs.RO
ReMAP-DP: Reprojected Multi-view Aligned PointMaps for Diffusion Policy
arXiv cs.RO
AGILE: A Comprehensive Workflow for Humanoid Loco-Manipulation Learning
arXiv cs.RO