Probing the Latent World: Emergent Discrete Symbols and Physical Structure in Latent Representations

arXiv cs.LG / 2026/3/24

💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research

要点

  • The paper studies video world models trained with JEPA-style masked prediction, arguing that moving prediction into latent space creates an interpretability gap for the physical structure learned by the encoder.
  • It introduces the “AI Mother Tongue” (AIM) framework, a passive, vocabulary-free quantization probe that discretizes frozen V-JEPA 2 latent vectors into symbol sequences without supervision or modifying the encoder.
  • By keeping the encoder fully frozen, the authors claim any emergent discrete symbolic structure in the AIM codebook can be attributed to the pre-trained V-JEPA 2 representations rather than the probe.
  • Category-contrast experiments on Kinetics-mini show significant differences in AIM symbol distributions across grasp angle, object geometry, and motion temporal structure, with metrics indicating meaningful mutual information and strong divergence in symbol usage.
  • The results suggest V-JEPA 2 latent space contains a compact shared representational core for action categories, with physical/semantic differences expressed as graded distribution shifts rather than sharp categorical boundaries.

Abstract

Video world models trained with Joint Embedding Predictive Architectures (JEPA) acquire rich spatiotemporal representations by predicting masked regions in latent space rather than reconstructing pixels. This removes the visual verification pathway of generative models, creating a structural interpretability gap: the encoder has learned physical structure inaccessible in any inspectable form. Existing probing methods either operate in continuous space without a structured intermediate layer, or attach generative components whose parameters confound attribution of behavior to the encoder. We propose the AI Mother Tongue (AIM) framework as a passive quantization probe: a lightweight, vocabulary-free probe that converts V-JEPA 2 continuous latent vectors into discrete symbol sequences without task-specific supervision or modifying the encoder. Because the encoder is kept completely frozen, any symbolic structure in the AIM codebook is attributable entirely to V-JEPA 2 pre-trained representations -- not to the probe. We evaluate through category-contrast experiments on Kinetics-mini along three physical dimensions: grasp angle, object geometry, and motion temporal structure. AIM symbol distributions differ significantly across all three experiments (chi^2 p < 10^{-4}; MI 0.036--0.117 bits, NMI 1.2--3.9% of the 3-bit maximum; JSD up to 0.342; codebook active ratio 62.5%). The experiments reveal that V-JEPA 2 latent space is markedly compact: diverse action categories share a common representational core, with semantic differences encoded as graded distributional variations rather than categorical boundaries. These results establish Stage 1 of a four-stage roadmap toward an action-conditioned symbolic world model, demonstrating that structured symbolic manifolds are discoverable properties of frozen JEPA latent spaces.