[R] V-JEPA 2 has no pixel decoder, so how do you inspect what it learned? We attached a VQ probe to the frozen encoder and found statistically significant physical structure

Reddit r/artificial / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • V-JEPA 2 avoids pixel reconstruction by predicting in latent space, which makes it hard to visually verify what physical concepts it has learned.
  • The authors address the “attribution problem” by freezing the V-JEPA 2 encoder and attaching a passive AIM framework VQ probe, so any discovered codebook structure is attributable to V-JEPA 2’s fixed representations.
  • In Kinetics-mini experiments across 3 physical-dimension contrasts, the probe finds statistically significant differences in symbol distributions, with measured mutual information of 0.036–0.117 bits and Jensen–Shannon divergence up to 0.342.
  • Codebook usage is substantial (62.5% active entries with K=8), and temporal structure yields a stronger signal than morphological differences, aligning with V-JEPA 2’s temporal prediction objective.
  • The results suggest V-JEPA 2’s latent space is compact: action categories largely map to a dominant codebook entry, while semantic differences appear as graded distributional shifts rather than separate category boundaries, consistent with learned shared physical structure.

V-JEPA 2 is powerful precisely because it predicts in latent space rather than reconstructing pixels. But that design creates a problem: there’s no visual verification pathway. You can benchmark it, but you can’t directly inspect what physical concepts it has encoded.

Existing probing approaches have a fundamental issue we call the attribution problem: when you attach a learned component (linear probe, LM head, pixel decoder) and the composite system performs well, you can’t tell how much of the performance comes from the encoder vs. the attached component’s own capacity.

Our approach: attach the AIM framework (arXiv:2507.10566) as a passive quantization probe — a lightweight VQ-VAE bottleneck with no task-specific supervision, no predefined symbol inventory, and crucially, the V-JEPA 2 encoder is completely frozen throughout. Zero gradient flows into V-JEPA 2. Zero modification to any source file.

Because the encoder is deterministic and fixed, any symbolic structure that emerges in the codebook is attributable to V-JEPA 2’s representations — not to the probe.

What we found (Kinetics-mini, 3 category-contrast experiments):

∙ Symbol distributions differ significantly across all 3 physical dimension contrasts (χ² p < 10⁻⁴ to p < 10⁻¹⁰) ∙ Absolute MI: 0.036–0.117 bits; JSD up to 0.342 ∙ Codebook utilization: 62.5% active entries (K=8) ∙ Temporal structure differences produce 1.8× stronger signal than morphological differences — consistent with V-JEPA 2’s temporal prediction objective 

The interesting finding isn’t just that it works. It’s that V-JEPA 2’s latent space is compact: all 5 action categories predominantly map to the same dominant codebook entry, with semantic differences encoded as graded distributional shifts rather than categorical boundaries. We argue this is the expected signature of a model that has internalized shared physical structure (gravity, kinematics, continuity) rather than a failure of separation.

Limitations we acknowledge upfront:

∙ Category-proxy confounding (we can’t isolate single physical variables with Kinetics-mini) ∙ Token-level pseudo-replication (effective N is closer to 9-10 videos/category) ∙ K=8 is too coarse for fine-grained structure (Stage 2 will increase to K=32/64) ∙ Gaussian noise baseline ≠ permutation test (weaker null) 

This is Stage 1 of a 4-stage roadmap toward an action-conditioned symbolic world model.

Paper: arXiv:2603.20327

Code: github.com/cyrilliu1974/JEPA

Happy to discuss the methodology, the compact-latent interpretation, or the roadmap.

submitted by /u/Pale-Entertainer-386
[link] [comments]