Investigating Permutation-Invariant Discrete Representation Learning for Spatially Aligned Images

arXiv cs.CV / 4/3/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes PI-VQ (permutation-invariant vector-quantized autoencoder) to make discrete image codes position-free for spatially aligned data, removing the need for positional dependence typically handled by priors like autoregressive or diffusion models.
  • By constraining latent codes to carry no positional information, the method encourages learning of global semantic features and supports direct latent interpolation between images without a learned prior.
  • To compensate for reduced information capacity from permutation invariance, the authors introduce “matching quantization,” which uses optimal bipartite matching to increase effective bottleneck capacity by about 3.5× versus naive nearest-neighbor quantization.
  • The compositional latent structure enables interpolation-based sampling that can synthesize novel images in a single forward pass, potentially simplifying generation pipelines.
  • Experiments on CelebA, CelebA-HQ, and FFHQ show competitive precision, density, and coverage metrics, while the authors discuss trade-offs such as reduced separability and interpretability of latent codes and outline directions for future research.

Abstract

Vector quantization approaches (VQ-VAE, VQ-GAN) learn discrete neural representations of images, but these representations are inherently position-dependent: codes are spatially arranged and contextually entangled, requiring autoregressive or diffusion-based priors to model their dependencies at sample time. In this work, we ask whether positional information is necessary for discrete representations of spatially aligned data. We propose the permutation-invariant vector-quantized autoencoder (PI-VQ), in which latent codes are constrained to carry no positional information. We find that this constraint encourages codes to capture global, semantic features, and enables direct interpolation between images without a learned prior. To address the reduced information capacity of permutation-invariant representations, we introduce matching quantization, a vector quantization algorithm based on optimal bipartite matching that increases effective bottleneck capacity by 3.5\times relative to naive nearest-neighbour quantization. The compositional structure of the learned codes further enables interpolation-based sampling, allowing synthesis of novel images in a single forward pass. We evaluate PI-VQ on CelebA, CelebA-HQ and FFHQ, obtaining competitive precision, density and coverage metrics for images synthesised with our approach. We discuss the trade-offs inherent to position-free representations, including separability and interpretability of the latent codes, pointing to numerous directions for future work.