Shape: A Self-Supervised 3D Geometry Foundation Model for Industrial CAD Analysis

arXiv cs.CV / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • Shape is a new self-supervised 3D geometry foundation model that turns industrial CAD surface meshes into dense per-token embeddings for more robust and explainable analysis.
  • The model architecture uses a structured 3D latent grid, a multi-scale geometry-aware tokenizer (MAGNO) with cross-attention, and a transformer with grouped-query attention and RMSNorm.
  • Shape includes a learned reconstruction prior to enable per-region attribution, supporting explainable predictions in downstream tasks.
  • Pretrained on 61,052 CAD meshes with masked-token reconstruction plus multi-resolution contrastive consistency, the 10.9M-parameter backbone reaches R² = 0.729 and 98.1% top-1 retrieval on a held-out set.
  • The ablation study shows per-dimension normalization is essential for performance stability, and the project releases code, embeddings, and an interactive demo on GitHub.

Abstract

Industrial CAD workflows require robust, generalizable 3D geometric representations supporting accuracy and explainability. We introduce Shape, a self-supervised foundation model converting surface meshes into dense per-token embeddings. Shape combines a structured 3D latent grid, a multi-scale geometry-aware tokenizer (MAGNO) with cross-attention, and a transformer processor using grouped-query attention and RMSNorm. A learned reconstruction prior enables per-region attribution for explainable predictions. Pretraining uses masked-token reconstruction of normalized geometry statistics and multi-resolution contrastive consistency. The 10.9M-parameter backbone is pretrained on 61,052 CAD meshes from Thingi10K, MFCAD, and Fusion360. On a held-out split of 2,983 meshes, Shape achieves reconstruction R2 = 0.729 and 98.1% top-1 retrieval under the Wang-Isola protocol, with near-zero reconstruction train/val gap (contrastive scores use a larger evaluation pool). A 2x2 ablation on loss type and target-space normalization shows per-dimension normalization is critical: without it, performance collapses (R2 < 0.14, top-1 < 88%); with it, both losses succeed (R2 > 0.70, top-1 > 96%). Smooth-L1 offers secondary stability. Code, embeddings, and an interactive demo are released at https://github.com/simd-ai/shape.