AmaraSpatial-10K: A Spatially and Semantically Aligned 3D Dataset for Spatial Computing and Embodied AI

arXiv cs.CV / 4/28/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • Web-scale 3D asset datasets are common but often not deployment-ready due to issues like incorrect metric scale, misaligned axes, brittle geometry, and relighting-incompatible textures.
  • AmaraSpatial-10K provides 10,000+ synthetic, deployment-oriented 3D assets packaged as metric-scaled, semantically anchored .glb files with separated PBR maps, convex collision hulls, reference images, and rich multi-sentence text metadata.
  • The dataset uses a unified spatial convention and covers categories including indoor objects, vehicles, architecture, creatures, and props for spatial computing and embodied AI use cases.
  • An accompanying evaluation suite introduces metrics such as Scale Plausibility Score (with an LLM-as-Judge protocol), LLM Concept Density, anchor-error, and cross-modal CLIP coherence to audit 3D asset banks.
  • Compared with Objaverse-derived assets, AmaraSpatial-10K significantly improves text-based retrieval (CLIP Recall@5: 0.612 vs 0.181; 3.4x improvement, with median rank dropping from 267 to 3), and is publicly available on Hugging Face.

Abstract

Web-scale 3D asset collections are abundant, but rarely deployment-ready. Assets ship with arbitrary metric scale, incorrect pivots and forward axes, brittle geometry, and textures that do not support relighting, which limits their utility for embodied AI, robotics simulation, game development, and AR/VR. We present AmaraSpatial-10K, a dataset of over 10,000 synthetic 3D assets designed for downstream use rather than volume alone. Each asset is released as a metric-scaled, semantically anchored .glb with separated PBR material maps, a convex collision hull, a paired reference image, and rich multi-sentence text metadata. The dataset spans indoor objects, vehicles, architecture, creatures, and props under a unified spatial convention. Alongside the dataset, we introduce an evaluation suite for 3D asset banks. The suite comprises a continuous Scale Plausibility Score (SPS) with an LLM-as-Judge interval protocol, an LLM Concept Density score for metadata, an anchor-error metric, and a cross-modal CLIP coherence protocol, and we use it to audit AmaraSpatial-10K alongside matched subsets from Objaverse, HSSD, ABO, and GSO. Compared with Objaverse-sourced assets, we demonstrate that AmaraSpatial-10K substantially improves text-based retrieval precision (CLIP Recall@5 of 0.612 vs 0.181, a 3.4x improvement with median rank falling from 267 to 3), and we establish that it satisfies the spatial and semantic prerequisites for physics-aware scene composition and embodied-AI asset banks, leaving those downstream evaluations to future work. AmaraSpatial-10K is publicly available on Hugging Face.