MoCHA: Denoising Caption Supervision for Motion-Text Retrieval

arXiv cs.CV / 3/26/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Text-motion retrieval methods often treat each caption as a single deterministic positive, but captions for the same motion vary due to both motion-recoverable semantics and annotator-specific or context-dependent style that isn’t inferable from 3D joints alone.
  • The paper introduces MoCHA, a caption canonicalization framework that reduces within-motion embedding variance by projecting each caption onto the motion-recoverable content before encoding, yielding tighter positive clusters and better embedding separation.
  • MoCHA is presented as a preprocessing step compatible with any retrieval architecture, with two implementations: an LLM-based canonicalizer (GPT-5.2) and a distilled FlanT5 variant that avoids using an LLM at inference time.
  • On MotionPatches and evaluated on HumanML3D and KIT-ML, MoCHA reports new SOTA results, including +3.1pp T2M R@1 on HumanML3D with the LLM variant and +10.3pp on KIT-ML, while the LLM-free T5 variant also delivers sizable gains.
  • Canonicalization reportedly cuts within-motion text-embedding variance by 11–19% and markedly improves cross-dataset transfer, with large bidirectional improvements (H→K and K→H).

Abstract

Text-motion retrieval systems learn shared embedding spaces from motion-caption pairs via contrastive objectives. However, each caption is not a deterministic label but a sample from a distribution of valid descriptions: different annotators produce different text for the same motion, mixing motion-recoverable semantics (action type, body parts, directionality) with annotator-specific style and inferred context that cannot be determined from 3D joint coordinates alone. Standard contrastive training treats each caption as the single positive target, overlooking this distributional structure and inducing within-motion embedding variance that weakens alignment. We propose MoCHA, a text canonicalization framework that reduces this variance by projecting each caption onto its motion-recoverable content prior to encoding, producing tighter positive clusters and better-separated embeddings. Canonicalization is a general principle: even deterministic rule-based methods improve cross-dataset transfer, though learned canonicalizers provide substantially larger gains. We present two learned variants: an LLM-based approach (GPT-5.2) and a distilled FlanT5 model requiring no LLM at inference time. MoCHA operates as a preprocessing step compatible with any retrieval architecture. Applied to MoPa (MotionPatches), MoCHA sets a new state of the art on both HumanML3D (H) and KIT-ML (K): the LLM variant achieves 13.9% T2M R@1 on H (+3.1pp) and 24.3% on K (+10.3pp), while the LLM-free T5 variant achieves gains of +2.5pp and +8.1pp. Canonicalization reduces within-motion text-embedding variance by 11-19% and improves cross-dataset transfer substantially, with H to K improving by 94% and K to H by 52%, demonstrating that standardizing the language space yields more transferable motion-language representations.