State Space Models are Effective Sign Language Learners: Exploiting Phonological Compositionality for Vocabulary-Scale Recognition
arXiv cs.CV / 4/13/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- Sign language recognition models often fail to scale to realistic vocabularies because they treat signs as atomic visual patterns instead of leveraging the language’s phonological compositional structure.
- The paper proposes PHONSSM, which enforces phonological decomposition using anatomically grounded graph attention, explicit factorization into orthogonal subspaces, and prototype-based classification for few-shot transfer.
- Trained on skeleton data only, PHONSSM attains 72.1% on WLASL2000, outperforming skeleton-state-of-the-art by +18.4 percentage points and surpassing many RGB approaches without using video.
- The approach shows especially large improvements in the few-shot setting (+225% relative) and demonstrates zero-shot transfer to ASL Citizen that beats supervised RGB baselines.
- The authors conclude the vocabulary scaling bottleneck is largely a representation learning issue and can be addressed with compositional inductive biases aligned with linguistic structure.
Related Articles

Black Hat Asia
AI Business

I built the missing piece of the MCP ecosystem
Dev.to

When Agents Go Wrong: AI Accountability and the Payment Audit Trail
Dev.to

Google Gemma 4 Review 2026: The Open Model That Runs Locally and Beats Closed APIs
Dev.to

OpenClaw Deep Dive Guide: Self-Host Your Own AI Agent on Any VPS (2026)
Dev.to