Neither Here Nor There: Cross-Lingual Representation Dynamics of Code-Mixed Text in Multilingual Encoders
arXiv cs.CL / 3/23/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The study investigates cross-lingual representations in multilingual encoders for Hindi-English code-mixed inputs, finding that code-mixed representations are loosely connected to either constituent language and tend to an English-dominant semantic subspace.
- The authors construct a unified trilingual corpus with English, Devanagari Hindi, and Romanized code-mixed sentences and analyze alignment using CKA, token-level saliency, and entropy-based uncertainty analyses.
- Continued pre-training on code-mixed data improves English-code-mixed alignment but reduces English-Hindi alignment, revealing a trade-off in multilingual pre-training objectives.
- They introduce a trilingual post-training alignment objective that brings code-mixed representations closer to both languages, yielding downstream gains on sentiment analysis and hate speech detection.
Related Articles

Interactive Web Visualization of GPT-2
Reddit r/artificial
Stop Treating AI Interview Fraud Like a Proctoring Problem
Dev.to
[R] Causal self-attention as a probabilistic model over embeddings
Reddit r/MachineLearning
The 5 software development trends that actually matter in 2026 (and what they mean for your startup)
Dev.to
InVideo AI Review: Fast Finished
Dev.to