Folding Tensor and Sequence Parallelism for Memory-Efficient Transformer Training & Inference
arXiv cs.CL / 4/30/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper introduces Tensor and Sequence Parallelism (TSP), which folds tensor parallelism and sequence parallelism onto the same device axis to reduce both parameter and activation memory per device.
- Unlike traditional approaches that use separate mesh dimensions for TP and SP, TSP assigns each rank both a weight shard and a token/sequence shard, shrinking memory overhead while using that shared axis.
- The authors present two runtime schedules: a sequence-wise key/value exchange method for attention and a ring-based circulation of weight shards with local accumulation for gated MLPs.
- TSP increases communication volume compared with simpler layouts, but the paper provides theoretical analysis and benchmarks showing it can outperform or match TP, SP, and TP+SP under memory-constrained and long-context settings.
- The work frames TSP as a hardware-aware parallelism option that can complement other strategies like pipeline parallelism and expert (Mixture-of-Experts) parallelism for dense and MoE models.
Related Articles

The Prompt Caching Mistake That's Costing You 70% More Than You Need to Pay
Dev.to

We Built a DNS-Based Discovery Protocol for AI Agents — Here's How It Works
Dev.to

Building AI Evaluation Pipelines: Automating LLM Testing from Dataset to CI/CD
Dev.to

Function Calling Harness 2: CoT Compliance from 9.91% to 100%
Dev.to

Stop Building Signal APIs. Build Systems That Prove Themselves Wrong.
Dev.to