Backdoor Attacks on Decentralised Post-Training

arXiv cs.LG / 4/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how decentralized post-training of large language models, which uses data and pipeline parallelism, can be attacked by malicious participants through poisoning and backdoor insertion.
  • It presents what it claims is the first backdoor attack targeting pipeline parallelism, where an attacker controls an intermediate pipeline stage rather than the full model or dataset.
  • Experiments show that this limited control is sufficient to inject a backdoor that misaligns the model, regardless of the learned domain or dataset.
  • Using a trigger word, the alignment percentage drops dramatically from 80% to 6%, demonstrating strong effectiveness of the attack.
  • Even after applying safety alignment training to the final model, the attack remains successful in 60% of cases, indicating the backdoor can persist through downstream alignment.

Abstract

Decentralised post-training of large language models utilises data and pipeline parallelism techniques to split the data and the model. Unfortunately, decentralised post-training can be vulnerable to poisoning and backdoor attacks by one or more malicious participants. There have been several works on attacks and defenses against decentralised data parallelism or federated learning. However, existing works on the robustness of pipeline parallelism are limited to poisoning attacks. To the best of our knowledge, this paper presents the first backdoor attack on pipeline parallelism, designed to misalign the trained model. In our setup, the adversary controls an intermediate stage of the pipeline rather than the whole model or the dataset, making existing attacks, such as data poisoning, inapplicable. Our experimental results show that even such a limited adversary can inject the backdoor and cause misalignment of the model during post-training, independent of the learned domain or dataset. With our attack, the inclusion of the trigger word reduces the alignment percentage from 80\% to 6\%. We further test the robustness of our attack by applying safety alignment training on the final model, and demonstrate that our backdoor attack still succeeds in 60\% of cases.