RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

arXiv cs.LG / 5/6/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces RouteHijack, a routing-aware jailbreak method specifically designed for Mixture-of-Experts (MoE) LLMs to bypass safety alignment.
  • It identifies safety-critical (refusal-related) and harmful experts by comparing expert activations under safe refusals versus harmful completions, then targets routing to manipulate which experts activate.
  • RouteHijack optimizes adversarial suffixes using a routing-aware objective that suppresses safety experts, promotes harmful ones, and helps prevent early refusal during generation.
  • Evaluation on seven MoE LLMs shows a 69.3% average attack success rate, outperforming earlier optimization-based attacks by 3.2×, with strong transfer to other sibling MoE variants and even some MoE-based VLMs.
  • The results suggest sparse expert routing creates a fundamental vulnerability, implying that defenses must go beyond output-level alignment to address routing-time behavior.

Abstract

Safety alignment is critical for the responsible deployment of large language models (LLMs). As Mixture-of-Experts (MoE) architectures are increasingly adopted to scale model capacity, understanding their safety robustness becomes essential. Existing adversarial attacks, however, have notable limitations. Prompt-based jailbreaks rely on heuristic search and transfer poorly, model intervention methods require privileged access to internal representations, and optimization-based input attacks remain output-centric and are fundamentally limited to MoE models due to the non-differentiable routing mechanism. In this paper, we present RouteHijack, a routing-aware jailbreak for MoE LLMs. Our key insight is that safety behavior is concentrated in a small subset of experts, creating an opportunity to steer model behavior by influencing routing decisions through input optimization. Building on this observation, RouteHijack first performs response-driven expert localization to identify safety-critical and harmful experts by contrasting activations under safe refusals and harmful completions. It then constructs adversarial suffixes with a routing-aware objective that suppresses safety experts, promotes harmful experts, and prevents early-stage refusal during generation. At inference time, the optimized suffix is appended to a malicious prompt, requiring only input access. Across seven MoE LLMs, RouteHijack achieves a 69.3\% average attack success rate (ASR), outperforming prior optimization-based attack by 3.2\times. RouteHijack also transfers zero-shot across five sibling MoE variants, raising average ASR from 27.7\% to 61.2\%, and further generalizes to three MoE-based VLMs, increasing average ASR from 2.47\% to 38.7\%. These findings expose a fundamental vulnerability in sparse expert architectures and highlight the need for defenses beyond output-level alignment.