TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

arXiv cs.CL / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces TwinGate, a stateful defense framework designed to detect decompositional jailbreak attempts against LLMs in realistic settings with anonymized, untraceable, and interleaved traffic.
  • TwinGate uses Asymmetric Contrastive Learning with a dual-encoder design to cluster semantically different but intent-matched malicious query fragments, while a parallel frozen encoder reduces false positives from benign topical similarity.
  • The method is built for deployment efficiency: each request needs only a single lightweight forward pass and can run in parallel with the LLM’s prefill stage to keep latency overhead negligible.
  • To support evaluation, the authors release a large dataset of 3.62M+ instructions covering 8,600 distinct malicious intents and test TwinGate under a strictly causal protocol.
  • Results indicate TwinGate achieves strong malicious-intent recall with a low false-positive rate, remains robust to adaptive attacks, and outperforms both stateful and stateless baselines on throughput and latency.

Abstract

Decompositional jailbreaks pose a critical threat to large language models (LLMs) by allowing adversaries to fragment a malicious objective into a sequence of individually benign queries that collectively reconstruct prohibited content. In real-world deployments, LLMs face a continuous, untraceable stream of fully anonymized and arbitrarily interleaved requests, infiltrated by covertly distributed adversarial queries. Under this rigorous threat model, state-of-the-art defensive strategies exhibit fundamental limitations. In the absence of trustworthy user metadata, they are incapable of tracking global historical contexts, while their deployment of generative models for real-time monitoring introduces computationally prohibitive overhead. To address this, we present TwinGate, a stateful dual-encoder defense framework. TwinGate employs Asymmetric Contrastive Learning (ACL) to cluster semantically disparate but intent-matched malicious fragments in a shared latent space, while a parallel frozen encoder suppresses false positives arising from benign topical overlap. Each request requires only a single lightweight forward pass, enabling the defense to execute in parallel with the target model's prefill phase at negligible latency overhead. To evaluate our approach and advance future research, we construct a comprehensive dataset of over 3.62 million instructions spanning 8,600 distinct malicious intents. Evaluated on this large-scale corpus under a strictly causal protocol, TwinGate achieves high malicious intent recall at a remarkably low false positive rate while remaining highly robust against adaptive attacks. Furthermore, our proposal substantially outperforms stateful and stateless baselines, delivering superior throughput and reduced latency.