Robust Alignment: Harmonizing Clean Accuracy and Adversarial Robustness in Adversarial Training

arXiv cs.CV / 4/30/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a core limitation of adversarial training: the trade-off between clean accuracy and adversarial robustness in deep neural networks.
  • It reports a new observation that changing perturbation intensities for training samples near decision boundaries has minimal effect on robustness, pointing to a mismatch between input and latent spaces as the key cause.
  • To reduce this mismatch, the authors introduce “Robust Alignment,” a training objective that encourages the model’s perception to change under input perturbations while keeping the final prediction label the same.
  • They propose two techniques to realize Robust Alignment: using reduced and fixed perturbation intensity for boundary samples, and DICAR (Domain Interpolation Consistency Adversarial Regularization) to enforce semantic alignment between input and latent representations.
  • The resulting RAAT method improves the accuracy–robustness trade-off on CIFAR-10, CIFAR-100, and Tiny-ImageNet across multiple ResNet variants, outperforming four common baselines and matching or surpassing many prior SOTA approaches.

Abstract

Adversarial Training (AT) is one of the most effective methods for developing robust deep neural networks (DNNs). However, AT faces a trade-off problem between clean accuracy and adversarial robustness. In this work, we reveal a surprising phenomenon for the first time: Varying input perturbation intensities for training samples near decision boundaries in AT have minimal impact on model robustness. This finding directly exposes the inconsistency between accuracy and robustness score fluctuations, leading us to identify the misalignment between input and latent spaces as a critical driver of the robustness-accuracy trade-off. To mitigate this misalignment for harmonizing accuracy and robustness, we define Robust Alignment as a new AT target, encouraging the model perception to change with input perturbations provided the final label prediction remains unchanged, which can be achieved via two novel ideas. First, we suggest a reduced and fixed perturbation intensity for those boundary samples, which facilitates the model to utilize the perturbations as learnable patterns, instead of noises that complicate decision boundaries meaninglessly. Second, we propose a Domain Interpolation Consistency Adversarial Regularization (DICAR), based on rigorous theoretical derivations, which explicitly introduces semantic alignment between input and latent spaces into AT. Based on these two ideas, we end up with a new Robust Alignment Adversarial Training (RAAT) method, effectively harmonizing accuracy and robustness. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet with ResNet-18, PreActResNet-18, and WideResNet-28-10 demonstrate the effectiveness of RAAT in improving the trade-off beyond four common baselines and a total of 14 related state-of-the-art (SOTA) works.