Models Know Their Shortcuts: Deployment-Time Shortcut Mitigation

arXiv cs.LG / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces “Shortcut Guardrail,” a deployment-time method to mitigate shortcut learning in pretrained language models without needing the original training data or shortcut annotations.
  • It leverages the insight that gradient-based attribution on a biased model can identify shortcut tokens, then uses a lightweight LoRA debiasing module to reduce reliance on those tokens.
  • The proposed module is trained with a Masked Contrastive Learning (MaskCL) objective to encourage consistent representations with or without specific tokens.
  • Experiments across sentiment classification, toxicity detection, and natural language inference show improved overall accuracy and worst-group accuracy under distribution shifts while maintaining in-distribution performance.
  • The approach is positioned as a simpler alternative to existing training-time mitigations that typically require heavy supervision or prior knowledge of shortcut types.

Abstract

Pretrained language models often rely on superficial features that appear predictive during training yet fail to generalize at test time, a phenomenon known as shortcut learning. Existing mitigation methods generally operate at training time and require heavy supervision such as access to the original training data or prior knowledge of shortcut type. We propose Shortcut Guardrail, a deployment-time framework that mitigates token-level shortcuts without access to the original training data or shortcut annotations. Our key insight is that gradient-based attribution on a biased model highlights shortcut tokens. Building on this finding, we train a lightweight LoRA-based debiasing module with a Masked Contrastive Learning (MaskCL) objective that encourages consistent representations with or without individual tokens. Across sentiment classification, toxicity detection, and natural language inference under both naturally occurring and controlled shortcuts, Shortcut Guardrail improves overall accuracy and worst-group accuracy over the unmitigated model under distribution shifts while preserving in-distribution performance.