Linear-Core Surrogates: Smooth Loss Functions with Linear Rates for Classification and Structured Prediction

arXiv cs.LG / 5/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces “Linear-Core (LC) Surrogates,” a new family of convex loss functions designed to reconcile the trade-off between smooth losses (fast optimization but slow statistical bounds) and piecewise-linear losses (fast statistical consistency but non-differentiability).
  • LC Surrogates are proven to be differentiable everywhere while still achieving strict linear H-consistency bounds, aiming to preserve both optimization and statistical efficiency.
  • In structured prediction, the smoothness enables an unbiased stochastic gradient estimator that avoids the typical quadratic-time inference cost O(|Y|^2) associated with exact methods like Viterbi.
  • Experiments report a 23× speedup over Structured SVMs on large-vocabulary sequence tagging, and improved robustness to instance-dependent label noise, including a 2.6% gain over Cross-Entropy on corrupted CIFAR-10.
  • Overall, the work suggests LC Surrogates can deliver both practical training acceleration and better resilience to noisy labels across classification and structured prediction tasks.

Abstract

The choice of loss function in classification involves a fundamental trade-off: smooth losses (like Cross-Entropy) enable fast optimization rates but yield slow square-root consistency bounds, while piecewise-linear losses (like Hinge) offer fast linear consistency rates but suffer from non-differentiability. We propose Linear-Core (LC) Surrogates, a new family of convex loss functions that resolve this tension by stitching a linear core to a smooth tail. We prove that these surrogates are differentiable everywhere while retaining strict linear H-consistency bounds, effectively combining the optimization benefits of smoothness with the statistical efficiency of margin-based losses. In the structured prediction setting, we show that this smoothness unlocks a massive computational and energy advantage: it allows for an unbiased stochastic gradient estimator that bypasses the quadratic complexity O(|\mathscr{Y}|^2) of exact inference (e.g., Viterbi). Empirically, our method achieves a 23\times speedup over Structured SVMs on large-vocabulary sequence tagging tasks and demonstrates superior robustness to instance-dependent label noise, outperforming Cross-Entropy by 2.6% on corrupted CIFAR-10.