Simple Projection-Free Algorithm for Contextual Recommendation with Logarithmic Regret and Robustness

arXiv cs.LG / 2026/3/24

📰 ニュースSignals & Early TrendsIdeas & Deep AnalysisModels & Research

要点

  • The paper proposes a new, simpler projection-free algorithm for contextual recommendation framed as contextual linear bandits, targeting logarithmic regret matching prior state-of-the-art bounds.
  • By leveraging the “improperness” specific to contextual recommendation, the update rule resembles a second-order perceptron style method rather than Online Newton Step (ONS), avoiding the Mahalanobis projection step that is a key computational bottleneck.
  • The authors claim the method is more computationally efficient than ONS-based approaches while retaining an O(d log T) regret guarantee (with d as action dimension and T as horizon).
  • The algorithm is also presented as robust to suboptimal feedback in observed actions, unlike the earlier ONS extension that required multiple ONS learners with different learning rates.
  • The approach is extended to general Hilbert spaces via kernelization, where removing Mahalanobis projections is argued to be even more beneficial for practical computation.

Abstract

Contextual recommendation is a variant of contextual linear bandits in which the learner observes an (optimal) action rather than a reward scalar. Recently, Sakaue et al. (2025) developed an efficient Online Newton Step (ONS) approach with an O(d\log T) regret bound, where d is the dimension of the action space and T is the time horizon. In this paper, we present a simple algorithm that is more efficient than the ONS-based method while achieving the same regret guarantee. Our core idea is to exploit the improperness inherent in contextual recommendation, leading to an update rule akin to the second-order perceptron from online classification. This removes the Mahalanobis projection step required by ONS, which is often a major computational bottleneck. More importantly, the same algorithm remains robust to possibly suboptimal action feedback, whereas the prior ONS-based method required running multiple ONS learners with different learning rates for this extension. We describe how our method works in general Hilbert spaces (e.g., via kernelization), where eliminating Mahalanobis projections becomes even more beneficial.