Online Learning and Equilibrium Computation with Ranking Feedback
arXiv cs.CL / 3/20/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies online learning in adversarial environments where the learner only observes a ranking over proposed actions, linking this setting to equilibrium computation in game theory.
- It analyzes two ranking mechanisms—rankings induced by instantaneous utility and rankings induced by time-average utility—under both full-information and bandit feedback settings.
- It proves that sublinear external regret is impossible in general with instantaneous-utility ranking feedback, and that sublinear regret can also be impossible under deterministic time-average rankings such as Plackett-Luce with a sufficiently small temperature.
- It develops new algorithms that achieve sublinear regret under the assumption that the utility sequence has sublinear total variation, and shows that for full-information time-average utility ranking feedback this additional assumption can be removed.
- Consequently, if all players follow these algorithms in repeated play, the outcome yields an approximate coarse correlated equilibrium, with a demonstrated effectiveness in an online large-language-model routing task.
Related Articles
Speaking of VoxtralResearchVoxtral TTS: A frontier, open-weights text-to-speech model that’s fast, instantly adaptable, and produces lifelike speech for voice agents.
Mistral AI Blog
Anyone who has any common sense knows that AI agents in marketing just don’t exist.
Dev.to
How to Use MiMo V2 API for Free in 2026: Complete Guide
Dev.to
The Agent Memory Problem Nobody Solves: A Practical Architecture for Persistent Context
Dev.to
From Chaos to Compliance: AI Automation for the Mobile Kitchen
Dev.to