Online Learning and Equilibrium Computation with Ranking Feedback
arXiv cs.CL / 3/20/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies online learning in adversarial environments where the learner only observes a ranking over proposed actions, linking this setting to equilibrium computation in game theory.
- It analyzes two ranking mechanisms—rankings induced by instantaneous utility and rankings induced by time-average utility—under both full-information and bandit feedback settings.
- It proves that sublinear external regret is impossible in general with instantaneous-utility ranking feedback, and that sublinear regret can also be impossible under deterministic time-average rankings such as Plackett-Luce with a sufficiently small temperature.
- It develops new algorithms that achieve sublinear regret under the assumption that the utility sequence has sublinear total variation, and shows that for full-information time-average utility ranking feedback this additional assumption can be removed.
- Consequently, if all players follow these algorithms in repeated play, the outcome yields an approximate coarse correlated equilibrium, with a demonstrated effectiveness in an online large-language-model routing task.
Related Articles

I let an AI agent loose on my codebase. It tried to read my .env file in 30 seconds.
Dev.to
Alex Chenglin Wu of DeepWisdom On The Future Of Artificial Intelligence | by Chad Silverstein | Authority Magazine | Mar, 2026
Reddit r/artificial
The Exit
Dev.to

Chip Smuggling Arrests, OpenClaw Is 'The Next ChatGPT,' and 81K People on AI
Dev.to
The Crucible
Dev.to