Abstract
We study stochastic multi-armed bandits under simultaneous constraints on space and adaptivity: the learner interacts with the environment in B batches and has only W bits of persistent memory. Prior work shows that each constraint alone is surprisingly mild: near-minimax regret \widetilde{O}(\sqrt{KT}) is achievable with O(\log T) bits of memory under fully adaptive interaction, and with a K-independent O(\log\log T)-type number of batches when memory is unrestricted. We show that this picture breaks down in the simultaneously constrained regime. We prove that any algorithm with a W-bit memory constraint must use at least \Omega(K/W) batches to achieve near-minimax regret \widetilde{O}(\sqrt{KT}) , even under adaptive grids. In particular, logarithmic memory rules out K-independent batch complexity.
Our proof is based on an information bottleneck. We show that near-minimax regret forces the learner to acquire \Omega(K) bits of information about the hidden set of good arms under a suitable hard prior, whereas an algorithm with B batches and W bits of memory allows only O(BW) bits of information. A key ingredient is a localized change-of-measure lemma that yields probability-level arm exploration guarantees, which is of independent interest. We also give an algorithm using O(\log T) bits of memory and \widetilde{O}(K) batches that achieves regret \widetilde{O}(\sqrt{KT}), which nearly matches our lower bound.