Corelab Seminar

Orestis Papadigenopoulos
Contextual Blocking Bandits

Only recently, researchers have focused their attention on variations of the multi-armed bandit (MAB) problem, where specific temporal dependencies are imposed between player's actions and reward distributions. In this direction, Kleinberg and Immorlica [FOCS'18] consider the setting of "recharging bandits", where the mean reward of each arm is a concave and weakly increasing function of the time passed since its last play. In a similar spirit, Basu et al. [NIPS'19] consider the problem of "blocking bandits", where once an arm is played it cannot be played again for a number of consecutive rounds.
We consider a novel variant of the blocking bandits problem where, at each time step, the player observes an independently sampled context that determines the arms' mean rewards. Assuming prior knowledge of the (context-dependent) reward distributions, we derive an efficient policy with optimal competitive guarantee based on randomized rounding. When the mean rewards are initially unknown and due to the time correlations caused by blocking, existing techniques for upper bounding the regret fail. For proving our regret bounds, we introduce the novel concepts of delayed exploitation and opportunistic subsampling, and combine them with ideas from combinatorial bandits and non-homogeneous Markov Chain coupling.

The is joint work with S. Basu, C. Caramanis and S. Shakkottai.