Corelab Seminar

Constantine Caramanis
Contextual Reinforcement Learning when we don't know the contexts

Contextual Bandits and more generally, contextual reinforcement learning, studies the problem where the learner relies on revealed contexts, or labels, to adapt learning and optimization strategies. What can we do when those contexts are missing? Statistical learning with missing or hidden information is ubiquitous in many theoretical and applied problems. A basic yet fundamental setting is that of mixtures, where each data point is generated by one of several possible (unknown) processes. In this talk, we are interested in the dynamic decision-making version of this problem. At the beginning of each (finite length, typically short) episode, we interact with an MDP drawn from a set of M possible MDPs. The identity of the MDP for each episode -- the context -- is unknown. We review the basic setting of MDPs and Reinforcement Learning, and explain in that framework why this class of problems is both important and challenging. Then, we outline several of our recent results in this area, as time permits. We first show that without additional assumptions, the problem is statistically hard in the number of different Markov chains: finding an epsilon-optimal policy requires exponentially (in M) many episodes. We then study several special and natural classes of LMDPs. We show how ideas from the method-of-moments, in addition to the principle of optimism, can be applied here to derive new, sample efficient RL algorithms in the presence of latent contexts.