We have seen algorithms that can have provably good performance in terms of regret in the bandit setting by adaptively exploring. And while the DQN algorithm we learned about last week has impressive performance on some tasks, it often fails to explore well enough since it relies on non-adaptive epsilon-greedy exploration. This week we will look at algorithms to extend the exploration ideas from the bandit setting to finite MDPs with tabular representations, getting us one step closer to the goal of scalable algorithms for RL with adaptive exploration mechsanisms.
Exploration in MDPs
Model-based interval estimation
Chapter 38 of the Bandit Algorithms textbook by Lattimore and Szepesvari (this reintroduces the MDP framework, gives the UCRL2 algorithm with a proof of its regret bound, and provides a regret lower bound)
Is Q-learning Provably Efficient? (a recent paper that uses a UCB-style bonus to make a variant of tabular Q-learning with a regret bound)
R-max: A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning (an older paper with an intuitive algorithm)
How does the goal of the PAC-MDP framework differ from the regret framework? Can you give some examples of problems where one framework may be preferred?
Why does MBIE choose to keep confidence intervals over models? How does this compare to the bandit setting from the first week?
What are the benefits of the exploration bonus version of the algorithm (MBIE-EB) over the standard version (MBIE)?
What makes it difficult to scale algorithms like MBIE-EB to settings where we want to use neural networks instead of tabular representations?