We have seen provably efficient exploration in small MDPs, but this requires keeping track of independent estimates of a model or Q function at every state and action. To scale up the algorithms to large state spaces we need to find a way to avoid this sort of tabular representation. This week we will look at one of the first papers that was able to effectively scale up a UCB-style exploration bonus to the deep RL setting of large MDPs.
Scaling up exploration algorithms
Pseudocounts from density models
Count-Based Exploration with Neural Density Models (a follow up paper with a neural density model and better performance)
VIME: Variational Information Maximizing Exploration (a concurrent paper that performs approximate information maximization)
Intrinsic Motivation Systems for Autonomous Mental Development (a neuroscience paper that gives some context to the discussion of intrinsic motivation)
Near-Bayesian Exploration in Polynomial Time (a paper that provides the lower bound referenced in Bellemare et al. for small exploration bonuses, see Theorem 2)
What is a pseudocount? Why is it useful for exploration in large state spaces?
What do you think are the biggest weaknesses of the pseudocount approach? Why?
How else might we be able to extend our exploration algorithms to large state spaces?