Mathematics of Deep Learning
Spring 2020

UCB in tabular RL

Motivation

We have seen algorithms that can have provably good performance in terms of regret in the bandit setting by adaptively exploring. And while the DQN algorithm we learned about last week has impressive performance on some tasks, it often fails to explore well enough since it relies on non-adaptive epsilon-greedy exploration. This week we will look at algorithms to extend the exploration ideas from the bandit setting to finite MDPs with tabular representations, getting us one step closer to the goal of scalable algorithms for RL with adaptive exploration mechsanisms.

Topics

Required reading

Optional reading

Questions