Mathematics of Deep Learning
Spring 2020

Parallel curriculum on reinforcement learning

Fridays 11am-12:15pm in CIWW 102.

Intro to Reinforcement Learning (covered in the intro lecture on 1/30)

Lecture notes

Topics

Useful resources

Unifying count-based exploration and intrinsic motivation

Overview

The paper we are targeting in this section of the curriculum extends traditional exploration techniques to the deep learning setting. We will cover the basics of exploration algorithms and value-based deep RL and then see how the paper is able to combine the two.

Bandits and the upper confidence bound (UCB) algorithm (2/7)

Motivation

The most basic setting where we need to consider the exploration/exploitation tradeoff is in multi-armed bandits. This week we will introduce the bandit problem and see how concentration inequalities are used to derive the upper confidence bound algorithm which has near optimal worst-case regret.

Topics

Required reading

Optional reading

Questions

Deep value-based RL (and DQN) (2/14)

Motivation

In the introduction, we saw value-based RL algorithms (and specifically Q learning) in the tabular setting where we keep a separate Q value for each $ s,a$ pair. If we want to scale to large state spaces we will need to be able to generalize across an infinite state space using a function approximator, like a neural network. This week we will see how Q-learning can be modified to support function approximation and read the influential paper from Deepmind introducing the deep Q network (DQN) algorithm.

Topics

Required reading

Optional reading

Questions

UCB in tabular RL

Motivation

We have seen algorithms that can have provably good performance in terms of regret in the bandit setting by adaptively exploring. And while the DQN algorithm we learned about last week has impressive performance on some tasks, it often fails to explore well enough since it relies on non-adaptive epsilon-greedy exploration. This week we will look at algorithms to extend the exploration ideas from the bandit setting to finite MDPs with tabular representations, getting us one step closer to the goal of scalable algorithms for RL with adaptive exploration mechsanisms.

Topics

Required reading

Optional reading

Questions