Fridays 11am-12:15pm in CIWW 102.
Sequential decision making under uncertainty
Markov Decision Processes (MDP)
Challenges: credit assignment, exploration
Policy search:
Ways to represent policies
Ways to search for policies
Policy gradients
Value estimation:
Value function and Q function
Bellman equation and Bellman optimality equation
Dynamic programming and value iteration
TD learning and Q-learning
Policy iteration
The Sutton and Barto textbook
Github repo from Denny Britz with notebooks covering examples from the book
Courses from Emma Brunskill (for more theory), Sergey Levine (for more deep RL/control) David Silver (for somewhere in between)
Lilian Weng’s peek into reinforcement learning blog post
Gridworld visualization from Andrej Karpathy
The paper we are targeting in this section of the curriculum extends traditional exploration techniques to the deep learning setting. We will cover the basics of exploration algorithms and value-based deep RL and then see how the paper is able to combine the two.
The most basic setting where we need to consider the exploration/exploitation tradeoff is in multi-armed bandits. This week we will introduce the bandit problem and see how concentration inequalities are used to derive the upper confidence bound algorithm which has near optimal worst-case regret.
Multi-Armed Bandit (MAB)
Concentration inequalities
UCB algorithm
Chapter 1 of this monograph on bandits by Aleksandrs Slivkins (pages 5-13)
Blog post about concentration inequalities by George Linderman
Lecture on exploration from David Silver or Emma Brunskill
Chapters 7 and 8 from the Bandit Algorithms textbook by Lattimore and Szepesvari
Implement UCB and play with bandits in this notebook
Why do we need exploration?
Give an intuitive explanation for why optimism in the face of uncertainty works.
(Optional) Complete exercise 1.1 from Slivkins
In the introduction, we saw value-based RL algorithms (and specifically Q learning) in the tabular setting where we keep a separate Q value for each $ s,a$ pair. If we want to scale to large state spaces we will need to be able to generalize across an infinite state space using a function approximator, like a neural network. This week we will see how Q-learning can be modified to support function approximation and read the influential paper from Deepmind introducing the deep Q network (DQN) algorithm.
Q-learning with function approximation
Experience replay
$\varepsilon$-greedy exploration
Sections 4.1-4.4 from the Intro to Deep RL Monograph of François-Lavet et al. (pages 24-29)
Playing Atari with Deep Reinforcement Learning (original DQN paper)
Lectures from David Silver or Emma Brunskill with lecture notes
Section 4.3 from Csaba Szepesvari’s RL monograph
Sutton and Barto chapters 9, 10, and 11
What are the potential problems with Q-learning when we introduce function approximation?
Why might experience replay improve the performance of DQN?
Is the DQN algorithm more similar to Q-learning or value iteration? Why?
Download and run the Pytorch DQN tutorial linked in the optional reading list to get an intuition for how the algorithm works.
We have seen algorithms that can have provably good performance in terms of regret in the bandit setting by adaptively exploring. And while the DQN algorithm we learned about last week has impressive performance on some tasks, it often fails to explore well enough since it relies on non-adaptive epsilon-greedy exploration. This week we will look at algorithms to extend the exploration ideas from the bandit setting to finite MDPs with tabular representations, getting us one step closer to the goal of scalable algorithms for RL with adaptive exploration mechsanisms.
Exploration in MDPs
Model-based interval estimation
Chapter 38 of the Bandit Algorithms textbook by Lattimore and Szepesvari (this reintroduces the MDP framework, gives the UCRL2 algorithm with a proof of its regret bound, and provides a regret lower bound)
Is Q-learning Provably Efficient? (a recent paper that uses a UCB-style bonus to make a variant of tabular Q-learning with a regret bound)
R-max: A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning (an older paper with an intuitive algorithm)
How does the goal of the PAC-MDP framework differ from the regret framework? Can you give some examples of problems where one framework may be preferred?
Why does MBIE choose to keep confidence intervals over models? How does this compare to the bandit setting from the first week?
What are the benefits of the exploration bonus version of the algorithm (MBIE-EB) over the standard version (MBIE)?
What makes it difficult to scale algorithms like MBIE-EB to settings where we want to use neural networks instead of tabular representations?