# Parallel curriculum on reinforcement learning

Fridays 11am-12:15pm in CIWW 102.

# Intro to Reinforcement Learning (covered in the intro lecture on 1/30)

Lecture notes

#### Topics

• Sequential decision making under uncertainty

• Markov Decision Processes (MDP)

• Challenges: credit assignment, exploration

• Policy search:

• Ways to represent policies

• Ways to search for policies

• Value estimation:

• Value function and Q function

• Bellman equation and Bellman optimality equation

• Dynamic programming and value iteration

• TD learning and Q-learning

• Policy iteration

# Unifying count-based exploration and intrinsic motivation

#### Overview

The paper we are targeting in this section of the curriculum extends traditional exploration techniques to the deep learning setting. We will cover the basics of exploration algorithms and value-based deep RL and then see how the paper is able to combine the two.

## Bandits and the upper confidence bound (UCB) algorithm (2/7)

#### Motivation

The most basic setting where we need to consider the exploration/exploitation tradeoff is in multi-armed bandits. This week we will introduce the bandit problem and see how concentration inequalities are used to derive the upper confidence bound algorithm which has near optimal worst-case regret.

#### Topics

• Multi-Armed Bandit (MAB)

• Concentration inequalities

• UCB algorithm

• Chapter 1 of this monograph on bandits by Aleksandrs Slivkins (pages 5-13)

• Blog post about concentration inequalities by George Linderman

#### Questions

• Implement UCB and play with bandits in this notebook

• Why do we need exploration?

• Give an intuitive explanation for why optimism in the face of uncertainty works.

• (Optional) Complete exercise 1.1 from Slivkins

## Deep value-based RL (and DQN) (2/14)

#### Motivation

In the introduction, we saw value-based RL algorithms (and specifically Q learning) in the tabular setting where we keep a separate Q value for each $s,a$ pair. If we want to scale to large state spaces we will need to be able to generalize across an infinite state space using a function approximator, like a neural network. This week we will see how Q-learning can be modified to support function approximation and read the influential paper from Deepmind introducing the deep Q network (DQN) algorithm.

#### Topics

• Q-learning with function approximation

• Experience replay

• $\varepsilon$-greedy exploration

#### Questions

• What are the potential problems with Q-learning when we introduce function approximation?

• Why might experience replay improve the performance of DQN?

• Is the DQN algorithm more similar to Q-learning or value iteration? Why?

• Download and run the Pytorch DQN tutorial linked in the optional reading list to get an intuition for how the algorithm works.

## UCB in tabular RL

#### Motivation

We have seen algorithms that can have provably good performance in terms of regret in the bandit setting by adaptively exploring. And while the DQN algorithm we learned about last week has impressive performance on some tasks, it often fails to explore well enough since it relies on non-adaptive epsilon-greedy exploration. This week we will look at algorithms to extend the exploration ideas from the bandit setting to finite MDPs with tabular representations, getting us one step closer to the goal of scalable algorithms for RL with adaptive exploration mechsanisms.

#### Topics

• Exploration in MDPs

• Model-based interval estimation