# End-to-end training of deep visuomotor policies (Guided Policy Search)

#### Motivation

This is what it’s all about! Guided policy search is one of the most efficient techniques for robotic control from vision with partially known environments. This week we’ll put it all together, showing how GPS combines trajectory optimization, imitation learning, and constrained optimization to find high-quality neural network policies with very little real-world experience.

#### Motivation

Guided policy search formulates its objective as a constrained optimization, minimizing the cost of its expert trajectories while guaranteeing that, at convergence, the expert trajectories and the neural network policy become identical. After this week you should understand the problem of constrained optimization and the specific technique, ADMM, used by GPS.

# Imitation learning

#### Motivation

Imitation learning is the subfield concerned with learning policies from expert demonstrations. Cascading errors and distribution mismatch are the main challenges in imitation learning. Guided policy search simultaneously trains a neural network policy to imitate expert trajectories and generates additional expert trajectories which stay close to the policy. After this week you should understand the challenges of designing an imitation learning algorithm such as GPS.

#### Motivation

Iterative linear quadratic regulation (iLQR) approximates the dynamics using a time-varying linear model and approximately solves it using an iterative algorithm. It enables optimal control via trajectory optimization for arbitrary environments where the dynamics are known or can be approximated. Guided Policy Search uses iLQR to find optimal guiding trajectories. After this week you should understand how iLQR solves nonlinear trajectory optimization problems.

# Linear trajectory optimization

#### Motivation

We’re now going to switch from talking about exploration to control and trajectory optimization. Whereas the focus from the first half of the course was building up to “Unifying Count-Based Exploration and Intrinsic Motivation”, the second half will lead to “End-to-end training of deep visuomotor policies” and the method it proposes, Guided Policy Search.

Guided policy search (GPS) is a family of methods which combine optimal control with rich model-free policies. By leveraging models of the environment and privileged information during training, GPS has been used to learn policies that map directly from pixels to torques on real robots, marking one of the first successes of deep RL on a physical system.

Trajectory optimization uses a model of a system’s dynamics to choose the controls which minimize some cost. The linear quadratic regulator, or LQR, is the fundamental tool of trajectory optimization. Guided policy search uses iLQR, which is based on LQR, to find optimal guiding trajectories. After this week you should understand the problem of trajectory optimization and how LQR solves it for linear systems.

# New directions in exploration

#### Motivation

Last week we saw one approach to scaling up exploration. This week we will conclude our section on exploration with a brief tour of some other scalable exploration approaches introduced in the last few years. Rather than aiming for a deep dive on any one direction, we will try to get a high-level idea of the pros and cons of several approaches and provide the relevant references if you want to learn more.

# Deep RL with principled exploration

#### Motivation

We have seen provably efficient exploration in small MDPs, but this requires keeping track of independent estimates of a model or Q function at every state and action. To scale up the algorithms to large state spaces we need to find a way to avoid this sort of tabular representation. This week we will look at one of the first papers that was able to effectively scale up a UCB-style exploration bonus to the deep RL setting of large MDPs.

# UCB in tabular RL

#### Motivation

We have seen algorithms that can have provably good performance in terms of regret in the bandit setting by adaptively exploring. And while the DQN algorithm we learned about last week has impressive performance on some tasks, it often fails to explore well enough since it relies on non-adaptive epsilon-greedy exploration. This week we will look at algorithms to extend the exploration ideas from the bandit setting to finite MDPs with tabular representations, getting us one step closer to the goal of scalable algorithms for RL with adaptive exploration mechsanisms.

# Deep value-based RL and DQN

#### Motivation

In the introduction, we saw value-based RL algorithms (and specifically Q learning) in the tabular setting where we keep a separate Q value for each $s,a$ pair. If we want to scale to large state spaces we will need to be able to generalize across an infinite state space using a function approximator, like a neural network. This week we will see how Q-learning can be modified to support function approximation and read the influential paper from Deepmind introducing the deep Q network (DQN) algorithm.

# Bandits and the Upper Confidence Bound algorithm

#### Motivation

The most basic setting where we need to consider the exploration/exploitation tradeoff is in multi-armed bandits. This week we will introduce the bandit problem and see how concentration inequalities are used to derive the upper confidence bound algorithm which has near optimal worst-case regret.