We’re now going to switch from talking about exploration to control and trajectory optimization. Whereas the focus from the first half of the course was building up to “Unifying Count-Based Exploration and Intrinsic Motivation”, the second half will lead to “End-to-end training of deep visuomotor policies” and the method it proposes, Guided Policy Search.
Guided policy search (GPS) is a family of methods which combine optimal control with rich model-free policies. By leveraging models of the environment and privileged information during training, GPS has been used to learn policies that map directly from pixels to torques on real robots, marking one of the first successes of deep RL on a physical system.
Trajectory optimization uses a model of a system’s dynamics to choose the controls which minimize some cost. The linear quadratic regulator, or LQR, is the fundamental tool of trajectory optimization. Guided policy search uses iLQR, which is based on LQR, to find optimal guiding trajectories. After this week you should understand the problem of trajectory optimization and how LQR solves it for linear systems.
Trajectory optimization
Linear dynamical systems
Linear Quadratic Regulator (LQR)
Introduction to control as optimization from Russ’s book (includes the Hamilton-Jacobi-Bellman equation)
What is trajectory optimization?
Play with the notebooks in Russ’s Example 8.2.
What kinds of systems are linear? What would happen if you ran LQR on a nonlinear system?
Is there a way to adapt LQR to more complex dynamics?
Autonomous Helicopter Aerobatics through Apprenticeship Learning by Pieter Abbeel, Adam Coates, and Andrew Ng
Quadratic approximate dynamic programming for input-affine systems