This is what it’s all about! Guided policy search is one of the most efficient techniques for robotic control from vision with partially known environments. This week we’ll put it all together, showing how GPS combines trajectory optimization, imitation learning, and constrained optimization to find high-quality neural network policies with very little real-world experience.
Guided Policy Search (GPS)
Trajectory optimization with unknown dynamics
Asymmetric imitation learning
Sergey’s lecture on GPS, which discusses some simpler alternatives and the problems with them (video)
Asymmetric Actor Critic for Image-Based Robot Learning, a paper that takes the “asymmetric supervision” idea from GPS and applies it to model-free RL in simulation
How does GPS use each of the components we’ve discussed (LQR, imitation, constrained optimization)?
What advantages does the trained neural net policy have over the trajectory optimizer?
How does this paper propose to deal with unknown dynamics? When will this strategy work well?
How does GPS learn with so few samples?
What are the limitations of this method?