Bandits and the Upper Confidence Bound algorithm

07 Feb 2020

Motivation

The most basic setting where we need to consider the exploration/exploitation tradeoff is in multi-armed bandits. This week we will introduce the bandit problem and see how concentration inequalities are used to derive the upper confidence bound algorithm which has near optimal worst-case regret.

Topics

Multi-Armed Bandit (MAB)
Concentration inequalities
UCB algorithm

Required reading

Chapter 1 of this monograph on bandits by Aleksandrs Slivkins (pages 5-13)
Blog post about concentration inequalities by George Linderman

Optional reading

Lecture on exploration from David Silver or Emma Brunskill
Chapters 7 and 8 from the Bandit Algorithms textbook by Lattimore and Szepesvari

Questions

Implement UCB and play with bandits in this notebook
Why do we need exploration?
Give an intuitive explanation for why optimism in the face of uncertainty works.
(Optional) Complete exercise 1.1 from Slivkins