The most basic setting where we need to consider the exploration/exploitation tradeoff is in multi-armed bandits. This week we will introduce the bandit problem and see how concentration inequalities are used to derive the upper confidence bound algorithm which has near optimal worst-case regret.
Multi-Armed Bandit (MAB)
Chapter 1 of this monograph on bandits by Aleksandrs Slivkins (pages 5-13)
Blog post about concentration inequalities by George Linderman
Chapters 7 and 8 from the Bandit Algorithms textbook by Lattimore and Szepesvari
Implement UCB and play with bandits in this notebook
Why do we need exploration?
Give an intuitive explanation for why optimism in the face of uncertainty works.
(Optional) Complete exercise 1.1 from Slivkins