Structured and Correlated Multi-Armed Bandits: Algorithms, Theory and Applications
Multi-Armed bandit (MAB) framework is a widely used sequential decision making framework in which a decision-maker needs to select one of the available K actions in each round, with the objective of maximizing their long-term reward. This framework has been used in practice for several applications including web advertising, medical testing by viewing the use of different ads/treatments as the arms in the MAB problem. The user’s response corresponding to these different actions generates a reward for the decision-maker. Under the classical MAB framework, it is implicitly assumed that the rewards corresponding to different actions are independent of each other. But, this may not be the case in practice as rewards corresponding to different actions (i.e., ads/drugs) are likely to be correlated. Motivated by this, we study the structured and correlated MAB problem in this thesis.
First, we study the structured MAB problem where the mean rewards corresponding to different actions are a known function of a hidden parameter θ∗, thereby imposing a structured on the mean rewards of different actions. We study this problem in the most general form by imposing no restriction on the form of mean reward functions and as a result subsume the setting of several previously studied structured bandit frameworks where the mean reward function is assumed to be of a specific form. While mean rewards of different actions may be related to one another in the structured bandit setup, the reward realizations may not necessarily be correlated.
Motivated by this, we propose a novel correlated MAB framework which explicitly captures the correlation in reward across different actions. For both these frameworks, we design novel algorithms that allow us to extend any classical bandit algorithm to the structured and correlated bandit settings. Through rigorous analysis, we show that our proposed algorithms sample certain sub-optimal actions, termed as non-competitive actions, only O(1) times as opposed to the typical O(log T) samples required by classical algorithms such as Upper Confidence Bound (UCB), Thompson sampling. These significant theoretical performance gains are reflected in our experiments performed on real-world recommendation system datasets such as Movielens, Goodreads. For our proposed correlated bandit framework, we also design best-arm identification algorithms where the task is to identify the best action in as few samples as possible. We demonstrate the achieved performance gains theoretically through sample complexity analysis and empirically through experiments on recommendation system datasets.
To further demonstrate the utility of our proposed correlated bandit framework, we show how the framework can be employed to solve online resource allocation problems, which frequently arise in tasks such as power allocation in wireless systems, financial optimization and multi-server scheduling. This is done by extending our correlated bandit framework and algorithms to the setting of online resource allocation. The performance gains are demonstrated theoretically through regret analysis and empirically through synthetic experiments on the task of online power allocation in wireless systems, job scheduling in multi-server systems and channel assignment under the ALOHA protocol.
DepartmentElectrical and Computer Engineering
- Doctor of Philosophy (PhD)