Carnegie Mellon University
Browse

Multi-Armed Bandits with Probes and Knapsacks: Algorithms, Theory and Applications

Download (8.56 MB)
thesis
posted on 2025-05-22, 19:48 authored by Eray Can ElumarEray Can Elumar

The multi-armed bandit (MAB) framework is a fundamental paradigm in decision theory and machine learning, in which a decision-maker sequentially selects one of K available actions at each round, with the objective of maximizing the cumulative reward over time. The main challenge lies in balancing exploration of new actions with exploitation of known ones—an essential trade off for making optimal, data-driven decisions in complex, uncertain environments with limited information, such as clinical trials, online advertising, and recommendation systems. Motivated by these challenges and applications, we propose and analyze two different multi-armed bandit models with broad practical relevance, which are i) bandits with probes, and ii) bandits with anytime knapsacks. Additionally, we focus on utilizing the multi-armed bandit framework in iii) dataset labeling with LLMs (large language models).

First, we study the MAB model with probes, where before pulling an arm, the decision-maker is allowed to probe one arm at an associated cost to observe its reward for that round. Based on the probe’s outcome, the decision-maker can choose to pull the probed arm or any other arm. Alternatively, an arm may be pulled directly without employing the probe. This variation is of particular interest due to its extensive range of applications, including online learning augmented with machine learning advice, hyper-parameter optimization for machine learning models, and wireless communications. The introduction of the probing mechanism significantly expands the action space, adding complexity to the problem, as the decision-maker must judiciously balance exploration and exploitation while also considering whether to probe an arm. To address these challenges, we propose a novel algorithm that efficiently determines the optimal action in each round to maximize the cumulative reward. Through rigorous analysis, we derive a theoretical gap-independent regret bound on the order of O(√ T), and a gap-dependent regret bound on the order of O(logT); and demonstrate the empirical performance of our algorithm through simulation results.

Secondly, we study the MAB model with anytime knapsacks. This model is similar to the bandits with knapsacks (BwK) problem, however, there is an anytime constraint on the average cost instead of a fixed total cost budget. This formulation finds broad applications in areas including inventory management, online advertising, and portfolio management. We propose an algorithm that employs upper confidence bounds to efficiently explore the decision space while strategically under-utilizing the available budget to reduce the number of rounds that need to be skipped to satisfy the anytime cost constraint. We derive theoretical gap-dependent regret bounds and demonstrate that, despite the increased constraints relative to the standard BwK model, our formulation achieves the same gap-
dependent regret bounds of O(log T). We provide simulation results to corroborate the theoretical findings and illustrate the empirical performance of the proposed algorithm.


Lastly, to illustrate the practical utility of the MAB framework in the newest problem settings, we examine the task of labeling a dataset using different LLMs. In this scenario, we have access to various LLMs, each with distinct costs and capabilities, and a model capable of extracting contextual information from individual data instances. We propose an algorithm that leverages this context information to select a subset of LLMs, as opposed to traditional weighted majority voting methods that query all available LLMs for each label. We demonstrate the utility of our proposed algorithm in this problem setting with simulation results.

Funding

CIF: Small: Efficient Sequential Decision-Making and Inference in the Small Data Regime

Directorate for Computer & Information Science & Engineering

Find out more...

CIF: Small: Designing Secure, Reliable, and Resilient Wireless Sensor Networks

Directorate for Computer & Information Science & Engineering

Find out more...

History

Date

2025-05-06

Degree Type

  • Dissertation

Thesis Department

  • Electrical and Computer Engineering

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Osman Yagan

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC