Bandit Methods and Selective Prediction in Deep Learning

2020-06-29T21:56:24Z (GMT) by Weicheng Ye
his thesis consists of three parts involving two areas in the field of machine learning: deep learning and reinforcement learning. Stochastic Lipschitz bandit algorithms are methods that govern exploration-exploitation trade-offs and have been used for a variety of important task domains, including zeroth-order optimization. In the first part, we present a framework for Lipschitz bandit methods that adaptively learns partitions of context- and arm-space. Due to this flexibility, the algorithm can efficiently optimize rewards and minimize regret, by focusing on the most relevant portions
of the space. Tuning hyperparameters is a crucial problem in deep learning that is required in every single model. Although the deep neural network with high complexity can fit the large-scale data very well, it uses plenty of hyperparameters. Our algorithms can achieve stateof-
the-art performance in challenging real-world tasks such as neural network hyperparameter tuning. Model-free Reinforcement Learning algorithms, such as Q-learning, require less space and are more expressive to use compared to model-based approaches. Upper Confidence Bound (UCB) exploration policy improves the exploration bonuses in the Q-learning framework. In the second part of this work, we study the regret bound of the Q-learning algorithm with UCB exploration in the scenario of compact state-action metric space. We develop a Q-learning
algorithm with UCB exploration policy that adaptively discretizes the continuous state-action space and iteratively updates Q-values. In addition, the algorithm can efficiently optimize rewards and minimize the cumulative regret. We provide a rigorous analysis of bounding the regret with concentration of measure analysis and combinatorial approaches. This is the first result of this kind to the best of our knowledge. Data gathered from real-world applications often suffer from corruption. The low-quality data will hinder the performance of the learning system in terms of the classification accuracy, model building time, and interpretability of the classifier. Selective prediction, also known as prediction with a reject option, reduces the error rate by abstaining from prediction under uncertainty while keeping coverage as high as possible. In the third part of this work, we propose sophisticated threshold learning algorithms integrated with selective prediction that
can estimate the intrinsic rejection rate of the dataset. Correspondingly, we provide a rigorous framework to generalize the estimation of data corruption rate. To leverage the advantage of multiple learning methods, we extend learning algorithms to a hierarchical two-stage system. Our methods have advantages of being noise-robust, intelligible, and flexible with any network
architecture. The empirical results show that our algorithms can accomplish state-of-the-art performance in real-world datasets in many classification and regression problems.