Adaptive Optimization Methods for Machine Learning
This thesis studies the adaptive optimization algorithms to solve convex and non-convex optimization problems in machine learning. Specifically, we focus on the utilization of adaptive step sizes and adaptive sampling in gradient-based optimization methods. In the first part of this thesis, we analyze the algorithmic stability of the well-known adaptive optimization methods such as Adagrad, RMsprop,. . . , which utilize the adaptivity of the step sizes to speed up convergence. Algorithmic stability measures the sensitivity of a learning algorithm with the change of data, which can be used as an upper bound for the learning algorithm’s generalization. Previously, there was much research on the algorithmic stability of Stochastic Gradient Descent and Stochastic Gradient Descent with Momentum. However, there is no similar result for adaptive optimization methods. This work is an attempt to bridge this gap. We show that under some mild conditions on hyper-parameters, the adaptive optimization methods are uniformly stable. This work also provides some insights on choosing hyper-parameters for those methods that are important in practice. In the second part of this thesis, we design a novel adaptive sampling Stochastic Variance Reduced Gradient method to solve convex and non-convex optimization problems in the distributed setting. Our new method reduces the dependence on the Lipschitz constants, which improves the convergence rate of the Stochastic Variance Reduced method. One novel aspect of our method compared to the previous state of the art is that our method does not need any prior knowledge about Lipschitz constants of local loss functions.
- Mathematical Sciences
- Doctor of Philosophy (PhD)