Reducing the Costs to Design, Train, and Collect Data for Neural Networks with Combinatorial Optimization
The success of modern deep learning algorithms owes itself to the steady effort in scaling up neural models and their training datasets. This combined amazing effort from many research groups have enabled neural network practitioners to train increasingly larger models and increasingly larger datasets, and obtain increasingly better results. In the model sizes, modern neural networks can easily have thousands of times more parameters then neural networks in the last decade. For instance, when the Sequence-to-Sequence LSTM model [197] model was introduced in 2014, the community was impressed by its 380 million weights. However, six years later, in 2020, GShard [121] came along with 1 trillion weights. Along with growth in model sizes is the growth of datasets. In natural language understanding, the One Billion Words dataset [100], once considered a large dataset, has now been dwarfed by several orders of magnitude larger corpora [22, 46, 172, 235]. The same trend exists in image understanding as well. Anecdotally, ImageNet [180], the dataset once considered to be the statute of large-scaled image classification, has become “the new MNIST” [237] – we went from training ImageNet models in days or weeks [81, 198] to 1 hour [68, 226].
Despite the success of large models learning on large datasets, their uni?versal adoption is hindered by their immense expenses. For instance, in 2020, training GPT-3, which is not the largest model at the time for natural language understanding, costs a staggering amount of 4.6 million USD. While this expense comes mostly from the computations required to train the model, and this cost will eventually go down as better technology becomes available, the same statement does not hold for collecting large training datasets. Since 2018, Google has been reporting to obtain strong results in image understanding by training models on their proprietary dataset JFT-300M [165, 229, 230], which has 300 million labeled images and which is 20 times bigger than the publicly available ImageNet [180]. Collecting datasets like JFT-300M requires human workers, and hence scales much slower than computational power. Due to their expenses, large models and large datasets gradually become a privilege of only corporations with affluent resources.
In this thesis, I present a family of methods to reduce the expense of deep learning models in three facets. First, I formulate an optimization problem which reduces the cost to design good architectures for neural networks by thousands of times compared to previous approaches. Second, I present a model parallelism technique to reduce the time to train and deploy neural networks. Third, I present a semi-supervised learning algorithm which reduces the cost of obtaining large training datasets for neural networks. In certain cases, I show that my algorithm can utilize only 10% of the available labeled data to train a neural network to the similar accuracy with the network trained on the entire labeled dataset.
This thesis has two key contributions. The first contribution is the Neural Combinatorial Optimization algorithm (NCO), which is the first algorithm that requires no annotated training data yet can still train a recurrent neural network to obtain nearly optimal solutions for certain combinatorial optimization problems, such as the Traveling Salesman Problem. The second contribution is the novel insight that designing, executing, and obtaining training data for neural networks, albeit distant, can be formulated as combinatorial optimization problems. Thanks to this insight, I show that by formulating a task of concern as the right combinatorial optimization problem and applying NCO or its variants, I can significantly reduce the expenses of neural networks.
The methods that I present in this thesis have delivered strong impacts to many related fields, including but not limited to operational research, machine learning, natural language processing, and computer vision. For instance, the NCO algorithm has inspired subsequent developments of neural approaches in canonical combinatorial optimization problems such as SAT, Vehicle Routing, Vertex Cover, and MaxCut. Furthermore, my formulation for the problem of designing neural network architectures, widely referred to as “weight-sharing neural architecture search”, has enabled hundreds of researchers from groups without affluent resources like Google or Facebook to develop and study algorithms to design network architectures.
History
Date
2021-08-19Degree Type
- Dissertation
Department
- Language Technologies Institute
Degree Name
- Doctor of Philosophy (PhD)