Auto-batching Techniques for Dynamic Deep Learning Computation
Deep learning has increasingly begun to be used across a wide range of computing applications. Dynamism—the property where the execution of a computation differs in some way across different inputs— has been shown to be an important tool in enabling deep learning models to effectively adapt to and model the varying structure of input data in various domains, thereby achieving high accuracy. On the other hand, dynamism often makes batching, an important performance optimization for deep learning computations, difficult to apply. This thesis presents techniques to enable efficient autobatching—automatically enabling batched execution for a computation—for dynamic deep learning computations. Specifically, we consider two kinds of dynamism commonly exhibited by deep learning computations—control flow dynamism, where the model computation involves control flow structures such as conditional statements, loops and recursion, and shape dynamism, where the model computation involves computation with tensors of different shapes across different input data.
Past work has proposed a variety of approaches towards tackling the auto-batching problem in the presence of dynamism. However, we note that past work is characterized by significant fragmentation from a compilation and execution point of view. Techniques often target individual components of the compilation and runtime stack without taking a holistic view of the entire stack, and hence the entire computation into account. For instance, tensor kernels are often optimized in isolation, without knowledge of the larger surrounding computation, while auto-batching techniques often primarily rely either on compile-time program transformations, or on runtime analyses, rather than an end-to-end approach.
Taking these limitations of past work into account, the techniques in this thesis explicitly attempt to remove the fragmentation present in today’s deep learning stacks to enable efficient auto-batching. Specifically, we rely on two insights (1) hybrid static+dynamic analysis to exploit all the available parallelism while keeping the runtime overheads to a minimum and (2) allowing the flow of information across the compilation and execution of tensor operators and the surrounding computation. These insights enable us to obtain significant gains over past work. For instance, Cortex, which is a compiler specialized for recursive deep learning computations achieves up to 14× faster inference over past work, while ACRoBat, an auto-batching framework that can handle unrestricted control flow is up to 8.5× faster. On the other hand, CoRa, a tensor compiler we designed for efficient batch execution in the presence of shape dynamism performs on-par with highly hand-optimized implementations of the transformer model.
- Doctor of Philosophy (PhD)