ZHANG_cmu_0041E_10610.pdf (6.02 MB)

Accelerating the "Motifs" in Machine Learning on Modern Processors

Download (6.02 MB)
thesis
posted on 27.05.2021, 18:35 by Jiyuan ZhangJiyuan Zhang
The booming growth in AI and machine learning is drastically reshaping the landscape of high performance computing. Traditional HPC addresses scientific problems that are driven by simulation, modeling and analysis in
science domains. Algorithms like linear algebra and methods like differential equations are at the core of solutions to such problems. However, emerging machine learning tasks rest on a new set of algorithms and models. The computation and data movement patterns inherent in learning algorithms cannot be directly mapped to the computation motifs in the physics, chemistry, or biology simulation problems. As a result, the high performance libraries that originate in the traditional scientific domains thus cannot be straightforwardly applied to emerging ML tasks to deliver required performance. This thesis focuses on performance optimizations of computation kernels
in emerging machine learning applications, spanning across a diverse range from dense, regular to sparse and irregular kernels. In this work, we demonstrate how code specialization and generation together with expert-built performance models and learned dispatch strategies can together enable ML motifs to achieve better performance on modern processors. First, we investigate the performance optimization of dense kernels with a focus on the convolutional neural networks (CNN). The computation of convolution layers in deep neural networks typically relies on high performance matrix-multiplication routines to improve performance. However, these routines are not optimized for performing convolution. Extra memory overhead is incurred due to data transformation, and the performance obtained is also less than conventionally expected. We demonstrate that direct convolution, when implemented correctly, eliminates all memory overhead, and yields performance
that is between 10% to 4x better than existing high performance implementations on conventional and embedded CPU architectures. We present a model-guided optimization approach which utilizes the characteristics of system architectures to guide the optimization choices of loop ordering, blocking, and memory layout transformation. We show that a high performance direct convolution exhibits better performance scaling than expert-tuned matrix implementation, i.e. suffers less performance drop, when increasing the number of threads. Sparse kernel is an equally important computation kernel appearing in many machine learning applications such as graph analytics and genetic sequencing. One factor that prevents sparse kernels from achieving high performance on modern processors results from the prohibitively large number of different implementations and data structures for sparse problems. We start with the observation that the complicated sparse computations can be distilled into primitive set of operators such as join, merge, and difference. To accelerate
those operators on modern processors with data parallelism, we propose a vectorization and code specialization approach which can eliminate the control divergences of these operators. Next, we explore the design space for vectorization on CPUs with various vector width, based on which we present the code generation algorithm that takes the data width and operations as input and generates various implementations. We then demonstrate the acceleration of the General Sparse Matrix-Matrix Multiplication (SpGEMM) on GPUs. We show how the SpGEMM implementation can leverage join/merge operators
to compose a variety of implementations. Another challenge when optimizing sparse kernels is that their performance behavior is data dependent, while the input characteristics may change online during iterative updates. To leverage
the different implementations offered by the code generator, we propose a low-overhead mechanism that collects the data characteristic information to learn online dispatch decisions over iterations. Overall, in this thesis, we demonstrate the interplay of code specialization and generation, together with performance modeling, learned dispatch, can enable high performance kernels for the emerging machine learning applications.

History

Date

10/12/2020

Degree Type

Dissertation

Department

Electrical and Computer Engineering

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Franz Franchetti

Usage metrics

Exports