Accelerating Sparse Matrix Kernels with Co-Optimized Architecture

2019-01-18T21:45:57Z (GMT) by Fazle Sadi
This dissertation presents an architecture to accelerate sparse matrix linear algebra,<br>which is among the most important numerical methods for numerous science and engineering domains, such as graph analytics, in current big data driven era. Sparse matrix<br>operations, especially for unstructured and large matrices with very few nonzeros, are<br>devoid of guaranteed temporal and/or spatial locality making them inherently unsuitable for cache based general purpose commercial o -the-shelf (COTS) architectures.<br>Lack of locality causes data dependent high latency memory accesses and eventually<br>exhausts limited load bu er in COTS architectures. These render sparse kernels to run<br>at a small fraction of peak machine speed and yield poor o -chip bandwidth saturation<br>despite fi nely tuned software libraries/frameworks and optimized code. The poor<br>utilization of compute and memory resources on COTS structures indicates signi ficant<br>room for improvement using existing technology. However, a computation paradigm<br>that is not dependent on data locality is essential for sparse operations to achieve<br>same level of performance that BLAS-like standards on COTS have delivered for dense<br>matrix linear algebra.<br>An algorithm/hardware co-optimized architecture that provides a data locality<br>independent platform for sparse matrix operations is developed in this work. The<br>proposed architecture is founded on a fundamental principle of trading streaming<br>bandwidth and compute to avoid high latency accesses. This principle stems from<br>a counterintuitive insight that minimally required number of irregular memory accesses for sparse operations generally incur high traffic that is transferred at slow<br>speed, whereas, more regular accesses can provide reduced traffic overall and faster<br>transfer through better usage of block level data. This work finds that a scalable, high performance and parallelizable multi-way merge network, which is absent in<br>current literature, is the core hardware primitive required in developing our proposed<br>architecture. Through both algorithmic and circuit level techniques, this work develops<br>a novel multi-way merge hardware primitive that meaningfully eliminates high latency<br>accesses for sparse operations. This work also demonstrates methodologies to avoid<br>strong dependency on fast random access on-chip memory for scaling, which is a major<br>limiting factor of current custom hardware solutions in handling very large problems.<br>Using a common custom platform, this work shows implementations of Sparse<br>Matrix dense Vector multiplication (SpMV), iterative SpMV and Sparse General<br>Matrix-Matrix multiplication (SpGEMM), which are core kernels for a broad range of<br>graph analytic applications. A number of architecture and circuit level optimization<br>techniques for reducing o -chip traffic and improving computation throughput to<br>saturate extreme o -chip steaming bandwidth, provided by state of the art 3D stacking<br>technology, are developed. Our proposed custom hardware is demonstrated on ASIC<br>(fabricated in 16nm FinFET) and FPGA platforms and evaluated against state of the<br>art COTS and custom hardware solutions. Experimental results show more than an<br>order of magnitude improvement over current custom hardware solutions and more than<br>two orders of magnitude improvement over COTS architectures for both performance<br>and energy efficiency. This work is intended to contribute through a software stack<br>provided by GraphBLAS-like [1] standards where broadest possible audience can utilize<br>this architecture using a well-defi ned and concise set of matrix-based graph operations. <br>