This dissertation presents an architecture to accelerate sparse matrix linear algebra, which is among the most important numerical methods for numerous science and engineering domains, such as graph analytics, in current big data driven era. Sparse matrix operations, especially for unstructured and large matrices with very few nonzeros, are devoid of guaranteed temporal and/or spatial locality making them inherently unsuitable for cache based general purpose commercial o-the-shelf (COTS) architectures. Lack of locality causes data dependent high latency memory accesses and eventually exhausts limited load buer in COTS architectures. These render sparse kernels to run at a small fraction of peak machine speed and yield poor o-chip bandwidth saturation despite finely tuned software libraries/frameworks and optimized code. The poor utilization of compute and memory resources on COTS structures indicates significant room for improvement using existing technology. However, a computation paradigm that is not dependent on data locality is essential for sparse operations to achieve same level of performance that BLAS-like standards on COTS have delivered for dense matrix linear algebra. An algorithm/hardware co-optimized architecture that provides a data locality independent platform for sparse matrix operations is developed in this work. The proposed architecture is founded on a fundamental principle of trading streaming bandwidth and compute to avoid high latency accesses. This principle stems from a counterintuitive insight that minimally required number of irregular memory accesses for sparse operations generally incur high traffic that is transferred at slow speed, whereas, more regular accesses can provide reduced traffic overall and faster transfer through better usage of block level data. This work finds that a scalable, high performance and parallelizable multi-way merge network, which is absent in current literature, is the core hardware primitive required in developing our proposed architecture. Through both algorithmic and circuit level techniques, this work develops a novel multi-way merge hardware primitive that meaningfully eliminates high latency accesses for sparse operations. This work also demonstrates methodologies to avoid strong dependency on fast random access on-chip memory for scaling, which is a major limiting factor of current custom hardware solutions in handling very large problems. Using a common custom platform, this work shows implementations of Sparse Matrix dense Vector multiplication (SpMV), iterative SpMV and Sparse General Matrix-Matrix multiplication (SpGEMM), which are core kernels for a broad range of graph analytic applications. A number of architecture and circuit level optimization techniques for reducing o-chip traffic and improving computation throughput to saturate extreme o-chip steaming bandwidth, provided by state of the art 3D stacking technology, are developed. Our proposed custom hardware is demonstrated on ASIC (fabricated in 16nm FinFET) and FPGA platforms and evaluated against state of the art COTS and custom hardware solutions. Experimental results show more than an order of magnitude improvement over current custom hardware solutions and more than two orders of magnitude improvement over COTS architectures for both performance and energy efficiency. This work is intended to contribute through a software stack provided by GraphBLAS-like [1] standards where broadest possible audience can utilize this architecture using a well-defined and concise set of matrix-based graph operations.