Boroumand_cmu_0041E_10619.pdf (4.05 MB)

Practical Mechanisms for Reducing Processor–Memory Data Movement in Modern Workloads

Download (4.05 MB)
posted on 21.05.2021, 19:44 by Amirali Boroumand
Data movement between the memory system and computation units is one of the most critical challenges in designing high performance and energy-efficient computing systems. The high cost of data movement is forcing architects to rethink the fundamental design of computer systems. Recent advances in memory design enable the opportunity for architects to avoid unnecessary data
movement by performing processing-in-memory (PIM), also known as near-data processing (NDP). While PIM can allow many data-intensive applications to avoid moving data from memory to the CPU, it introduces new challenges for system architects and programmers. Our goal in this thesis
is to make PIM effective and practical in conventional computing systems. Toward this end, this thesis presents three major directions: (1) examining the suitability of PIM across key workloads, (2) addressing major system challenges for adopting PIM in computing systems, and (3) redesigning applications aware of PIM capability. In line with these three major directions, we propose a series
of practical mechanisms to reduce processor–memory data movement in modern workloads: First, we comprehensively analyze the energy and performance impact of data movement for several widely-used Google consumer workloads. We find that PIM can significantly reduce data
movement for all of these workloads, by performing part of the computation close to memory. Each workload contains simple primitives and functions that contribute to a significant amount of the overall data movement. We investigate whether these primitives and functions are feasible to implement using PIM, given the limited area and power constraints of consumer devices. Our analysis shows that offloading these primitives to PIM logic, consisting of either simple cores or specialized accelerators, eliminates a large amount of data movement, and significantly reduces
total system energy execution time. Second, we address one of the key system challenges for communication with PIM logic by proposing an efficient cache coherence support for near-data accelerators (NDAs). We find that enforcing coherence with the rest of the system, which is already a major challenge for on-chip accelerators, becomes more difficult for NDAs. This is because (1) the cost of communication between NDAs and CPUs is high, and (2) NDA applications generate a large amount of off-chip data
movement. As a result, as we show in this work, existing coherence mechanisms eliminate most of the benefits of NDAs. Based on our observations, we propose CoNDA, a coherence mechanism that lets an NDA optimistically execute an NDA kernel, under the assumption that the NDA has all necessary coherence permissions. This optimistic execution allows CoNDA to gather information on the memory accesses performed by the NDA and by the rest of the system. CoNDA exploits this information to avoid performing unnecessary coherence requests, and thus, significantly reducesdata movement for coherence. We show that CoNDA significantly improves performance and
reduces energy consumption compared to prior coherence mechanisms. Third, we propose a hardware–software co-design approach aware of PIM for edge machine
learning (ML) accelerators to enable energy-efficient and high-performance inference execution. We analyze a commercial Edge TPU (tensor processing unit) using 24 Google edge neural network (NN) models (including CNNs, LSTMs, transducers, and RCNNs), and find that the accelerator suffers from three shortcomings, in terms of computational throughput, energy efficiency, and
memory access handling. We comprehensively study the characteristics of each NN layer in all of the Google edge models, and find that these shortcomings arise from the one-size-fits-all approach of the accelerator, as there is a high amount of heterogeneity in key layer characteristics both across different models and across different layers in the same model. To combat this inefficiency, we propose a new acceleration framework called Mensa. Mensa incorporates multiple heterogeneous ML edge accelerators (including both on-chip and near-data accelerators), each of which caters to the characteristics of a particular subset of models. At runtime, Mensa schedules each layer to
run on the best-suited accelerator, accounting for both efficiency and inter-layer dependencies. We show that Mensa significantly improves inference energy and throughput, while reducing hardware cost and improving area efficiency over the Edge TPU and Eyeriss v2, two state-of-the-art edge ML accelerators. Lastly, we propose to redesign emerging modern hybrid databases to be aware of PIM capability, to enable real-time analysis. Hybrid transactional and analytical processing (HTAP) database
systems can support real-time data analysis without the high costs of synchronizing across separate single-purpose databases. Unfortunately, for many applications that perform a high rate of data updates, state-of-the-art HTAP systems incur significant drops in transactional and/or analytical
throughput compared to performing only transactions or only analytics in isolation, due to (1) data movement between the CPU and memory, (2) data update propagation, and (3) consistency costs. We propose Polynesia, a hardware–software co-designed system for in-memory HTAP databases. Polynesia (1) divides the HTAP system into transactional and analytical processing islands, (2) implements custom algorithms and hardware to reduce the costs of update propagation and consistency, and (3) exploits processing-in-memory for the analytical islands to alleviate data movement. We show that Polynesia significantly outperforms three state-of-the-art HTAP systems and reduces energy consumption.




Degree Type



Electrical and Computer Engineering

Degree Name

  • Doctor of Philosophy (PhD)


Onur Mutlu Saugata Ghose

Usage metrics