Accelerating Genome Sequence Analysis via Efficient Hardware/Algorithm Co-Design
Genome sequence analysis plays a pivotal role in enabling many medical and scientific advancements in personalized medicine, outbreak tracing, the understanding of evolution, and forensics. Modern genome sequencing machines can rapidly generate massive amounts of genomics data at low cost. However, the analysis of genome sequencing data is currently bottlenecked by the computational power and memory bandwidth limitations of existing systems, as many of the steps in genome sequence analysis must process a large amount of data. Moreover, as sequencing technologies advance, the growth in the rate that sequencing devices generate genomics data is far outpacing the corresponding growth in computational power, placing greater pressure on these bottlenecks.
Our goals in this dissertation are to (1) understand where the current tools and algorithms do not perform well in order to develop better tools and algorithms, and (2) understand the limitations of existing hardware systems when running these tools and algorithms in order to design efficient customized accelerators. Towards this end, we propose four major works, where we characterize the real-system behavior of the genome sequence analysis pipeline and its associated tools, expose the bottlenecks and tradeoffs of the pipeline and tools, and co-design fast and efficient algorithms along with scalable and energy-efficient customized hardware accelerators for the key pipeline bottlenecks to enable faster genome sequence analysis.
First, we comprehensively analyze the tools in the genome assembly pipeline for long reads in multiple dimensions (i.e., accuracy, performance, memory usage, and scalability), uncovering bottlenecks and tradeoffs that different combinations of tools and different underlying systems lead to. We show that we need high-performance, memory-efficient, low-power, and scalable designs for genome sequence analysis in order to exploit the advantages that genome sequencing provides. Second, we propose GenASM, an acceleration framework that builds upon bitvector-based approximate string matching (ASM) to accelerate multiple steps of the genome sequence analysis pipeline. We co-design our highly-parallel, scalable and memory-efficient algorithms with low-power and area-efficient hardware accelerators. Weevaluate GenASM for three different use cases of ASM in genome sequence analysis and show that GenASM is significantly faster and more power- and area-efficient than state-of-the-art software and hardware tools for each of these use cases. Third, we implement an FPGA-based prototype for GenASM, where state-of-the-art 3D-stacked memory (HBM2) offeres high memory bandwidth and FPGA resources offer high parallelism by instantiating multiple copies of the GenASM accelerators. Fourth, we propose SeGraM, the first hardware acceleration framework for sequence-to-graph mapping and alignment. Instead of representing the reference genome as a single linear DNA sequence, genome graphs provide a better representation of the diversity among populations by encoding variations across individuals in a graph data structure, avoiding
a bias towards any one reference. SeGraM enables the efficient mapping of a sequenced genome to a graph-based reference, providing more comprehensive and accurate genome sequence analysis. For SeGraM, we co-design algorithms and accelerators for memory-efficient minimizer-based seeding and bitvector-based, highly-parallel sequence-to-graph alignment. Compared to state-of-the-art software tools for sequence-to-graph mapping and alignment, we show that SeGraM significantly increases the throughput and reduces the power consumption for both short and long reads.
Overall, we demonstrate that genome sequence analysis can be accelerated by co-designing scalable and energy-efficient customized accelerators along with efficient algorithms for the key steps of genome sequence analysis. We also hope that this dissertation inspires future work in co-designing algorithms and hardware together to create powerful frameworks that accelerate other genomics workloads and emerging applications.
History
Degree Type
- Dissertation
Department
- Electrical and Computer Engineering
Degree Name
- Doctor of Philosophy (PhD)