Build Kernels For Task-based DNN Inference Runtime

Wang, Zeyu

doi:10.1184/R1/23587371.v1

Build Kernels For Task-based DNN Inference Runtime

thesis

posted on 2023-07-20, 20:43 authored by Zeyu Wang

Distributed deep learning systems face several challenges in eciently scaling inference tasks due to communication overhead, load imbalance, and underutilized computational resources. These challenges are particularly common in large language models that incorporate Mixture-of-Experts layers and autoregressive attention layers. To address these challenges, we propose an asynchronous task-based distributed inference runtime that builds on the FlexFlow framework and optimizes kernels for Mixture-of-Experts (MoE) and Incremental MultiHead Self-Attention (IncMHA) layers in inference tasks. We introduce the optimized FlexFlow Experts Operator and FlexFlow IncMHA Operator, which leverage the asynchronous nature of inference tasks to achieve better GPU utilization and lower communication latency. It allows Flexflow to accommodate data-independent requests and read-only weights while being resilient to varying arrival rates and accommodating optimal batch configurations. We evaluated our system against existing frameworks and demonstrated its e↵ectiveness in improving performance and resource utilization. Our future work will focus on enabling better parallelism in large models by decoupling autoregressive operations with speculative inference.

History

Date

2023-05-01

Degree Type

Master's Thesis

Department

Information Networking Institute

Degree Name

Master of Science (MS)

Advisor(s)

Zhihao Jia

Usage metrics

Keywords

Deep Neural Networks Distributed Inference Runtime Mixture-of-Experts Speculative Inference Task-based Transformers Information and Computing Sciences not elsewhere classified

Licence

In Copyright

Build Kernels For Task-based DNN Inference Runtime

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports