Carnegie Mellon University
Browse
- No file added yet -

Build Kernels For Task-based DNN Inference Runtime

Download (3.36 MB)
thesis
posted on 2023-07-20, 20:43 authored by Zeyu Wang

Distributed deep learning systems face several challenges in eciently scaling inference tasks due to communication overhead, load imbalance, and underutilized computational resources. These challenges are particularly common in large language models that incorporate Mixture-of-Experts layers and autoregressive attention layers. To address these challenges, we propose an asynchronous task-based distributed inference runtime that builds on the FlexFlow framework and optimizes kernels for Mixture-of-Experts (MoE) and Incremental MultiHead Self-Attention (IncMHA) layers in inference tasks. We introduce the optimized FlexFlow Experts Operator and FlexFlow IncMHA Operator, which leverage the asynchronous nature of inference tasks to achieve better GPU utilization and lower communication latency. It allows Flexflow to accommodate data-independent requests and read-only weights while being resilient to varying arrival rates and accommodating optimal batch configurations. We evaluated our system against existing frameworks and demonstrated its e↵ectiveness in improving performance and resource utilization. Our future work will focus on enabling better parallelism in large models by decoupling autoregressive operations with speculative inference. 

History

Date

2023-05-01

Degree Type

  • Master's Thesis

Department

  • Information Networking Institute

Degree Name

  • Master of Science (MS)

Advisor(s)

Zhihao Jia

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC