Build Kernels For Task-based DNN Inference Runtime
Distributed deep learning systems face several challenges in eciently scaling inference tasks due to communication overhead, load imbalance, and underutilized computational resources. These challenges are particularly common in large language models that incorporate Mixture-of-Experts layers and autoregressive attention layers. To address these challenges, we propose an asynchronous task-based distributed inference runtime that builds on the FlexFlow framework and optimizes kernels for Mixture-of-Experts (MoE) and Incremental MultiHead Self-Attention (IncMHA) layers in inference tasks. We introduce the optimized FlexFlow Experts Operator and FlexFlow IncMHA Operator, which leverage the asynchronous nature of inference tasks to achieve better GPU utilization and lower communication latency. It allows Flexflow to accommodate data-independent requests and read-only weights while being resilient to varying arrival rates and accommodating optimal batch configurations. We evaluated our system against existing frameworks and demonstrated its e↵ectiveness in improving performance and resource utilization. Our future work will focus on enabling better parallelism in large models by decoupling autoregressive operations with speculative inference.
History
Date
2023-05-01Degree Type
- Master's Thesis
Department
- Information Networking Institute
Degree Name
- Master of Science (MS)