nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training

Zhiqi Lin; Youshan Miao; Quanlu Zhang; Fan Yang; Yi Zhu; Cheng Li; Saeed Maleki; Xu Cao; Ning Shang; Yilei Yang; Weijiang Xu; Mao Yang; Lintao Zhang; Lidong Zhou

nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training

Zhiqi Lin ,
Youshan Miao ,
Quanlu Zhang ,
Fan Yang ,
Yi Zhu ,
Cheng Li ,
Saeed Maleki ,
Xu Cao ,
Ning Shang ,
Yilei Yang ,
Weijiang Xu ,
Mao Yang ,
Lintao Zhang ,
Lidong Zhou

OSDI | July 2024

下载 BibTex

With the growing model size of deep neural networks (DNN), deep learning training is increasingly relying on handcrafted search spaces to find efficient parallelization execution plans. However, our study shows that existing search spaces exclude plans that significantly impact the training performance of well-known DNN models (e.g., AlphaFold2) under important settings, such as when handling large embedding tables in large language models.

To address this problem, we propose nnScaler, a framework that generates efficient parallelization plans for deep learning training. Instead of relying on the existing search space, nnScaler advocates a more general approach that empowers domain experts to construct their own search space through three primitives, op-trans, op-assign, and op-order, which capture model transformation and the temporal-spatial scheduling of the transformed model of any parallelization plans. To avoid space explosion, nnScaler allows the application of constraints to those primitives during space construction. With the proposed primitives and constraints, nnScaler can compose existing search spaces as well as new ones. Experiments show that nnScaler can find new parallelization plans in new search spaces that achieve up to 3.5× speedup compared to solutions such as DeepSpeed, Megatron-LM, and Alpa for popular DNN models like SwinTransformer and AlphaFold2.

论文与出版物下载

nnScaler

5 4 月, 2024

The nnScaler is a system that takes a DNN model that designed for single device, e.g., GPU, and automatically convert it into the program that can execute concurrently on multiple devices.

下载数据