NN-Stretch: Automatic Neural Network Branching for Parallel Inference on Heterogeneous Multi-Processors

Jianyu Wei; Ting Cao; Shijie Cao; Shiqi Jiang; Shaowei Fu; Mao Yang; Yanyong Zhang; Yunxin Liu

NN-Stretch: Automatic Neural Network Branching for Parallel Inference on Heterogeneous Multi-Processors

Jianyu Wei ,
Ting Cao ,
Shijie Cao ,
Shiqi Jiang ,
Shaowei Fu ,
Mao Yang ,
Yanyong Zhang ,
Yunxin Liu

The 21st International Conference on Mobile Systems, Applications, and Services (MobiSys ’23) | June 2023

Published by ACM

Mobile devices are increasingly equipped with heterogeneous multiprocessors, e.g., CPU + GPU + DSP. Yet existing Neural Network (NN) inference fails to fully utilize the computing power of the heterogeneous multi-processors due to the sequential structures of NN models. Towards this end, this paper proposes NN-Stretch, a new model adaption strategy, as well as the supporting system. It automatically branches a given model according to the processor architecture characteristics. Compared to other popular model adaption techniques such as model pruning that often sacrifices accuracy, NN-Stretch accelerates inference while preserving accuracy.

The key idea of NN-Stretch is to horizontally stretch a model structure, from a long and narrow model to a short and wide one with multiple branches. We formulate the model branching into an optimization problem. NN-Stretch attempts to narrow down the design space by taking into account the hard latency constraints through varying where the branches converge and how each branch is scaled to fit heterogeneous processors, as well as the soft accuracy constraints through maintaining the model skeleton and expressiveness of each branch. According to the constraints, NN-Stretch can efficiently generate accurate and efficient multi-branch models. To facilitate easy deployment, this paper also devises a subgraph-based spatial scheduler for existing inference frameworks to parallelly execute the multi-branch models. Our experimental results are very promising, with up to 3.85× speedup compared to single CPU/GPU/DSP execution and up to 0.8% accuracy improvement.

GitHub