Optimizing CNNs on Multicores for Scalability, Performance and Goodput

Samyam Rajbhandari; Yuxiong He; Olatunji Ruwase; Michael Carbin; Trishul Chilimbi

Optimizing CNNs on Multicores for Scalability, Performance and Goodput

Samyam Rajbhandari ,
Yuxiong He ,
Olatunji Ruwase ,
Michael Carbin ,
Trishul Chilimbi

2017 Architectural Support for Programming Languages and Operating Systems | April 2017

Published by ACM

DOI

Download BibTex

Convolutional Neural Networks (CNN) are a class of Artificial Neural Networks (ANN) that are highly efficient at the pattern recognition tasks that underlie difficult AI problems in a variety of domains, such as speech recognition, object recognition, and natural language processing. CNNs are, however, computationally intensive to train. This paper presents the first characterization of the performance optimization opportunities for training CNNs on CPUs. Our characterization includes insights based on the structure of the network itself (i.e., intrinsic arithmetic intensity of the convolution and its scalability under parallelism) as well as dynamic properties of its execution (i.e., sparsity of the computation).

Given this characterization, we present an automatic framework called spg-CNN for optimizing CNN training on CPUs. It comprises of a computation scheduler for efficient parallel execution, and two code generators: one that optimizes for sparsity, and the other that optimizes for spatial reuse in convolutions.

We evaluate spg-CNN using convolutions from a variety of real world benchmarks, and show that spg-CNN can train CNNs faster than state-of-the-art approaches by an order of magnitude.