A Configurable Cloud-Scale DNN Processor for Real-Time AI
- Jeremy Fowers ,
- Kalin Ovtcharov ,
- Michael Papamichael ,
- Todd Massengill ,
- Ming Liu ,
- Daniel Lo ,
- Shlomi Alkalay ,
- Michael Haselman ,
- Logan Adams ,
- Mahdi Ghandi ,
- Stephen Heil ,
- Prerak Patel ,
- Adam Sapek ,
- Gabriel Weisz ,
- Lisa Woods ,
- Sitaram Lanka ,
- Steve Reinhardt ,
- Adrian Caulfield ,
- Eric Chung ,
- Doug Burger
Proceedings of the 45th International Symposium on Computer Architecture, 2018 |
Published by ACM
Interactive AI-powered services require low-latency evaluation of deep neural network (DNN) models—aka “realtime AI”. The growing demand for computationally expensive, state-of-the-art DNNs, coupled with diminishing performance gains of general-purpose architectures, has fueled an explosion of specialized Neural Processing Units (NPUs). NPUs for interactive services should satisfy two requirements: (1) execution of DNN models with low latency, high throughput, and high efficiency, and (2) flexibility to accommodate evolving state-of-the-art models (e.g., RNNs, CNNs, MLPs) without costly silicon updates. This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI. The Brainwave NPU achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1. The NPU attains this performance using a single-threaded SIMD ISA paired with a distributed microarchitecture capable of dispatching over 7M operations from a single instruction. The spatially distributed microarchitecture, scaled up to 96,000 multiply-accumulate units, is supported by hierarchical instruction decoders and schedulers coupled with thousands of independently addressable high-bandwidth on-chip memories, and can transparently exploit many levels of fine-grain SIMD parallelism. When targeting an FPGA, microarchitectural parameters such as native datapaths and numerical precision can be “synthesis specialized” to models at compile time, enabling high FPGA performance competitive with hardened NPUs. When running on an Intel Stratix 10 280 FPGA, the Brainwave
NPU achieves performance ranging from ten to over thirty-five teraflops, with no batching, on large, memory-intensive RNNs.