Towards High-Performance Prediction Serving Systems
- Yunseong Lee ,
- Alberto Scolari ,
- Byung-Gon Chun ,
- Matteo Interlandi ,
- Markus Weimer
31st Conference on Neural Information Processing Systems |
Organized by NIPS
Machine Learning models are often composed of sequences of transformations. While this design makes it easy to decompose and efficiently execute single model components at training time, predictions require low latency and high-performance predictability whereby end-to-end and multi-model runtime optimizations are needed to meet such goals. This paper sheds some light on the problem by introducing a new system design for high-performance prediction serving. We report some preliminary results showing how our system design is able to improve performance over several dimensions with respect to current state-of-the-art approaches.