Scheduling For Efficient Large-Scale Machine Learning Training

Over recent years, machine learning techniques have achieved success in many real-world applications. While researchers and practitioners continue to expand machine learning to new application domains and push the boundary of existing applications, they face critical computational challenges due to growing dataset size, increasing model complexity and capacity. These challenges demand new software systems to train large models efficiently and to enable machine learning researchers to easily experiment with new ideas.
There exist many opportunities to improve training time and support training larger models by leveraging the structural properties of machine learning computation to design efficient training systems. In this talk, I will present two distributed training systems Bösen and Orion that schedules inter-machine network communication and parallel computation to improve training time by reducing data inconsistency in parameter states, without requiring heavy programmer effort. Moreover, by scheduling memory usage in TensorFlow, we reduce GPU memory consumption by 87% and enable training models with 10x more parameters on the same hardware.
Date:
Haut-parleurs:
Jinliang Wei
Affiliation:
Carnegie Mellon University