Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds

  • ,
  • Aaron Harlap ,
  • Nandita Vijaykumar ,
  • Dimitris Konomis ,
  • Gregory R. Ganger ,
  • Phillip B. Gibbons ,
  • Onur Mutlu

Symposium on Networked Systems Design and Implementation (NSDI) |

Presentation (ppt) | Related File

Machine learning (ML) is widely used to derive useful information from large-scale data (such as user activities, pictures, and videos) generated at increasingly rapid rates, all over the world. Unfortunately, it is infeasible to move all this globally-generated data to a centralized data center before running an ML algorithm over it—moving large amounts of raw data over wide-area networks (WANs) can be extremely slow, and is also subject to the constraints of privacy and data sovereignty laws. This motivates the need for a geo-distributed ML system spanning multiple data centers. Unfortunately, communicating over WANs can significantly degrade ML system performance (by as much as 53.7X in our study) because the communication overwhelms the limited WAN bandwidth.

Our goal in this work is to develop a geo-distributed ML system that (1) employs an intelligent communication mechanism over WANs to efficiently utilize the scarce WAN bandwidth, while retaining the accuracy and correctness guarantees of an ML algorithm; and (2) is generic and flexible enough to run a wide range of ML algorithms, without requiring any changes to the algorithms.

To this end, we introduce a new, general geo-distributed ML system, Gaia, that decouples the communication within a data center from the communication between data centers, enabling different communication and consistency models for each. We present a new ML synchronization model, Approximate Synchronous Parallel (ASP), whose key idea is to dynamically eliminate insignificant communication between data centers while still guaranteeing the correctness of ML algorithms. Our experiments on our prototypes of Gaia running across 11 Amazon EC2 global regions and on a cluster that emulates EC2 WAN bandwidth show that Gaia provides 1.8–53.5X speedup over two state-of-the-art distributed ML systems, and is within 0.94–1.40X of the speed of running the same ML algorithm on machines on a local area network (LAN).