SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters
- Hanyu Zhao ,
- Zhenhua Han ,
- Zhi Yang ,
- Quanlu Zhang ,
- Mingxia Li ,
- Fan Yang ,
- Qianxi Zhang ,
- Binyang Li ,
- Yuqing Yang ,
- Lili Qiu ,
- Lintao Zhang ,
- Lidong Zhou
Deep learning training on cloud platforms usually follows the tradition of the separation of storage and computing. The training executes on a compute cluster equipped with GPU/TPU while reading data from a separate cluster hosting the storage service. To alleviate the potential bottleneck, a training cluster usually leverages its local storage as a cache to reduce the remote IO to the storage cluster. However, existing deep learning schedulers do not manage storage resources thus fail to consider the diverse caching effect across different training jobs. This could degrade scheduling quality significantly.
To address this issue, we present SiloD, a scheduling framework that co-designs the cluster scheduler and the cache subsystems for deep learning training. SiloD treats cache and remote IO as first-class resources and can integrate different state-of-the-art deep learning scheduling policies in a unified scheduling framework. To achieve this, SiloD develops an enhanced job performance estimator to help different schedulers to jointly consider the impact of storage and compute resource allocation while preserving their respective scheduling objectives. The SiloD-enhanced performance estimator leverages the unique data access pattern of deep learning training to develop a closed-form analytic model that captures the diverse cache/remote IO requirement from different training jobs. Evaluations show SiloD improves the average job completion time, cluster utilization, and fairness by up to 7.4x, 2.57x, and 1.89x, respectively, compared to different combinations of cache systems and cluster schedulers where they operate independently.