USENIX ATC 2024 best paper | How Microsoft is improving cloud AI infrastructure reliability

October 21, 2024

Share this page

As cloud AI workloads grow in complexity and scale, maintaining high system reliability has become crucial. Traditional methods of ensuring system reliability, such as using redundant components, inadvertently introduce a new problem: subtle performance degradation, also known as “gray failures”. Gray failures are caused by the gradual failure of redundant components and are characterized by a gradual and not easily noticeable decline in performance in the early stages, making them difficult for system administrators to detect. When the redundant components fail completely in the later stages, the system experiences significant performance degradation. This complicates the task of identifying and resolving system failures.

Traditional approaches to system reliability assurance often rely on reactive methods, such as timely troubleshooting for incidents, which cannot effectively address gray failures. Researchers from Microsoft Research Asia and engineers from Microsoft Azure realized that reactive troubleshooting alone is insufficient to tackle this challenge. They developed SuperBench, a proactive validation system for cloud AI infrastructure that mitigates hidden degradation caused by hardware redundancies and enhances overall reliability. The paper on SuperBench has been accepted by USENIX ATC 2024, the world’s top academic conference in the field of computer systems, and has won the best paper award.

Publication: https://www.microsoft.com/en-us/research/publication/superbench/

Project: https://aka.ms/superbench (opens in new tab)

GitHub: https://github.com/microsoft/superbenchmark (opens in new tab)

The design concept of SuperBench focuses on proactive validation rather than reactive response. It is capable of detecting and addressing potential issues before significant performance degradation occurs in the system. This approach not only enhances system reliability but also reduces maintenance costs and performance problems experienced by users.

To effectively minimize the MTBI (mean time between incidents), proactive validation should satisfy the following requirements: (i) Comprehensive: Validation must be comprehensive and encompass a wide range of AI workloads to detect incidents that are undetected by vendors in new clusters and only surface in customer workloads. (ii) Clear-cut: Given that hardware components can exhibit gradual performance degradation and measurements are prone to natural variance, it is essential to establish a clear-cut boundary between defective and normal performance. Repetitions of the same test should yield consistent results, rather than fluctuating between outcomes. (iii) Cost-efficient: Proactive validation necessitates additional measurements, which consume time. Therefore, it must be cost-efficient, ensuring that validation expenses remain significantly lower than the penalties associated with incidents.

Nevertheless, addressing these requirements presents significant challenges. Firstly, the sheer number of workloads and exponential node combinations result in an immense search space for all scenarios, making it impossible to encompass every aspect in the validation process. Secondly, once performance is measured, there is no ground truth available for defective components. Identifying which components are defective is problematic, as hardware specifications cannot reliably predict workload performance. Moreover, AI hardware often exhibits substantial variations, further complicating the differentiation process. Lastly, the validation time and MTBI can be interdependent, as fewer validated components lead to shorter times between incidents. Determining when to validate which components for optimal cost-efficiency, while achieving the longest MTBI with the least measurement time, proves to be a challenging endeavor.

At the heart of SuperBench is its comprehensive benchmark suite, which assesses both individual hardware components and a wide range of real AI workloads. This approach ensures that the system can detect issues that might otherwise remain hidden during normal operation. Key features of SuperBench include:

Comprehensive Benchmark Suite: SuperBench incorporates end-to-end benchmarks for typical AI workloads and micro-benchmarks for individual hardware components. This holistic approach allows for thorough testing and early detection of potential issues.
Selector Module: Designed to optimize validation efforts, the Selector balances the trade-off between validation time and incident-related costs. It employs a probability model to determine the most effective subset of benchmarks to run, ensuring that validation is both efficient and impactful.
Validator Module: This component uses advanced machine learning techniques to analyze benchmark data and pinpoint defective hardware with precision. By focusing on cumulative distribution metrics rather than average values, SuperBench can clearly differentiate between functional and malfunctioning components.

Figure 1: SuperBench system architecture and an example workflow

The effectiveness of SuperBench is underscored by its successful deployment in Azure’s production environment over the past two years. During this period, SuperBench has validated hundreds of thousands of GPUs, identifying 10.36% of nodes as defective and significantly improving system reliability. Simulation results demonstrate that SuperBench can increase MTBI to an impressive 22.61x compared to the absence of validation and 1.11x compared to full set validation without benchmark selection. Additionally, it has increased user GPU hours to 4.81x while reducing validation time costs by 92.07%.

Figure 2: Simulated average node utilization with different benchmark selection policies within 30 days. SuperBench Selector achieves a high cluster utilization of 90.70%, improving the no validation baseline to 4.81x and the full set baseline to 1.09x

The introduction of SuperBench marks a significant advancement in proactive system validation. By addressing the challenges of gray failures and improving the reliability of cloud AI infrastructure, SuperBench not only enhances system performance but also contributes to cost savings and operational efficiency. The open-sourced benchmarks, available on GitHub (opens in new tab), have already been adopted by AI hardware vendors, further broadening the impact of this innovative solution.

Microsoft Research Lab – Asia

USENIX ATC 2024 best paper | How Microsoft is improving cloud AI infrastructure reliability

Share this page