AIOps Innovations of Incident Management for Cloud Services
While remarkable advances have been achieved in cloud computing infrastructure, the way incidents (unplanned interruptions or outages of a service/product) are managed needs to be as agile and dynamics as the cloud itself. In practice, incident management is conducted through analysing a huge amount of monitoring data collected at the runtime of services. Given its data-driven nature, we deem AIOps innovations as essential to empowering cloud systems to provide more reliable online services and applications by incorporating more intelligence into the entire workflow of incident management. This paper presents a project showcase of our AIOps practices towards these goals at Microsoft. First, we brief the incident management procedure and its corresponding real-world challenges. Then, we elaborate the ML & AI techniques used for mitigating such challenges and share some application results to demonstrate the intelligence and benefits conveyed to Microsoft service products.