Scouts: Improving the Diagnosis Process Through Domain-customized Incident Routing
- Jiaqi Gao ,
- Nofel Yaseen ,
- Robert MacDavid ,
- Felipe Vieira Frujeri ,
- Vincent Liu ,
- Ricardo Bianchini ,
- Ramaswamy Aditya ,
- Xiaohang Wang ,
- Henry Lee ,
- Dave Maltz ,
- Minlan Yu ,
- Behnaz Arzani
SIGCOMM |
Organized by ACM
Incident routing is critical for maintaining service level objectives in the cloud: the time-to-diagnosis can increase by 10x due to mis-routings. Properly routing incidents is challenging because of the complexity of today’s data center (DC) applications and their dependencies. For instance, an application running on a VM might rely on a functioning host-server, remote-storage service, and virtual and physical network components. It is hard for any one team, rule-based system, or even machine learning solution to fully learn the complexity and solve the incident routing problem. We propose a different approach using per-team Scouts. Each teams’ Scout acts as its gate-keeper – it routes relevant incidents to the team and routes-away unrelated ones. We solve the problem through a collection of these Scouts. Our PhyNet Scout alone – currently deployed in production – reduces the time-to-mitigation of 65% of mis-routed incidents in our dataset.