Unsupervised Latent Faults Detection in Data Centers
This talk will review our ongoing work on unsupervised latent fault detection in large scale data centers, such as those used cloud services, supercomputers, and compute clusters.
Modern data centers are comprised of hundreds or thousands of machines (or more!). With so many machines, failures are commonplace, so failure detection is crucial: undetected failures may lead to data loss and outages. Traditional fault detection techniques are often supervised, relying on domain knowledge and precious (often unavailable) training data, and are inflexible. More recent approaches focus on early detection and handling of performance problems, or latent faults. These faults «fly under the radar» of existing detection systems because they are not acute enough, or were not anticipated by maintenance engineers.
We will first discuss unsupervised latent fault detection in scale-out, load-balanced cloud services. We present a novel framework for statistical latent fault detection using only ordinary machine counters collected as standard practice, and demonstrate three detection methods within this framework. Derived tests are adaptive, domain-independent and unsupervised, require neither background information nor tuning, and scale to very large services. We proved strong guarantees on the false positive rates of our tests. Our evaluation on a large, real-world production service shows that at least 20% of machine or software failures were preceded by such latent fault. We further show that our latent fault detector can anticipate failures up to 14 days ahead, with high precision and very low FPR.
The second part of the talk will briefly present a communication-efficient variant designed for online outlier detection in distributed data streams. Our offline framework has large bandwidth and processing requirements. Using stream processing techniques that trade accuracy for communication and computation, we present an adapted latent fault detector which can reduce bandwidth costs by an order of magnitude with below 1% error compared to the original algorithm.
Finally, we’ll discuss current work that addresses latent fault detection for unbalanced workloads , such as map-reduce jobs and compute clusters.
One new scheme, based on Principal Components Analysis, retains the advantages of our previous methods: it is unsupervised, robust to changes, and statistically sound. Preliminary evaluation on supercomputer logs shows that the new method is able to correctly predict some failures, while our previous methods completely fail in this setting. We also show preliminary evaluation showing good performance on virtual machines running Hadoop and CassandraDB. Time allows, we’ll also touch on another scheme for opaque VMs, based on a sparse decomposition approach.
Speaker Bios
Moshe Gabel is a Ph.D. candidate in the Computer Science department at the Technion – Israel Institute of Technology. His research interests include machine learning and data mining in distributed settings with applications on systems research, such as monitoring health of cloud data centers and other large distributed systems. He also works on learning and monitoring models of large, distributed data streams.
- Séries:
- Microsoft Research Talks
- Date:
- Haut-parleurs:
- Moshe Gabel
- Affiliation:
- Technion
-
-
Jeff Running
-
-
Taille: Microsoft Research Talks
-
Decoding the Human Brain – A Neurosurgeon’s Experience
Speakers:- Pascal Zinn,
- Ivan Tashev
-
-
-
-
Galea: The Bridge Between Mixed Reality and Neurotechnology
Speakers:- Eva Esteban,
- Conor Russomanno
-
Current and Future Application of BCIs
Speakers:- Christoph Guger
-
Challenges in Evolving a Successful Database Product (SQL Server) to a Cloud Service (SQL Azure)
Speakers:- Hanuma Kodavalla,
- Phil Bernstein
-
Improving text prediction accuracy using neurophysiology
Speakers:- Sophia Mehdizadeh
-
-
DIABLo: a Deep Individual-Agnostic Binaural Localizer
Speakers:- Shoken Kaneko
-
-
Recent Efforts Towards Efficient And Scalable Neural Waveform Coding
Speakers:- Kai Zhen
-
-
Audio-based Toxic Language Detection
Speakers:- Midia Yousefi
-
-
From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks
Speakers:- Sujeeth Bharadwaj
-
Hope Speech and Help Speech: Surfacing Positivity Amidst Hate
Speakers:- Monojit Choudhury
-
-
-
-
-
'F' to 'A' on the N.Y. Regents Science Exams: An Overview of the Aristo Project
Speakers:- Peter Clark
-
Checkpointing the Un-checkpointable: the Split-Process Approach for MPI and Formal Verification
Speakers:- Gene Cooperman
-
Learning Structured Models for Safe Robot Control
Speakers:- Ashish Kapoor
-