Error Diagnosis and Data Profiling with Data X-Ray
- Xiaolan Wang ,
- Mary Feng ,
- Yue Wang ,
- Xin Luna Dong ,
- Alexandra Meliou
International Conference on Very Large Data Bases (PVLDB) | , Vol 8: pp. 1984-1987
The problem of identifying and repairing data errors has been an area of persistent focus in data management research. However, while traditional data cleaning techniques can be effective at identifying several data discrepancies, they disregard the fact that many errors are systematic, inherent to the process that produces the data, and thus will keep occurring unless the root cause is identified and corrected.
In this demonstration, we will present a large-scale diagnostic framework called DataXRay. Like a medical X-ray that aids the diagnosis of medical conditions by revealing problems underneath the surface, DataXRay reveals hidden connections and common properties among data errors. Thus, in contrast to traditional cleaning methods, which treat the symptoms, our system investigates the underlying conditions that cause the errors.
The core of DataXRay combines an intuitive and principled cost model derived by Bayesian analysis, and an efficient, highly-parallelizable diagnostic algorithm that discovers common properties among erroneous data elements in a top-down fashion. Our system has a simple interface that allows users to load different datasets, to interactively adjust key diagnostic parameters, to explore the derived diagnoses, and to compare with solutions produced by alternative algorithms. Through this demonstration, participants will understand (1) the characteristics of good diagnoses, (2) how and why errors occur in real-world datasets, and (3) the distinctions with other related problems and approaches.