YADING: Fast Clustering of Large-Scale Time Series Data

Rui Ding; Qiang Wang; Yingnong Dang; Qiang Fu; Haidong Zhang; Dongmei Zhang

YADING: Fast Clustering of Large-Scale Time Series Data

Rui Ding ,
Qiang Wang ,
Yingnong Dang ,
Qiang Fu ,
Haidong Zhang ,
Dongmei Zhang

August 2015

Published by VLDB2015

Fast and scalable analysis techniques are becoming increasingly important in the era of big data, because they are the enabling techniques to create real-time and interactive experiences in data analysis. Time series are widely available in diverse application areas. Due to the large number of time series instances (e.g., millions) and the high dimensionality of each time series instance (e.g., thousands), it is challenging to conduct clustering on large-scale time series, and it is even more challenging to do so in real-time to support interactive exploration.

In this paper, we propose a novel end-to-end time series clustering algorithm, YADING, which automatically clusters large-scale time series with fast performance and quality results. Specifically, YADING consists of three steps: sampling the input dataset, conducting clustering on the sampled dataset, and assigning the rest of the input data to the clusters generated on the sampled dataset. In particular, we provide theoretical proof on the lower and upper bounds of the sample size, which not only guarantees YADING’s high performance, but also ensures the distribution consistency between the input dataset and the sampled dataset. We also select ????1 norm as similarity measure and the multi-density approach as the clustering method. With theoretical bound, this selection ensures YADING’s robustness to time series variations due to phase perturbation and random noise.

Evaluation results have demonstrated that on typical-scale (100,000 time series each with 1,000 dimensions) datasets, YADING is about 40 times faster than the state-of-the-art, sampling-based clustering algorithm DENCLUE 2.0, and about 1,000 times faster than DBSCAN and CLARANS. YADING has also been used by product teams at Microsoft to analyze service performance. Two of such use cases are shared in this paper.