Efficient Exact Set-Similarity Joins

Arvind Arasu; Venky Ganti; Raghav Kaushik

Efficient Exact Set-Similarity Joins

Arvind Arasu ,
Venky Ganti ,
Raghav Kaushik

Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB 2006 | August 2006

Published by Very Large Data Bases Endowment Inc.

Download BibTex

Given two input collections of sets, a set-similarity join (SSJoin) identifies all pairs of sets, one from each collection, that have high similarity. Recent work has identified SSJoin as a useful primitive operator in data cleaning. In this paper, we propose new algorithms for SSJoin. Our algorithms have two important features: They are exact, i.e., they always produce the correct answer, and they carry precise performance guarantees. We believe our algorithms are the first to have both features; previous algorithms with performance guarantees are only probabilistically approximate. We demonstrate the effectiveness of our algorithms using a thorough experimental evaluation over real-life and synthetic data sets.

All articles published in this journal are protected by copyright, which covers the exclusive rights to reproduce and distribute the article (e.g., as offprints), as well as all translation rights. No material published in this journal may be reproduced photographically or stored on microfilm, in electronic data bases, video disks, etc., without first obtaining written permission from Very Large Data Bases Endowment Inc.