A Pattern Tree-based Approach to Learning URL Normalization Rules
- Tao Lei ,
- Rui Cai ,
- Jiang-Ming Yang ,
- Yan Ke ,
- Xiaodong Fan ,
- Lei Zhang
Proc. of the 19th International World Wide Web Conference (WWW 2010) |
Published by Association for Computing Machinery, Inc.
Duplicate URLs have brought serious troubles to the whole pipeline of a search engine, from crawling, indexing, to result serving. URL normalization is to transform duplicate URLs to a canonical form using a set of rewrite rules. Nowadays URL normalization has attracted significant attention as it is lightweight and can be flexibly integrated into both the online (e.g. crawling) and the offline (e.g. index compression) parts of a search engine. To deal with a large scale of websites, automatic approaches are highly desired to learn rewrite rules for various kinds of duplicate URLs. In this paper, we rethink the problem of URL normalization from a global perspective and propose a pattern tree-based approach, which is remarkably different from existing approaches. Most current approaches learn rewrite rules by iteratively inducing local duplicate pairs to more general forms, and inevitably suffer from noisy training data and are practically inefficient. Given a training set of URLs partitioned into duplicate clusters for a targeted website, we develop a simple yet efficient algorithm to automatically construct a URL pattern tree. With the pattern tree, the statistical information from all the training samples is leveraged to make the learning process more robust and reliable. The learning process is also accelerated as rules are directly summarized based on pattern tree nodes. In addition, from an engineering perspective, the pattern tree helps select deployable rules by removing conflicts and redundancies. An evaluation on more than 70 million duplicate URLs from 200 websites showed that the proposed approach achieves very promising performance, in terms of both de-duping effectiveness and computational efficiency.
Copyright © 2007 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or [email protected]. The definitive version of this paper can be found at ACM's Digital Library --http://www.acm.org/dl/.