Collecting High Quality Overlapping Labels at Low Cost

Grace Hui Yang; Anton Mityagin; Krysta M. Svore; Sergey Markov

Collecting High Quality Overlapping Labels at Low Cost

Grace Hui Yang ,
Anton Mityagin ,
Krysta M. Svore ,
Sergey Markov

Proceedings of SIGIR | July 2010

Published by Association for Computing Machinery, Inc.

Download BibTex

This paper studies quality of human labels used to train search engines’ rankers. Our specific focus is performance improvements obtained by using overlapping labels, which is by collecting multiple human judgments for each training sample. This paper presents a new method of effectively and efficiently producing and using overlapping labels to improve data quality and search engine accuracy. This paper explores whether, when, and for which data points one should obtain multiple, expert training labels, as well as what to do with them once they have been obtained. The proposed labeling scheme collects multiple overlapping labels only for a subset of training samples, specifically for those labeled relevant by a single judge. Our experiments show that this labeling scheme improves the NDCG of both LambdaRank and LambdaMart rankers on several real-world Web test sets, with a low labeling overhead of around 1.4 labels per sample. Moreover, these NDCG improvements are at least as good as collecting multiple overlapping labels on the entire data set. This labeling scheme also outperforms several methods of using overlapping labels, such as simple k-overlap, majority vote, the highest labels, etc. Finally, the paper presents a study of how many overlapping labels are needed to get the best improvement in search engine retrieval accuracy.

Copyright © 2007 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or [email protected]. The definitive version of this paper can be found at ACM's Digital Library --http://www.acm.org/dl/.