Evaluating Retrieval System Effectiveness

One of the primary motivations for the Text REtrieval Conference (TREC) was to standardize retrieval system evaluation. While the Cranfield paradigm of using test collections to compare system output had been introduced decades before the start of TREC, the particulars of how it was implemented differed across researchers making evaluation results incomparable. The validity of test collections as a research tool was in question, not only from those who objected to the reliance on relevance judgments, but also from those who were concerned as to how they could scale. With the notable exception of Sparck Jones and van Rijsbergen’s report on the need for larger, better test collections, there was little explicit discussion of what constituted a minimally acceptable experimental design and no hard evidence to support any position.

TREC has succeeded in standardizing and validating the use of test collections as a retrieval research tool. The repository of different runs using a common collection that have been submitted to TREC enabled the empirical determination of the confidence that can be placed in a conclusion that one system is better than another based on the experimental design. In particular, the reliability of the conclusion has been shown to depend critically on both the evaluation measure and the number of questions used in the experiment.

This talk summarizes the results of two more recent investigations based on the TREC data: the definition of a new measure, and evaluation methodologies that look beyond average effectiveness.
The new measure, named «bpref» for binary preferences, is as stable as existing measures, but is much more robust in the face of incomplete relevance judgments, so it can be used in environments where complete judgments are not possible.
Using average effectiveness scores hampers failure analysis because the averages hide an enormous amount of variance, yet more focused evaluations are unstable precisely because of that variation.

Speaker Bios

Ellen Voorhees is manager of the Retrieval Group in the Information Access Division of the National Institute of Standards and Technology (NIST). The Retrieval Group is home to the Text REtrieval Conference, TRECVid (evaluation of content-based access to digital video), and the new Text Analysis Conference (TAC, an evaluation conference for the natural language processing community). Prior to joining NIST in 1996, she was a senior research scientist at Siemens Corporate Research in Princeton, NJ. She received her PhD from Cornell University where she studied under Gerard Salton.

Date:
Haut-parleurs:
Ellen Voorhees
Affiliation:
National Institute of Standards and Technology