Data-driven evaluation metrics for heterogeneous search engine result pages
- Leif Azzopardi ,
- Ryen W. White ,
- Paul Thomas ,
- Nick Craswell
Proceedings of the Conference on Human Information Interaction and Retrieval |
Organized by ACM
Evaluation metrics for search typically assume items are homogeneous. However, in the context of web search, this assumption does not hold. Modern search engine result pages (SERPs) are composed of a variety of item types (e.g., news, web, entity, etc.), and their influence on browsing behavior is largely unknown.
In this paper, we perform a large-scale empirical analysis of popular web search queries and investigate how different item types influence how people interact on SERPs. We then infer a user browsing model given people’s interactions with SERP items – creating a data-driven metric based on item type. We show that the proposed metric leads to more accurate estimates of: (1) total gain, (2) total time spent, and (3) stopping depth – without requiring extensive parameter tuning or a priori relevance information. These results suggest that item heterogeneity should be accounted for when developing metrics for SERPs. While many open questions remain concerning the applicability and generalizability of data-driven metrics, they do serve as a formal mechanism to link observed user behaviors directly to how performance is measured. From this approach, we can draw new insights regarding the relationship between behavior and performance – and design data-driven metrics based on real user behavior rather than using metrics reliant on some hypothesized model of user browsing behavior.