Web crawl scheduling

A web crawler is an essential part of a search engine that visits URLs and downloads their content, which is subsequently processed, indexed, and used by the search engine to decide which results to show the users. As the Web is becoming increasingly dynamic, in addition to discovering new web pages a crawler needs to keep revisiting those already in the search engine’s index, in order to keep the index fresh by picking up the pages’ changed content. The problem of crawl scheduling consists in determining when to (re-)download which pages’ content so as to ensure that the search engine has complete and up-to-date view of the Web.

In this project, we aim to address the challenges of crawl scheduling as they are manifested in major commercial search engines such as Microsoft Bing. One such challenge is partial change observability: for most URLs, a search engine finds out whether they have changed only when it crawls them. To guess when the changes happen, and hence should be downloaded, a crawler needs a predictive model whose parameters are initially unknown. Crawlers need to learn these models while optimizing a freshness-related and information-completeness-related objectives when scheduling crawls, facing a phenomenon known as xploration-exploitation tradeoff. For some web pages, however, sitemap polling and other means can provide trustworthy near-instantaneous signals that the page has changed in a meaningful way, though not what the change is exactly. Even when these signals are available, crawl scheduling remains highly nontrivial because the crawler cannot react to every individual predicted or actual change. Its own infrastructure as well as hosts where web pages are located impose bandwidth constraints on the average daily number of crawls, which is usually just a fraction of the change event volume. Last but not least, Bing and other major search engines track many billions of pages with vastly different importance and change frequency characteristics. The sheer size of this constrained learning and optimization problem makes low-polynomial algorithms for it a must.

This effort is conducted in close collaboration with Bing’s Web Platform team.

Check out the project’s code and datasets on GitHub

人员

Andrey Kolobov的肖像

Andrey Kolobov

Principal Research Manager

Cheng Lu的肖像

Cheng Lu

Principal Software Engineering Manager

Microsoft Bing

Eric Horvitz的肖像

Eric Horvitz

Chief Scientific Officer