Tracking COVID-19 Using Online Search
- Vasileios Lampos ,
- Simon Moura ,
- Elad Yom-Tov ,
- Michael Edelstein ,
- Maimuna Majumder ,
- Rachel A. McKendry ,
- Ingemar J. Cox
ArXiv preprint
Disclaimer: The current version considers data up to and including April 2, 2020. The methods and results presented in this working paper should be considered as ongoing. The approach as well as the presented outcomes require further cross-checking and development. We would not normally publish work-in-progress, but we do so to potentially assist and collaborate with other groups supporting the response to COVID-19. To this end, we envision that an updated report will be uploaded at least on a weekly basis. The most up-to-date versions of this report (due to the instant turnaround) can be found at github.com/vlampos/covid-19-online-search (opens in new tab).
Introduction
Online search data is routinely used to monitor the prevalence of infectious diseases, such as influenza1–4. Previous work has focused on supervised learning solutions, where ground truth data, in the form of historical syndromic surveillance reports, can be used to train machine learning models. However, no sufficient data —in terms of accuracy and time span— exist to apply such approaches for monitoring the emerging COVID-19 infectious disease pandemic caused by a novel coronavirus (SARS-CoV-2). Therefore, unsupervised, or semi-supervised solutions should be sought. Recent outcomes have shown that it is possible to transfer an online search based model for influenza-like illness (ILI) from a source to a target country without using ground truth data for the target location5. The transferred model’s accuracy depends on choosing search queries and their corresponding weights wisely, via a transfer learning methodology, for the target location. In this work, we draw a parallel to previous findings and attempt to develop an unsupervised model for COVID-19 by: (i) carefully choosing search queries that refer to related symptoms as identified by a survey from the National Health Service (NHS) in the United Kingdom (UK), and (ii) weighting them based on their reported ratio of occurrence in people infected by COVID-19. Furthermore, understanding that online searches may be also driven by concern rather than infections, we devise a preliminary approach that attempts to minimise this part of the signal by incorporating a basic news media coverage metric in association with confirmed COVID-19 cases. Finally, we propose a transfer learning method for mapping supervised COVID-19 models from a country to another, in an effort to transfer knowledge from areas where the disease has a more extended progression. Results are presented for the UK, England, United States of America (US), Canada, Australia, France, Italy, and Greece.