Neighborhood level chronic respiratory disease prevalence estimation using search query data
- Nabeel Abdur Rehman ,
- Scott Counts
PLoS ONE | , Vol 16(6)
Estimation of disease prevalence at sub-city neighborhood scale allows early and targeted interventions that can help save lives and reduce public health burdens. However, the cost-prohibitive nature of highly localized data collection and sparsity of representative signals, has made it challenging to identify neighborhood scale prevalence of disease. To overcome this challenge, we utilize alternative data sources, which are both less sparse and representative of localized disease prevalence: using query data from a large commercial search engine, we identify the prevalence of respiratory illness in the United States, localized to census tract geographic granularity. Focusing on asthma and Chronic Obstructive Pulmonary Disease (COPD), we construct a set of features based on searches for symptoms, medications, and disease-related information, and use these to identify illness rates in more than 23 thousand tracts in 500 cities across the United States. Out of sample model estimates from search data alone correlate with ground truth illness rate estimates from the CDC at 0.69 to 0.76, with simple additions to these models raising those correlations to as high as 0.84. We then show that in practice search query data can be added to other relevant data such as census or land cover data to boost results, with models that incorporate all data sources correlating with ground truth data at 0.91 for asthma and 0.88 for COPD.