Screening for Lung Carcinoma with Web Search Data
MSR-TR-2016-51 |
Importance: A statistical model that predicts the future appearance of diagnoses of lung carcinoma via analysis of large-scale anonymized logs of Web search queries input by millions of people across the United States.
Objective: To evaluate the technical feasibility of performing background screening and alerting for patients at risk of lung carcinoma via analysis of online search behavior.
Design: We identify people who issue special queries that provide strong evidence of a recent diagnosis of lung carcinoma. We then consider patterns of symptomatology expressed as searches about concerning symptoms over several months prior to the appearance of landmark queries. We build statistical classifiers that predict the future appearance of landmark queries based on the search log signals.
Setting: Retrospective log analysis of the online activity of millions of Web searchers seeking health-related information online.
Participants: Web searchers who query for symptoms related to lung carcinoma. Some (n=5,443) later issue queries that provide strong evidence of recent clinical diagnosis of lung carcinoma and are regarded as positive cases in our analysis. Additional evidence on the reliability of these queries as representing clinical diagnoses is based on the significant increase in follow-on searches for treatments and medications for these searchers and on the correlation between lung carcinoma incidence rates and our log-based statistics. The remaining symptom searchers (n=4,808,542) are regarded as negative cases.
Main Outcome Measure(s): Classification performance of our statistical model for early detection from online search behavior, for different lead times, different sets of signals, and different cohorts of searchers stratified by potential risk.
Results: The statistical classifier has strong performance in identifying searchers who will later input queries consistent with a lung carcinoma diagnosis, with a true-positive rate ranging from 3-32% for false-positive rates ranging from 0.00001 to 0.001, respectively. We can employ the methods to identify people at highest risk up to a year in advance of the inferred diagnosis time. Beyond studies of predictive power, we identify new risk factors expressed as evidence in people’s search activity and location.
Conclusion: Pattern analysis and recognition based on data drawn from large-scale search queries holds opportunity for identifying risk factors and frames new directions with early detection of lung carcinoma.