Estimating disease burden using Internet data

Abstract

Data on disease burden are often used for assessing population health, evaluating the effectiveness of interventions, formulating health policies, and planning future resource allocation. We investigated whether Internet usage and social media data, specifically the search volume on Google, page view count on Wikipedia, and disease mentioning frequency on Twitter, correlated with the disease burden, measured by prevalence and treatment cost, for 1633 diseases over an 11-year period. We also applied least absolute shrinkage and selection operator to predict the burden of diseases. We found that Google search volume is relatively strongly correlated with the burdens for 39 of 1633 diseases, including viral hepatitis, diabetes mellitus, multiple sclerosis, and hemorrhoids. Wikipedia and Twitter data strongly correlated with the burdens of 15 and 7 diseases, respectively. However, an accurate analysis must consider each condition’s characteristics, including acute/chronic nature, severity, familiarity to the public, and the presence of stigma.

Keywords

data mining disease burden Google search least absolute shrinkage and selection operator prevalence treatment cost Twitter Wikipedia

Introduction

The term “disease burden” refers to the financial, medical, or socio-economic impact of a disease or health problem.¹ Researchers in public health frequently measure the burden of various diseases or health problems across different geographic locations or at different time points, for purposes such as assessing population health, evaluating the effectiveness of interventions, formulating health policies, and planning future resource allocation.

There is no consensus on the best measure of disease burden; the choice often depends on individual value or specific need. One common measure is financial cost. It summarizes the direct and indirect costs due to illness, which can be nontrivial for the low-income population. For example, Paez et al. examined the out-of-pocket expenses, which are the economic burden for patients and their family, for more than 100 chronic conditions in both adults and children. What they revealed was that the annual out-of-pocket expenses increased by 39.4 percent from 1996 to 2005 in the United States, after inflation adjustment.²

Another measure of disease burden is mortality rate. It counts the number of deaths due to a specific medical condition in a particular population, scaled to the size of that population, in unit time. In one study on the correlation between diabetes and ischemic heart disease (IHD), Laing et al. found that young adult women with diabetes were more than eight times more likely to die of IHD than those without diabetes were. Similar trends were observed among young adult men, older men, and older women; patients with Type I diabetes were found to have a relatively higher IHD mortality rate than patients with Type II diabetes.³

By contrast, morbidity rate describes the frequency with which a disease occurs in a population and is often calculated by incidence rate and prevalence rate. Incidence rate refers to the proportion of newly diagnosed cases of a disease in a population, while prevalence rate accounts for both newly diagnosed and pre-existing cases of a disease. Corbett et al. found that worldwide, among 0.9 million cases of newly diagnosed adult cases of tuberculosis (TB) in 2000, 9 percent were attributable to HIV. In selected African countries and the United States, 31 and 26 percent of TB cases were attributable to HIV, respectively. However, TB led to about 11 percent of adult deaths from AIDS.⁴ This study indicated the comorbidity of TB and HIV and highlighted the need for a targeted intervention strategy in countries with a high prevalence of HIV and TB.

A more sophisticated measure of disease burden is disability-adjusted life years (DALYs). It is defined as the years lived with disability (YLDs) plus the years of life lost (YLLs) owing to a disease or health problem. Both YLDs and YLLs are age-weighted to reflect productivity and societal investment (e.g. years lived as a young adult are valued more than years spent as a young child or older adult). DALYs is the primary measure of disease burdens developed for the most comprehensive worldwide observational epidemiological study to date—the Global Burden of Disease Study,⁵ in which researchers have been estimating DALYs among populations of different ages, sex, and countries for more than 200 diseases and causes of death since 1990. Several developed countries, including the Netherlands⁶ and Australia,⁷ use DALYs to survey and compare their nationwide burden of diseases for public policymaking.

Obviously, each of these established measures of disease burden has its own limitations. The financial cost of a disease, for instance, does not reflect health-related quality of life and untreated cases.⁸ Mortality rate does not capture the disease burden prior to death,⁹ and in practice, it is often difficult to determine the actual cause of death as it is often the consequence of multiple diseases or injuries.¹⁰ Morbidity does not adjust for the severity and impact of diseases. DALYs require a large amount of time and resources to calculate. This situation has led to many pandemic and rare diseases being left unstudied and made it barely possible to compare the disease burden across a large number of diseases over time.^11,12 Furthermore, the estimates of disease burden from different studies sometimes conflict with each other. The prevalence of Parkinson’s disease in Spain was reported to be 1.5, 0.6, and 0.2 percent in 1994, by three separate groups.^13–15

In recent years, new data from the Internet have revealed novel utility in different fields. For instance, Ginsberg et al.¹⁶ used some search query keywords describing influenza-like illness on Google to predict influenza epidemics, as they were highly correlated with the actual influenza prevalence data reported by Centers for Disease Control and Prevention. Similarly, Moat et al.¹⁷ identified correlations between the stock prices of 30 Dow Jones Industrial Average component companies and weekly Wikipedia page review data and were able to increase their portfolio return by 65 percent using Wikipedia page review data instead of conventional strategies to build prediction models. As one of the most popular social networking sites in the world, Twitter has also been used to estimate and predict disease burden, especially during the outbreak of pandemic diseases. Signorini et al.¹⁸ demonstrated that Twitter data can be used to track public sentiment about the H1N1 flu and predict its activity during 2009 and 2010, when it was affecting more than 100 countries all over the world. Therefore, we investigated whether mining these new data sources, primarily Google Trends, Wikipedia page review data, and Twitter data, would allow the estimation of disease burden for a large number of diseases in an automated and cost-efficient way. Specifically, we examined the alignment of disease burden in terms of disease prevalence and financial cost for 1633 diseases over 11 years using these three Internet data sources. We also applied the least absolute shrinkage and selection operator (LASSO), a regression method that accomplishes variable selection and regularization, to predict the burden of diseases, using the Internet data along with other variables that we quantified in a previous study¹⁹ for four specific diseases.

Data and methods

Data collection

Google Trends and Wikipedia are two publicly available data sources that record searching and browsing activities related to various diseases and health conditions on the Internet. On Google Trends (https://www.google.com/trends/), users enter one to five key words to retrieve their relative search volume. The upper-right panel of Figure 1 shows the output of querying “breast cancer,” “obesity,” “acne,” “headache,” and “anemia” in the interactive user interface. The x-axis gives the timeline and the y-axis gives the normalized search volume in percentages, where the denominator is the highest search volume among all queried terms in the given time frame (e.g. in Figure 1, the highest search volume is for breast cancer around October 2004). Google Trend also allows users to specify the geographic location, time period, data source category (i.e. Arts & Entertainment, Books & Literature, Health), and the type of search (i.e. web search, image search, news search), or to use their application programming interface (API) for batch queries. In our experiment, these parameters were set to worldwide, from 2004 (earliest available year for Google Trends) to 2014, all categories, and web search, respectively. Furthermore, we developed a two-step strategy to retrieve the relative search volume for 1633 diseases from 2004 to 2014. As shown in the left panel of Figure 1, the first step was to find the disease with the highest search volume during the defined time framework among all our diseases of interest and set it as the baseline disease. Thereafter, we categorized all the diseases into five disease groups, with the baseline disease inserted into each group, and queried Google Trends again (shown on the right panel, Figure 1). Hence, the normalization denominator for each group was the same. Wikipedia provides a simpler API that allows us to download the weekly page review counts of each disease term from 2008 to 2014. We computed the annual Wikipedia review counts by adding up all 52 weeks of a year.

Figure 1.

Strategy to retrieve relative search volume from Google Trends for 1633 diseases.

Since the foundation in 2006, Twitter has generated huge amount of data, which has been analyzed to provide insights about the public interest or concern in different diseases. In this work, we analyzed the tweets in 52 months (see Table 1), which were obtained through https://archive.org/details/twitterstream. With non-English posts removed, we analyzed a total of 2.77 billion tweets, which was 410 Gigabytes. To search for disease mentioning throughout a large dataset like this, both algorithmic efficiency and computing power are important. We applied Xapian, an open-source probabilistic information retrieval library to index the data and then searched from the indexed dataset. We also employed Slurm Workload Manager (an open-source job scheduler used by many of the world’s supercomputers and computer clusters) to submit the computational job to HiPerGator, the supercomputer at the University of Florida (https://www.rc.ufl.edu/services/hipergator/) to accelerate the indexing process. Eventually, the disease mentioning Tweets for each year were computed as the sum of available months’ data in that year before transforming into relative mentioning frequency (see section “Analytical method”).

Table 1.

The summary of Twitter data in terms of size, count, and disease mentioning.

Year (months with data available)	Number of Tweets (million)	Number of tweets mentioning a disease
2009 (05–12)	262.6	178,656
2010 (01–10)	752.3	347,665
2011 (09–12)	29.2	11,269
2012 (01–06)	354.6	89,245
2013 (01–12)	615	224,756
2014 (01–12)	754.5	271,573

Disease nomenclature

When diseases or medical conditions are mentioned in different online contexts, they can be abbreviated, exhibit various morphological or orthographical variations, or have multiple synonyms. For example, medical professionals refer to stroke as cerebrovascular accident, cerebrovascular insult, or brain attack. To ensure the completeness and consistency of the query results, we used the metathesaurus of the unified medical language system (UMLS),²⁰ a knowledge source that “compiles names, relationships, and associated information from a variety of biomedical naming systems.” For the diseases of interest, we queried Google Trends and Wikipedia using all of its synonyms and defined its search volume as the highest search volume among all its synonyms.

Benchmark data on disease burdens

We obtained the benchmark data on disease burdens during 2004 and 2010 from our previous study,¹⁹ and the data for 2011 to 2014 from a large medical claims database, MarketScan®. It is managed by Truven Health to provide healthcare data such as individual claims, lab test results, and hospital discharge for different stakeholders including employers and health plans, policy makers and practitioners, and healthcare providers and facilities.²¹

Using the UMLS, we set the disease terminology for our analysis to be PheWAS codes as they were developed to represent clinically meaningful phenotypes with appropriate granularity.²² Due to the aggregation of the diseases and medical conditions by the PheWAS codes, the total number of diseases was reduced to 1633.

In addition to the disease prevalence and treatment cost data, we added three data sources for each disease: the number of PubMed publications, the amount of National Institutes of Health (NIH) funding, and the count of clinical trials. In our previous study, they were used to measure how medical research resources were allocated. Based on the assumption that maximal societal benefits can only be achieved when medical research resources are allocated proportional to the disease burden across the full distribution of diseases and conditions, we proposed and computed research opportunity index (ROI) and identified the diseases that required more research resources or the diseases that received more resources than their actual disease burden.¹⁹

More specifically, for these 1633 diseases from medical claims databases with non-zero Internet or disease burden data, we calculated the relative prevalence and the relative treatment cost, together with the relative number of publications, the relative number of clinical trials, and the relative amount of NIH funding. The “relative” treatment cost, for instance, is defined as a given disease’s treatment cost divided by the total treatment cost of all the 1633 diseases. This way, different factors become unitless and comparable.

Analytical method

We denote the relative search volume of disease i in year j on Google as $G_{i, j}$ , where $\sum_{i = 1}^{n} G_{i, j} = 1$ . Thus, the vector $G_{, j}$ , which can be extended as ( $G_{1, j}$ , $G_{2, j}, G_{3, s j}$ , …, $G_{n, j}$ ), represents the relative search volume of all n diseases in year j on Google, and the vector $G_{i,}$ , which represents ( $G_{i, 1}$ , $G_{i, 2}$ , $G_{i, 3}$ , …, $G_{i, m}$ ), denotes the relative search volume of disease i in all m years of interest on Google. Similarly, we define the relative page review counts of disease i in year j on Wikipedia as $W_{i, j}$ , the relative prevalence of disease i in year j as $P_{i, j}$ , the relative treatment cost of disease i in year j as $C_{i, j}$ , and the relative mentioning frequency on Twitter as $T_{i, j}$ .

To determine whether the information from Google Trends, Wikipedia, and Twitter can approximate the burden of diseases from three dimensions, we first examined the correlations between the Internet data and disease burdens measured by relative prevalence and relative treatment cost for all the diseases of interest as a whole. We did so by computing the Pearson correlation coefficients of ( $G_{, k}$ , $P_{, l}$ ) and ( $G_{, k}$ , $C_{, l}$ ) for years from 2004 to 2014, the Pearson correlation coefficients of (W,k $G_{, 12}$ , $P_{, l}$ ) and ( $W_{, k}$ , $C_{, l}$ ) for years from 2008 to 2014, and the Pearson correlation coefficients of ( $T_{, k}$ , $P_{, l}$ ) and ( $T_{, k}$ , $C_{, l}$ ) for years from 2009 to 2014. We also computed Spearman rank correlation coefficients and the p values to test the null hypothesis that the Internet data is not correlated with those disease burden measures. Since the Type I error (false-positive findings) rate increases with the number of hypothesis testing performed, we first applied the Bonferroni correction to control the family-wise error rate. Bonferroni correction is a conservative and stringent way to control Type I error at the expense of Type II error (false-negative findings).²³ We then adopted Holm’s method, in which the marginal p-values are ordered from the smallest to the largest before sequential adjustment.²⁴ Holm’s method yields a uniformly more powerful test than Bonferroni correction.

Second, we determined whether the Internet data could forecast the disease burden during the same year, 1 year later, and 2 years later on an individual disease level. Mathematically, for each disease i, we computed the Pearson correlation coefficients between the relative search volume on Google and relative disease prevalence ( $G_{i,}$ , $P_{i,}$ ), ( $G_{i,}$ , ${\tilde{P}}_{i,}$ ), ( $G_{i,}$ , ${\tilde{\tilde{P}}}_{i,}$ ), between the relative page reviews on Wikipedia and relative disease prevalence ( $W_{i,}$ , $P_{i,}$ ), ( $W_{i,}$ , ${\tilde{P}}_{i,}$ ), between the relative mentioning frequency on Twitter and relative disease prevalence ( $T_{i,}$ , $P_{i,}$ ), ( $T_{i,}$ , ${\tilde{P}}_{i,}$ ), between the relative search volume on Google and relative treatment cost ( $G_{i,}$ , $C_{i,}$ ), ( $G_{i,}$ , ${\tilde{C}}_{i,}$ ), ( $G_{i,}$ , ${\tilde{\tilde{C}}}_{i,}$ ), between the relative page reviews on Wikipedia and relative treatment cost ( $W_{i,}$ , $C_{i,}$ ), ( $W_{i,}$ , ${\tilde{C}}_{i,}$ ), and between the relative mentioning frequency on Twitter and relative treatment cost ( $W_{i,}$ , $C_{i,}$ ), ( $W_{i,}$ , ${\tilde{C}}_{i,}$ ), where ${\tilde{P}}_{i,}$ = ( $P_{i, 2}$ , $P_{i, 3}$ , $P_{i, 4}$ , …, $P_{i, m + 1}$ ), ${\tilde{\tilde{P}}}_{i,}$ = ( $P_{i, 3}$ , $P_{i, 4}$ , $P_{i, 5}$ , …, $P_{i, m + 2}$ ), ${\tilde{C}}_{i,}$ = ( $C_{i, 2}$ , $C_{i, 3}$ , $C_{i, 4}$ , …, $C_{i, m + 1}$ ), and ${\tilde{\tilde{C}}}_{i,}$ = ( $C_{i, 3}$ , $C_{i, 4}$ , $C_{i, 5}$ , …, $C_{i, m + 2}$ ). We only consider same year and 1-year gap for the Wikipedia and Twitter data, since the 2-year data will only have a few points.

Finally, we used a LASSO-based regression model to predict the relative disease burden ( ${\tilde{P}}_{i,}$ or ${\tilde{C}}_{i,}$ ) using $G_{i,}$ , $W_{i,}$ , $T_{i,}$ , and three other variables introduced in our previous work,¹⁹ namely the relative number of scientific articles from PubMed $(L_{i,})$ , relative number of clinical trials $(R_{i,})$ , and relative funding from the NIH $(F_{i,})$ . LASSO is more powerful than traditional linear regression as it uses variable (feature) selection and regularization.²⁵ The diseases we chose are viral hepatitis, diabetes mellitus, other headache syndrome, and multiple sclerosis, whose relative burdens demonstrated the biggest correlations with relative search volume on Google and relative page review on Wikipedia in the second step. All these computations were performed in the R programming environment, in which LASSO is simulated by the “glmnet” package.²⁶

Results

Correlations analysis for the entire disease landscape

For all the diseases as a whole, we analyzed the correlations between disease burdens, measured by relative disease prevalence $(P_{, l})$ and relative treatment cost $(C_{, l})$ , and the relative search volume on Google $(G_{, k})$ at different years.

Table 2 lists the Pearson correlation coefficients. The coefficients in Table 2 are all greater than the corresponding values in Table 2, indicating that the prevalence of diseases is more correlated to search volume than treatment cost. This can be explained by the definitions of those two measures. The treatment cost of a disease equals to its prevalence times the average treatment fees for each patient with the disease diagnosis in a given year. When all diseases are evaluated as a whole, the treatment cost estimate will have a larger variation than disease prevalence, therefore reducing its correlation with the relative search volume data on Google. In addition, we computed the corresponding p values and adjust them by Holm’s method (https://s3.amazonaws.com/cds-1/p-values-table2.docx). The adjusted p values are all smaller than 0.05, proving the significance of the correlations.

Table 2.

The correlations between relative search volume on Google $(G_{, k})$ and relative disease burdens ( $P_{, l}$ and $C_{, l}$ ) during 2004–2014.

Correlations between relative search volume on Google and relative disease prevalence
	$G_{, 04}$	$G_{, 05}$	$G_{, 06}$	$G_{, 07}$	$G_{, 08}$	$G_{, 12}$	$G_{, 10}$	$G_{, 11}$	$G_{, 12}$	$G_{, 13}$	$G_{, 14}$
$P_{, 04}$	0.277	0.278	0.272	0.272	0.277	0.262	0.274	0.279	0.284	0.283	0.284
$P_{, 05}$	0.281	0.283	0.276	0.275	0.282	0.272	0.279	0.283	0.288	0.288	0.289
$P_{, 06}$	0.286	0.287	0.281	0.28	0.285	0.272	0.281	0.286	0.29	0.29	0.29
$P_{, 07}$	0.284	0.286	0.279	0.279	0.283	0.27	0.28	0.284	0.289	0.288	0.288
$P_{, 08}$	0.287	0.29	0.284	0.282	0.286	0.279	0.283	0.287	0.292	0.291	0.291
$P_{, 08}$	0.295	0.299	0.293	0.29	0.294	0.298	0.29	0.295	0.3	0.301	0.301
$P_{, 10}$	0.274	0.275	0.27	0.269	0.27	0.254	0.267	0.272	0.276	0.276	0.275
$P_{, 11}$	0.279	0.281	0.276	0.274	0.275	0.265	0.273	0.278	0.282	0.282	0.281
$P_{, 12}$	0.278	0.28	0.275	0.273	0.274	0.263	0.271	0.277	0.281	0.281	0.28
$P_{, 13}$	0.285	0.288	0.283	0.281	0.281	0.275	0.278	0.283	0.287	0.288	0.287
$P_{, 14}$	0.291	0.293	0.289	0.286	0.286	0.279	0.283	0.288	0.292	0.293	0.292
Correlations between relative search volume on Google and relative treatment cost
	$G_{, 04}$	$G_{, 05}$	$G_{, 06}$	$G_{, 07}$	$G_{, 08}$	$G_{, 09}$	$G_{, 10}$	$G_{, 11}$	$G_{, 12}$	$G_{, 13}$	$G_{, 14}$
$C_{, 04}$	0.203	0.201	0.197	0.195	0.192	0.176	0.186	0.189	0.191	0.191	0.188
$C_{, 05}$	0.206	0.205	0.2	0.199	0.195	0.181	0.19	0.193	0.195	0.195	0.192
$C_{, 06}$	0.207	0.206	0.201	0.2	0.196	0.181	0.191	0.194	0.195	0.195	0.192
$C_{, 07}$	0.213	0.211	0.207	0.205	0.201	0.185	0.195	0.198	0.199	0.199	0.195
$C_{, 08}$	0.213	0.212	0.207	0.205	0.201	0.187	0.195	0.198	0.199	0.199	0.195
$C_{, 09}$	0.214	0.214	0.209	0.207	0.203	0.194	0.197	0.2	0.202	0.201	0.198
$C_{, 10}$	0.208	0.205	0.201	0.199	0.191	0.173	0.183	0.185	0.185	0.186	0.181
$C_{, 11}$	0.217	0.214	0.21	0.207	0.199	0.183	0.19	0.193	0.193	0.194	0.188
$C_{, 12}$	0.216	0.214	0.21	0.207	0.199	0.182	0.19	0.193	0.193	0.194	0.188
$C_{, 13}$	0.228	0.226	0.222	0.218	0.209	0.195	0.2	0.203	0.202	0.204	0.198
$C_{, 14}$	0.24	0.238	0.234	0.23	0.22	0.205	0.21	0.213	0.212	0.214	0.207

Note: The diagonal line and the line under are highlighted to exhibit the correlations of Internet data and Disease Burden in the same year or in the next year.

We were also interested in the relationship between the relative search volume on Google in a given year t and the relative prevalence of a disease in year t – 1 (the highlighted area under the diagonal lines in Table 2), as we initially assumed that individuals search the Internet once they receive a diagnosis. However, we did not observe such a trend. This might be owing to the fact that not all patients with a certain diagnosis will search the Internet and not all people who search for a particular disease on the Internet are diagnosed patients, or the fact that the computation of prevalence includes both newly diagnosed and pre-existing cases. An observable trend is that the Pearson correlation coefficients in the diagonal and right under the diagonal increase slowly with time, despite a few downward instances during 2009 and 2011. Such a weak increase suggests that it is becoming increasingly common to search the Internet for health-related topics.

We also tested the null hypotheses that cor( $G_{, k}$ , $P_{, l}$ ) = 0 and cor( $G_{, k}$ , $C_{, l}$ ) = 0, adjusted the p values using Holm’s method, and found out that all the adjusted p values were less than the significance level. Therefore, we concluded that the relative search volume on Google and the relative disease prevalence (or the treatment cost) are unlikely to be uncorrelated.

The correlations between the relative page reviews on Wikipedia and relative disease burdens are also significant (https://s3.amazonaws.com/cds-1/p-values-table3.docx) and showed similar patterns during 2008 and 2014 (see Table 3)—the correlations in Table 3 are generally larger than the correlations in Table 3. Across the diagonal lines, cor( $W_{, k}$ , $C_{, k}$ ) has been increasing while cor( $W_{, k}$ , $P_{, k}$ ) went from 0.182 down to 0.152. It is noticeable that correlations for Google search volume are all larger than the corresponding number in Table 3. This may indicate that compared to Wikipedia, Google is more broadly referred to when people search online information about a disease.

Table 3.

The correlations between relative page view count on Wikipedia $(W_{, k})$ and relative disease burdens ( $P_{, l}$ and $C_{, l}$ ).

Correlations between relative page view count on Wikipedia and relative disease prevalence
	$W_{, 08}$	$W_{, 09}$	$W_{, 10}$	$W_{, 11}$	$W_{, 12}$	$W_{, 13}$	$W_{, 14}$
$P_{, 08}$	0.182	0.174	0.178	0.190	0.188	0.188	0.167
$P_{, 09}$	0.190	0.184	0.186	0.198	0.196	0.197	0.177
$P_{, 10}$	0.154	0.146	0.151	0.164	0.161	0.161	0.141
$P_{, 11}$	0.158	0.150	0.155	0.168	0.165	0.165	0.145
$P_{, 12}$	0.159	0.151	0.156	0.169	0.166	0.166	0.146
$P_{, 13}$	0.163	0.156	0.160	0.173	0.170	0.170	0.150
$P_{, 14}$	0.165	0.158	0.162	0.174	0.172	0.172	0.152
Correlations between relative page view count on Wikipedia and relative treatment cost
	$W_{, 08}$	$W_{, 09}$	$W_{, 10}$	$W_{, 11}$	$W_{, 12}$	$W_{, 13}$	$W_{, 14}$
$C_{, 08}$	0.143	0.135	0.141	0.170	0.156	0.154	0.133
$C_{, 09}$	0.144	0.138	0.143	0.171	0.158	0.155	0.136
$C_{, 10}$	0.126	0.120	0.127	0.154	0.143	0.142	0.123
$C_{, 11}$	0.134	0.128	0.135	0.162	0.151	0.149	0.131
$C_{, 12}$	0.139	0.133	0.140	0.167	0.156	0.154	0.135
$C_{, 13}$	0.149	0.142	0.149	0.175	0.165	0.163	0.144
$C_{, 14}$	0.157	0.150	0.156	0.183	0.173	0.171	0.151

Note: The diagonal line and the line under are highlighted to exhibit the correlations of Internet data and Disease Burden in the same year or in the next year.

The correlations between disease burden and relative mentioning count on Twitter, however, are much smaller, as exhibited in Table 4 (the corresponding adjusted p value is 1 for each cell). Given the 140-character limit and the publicity nature of Twitter, it is possible that many patients do not use Twitter or Twitter users do not post about some very personal disease experience.

Table 4.

The correlations between relative mentioning count on Twitter $(T_{, k})$ and relative disease burdens ( $P_{, l}$ and $C_{, l}$ ).

Correlations between relative mentioning count on Twitter and relative disease prevalence
	$T_{, 09}$	$T_{, 10}$	$T_{, 11}$	$T_{, 12}$	$T_{, 13}$	$T_{, 14}$
$P_{, 09}$	0.041	0.041	0.038	0.037	0.037	0.039
$P_{, 10}$	0.032	0.032	0.028	0.026	0.027	0.029
$P_{, 11}$	0.032	0.032	0.028	0.027	0.027	0.029
$P_{, 12}$	0.033	0.034	0.029	0.028	0.028	0.030
$P_{, 13}$	0.034	0.035	0.030	0.028	0.029	0.031
$P_{, 14}$	0.034	0.035	0.031	0.029	0.030	0.031
Correlations between relative mentioning count on Twitter and relative treatment cost
	$T_{, 09}$	$T_{, 10}$	$T_{, 11}$	$T_{, 12}$	$T_{, 13}$	$T_{, 14}$
$C_{, 09}$	0.058	0.056	0.058	0.055	0.056	0.056
$C_{, 10}$	0.046	0.045	0.046	0.043	0.045	0.044
$C_{, 11}$	0.047	0.046	0.047	0.044	0.046	0.045
$C_{, 12}$	0.049	0.049	0.050	0.046	0.048	0.048
$C_{, 13}$	0.049	0.048	0.049	0.046	0.048	0.047
$C_{, 14}$	0.050	0.049	0.050	0.047	0.049	0.048

Note: The diagonal line and the line under are highlighted to exhibit the correlations of Internet data and Disease Burden in the same year or in the next year.

Correlations at individual disease level

Overall, the correlation coefficients between relative search volume on Google (or relative page reviews on Wikipedia) and relative disease burden measures are small—all are less than 0.3. Twitter data showed minimal correlations (all under 0.06) with the burdens of disease as a whole. We thus assessed the correlations between each data source and relative disease burdens one by one.

We first looked at the relative search volume on Google Trends (G_i) and the relative disease burdens (P_i, C_i,) with 0-year, 1-year, and 2-year intervals for individual diseases. Filtering by adjusted-p <0.05 on all the six correlation coefficients left 60 diseases. A total of 21 diseases that had high correlations owing to missing values in either Google Trends or disease burden data were then excluded, and the remaining 39 diseases and their Pearson correlation coefficients are listed in Table 5 (corresponding adjusted p values—https://s3.amazonaws.com/cds-1/p-values-table5.docx). Black and white values refer to positive and negative correlations, respectively.

Table 5.

39 diseases demonstrate strong correlations between relative search volume on Google and disease burden measured by relative prevalence and relative treatment cost.

Note: The cells are highlighted to differentiate between the positive and negative correlations.

In Figure 2, we also plotted the correlation patterns for four representative diseases. Figure 2(a) shows that viral hepatitis is becoming less and less popular in Google Search, which corresponds to its decreasing prevalence and treatment costs. Figure 2(b) shows that diabetes mellitus is searched less and less frequently on Google, but both its prevalence and treatment cost are increasing with time. This might indicate that as a chronic condition, diabetes mellitus requires long-term treatment but is underestimated by the public. “Other headache syndromes” in Figure 2(c) exhibits a rising popularity in Google Search, but both its prevalence and treatment cost went down from 2004 to 2014. According to our communication with clinicians, one reasonable explanation is that headache is underdiagnosed as many people do not seek medical consultation for headache. Instead, patients simply turn to the Internet for information. In Figure 2(d), the relative search volume for multiple sclerosis on Google aligns well with its prevalence but the treatment cost has been rising dramatically, possibly owing to the increase in the cost of medication, which occurred in the same period.²⁷

Figure 2.

The correlations between relative search volume on Google Trends (solid lines) and relative disease prevalence (dotted lines) and treatment cost (dashed lines) for (a) viral hepatitis, (b) diabetes mellitus, (c) other headache syndromes, and (d) multiple sclerosis.

Second, we investigated the correlations at single disease level for Wikipedia page view count (W_i) and Twitter mentioning frequency (T_i). Since we only have 7 and 6 years of data for Wikipedia and Twitter, respectively, we only calculated the correlations based on 0-year and 1-year intervals. The highly correlated diseases are listed in Tables 6 and 7 separately. Among the 15 PheWAS diseases listed in Table 6, four diseases (highlighted in bolded font) also appeared in the Google Trends results (see Table 5), which are viral hepatitis, neoplasm of uncertain behavior, obesity, and other headache syndromes. The first three diseases showed the same type of correlations in results from both Google Trends and Wikipedia, which added to our confidence of the results. After filtering, only seven diseases exhibited high correlations between Twitter data and disease burden. In other words, most of the 1633 diseases were either not mentioned on Twitter or showed no correlations between their burdens and the mentioning frequency on Twitter.

Table 6.

Sixteen diseases demonstrate strong correlations between relative page view count on Wikipedia and disease burden measured by relative prevalence and relative treatment cost.

PheWAS name	Pearson correlation coefficient
	( $W_{i,}$ , $P_{i,}$ )	( $W_{i,}$ , $C_{i,}$ )	( $W_{i,}$ , ${\tilde{P}}_{i,}$ )	( $W_{i,}$ , ${\tilde{C}}_{i,}$ )
Viral hepatitis	0.973	0.941	0.834	0.825
Subjective visual disturbances	0.953	0.945	0.852	0.869
Neoplasm of uncertain behavior	0.950	0.910	0.857	0.898
Dermatophytosis	0.942	0.880	0.859	0.916
Hypertrophy of female genital organs	0.930	0.929	0.829	0.862
Parasomnia	0.875	0.899	0.910	0.909
Other pulmonary inflammation or edema	0.816	0.945	0.827	0.966
Anorexia	0.815	0.881	0.891	0.878
Myeloid leukemia, acute	0.786	0.783	0.900	0.910
Other headache syndromes	0.763	0.843	0.898	0.891
Foreign body injury	−0.798	−0.851	−0.871	−0.913
Cleft palate	−0.802	−0.788	−0.823	−0.934
Obesity	−0.810	−0.889	−0.823	−0.950
Corneal dystrophy	−0.860	−0.850	−0.854	−0.850
Convulsions	−0.918	−0.976	−0.935	−0.906

Note: The cells are highlighted to differentiate between the positive and negative correlations.

Table 7.

Seven diseases demonstrate strong correlations between relative mentioning frequency on Twitter and disease burden measured by relative prevalence and relative treatment cost.

PheWAS name	Pearson correlation coefficient
	( $T_{i,}$ , $P_{i,}$ )	( $T_{i,}$ , $C_{i,}$ )	( $T_{i,}$ , ${\tilde{P}}_{i,}$ )	( $T_{i,}$ , ${\tilde{C}}_{i,}$ )
Hyperlipidemia	0.951	0.921	0.944	0.934
Blood in stool	0.947	0.970	0.807	0.952
Azoospermia and oligospermia	0.941	0.902	0.930	0.832
Eye infection, viral	−0.894	−0.894	−0.781	−0.920
Autism	−0.938	−0.935	−0.881	−0.859
Disturbances of amino-acid transport	−0.850	0.978	−0.928	0.859
Fracture of unspecified bones	0.956	0.949	−0.866	−0.912

Note: The cells are highlighted to differentiate between the positive and negative correlations.

Predicting disease burdens using LASSO

Finally, we explored whether the relative search volume on Google $(G_{i,})$ , relative page review on Wikipedia $(W_{i,})$ , relative mentioning count on Twitter $(T_{i,})$ , relative disease prevalence $(P_{i,})$ , relative treatment cost $(C_{i,})$ , and three other variables we quantified in our previous study,¹⁹ namely the relative number of scientific articles from PubMed $(L_{i,})$ , relative number of clinical trials $(R_{i,})$ , and relative funding from the NIH $(F_{i,})$ for year t could predict the relative disease prevalence $({\tilde{P}}_{i,})$ or relative treatment cost $({\tilde{C}}_{i,})$ for year t + 1, using LASSO for each of the 39 diseases we identified in the previous step. Figure 3 shows the LASSO cross-validation curves and variable selection results for the treatment cost prediction of sleep apnea, hemorrhoid, disaccharidase deficiency, and diabetes mellitus. With the shrinkage of lambda (bottom horizontal axis; log scale), mean square error (MSE, left vertical axis) decreases until the minimum value (close to 0 in Figure 3) is reached at the left vertical line. The right vertical line gives the optimal model where the error is within one standard deviation from the minimal MSE. The correlation coefficients and intercept of the fittest model are listed in each panel. It seems that not all six variables are related to treatment cost prediction in each case, but the relative treatment cost from the previous year is most useful, which is consistent with our previous findings.¹⁹ We repeated the analysis for relative disease prevalence $({\tilde{P}}_{i,})$ (https://s3.amazonaws.com/cds-1/S1.JPG). The results confirmed that the predictive powers of the aforementioned factors vary in accordance with each case. Particularly for the relative mentioning count on Twitter $(T_{i,})$ , although it was included for all the experiments, none of them returned non-zero coefficient, adding weight to the conclusion that Twitter data is not appropriate for understanding the disease burden.

Figure 3.

LASSO cross-validation curves and estimated coefficients of four diseases: (a) sleep apnea, (b) hemorrhoids, (c) disaccharidase deficiency, and (d) diabetes mellitus.

Discussion

In this study, we investigated the correlation between search volume on Google, page view counts on Wikipedia, and disease mentioning frequency on Twitter with disease burden, measured by prevalence and treatment cost, for 1633 diseases over an 11-year period. The correlations between Twitter data and disease burden were not significant at the entire disease level and only seven diseases exhibited high correlations between Twitter data and disease burden. The demographics of Twitter users can possibly help us interpret this result. According to most recent analyses on Twitter users, 54 percent of them earn more than US$50,000 per year²⁸ and 91 percent of them are under 30 years old.²⁹ Such population has been proved to be associated with better health condition and lower spending on health.^30,31 Therefore, Twitter users might have less amount of health issues than the average Internet users and publish less on specific disease-related topics. The way we dealt with the Twitter data may also contribute to this situation: when deciding the disease mention for each Tweet, we looked for an exact match of disease synonym; it might cause bias when disease was misspelled, not mentioned, or mentioned in acronyms that were not included in the UMLS.

On the other hand, our analysis revealed that Google Search volume is much more robust for understanding the disease burden, especially for 39 diseases including viral hepatitis, diabetes mellitus, multiple sclerosis, sleep apnea, hemorrhoids, and disaccharidase deficiency. Out of the 1633 diseases, only 39 was listed as strongly correlated diseases for Google Search volume data because (1) the correlations at the entire disease landscape (ranging from 0.18 to 0.30) indicated that the number of correlated diseases was low; (2) we applied a strict filter that the Google Search volume should be correlated to the disease burden within the same year, 1-year gap, and 2-year gap; and (3) we had to remove some of the high-correlation diseases since either Google Search volume or disease burden data was missing.

In addition to the computed correlations, the LASSO regression analysis showed that the Internet data sources, disease burden, number of scientific articles from PubMed, number of clinical trials, and funding from the NIH have various power for predicting future disease burdens.

However, our analysis is limited to prevalence and treatment cost, but not other measures of disease burden due to data availability and comparability. The findings also caution us not to over-generalize when estimating disease burdens for the purpose of understanding population health, formulating health policies, or planning resource allocation. Instead, we should consider each individual disease according to its characteristics, such as the acute/chronic nature, severity, familiarity to the public, and the presence of stigma.

Conclusion

Estimating the disease burden using the Internet usage data is automated and cost-efficient. This study proved the robustness and feasibility of understanding the disease prevalence and treatment cost with the Google Search volume and Wikipedia page view count. Further research is necessary to compare the Internet usage data with other disease burden measures and adjust the estimation according to the characteristics of specific diseases.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Australian Institute of Health and Welfare (AIHW). Burden of disease, 2016, http://www.aihw.gov.au/burden-of-disease/

Paez

Zhao

Hwang

Rising out-of-pocket spending for chronic conditions: a ten-year trend. Health Aff 2009; 28(1): 15–25.

Laing

Swerdlow

Slater

, et al. Mortality from heart disease in a cohort of 23,000 patients with insulin-treated diabetes. Diabetologia 2003; 46(6): 760–765.

Corbett

Watt

Walker

, et al. The growing burden of tuberculosis: global trends and interactions with the HIV epidemic. Arch Intern Med 2003; 163(9): 1009–1021.

Feigin

Forouzanfar

Krishnamurthi

, et al. Global and regional burden of stroke during 1990-2010: findings from the Global Burden of Disease Study 2010. Lancet 2014; 383: 245–254.

Melse

Essink-Bot

Kramers

, et al. A national burden of disease calculation: Dutch disability-adjusted life-years. Dutch Burden of Disease Group. Am J Public Health 2000; 90(8): 1241–1247.

Mathers

Vos

Stevenson

, et al. The burden of disease and injury in Australia. Bull World Health Organ 2001; 79(11): 1076–1084.

Thacker

Stroup

Carande-Kulis

, et al. Measuring the public’s health. Public Health Rep 2006; 121(1): 14–22.

Michaud

Murray

Bloom

BR.

Burden of disease—implications for future research. JAMA 2001; 285(5): 535–539.

10.

McGinnis

Foege

WH.

Actual causes of death in the United States. JAMA 1993; 270(18): 2207–2212.

11.

Murray

CJ.

Quantifying the burden of disease: the technical basis for disability-adjusted life years. Bull World Health Organ 1994; 72(3): 429–445.

12.

Mason

Bridgwood

Methods of Collecting Morbidity Statistics. Revised Report to the Eurostat Task Force on ‘Health and Health-Related Survey Data’. London: Office for National Statistics, 1997.

13.

Claveria

Duarte

Sevillano

, et al. Prevalence of Parkinson’s disease in Cantalejo, Spain: a door-to-door survey. Mov Disord 2002; 17(2): 242–249.

14.

Benito-Leon

Bermejo-Pareja

Rodriguez

, et al. Prevalence of PD and other types of parkinsonism in three elderly populations of central Spain. Mov Disord 2003; 18(3): 267–274.

15.

Errea

Ara

Aibar

, et al. Prevalence of Parkinson’s disease in lower Aragon, Spain. Mov Disord 1999; 14(4): 596–604.

16.

Ginsberg

Mohebbi

Patel

, et al. Detecting influenza epidemics using search engine query data. Nature 2009; 457(7232): 1012–1014.

17.

Moat

Curme

Avakian

, et al. Quantifying Wikipedia usage patterns before stock market moves. Sci Rep 2013; 3: 1801.

18.

Signorini

Segre

Polgreen

. The use of Twitter to track levels of disease activity and public concern in the U.S. PLoS ONE 2011; 6(5): e19467.

19.

Yao

Ghosh

, et al. Health ROI as a measure of misalignment of biomedical needs and resources. Nat Biotechnol 2015; 33(8): 807–811.

20.

Schuyler

Hole

Tuttle

, et al. The UMLS metathesaurus: representing different views of biomedical concepts. Bull Med Libr Assoc 1993; 81(2): 217–222.

21.

Adamson

Chang

Hansen

LG.

Health research data for the real world: the MarketScan databases. New York: Thompson Healthcare, 2008.

22.

Denny

Ritchie

Basford

, et al. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations. Bioinformatics 2010; 26(9): 1205–1210.

23.

Weisstein

. Bonferroni correction. Wolfram Research, Inc, 2004.

24.

Holm

A simple sequentially rejective multiple test procedure. Scand J Stat 1979; 6(2): 65–70.

25.

Tibshirani

Regression shrinkage and selection via the Lasso. J Roy Stat Soc B Met 1996; 58: 267–288.

26.

Friedman

Hastie

Tibshirani

. glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models. package version 1.5.2. 2011, http://CRAN.R-project.org/package=glmnet

27.

Hartung

Bourdette

Ahmed

, et al. The cost of multiple sclerosis drugs in the US and the pharmaceutical industry: too big to fail. Neurology 2015; 84(19): 2185–2192.

28.

Aslam

Twitter by the numbers: stats, demographics & fun facts, 2017, https://www.omnicoreagency.com/twitter-statistics/

29.

Sloan

Morgan

Burnap

, et al. Who tweets? Deriving the demographic characteristics of age, occupation and social class from Twitter user meta-data. PLoS ONE 2015; 10(3): e0115545.

30.

Centers for Disease Control and Prevention, 2012, https://www.cdc.gov/media/releases/2012/p0516_higher_education.html

31.

Dieleman

Baral

Birger

, et al. US spending on personal health care and public health,1996–2013. JAMA 2016; 316(24): 2627–2646.