Abstract
Google Trends (GT) data could potentially supplement traditional crime measurement strategies, allowing criminologists to better understand crime statistics on a macro level. This study assesses the validity of GT crime estimates. The findings indicate that GT data are reliable for estimating MVT, larceny, and rape. Additionally, we illustrate how to use GT to identify places with high rates of unreported offenses. The results of this study demonstrate the feasibility of leveraging open-source big data such as GT to supplement traditional sources of crime data, particularly for categories of crime with substantial underreporting rates. Results suggest the GT rape measure may be a more accurate estimate of the true incidence of rape than the measure drawn from the Uniform Crime Report (UCR). The limitations associated with the use of GT to generate estimates of crime are also discussed.
Introduction
“Measuring crime has always been one of the most difficult challenges facing criminal justice researchers” (Maxfield, 1999, p.119). One key issue which has been identified is that most crime estimates underestimate the total crime committed each year. Prior research has demonstrated that a variety of factors, including gender, race, wealth, education, the nature of the crime, and the police response to previous victimization, influence victims’ decisions to file a police report (Avakame et al., 1999; Black, 1970; Fyfe et al., 1997; Hindelang, 1974; Laub, 1981; O’Brien, 1996; Skogan, 1984; Xie et al., 2006). In addition, certain categories of crime victims may not wish to contact law enforcement, out of concern for their own legal status, such as undocumented immigrants (Gutierrez and Kirk, 2017), sex workers (McBride et al., 2020), and drug dealers (Topalli et al., 2002).
The underreporting phenomenon extends to, and has varying effects on, distinct types of victimization. Existing research suggests homicide and auto theft are two of the most frequently reported crimes to the police. This may be due to the severity of the crime (loss is greater in homicide and motor vehicle theft), easier to discover than other crimes (body and the plate), and the requirement for insurance coverage plays a crucial part in the reporting behavior (Black, 1970; Goudriaan et al., 2004; Maxfield, 1999; Skogan, 1977, 1984). On the other hand, rape is known to suffer from the largest degree of underreporting. For example, according to the Bureau of Justice Statistics (2019), only 24.9% of rape victims reported the crime to police in 2018, compared to 78.6% for motor vehicle theft (MVT), 47.9% for burglary, and 28.6% for larceny. The reasons for underreporting in regard to rape include the loss of privacy, the risk of recrimination and reprisal, the familiarity between offender and victims (e.g. victimizied by relatives or friends), feeling shameful of being known by others, difficulty in accessing the evidence, and the risk of being “no crimed” by police (Allen, 2007; Gregory and Lees, 1996; Hine et al., 2021; Thomas and Wolff, 2021). Studies have shown that Native Americans are more likely to experience sexual assault than others due to the unfortunate combination of colonization, patriarchy, political status, and racism that exists in tribal communities (Bohn, 2003; Tjaden and Thoennes, 2006; Quasius, 2008). As a result, the Federal Bureau of Investigation’s Uniform Crime Report (UCR), which contains information about offenses “known to the police,” has inherent limitations in estimating the true incidence of crime. 1
Despite these shortcomings, scholars have continued to investigate the dependability and validity of UCR data. Prior research has compared UCR data to the National Crime Victimization Survey (NCVS) and other local crime data, such as arson statistics from local fire departments. They discovered that estimates of homicide, MVT, robbery, and burglary may be considered valid metrics of criminal behavior in UCR (Blumstein et al., 1991; Gove et al., 1985; O’Brien et al., 1980). On the other hand, rape, larceny, aggregated assault, arson, and serious violence are measures that they were unable to validate (Blumstein et al., 1991; Gove et al., 1985; Jackson, 1988; Lauritsen et al., 2016; O’Brien et al., 1980; Skogan, 1977). O’Brien (1996) and Lauritsen et al. (2016) examined the validity of violent crimes and concluded that UCR homicide is a more valid indicator of violent crime. They contended that if one indicator of violent crime is more strongly connected with UCR homicide rates than another, the former is more valid.
The global prevalence of the internet enables the use of big data in social science research. According to research, these internet-based and user-generated data are more accessible and less expensive than traditional social surveys (Salganik, 2017). Over 89% of adults in the United States reported using the internet (Pew Research Center, 2019). Additionally, internet users are distributed similarly to the broader population in terms of ethnicity, wealth, gender, and age (Pew Research Center, 2019; Stubbs-Richardson et al., 2018). Furthermore, 91% of internet users use search engines, with the vast majority (87.7%) of users using Google to perform their searches (Pew Research Center, 2012; Statcounter, 2021). Rowlands et al. (2008) research on the impact of Google on education concluded that search engines like Google have not only impacted the youth or students, but also extended to the parents and teachers. The evidence showed that people across all ages has changed the way they seek information, “We are all the Google Generation.”
Due to the extensive use of Google Search, a massive amount of data is recorded during this process. As a result, this data may be a precious resource for researchers interested in a variety of subjects. Google Trends (GT) is a Google Search extension that collects data on users’ time, location, and keywords. GT generates data on the search rates for keywords based on the acquired data to predict their users’ search interest. First, users enter a term on the GT website. Next, GT estimates the term’s “search interest” in a geographic area by dividing the number of search queries for that term by the total number of Google searches in that geographic area. From this data, GT calculates a number (scaled from 0 to 100) that represents a certain geographic area’s search interest relative to all other geographic areas. The area with the highest search interest is assigned a score of 100. The values for the other areas (99, 81, 76,…25) are based on the ratio of each region’s search rate to the area with the most searches using the designated phrases (for further discussion of GT method, see Arora et al., 2019; Mavragani and Ochoa, 2019).
Although its utility has not been thoroughly explored, some studies have demonstrated the utility of GT data in social science research. For instance, academics discovered that internet users with a practical goal in online searching help obtain certified GT data to match with real-world survey data (Ginsberg et al., 2009; Mellon, 2013). Moreover, scholars have discovered a strong association between GT data on religion and existing data on religious composition in the United States and Africa (Adamczyk et al., 2019, 2021a, 2021b; Scheitle, 2011). Other scholars have found that GT data accurately predicts future heroin emergency room visits, influenza patterns, methamphetamine-related crimes, premature deaths from alcohol, narcotics, and suicide, as well as racial animus toward a black candidate (Gamma et al., 2016; Parker et al., 2016; Stephens-Davidowitz, 2014; Xu et al., 2017; Young et al., 2018). Most recently, GT data has been used to estimate the prevalence of COVID-19 over time in Iran (Ayyoubzadeh et al., 2020) and to assess the effect of COVID-19 related lockdowns on individuals’ wellbeing (Brodeur et al., 2021). Importantly however, research leveraging GT data to probe criminal justice issues is limited at this time. Gross and Mann (2017) created a measure of concern about police violence using Google search queries and discovered that this measure is positively associated with the violent crime rate. Stubbs-Richardson et al. (2018) calculated the popularity of search phrases such as “vehicle alarm system” and “house alarm system” to represent individuals’ crime prevention searches and discovered a positive correlation between the relative volume of these queries and the rates of MVT and burglary captured in the UCR. More recently, Piña-García and Ramírez-Ramírez (2019) employed GT data and official crime data to explore the incidence criminal activity in Mexico City. Hoehn-Velasco et al. (2021) examined whether COVID-19 stay-at-home orders affected crimes targeting women (sexual offenses and domestic violence). Anderberg et al. (2022) observed a 40% surge in internet search-based domestic violence during the apex of the lockdown period in Greater London, which was seven to eight times more than the rise in police-recorded statistics and much closer to the increase in helpline calls reported by victim support organizations. From this review of existing research, it becomes clear that while researchers have begun to assess the potential utility of Google Search data in criminal justice research, much is left to be explored.
Current study
Despite the existing efforts to incorporate GT data into social science research, the feasibility of using GT to provide information about crime victimization and to generate crime estimates has not been evaluated. As a result, the current study examines the feasibility of using GT data to estimate crime victimization rates in metropolitan areas throughout the United States. The current study’s principal objectives are to assess the potential utility of GT data to estimate the occurrence of crime and investigate how these estimates might augment official crime reports and inform criminal justice policies and practices.
To our knowledge, this is the first study to use big data from GT to estimate criminal victimization at the metropolitan area level (i.e. designated market area [DMA]) across the United States. Additionally, this is the first study to examine the relationship between widely reported structural attributes and reported crime rates using Google Trends data.
Methods and data
Comparing two measures of crime—GT and UCR crime rates
We conduct several statistical tests of concurrent validity for the GT estimates of crime to determine their utility as a measure of crime victimization (Churchill, 1979). These statistical tests are Correlation analysis (Churchill, 1979; O’Brien et al., 1980; Scheitle, 2011), exploratory factor analysis (EFA), confirmatory factor analysis (CFA), and multiple linear regression with common crime covariates (Bergkvist and Rossiter, 2007; Churchill, 1979; O’Brien et al., 1980; Strauss and Smith, 2009).
To begin, we assume that DMAs have a “true” crime rate denoted by the variable
For the correlation analysis, it would make sense that
Without previously valid rape and larceny data from UCR, we cannot validate rape and larceny crime measures from GT. Lauritsen et al. (2016), as well as O’Brien (1996), argued that because the homicide numbers at UCR are exceptionally well-measured, we can use
Regarding EFA and CFA tests, we suggest that a need-to-be-valid measure of violent/property crime should be able to be grouped with other already-valid measures of violent/property crime in UCR, such as homicide, MVT, robbery, and burglary. As scholars often divide crimes into two categories: property and violent crime (Fox et al., 2010; Matthews et al., 2011; Zhao et al., 2002), we propose that there are two latent variables: property crime rates and violent crime rates in the EFA and CFA test.
2
Following this logic, we assess the need-to-be-valid property crime measure from
Additionally, we use multiple linear regression (MLR) analysis to determine the nomological validity (O’Brien et al., 1980) of the crime estimates. The outcome variables are drawn from
Designated market area as the unit of analysis
The nature of the GT data determines the unit of analysis in this study. GT’s smallest geographic unit across the United States is the designated market area (DMA). In GT, 210 DMAs cover all 3142 US counties. Because the analytical unit is DMA, it is necessary to aggregate other county-level data sources to DMA to align them with GT data. Also, by aggregating the UCR data from county-level to a larger geographic level of DMA, we can avoid the drawbacks Maltz and Targonski (2002) mentioned in the county-level UCR data. We use Sood’s (2016) county-DMA crosswalk file because it is the only publicly available county-DMA crosswalk file and it has already been widely cited in peer-reviewed research (Obermeier et al., 2022; Sides et al., 2022; Sood and Iyengar, 2016).
Two sets of crime estimates—GT and UCR
The study’s focal measures come from two unique sources: first, GT data, then UCR crime data. The current study uses GT and UCR data from 2010 to 2019. We omit data before 2010 since internet availability reached over 75% of the US population in 2010 (Pew Research Center, 2019). The end year is 2019 because it is the final year for which county-level UCR data is provided. The steps we took to get criminal victimization estimates from GT are detailed below.
We need to use words that victims used in their online searches to obtain the “estimate of victimization” utilizing GT. We start with a simple term like “my car was stolen” to get more similar or often-used keywords that victims usually enter on GT. There is a “related queries” section on Google Trends that lists the top 25 queries related to the keyword entered. For instance, a search for “my car was stolen” will return similar phrases such as “my car was stolen, what do I do?”, “find my car,” and “report stolen car.” Then, we combine these terms if they are associated with a victim searching for information online (using “+”) or exclude them if they are not (using “−”). Thus, each type of crime will have a single representative search phrase on GT.
There are two main purposes for combining “related queries” into a single phrase, as opposed to searching for different terms each time and combining the separated GT data into a single dataset. First, if we employ the latter method, we will lack sufficient data points for analysis. It is because the greater the number of input keywords in a phrase, the more DMA data points we can acquire from GT. Second, the use of related queries can mitigate potential bias caused by the researcher’s own selection of arbitrary words.
The keywords we use to estimate multiple crime types from GT are mentioned below. The negative symbol (−) excludes unrelated terms from “related queries” in GT. The plus sign (+) is also used to link terms. These terms are used to gather estimates of
Terms to pull
Terms to pull
Terms to pull
Terms to pull
UCR crime measures are derived from the Inter-University Consortium for Political and Social Research (Kaplan, 2021). 8 MVT, burglary, larceny, homicide (combination murder and manslaughter), robbery, and rape are the crime types that we employ in the current study. We aggregated the counts of each category of crime in each county to the DMA level and then divided the total number of crimes by the DMA’s total population multiplied by 100,000 to determine the crime rate in each DMA using the UCR data. Finally, in order to align UCR crime rates with GT estimates (scaled 0–100), we employ a similar procedure to convert crime rates to an estimate of relative crime rates for each DMA. A number, 100, is assigned to the DMA with the highest crime rate as measured by the UCR. Other DMAs are relative numbers (e.g. 93, 80, 68…13) based on the ratio of each area’s crime rate to the highest crime rate seen in the UCR data. 9
Notably, the GT data collection has certain missing information that must be addressed. When a DMA’s volume of a search query falls below a threshold (in terms of total searches for a specific term), GT removes that geographical unit from the results generated. This is because the volume of total searches in a region for the keywords is insufficient to produce credible estimates (Trends Help, 2023). When we enter a list of victimization keywords into GT, the GT rape estimates provide the sample size of 107 DMAs, which limits the size of our sample to these 107 DMAs. While the 107 included DMAs account for only 51% of all 210 DMAs, these 107 areas cover over 86.87% of the United States' population (see Supplemental Table 1).
Common covariates of crime rates
In the early 20th century, the Chicago School of Sociology, led by Robert Park, examined the relationship between crime and the urban environment (Park et al., 1967 [1925]). Their research revealed that social disorganization, represented by economic disadvantage, social mobility, and the percentage of immigrants, was associated with criminal rates in the inner city (Park et al., 1967 [1925]; Shaw and McKay, 1972 [1942]). Following their path, criminologists continue to work on the association between crime and place found that common crime covariates of crime rates are concentrated disadvantage (Sampson et al., 2008; Sampson and Groves, 1989; Wilson, 1987), social mobility (Lane and Meeker, 2004; South and Messner, 2000; Xie and McDowall, 2008), racial and ethnic heterogeneity (Avison and Loring, 1986; Blau, 1977; South and Messner, 1986), the percentage of foreign-born individuals (Butcher and Piehl, 1998; Reid et al., 2005; Wadsworth, 2010), the percentage of divorced individuals (Cáceres-Delpiano and Giolito, 2012; Sampson, 1986), the percentage of young males (Farrington, 1986; Moffitt, 1993), and drug mortality rate (Rajkumar and French, 1997).
The multiple linear regression (MLR) models presented below are used to assess the association between the GT and UCR data and the common covariates of crime rates in order to further investigate these relationships. The aim of MLR is not to determine whether the social disorganization theory is better suited to explain crime rates, but to utilize common crime covariates to evaluate the validity of GT and UCR. The independent variables in the MLR models are previously studied covariates of crime rates. These measures are drawn from three other sources of publicly accessible data: the American Community Survey (ACS), the National Center for Health Statistics (Centers for Disease Control and Prevention, 2020), and the Simply Analytics database (Simply Analytics, 2019).
The concentrated disadvantage (CD) index created by combining four standardized measures drawn from the ACS (Census Bureau, 2020): the percentage of the population aged 16 years and older who are unemployed, the percentage of families whose income is below the poverty level, the percentage of female-headed households with children aged under 18 years, and the percentage of the population aged over 25 years with less than high school education. 10 The mobility index is created by standardizing and combining the percentage of people who move within the last year and the proportion renters. 11 To create a heterogeneity index, we calculate the probability that two randomly selected individuals in a DMA are of different races 12 (Blau, 1977). The other three predictor variables (the percentage of foreign-born individuals, the percentage of divorced individuals, and the percentage of the young males aged 15–24 years) are also derived from ACS county-level data, which are aggregated to DMA level. The measure of drug mortality rates is drawn from National Center for Health Statistics’ (NCHS) by County (Centers for Disease Control and Prevention, 2020).
Finally, in order to account for differences in the underlying population and potential differences in access to the internet we include an estimate of the total population drawn from the ACS, and an estimate of the percentage of homes with internet access and the median number of vehicles per household drawn from the MRI Consumer Survey (Simply Analytics, 2019). Table 1 provides summary statistics for all measures used in the current study. 13
Descriptive statistics.
N = 107 DMAs.
HH: Household.
Results
Correlation analysis
As aforementioned, the previously validated measures of
Correlation of crime measures.
N = 107 DMAs.
p < 0.05. **p < 0.01.

Correlation matrix of crime measures.

Heatmap of violent crimes from UCR and GT.

Heatmap of property crimes from UCR and GT.
Our first focus is on the result of property crimes. Table 2 shows that
Related to violent crime, Table 2 shows that
Exploratory factor analysis (EFA)
The two-factor analysis results of EFA test are shown in Table 3. The latent variable Factor 1 has high positive loadings from items for
Exploratory factor analysis.
N = 107 DMAs; χ2 = 30.29, p < 0.01; RMSR = 0.06; RMSEA = 0.107; TLI = 0.894; Loading cutoff = 0.6; The grey text color represents the loadings under 0.6.
Confirmatory factor analysis (CFA)
The results of CFA models are shown in Table 4 (property crime) and Table 5 (violent crime). Among the first four property crime CFA models in Table 4, except for
Confirmatory factor analysis—property crimes.
N = 107 DMAs; Loading cutoff = 0.6; The grey text color represents the loadings under 0.6. Due to the degree of freedom is 0 from model 1 to 4, the results are “just identified models,” and we cannot access the model fit (that’s why the RMSEA is 0 and TLI is 1). Thus, we perform an extra CFA test in model 5 (OARC Stats, n.d.). Overall, when the RMSEA is smaller than 0.05, and the TLI is close to or greater than 0.95 (Harrington, 2009, p.52) shows a good-fit for CFA models. However, the purpose of this CFA test is to see how
RMSEA: root mean square error of approximation; TLI: Tucker–Lewis index.
Confirmatory factor analysis—violent crimes.
N = 107 DMAs; Loading cutoff = 0.6; The grey text color represents the loadings under 0.6. Due to the degree of freedom is 0 from model 1 to 2, the results are “just identified models,” and we cannot access the model fit (that’s why the RMSEA is 0 and TLI is 1). Thus, we perform the model 3 of CFA test (OARC Stats, n.d.). Overall, when the RMSEA is smaller than 0.05, and the TLI is close to or greater than 0.95 (Harrington, 2009, p.52) shows a good-fit for CFA models. However, the purpose of this CFA test is to see how
RMSEA: root mean square error of approximation; TLI: Tucker–Lewis index
Multiple linear regression analysis
The results of multiple linear regression analysis are shown in Table 6,
15
and a summary of the results is shown in Table 7. To present the results of our multiple linear regression models more clearly, we display the
Multiple linear regression models for common crime covariates on crime measures.
N = 107 DMAs.
HH: household; CD: concentrated disadvantage; RMSE: Root Meas Squared Error; MAE: Mean Absolute Error.
p < 0.05. **p < 0.01.
Summary of multiple regression results.
“+” = significant positive association; “−” = significant negative association.
CD: concentrated disadvantage.
For the first set of comparisons, the results indicate that
For the measures of larceny,
For rape, the model fit of
Finding DMAs which may suffer from underreporting in UCR
Finally, we aim to demonstrate how
First, we correlate

Underreported areas of
Discussion and conclusion
Our correlation analyses (see Table 2) suggest that
The findings of EFA, and CFA provide additional evidence that measures from
Our MLR model results further reaffirm our reservations about the use of official crime statistics in assessing common variables of crime when the dark figure of crime is high. It is problematic to use official rates of underreported crime types to determine their association with broadly accepted structural factors. We discovered the significant predictors of
For GT data to be useful in crime estimation, our study makes certain underlying assumptions. To begin, the internet has permeated practically every adult’s life in the United States, and the majority of internet users rely on Google. As a result, victims of crime use Google to look up solutions to crime situations. Second, regardless of the reasons for not reporting a crime to the police, these reasons would not prohibit victims from “googling” solutions. Offense victims may opt to “google” remedies to a variety of crimes, from MVT to rape, regardless of whether they report the crime to the police. If researchers can develop a consistent and effective approach to extract
With these assumptions, we may argue that
Overall, we can conclude that
Although these findings are encouraging in addressing the dark figure of crime, we should consider several limitations in the use of GT as a tool for studying crime. First, although GT’s “topic” keyword search includes multiple languages in the data, we used “terms” to obtain “victim-oriented” data. Our data are limited to English users in the United States and cannot be generalized to populations of non-English users. According to data from Census Bureau (Dietrich and Hernandez, 2022), approximately one in five (67.8 million) Americans speak a language other than English at home in 2022. Researchers who wish to use GT data without a “topic” keyword search should keep this limitation in mind.
Second, there are victims who cannot engage in search behavior due to the nature of the victimization. For instance, if the victims lose their laptop or cell phone, they are unable to perform an online search. It also applies to victims of severe attacks or mental disorders who are incapable of typing or searching. Similarly, we acknowledged that there are regional differences in people’s trust in law enforcement organizations (Taylor et al., 2015), the classification of crimes (Wells and Weisheit, 2004), and the application of the law (Beattie, 1960); therefore, additional research is required to determine how these regional differences affect victim’s internet search help-seeking behaviors.
Third,
Fourth, the
Fifth, the cross-sectional approach in the current study cannot confirm a causal relationship between the covariates and crime outcomes, and we do not assess whether GT data captures fluctuations in crime over time. Sixth, not all types of crime victimization may be found in GT data. For example, homicide victims cannot perform Google searches. In the future, when the GT data is available at the DMA level for a time-series analysis, the authors will continue to investigate the validity of GT.
Seventh, GT data are sensitive to the keywords that researchers use to obtain data. A subtle difference in keywords may lead to different output data. For example, the present study uses a single keyword, “burglary,” in combination with other victimization terms to obtain
Eighth, although it is plausible to use the 0–100 range to represent the data in the search, the percentage cannot represent the total count of search volumes or crime rates. Therefore, we only know the ranking and the relative seriousness of crime among DMAs, and we do not know the actual incidence of crime in a given area. Our concept of
Finally, GT should not and cannot be considered a scientific survey (Trends Help, 2023). The primary issue is that the method used for data collection is not for scientific investigation. Additionally, GT only contains the online search data of the people with internet/computer/mobile phone access, and the data is likely to be skewed towards the population that utilizes internet/search engines more than other types of population and increases the systematic errors in the data. To address the issue of discrepancy in internet access, we incorporated “the percentage of internet usage” as a control variable into our MLR model, finding its inclusion did not impact the results presented (Supplemental Table 4).
This paper is a bold attempt to use GT as an alternative source to generate crime data, and the exploratory nature of this paper generates hope and doubts. Despite the limitations and unsolved issues present in the current study, we are convinced that GT, if effectively utilized, may be a valuable tool for understanding the reality of crime problems. As
We do not claim that
This study explored how
Research Data
sj-csv-2-mio-10.1177_20597991231183962 – Supplemental material for Big data in crime statistics: Using Google Trends to measure victimization in designated market areas across the United States
Supplemental material, sj-csv-2-mio-10.1177_20597991231183962 for Big data in crime statistics: Using Google Trends to measure victimization in designated market areas across the United States by Yu-Hsuan Liu, Kevin T Wolff and Tzu-Ying Lo in Methodological Innovations
Supplemental Material
sj-docx-1-mio-10.1177_20597991231183962 – Supplemental material for Big data in crime statistics: Using Google Trends to measure victimization in designated market areas across the United States
Supplemental material, sj-docx-1-mio-10.1177_20597991231183962 for Big data in crime statistics: Using Google Trends to measure victimization in designated market areas across the United States by Yu-Hsuan Liu, Kevin T Wolff and Tzu-Ying Lo in Methodological Innovations
Footnotes
Acknowledgements
The authors express their gratitude to Dr. Amy Adamczyk, Dr. Michael Maxfield, and Dr. Chongmin Na for their valuable advice during the initial phase of this manuscript.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental material
Supplemental material for this article is available online.
Notes
Author biographies
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
