Sage Journals: Discover world-class research

Abstract

Large-scale data from private companies offer new opportunities to examine topics of scientific and social significance, such as racial inequality, partisan polarization, and activity-based segregation. However, because such data are often generated through automated processes, their accuracy and reliability for social science research remain unclear. The present study examines how quality issues in large-scale data from private companies can afflict the reporting of even ostensibly uncomplicated values. We assess the reliability with which an often-used device tracking data source, SafeGraph, sorted data it acquired on financial institutions into categories, such as banks and payday lenders, based on a standard classification system. We find major classification problems that vary by type of institution, and remarkably high rates of unidentified closures and duplicate records. We suggest that classification problems can affect research based on large-scale private data in four ways: detection, efficiency, validity, and bias. We discuss the implications of our findings, and list a set of problems researchers should consider when using large-scale data from companies.

Keywords

big data algorithms automated data generation audit financial establishments classification

Introduction

In recent years, researchers have increasingly exploited large-scale data from private companies to generate social science, studying issues such as disparities in COVID-19 infection rates (Chang et al., 2021), gender differences in labor force participation (Hansen et al., 2022), heterogeneity in consumer preferences (Athey et al., 2018), political polarization (Chen & Rohla, 2018), and neighborhood isolation (Prestby et al., 2020). However, many such data are automatically generated by algorithms whose details are unknown to researchers. And because the companies produce the data not to generate science but to pursue their own interests, the data’s accuracy and reliability remain in question (Grigoropoulou & Small, 2022; Zhao et al., 2019).

The present study probes the extent to which quality issues in large-scale data from private companies can undermine social science research. It focuses on the classification problem—how companies sort data inputs into variables and values—and considers a best-case scenario: when a company, instead of creating its own classification scheme, simply sorts data into a standardized classification system widely used by researchers. The study resulted from our efforts to understand the distribution of different kinds of financial establishments across neighborhoods in the U.S., and we use the need for reliable data for that research as our case here. We ask whether the company’s algorithms classified financial institutions reliably—that is, the extent to which every establishment the company classified as, say, a credit union, was in fact a credit union and not a conventional bank, payday lender, or something else.

For our case, we use data from SafeGraph, a popular device-tracking company that provides the locations of financial establishments in the U.S. and that has informed many important recent studies (Benzell et al., 2020; Chang et al., 2021; Chen & Rohla, 2018; Huang et al., 2023; Jay et al., 2020; Levy et al., 2022; Massenkoff & Chalfin, 2022; Weill et al., 2020). For each establishment, SafeGraph provides, among many other variables, the establishment’s name (e.g., “Bank of America”), addresses (e.g., “215 W 125^th St, New York, NY”), and classification code. For the latter, SafeGraph uses the 6-digit North American Industry Classification System (NAICS), a widely-used standard that assigns different codes to different establishments based on precise descriptions of their primary business activities. Informed by those descriptions, and based on both field and online research, we independently classified the financial establishments provided by SafeGraph, and compared the NAICS codes we assigned to the ones SafeGraph did. We also examined the possibility of other problems, such as duplicate or outdated records.

We find evidence of major classification problems that vary by type of institution, and considerably high rates of unidentified closures and duplicate records. The problems are of sufficient magnitude to have affected the conclusions drawn in our empirical research. Moreover, the processes through which the company produced the data suggest that similar problems are likely to affect other data sources where algorithms play a major role in classification systems. We conclude with recommendations for scholars working with company data. We begin by describing the classification problem.

The Classification Problem

In recent years, several companies have made mobility data collected from mobile devices freely available to researchers, resulting in a large number of studies (Chang et al., 2021; Elarde et al., 2021; Levin et al., 2021; Levy et al., 2022; Li et al., 2022; Prestby et al., 2020; Sparks et al., 2022). Studies using such data have relied on the companies’ classification of establishments to examine industry-specific research questions or to compare phenomena across business sectors (Brelsford et al., 2022; Gao et al., 2019; Li & Yang, 2021). For example, Athey et al. (2018) used the industry codes in SafeGraph to examine heterogeneity in consumer preferences for restaurant locations. Hansen et al. (2022) used the classifications to examine how school closures and re-openings affected the employment rates of married mothers versus other demographic groups. Benzell et al. (2020) used them to assess the risk of COVID-19 transmission in different industry categories, and to propose reopening policies. Others have used these classifications to understand inequality in visits across retail subsectors (Ballantyne et al., 2021), parks (Jay et al., 2022), and alcohol vendors (Hu et al., 2021).

Studies of this kind have largely assumed that locations designated as restaurants, parks, or banks in the datasets are in fact restaurants, parks, or banks. However, because large-scale administrative data are often derived algorithmically, their accuracy and reliability require assessment, particularly given that the interests of the companies may not align with those of researchers (Grigoropoulou & Small, 2022). For a business, an algorithmic model that accurately predicts the location of 80% of establishments in a short period of time may be satisfactory (Nisbet et al., 2018). Indeed, there are practical limits to prediction, and for complex tasks, such as classifying a large number of industry categories, even 80% may be higher than one can hope for (Hofman et al., 2017). For a social scientist, however, such a rate may undermine the reliability of many kinds of analyses.

The current study uses SafeGraph data to examine how classification into industry codes can affect the reporting of an ostensibly uncomplicated value, the locations of financial institutions such as banks across the U.S. For our purposes, classification is the process of assigning values to discrete categories based on correspondence to a known typology (Bailey, 2005; Nisbet et al., 2018). From a statistical perspective, the main objective of reliable classification is to reduce heterogeneity within each class by identifying similar members of the class and to increase heterogeneity between classes (Bailey, 1994). In the era of large-scale data, researchers and data scientists have tackled the task of classification through supervised machine learning methods, in which a smaller set of labeled data is used to train a model that classifies a larger set of unlabeled data (Kotsiantis et al., 2006; Nisbet et al., 2018). The model uses the information in the labeled data to generate the probability that each case in the unlabeled data belongs to one of the categories (see also Tharwat, 2020).

Supervised learning methods are widely used to classify establishment data (Choi et al., 2014; Giannopoulos & Meimaris, 2019; Milias & Psyllidis, 2021). Supervised learning was used by SafeGraph, which has developed a database of more than 18M Points-of-Interest (POI) in the United States by extracting information about establishments via web crawlers. Web crawlers scour various open Internet sources, most commonly store locators, government websites, and location platforms such as Google Maps, Yelp, MapQuest, and others (Bonack, 2021). Once the establishment data are collected, SafeGraph deploys its proprietary machine learning algorithm to classify them into nearly 1000 NAICS categories. The categories increase in granularity hierarchically, from 2 to 6 digits. For example, the general code 52 includes all types of “Finance and Insurance” establishments; the granular code 522130 captures “Credit Unions.” Because SafeGraph’s algorithm is proprietary, we only know that it classifies each establishment into an NAICS category by determining its primary activity based on its location name and other unspecified metadata (SafeGraph, 2022).

Classification into NAICS codes is an instance of multi-class classification, that is, into more than two categories (Grandini et al., 2020). Previous research notes that multi-class classification can pose particular difficulties when the sample size is small or sparse, the classification structure has unclear category boundaries, or the characteristics of the categories are insufficiently specified (Grandini et al., 2020; Ho & Basu, 2002; Moral et al., 2022). Such problems should be of lesser concern for NAICS. NAICS was created by the governments of the United States, Canada, and Mexico to offer a standard model for gathering, analyzing, and reporting data for different industries in North America (U.S. Census Bureau, 2022c). Its classification manual contains clear descriptions, index entries of the types of businesses included, and illustrative examples for each industry category leaving little space for ambiguity. Moreover, the establishment volume is large and dense. NAICS is commonly used in academic, governmental, and business contexts because it allows scientists to produce consistent, comparable, and replicable research.

Even without these challenges, text classification into ∼1000 NAICS categories remains a formidable task. It is further complicated by the fact that the categories are imbalanced in size: some categories contain hundreds of thousands of cases (e.g.,722511 Full-Service Restaurants), while others, only a few hundred. When an algorithm is trained to classify imbalanced data, it often assigns disproportionate importance to the dominant categories; as a result, the model selects information from the classes unevenly and performs poorly on the classification of the less frequent ones (Liu et al., 2009). Given the complexity of the multi-classification, it is reasonable to be wary of the ways potential inaccuracies in data classification can affect our research. Classification into NAICS codes can offer a near best-case scenario for such a complex task to examine how potential classification errors, if left unnoticed, can affect the quality of large-scale data obtained from third-party sources.

Data

In early 2022, we acquired location data for financial service establishments in the U.S. from SafeGraph’s Core Places. Places is a commercial dataset that contains location records, geographic coordinates, brand and establishment characteristics, and algorithmically derived NAICS codes. At the time, this dataset was freely available for academics through SafeGraph’s digital shop.¹ In the digital shop, researchers could query and select data by geographic location and NAICS category. We downloaded location data solely for establishments classified in the six 6-digit NAICS codes that most closely matched our research interest in consumer-facing conventional versus alternative financial institutions in the U.S. The industry categories were as follows (NAICS codes in parenthesis): “Commercial banks” (522110), “Savings institutions” (522120), “Credit unions” (522130), “Consumer lending” (522291), “All other nondepository credit intermediation” (522298), and “Other activities related to credit intermediation” (522390). Thus, our dataset consisted of the 202,750 establishments that had been classified into these six industry categories by SafeGraph.²

Methods

Procedure

Our data quality concerns were driven by an empirical interest in racial inequality in access to financial services in the United States. We acquired data on the location of financial establishments from SafeGraph to examine differences across racial groups in travel to traditional financial institutions, such as banks and credit unions, versus travel to what are commonly known as alternative financial institutions (AFI), such as payday lenders and check cashers. AFIs tend to offer costlier services than federally insured financial institutions, costs that can exacerbate the difficulties of low-income customers (Bradley et al., 2009; Faber, 2019; Small et al., 2021).

Our study required the ability to identify specific types of establishments, namely, (1) banks, (2) credit unions, and (3) alternative financial institutions (AFI). For AFIs, we also required a more granular classification to identify (a) payday lenders, (b) check cashers, (c) car title lenders, and (d) pawnshops, for a total of three broad categories plus four subcategories. The first three are mutually exclusive, as they describe types of institutions; the four subcategories are not, as they describe services offered by AFIs. For example, an AFI may offer both check cashing and payday lending, and many do. Still, when an AFI offers only one of the four services, such as payday lending, it may be thought of as a type of institution (payday lender).

As noted earlier, we retrieved SafeGraph data on more than 200k establishments classified under six NAICS codes. Table 1 lists each NAICS category and the associated description of the establishments it subsumes. The first three categories are depository institutions, businesses that host depository accounts from which they lend money to consumers; however, they vary in terms of profit status, deposit liabilities, and credit-type provided (U.S. Census Bureau, 2022d). Some differences are clear. For example, “credit unions” are typically not-for-profit institutions that serve members in cooperatives, while “commercial banks” are for-profit corporations. However, while historically “savings institutions” are known for their focus on savings accounts and real estate loans, in practice they currently offer many services that “commercial banks” do (for an overview, see Connecticut Department of Banking, 2023). Such overlaps pose challenges in distinguishing among institutions based on their services.

Table 1.

Descriptions and Index Entries for NAICS Codes, U.S. Census Bureau.

NAICS 6-digit code	NAICS descriptive label	NAICS Code Description	(Selected) Corresponding Index Entries
522110	Commercial banks	Establishments primarily engaged in accepting demand and other deposits and making commercial, industrial, and consumer loans	Branches of foreign banks; commercial banks; depository trust companies; national commercial banks; state commercial banks
522120	Savings institutions	Establishments primarily engaged in accepting time deposits, making mortgage and real estate loans, and investing in high-grade securities	Savings banks (including federal and state); savings & loan associations (including federal and state); mutual savings banks
522130	Credit unions	Establishments primarily engaged in accepting members’ share deposits in cooperatives that are organized to offer consumer loans to their members	Credit unions; corporate credit unions; federal credit unions; state credit unions
522291	Consumer lending	Establishments primarily engaged in making unsecured cash loans to consumers	(Consumer) finance companies; personal credit institutions; small loan companies; student loan companies
522298	All other nondepository credit intermediation	Establishments primarily engaged in providing nondepository credit (except credit card issuing, sales financing, consumer lending, real estate credit, international trade financing, and secondary market financing)	Agricultural lending; car title lending; factoring accounts receivable; industrial banks, nondepository; Morris plans, nondepository; pawnshops; short-term inventory credit lending
522390	Other activities related to credit intermediation	Establishments primarily engaged in facilitating credit intermediation (except mortgage and loan brokerage; and financial transactions processing, reserve, and clearinghouse activities)	Check cashing services; money order issuance services; payday lending services

bold indicates the index entries of interest in a NAICS category that includes index entries outside the scope of our research on financial access.

The bottom three categories on Table 1 are nondepository financial institutions that offer short-term loans and limited types of credit services (U.S. Census Bureau, 2022d). The index entries indicate that these three categories subsume finer classifications of AFIs, including payday lenders, check cashers, car title lenders, and pawnshops, which we seek to extract into separate categories. Payday lenders offer immediate, short-term loans based on proof of income, typically without a credit check, for a high fee. Check cashers cash checks for a fee, typically without requiring a bank account. Car title lenders do not provide loans for purchasing a car; instead, they provide short-term loans that are secured by the borrower’s car title. Pawnshops provide short-term loans secured with personal high-value items, such as jewelry.

Our objective for the present study was to assess how well SafeGraph’s algorithm classified its establishment data, by independently classifying the 202,750 establishments ourselves and comparing results.³ Our process was as follows:

Step 1: Assign Based on Descriptive Keywords

First, we classified each establishment based on keyword searching. Based on the assumption that, for many financial establishments, the firm’s name (labeled “location name” in the SafeGraph dataset) will indicate its primary activity, we searched for specific keywords in the list of names and automatically assigned establishments to categories when the name included the keyword. For example, based on this process, we automatically assigned “Alliance Credit Union” to the category “credit union.” To develop a preliminary list of keywords, we relied on our knowledge of the domain, including what we understood to be the appropriate keywords based on terms from the corresponding NAICS index entries from the U.S. Census Bureau (see Table 1). For this first step, we only included the eight keywords for which we had high confidence in their ability to accurately distinguish establishments: “credit union,” “savings bank,” “savings* loan,” “pawn,” “payday,” “car title loan,” “auto title loan,” and “check* cash*.” That is, we had high confidence that a firm with the name “payday” in its title was not a credit union, and vice versa. This approach helped identify first-order problems in SafeGraph’s algorithm—for example, whether it wrongly classified “Alliant Credit Union” as a “commercial bank” (522110) instead of a “credit union” (522130).

Step 2: Assign Based on Company Information

Many establishments could not be classified using any basic keyword because they had nondescript names that, unlike “Bank of America,” do not indicate the type of business the establishment operates. For example, “Wells Fargo” is also a bank. Several large check cashing companies, such as “ACE Cash Express” and “Pay-O-Matic,” do not include “check cashing” in their name. Thus, we extracted all company names with at least six entries, that is, branches, in our data.⁴ When a company had a nondescript name, we searched online for company information about the services it provides. We then assigned all establishments under the same name in the appropriate category. Table 2 presents the 25 companies with the largest number of establishments; these alone correspond to ∼55.8% of the entries in our dataset.

Table 2.

Companies With the Largest Number of Branches in the Data, by Number of Branches.

Rank	Company Name
01	Western Union
02	Chase
03	Wells Fargo
04	Bank of America
05	PNC Financial Services
06	U.S. Bank
07	BB&T (Branch Banking and Trust)
08	M&T Bank
09	Regions Bank
10	Advance America
11	World Finance
12	TD Bank
13	The Huntington National Bank
14	Fifth Third Bank
15	SunTrust Banks
16	KeyBank
17	Citizens Bank
18	TitleMax
19	ACE Cash Express
20	Woodforest National Bank
21	Check Into Cash
22	Cash America
23	Citibank
24	Check ‘n Go
25	First Citizens Bancshares

Matching based on company information was not always straightforward, for several reasons. First, SafeGraph records location names as they appear in the storefronts and online sources it uses to scrape POI data, without necessarily aligning them with the official company name. Thus, many establishments that were part of the same company were listed under different names on the dataset. An egregious example is the large check cashing company, Community Financial Services Center, which appeared in the dataset not only under that name but also under “CFSC All Checks Cashed,” “CFSC Check Cashing,” and many others, for a total of 134 unique names.

Second, at times, different and unrelated companies operate under the same name.⁵ For example, the 96 establishments operating under the name “Southern Bank” seemed to belong to three different banks, judging from the different Web site domain names and interfaces. While the Southern Banks example does not affect our classification—they are all banks—in multiple other cases the alternative companies represented distinct industry categories, such as credit reporting companies versus loan agencies or payday lenders versus mortgage lenders. The problem was particularly common among smaller AFI companies and independent, family-owned stores.

To address these issues, we conducted multiple online searches for selected establishments operating under the same name to verify whether they belonged to the presumed company, and adjusted our classifications accordingly.

While doing so, we also performed a small-scale (non-systematic) check of our Step 1 process. We searched online for several of the companies with descriptive names we had already classified during Step 1, and compared the classification to that obtained manually. We encountered almost no classification mismatches for company names that included the small, highly specific set of descriptive keywords we used in Step 1. However, online searches based on company names often provided information on additional financial services the company offered. For example, some companies that used the term “check cashing” in their name also provided payday loans. Thus, we often combined the two methods to get a fuller account of the financial services these companies offer, which was important for our substantive research on financial access.

Step 3: Assign Based on Query Expansion Mining

Our two first methods allowed us to classify ∼80% of the establishments in the dataset. We then pursued a query expansion method by mining only the remaining, unclassified data for additional keywords we could use to classify them, following an approach similar to King et al. (2017). We first extracted a list of the most frequent terms used in location names in the set of unclassified establishments. We inspected the names for keywords potentially relevant to our six NAICS codes. Once we located a keyword that could potentially discriminate among categories of financial establishments, we performed online searches for selected establishments to verify that we could confidently infer the corresponding services based on the presence of these terms alone. If so, we added the keyword to our inclusion terms for the relevant category. If a keyword was consistently associated with a type of financial service outside of the six NAICS codes of interest, we added it to our exclusion terms. For example, the term “bitcoin” in an establishment’s name suggests it primarily engages in “Virtual currency exchange services,” which should be listed under NAICS “Commodity contracts dealing” (523130), instead of any of our six target categories. We classified all establishments that should not have been in one of our six NAICS codes as “Other.” Table 3 presents the most important discriminant and non-discriminant keywords we identified through our query expansion method.⁶

Table 3.

Top Discriminant and Non-Discriminant Keywords Identified Through Query Expansion Mining (With Associated NAICS Codes in Parenthesis).

Discriminant Keywords	Non-Discriminant Keywords
Inclusion terms	Loan, credit, cash advance, cash gold, gold buy, finance, cash, check advance, jewel, coin, silver, money, currency exchange, fund, capital
Title loan (522298), consign (522298), gun loan (522298), student loan (522291), lawsuit loan (522291)
Exclusion terms
Bitcoin, ATM, home loan, sound, mortgage, equity, insurance, estate sale, auction, merchant, business loan, liquid, union, Montana, blood bank, auto finance, invest, real estate, cash home, repair, hard money, abstract

We updated our keywords iteratively, adding new high-confidence terms or dropping terms that either produced errors or proved more ambiguous than we originally believed, until we could not identify new useful keywords. The iterative process required deep qualitative probing and the deepening of domain expertise. For example, we first learned that the term “cash advance” is often used in the AFI industry to refer to or advertise payday loans, such that firms with that title were often payday lenders. However, we soon noticed that establishments operating in U.S. states where payday lending is explicitly prohibited were also using the term in their location name. Some of these were merely legacy institution names. But further investigation suggested that an establishment may also use “cash advance” in its title when it offers other types of short-term, unsecured cash loans, which reduced our confidence that the term consistently classified a payday lender.⁷ Thus, for such ambiguous terms, we searched for each establishment’s services online and classified them on a case-by-case basis.

Step 4: Assign Based on One-By-One Manual Search

Step 3 allowed us to classify an additional ∼5% of the data. The ∼15% of establishments that remained unclassified had either ambiguous or unique terms in their location names, or were branches of small companies (<6 stores). Therefore, we searched online for information on each, one by one, on the specific services the establishment provided, and classified it on that basis.

To assure the quality of our coding, for each online search, we consulted multiple sources. We relied primarily on company websites to identify or confirm the types of services offered by the establishment. Unfortunately, many establishments did not have an active Web site. Less often, websites were ambiguous about the kind of financial products provided. Payday lending, in particular, was at times difficult to evaluate. Because the practice has received extensive negative coverage and is often described as “predatory” by scholars and news media, payday lenders have an incentive to be more circumspect about their services. Many marketed their products with other, more innocuous names, such as “unsecured loans,” “signature loans,” “bad credit loans,” “short-term loans,” “cash advances,” and more. As before, we erred on the side of caution and classified establishments as payday lenders, check cashers, car title lenders, and pawnshops only when the services were explicitly mentioned, or when we could infer such service provision from the examination of fine prints, terms and conditions of the services, and APR specifications. Otherwise, we classified establishments that offered unspecified short-term, small, or personal loans, excluding auto financing,⁸ generically as AFIs. (For this and other reasons, the total number of AFIs we present below will differ from the sum of establishments across all four AFI categories.) Still, such ambiguity was less common, since non-payday lenders often described their loan products as “traditional installment loans” and specified their APR rates or rollover policies, information that allows us to classify them as consumer lenders (522291). In the absence of a business Web site or a Facebook page, we consulted Google Maps, Yelp, MapQuest, and the online version of Yellow Pages. After company websites, the second most informative source was storefront images, often with accompanying Google Maps entries, since owners often use them to advertise their primary activities to the physical neighborhood they serve.

Step 5: Check Reliability

We checked the reliability of our procedures in two ways, one informal and one formal.

Preliminary check: First, we conducted field observations (Grigoropoulou & Small, 2022). We know that SafeGraph uses online sources with publicly open APIs, such as Yelp, MapQuest, and Yellow Pages, to collect and likely train the data we obtained (Bonack, 2021). Online sources were also the foundation of our main procedures. To minimize the risk of a “feedback loop”—relying on the same online information to confirm our classification decisions that our data provider used to produce theirs—we searched for specific establishments offline, doing fieldwork in New York City for a week in late November and early December 2021.⁹ We selected a random sample of 243 establishments located in NYC, mainly in Manhattan and Brooklyn, extracting them from our corresponding Patterns data, which we had obtained a few weeks earlier than our Core Places data. We classified the establishments using the strategies we described above. Then, we either visited or telephoned each of the purported locations to validate our classifications and reclassify when necessary.

Final check: Second, we selected a random sample of 3050 establishments, stratified by NAICS category, for independent classification. A research assistant not involved in Steps 1–4 was assigned to classify the establishments independently and following a more stringent procedure.¹⁰ We first trained the research assistant by providing (a) descriptions of the six NAICS categories we examined; (b) definitions of banks, credit unions, payday loans, check cashing, car title lending, and pawning, as well as related concepts, such as “unsecured loans” and “collaterals,”; and (c) a coding scheme. We then asked the research assistant to classify the establishments based on a more painstaking process: by searching online for each individual establishment in the sample and assigning a code based on the exact services it offered, rather than automatically coding by keyword (as we had done in Steps 1–3). Thus, if a large company provided different services in different establishments (e.g., if it provided payday loans in some locations but not others), then the research assistant’s process would ensure that the different establishments by the same company where accurately coded under different financial services. After the research assistant’s coding was completed, we compared those results to ours by calculating an inter-coder reliability metric, ICR = total agreements/total establishments, for each establishment type. We confirmed that our classification approach was highly reliable, with ICR ranging from 94.2% for auto title lenders to 99.9% for credit unions. (For full results, see Appendix Table A.2. The Appendix also contains results comparing estimates of classification agreement across each of the methods in Steps 1–4.)

Step 6: Assign to a Primary NAICS Code

After confirming the reliability of our classification approach, we performed the last step of our procedure: deciding on a primary NAICS code. This step was necessary for three reasons. First, as noted earlier, our study on financial access did not require differentiation between commercial banks and savings institutions. In Steps 1–4, we coded all banks that accept deposits as “banks.” In Step 6, we separated these establishments to their respective NAICS codes, “Commercial banks” (522110) and “Savings institutions” (522120). Second, check cashing, payday lending, and car title lending are not mutually exclusive categories; many AFIs provide more than one of these financial services. Thus, an establishment that offers payday lending and car title lending could be classified as both 522390 and 522298. However, SafeGraph’s algorithm assigns establishments to a single NAICS category based on their presumed primary activity, regardless of the other services they may provide. Third, some of the NAICS codes include services beyond those of interest. For example, 522298, includes agricultural lending. We needed to account for these establishments to get a comprehensive estimate of the reliability of these six 6-digit NAICS categories provided by SafeGraph. Thus, as a final step, we assigned a single primary NAICS code to all 202,750 establishments on our dataset based on the likely primary activity of the establishment (instead of assigning codes based on each financial service).

For these assignments, we followed rules as conservative as those of our initial classification, favoring (1) clear industry markers, such as “savings bank,” “car title loan,” “check* cash*,” and “payday loan”, in the location name, and (2) information on company profiles, Web site descriptions, and selected government documents, such as the U.S. Securities and Exchange Commission (SEC) filings or reports from the Consumer Financial Protection Bureau (CFPB). For savings institutions, we also used an up-to-date, publicly available list of federal savings associations from the U.S. Office of the Comptroller of the Currency (OCC) (2023). As in Steps 1–4, when an establishment’s primary activity was not one of the indexed services in any of the six 6-digit categories of our dataset, we classified it as “Other.”

Figure 1 exhibits a graphical depiction of our classification process.

Figure 1.

Schematic representation of independent classification procedure.

Probing for Data Sourcing Problems

In addition to classification problems, we examined the possibility of duplicate records or overlooked establishment closures. We asked our research assistant to note whether an establishment in the random stratified sample was closed or open. We considered an establishment as closed only when (a) it did not appear in the store locator for a large corporation, (b) when Google Maps reported it as permanently closed, or (c) when Yelp noted that “Yelpers report this location has closed.” Researchers have shown that Google Maps establishment operation data are increasingly likely to be accurate, given the combination of administrative, image-based, and crowdsourced quality control it relies on (Payne, 2021; Small et al., 2021; Zamir & Shah, 2010). Yelp also relies on crowdsourcing. We included a fourth criterion: (d) following SafeGraph Policy,¹¹ we recorded an establishment as closed when it began operating under a different corporate name, as when a bank is acquired by or merges with another bank.

Finally, we examined the possibility of duplicate records. We used the establishment address in this process. It is possible for more than one establishment to occupy a single geographic location, as when different stores are located in the same multi-story building. However, such duplicates can also result from unidentified closures or ambiguity in the establishment name across the different sources SafeGraph uses to collect the data. To account for potential duplicates, we pulled all records of establishments with identical street address, city, state, and zip code in our dataset. We then randomly sampled every 20^th record of the 27,679 duplicate locations we found (approx. 10%), and inspected the corresponding pairs, or occasionally triples, to assess whether they were justifiably listed in the location or if the inclusion of one or more of them was due to errors.

We report all findings regarding classification, closures, and duplicates below.

Results

Coverage of Financial Establishments

We first assess the total coverage of the SafeGraph Places data, as a base indicator of the plausibility of its classification approach. The U.S. Department of Commerce provides the total counts for all its known establishments classified by NAICS codes in the County Business Patterns (CBP) dataset¹². We use the most recent release with data from 2020 (U.S. Census Bureau, 2022b). We compare its coverage to the coverage produced by SafeGraph in early 2022 and find that establishment counts vary widely between the two sources (see Table 4).

Table 4.

Establishment Counts by NAICS Code and Source.

NAICS 6-Digit Code	NAICS Descriptive Label	U.S. Census Bureau	SafeGraph	Relative Difference(%)
522110	Commercial banks	87,612	82,039	−6.4
522120	Savings institutions	7027	6	−99.9
522130	Credit unions	19,196	16,236	−15.4
522291	Consumer lending	14,731	15,180	3.1
522298	All other nondepository credit intermediation	10,649	7926	−25.6
522390	Other activities related to credit intermediation	14,147	81,363	475.1
Total		153,362	202,750	32.2

In many cases, the SafeGraph counts match the CBP’s. For example, the company reports 15,180 “consumer lenders,” or only 3.1% more than the CBP. At the other extreme lie “savings institutions,” for which SafeGraph reports substantially lower numbers. It appears that, for every 100 savings institutions that CBP reports, SafeGraph identifies 0.09 establishments, a 99.9% deficit. In theory, the difference could have resulted because the SafeGraph data were produced two years later, and well into the COVID-19 pandemic, which resulted in many establishment closures. However, such patterns are unlikely to have affected savings institutions at rates so dramatically higher than those of other financial establishments. We believe this particular difference is due to the fact that “savings institutions” and “commercial banks” currently offer similar services. However, it may also be due to insufficient attention by the company to the index entries of this NAICS category.¹³

SafeGraph also appears to undercount establishments on “all other nondepository credit intermediation,” “credit unions,” and “commercial banks,” by 25.6%, 15.4%, and 6.4%, respectively. We believe these differences may be due to either flaws in the data collection strategy or misclassification to a NAICS category outside the six we examine. For example, some “commercial banks” may have been classified as “investment banks” (523110), even though the latter are nondepository institutions. SafeGraph also specifies that they avoid assigning 6-digit NAICS codes to establishments when they are not very confident of the classification; instead, they only classify these establishments into the hierarchically broader 2-digit or 4-digit NAICS categories (SafeGraph, 2023). Many “credit unions” and “commercial banks” may have been only assigned to the more general, 4-digit NAICS category “Depository credit intermediation” (5221).

In contrast, SafeGraph reports an extraordinary 475.1% more establishments offering “other activities related to credit intermediation” than CBP. We uncovered that this difference is nearly entirely the product of classifying 75,610 stores from a single brand, Western Union, under 522390¹⁴. This assignment is not entirely implausible. Western Union stores offer “money transmission services,” which are indexed in 522390 (see Table 1). However, we believe those establishments should have been classified as “Financial transactions processing, reserve, and clearinghouse activities” (522320) because the NAICS code indicates “electronic funds transfer services” and “electronic financial payment services” as descriptors, and Western Union describes itself as “a leader in global money movement and payment services” in a letter commenting the amendment of the Electronic Fund Transfer Act (EFTA) (Western Union, 2011).¹⁵

Classification Assessment

To probe further, we assess whether SafeGraph classified each establishment into what we determined to be the appropriate, primary NAICS code based on our classification procedures, which assumed a strict reading of the descriptors of the NAICS classification system (U.S. Census Bureau, 2022d). We calculate two metrics of reliability of SafeGraph’s classification in the six 6-digit categories we examined. First, we assess SafeGraph’s probability of detection¹⁶ by calculating the proportion of agreed establishments among the number we identified: total agreements/total establishments assigned by authors. Second, we assess of precision¹⁷, or how many of the establishments SafeGraph has classified in a given category we can confirm as members of that category: total agreements/total establishments assigned by SafeGraph.

Table 5 exhibits the results. The second column presents the number of establishments we classified in each category.¹⁸ Of the 202,750 financial establishments, we determined that 73,876 were commercial banks; 3331 savings institutions; 16,119 credit unions; 5768 consumer lenders; 11,552 all other nondepository credit intermediation; and 8041 other activities related to credit intermediation. We determined that the remaining 84,063 establishments were engaged in a primary activity outside the scope of these six 6-digit industry codes (e.g. “Financial transactions processing, reserve, and clearinghouse activities” (522320)).

Table 5.

Estimates of Classification Agreement Between SafeGraph and Authors.

NAICS Code	Assigned by SafeGraph	Assigned by Authors (Step 6)	Agreements	Probability of Detection(%)	Precision(%)
522110 commercial banks	82,039	73,876	73,755	99.8	89.9
522120 savings institutions	6	3331	5	0.2	83.3
522130 credit unions	16,236	16,119	14,285	88.6	88.0
522291 consumer lending	15,180	5768	4265	73.9	28.1
522298 all other nondepository credit intermediation	7,926	11,552	6769	58.6	85.4
522390 other activities related to credit intermediation	81,363	8041	3301	41.1	4.1
Total	202,750	118,687	102,380	86.3	50.5

The fourth column suggests that a researcher using the data would detect almost all (99.8%) banks, and a vast majority of credit unions (88.6%), while missing a large portion, or a majority, among the other four industry categories. Still, even the relatively small gap for credit unions (11.6%) is particularly surprising, since 1664 of the 1834 credit unions SafeGraph had assigned to something other than the expected code had the words “credit union,” or an abbreviation, in their names. The lowest probability of detection was for savings institutions (0.2%); nearly all of them were instead classified as commercial banks by SafeGraph.

The fifth column shows that, with respect to precision, results vary. For some categories, the results are too unreliable for research. However, even NAICS categories with higher precision rates could be improved considerably with proper attention to types of establishments that would fit best under other categories. For example, 12% of the establishments classified by SafeGraph in 522130 were not credit unions. In fact, 6.6% of them were bitcoin selling points (523130, instead), and at least 1.0% were credit repair services (which should have been assigned into 541990).

With the exception of savings institutions, the classification inconsistencies were most pronounced among the three NAICS codes associated with AFIs. Given their importance to our empirical study, we further inspected how the classification inconsistencies would affect the analysis of check cashing, payday lending, and other AFI services. As noted earlier, many establishments provide more than one of these services. For this part of the assessment, each establishment is classified in every category of service we determined it offers, but only to a single NAICS code based on presumed primary activity. For example, if an establishment offers both payday lending and car title lending, we classify it as providing both services, but assign it only to either 522390 or 522298 based on its primary activity.

Table 6 exhibits the results. It describes how we classified alternative financial institutions compared to SafeGraph. The columns indicate the services provided. Items in bold indicate the proportion assigned as expected. As shown in the fourth and sixth columns, we classified 97.8% of establishments offering payday lending and 99.5% of those offering check cashing under 522390, compared to only 36.7% and 47.2%, respectively, by SafeGraph. Notably, 5.9% of the establishments we identified as offering check cashing services were classified by SafeGraph as “commercial banks.” These included places such as Community Financial Services Center (CFSC), Advance Financial, and First Virginia—Community Choice Financial, which are not commercial banks.

Table 6.

Assignment to NAICS Code by Primary Service, for AFIs Offering One or More Services, by SafeGraph and Authors (Steps 1–4).

NAICS Code	AFI Services (25,075)		Payday Lending (6285)		Check Cashing (5682)		Car Title Lending (9773)		Pawn Shop Trading (7177)
NAICS Code	SafeGraph(%)	Authors(%)	SafeGraph(%)	Authors(%)	SafeGraph(%)	Authors(%)	SafeGraph(%)	Authors(%)	SafeGraph(%)	Authors(%)
522110 commercial banks	3.3	0.0	0.3	0.0	5.9	0.0	0.4	0.0	0.1	0.0
522120 savings institutions	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
522130 credit unions	0.4	0.0	0.0	0.0	0.0	0.0	0.3	0.0	0.0	0.0
522291 consumer lending	51.1	23.0	61.8	0.2	46.3	0.4	78.0	10.0	6.7	0.0
522298 all other nondepository credit intermediation	28.2	45.0	1.1	2.0	0.6	0.1	2.4	43.4	92.6	99.7
522390 other activities related to credit intermediation	17.0	32.0	36.7	97.8	47.2	99.5	18.8	46.6	0.6	0.3
Total	96.3	100.0	99.6	100.0	94.1	100.0	99.2	100.0	99.9	100.0

Bold value indicate the proportion of establishments assigned as expected.

The situation is more complicated for car title lenders. The eighth column shows that 43.4% of establishments offering car title loans are primarily car title lenders (522298). Another 46.6% of them have payday lending and/or check cashing as their primary activity, and, thus should be classified in 522390 “other activities related to credit intermediation.” Nevertheless, SafeGraph classifies only 2.4% and 18.8% of car title lenders in these in these two NAICS categories, respectively (seventh column). It appears that the SafeGraph algorithm favors the assignment of alternative financial institutions, such as payday lenders, check cashing stores, and car title lenders, into “Consumer lending” (522291).

Data Sourcing Problems

In addition to these issues in classification, we identified other important problems. In our stratified random sample of 3267 establishments selected for additional screening, 16.7% were not in operation in 2022 (see (Table 7). There is a wide variation in unidentified closures by category of financial establishments, ranging from 12.1% for commercial banks to 24.3% for consumer lenders.¹⁹ Since we counted establishments as closed only when there were explicit indicators that the establishments were closed, we consider our estimates to be lower bounds.²⁰ Our inspection of duplicate records also revealed worrisome patterns. Sometimes, listing more than one establishment in a single location was justified. Still, 10.6% of financial establishments in the data are likely duplicates. Duplicates resulted from three factors.

Table 7.

Proportion of Financial Establishments Reported as Open by SafeGraph Places That We Documented to be Permanently Closed, by NAICS Category, With Confidence Intervals. Stratified Random Sample of Establishments.

NAICS Code	% Closed	95% Confidence Interval
NAICS Code	% Closed	Lower	Upper
522110 commercial banks	12.1	10.3	14.1
522130 credit unions	18.7	15.5	22.1
522291 consumer lending	24.3	21.2	27.3
522298 all other nondepository credit intermediation	12.4	8.0	17.4
522390 other activities related to credit intermediation	14.8	12.0	17.8
Total	16.7	15.5	18.1

The first was unidentified closures, wherein one establishment had closed and another one had opened at the same location. Online sources such as Yelp and Yellow Pages update their records with openings far faster than they do with closures. And even after being notified that an establishment has closed, platforms may retain the record for an unspecified period. We encountered records for establishments that had been closed as many as seven years. Often, unidentified closures were the product of a company merger or acquisition. For example, though “M & I Bank” merged with “BMO Harris” in 2011, one location had both establishments listed as active. We found 25 company-wide mergers responsible for duplicates, 22 of which took place between 1998 and 2019. These affected the listing of 12.8% of duplicate establishments in our data.

The second factor was name ambiguity. At times, different online sources listed a single establishment under slightly different names, such as “Hancock and Whitney Bank” versus “Hancock Whitney Bank.” When the establishments are larger chains, the error appears systematically; for example, “Hancock Whitney Bank” appears in 178 locations as a duplicate. Similar ambiguity results from spelling errors, or the use of abbreviations instead of the full name of the establishment.

The third factor was that SafeGraph often failed to consider that certain establishments offer Western Union services on their premises, instead of Western Union as a separate establishment. Thus, for many “Regions Bank” branches in our data, SafeGraph lists a Western Union establishment in the same location. For some analyses, it will be useful to know that certain establishments offer additional services; however, additional services do not actually constitute a separate establishment. This third factor accounted for 36.5% of the duplicates.²¹

Discussion

Our study reveals extensive and systematic problems in the classification of large-scale location and human mobility data in one commonly used dataset. First, SafeGraph’s reports of the total number of establishments were often inconsistent with those from the U.S. Census. Some of these discrepancies resulted from challenges in determining the most suitable level of granularity for the data. This challenge occurred either because SafeGraph hesitated to assign 6-digit NAICS codes to establishments when they lacked sufficient confidence in the accuracy of the classification or because it assigned 6-digit NAICS codes when broader 2-digit or 4-digit codes would have been more appropriate given limits in the available information for these establishments. Other discrepancies resulted from the issues in the data sourcing process, such as missed unidentified closures and duplicate records. Second, we found that the classification problems were not randomly distributed across categories; instead, they were more likely to affect AFIs, in part because several of these establishments offer multiple financial services. Still, SafeGraph often failed to classify establishments into what would seem the most appropriate NAICS code based on primary activity, as when establishments identified themselves as “car title lenders” or “credit unions” in their name. Thus, we believe a major reason the data had classification problems is the absence of sufficient domain expertise in the production or evaluation of the algorithm.

We also found that some of the sources used by SafeGraph may not be fully reliable with respect to closures, in part because they often retain traces of establishments long after closure. It is possible the changes following COVID-19 were particularly harmful to this form of data quality; however, we also presented evidence that unidentified closures included establishments closed many years, not just months, before early 2022. In addition, we found that the approach to sourcing may be a double-edged sword. While extracting data from multiple web sources produces wider data coverage, it also increases data complexity, which, in turn, tends to magnify data quality errors (Becker et al., 2015). Here, it resulted in a considerable number of duplicate records due to name ambiguity, spelling errors, and abbreviations.

The problems we identified can be classified as presented in Table 8.

Table 8.

Overview of Identified Data Quality Problems.

Problem	Source
Data coverage	- Unidentified closures
	- Imbalanced classes
	- Duplicate records
	- Granularity problems
Classification inconsistencies	- Imbalanced classes
	- Category ambiguity
	- Lack of domain knowledge
Closures	- Outdated records
Duplicate records	- Complexity from multiple sources
	- Unidentified closures
	- Name ambiguity

Limitations and Future Research

Our study has important limitations. We conducted our classification analysis in only six 6-digit NAICS categories out of nearly 1,000, given that these were relevant for the scope of our research in racial inequality in access to financial institutions in the United States. We do not know whether other NAICS industries would exhibit the problems we identify to a lesser or a greater extent. Still, given what we uncovered about the sources of the problems, it is unlikely that the serious issues we observed are unique to research on financial establishments.

Moreover, we acquired our establishment data from SafeGraph in January 2022, and inspected the establishments for closures between September and December 2022. Establishment closures between January and the fall account for some portion of the differences we observed. Still, we expect this portion to be small. One, we accounted for likely increases in closures during our inspection period between September and December 2022. We found no effect of the passage of time in the percentage of closures we identified over those four months. Two, we draw evidence from the inspection of duplicates, which suggests that many duplicate locations are the result of unidentified closures that happened several years ago.

We note that our strategy for identifying duplicates cannot capture unidentified closures when a financial establishment of the categories we have been examining has closed and has been replaced by an establishment of a category outside of these six. Although in some cases (e.g., for banks), zoning restrictions would prevent the replacement of establishments from an entirely different business sector, these restrictions do not apply in all regions or across all industries.

Conclusion

The wide availability of large-scale, administrative data from private companies has transformed researchers’ ability to study human behavior. However, such data are typically not produced with the intention of generating social science. Classification problems represent a prime example of this discord, wherein private companies construe them as a predictive task that will allow them to order the chaos of large-scale, unstructured data into meaningful categories efficiently, while social scientists repurpose the outputs of these classifications to examine human behavior (Grigoropoulou & Small, 2022). A classification can be sufficiently accurate from a predictive perspective, given the practical limits to prediction (Hofman et al., 2017), while being inappropriately inaccurate from a measurement perspective, given the needs of scientific research (Bailey, 1994).

We suggest that classification problems can affect the research based on large-scale private data in four particular ways. First is detection. Given the volume of the data and, often, the cost of acquiring them, researchers must rely on classification categories to identify the units of interest. A large number of false negatives would mean that, while establishments of interest are in the database, researchers cannot effectively locate them. Researchers will often not even know that they do not know of missing units. Second is efficiency. Researchers rely on classification schemes such as NAICS to examine patterns or infer relationships. As the number of false positives increases, heterogeneity within the classification increases, leading to less precision in estimates. Third is validity. Misclassification also undermines empirically the validity of any given category, as one is less certain of measuring what one intends to measure, as when a large number of what one believes to be banks are in fact credit unions. Fourth is bias. When classification problems vary systematically by category—as when banks are observed more reliably than payday lenders—everything from basic descriptions about distributions to relationships between category types and other variables is affected. This problem is especially pernicious when the variation is related to a variable of interest, as when the less reliably measured types of financial institutions are more likely to be located in low-income neighborhoods. Estimates of inequality may be wildly over- or understated. All four problems affect studies in which the location of establishments or mobility patterns to and from them play a role.

We note that our process for identifying the problems in classification and sourcing was labor-intensive—indeed, qualitative fieldwork proved crucial (Grigoropoulou & Small, 2022). For researchers without the resources to perform extensive quality controls at this level of granularity, the results at a minimum should call for both caution and humility about results. A better solution is to not trust any single dataset, instead seeking for multiple datasets produced by different companies, governments, or entities and with different sources or processes for generating the data. An even better and complimentary approach is to sample a subset of entities of interest and perform at least minimal quality audits.

Finally, we encourage researchers employing any large-scale data from private companies to develop or acquire domain expertise. Researchers should either study the specifications, descriptors, and index entries of the classification scheme or find collaborators who have such knowledge. (Indeed, companies producing such data might benefit from doing so as well. In our study, many classification problems would have been prevented with a basic understanding of the kinds of services the establishments provide.) An important component of this process is fully understanding the original sources of the data produced by the companies and the process through which those sources classified the data. As companies increasingly gather data from multiple sources to sell to others, the complexities and resulting classification problems are likely to increase. Researchers will thus need to be more, not less, attentive to these issues in the coming years.

Footnotes

Acknowledgments

This research is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy (University Allowance, EXC 2077, University of Bremen). We gratefully acknowledge the support received from the U Bremen Excellence Chair Program and from all those involved in the project, particularly the host, Betina Hollstein.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy (University Allowance, EXC 2077, University of Bremen). Small, a full-time faculty member at Columbia, received a U Bremen Excellence Chair grant (2020-25).

ORCID iD

Nikolitsa Grigoropoulou

Data Availability Statement

Core Places data that was used in this study are available to academics through the SafeGraph and Dewey data partnership ().

Notes

Appendix

This appendix presents several tables comparing the reliability of several methods.

Table A.1 compares our main approach to fieldwork (first reliability check, Step 5).²² After classifying all establishments in New York City in the dataset based on Steps 1–4, we randomly sampled 243 of them. As discussed above, we separately conducted in-person visits or calls to those 243. For the 243, we calculate the classification agreement between the two methods. Please note that payday lending and car title lending are prohibited by law in New York State; thus, the table does not report estimates for these categories.

Table A.1.

Estimates of Classification Agreement by Category of Financial Establishments.

Category	NYC Fieldwork
Category	Agreements	Total	(%)
Banks	236	243	97.1%
Credit unions	241	243	99.2%
AFIs	242	243	99.6%
Payday lending	N/A	N/A	N/A
Check cashers	243	243	100.0%
Car title lenders	N/A	N/A	N/A
Pawnshops	242	243	99.6%

The table shows that the classification agreement across the different establishment categories is exceptionally high. For banks, the classification agreement is slightly lower because in our original coding (Steps 1–4), we wrongly coded as commercial banks some establishments that instead did trade financing, a type of financial service outside the range of institutions that should be included in our SafeGraph data in the first place (see 522293 International Trade Financing). This initial misclassification was particularly prevalent among establishments representing foreign banks. While foreign banks constituted a relatively small portion of all banks in the dataset, we sought more information about them as we scaled our classification to the entire dataset. Still, there are limits to identifying such establishments through online searches. For that reason, the estimate of classification agreement between our scheme and SafeGraph’s reported in Table 5 may in reality be lower than shown.

Table A.2 presents the second reliability check discussed in Step 5. It compares the results of our classification (Steps 1–4) for the full data to those of the classification conducted independently by a research assistant on a stratified random sample of establishments.

Table A.2.

Estimates of Classification Agreement by Category of Financial Establishments.

Category	Agreements	Total	Inter-Coder Reliability(%)
Banks	3044	3050	99.8
Credit unions	3048	3050	99.9
AFIs	3031	3050	99.4
Payday lending	2879	3050	94.4
Check cashers	2983	3050	97.8
Car title lenders	2872	3050	94.2
Pawnshops	3040	3050	99.7

We probed more to understand if our approach missed something important that resulted in lower reliability rates for payday lenders and car title lenders. Specifically, we found that we tended to identify more establishments as offering payday loans and car title loans when, in fact, they did not. As discussed earlier, we consider our coding approach conservative. While we expect that a close inspection at the establishment level will recognize more establishments that offer specific alternative financial services, than our heuristic methods, we were particularly invested in not classifying establishments into service categories we could not verify. Upon inspection, we discovered that all establishments we identified as offering payday lending, while they did not, and 92.9% of them for car title lenders were for branches of large corporations. For example, while Advance America is one of the largest payday lenders in the United States, not all of its branches offer in-store payday lending.²³ This type of service heterogeneity across establishments within corporations is nearly impossible to account for in automatic or semi-automatic classification systems, a fact that makes clear the need for a more targeted, qualitative approach to coding to tease out differences at the establishment level, when such differences are important in a study.

Nonetheless, this variation should not impact considerably the classification into NAICS codes in this case for two reasons: (1) The corporations that exhibited heterogeneity across establishments in payday lending, which resulted in false positives, were primarily check cashers and payday lenders, and invariably offered check cashing. These two types of alternative financial services fall under the same NAICS category (see Table 1). (2) The companies associated with at least 92.9% of eventually unconfirmed services in car title lending were corporations whose primary activity was payday lending and check cashing, not car title lending.

As discussed in the Methods section, because our initial objective was not to assess the efficacy of given methods but to improve the quality of our data for empirical analysis, our classification process was iterative. We used some information from later steps to return and refine classifications in earlier steps. Thus, some establishments have been classified based on information from more than one of these methods.

Nevertheless, it is valuable to assess the efficacy of each of the given methods. Thus, in Table A.3 we present the results of our reliability check for each method in Steps 1–4 separately. For these calculations, we removed those establishments that we had classified iteratively—that is, based on information from more than one method. Thus, in Table A.3 each column set exhibits the reliability of a given method based only on those establishments classified with that method. As the table shows, the intercoder reliability is exceptionally high across financial categories for our two primary classification methods: assignments based on descriptive keywords and based company information. However, for our most labor-intensive method, the manual search, the reliability was lower for payday lenders (89.1%), check cashers (84.5%), and car title lenders (91.5%).

To understand why, we probed further. During the annotation process, both the researcher and the research assistant, who coded the data independently, were required to note how confident they were about the accuracy of the assigned code for a given establishment, and to make notes regarding any ambiguous entries. We found that many of the coding discrepancies were rooted in the issues we discussed above in our presentation of Step 4. For example, in 15 of the 20 discrepant annotations for check cashers, either the researcher, the research assistant, or both had indicated low confidence in their classification. The limited confidence was often due to limited information from the expected source, such as an active company Web site, which was frequently attributable to the establishment having been closed for some time. In addition, establishments were at times ambiguous in the description of financial services. The procedure reinforces the idea that researchers employing manual classification include indicators of their subjective confidence in the classification.

Table A.3.

Estimates of Classification Agreement by Category of Financial Establishments Across Classification Methods.

Category	Descriptive Keywords (Step 1)			Company Information (Step 2)			Query Expansion Mining (Step 3)			Manual Search (Step 4)
Category	Agreements	Total	(%)	Agreements	Total	(%)	Agreements	Total	(%)	Agreements	Total	(%)
Banks	564	566	99.6	2185	2187	99.9	60	60	100.0	127	129	98.4
Credit unions	565	566	99.8	2187	2187	100.0	60	60	100.0	128	129	99.2
AFIs	563	566	99.5	2183	2187	99.8	59	60	98.3	119	129	92.2
Payday lending	539	566	95.2	2062	2187	94.3	60	60	100.0	115	129	89.1
Check cashers	561	566	99.1	2115	2187	96.7	58	60	96.7	109	129	84.5
Car title lenders	557	566	99.4	2080	2187	95.1	59	60	98.3	118	129	91.5
Pawnshops	565	566	99.8	2182	2187	99.8	60	60	100.0	125	129	96.9

Author Biographies

Nikolitsa Grigoropoulou, Ph.D., is a postdoctoral researcher in Computational Social Science at the University of Bremen. She is an interdisciplinary social inequality scholar with an emphasis on intergroup dynamics, human flourishing, and financial inequality. Her current research focuses on quality and classification issues in the operationalization of large-scale data and the extent to which surveys, qualitative methods, and other social science methods can aid “big data” research to produce reliable social science.

Mario L. Small, Ph.D., is Quetelet Professor of Social Science at Columbia University. An elected member of the National Academy of Sciences, he is an expert on social inequality, urban poverty, social networks, and the relationship between qualitative and quantitative methods. His recent books include Someone To Talk To: How Networks Matter in Practice (Oxford U Press) and Qualitative Literacy: A Guide to Evaluating Ethnographic and Interview Research (U California Press).

References

Athey

Blei

Donnelly

Ruiz

Schmidt

(2018). Estimating heterogeneous consumer preferences for restaurants and travel time using mobile location data. AEA Papers and Proceedings, 108, 64–67. https://doi.org/10.1257/pandp.20181031

Bailey

K. D.

(1994). Typologies and taxonomies: An introduction to classification techniques. Sage Publications.

Bailey

K. D.

(2005). Typology construction, methods and issues. In Kempf-Leonard

(Ed.), Encyclopedia of social measurement (pp. 889–898). Elsevier. https://doi.org/10.1016/B0-12-369398-5/00108-0

Ballantyne

Singleton

Dolega

(2021). A regional exploration of retail visits during the COVID-19 pandemic. Regional Studies, Regional Science, 8(1), 366–370. https://doi.org/10.1080/21681376.2021.1973548

Becker

King

T. D.

McMullen

(2015). Big data, big data quality problem. In 2015 IEEE international conference on big data (big data) (pp. 2644–2653). IEEE. https://doi.org/10.1109/BigData.2015.7364064

Benzell

S. G.

Collis

Nicolaides

(2020). Rationing social contact during the COVID-19 pandemic: Transmission risk and social benefits of US locations. Proceedings of the National Academy of Sciences of the United States of America, 117(26), 14642–14644. https://doi.org/10.1073/pnas.2008025117

Bonack

(2021). SafeGraph’s data sourcing process. SafeGraph Blog. https://www.safegraph.com/blog/safegraphs-data-sourcing-process

Bradley

Burhouse

Gratton

Miller

R.-A.

(2009). Alternative financial services: A primer. FDIC Quarterly, 3(1), 39–47. https://www.fdic.gov/analysis/quarterly-banking-profile/fdic-quarterly/index.html

Brelsford

Moehl

Weber

Sparks

Tuccillo

J. V.

Rose

(2022). Spatial and temporal characterization of activity in public space, 2019–2020. Scientific Data, 9(1), 379. Article 1.https://doi.org/10.1038/s41597-022-01480-6

10.

Chang

Pierson

Koh

P. W.

Gerardin

Redbird

Grusky

Leskovec

(2021). Mobility network models of COVID-19 explain inequities and inform reopening. Nature, 589(7840), 82–87. https://doi.org/10.1038/s41586-020-2923-3

11.

Chen

M. K.

Rohla

(2018). The effect of partisanship and political advertising on close family ties. Science, 360(6392), 1020–1024. https://doi.org/10.1126/science.aaq1433

12.

Choi

S. J.

Song

H. J.

Park

S. B.

Lee

S. J.

(2014). A POI categorization by composition of onomastic and contextual information. 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2, 38–45. https://doi.org/10.1109/WI-IAT.2014.78

13.

Connecticut Department of Banking . (2023). ABCs of banking banks thrifts and credit unions. CT.Gov - Connecticut’s Official State Website. https://portal.ct.gov/DOB/Consumer/Consumer-Education/ABCs-of-Banking--Banks-Thrifts-and-Credit-Unions

14.

Elarde

Kim

J.-S.

Kavak

Züfle

Anderson

(2021). Change of human mobility during COVID-19: A United States case study. PLoS One, 16(11), Article e0259031. https://doi.org/10.1371/journal.pone.0259031

15.

Faber

J. W.

(2019). Segregation and the cost of money: Race, poverty, and the prevalence of alternative financial institutions. Social Forces, 98(2), 819–848. https://doi.org/10.1093/sf/soy129

16.

Gao

Liang

Marks

Kang

(2019). Predicting the spatiotemporal legality of on-street parking using open data and machine learning. Annals of GIS, 25(4), 299–312. https://doi.org/10.1080/19475683.2019.1679882

17.

Giannopoulos

Meimaris

(2019). Learning domain driven and semantically enriched embeddings for POI classification. Proceedings of the 16th international symposium on spatial and temporal databases (pp. 214–217). ACM. https://doi.org/10.1145/3340964.3340992

18.

Grandini

Bagli

Visani

(2020). Metrics for multi-class classification: An overview (arXiv:2008.05756). arXiv. https://arxiv.org/abs/2008.05756

19.

Grigoropoulou

Small

M. L.

(2022). The data revolution in social science needs qualitative research. Nature Human Behaviour, 6(7), 904–906. https://doi.org/10.1038/s41562-022-01333-7

20.

Hansen

Sabia

Schaller

(2022). Schools, job flexibility, and married women’s labor supply (No. w29660; p. w29660). National Bureau of Economic Research. https://doi.org/10.3386/w29660

21.

T. K.

Basu

(2002). Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3), 289–300. https://doi.org/10.1109/34.990132

22.

Hofman

J. M.

Sharma

Watts

D. J.

(2017). Prediction and explanation in social systems. Science, 355(6324), 486–488. https://doi.org/10.1126/science.aal3856

23.

Quigley

B. M.

Taylor

(2021). Human mobility data and machine learning reveal geographic differences in alcohol sales and alcohol outlet visits across U.S. states during COVID-19. PLoS One, 16(12), Article e0255757. https://doi.org/10.1371/journal.pone.0255757

24.

Huang

J. T.

Krupenkin

Rothschild

Lee Cunningham

(2023). The cost of anti-Asian racism during the COVID-19 pandemic. Nature Human Behaviour, 7(5), 682–695. https://doi.org/10.1038/s41562-022-01493-6

25.

Jay

Bor

Nsoesie

E. O.

Lipson

S. K.

Jones

D. K.

Galea

Raifman

(2020). Neighbourhood income and physical distancing during the COVID-19 pandemic in the United States. Nature Human Behaviour, 4(12), 1294–1302. https://doi.org/10.1038/s41562-020-00998-2

26.

Jay

Heykoop

Hwang

Courtepatte

de Jong

Kondo

(2022). Use of smartphone mobility data to analyze city park visits during the COVID-19 pandemic. Landscape and Urban Planning, 228, 104554. https://doi.org/10.1016/j.landurbplan.2022.104554

27.

King

Lam

Roberts

M. E.

(2017). Computer-Assisted Keyword and Document Set Discovery from Unstructured Text. American Journal of Political Science, 61(4), 971–988. https://doi.org/10.1111/ajps.12291

28.

Kotsiantis

S. B.

Zaharakis

I. D.

Pintelas

P. E.

(2006). Machine learning: A review of classification and combining techniques. Artificial Intelligence Review, 26(3), 159–190. https://doi.org/10.1007/s10462-007-9052-3

29.

Levin

Chao

D. L.

Wenger

E. A.

Proctor

J. L.

(2021). Insights into population behavior during the COVID-19 pandemic from cell phone mobility data and manifold learning. Nature Computational Science, 1(9), 588–597. Article 9. https://doi.org/10.1038/s43588-021-00125-9

30.

Levy

B. L.

Vachuska

Subramanian

S. V.

Sampson

R. J.

(2022). Neighborhood socioeconomic inequality based on everyday mobility predicts COVID-19 infection in San Francisco, Seattle, and Wisconsin. Science Advances, 8(7), eabl3825. https://doi.org/10.1126/sciadv.abl3825

31.

Yang

(2021). How important are the park size and shape to a park system’s performance? An exploration with big data in tucson, Arizona, USA. Socio-Ecological Practice Research, 3(3), 281–291. https://doi.org/10.1186/s12871-021-01505-4

32.

Wang

Liu

Small

M. L.

Gao

(2022). A spatiotemporal decay model of human mobility when facing large-scale crises. Proceedings of the National Academy of Sciences of the United States of America, 119(33), Article e2203042119. https://doi.org/10.1073/pnas.2203042119

33.

Liu

Loh

H. T.

Sun

(2009). Imbalanced text classification: A term weighting approach. Expert Systems with Applications, 36(1), 690–701. https://doi.org/10.1016/j.eswa.2007.10.042

34.

Massenkoff

Chalfin

(2022). Activity-adjusted crime rates show that public safety worsened in 2020. Proceedings of the National Academy of Sciences of the United States of America, 119(46), Article e2208598119. https://doi.org/10.1073/pnas.2208598119

35.

Milias

Psyllidis

(2021). Assessing the influence of point-of-interest features on the classification of place categories. Computers, Environment and Urban Systems, 86, 101597. https://doi.org/10.1016/j.compenvurbsys.2021.101597

36.

Moral

P. D.

Nowaczyk

Pashami

(2022). Why is multiclass classification hard? IEEE Access, 10, 80448–80462. https://doi.org/10.1109/ACCESS.2022.3192514

37.

Nisbet

Miner

Yale

(2018). Classification. In Handbook of statistical analysis and data mining applications (pp. 169–186). Elsevier. https://doi.org/10.1016/B978-0-12-416632-5.00009-8

38.

Office of the Comptroller of the Currency . (2023). Financial institution lists. Office of the Comptroller of the Currency. https://www.occ.treas.gov/topics/charters-and-licensing/financial-institution-lists/index-financial-institution-lists.html

39.

Payne

W. B.

(2021). Powering the local review engine at Yelp and Google: Intensive and extensive approaches to crowdsourcing spatial data. Regional Studies, 55(12), 1878–1889. https://doi.org/10.1080/00343404.2021.1910229

40.

Prestby

App

Kang

Gao

(2020). Understanding neighborhood isolation through spatial interaction network analysis using location big data. Environment and Planning A: Economy and Space, 52(6), 1027–1031. https://doi.org/10.1177/0308518X19891911

41.

SafeGraph . (2022). Core places. SafeGraph. https://docs.safegraph.com/docs

42.

SafeGraph . (2023). Base attributes | SafeGraph docs. SafeGraph. https://docs.safegraph.com/docs

43.

Small

M. L.

Akhavan

Torres

Wang

(2021). Banks, alternative institutions and the spatial–temporal ecology of racial inequality in US cities. Nature Human Behaviour, 5(12), 1622–1628. https://doi.org/10.1038/s41562-021-01153-1

44.

Sparks

Moehl

Weber

Brelsford

Rose

(2022). Shifting temporal dynamics of human mobility in the United States. Journal of Transport Geography, 99, 103295. https://doi.org/10.1016/j.jtrangeo.2022.103295

45.

Tharwat

(2020). Classification assessment methods. Applied Computing and Informatics, 17(1), 168–192. https://doi.org/10.1016/j.aci.2018.08.003

46.

U.S. Census Bureau . (2022a). County business patterns (CBP) methodology. Census.Gov. https://www.census.gov/programs-surveys/cbp/technical-documentation/methodology.html

47.

U.S. Census Bureau . (2022b). County business patterns, including ZIP code business patterns, by Legal form of organization and employment size class for the U.S., states, and selected geographies: 2020. U.S. Department of Commerce, Bureau of the Census. https://www.census.gov/data/datasets/2020/econ/cbp/2020-cbp.html

48.

U.S. Census Bureau . (2022c). North American industry classification system manual. U.S. Census Bureau. https://www.census.gov/naics/reference_files_tools/2022_NAICS_Manual.pdf

49.

U.S. Census Bureau . (2022d). North American industry classification system (NAICS), 2017. U.S. Census Bureau. https://www.census.gov/naics/?58967?yearbck=2017

50.

Weill

J. A.

Stigler

Deschenes

Springborn

M. R.

(2020). Social distancing responses to COVID-19 emergency declarations strongly differentiated by income. Proceedings of the National Academy of Sciences of the United States of America, 117(33), 19658–19660. https://doi.org/10.1073/pnas.2009412117

51.

Western Union . (2011). Comments on the notice of proposed rulemaking regarding remittance transfers. Federal Reserve Board. https://www.federalreserve.gov/secrs/2011/august/20110811/r-1419/r-1419_072211_83480_331083972500_1.pdf

52.

Zamir

A. R.

Shah

(2010). Accurate image localization based on Google Maps street view. Computer Vision – ECCV 2010, 6314, 255–268. https://doi.org/10.1007/978-3-642-15561-1_19

53.

Zhao

Shaw

S.-L.

Yin

Fang

Yang

Zhang

(2019). The effect of temporal sampling intervals on typical human mobility indicators obtained from mobile phone location data. International Journal of Geographical Information Science, 33(7), 1471–1495. https://doi.org/10.1080/13658816.2019.1584805

Are Large-Scale Data From Private Companies Reliable? An Analysis of Machine-Generated Business Location Data in a Popular Dataset

Abstract

Keywords

Introduction

The Classification Problem

Data

Methods

Procedure

Step 1: Assign Based on Descriptive Keywords

Step 2: Assign Based on Company Information

Step 3: Assign Based on Query Expansion Mining

Step 4: Assign Based on One-By-One Manual Search

Step 5: Check Reliability

Step 6: Assign to a Primary NAICS Code

Probing for Data Sourcing Problems

Results

Coverage of Financial Establishments

Classification Assessment

Data Sourcing Problems

Discussion

Limitations and Future Research

Conclusion

Footnotes

Acknowledgments

Declaration of Conflicting Interests

Funding

ORCID iD

Data Availability Statement

Notes

Appendix

Author Biographies

References