Abstract
Big data enables researchers to closely follow the behavior of large groups of individuals by using high-frequency digital traces. However, these digital traces often lack context, and it is not always clear what is measured. In contrast, data from ethnographic fieldwork follows a limited number of individuals but can provide the context often lacking from big data. Yet, there is an under-explored potential in combining ethnographic data with big data and other digital data sources. This paper presents ways that quantitative research designs can combine big data and ethnographic data and account for the synergies that such combinations can provide. We highlight the differences and similarities between ethnographic data and big data, focusing on the three dimensions: individuals, depth of information, and time. We outline how ethnographic data can validate big data by providing a “ground truth” and complement it by giving a “thick description.” Further, we lay out ways that analysis carried out using big data could benefit from collaboration with ethnographers, and we discuss the potential within the fields of machine learning and causal inference.
This article is a part of special theme on Machine Anthropology. To see a full list of all articles in this special theme, please click here: https://journals.sagepub.com/page/bds/collections/machineanthropology
Introduction
Digitalization during recent decades has produced unprecedented amounts of “big data” that often contain signals of human behavior and social interactions that are potentially valuable for social science. Since these data are collected as by-products of other administrative or business processes, they only capture a limited number of details of a given situation. In contrast, the scope of data collected from ethnographic fieldwork has few restrictions, in particular, for behavioral and social aspects. Although new digital data sources have yielded insights within both quantitative and qualitative social sciences (Buyalskaya et al., 2021), they are rarely directly integrated. This is contrasted by the rapid commercial development of models created on data that combines big data with data annotated by humans. These models can often accurately infer the human annotation or judgment from the big data, for example, whether a post on social media contains violence, and recently also capture ethnographic descriptions of situations (Cury et al., 2019). Yet, beyond simply recreating such human annotation, little research has been dedicated to studying how to actively combine ethnographic fieldwork with “big data” using quantitative methods.
This paper outlines possible combinations of new (big) data sources with ethnographic fieldwork and accounts for how these may lead to advances within social sciences through strong synergies. To understand the role of ethnographic fieldwork data, we first compare it with other common data types in social science, including “big data.” Our comparison shows that “big data” and ethnographic data are (often) similar in fundamental ways in terms of their high temporal resolution and level of detail. However, “big data” usually comes as a by-product of other services, known as “exhaust data.” Therefore, it is often poorly documented and does not necessarily capture what researchers ideally aim to measure. This nature of “big data” indicates a novel role for ethnographic data to validate and complement it.
We argue that ethnographic fieldwork can enhance quantitative big data in three fundamental ways: (i) by establishing a “ground truth” (i.e. whether the data measures what was assumed), (ii) providing a “thick description” (understanding of the social context and mechanisms at play) of situations that the data represent, and (iii) by measuring otherwise hidden dimensions. Thus, combining ethnographic data with other data enables researchers to examine new research questions with quantitative methods.
This paper contributes to the nascent literature that outlines how and why novel data science tools and big data can be complemented with data collected through and analyzed by ethnographic methods (e.g. Wang, 2013; Blok and Pedersen, 2014; Blok et al., 2017; Bornakke and Due, 2018). This literature expands the field of mixed methods (which primarily uses survey data when combining qualitative and quantitative data) by outlining how and why novel data science tools and big data can be complemented with data collected through and analyzed by ethnographic methods (Blok and Pedersen, 2014). Whereas this literature starts from the viewpoint of researchers with (at least partially) qualitative backgrounds, we depart from existing work by targeting social science researchers with primarily quantitative backgrounds. Within the fields of science and technology studies and digital methods, there is substantial amount of literature concerning how data from virtual behavior can be used and studied in social science (see e.g. Rogers, 2009; Marres, 2012; Marres and Gerlitz, 2016; Venturini et al., 2019). Therefore, we limit our focus to how methods can be combined to analyze phenomenons in the physical (i.e. non-virtual world), where a large, unexplored potential lies.
This paper also contributes to the emerging literature that uses historical ethnographic data, such as the Ethnographic Atlas (Murdock, 1967), as inputs for quantitative models, for example, within the field of economic history (see Lowes, 2020, for a review). We complement this literature by suggesting a more active role for ethnographic fieldwork in combination with quantitative methods, and by broadening the scope of using ethnographic data from more recent data sources.
Comparing different kinds of data
Methods for collecting data can be categorized in several useful ways. Kitchin (2014) divides raw data into captured data and exhaust data depending on how it is generated. Captured data is collected with the intention to produce specific data. Exhaust data is instead produced by electronic devices or systems as a by-product of other activities. Over the last decade, industry and researchers alike have come to regard exhaust data, not just as a by-product, but as a valuable input to business processes and to research.
The use of exhaust data in research has both advantages and drawbacks. The main advantage is that large-scale data with high granularity can be collected for many individuals at low costs. The primary drawback is that the data was intended for non-research purposes; therefore, it may be a poor measure for the object(s) of interest, which may lead to erroneous conclusions. An additional concern is that exhaust data may contain private and sensitive information such that processing it raises ethical concerns.
Relying on data that was not generated for the specific research question is not new in social sciences. This is done routinely, for example, using register data of tax filings, voter registration, and so on. Kitchin (2014) calls these data types secondary data. He distinguishes between primary data collected for the specific research question and secondary data that is collected for another purpose and subsequently made available to others to reuse and analyze. This dimension is independent of the exhaust/captured dimension. However, this kind of register data is not exhaust data because it was collected and stored intentionally, which means that a decision was made about how to collect the data in the most meaningful way. In contrast, exhaust data is usually generated as the by-product of activities online or on smart devices, for example, log files or location history. Therefore, there is a greater need to understand what the exhaust data actually measures—the need for a “ground truth.”
Ethnographic data is collected in everyday contexts, often referred to as “in the field,” mainly using participant observation and/or in-depth interviews (Hammersley, 2007). Unlike most quantitative data, ethnographic data collection is usually “unstructured in the sense that it does not involve following through a fixed and detailed research design set up at the beginning,” which allows spontaneous and relevant observations and thoughts to be recorded (Hammersley, 2007). Though we are aware that ethnography can include a range of methods, in this paper, we refer to observations from the field when we refer to ethnographic data.
Ground truth and thick description
“Ground truth” refers to direct observation that can confirm (or reject) that the sensory data (and the researchers’ interpretation) is correct. The term originates from remote sensing using satellites (Hoffer, 1972), for example, for meteorologists to verify that a hurricane visible in satellite data was also visible on site. We find ground truth to be a good metaphor for how human observations can help validate exhaust data because the data quality might not be perfect or may measure something unintended.
Related to ground truth, but distinct, is the ethnographic concept of “thick description,” illustrated by the classic example from Geertz (1973), about the difference between a twitch and a wink. We might observe two boys doing exactly the same rapid contraction of the right eyelid. For one, the twitch is an involuntary result of a physical impairment, while for the other, it is a conspiratorial signal to a friend. Though these two situations look exactly alike, they are very different but a thick description can distinguish between them.1
Having a ground truth and a thick description from the situations where the data is generated is especially beneficial when using exhaust data. Ground truth can help examine what the exhaust data is actually recording, which is important as the data may be unstable over time or may not correspond to the analyst’s perception of it. A thick description can help the researcher understand the social phenomena at play.
While ethnographic data collected by humans provides the potential for highly increased detail, this data may also suffer from issues similar to the exhaust nature of big data. Ethnographers, and more generally human observers, may make errors when recording data, for example, by misperceiving what they observe or failing to recall an event or piece of information correctly when they write their notes. Further, their subsequent interpretation of the field notes may not correspond to reality or how those observed view it. Finally, the mere presence of others may be obtrusive, and the people observed may alter their behavior. However, all these considerations along with approaches to mitigate and reflect upon these errors and misperceptions are already embedded in ethnographic methods and analysis (Bernard, 2017; O’Reilly, 2012). In fact, quantitative researchers working with exhaust data are likely to benefit from the insights in the ethnographic tradition of working with unstructured data and critically accessing what the data represents (Charles and Gherman, 2019).
Whether or not to collect data on ground truth and thick descriptions depends on what value they add to the analysis. We note that ground truth does not need to be obtained by trained ethnographers but can be obtained by observers with little or no training.2 However, for a deeper understanding of the social context, we need ethnographers to make a thick description. In situations where both ground truth and thick description are needed, it would be logical for an ethnographer to obtain both. Collecting a ground truth can be a natural part of an ethnographer’s fieldwork, for example, in the early stages of the fieldwork, which are often task-oriented (Bernard, 2017).
In most studies that employ big data it is neither feasible nor necessary to obtain ground truths and/or thick descriptions for every instance where data is generated. Depending on the application and context of the study, once a certain number of observations and instances have been collected via ethnographic fieldwork, one can either estimate the desired parameters with sufficient precision or extrapolate ground truths and/or thick descriptions to situations where it is missing.
Fundamental dimensions of data
Most quantitative data for analysis in social science has three main dimensions called N, M, and T.3 Here, N represents the number of individuals (or another specified unit of observation) observed, M represents the number of variables (e.g. health or income) observed for these individuals. T is a temporal dimension that tells how many points in time the M variables have been observed for the N individuals. As we will argue in the following, classical quantitative social science data is usually high N, medium M, and (relatively) low T.
Related to this, Bornakke and Due (2018) refer to big data as big and thin and ethnographic observational data as small and thick, primarily focusing on the level of contextual complexity (thick/thin) and the number of observations (small/big). While this can be a relevant differentiation, coming from a quantitative background, we find that the N, M, and T dimensions are relevant since they explicitly capture the temporal aspects common in both big and ethnographic data.
New digital data is often referred to as “big data.” What makes “big data” big is most often described from a computer science perspective focusing on the three V’s (Gandomi and Haider, 2015): volume, variety, and velocity. However, while big may refer to many individuals or many variables observed (“high N or M”), in our view, the novelty in “big data” for social science research is velocity, that is, the temporal dimension (“high T”).4 Although not all “big data” observe the same individuals over time, it is usually the case. Therefore, we consider “big data” as having “high T” (collected continuously and with high granularity) but with ambiguous sizes of M and N.
Ethnographic data versus “big data”
Ethnographic fieldwork data is usually “low N” in the sense that few individuals are observed (Hammersley, 2007).5 However, ethnographic data has high depth (“high M”) since, for each individual or setting, the ethnographer can potentially list hundreds, possibly thousands, of details in the description of a situation, for example, appearance, tonality, behavior, and attitude. Likewise, ethnographic data usually has a temporal dimension making it “high T.” These time measures range from observing individuals in a specific setting for a couple of hours to following the same individuals across settings for months or years.6 It is the combination of the depth of information (high M) along with the continuous observations (high T) that allows ethnographers to make thick descriptions.7
Comparing ethnographic data and “big data,” they both have “high T” since the same individuals are observed repeatedly, whereas ethnographic data has much more depth (a “higher M” ) and observe fewer individuals (“lower N”) compared to big data.
An example: Contagious smartphones
To give a concrete example of how and why the combination of big data and ethnographic fieldwork is beneficial, we review our paper and provide examples of how the analysis may have benefited from combining big data with ethnographic data. Glavind et al. (2021) examines whether smartphone use is contagious when people are (physically) close together in social settings. Knowledge of such behavioral spillovers is important as they magnify the adverse consequences for health and learning (Ferguson, 2017; Bjerre-Nielsen et al., 2020) of increased smartphone use. Further, spillovers would alter smartphone use in social settings from an individual choice to a (partly) collective choice, suggesting that restrictions could be beneficial.
Our analysis is based on big data from the Copenhagen Networks Study (Stopczynski et al., 2014) that allows us to track physical co-presence (through Bluetooth sensors) as well as participant screentime and incoming/outgoing text messages. We focus on whether co-present individuals increase their smartphone use around the arrival of text messages, which, given quasi-random timing, provides a causal estimate of contagion. We find that when one individual receives a text message, the screentime of nearby individuals increases for the next 2 min. This effect is only present if the two people present have a social relationship (communicated by phones or friends on Facebook).
The analysis was made possible using the high granularity of big data that allowed us to track the social dynamics of the situation at the second level. However, several times during our analysis, we faced a limitation due to a big data “thinness,” where the following two additions would contribute substantially to our analysis. First, in our analysis, we lacked ground truth on whether or not the two co-present people had a social interaction or not. Such data could allow us to account for why the size of the effect we measured was relatively small, as one would expect spillovers to occur mainly when there was a social interaction. Second, a thick description of the social situations would allow us to understand how and why there is spillover in usage. Ethnographers doing participant observations could help understand whether smartphone contagion depends on, for example, social norms, relative social status, group composition, activity, and so on. This could potentially give us a deeper understanding of how contagion happens, whether and when it might be a problem, and how best to handle potential adverse effects, for example, by prohibiting smartphones in certain contexts or encouraging certain norms around smartphone use.
Combining ethnographic data with quantitative tools
When applying quantitative tools to data from ethnographic fieldwork, the ethnographic data needs to be structured to some degree and preferably digitized. How best to do this will depend on the format of the data and the application. Sometimes raw transcripts could be relevant, at other times input from “coded” field notes would be relevant (Dohan and Sanchez-Jankowski, 1998; Abramson and Dohan, 2015). Subsequently, the structured ethnographic data can be combined with other data sources or be used as-is.
One fundamental way ethnographic data can be combined with quantitative methods and data science techniques is by using tools for pre-processing or describing data. Such tools could range from plotting the development of how often selected words are mentioned over time in a transcribed interview or using unsupervised machine learning that provides a flexible approach to finding novel patterns in the data, for example, clusters of similar observations. In fact, clustering analysis was pioneered by quantitative anthropologists (Driver and Kroeber, 1932). While these tools rarely yield end results, they can provide useful inputs that complement other parts of the qualitative analysis. In particular, they can be helpful in the early stages of fieldwork, as a guide to finding patterns that deserve further in-field exploration,8 and in post-fieldwork analysis, as a complement to the classical ethnographic analysis.
A fundamentally different approach is to combine ethnographic data with models of specific outcomes. One such approach is the rapidly evolving method of supervised machine learning, which is used to select the best models for predicting or inferring outcomes. These tools are helpful in many social science settings, for example, to predict students’ academic performance (Bjerre-Nielsen et al., 2021), and can be integrated into algorithms that determine decisions based on the predictions. As argued in the introduction, machines can use data annotated by humans to learn to correctly recognize patterns and properties of a situation (Cury et al., 2019). Another use is to use ethnographic data as an input in the prediction, which will lead to better predictions (given that the ethnographic data included is relevant). Bjerre-Nielsen et al. (2021) show predictions made from task-specific information (e.g. using high-school grade when predicting college grades) can outperform predictions made using fined grained individual big data. It is conceivable that ethnographers can supply such task-specific data and thereby enhance prediction.
Causal inference is a radically different approach from prediction and relies on either theoretical knowledge of the causal structure and/or interventions in the form of experiments or quasi-experiments. We primarily see two ways that ethnographic fieldwork can improve causal inference. First, the ethnographer can intervene as an agent of change in the observed situation. In statistical terms, the ethnographer would act as a source of (exogenous) variation that can turn the field study into a quasi-experimental setting. For example, if we want to examine whether one person’s mobile phone use affects nearby peers, the ethnographer can manipulate the situation by taking the phone up at random to examine how this affects the social situation. Second, ethnographic fieldwork has the potential to give a better understanding of the mechanisms that cause covariates to influence the outcome and, thereby, to strengthen the causal claim. One way of doing this is by collecting data through fieldwork that contains thick descriptions of situations similar to or within the (big) data set used.
We emphasize that combining ethnographic fieldwork with other data sources for causal purposes or as an input to predictive models will require careful planning, which may require the simultaneous collection of ethnographic and big data. This collaboration will often be demanding for the researchers - Moats (2021) considers challenges and how this is done best.
Footnotes
Acknowledgment
We thank the two anonymous reviewers for their helpful and constructive comments. The research projects discussed in this commentary were carried out at the Copenhagen Center for Social Data Science (SODAS), a highly interdisciplinary center for research and education at the University of Copenhagen. We are grateful for our discussion with colleagues in this interdisciplinary environment, in particular, Morten Axel Pedersen and Hjalmar Bang Carlsen. Morten has been essential in the process of writing this essay, for example, when clarifying the difference between ground truth and thick description. We are also grateful to Anne Sofie Beck Knudsen for discussions about the use of ethnographic data in economics and Esben Brøgger Lemminger for his research assistance in finding relevant research articles.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
