Abstract
Bringing together a motley crew of social scientists and data scientists, the aim of this special theme issue is to explore what an integration or even fusion between anthropology and data science might look like. Going beyond existing work on the complementarity between ‘thick’ qualitative and ‘big’ quantitative data, the ambition is to unsettle and push established disciplinary, methodological and epistemological boundaries by creatively and critically probing various computational methods for augmenting and automatizing the collection, processing and analysis of ethnographic data, and vice versa. Can ethnographic and other qualitative data and methods be integrated with natural language processing tools and other machine-learning techniques, and if so, to what effect? Does the rise of data science allow for the realization of Levi-Strauss’ old dream of a computational structuralism, and even if so, should it? Might one even go as far as saying that computers are now becoming agents of social scientific analysis or even thinking: are we about to witness the birth of distinctly anthropological forms of artificial intelligence? By exploring these questions, the hope is not only to introduce scholars and students to computational anthropological methods, but also to disrupt predominant norms and assumptions among computational social scientists and data science writ large.
Keywords
This article is a part of special theme on Machine Anthropology. To see a full list of all articles in this special theme, please click here: https://journals.sagepub.com/page/bds/collections/machineanthropology
Introduction
Over the last decade, computational techniques developed by data scientists have transformed social science (Centola, 2018; Jemielniak, 2020; Lazer et al., 2009a, 2009b; Mills, 2019; Mohr et al. 2015; McFarland et al., 2016; Pentland, 2015; Salganik, 2019; Sekara et al., 2016; Veltri, 2019). This has been especially prominent within sociology and political science (Grimmer and Stewart, 2013) but also economics (Gentzkow et al., 2019) and psychology (Goel et al., 2010). Machine-learning methods have revolutionized the quantitative study of text data, just as the field of network science has revitalized the field of social network analysis (Centola, 2018; Sekara et al., 2016). Anthropologists and other social or humanities researchers relying on ethnographic methods have, however, been conspicuously absent from these developments. While the founders of modern anthropology ‘were brought up as part of a larger and far more quantitative’ (Munk, 2019: 163) tradition, socio-cultural anthropology has earned the reputation as ‘one of the least “mathematized” and “computerized” of the social sciences’ (Cunningham, 1996: 401). Indeed, ‘many ethnographers and qualitative researchers more broadly have been reluctant to integrating computing into their “craft”’ (Abramson, 2016: 255), presumably because quantitative methods are widely associated with ‘positivist’ and other purportedly epistemologically and ethically flawed approaches. No matter whether it is due to the limited statistical and computational competencies among many anthropologists, or whether it involves a more fundamental (and largely tacit) aversion to anything quantitative (Feldman, 2017), the fact is that the potentials (as well as the perils) of a computational or indeed machinic anthropology remains largely unexplored. It is just this lacuna that this special theme issue seeks to address.
Can ethnographic data and methods be combined with machine learning? Does the rise of social data science allow for a realization of Levi-Strauss’ old dream for a computational anthropology – and if so, to what benefit (and with what risk)? Bringing together researchers from academia and beyond who have spearheaded the combination of ethnographic and big data, this special theme builds on and goes beyond existing work on the complementarity between ‘thick’ and ‘big’ data by exploring what a fully-fledged integration or even fusion between anthropology and data science might look like. In doing so, the ambition is not only to contribute to the development of novel computational anthropological methods, but also to subvert established conventions within and bifurcations between quantitative and qualitative social science. 1
Anthropology and data science: Three strategies
In broad terms, one can distinguish between the different strategies for how anthropologists as well as scholars from cognate disciplines relying on ethnographic approaches 2 have over the last decade engaged with the so-called Big Data revolution, namely what might be called the anthropology of (big) data science, the anthropology with data science, and the anthropology as – or by – data science. While there are several overlaps between these three strategies when it comes to both analytical approach and personnel, each can be said to represent a distinct positioning within, and attitude towards, social scientific engagements with computational methods and data science in the academy and beyond. Let us now consider these in turn, with emphasis on the third strategy, which echoes the aspiration of the present social theme issue. As we shall see, while there are only few existing examples of studies fall within this latter category, it arguably also this strategy that holds the greatest transformative potential for both qualitative social science and data science alike.
The anthropology of big data/data science has so far been most dominant. Numerous scholars have studied data science and data scientists ethnographically (e.g. Bell et al., 2015; Madsen et al., 2018; Douglas-Jones et al., 2021; Kockelman, 2020; Mackenzie, 2017), as part of a wider interest in ‘critical data’ (Blok and Pedersen, 2014) or ‘critical algorithm’ (Seaver, 2017) studies. These approaches serve as a corrective to the techno-futurism (McGillivray et al., 2020) of ‘big data’ narratives. Yet, while recent ethnographic work on vernacular data practices (e.g. Hobbis and Hobbis, 2022) offers a welcome reality-check to the alarmism of much critical work on the political economy of digital data, the anthropology ‘of data’ literature is still dominated by ‘a distant, neutral gaze’ (Paff, 2022) posited to be at safe distance from the big data reality studied. Crucially, this sense of epistemological-cum-ethical superiority and distance extends to the algorithms and other computational techniques deployed by the data scientists themselves. While this is unsurprising given the skepticism towards quantitative data and quantification among modern-day socio-cultural anthropologists and critical sociologists, it is still unsatisfactory. Indeed, there is a something profoundly non-anthropological about the knee-jerk way in which anthropologists tend to reduce quantitative data and methods to mere objects of critique. As several scholars have pointed out (Fortun et al., 2017; Knox and Nafus, 2018; Paff, 2022), the entanglements between anthropology and data science cannot be reduced to a relationship between etic subject that studies and an emic object of study. There is a need for more respect, curiosity and symmetry, between the two communities – including toward their respective methods.
The anthropology with big data is a step in this direction. Representing different attempts to bring disparate qualitative and quantitative datasets into dialogue through ‘integration’ (Charles and Gherman, 2019) and ‘complementarity’ (Blok and Pedersen, 2014), this ‘quali-quantitative’ (Venturini and Latour, 2010) literature includes Bornakke and Due’s (2018) ‘big-thick blending’ of observational and camera data, Christin's triangulation between ethnographic and algorithmic data (2020), and Blok et al.’s (2017) ‘stitching’ of fieldwork notes and sensor data. Yet, this and similar studies (e.g. Beaulieu, 2017; Ford, 2014; Lowrie, 2018; Ruckenstein, 2019) tend to be pilot work and thus hard to generalize from, new generations of social data scientist are left ‘in the dark’ as to how to design ‘process[es] of complementing big and thick insights’ and what is the most ‘practical method for integrating big and thick data’ (Bornakke and Due, 2018: 13).
Several contributions to this special issue take steps in this direction. Heeding Breiger et al.’s (2018) call for a ‘low-tech formalization for text analysis’, Isfelt et al. combine netnographic material about green activists in Denmark with computationally generated Twitter data in order to write a ‘micro-history of ideas in real-time’. Under heading ‘thick quali-quantitative data’, Albris et al. reflect on how digitized logbooks can optimize the collection, processing and analysis of (n)ethnographic data. Finally, based on research focused on the micro-sociological aspects of international diplomacy, Adler-Nissen and her colleagues illustrate how the political scientific study of social media data can benefit from treating them as both singular data points in a larger pattern and as fluid objects embedded in broader social processes. In each of these cases (as in other studies in the same mould as Moats and Borra, 2018; Pretnar and Podjed, 2019)), an established field (anthropology, international relations), is augmented by juxtaposing ethnographic and computational methods in hands-on ways, which may inspire more quali-quantitative applications in the future.
Anthropology as data science is almost terra incognito. Perhaps the reason why is that this approach requires ‘enacting [the qual-quant difference] into our own practice as anthropologists’ (Paff, 2022). Whereas, in the two above strategies, anthropology is reproduced as a distinct discipline with its own methodological tools (fieldwork) and epistemological assumptions (e.g. about quantitative methods), this third strategy seeks to disrupt, transform and expand what anthropology is, or rather could be. Of course, quantitative anthropology has a long history, even if it has always been marginalized in the discipline (Chibnik, 1999; Pedersen, 2021; Schaffer, 1994). After all, people who do fieldwork count things all the time, no matter whether they recognize this or not (Pedersen, 2019). Famous figures from classic British (e.g. Gluckman, 1961) and American (Driver and Kroeber, 1932) anthropology promoted quantitative data and methods, and the history of the discipline is awash with attempts to introduce a more formalized and mathematicised collection, processing and analyses of ethnographic data, although these have generally left little impact. 3 However, it was Levi Strauss who first called for a computational anthropology. ‘The fundamental requirement of anthropology’, he it, ‘is that it begins with a personal relation and ends with a personal experience, but … in between there is room for plenty of computers’ (cited in Hymes, 1965; see also Levi-Strauss, 1963). Yet, due to a combination of lacking processing power, technical skills, and institutional backing, he never ‘developed … a systematic program of investigation based upon the repertory of basic mathematical structures’ (de Almeida et al., 1990: 370). Instead of using computers in concrete anthropological research, ‘the imagined computer allow[ed] Lévi-Strauss’ ideal method to exist, in theory’ (Seaver, 2014). 4
Only a limited number of recent works represent a genuine anthropology as data science approach. These include Hsu's (2014) spirited plea for ‘unleashing’ quantitative data ‘from the disciplinary compartmentalization of science’ to ‘discover new interpretative and speculative territories’ and Brooker's no less enthusiastic suggestion to ‘incorporate bots into the sociological mold to harness them for sociological service’ (2019: 2). After all, as Brooker also elaborates upon in his commentary to this special theme issue, algorithms are imbued with ‘the potential to perform a wider range of sociologically [and by implication anthropologically] relevant functions’ (2019: 1234). Munk et al.'s contribution to this issue, where three authors in their own words seek to build a deliberately playful ‘an ethnographic algorithm capable of passing for a native’, is case point (indeed, during the workshop in Copenhagen, Munk and colleagues brought with them a physical prototype of an ‘anthropological machine’!). Indeed, deliberate playfulness is a characteristic feature of much research in the intersection between data science and qualitative social science, including my own (e.g. Blok et al., 2017). 5 Yet, one might (self)critically note, the ultimate goal of a computationally enhanced critical anthropology must be to transcend the binary between the ‘playful’ and the ‘earnest’ by harnessing AI methods to address fundamental social scientific questions.
Doja and colleagues’ idiosyncratic combination of ‘fuzzy logic, probabilities, machine learning, and maps manipulation’ (as they put in this issue) is probably the most earnest attempt yet to realize Levi-Strauss’ old computational dream. Unlike word embedding models that train neutral networks to ‘provide insight into the relationship between individual words and the overall conceptual structure undergirding a text’ (Kozlowski et al., 2019), Doja et al. directly seek to simulate the generative structural logic allegedly undergirding the spatio-temporal unfolding of all myths (Bruchansky, 2019). But in more overarching terms, as a ‘Turing test of Amerindian mythology’ (Santucci et al., 2020), their neo-structuralism fundamentally resembles the ‘purely relational approach to modeling’ (Kozlowski et al., 2019) that has recently gained traction among computational cultural sociologists (e.g. Evans and Aceves, 2016). Certainly, there is a sense to which unsupervised machine learning is ‘a kind of folk structuralism – that if we number-crunch culture in a sophisticated enough way, the “latent” … mathematical structures of our (secretly computational) minds will be uncovered’ (Castelle, 2018: see also Pedersen and Nielsen, 2018; Santucci et al., 2020)).
Towards a machinic anthropology
Other scholars have theorized and typologized the relation between anthropology/ethnography and data science. Let us now consider these to nail down more precisely what is specific about the machine anthropology agenda. In Munk and Winthereik (2022) outline their vision for a ‘computational ethnography’, whose aspiration is to both ‘appropriat[e] digital media as its empirical material and us[e] computational techniques for gathering and analyzing this material’. This definition of computational ethnographic calls to mind well-known understandings of digital methods by Marres (2017) and Rogers (2019). While not a problem as such, it does raise the question of why a new term (‘computational ethnography’) is needed, just as it leaves unresolved how to conceive of social data science studies that use digital devices as sources of quantitative data in their own right, and less as objects of critical meta-analysis (e.g. Anderson et al., 2009; Lohse et al., 2022). Indeed, this is probably the main difference between digital methods and machine anthropology. Whereas the former predominantly uses preexisting, often commercial digital software and visualization tools for is critical analysis of the affordances of digital infrastructures and natively digitalized data, the object and aim of machine anthropology is not restricted to digital phenomena. Instead, the focus here is the digital as a means, using computers for the processing of data and analysis of data. 6 Here, unstructured data afforded by digital devices and platforms (e.g. social media posts, or biometric or geolocation data logged in wearable sensors) are used as source of methodological development, including the programming of specific-purpose algorithms that are built from scratch to collect (e.g. scrape), pre-process (e.g. clean), process (e.g. mine) and analyse (e.g. model) this data.
Munk (2019) and Paff (2022) both capture this data sciency aspiration. Thus the variant of quali-quant research that Munk calls ‘algorithmic sensemaking’ does ‘not involve any conventionally qualitative work but rather solicits sensemaking to quantitative community detection and pattern recognition’ (2019: 165). An apt example is Munk et al.'s contribution to this issue, where the incongruity between big and thick data to test the limits of both. Turning now to Paff, his recent call for an ‘anthropology by data science’ closely resembles what I in the previous section called the ‘anthropology as data science’ approach. Suggesting that ‘anthropology and data science do not possess fundamental theoretical or philosophical differences’, Paff calls for the incorporation of machine-learning techniques in ‘ethnographies and other anthropological research’ (2022). This claim – that there is a correspondence between anthropology and data science in terms of methodology, epistemology and metaphysics – echoes the call for the 2020 ‘Machine Anthropology’ workshop. 7 Yet, as I am going to suggest now, what is called for is a more nuanced rendering of the similarities and the differences between qualitative social science, quantitative social science and data science, which delineates and reflects on the strengths as well as the weaknesses of each of the three approaches.
To be sure, the data-driven attitude of data scientists does come much closer to the explorative aspiration of anthropology and other grounded-theory-informed approaches than the theory-driven hypothesis-testing characteristic of mainstream quantitative social science. As Sapienza and Lehmann put it in this issue, as ‘data scientists … we are not hypothesis-driven… We are looking for questions that can be convincingly answered by our dataset’ (see also Milner, 2018). Here, the old anthropological ideal of ‘taking people seriously’ (Malinowski, 1961) via bottom-up and radically empiricist research reappears among computer scientists doing ‘AI in the wild’ (Dyson, 2019). Yet, as shown in publications in this journal (e.g. Kitchin, 2014) and others (e.g. Radford and Joseph, 2020; Shmueli, 2010), this does not mean the ‘end of theory’ (Anderson, 2008). Social science theory, after all, ‘is useful not only in generating hypotheses, but also in selecting an appropriate way of measuring constructs with big data’ (Lazer et al., 2021). That is to say, the role of theory in computational social science/social data science research has to do with the all-important issues of construct (Cronbach and Meehl, 1955) and measurement (Adcock and Collier, 2001) validity.
Carlsen and Ralund's contribution to this issue is a case in point. Via a critical discussion of Nelson's ‘computational grounded theory’ (2021), they present a detailed protocol for a state-of-the-art quali-quantitative analysis of large-scale text data. As its name indicates, the computer assisted learning and measurement (CALM) protocol leverages the advantages of unsupervised machine-learning techniques for improving especially the more explorative phases of computational text analysis, while at the same time systematically deploying qualitative methods and measures to mitigate against the problems with the topic modeling method, which has become widely used both within and outside the academy over the last 15 years or so (DiMaggio et al., 2013; Mohr and Bogdanov, 2013). Indeed, because it allows for the iterative integration of qualitative insights into both data work, model building, and the final analysis, CALM offers the perhaps most concerted attempt made so far to put together a directly applicable framework for a so-called ‘abductive logic of inquiry’ (Brandt and Timmermans, 2021) in the study of digital phenomena and/or data. 8
But how then to conceive of quali-quantitative analyses, like Blok et al.'s political ‘micro-history’ (this issue), where new concepts are formulated via an iterative oscillation between data, model and theory to describe and theorize a particular state of affairs? While such studies evidently qualify as ‘grounded theory’ in both Nelson's and ‘pre-digital’ senses of this term, their scope and aspiration seem to differ slightly from the methods of abduction according to recent sociological accounts of this concept (Brandt and Timmermans, 2021). Here, the bottom-up discovery and development of new concepts is done, not to existing big theory (as Brandt and Timmerman would have it), but to take the specific phenomenon under investigation theoretically seriously without necessarily having any impetus towards generalization (cf. Holbraad and Pedersen, 2017). We are here reminded of Levi-Martin's notion of ‘mathematical sociology’, which presents a pragmatist alternative to established sampling strategies and representativity within quantitative social science research. As Levi-Martin himself puts it, ‘[we] want be able to mathematize this group, with its number of isolated people right here, right now. If that doesn’t do justice to the population of all possible sets of groups, then so be it. Mathematical sociology isn’t about inference in this sense of sampling, and we shouldn’t let statisticians come in and smash [our] more delicate constructions’ (Levi-Martin, 2020: 27). Which begs the question: Perhaps anthropology could be mathematical too? Not in the naturalist sense of cognitive anthropologists (e.g. Sperber, 1985), but in the pragmatist sense of Levi-Martin and Chicago colleagues (e.g. Abbott, 2004). This would not only allow for a contemporary version of Levi-Strauss’ old computational structuralist vision; it might also open up for a fusion between the radically empiricist commitment of much contemporary anthropology and sociology's continuing commitment to big theory building.
Certainly, there are several low hanging ethnographic fruits. Consider ethnographic fieldnotes, which still tend to be collected and processed in predominantly manual ways, even with the availability of software like NVivo. 9 Yet, their unstructured nature makes them particularly compatible with unsupervised machine-learning methods (Nelson, 2020: 7). What is more, as Albris et al. point out in this theme issue, once qualitative researchers begins analyzing fieldnotes via computational methods, they can suddenly ‘ask different kinds of questions, such as: Does the style of fieldnotes depend on where the fieldwork took place? Does the length of a fieldnote depend on the time span in which it took place? Does group-based fieldwork impact … fieldnotes?’ Still, barring a few recent exceptions (Abramson et al., 2018; Astrupgaard et al., 2022; Marathe and Toyama, 2018), NLP methods have not been used systematically on ethnographic fieldnotes.
But the potential contribution of machine anthropology goes beyond the automatization and augmentation of fieldnote collection, processing and analysis. As Glavind and Bjerre-Nielsen suggest (this issue), a very significant (but so far largely ignored) interdisciplinary advantage of ethnographic data is the fact that they can be used to ‘validate [quantitative] data by establishing a ground truth … to examine whether the data measures what the researcher think it measures’ (see also Grigoropoulou and Small, 2022; Marda and Narayan, 2021). Indeed, the impact qualitative research would probably increase significantly if they tapped into data science narratives of ‘ground truth’ as ‘information gathered via direct observation …[used as] …the standard with which to compare the performance of a model’ (Corwin and Erickson-Davis, 2020).
But there is yet one further implication of the machine anthropological project. At issue is whether ethnographic data should, in fact, be deemed ‘small’ in the first place. As Glavind and Bjerre-Nielsen goes on to argue, ethnographic data has ‘high depth (“high M”) since, for each individual or setting, the ethnographer can potentially list hundreds, possibly thousands, of details’. Moreover, they typically have ‘a temporal dimension … (‘high T)…from observing individuals in a specific setting for a couple of hours [or] following the same individuals across settings for months or years’. In other words, ethnographic data is ‘“big” in the same way that “big data” is big, even though N is small’ (see also Pedersen, 2019). This has huge ramifications for social science. If field notes and other qualitative data are ‘big’ on this alternative measure of size, it not only underscores the earlier mentioned need to introduce computational methods in their collection, processing and analysis, but it also raises the more fundamental issue of whether such data should always conceived as qualitative in the first place. Perhaps what is needed is more quantitative ethnographic data (standardized records of systematic participation and/or observation), and less reflexive, poetic and intersubjective ethnography –in short, less qualitative fundamentalism and epistemological exceptionalism?
For such a distinctly anthropological and distinctly quantitative research agenda to succeed, scholars and students subscribing to an anthropological identity will have to do away with some of their most deeply held beliefs. In particular, they will need to make a separation between ethnographic data and anthropological analysis. Indeed, this might be the main difference between machine anthropology and the cognate projects discussed above, where ethnography and anthropology tend to be used as synonyms with little semantic difference. Conversely, throughout this introduction, I have sought to operate with a principled and systematic distinction between ethnography and anthropology (Agar, 2006; Ingold, 2014). Indeed, it seems to me, it is precisely in this separation between anthropology as a distinct analytical method and mode of theorizing on the one hand, and ethnography as a certain empirical method and data collecting and processing form on the other, that the potentials and promises of a future machine anthropology can be located.
We can, then, think of machine anthropology as a strong version of computational anthropology in Munk and Winthereik's sense (2022). Leveraging the technical advancements brought about by the data science revolution and combining these methodological innovations with an epistemological re-orientation towards a mathematical sociology, the machine anthropology project aspires to expand the very scope of anthropological inquiry by embracing quantitative thinking and computational methods. For anthropology to embrace its machinic potential, it will require a widening of the discipline's data, methods and identity. In addition to data obtained through ethnographic fieldwork (be they ‘thick’ qualitative or ‘big’ quantitative), the mathematical machine anthropologist must also be open towards other registers of data, ranging from large corpora of scraped tweets (Breslin et al., 2022) to experimentally sampled and collected as well as statistically processed and modelled sensor data (Lohse et al., 2022). For such a more-than-qualitative transformation and extension of anthropology to happen, it will involve a questioning some of the most deep-held convictions of scholars and students from this discipline, including a bracketing of ethnography as anthropology's primary – and to some, only – method. Only when, or rather if, this happens, will computers cease to be merely ‘good to think [anthropology] with’ – whether in the modernist imaginary espoused by Levi-Strauss or the postmodernist stereotype of an evil, quantitative other popular in certain quarters of the academy – and become vehicles for the emergence of distinctly anthropological forms of machine learning and AI.
Footnotes
Acknowledgments
This work was made possible by funding from the DISTRACT Advanced Grant project grant 834540 from the European Research Council). Apart from the contribution from Blok et al., all articles and commentaries are the product of the Machine Anthropology Workshop, which was held at the Copenhagen Center for Social Data Science (SODAS) on the 27th and 28th of January 2020 to inaugurate the DISTRACT project. In addition to the contributions to the present theme issue, the workshop also included presentations by Krista Lagus and Minna Ruckenstein, Ajda Pretnar and Dan Podjed, Marie Cury and Sebastian Barfort, Daniel Souleles and Nicholas Skar-Gislinge, as well as by Andreas Refsgaard. The author would like to thank all these people, as well as Andreas Roebstorff, Eva Iris Otto, Sophie Smitt Sindrup Grønning, and Emilie Munch Gregersen and everyone in the DISTRACT team (including Thyge Enggaard), for their invaluable academic and/or administrative assistance in making this workshop a successful inauguration of the ERC project. The author is also indebted to Jennifer Gabrys and Matthew Zook for their perceptive comments on a draft of this introduction, and to Jennifer for her advice, assistance (and patience!) in co-editing this theme issue with me. A special thanks also to John Levi-Martin for stimulating discussions on the topic of machine anthropology.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship and/or publication of this article: This work was supported by the ERC (grant number 834540).
