Abstract
In light of recent technological innovations and discourses around data and algorithmic analytics, scholars of many stripes are attempting to develop critical agendas and responses to these developments (boyd and Crawford 2012). In this mutual interview, three scholars discuss the stakes, ideas, responsibilities, and possibilities of critical data studies. The resulting dialog seeks to explore what kinds of critical approaches to these topics, in theory and practice, could open and make available such approaches to a broader audience.
Introduction
It was on the tip of everyone’s tongue. Tyler and I just gave it a name. (Fight Club, 1999)
Critical Data Studies calls attention to subject formation within these data regimes, for a critical examination of where the interpellation of the individual emerges in algorithmic culture (Striphas, 2015) and, through that, where the cracks and seams, the spaces for resistance and alternatives, might be found. When you append “critical” to a field of study, you run the risk of both offending other researchers, who rightly point out that all research is broadly critical and of bifurcating those who use critical theory from those who engage in rigorous empirical research. Recently, Kate Crawford (2015) observed that the meaning of Critical Data Studies is as political as the data it engages. To extend that, it must remain contestable in order to contest the creation, commodification, analysis and application of data. Ultimately, Critical Data Studies must make space for the recursive dialog between the deeply theoretical and the robustly empiric and, in so doing, avoid the hubris of pseudopositivism and technological determinism, in favor of the nuanced and contingent.
This is a dialog between Dalton, Thatcher and Taylor on the spatial nature of data with respect to Critical Data Studies. It highlights the historical variability of the processes of data production and accumulation and how this, in turn, has resulted in the uneven development of data. We ask both what is missing and what must be brought to the analysis of data in order to respond to the existence of particular dataspheres governed by the kinds of technology available in different locations. In form, Dalton and Thatcher ask the first question and Taylor responds before asking the second, alternating from there. As such, this dialog has many entry and exit points and is not meant as a definitive statement of what Critical Data Studies is or who can speak for it. Rather, we choose to live in the unresolved tensions between researcher and subject, technology and society, space and time (Haraway, 1991), and we encourage our readers to do the same.
Dalton and Thatcher: Andrejevic (2014) suggests that the “Big Data divide” is more than simply the separation between individuals and their data, but also the separation between the ability to analyze and leverage said data. How do the processes and circumstances that produce Big Data simultaneously produce its geographic limits and divides and how do these geographic divides factor into the “Big Data divide” Andrejevic posits?
Taylor: The “Big Data divide” in some ways parallels the digital divide, since location and income tend to determine access to the kinds of digital technologies that contribute to Big Data. However, the divide Andrejevic refers to also reflects differing understandings of the creation, collection, use and interpretation of those data. The “Big Data divide” also implies a divide between the person generating the data and the reader. As researchers, we can start to perceive this divide more clearly if we think about Big Data as human subjects data, since it mainly (though not universally) consists of data about, and produced by, people. Even environmental sensor data is often collected because of what it can tell us about human movement, activities or behavior.
This human-centered perspective on Big Data has implications for the way we can address the digital as researchers. For example, someone using a mobile phone to access the internet in London produces many kinds of data (e.g., from apps, handset usage, internet browsing and social media), but they are all governed by a set of rules and standards determined by the place they were emitted. They will be collected in certain databases and not others; edited, shared and used in certain ways and not others. They reflect a certain infrastructure: they will be fairly granular in terms of location and other variables which make it possible to gain a multifaceted perspective on that user. There will also be supplementary data available to situate the subject in historical, geographical and political context. It is those layers and rules that make London a particular datasphere different from other dataspheres.
Despite this, research often treats data production as a flat, homogeneous process that is not influenced by place—while also hailing Big Data as unprecedentedly granular and rich. As a contrast, we can take the example of someone using a mobile phone in Mauritania. They are probably on a comparatively early-generation phone, connected to one of only a few antennae in their area, and they probably do not have a signal that allows them to use much, if any, data. There will also be less supplementary data out there about them and their location. All this makes it hard to draw the level of conclusions about them that we would be able to about someone in London (or Buenos Aires or Johannesburg). Similarly, a satellite image of that person’s location will offer highly detailed data, but our ability to interpret it will be vastly diminished by the lack of supplementary data that allows us to read the data accurately. A farm, a settlement, a school, a hospital—all these things look different in different places, as does behavior when observed through digital data. Recent research projects using remote analysis of geocoded mobile data from lower income countries (Taylor and Richter, 2015) suggest that it is difficult for analysts working remotely to read Big Data clearly without knowledge of its context, or with little supplementary data to provide ground truth (Pickles, 1995). Moreover, as the data analyst, it does not particularly help me if I move my machine to Mauritania—I will still not be able to read the data clearly because I do not have as rich a datasphere to work with. Important chunks of the Mauritanian datasphere are likely to exist in analogue form, and bits may also be located in local memory and unwritten knowledge.
You could call this a problem of missing data. Andrea Rossi, a development researcher who works on methodologies for surveying the hidden, marginal or excluded, has said something useful on this: that “non-response is also a response.” For example, the work of Mark Graham (2011) shows that there are more Wikipedia users in the Netherlands than in the whole of Africa, Research could do a better job of addressing these blanks as non-responses, and taking absence into account. Since the limits of the reader are also the limits of the data, we should perhaps be more suspicious of the idea that digital data can reveal phenomena and meaning that are otherwise invisible, such as the contours of poverty in developing countries, or the way relationships are built through networks. Perhaps they are visible, but not from an office in New York or Amsterdam.
Our growing faith in “big” is risky if it makes us skate over this divide in understanding. Instead, we should be suspicious of the idea that big is comprehensive, and that we should fit our questions to the data available. The major sources of Big Data commonly used in many social scientific disciplines, for example, have very long tails in terms of their representativeness to the point where we may find they can only answer questions about a relatively specific population. Twitter, for example, has 316 million users (Twitter, 2015), but most are in high-income countries. Then there is a long tail comprising, for example, the users in Africa, around 7% (Statistia, 2015), and a similar number of non-humans (OurSocialTimes, 2015). Belonging to this long tail can cause invisibility: for example, Kenyan Wikipedia contributors had an article repeatedly deleted by Wikipedia’s gatekeepers who had not heard of its subject (Zook et al., 2013: 229).
For these reasons, the questions that can be answered by the digital change radically when they move outside a small core of high-income countries, and researchers should be suspicious of conclusions that use the word “we” about the digital. Most people in the world are not digital in similar ways, or may be digital in ways that do not comply with the “we” of current analyses of digital data and power. This means that researchers could be more cautious about that “we,” that is used so often. It may only mean “those I recognize from my side of the Big Data divide.”
Taylor: How does the current discourse about Big Data relates to previous “turns” in research and policy? What historical parallels might be informative with regard to the kind of scientism and epistemological determinism that surround Big Data at the moment?
Dalton and Thatcher: It has been noted elsewhere that “Big Data” advocates present it as perpetually new, ahistorical and revolutionary, and moreover that this presentation is hardly accidental and serves industry narratives of disruption. Separated from the past, these technologies and actors are unbeholden to the problems, contradictions, and limits that afflicted older forms of knowledge production (Leszczynski, 2014). The project of situating Big Data in time, just as the first question situates it in space, is thus an inherently critical project. Context makes it possible to ask: what is really new and different about Big Data and spatial Big Data? What parts are continuations and developments from pre-existing process? Who and what are the drivers of Big Data and why do they do it?
One undeniable historical precondition for spatial Big Data is geography’s quantitative revolution in the mid-twentieth century, particularly as it involved social physics. In the mid-twentieth century, non-geographical researchers theorized that social processes resembled physical laws, a social physics, such that it was possible to model them with unprecedented complexity. Meanwhile, geographers quantified not just location, but spatial connections and relationships. In the early 1960s, geographers used mainframe computers to combine social physics and quantitative geography to perform geographical algorithmic analyses similar to today’s Big Data analyses and geographic information systems (GISs) on a simpler computational level (Barnes and Wilson 2014; Chrisman, 2006).
Geographers utilizing contemporary quantitative technologies also catalyzed geodemographic research. Geodemographic classification attempts to identify socially homogeneous areas or neighborhoods that match a pre-determined list of social categories such as “thriving grays” or “hard-pressed families.” The new computational methods and postal code systems of the 1950s–1960s facilitated a watershed of geodemographic classification with previously impractical amounts of data and socioeconomic specificity. Originally developed for public service provisioning, geodemographic analyses were quickly adopted by the marketing industry, which continues to use such methods on increasingly individualized scales (Dalton and Thatcher, 2015; Harris et al., 2005).
The height of the Cold War was also a watershed period for the use of computational geographical methods by government defense and intelligence agencies. Period innovations include early military GISs, reconnaissance satellites, geodetic models of the Earth for intercontinental ballistic missile targeting and the global positioning system (GPS) (Central Intelligence Agency, 1980; Cloud, 2002; Dalton, 2013). All four have military and civilian applications today. For example, smartphones can show a user’s location using GPS on geodetically correct satellite photographs.
History also holds useful critical approaches for Critical Data Studies. Up until the early 1990s, many cartography and GIS practitioners were unwilling to evaluate the social consequences of their technology and work. Critical geographers, including Harley (2001), Wood (2010), Pickles (1995) and Elwood and Ghose (2001), did critical work on how maps and geographic information are inherently part of political, cultural programs. Perhaps the most important lesson is that of situated knowledges: acknowledging the circumstances of production and positionality of those creating that knowledge (Crampton, 2010; Haraway, 1991). How do Big Data researchers situate themselves vis-a-vis those who are researched, the data about them, and analytical mechanisms? This work also productively highlights counter-knowledges. Participatory GIS and counter-maps, such as some indigenous maps and Bunge’s maps of racism (Bunge, 2011; Sieber, 2004), have a long history of contesting the status quo. If Critical Data Studies is to be more than a voice of skeptics, we must be open to and ideally develop alternative knowledges that reflect and build on our criticisms.
Dalton and Thatcher: We would like to return to the “we” and “us” of data that you mention earlier. We have argued that in addition to the limited, always partial, and always privileged identities of those who contribute to Big Data, there is also an epistemological and ontological leap between the individual creating the data and the representation of that individual by the data. What is happening and what are the consequences of approaching data not as flat or lifeless, but as a wild or natural other? Can we explicate the relationship between Big Data and othering?
Taylor: As I wrote earlier, there are many digital we’s, and the more that assumptions of unity can be broken down, the more useable our research becomes. For instance, for lawyers dealing with Big Data, “we” tends to mean the European or US citizens (though seldom both), that data protection, commerce and IP law can see and the courts can reach. For many (but by no means all) media and communications researchers, “we” often focuses on a particular online elite. Human geographers tend to be fairly comfortable addressing the non-uniformity of the digital research subject, and in fact, the new school of critical internet geographies focuses on this explicitly. Nevertheless, even for researchers sensitized to that non-uniformity, the terminologies of Big Data collection and use—user consent, volunteered geographic information and the benefits of opening data—create the idea of a participatory datasphere full of consenting, aware data subjects. They also create the idea that the “we” of those who emit data is a statistically representative “we” that is truthful about its location and movements—assumptions that are flimsy at best (Samarajiva, 2014; Taylor, 2015). In reality, the lumpiness and gaps in the digital world highlight the discursive nature of that “we.”
One way to address this is to distinguish those who contribute to Big Data from those who are represented in it—for example, how can human rights researchers distinguish between social media users in remote areas and those picked up by satellite, drone or other sensor devices in those places, and what can they know about the overlap? These become important questions when the representativeness of such data is highly politicized.
This epistemological and ontological gray area between those who create and those who are represented by data has been picked up by surveillance studies in the notion of the “data double” (Haggerty and Ericson, 2000), the abstracted representations of people created for purposes of intervention. The data double highlights both the negative space between an individual and their digital representation, and how the position and intention of the analyst influences how the double is constructed, what is included or left out and what kinds of action it is shaped to facilitate. Big Data exaggerates this otherness because its origins are commercial and it necessitates new tools and approaches, which in turn can be better monetized if the data is also opaque in terms of its meaning.
This opacity is a problem as old as social science: it demands finding the qualitative within quantitative data. Statisticians and GIS researchers have always used analytical tools that create a distance from the lumpy reality of those generating the data. “Big” can be seen as just a new iteration of this problem. By celebrating size as something new rather than an extension of an existing continuum, we give it an otherness and apply the terminology of natural resource extraction (Mann, forthcoming; Puschmann and Burgess, 2014) instead of that of social research. That terminology of otherness is risky because it promises to absolve us from the responsibility of sifting data with our eyes and our common sense. It is easy to get too comfortable at one’s desk and no longer do the hard work of going and asking people what their data means. This is particularly true with regard to data originating from places that are difficult to research, or groups that are marginalized or difficult to find. There is a risk that the myth of Big Data’s comprehensiveness will result in less field research, and potentially less accurate and contextualized knowledge.
This is a long-recognized problem in the development sector, where inadequate fieldwork is termed “Land-Rover research”: consultants make very short visits to harder-to-reach places, often in an air-conditioned vehicle, then retreat to an air-conditioned hotel and write up a report on local needs and experiences. Much policy research on lower income places—probably most policy research—is set up that way. But Big Data makes even the Land-Rover and the air-conditioned hotel unnecessary. It is unsurprising, then, that Big Data in the development sector is being hailed as revolutionary (United Nations, 2014), given how much effort it takes to do the kind of research that can take account of context.
The othering of Big Data in the development sphere is also a new instance of the core-periphery dynamic described in world systems theory (Wallerstein, 1992). Just as raw materials have been produced at the global periphery and sent to the core industrial nations to be turned into usable products, Big Data is the newest raw material, and knowledge the usable product (Mann, forthcoming). Mann argues that it is important to capture and interpret data—and its value—at the global periphery. This can work in some cases—she cites a transport collective in South Africa—but it works less well in the case of multinational telecom or social media firms. In fact, the otherness of data may be correlated with its perceived value. If it is collected and processed at a great distance, it becomes both easier to mythologize and harder to verify. So the mythology and othering of Big Data relate to its commercial value: they make Big Data seems like the only way to know anything about remote populations, but they also distract from its imperfections and qualitative nature, and the risks it poses as a basis for decision making.
Taylor: How should we address the fact that Big Data is largely a corporate phenomenon, and whose geographies do we reflect when we work with corporate generated and analyzed data?
Dalton and Thatcher: In a single sentence, the answer to this question might be “carefully and capitalist modernity’s”; but such brevity obscures as much as it reveals. In truth, this is one of the core questions that drove our work towards Critical Data Studies. There are many ways to conceptualize the largely corporate nature of “Big Data.” One is sheer volume: mobile device use is estimated to produce 5.2 petabytes of new data a day (IBM, 2013), or roughly the entire yearly output of the Large Hadron Collider each week. But a reliance on volume to justify import plays into the mythologies of “Big Data” wherein larger is always more clear, more comprehensive, and generally better, furthering a seductive pseudopositivist research orientation in which “raw” data becomes epistemic reality (Wyly, 2013).
A more nuanced take recognizes that sweeping EULAs render the vast majority of Big Data the property of corporate entities distinct from the individuals said data purportedly represents (Thatcher et al., forthcoming). The gulf between individuals and their data doubles marks a twofold epistemological leap. On the one hand, private regimes of data and analysis seek to remake the world in the image of their own algorithms, less interpreting than “actively fram[ing] and produc[ing]” (Kitchin et al., 2015: 7). On the other, researchers are limited to the “data fumes” given off by these corporations; at an extreme, ceding the limits of research to what is given off from black-boxed databanks through corporate APIs (Thatcher, 2014).
Minard CJ. “Carte figurative et approximative des quantités de vin français exportés par mer en 1864.” lith. (835 × 547), 1865. Minard’s map of French wine exports for 1864. Public Domain: https://commons.wikimedia.org/wiki/File:Minard%E2%80%99s_map_of_French_wine_exports_for_1864.jpg.
Code and data mediate, saturate and sustain global capital (Graham, 2005). When we map Big Data we map the contours of capital, one intrinsically limited by the uneven contours of data as it plays out across space. This occurs both on the scale of commodity and information flows and that of everyday lived experience. Of course, this is nothing new, in addition to his more famous Napoleonic map, Minard created a variety of maps focusing on commodity and industrial flows. These maps reflect the reach of empire just as the maps of Twitter data today reflects the uneven reach of data. The uneven development of data regimes plays out not only across variegated space and time but also through the very imperatives which drive data creation, capture and control.
In parallel, data silences or gaps result from the kinds of data deemed worth creating and storing. Simply put, corporate data is meant to create a profit, its veracity secondary to its economic value. In practice, this means that the everyday scale of data is the scale of the commodified data point and the individual person from whence it springs.
Dalton and Thatcher: Following on the last few questions and ideas around the uneven development of data, what are some concrete examples of how data is produced and operationalized by corporate or other institutional players across space? How is this similar to and distinct from historical processes through which statistics and data have been leveraged by the state and private actors to make claims upon and reshape society?
Taylor: Due to multinationals’ role as mobile operators and internet providers, digital data are increasingly collected, analyzed and stored across national borders and are also sometimes shared for research on developing countries. For instance, the French firm Orange Telecom has run two data-for-development challenges, sharing West African subscriber data with researchers worldwide in order to propose interventions based on their analysis (Taylor, 2015). There is also the UN initiative Global Pulse, one of whose objectives is to access and analyze social media data from developing countries to “nowcast” economic shocks and food security problems. These pro-bono data flows occur as a result of international agreements between non-state actors, rather than between governmental authorities and technology providers.
There are also cases, however, where remote data analytics are locally embedded. The Billion Prices Project (Cavallo, 2013), an Argentine inflation index, was created at MIT when Argentine economists were threatened for challenging national inflation statistics (Wall Street Journal, 2011). Alberto Cavallo, an Argentine economics student, scraped online price data remotely to form an accurate index which had real political effects in Argentina. In this case, remote data analysis was, paradoxically, the most relevant methodology for the local context (Taylor and Schroeder, 2014)—and to return to the question of “we,” in this case the “we” of the research was very clearly the local population.
Historically, the current excitement about Big Data seems to parallel the excitement that surrounded the emerging field of statistics in the late eighteenth and early nineteenth centuries. Social physics (today being marketed as “a new science” (Pentland, 2014)) dates back to at least the early 1800s and possibly earlier (Sinclair’s work on Scottish population statistics in 1791, quoted by Porter, 1986: 24). Statistics was first a moral and political science, then became an exact science during the first part of the nineteenth century. As with Big Data analytics today, the transition from moral to exact science involved flattening the data subject into an anonymous and average person (Porter, 1986: 25). Practices of linking and merging, key to today’s Big Data epistemologies, recall the work of Herman Hollerith, a statistician and engineer who in the 1890s invented a tabulating machine to perform data analytics on the US census. He wrote of it: Without the slightest delay such an electrical counting machine will read or test before tabulating whether the given person was white, native born, native father, native mother, male, blacksmith, and resident of New York City. If it agree in all these particulars, it would tabulate the person under from six to ten different items … (Hollerith, 1894)
A second historical parallel with Big Data is the claim made by early statisticians about the end of causality, as occurred again in 2008 (Anderson, 2008)—in 1838 the London Statistical Society announced that the field of statistics was epistemologically different from political economy because “it does not discuss causes” (JRSS, 1838 in Porter, 1986: 35). Even at that time, this was disproved by innovative uses of data: John Snow’s map of London’s 1854 cholera outbreak is a perfect example of spatial data analytics being used for causal analysis, which eventually gave rise to the field of modern epidemiology.
The epistemological determinism (Cherlet, 2014) we are seeing today—the belief that some types of data are fundamentally better and “truthier” (Colbert, 2005) than others—has a distinct genealogy. Over the course of the last two centuries, we can chart how the pendulum of social-physics-style thinking has swung from an interest in causality and general rules governing human behavior to an interest in large scale, granular description of that behavior. Similarly, there has been an oscillation between an interest in nowcasting—seeing the present in more detail through data—to an interest in using large-scale data for prediction. In the current field of Big Data analytics, we can identify a return to nowcasting, as seen in recent calls for the donation of data to help track Ebola (Wesolowski et al., 2014), and work by multilaterals such as Global Pulse on the real-time identification of economic shocks (Global Pulse, 2014).
Taylor: Beyond Big Data and small data, how should researchers address missing data? Data may not be formally defined as missing from a given dataset, but important gaps are left by those who are not emitting data, or are doing so in ways we cannot read—or by those with minimal or no involvement in the kinds of markets that Shearmuir (2015) refers to as creating today’s Big Data.
Dalton and Thatcher: Cartography and GIS have long been concerned with the nature of missing data. In particular, critical cartography, critical GIS and counter-mapping emphasize the ethics of geographic information and subaltern peoples. In some cases, such programs can be a process of empowering otherwise silenced groups; in other cases “missing data” is better kept obscure or local. For example, some indigenous mapping allows native peoples to map land claims as a means to establish legal territorial legitimacy. Such efforts in Canada helped lead to the establishment of Nunavut. Other situations are problematic. The US Military covertly funded community led mapping projects of indigenous lands in Mexico as a means of gathering intelligence (Bryan and Wood, 2015). In our own counter-mapping work, we have run into many cases, such as undocumented migrants in danger of deportation, where the ethical choice is to not collect data, to not make a map.
If data researchers are to understand how their work engages with a particular datasphere, it is crucial to consider on what terms people are enrolled and re-enrolled in regimes of data generation and collection. Power asymmetries between data creator, data captor and data analyst play out unevenly across time and space. Accumulation through dispossession plays out unevenly as technology firms and state actors value data differently (Thatcher et al., forthcoming). The quantified self, whether liberatory or dataveillance (Thatcher, forthcoming), emerges within capitalism’s “corporeal corkscrewing inwards” (Beller, 2012: 8).
What is missing poses as many questions as what is revealed. We know a “teenager in suburban USA will tweet differently from a German professional football team” (Schmidt, 2014: 3), but what social relations produce subjects who do not tweet? How is “volunteered” information different from “collected” data from ambient services (Harvey, 2012)? As researchers, we must simultaneously accept the epistemological limits set by the profit imperatives of much “Big Data,” and ask what data should not be collected or analyzed for ethical reasons?
“What’s missing” (and who) thus remains of paramount importance. Certain populations are over-represented, such tech industry men (Harkinson, 2014), while others remain opaque to outsiders. Work like Rossi’s on innovative surveys with marginalized groups, 1 Graham on Wikipedia (2011) and Kwan (2015) on “making the invisible visible” map the contours of these missing dataspheres. Such work is expensive, slow and relies on smaller datasets than Twitter’s daily output. Critical Data Studies calls for ethnographic and discursive work, for the thick description of data and the cultures around it, just as much as it relies on algorithmic analysis. It is not enough to map Big Data, the point is to change it.
This article is a part of Special theme on Critical Data Studies. To see a full list of all articles in this special theme, please click here: http://bds.sagepub.com/content/critical-data-studies.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
