Abstract
Despite all the attention to Big Data and the claims that it represents a “paradigm shift” in science, we lack understanding about what are the qualities of Big Data that may contribute to this revolutionary impact. In this paper, we look beyond the quantitative aspects of Big Data (i.e. lots of data) and examine it from a sociotechnical perspective. We argue that a key factor that distinguishes “Big Data” from “lots of data” lies in changes to the traditional, well-established “control zones” that facilitated clear provenance of scientific data, thereby ensuring data integrity and providing the foundation for credible science. The breakdown of these control zones is a consequence of the manner in which our network technology and culture enable and encourage open, anonymous sharing of information, participation regardless of expertise, and collaboration across geographic, disciplinary, and institutional barriers. We are left with the conundrum—how to reap the benefits of Big Data while re-creating a trust fabric and an accountable chain of responsibility that make credible science possible.
Big Data is not only about being big
The popular and scholarly literature is filled with excitement about
Our goal in this paper is to pull back from the hype and take a more measured, analytical approach to Big Data, focusing on the question “what are the characteristics of (some) Big Data that manifest a paradigm shift in the fundamental assumptions of science”? We distinguish between Big Data characteristics that have methodological consequences and those that impact epistemological foundations. We characterize the former as important but not paradigm-shifting. In contrast, we argue that a paradigm shift is indeed evident when Big Data impacts epistemological foundations.
Embedded in this argument is the assumption that the characteristics we are looking for are not native to all uses of big (in size) data. And, in fact, it may be true that data that is not necessarily quantitatively large may have characteristics that are paradigm-shifting when used in certain contexts and by certain communities of use.
Before proceeding any further with an analysis of “big” as a qualifying characteristic of (some) “data”, it is important to establish a definition of “data”, whether big or small. A National Academies report (A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases, 1999) provides a simple and inclusive foundation definition: “data are artifacts, numbers, letters, and symbols that describe an object, idea, condition, situation, or other factors.” Although this definition is useful, it fails to capture the “relative” nature of data (in contrast to it having an “essential” nature). As Borgman (2011) states: “[d]ata may exist only in the eyes of the holder: the recognition that an observation, artifact, or record constitute data is itself a scholarly act.” This perspective of data is reflexive; something (e.g. images, text, and Excel worksheet, etc.) is data because someone uses it as data in a specific context, and transcendent, it carries across the many disciplines, practices, and epistemologies of science.
This relational/contextual perspective gives us the basis for examining the Big Data phenomenon in a manner that both crosses epistemological boundaries and is contextualized by them. With due recognition of the dangers of making generalizations about “science”, we hope to establish some fundamental aspects of Big Data that are indeed boundary crossing, while remaining shaped by (and shaping) specific disciplinary practices.
Having established this relativistic definition of data, we return to the notion that “Big Data is not only about being big”; that there is some combination of features or dimensions (perhaps among them size) that may have revolutionary effects on science and knowledge production. This multidimensional perspective is evident in many of the popularized, mass-market descriptions of Big Data.
One popular multidimensional definition of Big Data is based on the so-called 3Vs: Volume, Velocity, and Variety (Laney, 2001). Volume is the size factor. Velocity refers to the speed of accumulation, the resulting dynamic nature of the data, and the high-scale processing capacity needed to make it useful and keep it current. Finally, Variety refers to the mixing together, or mashing-up, of heterogeneous data types, models, and schema and the need to resolve these differences in order to make the data useful. Others have enhanced this list with additional “Vs”: Validity, the amount of bias or noise in the data; Veracity, the correctness and accuracy of the data; and Volatility, the persistence and longevity of data (Normandeau, 2013), the first two of which, Validity and Veracity, are of particular interest to the argument of this paper.
Mayer-Schonberger and Cukier in their best-selling book
These two attempts to define Big Data, and many others like them, fail to adequately capture the nuances and contexts of use of Big Data that may make it revolutionary and the driver of a new scientific paradigm. Employing Kuhn’s words, when are Big Data “tradition-shattering complements to the tradition-bound activity of normal science” (Kuhn, 1970)? To answer this question, we need to examine Big Data from a sociotechnical perspective (Bijker, 1995; Lamb and Sawyer, 2005). We need to investigate their social, cultural, historical, and technical facets and the interplay and tensions among these facets that collectively establish the impact of Big Data on science and the possible transformation thereof. An analysis of this sort will allow us to distinguish the aspects of Big Data that, no matter how contributory to innovation, may be more evolutionary than revolutionary, from those that are indeed paradigm-shifting. Furthermore, it will help us distinguish between locality—discipline and/or field-specific characteristics of Big Data—and globality—aspects of Big Data that may be paradigm-shifting across the scholarly enterprise.
Lots of data or Big Data?
Because technology is such a basic enabler and component of Big Data practices (i.e. computation including hardware, software, and algorithmic components; high-speed networks; massive storage arrays), it is useful to build our argument on the notions of
By leveraging the theoretical frameworks of Kuhn, Rosenbloom, and Christiansen, we argue that a disruption in science (a.k.a. the creation of a new paradigm) is not just methodological, a way of doing (a.k.a. technical), but also must be sociotechnical. It must challenge existing epistemological norms, ways of knowing and framing the fundamental scientific questions of the field; institutional ecologies (Star and Griesemer, 1989), agreements on scope, assumed knowledge, and boundaries of research work; reward structures, paths to tenure and promotion; and communication regimes, mechanisms, and norms for disseminating knowledge. We will use this scaffolding for the remainder of this essay to distinguish between what we will call
Our distinguishing between these two terms—lots of data (which entails methodological change and technical innovation) and Big Data (which implies the reevaluation of epistemological foundations)—should not be interpreted as an attempt to segregate data into two disjoint silos, i.e. data set 1 is “lots of data”, in contrast to data set 2 that is genuine “Big Data”. Our intention, rather, is to establish these concepts as continuous dimensions with which
Although the primary focus of the remainder of this paper is the Big Data dimension—when, how, and why does data use challenge the epistemological foundations of science—it is useful, for the purpose of contrast, to briefly examine the companion lots of data dimension. This brevity should not be construed as dismissive towards the significance of these technical challenges and the methodological impacts they have. Indeed, there are great challenges here and the scholarly and practical effects of meeting these challenges can be profound, albeit not paradigm-shifting.
Two often-cited instances of data use demonstrate the lots of data dimension. The petabytes of data streaming in from high-energy physics experiments (studied thoroughly by Knorr-Cetina, 1999) or those that are components of the Sloan Digital Sky Survey (Szalay and Gray, 2001) are certainly Big Data in terms of size. But, considered alone, their bigness and the issues associated with them are by and large technical. These communities have historic cultures of data sharing (Ginsparg, 1994; Knorr-Cetina, 1999) and, in fact, their data has always been “big” relative to the quantitative definitions of the day. This is similar to the situation with many domains of science that have a legacy of exploring and manipulating large data sets, where “large” is historically contextualized relative to the technical affordances of the time (Gitelman, 2013).
The massive quantity of data in these two examples clearly introduces issues about new high-capacity storage systems, high-speed networks to easily move them back and forth, and map-reduce algorithms that permit parallel computation over these massive data sets. A recent white paper co-authored by leading data science researchers (Agrawal et al., n.d.) provides a useful list of the cross-cutting challenges that need to be met to respond to these issues; heterogeneity and incompleteness, scale, timeliness or speed, privacy, and human collaboration. All of these are formidable challenges. However, the need for these new methodologies and tools to manipulate, store, and curate these massive data sets does not correspond to a paradigm-shifting disruption of the historically data-focused epistemic culture of the communities of practice that engage with these data.
A recent paper by Leonelli (2014) in the inaugural issue of this journal explores the same issue in the discipline of biology. 3 Similar to this paper (albeit limited to a single discipline), Leonelli aims “to inform a critique of the supposedly revolutionary power of Big Data science,” likewise defining revolutionary as synonymous with creating a new epistemology and a new set of norms. Similar to our earlier examples in physics and astronomy, she notes that “data-gathering practices in subfields [of the life sciences] have been at the heart of inquiry since the early modern era, and have generated problems ever since.” She then aims the bulk of her critique at Mayer-Schönberger and Cukier's claims that data completeness mitigates data messiness and their championing of correlation over causality, which we will return to later in this paper. She finishes by rejecting the notion that Big Data is exerting a revolutionary effect on the epistemology of biology itself, claiming that “there is a strong continuity with practices of large data collection and assemblage since the early modern period; and the core methods and epistemic problems of biological research, including exploratory experimentation, sampling and the search for causal mechanisms, remain crucial parts of the inquiry in the area of science.” In contrast to epistemic effects on the discipline itself, she acknowledges significant methodological challenges “encountered in developing and applying curatorial standards for data … ” and in the dissemination of that data.
On the other hand, the sensitivity of the evolutionary versus revolutionary impact of big (or of even any) data to epistemic culture becomes evident in the context of digital humanities (or as some call it computational humanities, and its specializations such as computational history). The level of controversy over the “datafication” (Mayer-Schönberger, 2013) of historical and/or literary artifacts (whether in massive scale such as the Google Books Project or the scale of a single literary corpus) can be viewed as evidence of resistance to the introduction of a new epistemology, based on data, that is viewed by some as threatening, and perhaps inferior, to existing and historically based epistemologies (Bruns, 2013; Rosenberg, 2013).
These examples in physics, astronomy, biology, and the humanities (and many similar ones) lead us to conclude that mere bigness, lots of data (which appears to have different meanings in different scholarly fields), is not the basis for declaring a new paradigm in science. Furthermore, we can be fairly confident that such a blanket declaration without attention to the confounding factor of epistemic cultures warrants skepticism.
Data integrity and credible science
With these caveats in mind, however, we do claim that there might be some cross-cutting framing of data and their application across the entire scholarly endeavor, recognizing that this framing needs to be parameterized to a particular use of data within a particular epistemic culture. Then, we need to understand how Big Data might challenge this common framing, thereby becoming “tradition shattering” (Kuhn, 1970).
At the forefront is the notion of
Our attention here to the issues of data and scientific integrity is coincident with a growing concern with the reliability of scientific knowledge. The notion of a crisis in reliability has been discussed in the media (Naik, 2011), and in scientific journal articles (Brembs and Munafò, 2013) and editorials (“Announcement,” 2013; Jasny et al., 2011). Some of the concern about reliability has been fueled by well-publicized cases of scientific fraud and data falsification in a number of scientific fields (Harrison et al., 2010; “Researcher Faked Evidence of Human Cloning, Koreans Report,” 2006; Verfaellie and McGwin, 2011). In addition, a number of academics are warning about the prevalence of false results in the scientific literature (Ioannidis, 2005; Pöschl, 2004).
But, as pointed out by Stodden (2014), some of this concern arises from the increasing prevalence of data-intensive (Big Data) science across the disciplines, and the application of computational, analytical methods to those data without complete understanding of their characteristics (e.g. the nature of the sample represented by the data). Absent full understanding of the data (and in some cases a failure to account for this lack of intimacy with the data), researchers have at times unwittingly or sloppily applied methodological tools or epistemological understanding to those data that failed to account for the fundamental differences between them and traditional highly-curated and reliable data. As pointed out by Lazer et al. (2014), “ … most Big Data that have received popular attention are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis.”
Of particular concern in this area has been scientific results based on data sources of questionable provenance and integrity such as distributed sensors (Wallis et al., 2007) and “black box social media,” where the origin and basis of the data are difficult to determine (Driscoll and Walker, 2014) and the algorithmic bias on the conclusions is difficult to unravel (Gillespie, 2014). A well-known example of the foibles of the reliance on informally collected data and algorithmic projection is the Google Flu Trends (GFT), which raised huge scientific optimism about the predictive utility of informally collected data when first published in
We acknowledge that this emphasis on data integrity (a.k.a. quality) stands somewhat in opposition to the popularized claims by Mayer-Schönberger and Cukier that “looking at vastly more data … permits us to loosen up a desire for exactitude” and effectively allows us to ignore “messiness” in data (Mayer-Schönberger, 2013). As mentioned earlier, this claim and subsequent claims by the authors seem to rely heavily on
As a point of reference, it is useful to look at the notions of integrity, trust, and provenance in the context of archives and archival science, for which they are essential concepts. Hirtle (2000) describes the meanings of these terms and the manner in which they are core to the definition of the archive in the context of the ship
Defining the control zone
Taking a cue from archival science then, we should look at the role of
The boundary of the traditional library was easy to define. It was the “bricks and mortar” structure with a clear and controlled entry point that contained and protected the selected physical resources over which the library asserted control and curatorial responsibility. Correspondingly, from the patron’s point of view, the boundary marked what could be called a “trust zone”, an area to which entry and exit were clearly marked and in which they could presume the existence of the integrity guarantees of the library. Integrity, in this case, does not imply veracity of the resources of the library, but adherence to principles of proper information stewardship, including accurate description, longevity of the resources, and adherence to some selection criteria.
In Lagoze (2010), we describe how the move from physical to digital information resources and the attendant access to them by the web architecture profoundly disrupted the foundation of the control zone. This disruption was not anticipated by early participants, practitioners, and researchers in the early digital library initiatives, who foresaw technical but not institutional change. In fact, some predicted that in the end “[digital] library services would follow a familiar model” (Gladney et al., 1994). Others saw the Internet as another familiar evolutionary technical change, similar to past challenges to libraries, stating that “The anarchy of the Internet may be daunting for the neophyte, but it differs little from the bibliographic chaos that is the result of five and a half centuries of the printing press” (Lerner, 1999).
Two decades later, it is clear that the implications of moving from physical to digital information and network access to the information is more than a technical phenomenon; the implications are more than that “digital information crosses boundaries easily” (Van House et al., 2003) and in fact are deeply disruptive to the library. By viewing the library as a meme, 4 rather than just as an institution or a physical artifact, we can see the roots of the disruption. At the foundation of it is the foundation of the library itself, the disintegration of the control zone. The notions of a clear boundary, and the attendant concepts of being inside or outside, disappear in the web architecture, where users (i.e. patrons) no longer enter through a well-defined door, but ride hyperlinks and land wherever they may choose in the digital library. Attempts to reassert a boundary by defining a new digital door or portal and establishing branding signposts defining inside vs. outside have proven incompatible with the dominant web context and have largely failed. With the collapse of the control zone, other fundamental components of the library meme become difficult to implement or anachronistic relative to the increasingly normative broader web context. These include selection, deciding what information sources are available to patrons; intermediation, acting as a buffer between information creators and information users; bibliographic description, providing “order making” via the catalog; and fixity, guaranteeing the immutability of information resources.
In conclusion, the wholesale transition of our intellectual, popular, and cultural heritage to the digital realm has been accompanied by a disruptive change in our expectations about our knowledge infrastructures. The notions of selection, intermediation, bibliographic description, and fixity that are core principles of the library meme stand at odds to the web information meme. These contradictions become sharper as the web has moved over the past decade into the web 2.0 era and beyond. Expectations of open access to information, active participation in knowledge production and annotation, and the integration of social activity and knowledge activities are now the expected norm. Libraries are certainly part of this modern knowledge infrastructure. But they exist as participants in a world of competing “knowledge institutions” (e.g. Wikipedia, Facebook, Twitter). Meanwhile, notions of information integrity, which were formally grounded in institutional frameworks such as the library, remain problematic and in search of new ways to certify the provenance of information resources.
Rethinking credible science in the age of Big Data
With knowledge of this precedent, we can now return to Big Data and recognize parallels in the historical transitions of the library and the transformations in the ways that scholarly data are created, shared, and used. The relatively well-controlled mechanisms (both cultural and technical) for data creation, data sharing, and data reuse are under pressure for a number of reasons. Funders, the public, and fellow scientists are demanding, for good reason, better access to data and in general “open data” (Huijboom and Broek, 2011; Molloy, 2011; Murray-Rust, 2008), motivating the creation of numerous data repositories (Greenberg et al., 2009; Hahnel, 2012; Michener et al., 2011) that allow easy and generally anonymous access to scientific data on a global scale. Science in general is becoming more collaborative and interdisciplinary (Barry and Born, 2013; Haythornthwaite et al., 2006; Wagner et al., 2011) (at least partly due to the multidisciplinary scope of grand challenge problems like climate change), breaking down traditional closely-knit teams of colleagues and bringing together scholars with different epistemic and methodological cultures. An increasing number of data sources originate from nontraditional means, such as social networks for which concerns about integrity and provenance are not priorities. Mashups of data are becoming increasingly common, blurring the lines between formal and informal data. Scientists seem to have a love/hate relationship with this new reality. While they support the abstract idea of open data (Cragin et al., 2010; Tenopir et al., 2011), their sharing practices, and sharing preferences, remain relatively closed and motivated by control (Borgman, 2011; Edwards et al., 2011).
Quantitative social science research provides an interesting example of this data transition and impact on the control zone. For the past 50 years, quantitative social science has been built on a shared foundation of data sources originating from survey research, aggregate government statistics, and in-depth studies of individual places, people, or events. Underlying these data is a well-established and well-controlled infrastructure composed of an international network of highly curated and metadata-rich archives of social science such as the Inter-University Consortium for Political and Social Research 5 (ICPSR) and the UK Data Archive. 6
These archives continue to play an important role in quantitative social science research. However, the emergence and maturation of ubiquitous networked computing and the ever-growing data cloud have introduced a spectacular quantity and variety of new data sources into this mix. These include social media data sources such as Facebook, Twitter, and other online communities in which individuals reveal massive amounts of information about themselves that are invaluable for social science research. When combined with more traditional data sources, these provide the opportunity for studies at scales and complexities heretofore unimaginable. This transformation has been described by Gary King, a Harvard political scientist, as the
Another example of this fracturing of the control zone exists in observational science, for example, identification and reporting of phenomena (e.g. species) in ecological niches, astronomy, and meteorology. In each of these areas there is a growing interest in what has been termed
It comes as no surprise that crowd sourced citizen science makes a substantial portion of the formal scientific community uneasy (Sauer et al., 1994), especially in fields where people’s lives are at stake, such as medicine (Raven, 2012). These data, by nature, breakdown a well-established control zone whereby data is collected by experts or individuals managed by experts who carefully abide by scientific methods. In contrast, citizen science of this type must contend with the problems of highly variable observer expertise and experience. How can we trust data or the science that results from those data when their provenance is rooted in sources whose own provenance does not conform to “standard” criteria such as degree, publication record, or institutional affiliation?
The examples described above are only two of the many instances in which new varieties of Big Data are undermining traditional control zones of science. If we look longitudinally, we can see that examples such as these are only the beginning of the problem. The fractured control zones, and the resulting uncertain provenance and trust, only intensify through the lifecycle of sharing, reuse, and circulation of data in an open network in which not all participants are deemed trustworthy according to established norms. Looking across this lifecycle, this dilemma very quickly becomes combinatorially more complex. If the control zone around data set A and that around data set B are poorly defined, that which results from the reuse and combination of the two is only fuzzier. Of course, this is only the first step in the progressive mashup and “cooking” of these data with other data, a progression that is inevitable when data reuse is easy and strongly encouraged.
Despite the challenges and uncertainties, the inclusion of these “uncontrolled” Big Data into the scientific process is a reality that will continue and perhaps become more common. Our “always there, everywhere” network culture will continue to make more and larger amounts of automatically, accidentally, and informally created data available for science. The value of these data across the scholarly spectrum has been demonstrated numerous times. Social scientists can conduct studies on large-scale social networks that may not replace, but do significantly complement, traditional research based on small-scale social groups (Milgram, 1967; Zachary, 1977). Observational scientists can now accumulate heretofore unavailable evidence of global phenomena, such as bird migrations and climatological events, by leveraging the active participation and contribution of enthusiastic human volunteers. 8
Our goal in this paper has not been to propose a normative framework for this reality, but to simulate and add to discussions and investigations of its entangled social, cultural, historical, and technical implications. Rather than fall back on hyperbolic “Big Data will change the world,” the scholarly community needs to understand it and investigate its implications for science policy and public trust of science. We propose two threads for moving forward: one epistemological, evaluate our understanding of quality in both data and science and our means for determining it, the other methodological, developing means of recovering traditional quality metrics.
The first approach begins by raising the awareness of researchers who use Big Data about its opportunities, complexities, and dangers. This area is reasonably well covered in Boyd and Crawford’s (2011) paper “Six Provocations for Big Data”, which covers many of the caveats in dealing with this kind of data including “Claims to Objectivity and Accuracy are Misleading” and “Bigger Data Are Not Always Better Data.” As the authors point out, a critical component of using Big Data for research is understanding the integrity of those data, where they originated, what biases are built into them, how data cleaning may lead to over fitting, and what sampling biases may be embedded in them. In this context, we need to evaluate what quality and integrity mean in a networked culture and its numerous possible contexts, in the manner that other scholars are investigating parallel issues such as privacy (Nissenbaum, 2009).
As for methodology, we suggest two technical paths that may offer amelioration of the integrity problem, both based on retrospectively recovering provenance, rather than prospectively, as in the traditional manner. In our research with eBird, we have been investigating ways to reconstruct observer/contributor expertise from the aggregated data. Our realization has been that expertise is too nuanced a factor to reconstruct, but that
In conclusion, we have argued for an understanding of the difference between lots of data and Big Data. The former, a quantitative feature with mainly technical and methodological implications, has, without a doubt, had important effects on the way science is done and what it makes possible. However, the latter, a qualitative feature with profound epistemological and sociotechnical implications, shakes some of the core assumptions of credible science: trust and integrity. Similar to so many aspects of our modern digital culture such as journalism (e.g. the
Footnotes
Declaration of conflicting interests
The author declares that there is no conflict of interest.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
