Sage Journals: Discover world-class research

Abstract

Despite all the attention to Big Data and the claims that it represents a “paradigm shift” in science, we lack understanding about what are the qualities of Big Data that may contribute to this revolutionary impact. In this paper, we look beyond the quantitative aspects of Big Data (i.e. lots of data) and examine it from a sociotechnical perspective. We argue that a key factor that distinguishes “Big Data” from “lots of data” lies in changes to the traditional, well-established “control zones” that facilitated clear provenance of scientific data, thereby ensuring data integrity and providing the foundation for credible science. The breakdown of these control zones is a consequence of the manner in which our network technology and culture enable and encourage open, anonymous sharing of information, participation regardless of expertise, and collaboration across geographic, disciplinary, and institutional barriers. We are left with the conundrum—how to reap the benefits of Big Data while re-creating a trust fabric and an accountable chain of responsibility that make credible science possible.

Keywords

Big Data control zone paradigm shift sociotechnical

Big Data is not only about being big

The popular and scholarly literature is filled with excitement about Big Data. A good deal of the enthusiasm comes from the business sector, where Big Data offers new possibilities for direct and micro marketing, supply-chain optimization, and other means of increasing efficiency and profits. This enthusiasm has also spread to the public sector, particularly in the areas of security and terrorism prevention. In this paper, we examine the impact of Big Data in the context of science,¹ encompassing the research that takes place in the academic, corporate, and government milieu. Admittedly, the line between commercial research (distinguished from corporate research such as that which takes place at IBM Watson) and scientific research can be fuzzy, but we distinguish the former as motivated by financial concerns (e.g. product improvement for profit improvement), whereas the latter is motivated by the search for some “truth”. Some argue that Big Data represents a new paradigm of science, a “fourth paradigm” (Hey et al., 2009), adopting the terminology used by Kuhn (1970) to characterize the revolutionary transformation of a scientific field.² While many view this new paradigm as complementary rather than substitutive to pre-existing paradigms (observation, experimentation, and simulation), others like Chris Anderson have taken a more extreme view, claiming that Big Data represents the “end of theory” (Anderson, 2008).

Our goal in this paper is to pull back from the hype and take a more measured, analytical approach to Big Data, focusing on the question “what are the characteristics of (some) Big Data that manifest a paradigm shift in the fundamental assumptions of science”? We distinguish between Big Data characteristics that have methodological consequences and those that impact epistemological foundations. We characterize the former as important but not paradigm-shifting. In contrast, we argue that a paradigm shift is indeed evident when Big Data impacts epistemological foundations.

Embedded in this argument is the assumption that the characteristics we are looking for are not native to all uses of big (in size) data. And, in fact, it may be true that data that is not necessarily quantitatively large may have characteristics that are paradigm-shifting when used in certain contexts and by certain communities of use.

Before proceeding any further with an analysis of “big” as a qualifying characteristic of (some) “data”, it is important to establish a definition of “data”, whether big or small. A National Academies report (A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases, 1999) provides a simple and inclusive foundation definition: “data are artifacts, numbers, letters, and symbols that describe an object, idea, condition, situation, or other factors.” Although this definition is useful, it fails to capture the “relative” nature of data (in contrast to it having an “essential” nature). As Borgman (2011) states: “[d]ata may exist only in the eyes of the holder: the recognition that an observation, artifact, or record constitute data is itself a scholarly act.” This perspective of data is reflexive; something (e.g. images, text, and Excel worksheet, etc.) is data because someone uses it as data in a specific context, and transcendent, it carries across the many disciplines, practices, and epistemologies of science.

This relational/contextual perspective gives us the basis for examining the Big Data phenomenon in a manner that both crosses epistemological boundaries and is contextualized by them. With due recognition of the dangers of making generalizations about “science”, we hope to establish some fundamental aspects of Big Data that are indeed boundary crossing, while remaining shaped by (and shaping) specific disciplinary practices.

Having established this relativistic definition of data, we return to the notion that “Big Data is not only about being big”; that there is some combination of features or dimensions (perhaps among them size) that may have revolutionary effects on science and knowledge production. This multidimensional perspective is evident in many of the popularized, mass-market descriptions of Big Data.

One popular multidimensional definition of Big Data is based on the so-called 3Vs: Volume, Velocity, and Variety (Laney, 2001). Volume is the size factor. Velocity refers to the speed of accumulation, the resulting dynamic nature of the data, and the high-scale processing capacity needed to make it useful and keep it current. Finally, Variety refers to the mixing together, or mashing-up, of heterogeneous data types, models, and schema and the need to resolve these differences in order to make the data useful. Others have enhanced this list with additional “Vs”: Validity, the amount of bias or noise in the data; Veracity, the correctness and accuracy of the data; and Volatility, the persistence and longevity of data (Normandeau, 2013), the first two of which, Validity and Veracity, are of particular interest to the argument of this paper.

Mayer-Schonberger and Cukier in their best-selling book Big Data offer an alternative but complementary set of characteristics of Big Data, which they claim “challenges the way we live and interact with the world” (Mayer-Schönberger, 2013, p. 6). They characterize Big Data as revolutionary because it enables/embodies “three shifts [characteristics] in the way we analyze information and transform how we understand and organize society.” The first is the “more” characteristic, which they posit as the foundation for the two other characteristics. A notable aspect of “bigness” according to the authors is its equivalence to “allness” (n = all). Throughout the book they assert that Big Data obviates the need for traditional (in their view flawed) sampling techniques and increasingly can be considered a complete view of the object of investigation. We later question this argument and initially note that even if the n = all principle were true, the notion of data providing a “complete view” of reality, in the objective sense, is met with skepticism by a number of modern scholars (Bowker, 2014; Edwards, 2010; Gitelman, 2013). The second characteristic is “messy,” the effect of which is diminished by the n = all characteristic. In their words, “looking at vastly more data also permits us to lessen our desire for exactitude.” The third and final characteristic is the shift in analytical technique from causality to correlation. “Most strikingly, society will need to shed some of its obsession for causality in exchange for simple correlations; not knowing why but only what.” We will return to Mayer-Schonberger and Cukier later in this paper to further critique of their n = all claim and its implications for new paradigm science.

These two attempts to define Big Data, and many others like them, fail to adequately capture the nuances and contexts of use of Big Data that may make it revolutionary and the driver of a new scientific paradigm. Employing Kuhn’s words, when are Big Data “tradition-shattering complements to the tradition-bound activity of normal science” (Kuhn, 1970)? To answer this question, we need to examine Big Data from a sociotechnical perspective (Bijker, 1995; Lamb and Sawyer, 2005). We need to investigate their social, cultural, historical, and technical facets and the interplay and tensions among these facets that collectively establish the impact of Big Data on science and the possible transformation thereof. An analysis of this sort will allow us to distinguish the aspects of Big Data that, no matter how contributory to innovation, may be more evolutionary than revolutionary, from those that are indeed paradigm-shifting. Furthermore, it will help us distinguish between locality—discipline and/or field-specific characteristics of Big Data—and globality—aspects of Big Data that may be paradigm-shifting across the scholarly enterprise.

Lots of data or Big Data?

Because technology is such a basic enabler and component of Big Data practices (i.e. computation including hardware, software, and algorithmic components; high-speed networks; massive storage arrays), it is useful to build our argument on the notions of new technological paradigms (Dosi, 1982) and of disruption (Christensen and Rosenbloom, 1995; Rosenbloom and Christensen, 1994). Originating in the business and organizational behavior sector, these two concepts nicely complement Kuhn’s theories, which focus on the scholarly domain. Dosi distinguishes between evolutionary paths of technological change and new technological paradigms that represent discontinuities from pre-existing technological paths and address new classes of problems. Christensen and Rosenbloom expand on this with the notion of disruptive innovation, which is a discontinuity in not only the technological aspect of a product or service, but also a sociotechnical disruption; a contextual change in the set of valuations and values that frame and are impacted by the technical innovation. Christensen initially applied this theory of disruption to product lines (Christensen, 1997), with a prime example being the successive introduction of smaller hard disk platters that initially seemed noncompetitive with the disc products of mainstream manufacturers, but eventually and repetitively obliterated the mainstream markets due to their framing within the revolutionary personalization of computing. Christensen has also applied this theoretical framework to health care (Christensen et al., 2008a) and education (Christensen et al., 2008b).

By leveraging the theoretical frameworks of Kuhn, Rosenbloom, and Christiansen, we argue that a disruption in science (a.k.a. the creation of a new paradigm) is not just methodological, a way of doing (a.k.a. technical), but also must be sociotechnical. It must challenge existing epistemological norms, ways of knowing and framing the fundamental scientific questions of the field; institutional ecologies (Star and Griesemer, 1989), agreements on scope, assumed knowledge, and boundaries of research work; reward structures, paths to tenure and promotion; and communication regimes, mechanisms, and norms for disseminating knowledge. We will use this scaffolding for the remainder of this essay to distinguish between what we will call lots of data, the effects of which are by and large methodological and technical, and true Big Data, that which entails epistemological and, as a result, paradigmatic change.

Our distinguishing between these two terms—lots of data (which entails methodological change and technical innovation) and Big Data (which implies the reevaluation of epistemological foundations)—should not be interpreted as an attempt to segregate data into two disjoint silos, i.e. data set 1 is “lots of data”, in contrast to data set 2 that is genuine “Big Data”. Our intention, rather, is to establish these concepts as continuous dimensions with which instances of data use can be evaluated in order to understand the degree and origins of their methodological and/or paradigm-shifting effects, i.e. a use of data set 1 has high “lots of data” impact but low “Big Data” impact while a use of data set 2 has low “lots of data” impact but high “Big Data” impact. The term “instances of data use”, in contrast to simply “data”, is intentional and refers to the fact that, similar to the definition of data, the methodological and epistemological impact of data must be evaluated within the context of use. An important facet of this context is the distinct epistemic culture (Knorr-Cetina, 1999) of the community of use and its particular perspectives on data and its meaning. In other words, the same data set may “measure” differently according to the “Big Data” and “lots of data” dimensions when employed by different disciplinary communities and/or for different purposes.

Although the primary focus of the remainder of this paper is the Big Data dimension—when, how, and why does data use challenge the epistemological foundations of science—it is useful, for the purpose of contrast, to briefly examine the companion lots of data dimension. This brevity should not be construed as dismissive towards the significance of these technical challenges and the methodological impacts they have. Indeed, there are great challenges here and the scholarly and practical effects of meeting these challenges can be profound, albeit not paradigm-shifting.

Two often-cited instances of data use demonstrate the lots of data dimension. The petabytes of data streaming in from high-energy physics experiments (studied thoroughly by Knorr-Cetina, 1999) or those that are components of the Sloan Digital Sky Survey (Szalay and Gray, 2001) are certainly Big Data in terms of size. But, considered alone, their bigness and the issues associated with them are by and large technical. These communities have historic cultures of data sharing (Ginsparg, 1994; Knorr-Cetina, 1999) and, in fact, their data has always been “big” relative to the quantitative definitions of the day. This is similar to the situation with many domains of science that have a legacy of exploring and manipulating large data sets, where “large” is historically contextualized relative to the technical affordances of the time (Gitelman, 2013).

The massive quantity of data in these two examples clearly introduces issues about new high-capacity storage systems, high-speed networks to easily move them back and forth, and map-reduce algorithms that permit parallel computation over these massive data sets. A recent white paper co-authored by leading data science researchers (Agrawal et al., n.d.) provides a useful list of the cross-cutting challenges that need to be met to respond to these issues; heterogeneity and incompleteness, scale, timeliness or speed, privacy, and human collaboration. All of these are formidable challenges. However, the need for these new methodologies and tools to manipulate, store, and curate these massive data sets does not correspond to a paradigm-shifting disruption of the historically data-focused epistemic culture of the communities of practice that engage with these data.

A recent paper by Leonelli (2014) in the inaugural issue of this journal explores the same issue in the discipline of biology.³ Similar to this paper (albeit limited to a single discipline), Leonelli aims “to inform a critique of the supposedly revolutionary power of Big Data science,” likewise defining revolutionary as synonymous with creating a new epistemology and a new set of norms. Similar to our earlier examples in physics and astronomy, she notes that “data-gathering practices in subfields [of the life sciences] have been at the heart of inquiry since the early modern era, and have generated problems ever since.” She then aims the bulk of her critique at Mayer-Schönberger and Cukier's claims that data completeness mitigates data messiness and their championing of correlation over causality, which we will return to later in this paper. She finishes by rejecting the notion that Big Data is exerting a revolutionary effect on the epistemology of biology itself, claiming that “there is a strong continuity with practices of large data collection and assemblage since the early modern period; and the core methods and epistemic problems of biological research, including exploratory experimentation, sampling and the search for causal mechanisms, remain crucial parts of the inquiry in the area of science.” In contrast to epistemic effects on the discipline itself, she acknowledges significant methodological challenges “encountered in developing and applying curatorial standards for data … ” and in the dissemination of that data.

On the other hand, the sensitivity of the evolutionary versus revolutionary impact of big (or of even any) data to epistemic culture becomes evident in the context of digital humanities (or as some call it computational humanities, and its specializations such as computational history). The level of controversy over the “datafication” (Mayer-Schönberger, 2013) of historical and/or literary artifacts (whether in massive scale such as the Google Books Project or the scale of a single literary corpus) can be viewed as evidence of resistance to the introduction of a new epistemology, based on data, that is viewed by some as threatening, and perhaps inferior, to existing and historically based epistemologies (Bruns, 2013; Rosenberg, 2013).

These examples in physics, astronomy, biology, and the humanities (and many similar ones) lead us to conclude that mere bigness, lots of data (which appears to have different meanings in different scholarly fields), is not the basis for declaring a new paradigm in science. Furthermore, we can be fairly confident that such a blanket declaration without attention to the confounding factor of epistemic cultures warrants skepticism.

Data integrity and credible science

With these caveats in mind, however, we do claim that there might be some cross-cutting framing of data and their application across the entire scholarly endeavor, recognizing that this framing needs to be parameterized to a particular use of data within a particular epistemic culture. Then, we need to understand how Big Data might challenge this common framing, thereby becoming “tradition shattering” (Kuhn, 1970).

At the forefront is the notion of data integrity, which we assert is a consistent and discipline-crossing foundation of credible science (Committee on Ensuring the Utility and Integrity of Research Data in a Digital Age, 2009; Nowotny, 2001). We intentionally use the term integrity rather than correctness or quality; the latter terms ascribe a level of positivism to data that many modern scholars refute (Edwards, 2010; Gitelman, 2013). Integrity, on the other hand, has a more constructivist tone, implying notions of “trust”, “fitness for use”, and “consensual understanding”, all of which are contextual and relative to epistemic culture, in contrast to the implicitly binary notion of correctness. Looking at this from the perspective of infrastructure to support data sharing (using “infrastructure” in its broadest most sociotechnical sense; Edwards et al., 2007), we can then draw the links from integrity to trust, and ultimately to provenance (evidence upon which trust is established), and propose that determining the degree of data integrity is based on the ability to answer a number of questions. What is the origin of these data? Who has been responsible for them since their origination? Can we apply our standard notions for trust and integrity to them? Do our standard methodologies for interpreting them and drawing conclusions from them make sense? Big Data is then those data that disrupt fundamental notions of integrity and force new ways of thinking and doing to reestablish it. Said differently, Big Data is data that makes us rethink our notions of credible science.

Our attention here to the issues of data and scientific integrity is coincident with a growing concern with the reliability of scientific knowledge. The notion of a crisis in reliability has been discussed in the media (Naik, 2011), and in scientific journal articles (Brembs and Munafò, 2013) and editorials (“Announcement,” 2013; Jasny et al., 2011). Some of the concern about reliability has been fueled by well-publicized cases of scientific fraud and data falsification in a number of scientific fields (Harrison et al., 2010; “Researcher Faked Evidence of Human Cloning, Koreans Report,” 2006; Verfaellie and McGwin, 2011). In addition, a number of academics are warning about the prevalence of false results in the scientific literature (Ioannidis, 2005; Pöschl, 2004).

But, as pointed out by Stodden (2014), some of this concern arises from the increasing prevalence of data-intensive (Big Data) science across the disciplines, and the application of computational, analytical methods to those data without complete understanding of their characteristics (e.g. the nature of the sample represented by the data). Absent full understanding of the data (and in some cases a failure to account for this lack of intimacy with the data), researchers have at times unwittingly or sloppily applied methodological tools or epistemological understanding to those data that failed to account for the fundamental differences between them and traditional highly-curated and reliable data. As pointed out by Lazer et al. (2014), “ … most Big Data that have received popular attention are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis.”

Of particular concern in this area has been scientific results based on data sources of questionable provenance and integrity such as distributed sensors (Wallis et al., 2007) and “black box social media,” where the origin and basis of the data are difficult to determine (Driscoll and Walker, 2014) and the algorithmic bias on the conclusions is difficult to unravel (Gillespie, 2014). A well-known example of the foibles of the reliance on informally collected data and algorithmic projection is the Google Flu Trends (GFT), which raised huge scientific optimism about the predictive utility of informally collected data when first published in Nature in 2009 (Ginsberg et al., 2009). This optimism suffered a serious setback in 2013 when the GFT predictions for that year were shown to be seriously exaggerated (Butler, 2013; Lazer et al., 2014). A complete accounting for this setback is beyond the scope of this paper. However, one acknowledged factor is an overconfidence in the veracity of the data as a true sample of reality, rather than a random snapshot in time and the result of algorithmic dynamics.

We acknowledge that this emphasis on data integrity (a.k.a. quality) stands somewhat in opposition to the popularized claims by Mayer-Schönberger and Cukier that “looking at vastly more data … permits us to loosen up a desire for exactitude” and effectively allows us to ignore “messiness” in data (Mayer-Schönberger, 2013). As mentioned earlier, this claim and subsequent claims by the authors seem to rely heavily on n = all, that is, Big Data is not a sample but a complete set. We find this claim highly suspicious and agree with fellow scholars (Boyd and Crawford, 2011; Bowker, 2014) who take the position that any data, no matter what its size, is de facto a sample, with bias implicit due to choice of instrumentation, span of observation, units of measurement, and numerous other factors. In essence, n never equals all; all is a limit in mathematical terms that can be approached but never attained. This point is also emphasized by Leonelli, who states that “having a lot of data is not the same as having all of them; and cultivating such a vision of completeness is a very risky and potentially misleading strategy” (Leonelli, 2014). Thus, if one denies sampling and its effects on messiness or on our ability to derive meaning from correlations, as Mayer-Schönberger and Cukier seem to do, they tread on questionable territory in terms of high integrity science, and may indeed have an argument that is more appropriate to business and commerce. Again quoting Leonelli, “it is no coincidence that most of the examples given by Mayer-Schönberger and Cukier come from the industrial world, and particularly globalized retail strategies as is the case of Amazon.com” (Leonelli, 2014).

As a point of reference, it is useful to look at the notions of integrity, trust, and provenance in the context of archives and archival science, for which they are essential concepts. Hirtle (2000) describes the meanings of these terms and the manner in which they are core to the definition of the archive in the context of the ship Constellation, a tourist destination in Baltimore harbor that was mistakenly identified as a revolutionary war ship when its vintage was really the US Civil War. According to Hirtle (2000), “at the heart of an archive … are records that are created by an agency or organization in the course of its business and that serve as evidence of the actions of that agency or organization [italics added].” Furthermore, “one way in which archivists working with … records have sought to ensure the enduring value of archives as evidence is through the maintenance of an unbroken provenance for the records [italics added].” Implicit in the notion of “unbroken provenance” is control over storage and transfer; in order to serve as evidence an archival record must demonstrate a complete, unbroken, historical knowledge of the item of interest, who has been in control of it, and by what means it has been transferred or moved to other authorities. Fans of crime shows on TV or of detective novels should find this notion quite familiar; the evidence presented in a court of law is useless if law enforcement has lost control of it and it may have been tampered with.

Defining the control zone

Taking a cue from archival science then, we should look at the role of control (and unbroken provenance) as a necessary (but not necessarily sufficient) factor in data integrity. Traditional data origination, sharing, and reuse were based on the reality of containable and concrete physical data (e.g. written by hand or stored on magnetic devices that are kept in drawers or file cabinets) and data sharing practices based on physical handoff to known colleagues. The physicality of both the data and the transfer of data amounted to a well-defined control zone resulting in a provenance chain that was documented and witnessed. Before examining the breakdown of this control zone in the context of Big Data, in the next section we examine the same notion and its role in the disruption of another knowledge infrastructure (Edwards et al., 2013) that has over the past two decades undergone considerable change, the library. In a seminal 1996 article, “Library Functions, Scholarly Communication, and the Foundation of the Digital Library: Laying Claim to the Control Zone” (Atkinson, 1996), the late Ross Atkinson, then Associate University Librarian at Cornell University, describes how the notion of a control zone lay at the foundation of the library. According to Atkinson, the functioning of the library depends on the definition of a clear boundary, a demarcation of what lies in the library and what is outside. Internal to this boundary, within the control zone, the library can lay claim to those resources that have been selected as part of the collection, and assert curation, or stewardship, of those selected resources to ensure their integrity, availability, and stability over the long-term.

The boundary of the traditional library was easy to define. It was the “bricks and mortar” structure with a clear and controlled entry point that contained and protected the selected physical resources over which the library asserted control and curatorial responsibility. Correspondingly, from the patron’s point of view, the boundary marked what could be called a “trust zone”, an area to which entry and exit were clearly marked and in which they could presume the existence of the integrity guarantees of the library. Integrity, in this case, does not imply veracity of the resources of the library, but adherence to principles of proper information stewardship, including accurate description, longevity of the resources, and adherence to some selection criteria.

In Lagoze (2010), we describe how the move from physical to digital information resources and the attendant access to them by the web architecture profoundly disrupted the foundation of the control zone. This disruption was not anticipated by early participants, practitioners, and researchers in the early digital library initiatives, who foresaw technical but not institutional change. In fact, some predicted that in the end “[digital] library services would follow a familiar model” (Gladney et al., 1994). Others saw the Internet as another familiar evolutionary technical change, similar to past challenges to libraries, stating that “The anarchy of the Internet may be daunting for the neophyte, but it differs little from the bibliographic chaos that is the result of five and a half centuries of the printing press” (Lerner, 1999).

Two decades later, it is clear that the implications of moving from physical to digital information and network access to the information is more than a technical phenomenon; the implications are more than that “digital information crosses boundaries easily” (Van House et al., 2003) and in fact are deeply disruptive to the library. By viewing the library as a meme,⁴ rather than just as an institution or a physical artifact, we can see the roots of the disruption. At the foundation of it is the foundation of the library itself, the disintegration of the control zone. The notions of a clear boundary, and the attendant concepts of being inside or outside, disappear in the web architecture, where users (i.e. patrons) no longer enter through a well-defined door, but ride hyperlinks and land wherever they may choose in the digital library. Attempts to reassert a boundary by defining a new digital door or portal and establishing branding signposts defining inside vs. outside have proven incompatible with the dominant web context and have largely failed. With the collapse of the control zone, other fundamental components of the library meme become difficult to implement or anachronistic relative to the increasingly normative broader web context. These include selection, deciding what information sources are available to patrons; intermediation, acting as a buffer between information creators and information users; bibliographic description, providing “order making” via the catalog; and fixity, guaranteeing the immutability of information resources.

In conclusion, the wholesale transition of our intellectual, popular, and cultural heritage to the digital realm has been accompanied by a disruptive change in our expectations about our knowledge infrastructures. The notions of selection, intermediation, bibliographic description, and fixity that are core principles of the library meme stand at odds to the web information meme. These contradictions become sharper as the web has moved over the past decade into the web 2.0 era and beyond. Expectations of open access to information, active participation in knowledge production and annotation, and the integration of social activity and knowledge activities are now the expected norm. Libraries are certainly part of this modern knowledge infrastructure. But they exist as participants in a world of competing “knowledge institutions” (e.g. Wikipedia, Facebook, Twitter). Meanwhile, notions of information integrity, which were formally grounded in institutional frameworks such as the library, remain problematic and in search of new ways to certify the provenance of information resources.

Rethinking credible science in the age of Big Data

With knowledge of this precedent, we can now return to Big Data and recognize parallels in the historical transitions of the library and the transformations in the ways that scholarly data are created, shared, and used. The relatively well-controlled mechanisms (both cultural and technical) for data creation, data sharing, and data reuse are under pressure for a number of reasons. Funders, the public, and fellow scientists are demanding, for good reason, better access to data and in general “open data” (Huijboom and Broek, 2011; Molloy, 2011; Murray-Rust, 2008), motivating the creation of numerous data repositories (Greenberg et al., 2009; Hahnel, 2012; Michener et al., 2011) that allow easy and generally anonymous access to scientific data on a global scale. Science in general is becoming more collaborative and interdisciplinary (Barry and Born, 2013; Haythornthwaite et al., 2006; Wagner et al., 2011) (at least partly due to the multidisciplinary scope of grand challenge problems like climate change), breaking down traditional closely-knit teams of colleagues and bringing together scholars with different epistemic and methodological cultures. An increasing number of data sources originate from nontraditional means, such as social networks for which concerns about integrity and provenance are not priorities. Mashups of data are becoming increasingly common, blurring the lines between formal and informal data. Scientists seem to have a love/hate relationship with this new reality. While they support the abstract idea of open data (Cragin et al., 2010; Tenopir et al., 2011), their sharing practices, and sharing preferences, remain relatively closed and motivated by control (Borgman, 2011; Edwards et al., 2011).

Quantitative social science research provides an interesting example of this data transition and impact on the control zone. For the past 50 years, quantitative social science has been built on a shared foundation of data sources originating from survey research, aggregate government statistics, and in-depth studies of individual places, people, or events. Underlying these data is a well-established and well-controlled infrastructure composed of an international network of highly curated and metadata-rich archives of social science such as the Inter-University Consortium for Political and Social Research⁵ (ICPSR) and the UK Data Archive.⁶

These archives continue to play an important role in quantitative social science research. However, the emergence and maturation of ubiquitous networked computing and the ever-growing data cloud have introduced a spectacular quantity and variety of new data sources into this mix. These include social media data sources such as Facebook, Twitter, and other online communities in which individuals reveal massive amounts of information about themselves that are invaluable for social science research. When combined with more traditional data sources, these provide the opportunity for studies at scales and complexities heretofore unimaginable. This transformation has been described by Gary King, a Harvard political scientist, as the social science data revolution, which is characterized by a “changing evidence base of social science research” (King, 2011a, 2011b). These new opportunities present formidable new challenges to the fabric of social science research. Among those mentioned by King (2011b) include privacy challenges, problems of sampling bias in uncontrolled data sets, a change in the basic “job descriptions” of social scientists with demand for new skills in statistical methods, computational methods, and the like, and the need for new cross-disciplinary collaborations (i.e. breaking down the silos that social science scholars formally existed in). Clearly this is an example of Big Data rather than just lots of data.

Another example of this fracturing of the control zone exists in observational science, for example, identification and reporting of phenomena (e.g. species) in ecological niches, astronomy, and meteorology. In each of these areas there is a growing interest in what has been termed crowd sourced citizen science, which engages numerous volunteers as participants in large-scale scientific endeavors (Wiggins and Crowston, 2010). The opportunities for large-scale citizen science arise from the ubiquitous networking and computing context and especially the recent spectacular growth in the use of mobile devices. The motivations for leveraging this large-scale volunteer workforce as observational “sensors” are substantial. The geographic scope of the observational spaces and the varieties of habitats make reliance on trained observers (e.g. scientists) infeasible. Our particular experience in this area is with the eBird project,⁷ originated at the Cornell Laboratory of Ornithology, a highly successful citizen science project that for over a decade has collected observations from volunteer participants worldwide (Sullivan et al., 2014). Those data have subsequently been used for a large body of highly-regarded and influential scientific research.

It comes as no surprise that crowd sourced citizen science makes a substantial portion of the formal scientific community uneasy (Sauer et al., 1994), especially in fields where people’s lives are at stake, such as medicine (Raven, 2012). These data, by nature, breakdown a well-established control zone whereby data is collected by experts or individuals managed by experts who carefully abide by scientific methods. In contrast, citizen science of this type must contend with the problems of highly variable observer expertise and experience. How can we trust data or the science that results from those data when their provenance is rooted in sources whose own provenance does not conform to “standard” criteria such as degree, publication record, or institutional affiliation?

The examples described above are only two of the many instances in which new varieties of Big Data are undermining traditional control zones of science. If we look longitudinally, we can see that examples such as these are only the beginning of the problem. The fractured control zones, and the resulting uncertain provenance and trust, only intensify through the lifecycle of sharing, reuse, and circulation of data in an open network in which not all participants are deemed trustworthy according to established norms. Looking across this lifecycle, this dilemma very quickly becomes combinatorially more complex. If the control zone around data set A and that around data set B are poorly defined, that which results from the reuse and combination of the two is only fuzzier. Of course, this is only the first step in the progressive mashup and “cooking” of these data with other data, a progression that is inevitable when data reuse is easy and strongly encouraged.

Despite the challenges and uncertainties, the inclusion of these “uncontrolled” Big Data into the scientific process is a reality that will continue and perhaps become more common. Our “always there, everywhere” network culture will continue to make more and larger amounts of automatically, accidentally, and informally created data available for science. The value of these data across the scholarly spectrum has been demonstrated numerous times. Social scientists can conduct studies on large-scale social networks that may not replace, but do significantly complement, traditional research based on small-scale social groups (Milgram, 1967; Zachary, 1977). Observational scientists can now accumulate heretofore unavailable evidence of global phenomena, such as bird migrations and climatological events, by leveraging the active participation and contribution of enthusiastic human volunteers.⁸

Our goal in this paper has not been to propose a normative framework for this reality, but to simulate and add to discussions and investigations of its entangled social, cultural, historical, and technical implications. Rather than fall back on hyperbolic “Big Data will change the world,” the scholarly community needs to understand it and investigate its implications for science policy and public trust of science. We propose two threads for moving forward: one epistemological, evaluate our understanding of quality in both data and science and our means for determining it, the other methodological, developing means of recovering traditional quality metrics.

The first approach begins by raising the awareness of researchers who use Big Data about its opportunities, complexities, and dangers. This area is reasonably well covered in Boyd and Crawford’s (2011) paper “Six Provocations for Big Data”, which covers many of the caveats in dealing with this kind of data including “Claims to Objectivity and Accuracy are Misleading” and “Bigger Data Are Not Always Better Data.” As the authors point out, a critical component of using Big Data for research is understanding the integrity of those data, where they originated, what biases are built into them, how data cleaning may lead to over fitting, and what sampling biases may be embedded in them. In this context, we need to evaluate what quality and integrity mean in a networked culture and its numerous possible contexts, in the manner that other scholars are investigating parallel issues such as privacy (Nissenbaum, 2009).

As for methodology, we suggest two technical paths that may offer amelioration of the integrity problem, both based on retrospectively recovering provenance, rather than prospectively, as in the traditional manner. In our research with eBird, we have been investigating ways to reconstruct observer/contributor expertise from the aggregated data. Our realization has been that expertise is too nuanced a factor to reconstruct, but that experience, interpreted as deliberate practice, is an effective path to expert performance (Ericsson and Charness, 1994). Evidence of experience can be extracted from the aggregated data; for example, frequency of contributions, the diversity of contributions measured by species distribution, etc. By devising ways to recognize these traces we hope to develop mechanisms that aid scientists in determining the expertise (and perhaps integrity) of anonymous data contributors (reference removed for author anonymity). Another approach might be to employ digital forensics (Reith et al., 2002), a technique increasingly popular in the intelligence and legal communities, which, like our work with expertise, recovers traces of origin and provenance metadata from a digital artifact itself.

In conclusion, we have argued for an understanding of the difference between lots of data and Big Data. The former, a quantitative feature with mainly technical and methodological implications, has, without a doubt, had important effects on the way science is done and what it makes possible. However, the latter, a qualitative feature with profound epistemological and sociotechnical implications, shakes some of the core assumptions of credible science: trust and integrity. Similar to so many aspects of our modern digital culture such as journalism (e.g. the New York Times versus the flood of grassroots news blogs) and reference information (e.g. Encyclopedia Britannica versus Wikipedia), it is futile and even undesirable to seek a return to traditional, rigid control zones. Nevertheless, we are left with the challenge with Big Data to reap its benefits while simultaneously holding science to the same standards that it has been held to for centuries.

Footnotes

Declaration of conflicting interests

The author declares that there is no conflict of interest.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Notes

References

Agrawal D, Bernstein P, Bertino E, et al. (n.d.) Challenges and Opportunities with Big Data. Available at: https://www.purdue.edu/discoverypark/cyber/assets/pdfs/BigDataWhitePaper.pdf (accessed 28 October 2014).

Anderson C (2008) The end of theory: will the data deluge make the scientific method obsolete? Wired 1--5.

Announcement: Reducing our irreproducibility (2013) Nature 496(7446): 398–398.

A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases (1999) Washington, DC: The National Academies Press. Available at: http://www.nap.edu/openbook.php?record_id=9692 (accessed 28 October 2014).

Atkinson R (1996) Library functions, scholarly communication, and the foundation of the digital library: laying claim to the control zone. The Library Quarterly 66(3).

Barry

Born

(2013) Interdisciplinarity: Reconfigurations of the Social and Natural Sciences, 1st ed. New York, NY: Routledge.

Bijker WE (1995) Of Bicycles, Bakelites, and Bulbs: Toward a Theory of Sociotechnical Change. Cambridge, MA: MIT Press.

Borgman

(2011) The conundrum of sharing research data. Journal of the American Society for Information Science 63(6): 1–40.

Bowker G (2014) The theory/data thing. International Journal of Communication 8(5).

10.

Boyd D and Crawford K (2011) Six provocations for Big Data. SSRN Electronic Journal. DOI: 10.2139/ssrn.1926431.

11.

Brembs B and Munafò M (2013) Deep impact: unintended consequences of journal rank. ArXiv. Available at: http://arxiv.org/abs/1301.3748 (accessed 28 October 2014).

12.

Bruns A (2013) Faster than the speed of print: reconciling “big data” social media analysis and academic scholarship. First Monday 18(10). Available at: http://firstmonday.org/ojs/index.php/fm/article/view/4879/3756 (accessed 7 October 2013).

13.

Butler

(2013) When Google got flu wrong. Nature 494(7436): 155–156.

14.

Christensen

(1997) The Innovator’s Dilemma: When New Technologies Cause Great Firms to Fail, Boston, MA: Harvard Business School Press.

15.

Christensen CM, Grossman JH and Hwang J (2008a) The Innovator’s Prescription: A Disruptive Solution for Health Care. New York, NY: McGraw-Hill.

16.

Christensen CM, Horn MB and Johnson CW (2008b) Disrupting Class: How Disruptive Innovation Will Change the Way the World Learns. New York, NY: McGraw-Hill.

17.

Christensen

Rosenbloom

(1995) Explaining the attacker’s advantage: technological paradigms, organizational dynamics, and the value network. Research Policy 24(2): 233–257.

18.

Committee on Ensuring the Utility and Integrity of Research Data in a Digital Age (2009) Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: National Academies Press.

19.

Cragin

Palmer

Carlson

(2010) Data sharing, small science and institutional repositories. Philosophical Transactions. Series A, Mathematical, Physical, and Engineering Sciences 368(1926): 4023–4038.

20.

Dosi G (1982) Technological paradigms and technological trajectories: a suggested interpretation of the determinants and directions of technical change. Research Policy 11(3): 147–162.

21.

Driscoll

Walker

(2014) Big data, big questions working within a black box: transparency in the collection and production of big twitter data. International Journal of Communication 8(0): 20.

22.

Edwards P, Mayernik MS, Batcheller A, et al. (2011) Science friction: data, metadata, and collaboration. Social Studies of Science 41(5): 667–690.

23.

Edwards PN (2010) A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming. Cambridge, MA: MIT Press.

24.

Edwards PN, Jackson SJ, Bowker GC, et al. (2007) Understanding Infrastructure: Dynamics, Tensions, and Design. Washington, DC: National Science Foundation.

25.

Edwards PN, Jackson SJ, Chalmers MK, et al. (2013) Knowledge Infrastructures: Intellectual Frameworks and Research Challenges. Ann Arbor, MI.

26.

Ericsson

Charness

(1994) Expert performance: its structure and acquisition. American Psychologist 49(8): 725–747.

27.

Gillespie T (2014) The relevance of algorithms. In: Gillespie T, Boczkowski P and Foot (eds) Media Technologies: Essays on Communication, Materiality, and Society. Cambridge, MA: MIT Press, p.167.

28.

Ginsberg

Mohebbi

Patel

(2009) Detecting influenza epidemics using search engine query data. Nature 457(7232): 1012–1014.

29.

Ginsparg

(1994) First steps towards electronic research communication. Los Alamos Science 8(4): 390–396.

30.

Gitelman L (2013) “Raw Data” Is an Oxymoron (Infrastructures). Cambridge, MA: The MIT Press, p.192.

31.

Gladney HM, Fox EA, Ahmed Z, et al. (1994) Digital Library: Gross Structure and Requirements: Report from a March 1994 Workshop. College Station: IEEE.

32.

Greenberg

White

Carrier

(2009) A metadata best practice for a scientific data repository. Journal of Library Metadata 9(3–4): 194–212.

33.

Hahnel M (2012) Exclusive: figshare a new open data project that wants to change the future of scholarly publishing. In: Impact of Social Sciences Blog.

34.

Harrison

WTA

Simpson

Weil

(2010) Editorial. Acta Crystallographica Section E Structure Reports Online 66(1): e1–e2.

35.

Haythornthwaite C, Lunsford KJ, Bowker GC, et al. (2006) Challenges for research and practice in distributed, interdisciplinary collaboration. In: Hine C (ed) New Infrastructures for Knowledge Production: Understanding E-science. Information Science Publishing, pp.143–166.

36.

Hey T, Tansley S and Tolle K (eds) (2009) The Fourth Paradigm. Redmond, WA: Microsoft Research.

37.

Hirtle

(2000) Archival authenticity in a digital age. In: Cullen

Levy

Lynch

(eds) Authenticity in a Digital Environment, Washington, DC: Council on Library and Information Resources.

38.

Huijboom

Broek

(2011) Open data: an international comparison of strategies. European Journal of ePractice 12: 1–13.

39.

Ioannidis

JPA

(2005) Why most published research findings are false. PLoS Med 2(8): e124.

40.

Jasny

Chin

Chong

(2011) Data replication & reproducibility. Again, and again, and again… Introduction. Science (New York, N.Y.) 334(6060): 1225.

41.

King

(2011a) Ensuring the data-rich future of the social sciences. Science (New York, N.Y.) 331(6018): 719–721.

42.

King G (2011b) The social science data revolution. Available at: http://gking.harvard.edu/files/gking/files/evbase-horizonsp.pdf (accessed 28 October 2014).

43.

Knorr-Cetina K (1999) Epistemic Cultures: How the Sciences Make Knowledge. Cambridge, MA: Harvard University Press.

44.

Kuhn TS (1970) The Structure of Scientific Revolutions, 2nd ed. Chicago: University of Chicago Press.

45.

Lagoze C (2010) Lost Identity: The Assimilation of Digital Libraries into the Web (PhD dissertation). Cornell University, Ithaca. Available at: http://carllagoze.files.wordpress.com/2012/06/carllagoze.pdf.

46.

Lamb

Sawyer

(2005) On extending social informatics from a rich legacy of networks and conceptual resources. Information Technology & People 18(1): 9–20.

47.

Laney D (2001) {3D} Data Management: Controlling Data Volume, Velocity, and Variety.

48.

Lazer

Kennedy

King

(2014) The parable of Google flu: traps in big data analysis. Science 343(6176): 1203–1205.

49.

Leonelli S (2014) What difference does quantity make? On the epistemology of Big Data in biology. Big Data & Society 1(1). DOI: 10.1177/2053951714534395.

50.

Lerner FA (1999) Libraries Through the Ages. New York, NY: Continuum.

51.

Mayer-Schönberger

(2013) Big Data: A Revolution that Will Transform How We Live, Work, and Think, Boston: Houghton Mifflin Harcourt.

52.

Michener W, Vieglais D, Vision T, et al. (2011) DataONE: data observation network for earth — preserving data and enabling innovation in the biological and environmental sciences. D-Lib Magazine 17(1/2).

53.

Milgram

(1967) The small world problem. Psychology Today 2: 60–67.

54.

Molloy JC (2011) The open knowledge foundation: open data means better science. PLoS Biology 9. DOI: 10.1371/journal.pbio.1001195.

55.

Morris CW (1938) Foundations of the Theory of Signs. Chicago: University of Chicago Press.

56.

Murray-Rust

(2008) Open data in science. Serials Review 34: 52–64.

57.

Naik G (2011). Mistakes in scientific studies surge. Wall Street Journal. Available at: http://online.wsj.com/news/articles/SB10001424052702303627104576411850666582080.

58.

Nissenbaum

(2009) Privacy in Context: Technology, Policy, and the Integrity of Social Life, Stanford, CA: Stanford Law Books.

59.

Normandeau N (2013) Beyond volume, variety and velocity is the issue of big data veracity. Available at: http://inside-bigdata.com/2013/09/12/beyond-volume-variety-velocity-issue-big-data-veracity/ (accessed 15 April 2014).

60.

Nowotny

(2001) Re-Thinking Science: Knowledge and the Public in an Age of Uncertainty, 1st ed. Cambridge, UK: Polity.

61.

Pöschl

(2004) Interactive journal concept for improved scientific publishing and quality assurance. Learned Publishing 17(2): 105–113.

62.

Raven K (2012) 23andMe’s face in the crowdsourced health research industry gets bigger. Available at: http://blogs.nature.com/spoonful/2012/07/23andmes-face-in-the-crowdsourced-health-research-industry-gets-bigger.html (accessed 28 October 2014).

63.

Reith M, Carr C and Gunsch G (2002) An examination of digital forensic models. International Journal of Digital Evidence 1: 1–12.

64.

Researcher faked evidence of human cloning, Koreans report (2006) The New York Times, 10 January.

65.

Rosenberg D (2013) Data before the fact. In: “Raw Data” is an Oxymoron. Cambridge, MA: MIT Press, pp.15–30.

66.

Rosenbloom

Christensen

(1994) Technological discontinuties, organizational capabilities, and strategic commitments. Industrial and Corporate Change 3(3): 655–685.

67.

Sauer

Peterjohn

Link

(1994) Observer differences in the North American Breeding Bird Survey. The Auk 111(1): 50–62.

68.

Star

Griesemer

(1989) Institutional ecology, translations and boundary objects: amateurs and professionals in Berkeley’s Museum of Vertebrate Zoology, 1907-39. Social Studies of Science 19(3): 387.

69.

Stodden V (2014) Enabling reproducibility in big data research: balancing confidentiality and scientific transparency. In: Privacy, Big Data and the Public Good. Cambridge, UK: Cambridge University Press. Available at: http://www.cambridge.org/us/academic/subjects/statistics-probability/statistical-theory-and-methods/privacy-big-data-and-public-good-frameworks-engagement (accessed 28 October 2014).

70.

Sullivan BL, Aycrigg JL, Barry JH, et al. (2014) The eBird enterprise: an integrated approach to development and application of citizen science. Biological Conservation 169 (January).

71.

Szalay

Gray

(2001) The world-wide telescope. Science (New York, N.Y.) 293(5537): 2037–2040.

72.

Tenopir

Allard

Douglass

(2011) Data sharing by scientists: practices and perceptions. PLoS ONE 6(6): 21.

73.

Van House

Bishop

Buttenfield

(2003) Introduction: Digital Libraries as Sociotechnical Systems, Cambridge, MA: MIT Press.

74.

Verfaellie M and McGwin J (2011) The case of Diederik Stapel: Allegations of scientific fraud by prominent Dutch social psychologist are investigated by multiple universities. Psychological Science Agenda 25(12).

75.

Wagner

Roessner

Bobb

(2011) Approaches to understanding and measuring interdisciplinary scientific research (IDR): a review of the literature. Journal of Informetrics 5(1): 14–26.

76.

Wallis

Borgman

Mayernik

(2007) Know thy sensor: trust, data quality, and data integrity in scientific digital libraries. In: Kovács

Fuhr

Meghini

(eds) Research and Advanced Technology for Digital Libraries SE - 32 Vol. 4675.Berlin, Heidelberg: Springer, pp. 380–391.

77.

Wiggins A and Crowston K (2010) Distributed scientific collaboration: research opportunities in citizen science. In: Proceedings of ACM CSCW 2010 workshop on the changing dynamics of scientific collaborations.

78.

Zachary

(1977) An information flow model for conflict and fission in small groups. Journal of Anthropological Research 33: 452–473.

Big Data,data integrity,and the fracturing of the control zone

Abstract

Keywords

Big Data is not only about being big

Lots of data or Big Data?

Data integrity and credible science

Defining the control zone

Rethinking credible science in the age of Big Data

Footnotes

Declaration of conflicting interests

Funding

Notes

References