Abstract
To improve the quality and efficiency of research, groups within the scientific community seek to exploit the value of data sharing. Funders, institutions, and specialist organizations are developing and implementing strategies to encourage or mandate data sharing within and across disciplines, with varying degrees of success. Academic journals in ecology and evolution have adopted several types of public data archiving policies requiring authors to make data underlying scholarly manuscripts freely available. The effort to increase data sharing in the sciences is one part of a broader “data revolution” that has prompted discussion about a paradigm shift in scientific research. Yet anecdotes from the community and studies evaluating data availability suggest that these policies have not obtained the desired effects, both in terms of quantity and quality of available datasets. We conducted a qualitative, interview-based study with journal editorial staff and other stakeholders in the academic publishing process to examine how journals enforce data archiving policies. We specifically sought to establish who editors and other stakeholders perceive as responsible for ensuring data completeness and quality in the peer review process. Our analysis revealed little consensus with regard to how data archiving policies should be enforced and who should hold authors accountable for dataset submissions. Themes in interviewee responses included hopefulness that reviewers would take the initiative to review datasets and trust in authors to ensure the completeness and quality of their datasets. We highlight problematic aspects of these thematic responses and offer potential starting points for improvement of the public data archiving process.
Keywords
Introduction
The value of open data in the scientific discovery process is well documented (Bowker, 2001; Group on Earth Observations, 2015; Hilgartner, 2013; Leonelli, 2014). As Michael Nielsen (2011) wrote in
Coordination of data sharing efforts in the latter cases relies on a number of stakeholders. Funding agencies, for example, might seek to streamline their efforts by requiring data to be shared and preventing costly re-collection. Funders have both incentives and enforcement mechanisms readily available (i.e. “carrots and sticks”) (Couture et al., 2018; Diekema et al., 2014). Other stakeholders, including scientific journals, manage a delicate balance between incentives and enforcement. These journals are increasingly requiring researchers to make datasets associated with manuscripts available, often by establishing public data archiving (PDA) policies (Roche et al., 2015). PDA policies illustrate journals’ recognition of data archiving as an essential step in the research process (Vines et al., 2013; Whitlock et al., 2010), yet the appropriate mechanisms for managing the data archiving process are, to date, undefined.
Journals have several motivations for instituting PDA policies and developing appropriate strategies for incentivizing and enforcing compliance. In principle, policies requiring authors to publish the datasets underpinning analyses in their manuscripts facilitate scrutinization, reproduction, and replication of studies (Bloom et al., 2014; Goecks et al., 2010; Poisot et al., 2013). The resulting transparency can increase public trust in science (Beardsley, 2010; Duke and Porter, 2013; South and Duke, 2010) and, by extension, enhance the reputation of the journal.
The above motivations are increasingly prominent across scientific disciplines in light of the reproducibility and replicability crises. In various fields, researchers have experienced high rates of failure in their attempts to replicate or reproduce the findings presented in academic papers (e.g. Begley, 2013; Open Science Collaboration, 2012). A survey of 1576 researchers conducted by
Journals may also view PDA as a way to increase citations, to provide other researchers interested in the same or similar phenomena with resources, and to provide valuable objects of collaboration (Borgman, 2007; Edwards et al., 2011; Piwowar and Vision, 2013). Furthermore, PDA policies aid in ensuring the sustainability and quality of scientific data. Without adequate PDA, the short- and long-term sustainability of research data diminishes (Kaye and Hawkins, 2014; Vines et al., 2014). In sum, incorporating review of datasets into the publishing process can help to avert some of what Leonelli (2014: 1) refers to as “difficulties caused by the lack of adequate curation for the vast majority of data in the life sciences.”
Publishers have implemented PDA policies in various journals across scientific disciplines (see Appendix 1 for a list of examples in ecology and evolution) and many different types of policies have emerged. In general, PDA policies fall on a spectrum from those only requiring authors to make data “available upon request” (e.g. via email from an interested party) to requiring authors to deposit datasets in specific repositories housing specific types of data (e.g.
Moving beyond “available upon request” policies served as a consensus step toward realizing the potential of open data in science. As Michener (2015) describes, factors such as the availability of technical infrastructure for data sharing and funder policies requiring sharing drove these changes in norms. However, the effectiveness of PDA policies for enabling reproduction, replication, and data reuse remains questionable. For example, Roche et al. (2015: 1) found that 56% of published datasets related to manuscripts in top ecology and evolution journals were incomplete, and 64% “were archived in a way that partially or entirely prevented reuse.”
This study investigates why PDA policies that go beyond “available upon request” may not be proving effective in realizing the goals of open data efforts. One purported reason for failing to build collections of reusable datasets is that journals lack appropriate enforcement mechanisms and incentive structures to ensure that published datasets are complete and high quality (Costello et al., 2013; Mayernik, 2017). We sought to build on this idea and explain the mechanisms by which journals enforce PDA policies by examining the roles of stakeholders in the PDA process and identifying problematic aspects of the process. We do so by reviewing the state of PDA policies in a broadly defined biology discipline—ecology and evolution—and identifying some of the unique challenges of sharing data in this area of research. We chose ecology and evolution because of its scope (i.e. the range of biological and other sciences it covers) and its long history. We begin by contextualizing PDA policy development in the broader paradigm shift in scientific research discussed by Kitchin (2014), among others. We then describe the types of PDA policies currently in place in ecology and evolution journals, which reflect the types of policies emerging across the sciences. We describe our interview-based approach to examining how PDA policies are enforced in ecology and evolution journals, present the themes we identified in interview responses, and contextualize the findings in the ongoing discussions about data and related research artifact sharing in science.
Journal data publication policies in ecology and evolution: Origins and current status
In Star and Ruhleder’s (1996) foundational piece on cyberinfrastructures, they emphasized that infrastructures such as those supporting open data do not grow de novo. Infrastructures such as those for sharing data are shaped by existing tools, methods, and practices; therefore, studying infrastructures of any kind requires contextualizing the study in a situated way (Jackson et al., 2007; Schrock and Shaffer, 2017; Tilson et al., 2010). In other words, studying open data infrastructure development and use should recognize and account for the diversity of stakeholders and knowledge-making processes involved in the scientific discovery process (Bowker, 2006; Edwards et al., 2011; Hine, 2006; Jasanoff, 2004; Kitchin, 2014; Knorr Cetina, 1998). Adopting this approach, we examined the enforcement of PDA policies by journals within one discipline—ecology and evolution.
The shift toward increased data sharing in ecology and evolution is one example of a broader ongoing shift in the way research is conducted and the role data plays in the production of scientific knowledge. Kitchin (2014) describes what Jim Gray refers to as the fourth paradigm in scientific research, one that is “based on the growing availability of Big Data and new analytics.” According to Kitchin, questions have arisen within the conversation about a new, Big Data-driven scientific paradigm, contesting the form the transition should ultimately take. Namely, the debate centers on whether the sheer volume of data and development of advanced analytical techniques renders theory impractical or unnecessary.
The implications of a shift toward purely data-driven science are complex and intertwined. First, aggregating massive amounts of data for the production of scientific knowledge shifts the skills and technologies inherent in doing scientific work. The emergence of data scientist as a role within research teams in indicative of how data-driven science changes the way research is conducted. Moreover, automated data capture and other forms of collecting data (i.e. non-experimental data collection) introduces questions about data ownership and reuse, credit for scientific advances, and a host of other social issues needing resolution amidst the changing paradigm.
The shifting sets of skills and technologies are often touted as a positive development because they bring objectivity and rigor to the construction of scientific knowledge (see, for example, Bryant et al., 2008; Lehning et al., 2009; Loukides, 2010). However, scholars have cautioned against this assumption because the ways in which data are collected and analyzed necessarily imbue the process with biases and subjectivities (Ribes and Jackson, 2013). In shifting science toward a data-centric mode of producing knowledge, then, scientists should attempt to acknowledge and counterbalance these potential biases by opening datasets up to scrutiny and examination by peers. One way of facilitating open data and scrutinization of the new paradigm’s implications—and one that is amenable to prior modes of conducting scientific research—is to archive data underlying published manuscripts. Various disciplines have begun to take this step, a transition that we discuss in one context below.
Data sharing in ecology and evolution: A brief history
Data sharing in ecology and evolution predates the recent shift toward open data. As Michener and Jones (2012) described, large projects in ecology and evolution required data sharing as early as the 1980s with the advent of programs like the Long Term Ecological Research (LTER) Network. LTER has grown in size and shifted in structure over its 40-year history, but has always managed data from at least 15 ongoing research programs spanning forest, grassland-agriculture, tundra, coastal, freshwater, marine, and urban ecosystems (LTER, 2018). According to Michener et al. (2011), the project includes human infrastructure, cyberinfrastructure, and standards to support ecological research. Human infrastructure includes personnel such as information managers at each site, while the cyberinfrastructure relies largely on relational databases accessed via SQL. The infrastructure also includes additional databases for files such as XML documents. Standards include required use of the Ecological Metadata Language for structuring metadata. Examples of data sharing in ecology and evolution can be found even earlier than the 1980s. For example, the International Biological Program ran from 1964 to 1974 and served as a foundational approach for collaborative, large-scale ecosystem science (see Coleman (2010: 15–89) for a detailed account of its emergence and impact.
Recently, subdisciplines and interest areas in ecology and evolution research have developed their own infrastructures for sharing data (e.g. Goff et al., 2011; Morisette et al., 2009; Wright and Wang, 2011). For example, the Global Biodiversity Information Facility houses hundreds of datasets and continues to grow in size and scope. However, as Costello et al. (2013) pointed out, over two-thirds of the datasets have been provided by government organizations rather than from academics. This finding is surprising given that, according to Ware and Mabe (2015) and Costello et al. (2013), academic researchers publish 75% of all scientific papers across scientific disciplines. Researchers from various subdisciplines have called for increased contributions from the academic community and have offered mechanisms by which the community might review and improve datasets prior to publishing (Chavan and Ingwersen, 2009; Costello, 2009; Costello et al., 2013; Michener, 2015; Piwowar et al., 2007; White et al., 2013).
A turning point in the effort to institute PDA policies in ecology and evolution came in 2008 with the launch of
As The example of GenBank shows the value of the availability of data for all of these reasons. The modern synthetic use of DNA sequence data would not be possible without the near-universal use of GenBank as a public archive. Moreover, GenBank would not be nearly as complete as it is without the communal decision to archive all DNA sequence data, a decision initially introduced by journals.
Current status: PDA policies and technical infrastructure
Journals, perhaps partially in response to calls for data contributions from their manuscripts and the availability of data repositories, are rapidly adopting and implementing PDA policies. Figure 1 places these policies on a spectrum, from no policy at all (rare), to requiring a data availability statement (most common), up to a full data review process resulting in peer-reviewed datasets (rare).
Types of PDA policies.
Available upon request
The first journals to implement PDA policies required all authors to make data available upon reasonable request. In other words, editors mandated that authors share their data should someone contact them to examine or reuse the datasets underlying manuscripts. These policies tended to be non-specific with regard to the form and amount of data to be shared (e.g. data underlying figures in the manuscript vs. raw data). “Available upon request” policies served as an initial, consensus-driven step toward making datasets widely available.
Researchers have since tested the efficacy of “available upon request” policies and levied criticisms against the policies’ ability to reach desired outcomes, backed by evidence that authors often do not respond to such requests or deny the requester access. Stodden et al. (2018), for example, conducted a study in which the authors requested data and code from a random sample of 204 manuscripts in
Data availability statement
Many journals in ecology and evolution have moved beyond available upon request policies and have made strides toward ensuring the long-term availability of research data. Like journals in many other disciplines, ecology and evolution journals have favored data availability statements as the preferred requirement for PDA. A data availability statement requires authors to write a section detailing where the data can be found. In general, data availability statements can point to a number of places, including supporting information files and public repositories. Editorial policies at publishers such as PLOS One (2014), for example, strongly suggest that public, discipline-specific repositories be used. However, exceptions are available and allow authors to, at minimum: … specify “Data available on request” and identify the group to which requests should be submitted (e.g., a named data access committee or named ethics committee). The reasons for restrictions on public data deposition must also be specified. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.
Mandated repository
The availability of repositories like [Journal] requires, as a condition for publication, that data supporting the results in the paper should be archived in an appropriate public archive, such as [list of approved archives here]. Data are important products of the scientific enterprise, and they should be preserved and usable for decades in the future. Authors may elect to have the data publicly available at time of publication, or, if the technology of the archive allows, may opt to embargo access to the data for a period up to a year after publication. Exceptions may be granted at the discretion of the editor, especially for sensitive information such as human subject data or the location of endangered species.
Ambitious examples of PDA policies extending beyond JDAP also exist in ecology and evolution. These policies specify the level of data authors should submit and, in some cases, which repositories should be used. Specific levels of data vary. Archived data should be sufficiently complete so that subsequent users can repeat tables, graphs, and statistical analyses reported in the original publication, and derive summary statistics for new or meta- analyses. Thus, the normal resolution of the data that are archived will be at the level of individual observations.
Peer-reviewed datasets
The least common policy among those presented here are policies resulting in peer reviewed datasets. The journals with such policies tend to be specialized instantiations of a larger journal. In other words, these journals often specialize in data publications and therefore may not have the visibility or readership traditional journals enjoy in the ecology and evolution research communities.
These types of publications achieve data review rigor unseen in traditional journals. As Lawrence et al. (2011) pointed out, other configurations of peer reviewed data exist and could be employed instead of standalone data paper publications. However, peer review of data is difficult to scope, as pointed out by Parsons et al. (2011): No clear definition of peer review of data exists, as peer review could focus on data accuracy, validity, or documentation, among other possibilities.
Research questions
Journals that have such data review processes produce high-quality datasets enabling future use by the scientific community. However, the degree to which traditional manuscripts receive this type of attention from reviewers was, prior to this study, unclear. We sought to illuminate the process by which traditional journals enforce PDA policies and ensure the quality of published datasets by interviewing editors at widely read ecology and evolution outlets. The following research questions guided our study of the PDA enforcement and data review processes: RQ1: How do journals develop, implement, incentivize, and enforce PDA policies? RQ2: Who do editors perceive as responsible for enforcing PDA policies and reviewing datasets? What roles do they assign to each participant in the peer review process?
Methods
We conducted a qualitative, interview-based study of how journals enforce data publication policies. We selected 10 journals that publish research in ecology and evolution, all of which have adopted either JDAP or their own PDA policies. The journals were selected for their prominence in the field, identified via two of the present study’s authors who conduct ecological research and supported by traditional metrics such as the journal impact factor. The study data was collected by conducting 20 semi-structured interviews with journal editors (editors-in-chief, associate/assistant/subject editors, and data editors), data repository staff, and journal/publisher staff. We followed qualitative data analysis techniques first developed by Miles and Huberman (1994) to identify themes in the responses to our semi-structured interview questions, focusing explicitly on editors’ and staff’s perceived roles of each stakeholder in enforcing PDA policies.
Data collection
We employed three semi-structured interview protocols to provide the qualitative data necessary to assess how journals develop and enforce data sharing policies. In 2017 and early 2018, we interviewed different types of stakeholders in the PDA policy enforcement process following the advice of Weiss (1995), who emphasized the importance of capturing diverse perspectives when constructing an interpretation of an organizational process.
In developing the protocols, we followed guidelines set by Spradley (1979) for conducting semi-structured interviews. Semi-structured interviewing involves the use of a guiding protocol (i.e. a general set of questions that the researcher can adapt to the ongoing conversation) for the interviews rather than a strict protocol or no protocol at all. The semi-structured interview allows the researcher to explore emergent ideas that may provide new insights into the phenomenon of interest, whereas a strict protocol might preclude the interviewee from expressing what is important to his or her thinking on a given topic. Interview protocols contain common questions so that responses can be compared across informants (Bernard, 1988; Yin, 1994), a feature that is vital to identifying and validating themes in the data (Corbin and Strauss, 1990; Emerson et al., 1995). In this act of comparing and contrasting, the researcher can ensure that the sentiments and experiences of the informants are not idiosyncratic, but are in some way thematic and/or related to one another.
Interviews with editors, AEs, and repository/journal staff.
Data analysis
We employed qualitative coding techniques to analyze interview data. Interview transcripts were analyzed using Atlas.ti© qualitative coding software. Following the work of Corbin and Strauss (1990), Emerson et al. (1995), and Miles and Huberman (1994), the analysis was interpretive and iterative. The analysis began with repeated readings of interview transcripts and continued with readings until we were familiar with all of the actors, terminology, policies, and interaction scenarios contained in the data. The next step, open coding, involved annotating the interview transcripts with initial interpretations of the data based on the initial repeated readings of transcripts. Following Corbin and Strauss (1990), codes were then refined, conflated, or made more granular throughout the open coding process. We then used several other forms of qualitative coding to deepen the analysis. For instance, selective coding involved looking closely for additional instances of a particular concept or theme. Axial coding included relating codes to one another by breaking single codes down into multiple, related codes or recognizing co-occurrence of two or more themes and conflating them. Glaser’s constant comparison was employed throughout (Glaser, 1965). Constant comparison refers to the process of comparing new instances of a code to its previous uses to ensure that there is thematic consistency. We present the themes generated from the analysis below.
Findings
The journals we studied implemented formal data sharing policies between 2013 and 2015. At the time of implementation, all 10 journals had existing guidelines in place encouraging authors to share data associated with their manuscripts. The increasing availability of technical infrastructure to support data sharing in ecology and evolution (e.g. growth in the number of repositories) and community criticism of “available upon request” policies prompted the expansion of data sharing policies. Editors and repository staff reported that transitioning from voluntary submission to mandated data archiving did not appear to be an unreasonable extension of existing practice.
All journals opted for a data availability statement approach, with varying levels of and approaches to enforcement. Four editorial staff members indicated that reviewers were asked to check that a link to data was present; and three of the four reported that they encouraged reviewers to click on the link and ensure that datasets were in a reasonable repository. According to editors, no journals asked reviewers, AEs, or editors to systematically scrutinize the published dataset. Editors viewed Assistant/Associate Editors (AEs) and reviewers as jointly responsible for enforcing compliance with the data availability statement and verifying that the data exist at the location provided by the author. AEs relied on reviewers to flag potential quality issues early in the review process. However, editors and AEs alike repeatedly emphasized (in 12 of the interviews) that the authors were primarily responsible for ensuring the quality of their datasets. Below, we describe the implementation of PDA policies, the mechanisms of their enforcement and quality assurance, and editors’ plans going forward.
Developing data publication policies: “available upon request” to “available”
All of the journals we studied had implemented or were beginning to implement PDA policies at the time of our interviews with editors. A common theme in participant accounts of the implementation of data sharing policies was that mandated sharing emerged through bottom up changes to practice within the ecology and evolution community. Editors commonly made statements such as, “It’s not actually hugely different from the previous policy” because most journals “already operated with a kind of a policy expecting that data would be shared upon reasonable request.” Similarly, one editor stated, “we had a data policy that very very strongly encouraged the archiving—well, it was the same policy, but it was optional instead of mandatory.” Indeed, in almost all of the journals (8 of 10), transitioning to data availability statements was a matter of moving from voluntary contributions to mandatory archiving, with little additional action required. Editors reported little pushback from authors regarding the extension of data sharing policies. As one editor of a journal requiring a data availability statement recounted: We’ve gotten shockingly little pushback on our open data policy. We literally had one author write and say, “My career depends on my exclusive access to these data.” In 2 years … I want to say that we’ve really only ever in two years have one author say that they did not want to make their data publicly available because they hadn’t finished writing all the papers they wanted to. Literally, there’s n of 1 on that. We’ve had a number of authors who’ve had to work with us to come up with a compromise around data that could not be publicly shared. Either because it involved endangered species or human subjects, or because it involved data that was obtained through, like, a commercial fisheries program.
Asked to speculate as to why they encountered little pushback, editors commonly pointed to three factors: ecologists’ history of sharing data in projects that predate journal data sharing policies; a new generation of researchers “growing up” with the expectation that data would be shared; and the ethos of the ecology and evolution field. Editors-in-chief eagerly discussed ecology and evolution’s history of data sharing and pointed to it as evidence that the discipline was well positioned to adhere to data sharing policies. Editors-in-chief tended either to cite their own involvement in data sharing projects before the “open data movement” began or to describe examples of large, collaborative ecological projects. As one editor reported, he had firsthand experience with collaborative projects requiring data sharing as early as the 1980s: I was funded on a [federal agency] project, I think we were selected in 1984. And that was the first project that [federal agency] enforced its earth science data policy for. And so I’ve worked in the context of open data since I was a postdoc. And I’m very much in the context of the [federal agency] open data policy, which applies not only to [redacted agency name], but to all of the field projects and research that the agency supports. And so, because of that, I’ve been involved not only in collaborative open data projects – not as a practitioner by the way – but also I’ve seen the evolution of the data systems that support those open data policies. So my first project, the [redacted] project in the late 1980s, actually had a data system to support and enable the team to share their information. Of course, back then, we were sending three-and-a-quarter inch floppy disks back and forth with our data on them because there was no Internet. But I’ve kind of seen the evolution from an earth science perspective of open data for a long time. At least in U.S. ecology, large collaborative projects like LTER and FluxNet go back quite a ways … So community ecologists have grown up in an environment where either they themselves or they were around people who were sharing data. So I think that our data sharing activities go back further than the open data movement. And, you know, I think that the motivation that I expressed, this notion of transparency for decision support, that’s something that ecologists have been dealing with for a long time in supporting decisions like the Northwest Forest Management plan, the spotted owl issue, or salmon fisheries, or coastal zone management. So I think ecologists have been working in an environment where they might have needed to make their data available for scrutiny for quite a long while. We really rely on the AE to make sure, and the reviewers, but mainly the subject matter editor, to ensure that what’s archived is usable. (editor-in-chief) All of the quite junior people have become quite expert at fielding these types of inquiries [from authors about data sharing]. (editor-in-chief) I’m going to speculate that ecology is a pretty societal-benefit oriented field … and we have the motivation to share data that comes from people’s motivation related to conservation and other sort of benefit areas. (editor-in-chief) There’s not as much private sector influence. Engineering, these types of fields, they can make big bucks off of their research products. (AE) We don’t have the overlay of the profit motive that biomedicine has … So I think we have the same sort of fairly practical view that biomedical science does, but without the overlay of big money. And so nobody thinks they’re going to get rich off their ecological data or code or anything like that. (editor-in-chief)
Enforcing data publication policies
The lack of resistance to the policy described by the editors did not mean that authors were entirely compliant with data sharing policies, nor were editors unaware of the quality issues associated with archived datasets. In fact, editors reported that their journals did, on occasion, need to take action to ensure that authors submitted datasets upon acceptance of their manuscripts. Interestingly, the mechanism of that action varied across journals: In some cases, editorial staff indicated to authors that they would withhold publication until they received a link to the data; in other cases, staff requested data submission in the acceptance letter but did not follow up on compliance prior to publishing the paper. Perceptions of whose responsibility it might be to take appropriate action garnered no concrete answers. Editors commonly cited heterogeneity of ecological data as an obstacle to establishing standard mechanisms for enforcing data sharing policies and, when probed, reported that they routinely handled compliance issues on a “case-by-case” basis.
Editors, AEs, and repository/journal staff discussed the roles of at least four key groups of stakeholders who participate in the journal PDA process and commented on how each had a hand in the enforcement of PDA policies and ensuring the quality of research data: Authors (researchers), reviewers, editors, and repository staff. Commonalities emerged in the responses interviewees gave about the role of each stakeholder in the enforcement of data policies (i.e. reviewers, associate/assistant-level editors, repositories, authors). We focus below on how editors perceived reviewer and author roles and, in the process, discuss how editors view their own roles in enforcing PDA policies and ensuring data quality.
Role of reviewers
Editors and associate editors reported that they do not select reviewers based on expertise in data or code; instead, editors aim for reviewers with domain expertise. Expertise sometimes includes adeptness with making data reproducible/reusable, but often does not include these skills to the same degree that we might expect to find in fields like mathematics or computer science. Journals that publish methods paper often accept manuscripts heavily focused on data and/or software code (e.g.
No editors reported choosing reviewers based on their ability to review datasets, nor did they expect that a reviewer with domain expertise would be capable or willing to review datasets within their domain. Instead, editors-in-chief and AEs repeatedly expressed “hope” that reviewers would have the skills and appetite to review datasets: We hope that reviewers will have flagged issues along the way if they can’t access data, etcetera, as well. (editor-in-chief) So there are specific questions in the reviews [guide for reviewers] at [some journals] about whether all of the data should be available are available, so we hope that that question gets looked at externally by reviewers or the academic editor. (AE) INTERVIEWER: So what do you do to build new things into the review process to combat that sort of bias or, maybe the tendency for some things to be included [in a dataset], others not? ASSOCIATE/ASSISTANT-LEVEL EDITOR: Well, I don’t think there’s anything explicit we do in the review process. It’s just, when a paper comes in that analyzes something like this, is it, is the scope of the inference and the scope of what they’re doing really relevant? For example, if you’re analyzing something about body size, are there any experimental studies that have ever shown that body size is important for that? If there are, then sure, go ahead, but if there’s not, well that’s, you know …
In cases where the journal required authors to submit a data availability statement, but reviewers were not asked to certify the veracity of the statement, PDA policy enforcement was treated as a copy-editing process: It’s more of an administrative process … We do have some minimum checks, though, things like making sure there are no instances of “data not showing” [referring to text in the paper]. So then we’ll kind of macro-level check on what’s written in the paper.
Role of authors: “Trusting the authors”
Perhaps the most interesting of the themes that emerged from our analysis concerns editors’ perceptions of the role of authors in PDA policy compliance and data quality assurance. In particular, editors-in-chief and AEs repeatedly mentioned “trust” for authors as a key aspect of their PDA practices. Their responses echo arguments that have been made elsewhere regarding the need for culture change, rather than policy, to drive changes in data sharing norms. Surprisingly little consideration was given to imposing further PDA policies upon authors; furthermore, several editorial staff members discussed hesitation at the idea of stricter policies for fear of driving authors to submit to other publications.
One of the factors underlying editorial staff’s dependence on trusting authors was community disagreement over the appropriate “level” of data archiving. The participants we interviewed discussed their inability to reach consensus about what should be shared and how it should be shared (presumably having moved on from discussions of whether or not INTERVIEWER: Is there anyone at [data repository] that’s checking the data for you, to see if it’s complete, if it’s appropriate for the paper, and that sort of thing? ASSOCIATE/ASSISTANT-LEVEL EDITOR: It’s completely on the author. We went through a very long period about what is the appropriate way to do that, what is the appropriate sort of level of specificity for the data, what’s the appropriate level of annotation, and you know, once you start getting into those issues, and you’re running on volunteers, it just blows up in your face. [laughter] And at some level, too, it’s never really clear exactly what is the appropriate level. I think what we’d like to be doing more is working with the community to define what the data sharing expectations are in individual sub fields and disciplines and for different specific data types. So you might look at something like fMRI and try and understand which pieces of the data output are required and which format they’re best put in, and basically every subfield, you might look at electrophysiology data, or it might be imaging data, which is another very complex problem because it can be such big amounts of data. So I think what we’d like to do is work with the community to better define and understand what the expectations are and should be that is accepted by them. Because what doesn’t work well is if a publisher presents an edict to a community which they haven’t been involved in discussing. [editor-in-chief] ASSOCIATE/ASSISTANT-LEVEL EDITOR: And their [staff at the data repository] opinion is, it’s up to the editor and journal and the authors to come to an agreement about what that embargo period should be. INTERVIEWER: But do authors have a direct line to [the repository] to ask them for an exception? Do you allow that? Or you come to an agreement with them? ASSOCIATE/ASSISTANT-LEVEL EDITOR: We come to an agreement with the authors, and then the authors pick. But we assume everybody’s upstanding, and all we check is that data’s been deposited. Sometimes the data might be something like a MATLAB script and we can’t actually open it. You know, sometimes you’re dealing with a proprietary file type that we can’t verify. But, we sort of take that trust element of, we’ve got that file, we probably think it’s reasonable. If there’s a problem with it flagged down the line, we’d need to pursue it, but for now, we’ll take that on trust. (editor-in-chief)
Role of penalty
Editors universally reported in all interviews that the penalty for noncompliance with data sharing policies was refusal to publish the paper. Specifically, enforcement would be levied between the time reviewers accepted the paper and the time the article was assigned a DOI: Well the enforcement is really pretty simple—they deposit or they don’t publish. And so it’s part of our acceptance checklist that we won’t put the paper into production unless a staff member has gone, checked, and verified that the data are in fact present where the paper claims they are. To the extent that there’s enforcement, it falls to the AE backed up by me. (editor-in-chief) INTERVIEWER: Ok, so that would come at the time that manuscript had actually been accepted— EDITOR-IN-CHIEF: We would try to flag that earlier [chuckles], and we’d hope that that would surface earlier, but failing earlier, there’s a final step that happens to make sure that the data have been made available. Ideally someone’s looked at it, you know, and assessed it, whether the in-house person or the external person. Well, [data repository name] actually sets up a link, so when we go to copy edit a paper, if there should be data in [repository name], [repository name] has already sent a link to be published with the paper, to say, “Here’s the URL if you want the data.” So we know exactly if it has been deposited or not, based on whether we got that URL or not. (associate/assistant-level editor)
The dearth of cases in which papers were rejected for data archiving issues does not provide sufficient evidence that rejection fails to provide an appropriate enforcement mechanism. However, when considered alongside findings above regarding reliance on reviewers, trust for authors, and previous studies about the low quality of archived data (e.g. Roche et al., 2015), the finding does indicate that the timing of the enforcement may be problematic. In other words, authors may feel compelled to submit a dataset in some form once the manuscript has been conditionally accepted, but may not be as attentive to the completeness or quality of the dataset as they might be if it were reviewed during the peer review process. When the issue of timing was raised in interviews, five editors cited resource constriction as a limitation (e.g. “Our reviewers are already pressed for time” and “It would be too costly to decouple the review process and have dedicated staff [to review datasets].”).
Plans going forward
In addition to the themes about enforcement presented above, we also found commonalities in how editors discussed their journals’ plans going forward. As noted earlier, many of the editors were hesitant to make PDA policies increasingly strict—i.e. by instituting peer review processes for ensuring dataset quality. Instead, almost all of the editors we interviewed expressed optimism that cultural change would drive improvements in quality of data: Where, in the beginning, we had some resistance from our academic editors and reviewers about looking at the data, now we see more people who are requesting things upfront, saying “Where are the data? I’m not going to look at this further until I get the data.” And so I like to think that we’ve seen change happen and that more people are caught unaware by the policy, which is really, it’s very anecdotal, but it’s the most encouraging thing I can really say in terms of what’s changed since we started doing it. I don’t know how many people actually go and read the data policy, but that’s another issue … they want to be able to look at the data that underpin the study before they’ll review, or in order to review the paper. (editor-in-chief) I think we’re probably slightly shy of going full mandate direction after the data policy because it has proven so difficult to resolve the actual enforcement parts. And I think, clearly [our journal] is in a unique position, being more mission-driven and a not-for-profit, which allows us to do things that other publishers wouldn’t do in the pursuit of greater openness. But I’m sure that whatever we come up with will be the subject of a lot of long and hard and tough discussions and back and forth about, it’s a fine line between asking for more and ensuring that authors can actually comply, can and will comply with what you’re asking of them. So it will be a complex discussion, I’m sure. (editor-in-chief) So we’ve put off the requirement for archiving code and scripts partly just because we wanted to see how the data policy worked. (editor-in-chief) I think that’s going to be harder. You know, having been in the “computer-aided ecology biz” for a long time, I’m expecting to get screenshots of MATLAB procedures and horrible Python code that even the author can’t read anymore, and I don’t know what we’re going to do about that. Because in some sense, you can’t push too hard because if they go back and rewrite the code or clean it up, then they might actually change it. And so, we’ve talked about this across the journals. [editor-in-chief]
Discussion
The findings presented above are encouraging in that editorial staff at widely read ecology and evolution journals are reflecting upon the successes and shortcomings of PDA policies. Furthermore, the staff’s recognition that technical infrastructure for data sharing exists and no longer limits authors in publishing datasets promises to shape the negotiations between journals and authors in favor of increased sharing. However, our analysis revealed several remaining issues that illustrate how going beyond “available upon request” PDA policies may not be sufficient in producing desired outcomes.
The findings highlight how incentivizing and enforcing PDA policies has proved difficult for journals in ecology and evolution because editors and other journal staff rely heavily upon hope in reviewers’ attention to datasets and trust in authors to submit appropriate data. This relational approach risks reproducing the issues that limited data sharing prior to PDA policy implementation. Furthermore, reliance on reviewers and authors comes at the expense of the development of new or extended processes for ensuring data submission and dataset quality. Below, we contextualize the findings in previous studies and offer suggestions for ways to remedy remaining issues. Our primary argument is that journals might consider changes to their existing processes and adapting the processes of other organizations rather than inventing new processes altogether. Specifically, journals might be particularly amenable to processes that resemble and/or extend existing peer review processes.
The primary, remaining concern for the future of PDA policies is that data submissions, without scrutiny, will fail to reach “community standards” as defined by guidelines such as the FAIR principles (Wilkinson et al., 2016). Indeed, Roche et al. (2015) discussed the paucity of high-quality datasets emanating from ecology and evolution journals as compared to datasets published in accordance with funder policies. Given the results of our study, the lack of quality in datasets comes as no surprise: The journals we studied, although optimistic about reviewers’ and authors’ behaviors, did very little to equip reviewers and authors with the tools and processes to evaluate and refine archived datasets. Below, we discuss some possibilities for data review processes to be implemented in the peer review process.
Applying strategies from data curation initiatives
Ecologists have called for peer review of datasets in the past (e.g. Costello et al., 2013). Formalized data review has proved difficult, though, in part because of the heterogeneity of research data both across and within subdisciplines (Reichman et al., 2011). Moreover, the ways of dealing with heterogeneity vary even within organizations. Mayernik (2016: 973) illustrated the complexity of developing standardized practices for data management and archiving through a case study of three programs, including LTER. The author found that “institutional support for data and metadata management are not uniform within a single organization or academic discipline.” Vanderbilt et al. (2009) noted similar challenges in a study of data integration in the International LTER program, and the general issue of “science friction” is well documented in social studies of science (e.g. Edwards et al., 2011).
To be sure, an important distinction exists between data curation efforts—which seek to organize, integrate, and preserve datasets for long-term sustainability and reuse for various purposes—and peer review efforts, which seek to ensure that data analysis underlying manuscript findings reflect scientific best-practices and that datasets are sufficiently complete to evaluate said findings. Various examples of successful data curation efforts demonstrate that data review is one part of broader curation tasks and can perhaps be more easily standardized than data review for academic manuscripts. Standardization is perhaps more achievable for data curation efforts because funding agencies can issue strong mandates and sufficient funding for data review. One such example is the National Science Foundation (2016) Arctic Data Center’s process for requiring “metadata, full data sets, and derived data products be deposited in a long-lived and publicly accessible archive”. Data review processes for journals, on the other hand, must be flexible enough to accommodate heterogeneity and complexity of datasets and the institutional structures in which data are generated while achieving some degree of standardization.
Data librarians and other research support staff comprise a community that is familiar with striking such a balance. As Lin and Strasser (2014) noted, “librarians, information technologists, preservation specialists, and others have a long history of providing infrastructure, education, and support for preserving and promoting researchers’ outputs,” and they argue that scholarly publishers should elevate that role for the professionals listed. Journals might begin to do so by engaging with communities such as the Research Data Alliance (RDA) to develop data review processes. RDA and other groups engage regularly with public and private funders (Berman et al., 2014) and therefore have experience translating policy into the research process; however, journals appear to be less engaged with RDA and similar groups. The professionals in data management roles, including in non-library organizations (e.g. DataONE) might alleviate journals’ difficulties navigating issues such as resource constriction for managing research outputs; the nuances of ownership rights and intellectual property; and principles of reuse, credit, and citation.
Journal editors frequently reported that the human resource cost of reviewing datasets presented a barrier to implementing data review processes. One option often discussed in the research data management community involves automating the data review process, which would alleviate some of the concern over reviewers’ unpaid time and effort. Organizations such as the U.S. Geological Survey (2016) have outlined options for automating audits of data management practices to ensure adherence to best-practices. Building best-practices, as defined by individual communities (Moore et al., 2010) and by the broader research community (Tenopir et al., 2011; Wilson et al., 2014), into the research process from the outset of a project might aid in easing the burden of PDA on reviewers, editors, and authors at the manuscript stage. Automation tools provide a starting point for ensuring that researchers include appropriate metadata, handle missing data with care, control versions, and format data in machine-readable formats (Starr et al., 2015).
Applying strategies from software review initiatives
Our findings also illustrated some carryover effects from editors’ experiences with PDA policies into their perceptions of impending code review and publication policies. In principle, there is little reason to treat data archiving and code archiving serially; the two research products are often intertwined, but also have meaningful differences that require attention. In practice, though, journals are beginning to draft and implement code sharing policies by extending existing policies or issuing new ones. It is worthwhile, then, to consider how journals might simultaneously (1) apply mechanisms developed by journals and organizations who regularly facilitate peer review of
The perception among editors appears to be that scientists will be unprepared to share code used to analyze data due to inexperience with code sharing best-practices (e.g. adequate documentation, appropriate use of versioning technologies), concerns that mirror the issues editors raised regarding data sharing. Perhaps as a result, editors described potential code archiving policies as almost identical to open data policies in function: mandated availability, but little attention to how the review process should incorporate code review. Rather than framing editors’ conflation of data and code review and archiving as a barrier to developing sound policies, we propose that the development of code sharing policies may help to improve journal data policies and their application.
Consider, for example, the similarities in challenges that both datasets and code present when compared to manuscripts, some of which are outlined by Schopf (2012). Both artifacts can be heterogeneous in format, have traditionally been under-supported and/or underappreciated aspects of the research process, and are increasingly reliant on computing infrastructures. Furthermore, editors expressed concern about finding volunteer reviewers with expertise in data and code in addition to scientific knowledge on a manuscript’s subject matter. Various organizations and initiatives have begun to tackle these issues as they relate to software and may provide a foundation for improving journal policies for both data and code sharing.
Open source software (OSS) journals and other OSS organizations hold collective lessons to offer journals in the way of mobilizing expertise to review data and code before archiving. These organizations employ processes similar to peer review to ensure that software—or standalone, reusable packages of code for various types of scientific computing—adhere to best-practices and community standards. Software is distinct from the code that might accompany a manuscript in that it typically can be used for more general purposes (e.g. data retrieval, database access, or text analysis) (see rOpenSci Onboarding Policies: https://ropensci.github.io/dev_guide/policies.html) than analysis on a single dataset for a single study (Barnes, 2010). Software review employed by OSS organizations and specialized journals such as
Just as previous authors have suggested that government agencies look to open source culture for lessons on how to develop their own open data initiatives (see Baack, 2015; Schrock and Shaffer, 2017), journals might also translate open source processes into peer review. Despite the distinction between code and software described above, journals might look to OSS organizations for guidance and partnership in reviewing code just as they might work with research data specialists on PDA policies. For example, the
The formal option is a partnership between journals and OSS organizations, a model that might be applied to both data and code review. These partnerships are already emerging at the intersection of ecology and evolution research and software. For example, rOpenSci’s partnership with
Conclusion
In summary, journals in ecology and evolution have made strides in ensuring the availability and sustainability of datasets by instituting policies that go beyond “available upon request.” However, challenges remain in developing mechanisms for ensuring that the archived data is complete and useful for the outcomes archiving policies wish to achieve. Trusting authors to archive complete and high-quality datasets is problematic in that authors may or may not comply for various reasons. For instance, some may fear “scooping” of their projects or feel that the time cost of cleaning datasets for sharing outweighs potential benefits.
Additionally, if journals continue to rely on hope in reviewers and trust in authors for quality data submissions, we might expect change to occur slowly. A few ways to accelerate the process include learning from data curation initiatives and software review mechanisms used by OSS organizations, which are beginning to solidify in the digital curation community, software journals and, to a lesser extent, in disciplinary journals in ecology and evolution.
Journals will constitute a valuable organization for studying the governance of data and software sharing and comparing between them, across disciplines, and in seemingly infinite other configurations because they are ubiquitous in the process of publishing outputs of scientific research. Studying the experiences of journal editors, reviewers, and authors during the rollout of code sharing policies can help OSS organizations and others reflect on management principles/policies and strategies for organizing.
Footnotes
Acknowledgments
We would like to thank the interview participants who graciously volunteered their time to be interviewed for our study. We also thank Daniella Lowenberg for critical feedback on drafts of the manuscript and the reviewers for their work in helping us to improve the paper. Research reported in this publication was approved by the University of California, Berkeley Office for the Protection of Human Subjects, Protocol ID 2017-08-10194.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported in whole by The Leona M. and Harry B. Helmsley Charitable Trust, award number 2016PG-BRI004.
