Abstract
As more researchers make their data sets openly available, the potential of secondary data analysis to address new questions increases. However, the distinction between primary and secondary data analysis is unnecessarily confounded with the distinction between confirmatory and exploratory research. We propose a framework, akin to library-book checkout records, for logging access to data sets in order to support confirmatory analysis when appropriate. This system would support a standard form of preregistration for secondary data analysis, allowing authors to demonstrate that their plans were registered prior to data access. We discuss the critical elements of such a system, its strengths and limitations, and potential extensions.
Many scientific practices are based on norms rather than strict rules from institutions, professional societies, or funders. These norms evolve over time. For example, methods sections of psychology articles have changed greatly since the 1950s, thanks to advances in graphing methods, the reduced role of page limits in paper journals, and changes in the scientific community’s beliefs about what information is appropriate or necessary to share. In the past 5 years, norms regarding how to share data and enable future replication efforts have been changing rapidly, to promote both increased transparency in processes and an increased focus on ensuring—and testing—the robustness of individual claims (Nosek & Lindsay, 2018). These changes are supported by changing technologies and practices, from general tools, such as shareable online documents, to purpose-built tools, including platforms such as the Open Science Framework (https://osf.io), which lower the workload for individual scientists looking to implement these practices.
Preregistration of experimental studies is a newly developing norm in psychology that allows researchers to clearly mark the distinction between decisions they made before and after viewing their data, and correspondingly to distinguish which analyses are confirmatory and which are exploratory. Although the majority of experimental articles in psychology journals are written as though all the tested hypotheses were developed and all the analyses were planned before data collection (Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit, 2012), it is now known that this does not accurately reflect common research practices (Eason, Hamlin, & Sommerville, 2017; John, Loewenstein, & Prelec, 2012). Given the vast number of methodological and analytic decisions available (Wicherts et al., 2016), the results of an analysis designed on the basis of initial viewing of the data, even if the researcher has not actually tried multiple approaches (Gelman & Loken, 2013) or has used Bayesian methods (Dienes, 2016), constitute weaker evidence regarding a claim than would the same results of the same analysis had it been prespecified.
Currently, however, the distinction between exploratory and confirmatory research is confounded by the distinction between the use of new and existing data sets. It is generally more complicated, and in some cases prohibited, to preregister analyses of existing data using standard templates (e.g., AsPredicted, https://aspredicted.org, does not allow preregistration if data have already been collected). There are important differences between existing and new data sets, of course. For instance, even basic communication about a data set almost always reveals information about the size of the main effects in question. But preregistering analyses only when one is preparing to collect new data limits the strength of the conclusions that can be drawn by reconsidering original analyses or addressing new questions with existing data sets.
Consider the case of Meltzoff et al. (2017), who set out to explain why Oostenbroek et al. (2016) failed to replicate an earlier finding that human neonates imitate facial expressions (Meltzoff & Moore, 1977). Meltzoff et al. had access to the knowledge that this was a failed replication (i.e., the outcome was known), but much of their discussion proceeded from new information they obtained by accessing the original data set in detail. Their rebuttal outlined 11 ways in which Oostenbroek et al.’s design reduced the probability of detecting an effect, and their own reanalysis showed that infants consistently performed significantly above chance on one of the behaviors tested (tongue protrusion). Yet despite plausible concerns about differences in implementation of the experiments, a carefully presented analysis, and cooperation between the two groups in sharing data, the rebuttal is necessarily weakened by the knowledge that a complex data set like this one can, with sufficient effort, generally be interpreted to support either preferred outcome. In this case, neither contribution achieved its full potential to finally resolve a debate about neonatal imitation: Oostenbroek et al.’s data got “one shot” to be used in confirmatory research, and Meltzoff et al.’s secondary analysis must be treated as exploratory even if it could have been specified prior to accessing the data.
This is a particularly unfortunate outcome for data on neonatal imitation, which are exceptionally difficult to collect (researchers must often camp out in maternity wards, hoping to catch a rare calm moment when a newborn is awake but neither crying nor eating). Similarly, very large data sets that have the potential to answer many questions from many researchers are often beyond the abilities of any one lab to collect. Consider, for instance, the Health and Retirement Study (Juster & Suzman, 1995), cited in more than 4,000 publications (University of Michigan, 2018), or the National Institutes of Health’s new “All of Us” initiative (Collins & Varmus, 2015), which is aimed at collecting detailed longitudinal health data from more than 1 million Americans. Although these data sets are still exceptionally valuable for exploratory work, in principle a researcher who has not yet seen the data should also have the option to use them for confirmatory work.
The potential for data to be reused to answer new questions is indeed one of the chief virtues of openly sharing data. Although the use of existing data sets has a long history in many disciplines, the tradition in our own field of cognitive development has been independent collection of small data sets. Even in this field, the value of secondary data-set analysis is becoming increasingly clear (Davis-Kean, Jager, & Maslowsky, 2015). Psychologists are already beginning to consider how to bring the same rigor and planning that preregistration provides to secondary data analysis, and norms for preregistration of secondary data analysis are emerging (Weston, Ritchie, Rohrer, & Przybylski, 2018). When researchers are collecting experimental data, the distinction between decisions made before and after viewing the data is built in: The researchers can make decisions before the data even exist. But when data already exist, it is both harder to see the line and harder to convey to other people what steps have been taken to limit analytic flexibility (see also van ’t Veer & Giner-Sorolla, 2016; Watt & Kennedy, 2017), and this has led to doubt about the value of preregistration for secondary analyses. Two kinds of solutions have been proposed: being as transparent as possible about what knowledge of the data was used to plan analysis and pledging to follow a set of “best practices” in reusing a data set responsibly (DeHaven & Mellor, 2018; Heng, Wagner, Barnes, & Guarana, 2018; Nosek, Ebersole, DeHaven, & Mellor, 2018).
To complement and support these practices, we propose a small change in how psychological researchers, as a scientific community, openly share data for secondary analysis: maintaining a central electronic sign-out sheet for data sets, so that researchers can clearly and verifiably indicate at what point in the research process they accessed the data. We refer to the proposed practice as checking out a data set, because it is analogous to checking out a book at a library. Most libraries allow anyone to check out books, but do not allow people to walk directly out the door with them; library users have to check books out first, and the library retains a record of who has checked out which books. Critically, data checkout does not change who is authorized to access a data set; it only introduces friction into the access process. In addition to supporting planned confirmatory analysis, data checkout would encourage all researchers to be more aware of the transition from pre- to post-data-access decision making, allowing for more deliberate and open exploratory or confirmatory work. We anticipate that the proposed functionality would complement existing tools, promote good practices, and enhance the value of shared data.
Implementation
In the following discussion, we refer to a researcher who has collected and published a data set as the data-set creator and a researcher who conducts secondary analysis of the data set as the data-set user (or user).
Data checkout will be most seamlessly integrated into the research experience if it can be supported by the central resources already used for sharing data, such as cross-institutional repositories like the Inter-university Consortium for Political and Social Research (ICPSR), journals’ supplementary data repositories, and the Open Science Framework. In many cases, almost all of the necessary functionality has already been implemented. For instance, repositories that handle protected data sets must already track data-access agreements, many journals track file downloads for purposes of measuring impact, and the Open Science Framework allows researchers to specify different privacy levels for different project components. Data-set creators using a data-checkout protocol would still upload their data to a repository, but they would specify that some files require data checkout to access, a requirement analogous to creating a new privacy level between “completely public” and “limited to the specific people listed.” The only change data-set users would experience is that downloading data would require a log-in: For instance, clicking on a download link or opening files containing raw data might redirect a user to a checkout page with a brief explanation of the reasons for tracking data access. However, upon log-in, data access would still be immediate. A user who wished to conduct confirmatory research on the existing data set could therefore preregister the planned analyses before checking out the data and simply include a statement such as the following in the preregistration: The following is a list of key researchers on the project. As of this preregistration, we certify that the following people have not accessed the data set [Title] at [link]: [names of key researchers and links to their (empty) checkout histories for that data set] At the point of this preregistration, we have the following additional information about the data set we plan to access: [journal articles or other summary documents that the researchers have read or are aware of, impressions from discussions with the data-set creators, etc.]
The second paragraph of this suggested language acknowledges that there will always remain a basic difference between analysis of existing and new data sets: Even if researchers have not accessed an existing data set, they almost always have some knowledge about the data in it, for instance, because they have read a journal article (or even its title, in many cases) reporting the experiment. This knowledge may affect the secondary analyses planned, as well as, less visibly, the choice of this data set. The extent to which this knowledge reduces the statistical validity of secondary analysis depends on the relationship between the new questions being asked and the information already known: Taken to the logical extreme, of course, “secondary” analyses that merely repeat the reported primary analyses provide no new evidence for the hypotheses in question. In general, however, if a user’s existing knowledge of the data set only weakly constrains the predictions to be tested in the analysis, the conclusions that can be drawn from secondary analyses specified before data checkout are stronger than those that could be drawn from the same analyses following full exploration of the data—although not as strong as if the analyses had been specified prior to collection of a new data set.
After depositing their preregistration on a server like AsPredicted or the Open Science Framework, researchers conducting confirmatory research on an existing data set would proceed to check out the data and conduct their analyses. We believe data checkout could be implemented entirely within existing data-sharing services, although the system’s priorities would be informed by and its adoption encouraged by additional players. For instance, journals would need to decide whether to encourage data checkout, whether to consider data shared with a checkout system to be “open,” and how to provide data to reviewers while protecting their identities.
At a minimum, a data-checkout system would have four technical elements: a system for identifying users, a central record of downloads, publicly accessible summary statistics for protected data sets, and a simple license that protects data while allowing central access.
An authentication system that identifies unique users
The key claim a checkout system needs to verify is the following: None of the downloads of this data prior to [date] were by [researcher’s user ID]. The more confident one is that each researcher has at most one user ID—essentially, the further from anonymous the log-in system is—the more compelling this statement is. However, the need for confidence that user IDs refer to unique users must be balanced against the privacy intrusions needed to prevent creation of fraudulent IDs. This challenge is far from unique, and we expect that data-checkout implementations will leverage existing tools to strike an acceptable balance. Examples of potential starting points for authentication include institutional ORCID authentication (Haak, Fenner, Paglione, Pentz, & Ratner, 2012) and ICPSR’s Researcher Passport (Levenstein, Tyler, & Davidson Bleckman, 2018), intended for situations in which researchers require particular credentials for data access.
A central record of who downloaded what when
This is the core of the data-checkout proposal: Whenever someone accesses the data, that access should be recorded. This is what allows a researcher to back up a claim of confirmatory research with a record that he or she did not access the data set prior to designing the analysis.
Access to the full data set must require authentication, so that the system providing access can record a time stamp and user each time the data set is downloaded. If the data-set user chooses to make his or her checkout records for this data set available, the system must then be able to, at a minimum, verify the earliest time at which he or she accessed the data set. Although the simplest implementation would be a public record of all downloads, verification of a particular data user’s history does not necessarily entail publishing the identities of other researchers who have accessed the data.
Potential reuse of a data set would be optimized by allowing researchers to specify a subset of the data to download—for instance, data for a particular experiment or dependent measure—to allow for cases in which the answer to one question informs the design of the next set of analyses. For ease of implementation, this feature could initially be approximated by allowing researchers sharing a data set to split it into pieces that could be downloaded separately.
Publicly accessible summary statistics for protected data sets
In order to understand the potential value of a data set and make informed decisions about analytic techniques, researchers must be able to freely access a summary of the available data. That is, prior to accessing the full data set, they should be able to access a codebook (“data dictionary”) of variables’ definitions, relevant summary information such as the population and number of subjects, and so on. At a minimum, this simply corresponds to the information that would be included in a Method section. When possible, a jittered data set (one that has the same structure as the actual data set, but randomly scrambled values) may be provided to allow researchers to design, test, and register the exact code they wish to use for analysis.
A simple license for protected data that encourages central access
To maintain a complete record of individuals who have accessed a data set, the system must be the sole means of public access to the data or must coordinate with any other systems providing access to create shared records (e.g., a journal and an open repository could share a record for a particular resource stored in both locations). In downloading the data set, researchers should by default accept a simple license that discourages publishing the raw data elsewhere. This is, fortunately, already a widely held social norm for secondary analyses: Retaining the connection to the original data set is important for properly crediting the original authors.
Summary
Logging data checkout is a small but potentially powerful step that would neatly complement the emerging network of tools for sharing data and for linking data, projects, and researchers. We expect that it will be especially valuable for data sets that are difficult to obtain (e.g., data from developmental research, research involving special populations, and longitudinal studies) and for rich data sets that may shed light on multiple questions. When a researcher wishes to challenge the conclusions of a study, or the data are so precious that it will be impossible or very costly to get new data sets, being able to establish evidence that analyses were planned before viewing the data (with a preregistration and a time stamp showing subsequent access to the data set) allows stronger conclusions, sharper critiques, and better use of existing data sets to make scientific progress.
Data Checkout in Practice: Questions and Challenges
Technology alone cannot ensure responsible reuse of data, nor will checkout be appropriate in all cases. As is the case with most new scientific practices, data checkout solves some problems in some situations. Although it is not a panacea, we believe that data checkout can add considerable value to the strength of the inferences researchers make when reanalyzing data sets, at relatively low time cost. Understanding the strengths and weaknesses of this tool is critical for its success.
Won’t data checkout discourage exploratory work?
Data checkout is a tool to complement preregistering planned analyses; it does not directly affect the exploratory workflow. However, many researchers (e.g., Goldin-Meadow, 2016; Scott, 2013) have concerns that moving norms in psychology toward explicitly confirmatory work will hamper creativity and prevent the serendipitous discovery of patterns that can launch new questions and theories (Washburn et al., 2018). We anticipate two primary concerns about data checkout as regards exploratory research:
First, researchers may be concerned that data checkout could serve to “out” exploratory work as such. However, we believe that, over time, better recognition of the distinction between exploratory and confirmatory research will in fact strengthen the position of exploratory research and help to avoid an inappropriate dominance of confirmatory research. Exploratory research, which allows researchers to fully explore and identify potentially unexpected trends in a data set, plays a vital and unique role in the scientific process (Whewell, 1858), but it is currently hobbled by a strong expectation that nearly all studies will be presented as confirmatory (whether they concern new or reused data sets). This expectation hurts exploratory work in two ways. First, it holds exploratory research to inappropriate standards, painting it as subpar or misleading confirmatory work instead of as a different endeavor entirely. Second, it withholds appropriate credit and status by holding up exciting exploratory findings as “role models” for what confirmatory work should look like (when in fact it is unrealistic for confirmatory research to regularly uncover surprising new phenomena). Although appropriately valuing exploratory work will require change well beyond data checkout—for instance, in publication options and career incentives—we believe that accurately distinguishing between exploratory and confirmatory research is a step in the right direction.
A second concern is that a checkout log might make researchers delay accessing a data set for exploratory work for fear of missing out on the opportunity to do confirmatory work with that data set. Indeed, encouraging thoughtfulness about accessing data is intended to encourage researchers to step back and consider their options, much as your computer’s operating system provides an extra warning before irreversible actions. We believe that adding this small amount of friction will only support researchers in acting more consistently with their genuine preferences. If they choose not to access a data set immediately upon being reminded that this would mean their work is exploratory—but would otherwise have forged ahead—one can reasonably assume that their intent was in fact to do confirmatory research or that they were undecided, and therefore that the reminder was in their best interest. As researchers get better at distinguishing between exploratory and confirmatory work and deciding which approach is most appropriate for a particular question, and as the contributions of exploratory work are better recognized, researchers will be more comfortable deciding when to access a data set and forgo the opportunity to do truly confirmatory work.
Does data checkout mean that all secondary data analysis will need to be preregistered?
No. As is true of other proposed reforms, data checkout is simply the addition of a tool. It will serve to increase the weight of a secondary-data-analysis preregistration if the researcher wishes to create one. Users will still be able to conduct exploratory research with secondary data sets, and they will also be able to conduct research that they know to be confirmatory, simply by refraining from exploring the data before conducting the critical tests.
Can’t researchers establish when they accessed a data set by citing the repository’s data-use agreement?
For researchers used to working with large, often protected, data sets (e.g., in public health and education) that are stored centrally (e.g., ICPSR or the National Institutes of Health’s repositories), registration of a kind of data checkout already exists in the form of the documentation required for a researcher at a new institution to access the data set. In these cases, data checkout is simply a new way to refer to one use of this existing infrastructure: These repositories can verify when a particular researcher gained access to a data set, and researchers can include this information in a preregistration. Implementing data checkout would extend this benefit of protected data sets to cases in which data-set creators wish to simply track (but not limit or prevent) access to data sets.
What can’t data checkout fix?
Data checkout would serve a fairly specific function as a way to enable confirmatory testing of hypotheses using existing data sets. However, it would not eliminate the necessity of a basic degree of trust. For instance, data checkout cannot prevent outright cheating: Just as people can falsify a data set, they can give false information on a preregistration (e.g., data can be altered to appear that they were collected after, not before, the preregistration was submitted), and they could falsely log their data checkout after getting the data from a colleague or making an alternate log-in account. Indeed, science depends on being able to trust that colleagues are not deliberately lying, and data checkout will not change this. What it will do is eliminate the need to rely only on trust (and users’ own memory or documentation); information on when a user accessed a data set in the course of a research project can instead become part of the public record (a point to which we return later in our discussion about trust.)
Even with a data-checkout system, many scientific practices will remain just as necessary as they are today, and new tools will be required to solve other aspects of the challenges of robustness and replicability. For instance, data checkout cannot prevent the problem of selective attention to controversial or surprising data sets. Just as publication bias can prevent “boring” null results from being published, researchers’ own biases mean that they are more likely to turn their powerful microscopes toward data sets they find surprising or questionable; they are unlikely to pour their efforts into minutely checking the analytic soundness of a data set that produces a more prosaic result.
Finally, data checkout will not prevent another scientist from using a data set to reach a conclusion that its creators dislike—nor should it. Checkout is meant as a tool to make users more mindful about making the leap to access a data set, not as a tool to allow gatekeeping by data-set creators. However, data checkout could be incorporated as an extra step beyond existing restrictions on who may access a data set (for instance, requirements for certification by an institutional review board).
In which situations would data checkout be inappropriate?
We have described the positive function of data checkout, but there may be cases in which data checkout either will be unhelpful or could be actively harmful. First, if a data set is already very well known in a field—widely analyzed and already available and circulated among many scientists (e.g., some corpora of children’s and child-directed speech, such as Brown, 1973)—data checkout could create a false sense of rigor, and might not be helpful.
Second, if there is some reason that scientists (or citizens more generally) could be targeted for having looked at a particular data set, leaving access to it unlogged may be appropriate. This includes cases in which accessing a data set could be construed as revealing sensitive information about the user’s beliefs or group membership, or in which a checkout record could compromise the user’s relationship with an employer or eligibility for a position. We expect that such cases, although limited, can be addressed using principles similar to those used when deciding whether to share human-subjects data.
Researchers might also choose to not employ checkout if their data set has special potential for scientific outreach, and many users of the data are likely to be casual browsers who may look at it only for a few minutes, rather than with a focused interest. One example is Wordbank (Frank, Braginsky, Yurovsky, & Marchman, 2017), a multilanguage data set of vocabulary development presented via an interactive Web site. These data are of particular interest to parents and other members of the public, not only to researchers. In this case, requiring log-in and data checkout would likely mean that far fewer casually interested parents would browse the data and learn about cognitive development (perhaps checking what to expect about their child’s upcoming vocabulary development, or whether her first words are typical). From our academic perspective, we have not necessarily anticipated all the reasons for some potential users to find data checkout an unacceptable barrier, but these reasons would need to be considered when weighing the needs of nonacademic users against those of academic researchers.
In cases such as the examples we have mentioned, data-set creators may choose to share data publicly, as is currently standard for open data, without a log-in or checkout barrier. Scientists wishing to use such a resource for confirmatory research would simply need to proceed as they currently do with preregistrations of work with existing data: by being explicit about what was accessed when and what they already know about the data set when they planned their study.
Won’t data checkout make fewer people access my data? I want credit for my open data sets!
Data checkout is indeed designed to interpose friction in data access, to encourage data-set users to distinguish between confirmatory and exploratory work. On the other hand, a checkout record both enhances the value of data sets by giving them more power to answer additional questions and lets data-set creators track the impact of their data sets in a way that can be useful for their own scientific careers.
How can I specify my (confirmatory) analyses in enough detail if I can’t access the data first?
Many scientists (including ourselves) are used to “getting their hands dirty” with a data set as quickly as possible—learning about the data set while cleaning or summarizing it, making initial graphs, and working through understanding the variables. This approach is both eminently practical and a danger: Regardless of the researcher’s intent, it provides impressionistic insights into the data that muddy the waters of any confirmatory tests. Researchers who share a data set can make life easier for other researchers and allow analyses to be planned without first looking at the data by providing clear summary information about how the data were collected and what variables the data set contains, as well as, ideally, a full data dictionary (Broman & Woo, 2018). Publishing any code already used to analyze the data, in addition to supporting transparency and computational replicability, also serves as a helpful concrete example to other users. In some cases, data-set creators may choose to provide a jittered version of a data set that preserves its structure while scrambling the content; this can allow users to make more specific predictions and even implement code for analysis during preregistration. As is the case with many open-science tools and practices, the steps taken to make science more usable and robust tend to be applicable outside their originally envisioned context: A data dictionary is helpful for any reuse of a data set, whether or not the researcher is additionally able to access the raw data when planning analyses.
What if many people have already looked at a data set? Won’t data sets get overused?
Over time, data sets become more fully characterized, and their use becomes more subject to existing biases as researchers become aware of more conclusions the data have been used to support. Data checkout will not solve this problem, but will make it easier to identify: Scientists will at least be able to distinguish between a data set that has been checked out 10,000 times and one that has been checked out four times. This is particularly important for scientists from neighboring fields, who may not otherwise be aware of how heavily a particular data set has been used.
Whether a data set is “overstudied” is not a black-or-white issue: The amount of information available about data sets will vary, and data-set users will also vary in how much of that information they have accessed. For the accurate interpretation of research findings using data checkout, it will be important that researchers document what information about a data set they had already encountered (in the form of conversations, journal articles, media coverage, etc.) before accessing it, even if that information is vague or they believe it unlikely to have influenced their analytic decisions. We acknowledge that appropriate interpretation is a challenge of its own. Simply knowing that a data-set user knew the condition means beforehand, for instance, does not indicate exactly how much less weight should be given to the user’s successful prediction of a new moderator. This challenge is not entirely unique to secondary analysis; consider, for instance, the case of a primary research team that delays preregistration until after data collection, but before analysis. Even if extensive preprocessing is required to get any useful information out of the data set, the team may still know at least qualitative information (e.g., whether subjects appeared to find a task difficult or easy). When a secondary analysis is preregistered, as when a primary analysis is preregistered after data collection, documentation of what was already known when the analysis was planned is not a full solution, but it is clearly a necessary step.
It is already possible to make an analysis plan before viewing the data, so why bother tracking access? Can’t researchers trust each other?
Adopting preregistration at all amounts to acknowledging that evidence of having planned analyses before collecting data provides additional support for a claim, just as open data and materials do. A standard method to document when an existing data set was first accessed by a given researcher, coupled with preregistration before that date, could serve the same function. In many cases, researchers are already making principled and transparent decisions about when to access data, but do not receive the credit they should because there is no standard way to convey this rigor. Indeed, complete transparency (even about more careful planning than is typical) currently may place a researcher at a disadvantage with respect to publishing research findings; open documentation of data access could level the playing field.
What should I do if my favorite repository (or supplemental journal archive) doesn’t support data checkout?
Please lobby the platforms you use to support data checkout. An honor-code system can be implemented by a data-set creator as a first step with very low engineering cost, for instance, with a note asking anyone accessing the data to fill out a Web form. When data are not openly shared, the process used to share data (e.g., personally e-mailing the authors) can serve as proof of the first access date. But in many cases, data access is already being logged to track impact, and a full-fledged implementation of data checkout would be straightforward. Journals are unlikely to converge on identical systems, but interoperability is an important goal, and developing norms and expectations is the first step toward changing practices.
Beyond Basic Data Checkout
Early versions of data checkout would ideally allow both data-set creators and data-set users substantial choice in the degree to which they participate in data checkout, to allow early adopters the potential benefits of checkout without imposing sudden and heavy-handed restrictions. Providing such choices may depend on details of implementation beyond the four minimal technical requirements we discussed earlier. For instance, although there must be a way for individual researchers to verifiably reveal their own history of access to a particular data set, this does not mean that it must be possible to request a list of every researcher who has accessed the data set, or to request a list of every data set a researcher has accessed. Indeed, it may be preferable to protect researchers’ privacy as much as possible outside of voluntary disclosure to minimize the potential burdens of data checkout. Even a mature data-checkout system could function with voluntary participation in release of checkout records; all users could be required to log in to access a data set but able to specify whether they want their own record of access to be available upon request.
Once established, a basic system for data checkout would also form a foundation for a variety of additional functions that could benefit researchers. For example, the system could include a mechanism to implement existing suggestions for reducing overfitting on rich data sets, such as by separating train and test data subsets, or linking data access to the approval of a Stage 1 Registered Report (Arslan, 2017). In addition, extensions to the data-checkout model could
Formally link data-checkout records, preregistrations, and results to make research more replicable and cumulative, to make it easier to see what results were published at the time of a particular preregistration, and to give due credit to data-set creators.
Allow a data-set user to check out an arbitrary subset of the data (and record that access), for instance, to address a question that pertains only to a particular task or to document a cross-validation approach (in which exploratory work on one subset of the data is tested on another subset).
Provide an application programming interface (API) that allows users to submit code to be run on a data set and receive the results, without receiving the data itself, and with central logging of all analyses conducted (an approach called differential privacy; cf. Arslan, 2017). This approach could provide even stronger evidence that data-set users did not conduct additional unreported analyses, make it easier for researchers to tell what questions have already been addressed, and also protect subjects’ privacy when necessary.
Collect information about how people are using a data set, thereby allowing data-set creators to more easily track the scientific impact of their work.
Implement data checkout for currently closed data sets, which can be accessed only by e-mailing the authors (and hoping for a response): If such requests could be made and fulfilled via a journal’s Web site, for instance, then that communication would serve as documentation of first access, even without the data being hosted centrally. (As a side benefit, a record of responses could encourage authors to honor statements that data are available “upon request.”)
Data Checkout and the Replication Crisis
We believe that data checkout has a particular role to play when two labs have generated conflicting results regarding a shared question, as in the case of the neonate imitation studies we mentioned in the introduction. There are two broad classes of explanations in such cases. First, the difference in results can be due simply to the nature of statistical sampling: Even though the two groups conducted equivalent experiments, their effect-size estimates (or the uncertainties around these estimates) differed. Second, the difference in results can be due to a moderator: Although the groups may have intended to conduct the same experiment, a procedural difference may have changed the true, underlying effect size being measured.
Subsequent to Oostenbroek et al.’s (2016) and Meltzoff et al.’s (2017) articles mentioned in the introduction, these research groups published additional commentaries that continued to debate this precise issue (Meltzoff et al., 2018; Oostenbroek et al., 2018), and they continued to face the problem that they were making arguments with considerable knowledge of the data set in question. Meltzoff et al. contended that the procedural differences they pointed to were not post hoc objections, but reflected known moderators discussed in previous reports on neonate imitation. Oostenbroek et al. presented additional analyses of their data set, attempting to provide evidence against the proposed moderators (e.g., a concern about subjects’ fatigue was countered with a new analysis showing that effects did not decrease from earlier to later trials in Oostenbroek et al.’s experiments). Oostenbroek et al. also pointed out that some of the proposed moderating factors were not clearly expressed in the previous literature (e.g., prior articles were split as to whether effect sizes should be larger for imitation of familiar faces or for imitation of unfamiliar faces).
Both sets of authors could have benefited from preregistration and data checkout. Meltzoff et al. (2018) could have created a preregistration including an updated statement of expected moderators and their predicted effects. Similarly, Oostenbroek et al. (2018) could have preregistered their additional analyses and then accessed one of the other existing imitation data sets to test their predictions. In the best case, research groups with competing interpretations could conduct an “adversarial” preregistration before accessing the same data set, or even an “adversarial meta-analysis” to establish whether a proposed moderator does or does not track with effect sizes in the literature. Between-study moderators have presented a significant conceptual challenge for understanding differences between the results of original studies and replications (and a general problem for interpreting meta-analyses in psychology—see Stanley, Carter, & Doucouliagos, 2018); we hope that with an established system of data checkout, hypotheses regarding potential moderators can be clearly stated, and when appropriate, confirmatory analyses can be used to resolve at least some of the uncertainty around replication in psychological studies.
Conclusion
The challenge of preregistering secondary analyses arises from a welcome trend: scientists choosing to work openly and make their data available to the scientific community. The use of closed data provides a built-in solution to the problem of preregistering analyses based on existing data sets: When researchers have to personally request data or travel to access a data set stored in a particular location (e.g., in fields such as economics and education, which deal with data sets featuring significant personally identifying information), there is significant natural friction that highlights the differences between hypothesizing with and without the data in hand, and there is often specific documentation (such as an e-mail exchange or a signed contract) granting access to the data. Such a solution is not built in when data are open. As the amount of data shared grows, psychological scientists will continue to face challenges in how best to document data sets, make them accessible to the community, and capitalize on their scientific potential. Just as the format and content of journal articles varies across contexts and studies while maintaining a basic uniformity, we expect that how researchers share and access data sets will draw on a set of shared practices that provide clarity and scientific rigor.
The field of psychology is currently evolving toward a new platonic ideal of research output, stretching beyond the research article to include preregistration for confirmatory work, open materials and data, reproducible analyses, explicit statements on the limits of generalizability, and much more. The norms researchers create and adopt in striving toward this ideal will have the potential to revolutionize the reproducibility and rigor of psychological science.
Footnotes
Acknowledgements
We thank the community of the Society for the Improvement of Psychological Science for countless useful conversations at the society’s 2018 meeting.
Action Editor
Simine Vazire served as action editor for this article.
Author Contributions
The authors generated the idea for the article and drafted the manuscript together. Both authors approved the final submitted version of the manuscript.
Declaration of Conflicting Interests
The author(s) declared that there were no conflicts of interest with respect to the authorship or the publication of this article.
