Abstract
Researchers have identified problems with the validity of software engineering research findings. In particular, it is often impossible to reproduce data analyses, due to lack of raw data, or sufficient summary statistics, or undefined analysis procedures. The aim of this paper is to raise awareness of the problems caused by unreproducible research in software engineering and to discuss the concept of reproducible research (RR) as a mechanism to address these problems. RR is the idea that the outcome of research is both a paper and its computational environment. We report some recent studies that have cast doubts on the reliability of research outcomes in software engineering. Then we discuss the use of RR as a means of addressing these problems. We discuss the use of RR in software engineering research and present the methodology we have used to adopt RR principles. We report a small working example of how to create reproducible research. We summarise advantages of and problems with adopting RR methods. We conclude that RR supports good scientific practice and would help to address some of the problems found in empirical software engineering research.
Introduction
This paper reports some recent research results that have cast doubts about the reliability of current empirical software engineering research results. In the context of data mining studies, software engineering researchers have proposed reproducible research (RR) as a means to improve research practice (e.g. [36], and [15]). In this paper we ask the question “Would wider adoption of reproducible research be beneficial for empirical software engineering research involving human-centric experiments?”.
In Section 2, we discuss what we mean by reproducible research (RR) which is one of the methods being proposed to address problems with empirical research in software engineering data mining studies and other disciplines. We discuss the origin and scope of reproducible research, but also how it differs from, but supports, the concept of replication in human-centric software engineering studies. We report, in Section 3, problems found with recent empirical software engineering research. We also emphasize that the discussed problems are not unique to the software engineering domain. In Section 4, we examine how RR is currently being adopted in empirical software engineering research. To confirm the viability of RR, we identify, in Section 5, a set of free and open-source tools that we have been able to use in practice to produce reproducible research. In Section 6, we present an intentionally simple example of the RR process to help other researchers to understand how to construct reproducible research. We also highlight, in Section 7, the problems and benefits associated with RR from the viewpoint of software engineering researchers, as well as major initiatives related to RR. Finally, we present conclusions in Section 8. This is primarily a discussion paper. Our main contribution is to discuss the use of RR to address some of problems observed in software engineering experiments and to confirm the viability of RR with a small practical example of its use.
Reproducible research: Origins and definition
Gandrud [12] attributes the term
RR refers to the idea that the ultimate product of research is the paper plus its computational environment. That is, a reproducible research document incorporates the textual body of the paper (including any necessary supplementary materials, e.g., protocols or appendices) plus the data used by the study, and the analysis steps (algorithms) used to process the data, in the context of an open access environment that is used to compile these pieces of information into the resulting document. This triple is called the
When reading the literature on RR, we noticed that some researchers (for example, Gandrud [12]) appear to use the terms reproducibility and replication interchangeably. However, in this paper we make a distinction between the concepts. Replication involves repeating an experiment with different participants or experimental materials to investigate whether previous experimental results are repeatable. However as Gomez et al. [14] point out many researchers talk about reproducibility in the context of replication. For example, Carver et al. talk about results being
RR is particularly important in the context of studies of computational algorithms where, as Vandewalle et al. [45] point out, details such as the “exact data set, initialization or termination procedure, and precise parameter values are often omitted” for reasons such as “a lack of space, a lack of self-discipline, or an apparent lack of interest to the readers”. In software engineering, this would apply to the data mining studies discussed by Robles and his colleagues (e.g. [15, 36]), such as comparative studies of algorithms for test automation, comparative studies of cost estimation and of defect prediction, and any studies investigating the performance of evolutionary and machine learning algorithms. In this paper, we discuss, whether RR is also relevant to human-intensive experiments.
Problems with empirical software engineering practice
Recent results in empirical software engineering have cast some doubts on the validity of our software engineering research results. For example, Shepperd et al. [39] analyzed the results of 42 papers reporting studies comparing methods for predicting fault-proneness. They found that the explanatory factor that accounted for the largest percentage of the differences among studies (i.e., 30%) was research group. In contrast prediction method, which was the main topic of research, accounted for only 1.3% of the variation among studies. They commented that “It matters more who does the work than what is done.” and “Until this can be satisfactorily addressed there seems little point in conducting further primary studies”. The papers overlapped in terms of the data sets used, and the defect prediction modelling methods used in primary papers. The fact that their results are inconsistent with respect to the impact of the fault prediction methods suggests significant reproducibility failures.
In the area of cost estimation, Kitchenham and Mendes pointed out that reported accuracy statistics for cost estimation studies claiming to use a specific method on a particular data set were inconsistent with the results they obtained using the specific method on the same data set [21]. More recently, Whigham et al. found that claims made in two recent cost estimation studies could not be confirmed by independent analyses [47]. RR emphasizes the need to specify, fully, any statistical analysis, in order to address problems such as these.
In the context of experiments and quasi-experiments, Vegas et al. [46] reviewed 39 papers using crossover designs (which are a form of repeated measures design) and found 58% of the papers did not use an analysis method consistent with the design, which could “compromise the validity of the findings”. Papers that used the invalid analysis are valueless scientifically, unless their raw data is available for re-analysis. RR practices require the publication of the raw data to address this problem.
In another recent study, Jørgensen et al. [19] suggested that the trustworthiness of software engineering experiments needs to be improved. They were particularly concerned about low power, researcher bias, and publication bias. Among their recommendations they include improving the reporting of study design, analysis and results, making data available, emphasising effect sizes and their confidence intervals, and undertaking more replications and meta-studies. The first two issues are directly supported by reproducible research requirements to make data and analysis available. The issues related to effect sizes and meta-analysis are supported by reproducible research, since if data is fully reported, subsequent studies can easily reanalyse the data to calculate effect sizes and perform meta-analysis.
Overall, these studies suggest that both our data intensive studies and human-centric experiments sometimes fail to provide reliable evidence to support technology adoption decisions. Reported issues also suggest that RR should be adopted more widely within the software engineering community.
Problems such as those discussed above are not unique to the software engineering domain. For example, in the context of drug trials, Osherovich reports that “an ‘unspoken rule’ among early stage VCs [Venture Capitals] is that at least 50% of published studies, even those in top-tier academic journals, can’t be repeated with the same conclusions by an industrial lab” [32]. In addition, Ioannidis and his colleagues reported that only 2 of 18 research papers published by Nature Genetics journal (one of the highest ranked journals in the world, with the impact factor about 30) could be fully reproduced [18]. The reasons for this included data sets and home made software disappearing, or the specification of data processing and analysis being incomplete. Probably the most striking summary of the research crisis in multiple disciplines is given by Ioannidis who (in his seminal paper with 3600+ citations) claims that “Most Research Findings Are False for Most Research Designs and for Most Fields” [17].
Reproducibility in software engineering
Within the context of software engineering experiments, there has been discussion of
Robles and his colleagues have discussed the importance of reproducibility from the view point of studies involving data mining from software repositories. Robles [35] undertook a systematic review of papers published in the former International Workshop on Mining Software Repositories (MSR) (2004–2006) and now Working Conference on MSR (2007–2009). He checked 171 papers for i) the public availability of the data used as case study, ii) the public availability of the processed dataset used by researchers and iii) the public availability of the tools and scripts. He found researchers mainly used publicly available data but the availability of the processed data used in specific studies was low. In the majority of papers, he could not find references to any tools even when authors said they had produced one. He concluded that there was a need for the community to address replicability in a formal way.
Robles and German [36] discuss best practices for supporting reproducibility, in particular providing a snapshot of the data as it was used in the study, versioning data sets, identifying the conditions under which the data can be used, making the data set available in a public repository, making tools available to others, licensing software, and providing the infrastructure to support tools via a forge. They also point out additional benefits of reproducibility to the data mining community such as providing worked examples for software engineering students and improved benchmarking.
González-Barahona and Robles [15] provide a method of assessing the reproducibility of a data mining study that is useful not only to other researchers but also to authors who can judge the main barriers to reproducing their study and reviewers who can assess whether a study will allow for easy reproduction.
In the context of software maintenance research, Dit et al. [10] have constructed a publicly available library of components and experiments aiming to improve the reproducibility and extensibility of software maintenance experiments.
In addition, Bowes et al. [5] have developed the
Tools for reproducible research
We discuss the concept of Literate Programming behind RR and some of the tools that can be used to adopt RR, in Sections 5.1 and 5.2, respectively. Furthermore, we present a small working example of the RR approach in Section 6.
Literate programming
Several researchers ([12, 42]) have linked reproducible research to Knuth’s concept of Literate Programming [24]. They point out that if the output of research is not just a paper, but also the full computational environment, then researchers can use Knuth’s concept of Literate Programming [24] to achieve RR.
Literate Programming treats a program as a piece of literature addressed to human beings rather than a computer. The key assumptions are: A program should have plain language explanations interspersed with source code. The source code, data and plain language explanations are combined together. Results of program (or code chunks) are automatically included when document is created (so no exporting and/or importing is needed). After recompilation, changes are automatically incorporated if code or data sets change. Tools are available to make this simple to achieve.
A reproducible research environment
We used the following freely available tools and formats to support our attempts to adopt reproducible research methods: The The paper was written in Data sets and analytical procedures should be stored in a reliable manner and easily available to reviewers and readers. We decided to develop the All references were stored in the pure
We used
Other tools and formats can be used to achieve the goal of reproducible research, see, for example [25]. In addition, there are specialised tools that support RR in a specific context such as the
The
We used the
An example of the RR process
To assist the uptake of RR this section presents a small working example of the RR approach. We will use a real data set recently analysed by Madeyski and Jureczko [28] and available from the
It is worth mentioning that our example is deliberately simple so it can provide a starting point for novices. Researchers with more experience of RR concepts can view the
The steps needed to produce reproducible research follow indicating in each case the goal of the specific step: Goal: Setup the basic RR environment. Steps: Install Goal: Setup the convenient integrated development environment for RR. Steps: Launch Goal: Setup the mechanism for linking the data analyses (and their the results) into text documents. Steps: Click on “Console” window in
Goal: Setup the document containing text and analysis procedures that can be executed on data (analytic results, figures and tables can be produced on the fly from data). Steps: Create an The
Goal: Compile the.Rnw to.tex and.pdf. You may use RStudio, see Fig. 1.
Subsequent steps of data analysis, from the Descriptive analysis of simple and advanced models: summary of descriptive statistics (Output 1 and Output 2) and box plot combined with density curve laid out on histogram (Output 3, Output 4 and Output 5).
Inferential analysis using Wilcoxon test: as the data are non-normal, t-tests may not be appropriate and an alternative is Wilcoxon paired test, see Output 6.
Since, as mentioned in Section 5.2, changes between packages can impede reproducibility, we recommend recording the R session info, which makes it easy for future researchers to recreate what was done in the past and which versions of the R packages were used. How to do it, as well as a result is shown inOutput 7.
It is also worth mentioning that one can easily view the data set embeeded in the
The detailed description of the data set, as in case of other R packages, is available from CRAN(https://cran.r-project.org/web/packages/reproducer/reproducer.pdf).
Readers are invited to reproduce our small working example of reproducible research. All of our analyzes, as well as data, are encapsulated in the
We used the method proposed by González-Barahona and Robles [15] to assess reproducibility of our example, as shown in Table 1. The
Discussion
In this section we discuss some of the pros and cons of reproducible research, as well as major RR initiatives.
Advantages of reproducible research
Reproducible Research does not address all problems of the validity of experimental software engineering studies. It can only ensure that the data and analysis methods are available for inspection and that the results presented in the paper can be derived from the data and analysis procedures. However, if it were adopted, we hope that design and analysis errors, such as those reported by Vegas et al. [46] and Shepperd et al. [39] would be more likely to be uncovered, hopefully, prior to publication during the review process or soon after publication as the full details will be available to all interested readers. Furthermore, it would provide a valuable resource for training novice researchers. However, there are other advantages as well.
There is some research evidence that reproducibility improves the impact of research. For example, Piwowar et al. reported that from a set of 85 studies, the 48% with publicly available data received 85% of the total citations [33]. Data availability significantly increased citation rate (
Vandewalle et al. also described their own experiences of RR. They reported a gain in their own efficiency, because it was easier to pick up their work again, and positive feedback from colleagues and students who downloaded their code. In addition, they state that the availability of their code “allowed and simplified some collaborations and is a source of easily reusable demo material for students and visitors”.
For a research group, RR supports the preservation of group knowledge in terms of long-term conservation of experimental data together with supporting (e.g., statistical) analyses. This means that knowledge of current research is not lost when researchers move on, and it is easier for new members of a group to understand and build on previous research. It can also help researchers leverage their own research. For example, Bowes et al. report developing and using an environment to support complex systematic reviews that supports RR principles [5]. They have used this environment to support a number of different systematic reviews. In addition, funding agencies and journals are increasingly adopting open research policies and RR provides a means of complying with such policies.
Using tools such as
Objections to reproducible research
There are two major objections related to RR. The first objection, discussed in Section 7.2.1, is the potential loss of intellectual property caused by making data freely available. The second one, examined in Section 7.2.2, is an additional effort imposed by RR.
Potential loss of intellectual property
Sharing of data has attracted some strong disagreements among software engineering researchers. At the International Workshop on Empirical Software Engineering Issues held at Dagstuhl Castle, Germany in 2006 [3], participants discussed data sharing and concluded that there were areas of dissent among the software engineering community. Some people wanted licensing, others were opposed, and there was no agreement as to who owns the data. Subsequently, Basili et al. [4] published a proposal for data and artifact sharing agreements in software engineering research. This proposal concerned a framework for software engineering artifact agreement to “foster a market in making available and using such artifacts”. The limitation of the market-based viewpoint is that it fails to address two issues: Ownership may not reside solely with the individual researchers but also with any research funding agency that supported the research. Scientific ethics as well as research agencies advocate open sharing of results including data.
Barr et al. [1] argue that scientific research would advance more quickly if data and tool sharing were more widespread. They note that the main objections to sharing are the time and effort needed to package data and tools in a manner suitable for reuse and the “risk of being scooped”. They discuss various approaches to reduce objections to sharing including partial sharing, registry, escrow and the market. Personally, we find interesting their proposal to give researchers exclusive rights to their own data but only for a limited period of time.
Stodden proposed an alternative approach called the Reproducible Research Standard (RRS) which aims to realign legal rights to fit scientific norms [41]. RRS is a legal tool for waiving as many rights as legally possible, worldwide. In particular, she suggests that authors: Release media components (text, figures) under CC BY, that is, the Creative Commons attribution license that does not have a Share Alike provision. This means that licensees may copy, distribute, display and perform the work and make derivative works based on it, only if they give the author or licensor the credits in the manner specified by the copyright holders. Release code components under a Berkeley Software Distribution (BSD) license that place few restrictions on re-use beyond attribution, creating an Intellectual Property framework resembling conventional scientific norms. Release data under the Science Commons Database Protocol
3
because “raw data aren’t copyrightable” only “selection and arrangement” of data is copyrightable. Generally, data sets should be made available in recognized repositories for the field, if they exist. Otherwise, researchers may choose a repository for sharing, citing, analyzing, and preserving research data, which is open to all scientific data from all disciplines, e.g., Dataverse, OpenAIRE.
However, the approach we have adopted to make RR available is to create an R package [27], which is a free, open access way to share (via CRAN—the official R repository) both, data and code with accompanying detailed description of every field of the shared data sets, and every function and parameter of the shared code. As an alternative, Gandrud suggests using GitHub to share your research, pointing out that projects can can be kept private initially and then made public once the research results are published [12].
Additional effort required by reproducible research
The second objection is that, in our experience, RR requires additional effort to write research papers. Some of the tools used to support RR (i.e.,
There are other issues that make adoption of RR difficult for software engineers. In software engineering, it is often the case that data cannot be distributed due to confidentiality issues. However, there are tools that offer data anonymization functionality in a way expected by an industrial partner, e.g., DePress software measurement and prediction framework [29], developed as an open source project in close collaboration between Wroclaw University of Science and Technology and Capgemini software development company.
In addition, if data items need to be obtained from a variety of different sources and need to be integrated into a single data set, it may be very difficult to make the process of formatting the raw data into analyzable data completely reproducible.
Reproducible research initiatives
Interest in reproducible research has led to two major scientific initiatives: In psychology, the Open Science Collaboration aims to “to increase the alignment between scientific values and scientific practices” by open, large-scale, collaborative effort to estimate the reproducibility of psychological science [8, 31]. The recent result from this initiative has been published in In medicine, Ioannidis and Goodman have established the Meta-Research Innovation Center at Stanford University (METRICS)
4
. This center aims to improve reproducibility by studying “how research is done, how it can be done better, and how to effectively promote and incentivize the use of best scientific practices”. In his recent video lecture
5
, Ioannidis proposed registration of data sets, protocols, analysis plans, and raw data. We expect results from this initiative provide further discussion of RR concepts.
In addition, in the context of computer science, Elsevier is backing the
These initiatives are likely to encourage funding agencies and journals to police their open science policies more rigorously.
Conclusions
We have identified studies criticising the current software engineering practices. In our view these criticisms are serious enough that we need to consider carefully how we establish the validity of our research outcomes. Following ideas proposed in the data mining community and adopted in other empirical disciplines, we raise the question of whether it would be beneficial for researchers undertaking human-centric studies to adopt RR. In software engineering, particular research groups investigating specific topic areas are using RR principles, but there is little general agreement about whether the ideas should be more widely adopted. Critical issues are the extent to which researchers are willing to share their data and the time and effort needed to make data available for sharing. However, if we continue as we are, we run the risk of publishing more and more invalid and incorrect results with no systematic methods of correcting them.
RR, as discussed in this paper, concerns the extent to which the report of a specific study can be deemed trustworthy. It supports only the minimum level of validity that we should expect of research outcomes. It would, however, address problems currently being found in software engineering research by various leading researchers ([19, 39 and 46]). The example of reproducible research reported in this paper identifies free-to-use tools that are currently available to support RR. Our example shows how they can be integrated to adopt an RR approach. It should help other researchers to try out the RR approach for themselves. Other empirical software engineering researchers willing to share their data sets, and related analytic procedures, via the
Footnotes
See http://www.r-project.org/, http://latex-project.org/, and
.
Output 1–Output 8 include the
