Sage Journals: Discover world-class research

Abstract

In this editorial, we provide guidance regarding the design and execution of replication, reproducibility, and generalizability studies and explain the unique purpose and empirical contribution of each research design to the overall knowledge generation and consolidation cycle. We evaluate these research designs within the confines of JOMSR's mission to avoid novel theorizing, but instead to engage only in testing of existing theory or of combinations of existing theory. We identify two important dimensions on which these research designs can vary, independence and constructiveness, and highlight the particular utility of independent studies that are high in constructiveness. We offer suggestions for the identification of suitable targets for reproducibility and replication as well as concrete examples demonstrating high-quality theory testing.

Keywords

Replication studies statistical methods research design

We thank the editorial team of JOMSR for inviting us to write this editorial on how to conceptualize and design replication, reproducibility, and generalizability studies for submission to JOMSR. We are very excited about the mission of JOMSR and the explicit premise underlying the evaluation of submissions: the journal will publish only stringent testing of existing theory with sound methodology. This represents a kind of nirvana for the many researchers who have bemoaned for decades the field's obsession with novel theorizing at the cost of critical evaluation and refutation of existing theories (e.g., Antonakis, 2017; Edwards, Berry, & Kay, 2013; Köhler & Cortina, 2021; Leavitt, Mitchell, & Peterson, 2010). So, here is our chance to get it right! Test theories the way they should be tested. Throw our best methods at them and see what remains standing. We for one (or two) are excited about the potential of the journal to advance and consolidate knowledge in our field.

Per the Editors’ opening editorials (Kraimer, 2023; Kraimer, Martin, Schulze, & Seibert, 2023), the theories whose tests would appear in the pages of JOMSR fall into two broad categories: those that have not been tested previously (e.g., most Academy of Management Review conceptual models; e.g., Edwards et al., 2013) and those that have. The focus of this editorial is on the latter category. More specifically, we provide guidance regarding the design and execution of replication, reproducibility, and generalizability studies. We evaluate these research designs within the confines of the journal's mission to avoid novel theorizing, but instead to engage only in testing of existing theory or of combinations of existing theory.

In our editorial, we first define reproducibility, replication, and generalizability and distinguish them from each other. We explain the unique purpose and empirical contribution of each research design to the overall knowledge generation and consolidation cycle. We then identify two important dimensions on which these research designs can vary, independence and constructiveness, and highlight the particular utility of independent studies that are high in constructiveness. We then discuss important research design considerations for a variety of combinations of the above-mentioned attributes. We pay particular attention to the identification of suitable targets for reproducibility and replication. We also provide concrete examples that demonstrate high-quality theory testing leading to important conclusions regarding the tested theories.

Table 1

Tips for planning reproducibility, replication, and generalizability studies

Category	Type	Likely targets	Planning the study
Reproducibility		Influential (or likely to be)	Need to be able to get the data
	Literal	Lots of data handling/analysis choicesCounterintuitive results	What were the data handling/analysis decision points? Which decisions are unclear? If you get different results, can you figure out why?
	Constructive	Dubious or out-of-date data handling/analysis choices Small N (e.g., firm or group level analysis with relatively few “clusters,” thus making outliers more important) Counterintuitive results	What are the typical consequences of flaw(s)? What consequences might be specific to this phenomenon? If results were counterintuitive, can you argue that they were due to the flaw(s)?
Replication		Influential
	Literal	Lots of design choices Counterintuitive results	What were the design-related decision points? Which decisions are unclear? If you get different results, can you figure out why?
	Constructive	Dubious design choices Counterintuitive results	What are the typical consequences of flaw(s)? What consequences might be specific to this phenomenon? If results were counterintuitive, can you argue that they were due to the flaws?
Generalizability		Important substantive moderator	Are there moderators that would substantially change the theorizing of the relationship between the IV and DV (e.g., restricted variance effects)? Are there boundary conditions that limit the theory's applicability in certain contexts? (Need to provide theoretical rationale for these boundary conditions) Are there untested underlying assumptions of the theory in question that could invalidate the theory if unmet? (Problematize theoretical assumptions.)

Defining and distinguishing replication, reproducibility, and generalizability

One of the challenges that our field faces is that it lacks consistently applied definitions of and distinctions between reproducibility, replication, and generalizability. As a consequence, researchers often disagree about whether these sorts of studies are necessary, play an important role in the advancement of our field, and are worthy of journal space. As such, our first priority for this editorial is to provide clear definitions of the three types of study. In so doing, we draw heavily on our prior work in this space (e.g., Köhler & Cortina, 2021; Cortina, Köhler, & Craig, 2022), which also constitutes the foundations on which the JOMSR Editors based their respective definitions (Kraimer, 2023; Kraimer et al., 2023).

A reproducibility study is one in which a dataset, collected to test a set of hypotheses, is reanalyzed to test the same set of hypotheses or a subset thereof (Köhler & Cortina, 2021). This reanalysis can be completed by the researchers who conducted the original analyses (dependent reproducibility) or by different researchers (independent reproducibility). The reanalysis can be literal, following the exact same steps as the original analysis, or it can change some or all of those steps. The defining factor of a reproducibility study is that no hypotheses are tested other than those proposed in the original study and that the same dataset is used.

The purpose of a reproducibility study is to determine whether a reanalysis leads to the same conclusions as did the original analysis. One possible source of variation between the original findings and the reanalysis is a lack of transparency in the original study vis-à-vis the data handling and analysis steps that were followed. This is, presumably, less of a problem in dependent reproducibility studies, but constitutes a significant source of error in independent reproducibility studies (e.g., Aguinis, Ramani, & Alabduljader, 2018). Other possible sources of variation are researcher competence, with differences in competence (in either direction) between the original research team and the reanalysis team leading to differences in analytic choices, and differences in motives. Finally, variation can stem from differences in software packages. For example, software packages might use bootstrapping algorithms with different starting points, have different default settings for treating missing data, or treat correlations between residuals differently. This can lead to different results (e.g., Albright & Park, 2009; Asparouhov & Muthen, 2006; Cortina et al., 2022). A reproducibility study can expose any of these sources of variation, thus pointing to the most defensible conclusions vis-à-vis the hypotheses. As we explain later, the different types of reproducibility study will expose different sources of variation.

Unlike a reproducibility study, a replication study requires the collection of new data to test the hypothesized effects or relationships of an original study (Köhler & Cortina, 2021). As with reproducibility, a replication study can be conducted by the same research team (dependent replication) or by a different research team (independent replication). The goal is to determine whether findings are consistent or inconsistent with the findings from the original study when new data are collected. A replication is literal if it follows the exact same design and analysis plan as the original study. If it diverges from the original, then the label that is applied to it depends on the nature of the divergence. As is the case with reproducibility, different types of replication study will expose different reasons for inconsistencies between original findings and replication findings. The factor that is common to all replication studies is that the phenomenon of interest is reevaluated in a new data collection.

As with reproducibility, variation between the original findings and the replication might stem from differences in experimenter competence, experimenter motives, or reporting accuracy. Unlike reproducibility, sampling error is also a possible culprit in replications. The specific differences or inaccuracies that are targeted by replication, however, differ from those targeted by reproducibility. If one suspects that findings would be different with, for example, a superior control or experimental condition for an independent variable or a more valid measure of a dependent variable or a more representative sample, then replication is needed. Eden and Shani (1982), for example, could not have achieved what they did with reanalysis alone. We strongly encourage the reader to spend some time with this paper because it is a model of its type. Briefly, prior work on the Pygmalion effect had suffered from various major flaws. Eden and Shani (1982) improved upon this prior work by, among other things, manipulating expectancy in a high stakes work environment with an adult sample. No amount of reanalysis of existing data would have accomplished what Eden and Shani (1982) accomplished with their comprehensive constructive replication.

Finally, there is generalizability. Köhler and Cortina (2021) make the point that the meaning of generalizability is often muddled. In order to keep generalizability distinct, we defined it as “lack of variation in findings across studies that differ on one or more substantive moderators” (p. 492). Thus, a generalizability study is one that adds a substantive moderator to an existing model. Köhler and Cortina (2021) make the point that, given this definition, our field has no lack of generalizability studies. Whether or not a generalizability study is within JOMSR's scope depends on the extent to which novel theorizing has to be introduced to justify the hypothesized moderation effect.

As a point of distinction, it should be noted that reproducibility and replication studies usually target moderators as well, just not substantive ones. If the magnitude of an effect varies across levels of a researcher's data analysis knowledge, then a reproducibility study can be conducted to reveal this fact. If the magnitude of an effect varies across levels of a researcher's desire to manipulate an independent variable in a way that will result in support for their hypotheses, then a replication study (an independent one anyway) would point us toward the light. One final point made by Köhler and Cortina (2021), however, is worth noting. The phrase “replicating and extending” is often used in reference to additions of substantive moderators, that is, generalizability rather than replication. We return to this topic shortly.

Methodological moderators: different forms of reproducibility and replication and their value for scientific advancement

Having distinguished reproducibility and replication in the previous section, we now turn our attention to identifying the different forms that such studies can take. These different forms vary on two important dimensions, independence and constructiveness. Research designs that combine different levels of independence and different degrees of constructiveness vary in their utility for scientific advancement, given that certain combinations allow for more rigorous tests of the proposed theory.

Independence is relatively simple to define. A reproducibility or replication study is independent if it is conducted by a different author team from the original study. A dependent study is conducted by the same author team, and a semi-independent study shares at least one author with the original. Constructiveness is a bit trickier. Köhler and Cortina (2021) define a constructive study as one that “retains all of the virtues of the previous study but includes at least one methodological improvement” (p. 495). A constructive study is incremental if the improvement is minor and substantial if it is major (we touch on the major/minor distinction later). A constructive study that resolves all of the major flaws of the original study is comprehensive.

No study is perfect.¹ Consequently, design flaws are ubiquitous. Flaws might stem from contextual factors outside of our control (e.g., when conducting research on phenomena that cannot be manipulated or when working with secondary data that only exists for certain contexts or countries), data collection limitations (e.g., access to sensitive populations; secondary data that may or may not contain measures for variables of interest), measurement limitations (e.g., no superior measure exists in a field), or data analysis limitations (e.g., the optimal statistical analysis tool simply does not exist yet). Because these flaws are seldom inherent in the research questions being asked, improvements are usually possible, especially by independent teams of researchers who might have better access to data, have more advanced design capabilities, or are able to develop new or apply better measures and analysis tools. The goal of comprehensive constructive replications is to address all major limitations of previous studies, while the goal of substantial constructive replications is to address at least one major limitation. These constructive replications rarely invalidate the original study. Rather, the constructive replication study provides a stronger test of the phenomenon of interest and allows researchers to determine the effect of research design characteristics on observed findings (see e.g., Minefee, McDonnell, & Werner, 2021). This in turn provides deeper knowledge about the phenomenon and its methodological moderators.

There are a few more terminology-related issues worth mentioning. First, a literal study is, by definition, not constructive. It can still have a great deal of value, but that value does not come from correcting research design flaws or analysis execution errors. Second, Köhler and Cortina (2021) found examples of what they called “regressive” studies, that is, studies with all of the flaws of the original plus at least one additional flaw. These are of little value and should not be submitted to JOMSR. Third, some studies introduce differences that are neither good nor bad. These are quite common and are labeled “quasi-random.” The value of these is also limited except perhaps as fodder for meta-analysis and should not be submitted to JOMSR.

JOMSR will, however, welcome certain kinds of literal and constructive reproducibility and replication studies. For replication studies, these can contain any level of independence. For reproducibility studies, JOMSR prefers at least semi-independent reproducibility studies with a strong preference for fully independent reproducibility studies. Each type of study addresses different issues, however, and it is important for authors who are planning to submit to the journal to understand what these issues are. For more information on JOMSR's aims and scope, we refer interested readers to the inaugural editorial mission statement of its Editor (see Kraimer, 2023).

Issues addressed by literal versus constructive studies

A constructive study, by definition, provides an evaluation of the importance of a design flaw or limitation in the original study. A constructive reproducibility study evaluates the importance of a flaw in the handling or analysis of data while a constructive replication study evaluates the importance of a flaw in study design. Of course, all flaws are not created equal. Whether a flaw should be considered major or minor is, to some degree, a matter of opinion, but such judgments should be driven by the likelihood that removal of the flaw would influence some of the findings reported in the Results section and some of the conclusions presented in the Discussion section. JOMSR will welcome comprehensive constructive studies and even substantial ones because, by definition, these address flaws that might be driving the results of the original study. If one is able to correct only minor flaws (i.e., incremental), then a literal study might be wiser. It might be useful to the reader to see some examples of flaws that could represent raisons d’être of reproducibility and replication studies.

Examples of constructive and literal reproducibility studies

Constructive reproducibility attempts could involve using a different treatment of outliers so that a more appropriate version of a dataset is analyzed (e.g., exclusion of extreme or unlikely cases; bounding secondary data analyses to specific time frames that contain a more comparable set of data points), using a superior analytical technique that better accounts for characteristics of the data (e.g., non-normal dependent variables, nestedness, endogeneity), or incorporating control variables that allow one to rule out alternative explanations. Whether these would be considered major or minor depends on the likelihood that repair of the flaw would change the conclusions of the paper. Dropping an outlier, for example, usually would not have enough of an effect on analyses to change many of the words in a Results section. In the case of Hollenbeck, DeRue, and Mannor (2006), however, the authors argued that one case in Peterson, Smith, Martorana, and Owens (2003) was unrepresentative and was so influential that its inclusion had a dramatic effect on the conclusions of Peterson et al. (2003). Its omission by Hollenbeck et al. (2006) did in fact change results dramatically, and this would lead us to characterize the inclusion of that case as a major flaw. In any case, the more flaws that are addressed in the constructive reproducibility study, and the more serious the flaws, the greater the contribution.

The target of literal reproducibility is a bit different. Such studies can be used as a check on data handling and analysis procedures. If, for example, two members of an author team independently attempt to conduct moderated regressions on the same variables in the same dataset and they come up with different results, then they cannot both be right. The discrepancy will then prompt an effort to discover the source of the discrepancy and to choose the approach that is most defensible given current best practice recommendations in the research methods literature.

Of more interest to JOMSR, however, is the independent reproducibility study. Such a study can, of course, shed light on things like transcription errors, but its primary benefit comes from shedding light on hidden choices made by the original authors. For example, Herndon, Ash, and Pollin (2014) set out to conduct a literal reproducibility study of the Reinhart and Rogoff (2010) examination of the effects of national debt ratios on economic growth (see Cortina et al., 2022, for a more detailed description of both papers). It was only after they were unable to reproduce the original results that they discovered errors, both intentional and unintentional, in the handling and analysis of data by the original authors. Correction of these errors (i.e., a constructive reproducibility study) led to entirely different conclusions regarding the value of austerity measures.

JOMSR is most interested in publishing independent reproducibility studies due to the fact that they can lead to consequential reinterpretations of findings and conclusions, reinterpretations that authors of dependent reproducibility attempts might be less likely to reveal to the field. These reinterpretations might in turn be more consequential for theory testing and overall knowledge advancement in our field. Silberzahn et al. (2018) demonstrated the value of independent reproducibility quite convincingly. They had 29 different research teams address the same research question with the same data set. Depending on a variety of perfectly reasonably choices, the teams generated very different results. For example, some teams found effects that were more than three times the size of effects found by other teams. If one of the latter teams had been the only one to conduct this analysis, the field would have concluded one thing. If one of the former teams had been the only one to conduct this analysis, the field would have concluded something very different. These sorts of independent reproducibility studies that shed new light on a topic through reanalysis are welcomed by JOMSR.

Examples of constructive and literal replication studies

When evaluating constructive replications, an incremental improvement might be the use of a slightly different time lag in a repeated measures design that better represents the stability of the construct. As another example, one might use the full rather than the shortened version of a scale or use a sample with slightly more power. Or one might employ a superior data analysis method in the newly collected data. For example, Ghosh, Ranganathan, and Rosenkopf (2016) replicated a prior study of endogenous characteristics of alliance network structure by Ahuja, Polidoro, and Mitchell (2009) via a three-stage process. In their first stage, they replicated the data analyses of Ahuja et al. in a different time period, keeping everything else constant, with the goal of replicating as literally as possible (reproducibility was not possible because the data from original study data was supposedly proprietary). In stage two, the authors analyzed the data with a superior data analysis method (Exponential Random Graph Models: ERGMs) which had been developed after the original study. Stage two constitutes a substantial constructive replication attempt because ERGM allowed the authors to model network formation processes and dependencies that are known to exist but were ignored in the analytic approach used in previous alliance research. In stage three, Ghosh et al. conducted an extension to a different industry for a generalizability assessment.

Note that simply recoding variables included in the same dataset, as is often done in macro level studies, is not considered a replication, but rather a reproducibility attempt as it is based on the same data collection. Similarly, using different performance indicators included in the same dataset would not be considered a replication attempt. Instead, it might be a reproducibility attempt if the alternative performance indicator captures the same underlying variable. If the alternative performance indicator were to capture a slightly different concept, for example, effectiveness versus efficiency, individual versus team performance, or patent filings versus R&D expenditures, then this would be considered neither a replication nor a reproducibility attempt but simply the testing of a different hypothesis. In any case, authors of an incremental constructive replication would need to argue that the relatively minor flaws that they targeted were, nevertheless, consequential, and that, taken together, they could be expected to influence conclusions.

A substantially constructive study needs to go beyond that. For example, using valid measures of core variables when the original study used deeply flawed measures would constitute a substantial improvement as would using a more rigorous experimental design that controls for more or more important sources of variance. A substantial improvement could also come from using a sample that is more representative of the population of interest when the original sample differs from the population of interest in a way that is likely to affect results. A sample of employees in a real or single organizational context—rather than a student or MTurk sample—would be more appropriate when studying organizational phenomena that might vary substantially with job experience or that might be highly dependent on organizational context factors. As another example, many phenomena are particularly relevant for small- to medium-sized enterprises (SMEs) (e.g., informal HR practices, collaborative cultures) but are, for reasons of access if nothing else, studied primarily with large firms. A study that managed to gain access to data from SMEs to study such a phenomenon would represent a substantial constructive replication. And as with reproducibility, the more (and more serious) the design flaws, the greater the opportunity for contribution. Hence, JOMSR is most interested in replication studies that include multiple substantial improvements vis a vis the original study, with the gold standard being a comprehensive constructive replication (e.g., Eden & Shani, 1982)

A literal replication study, in principle, addresses sampling error. If the original authors were just lucky, then a literal study will probably reveal it. However, a literal study can also shed light on details, and possibly flaws, that were not obvious in the original report due, for example, to lack of transparency about how the study was conducted (Aguinis et al., 2018). Of course, lack of transparency makes literal replications more challenging. Literal replications can be difficult for other reasons as well, including lack of access to an equivalent participant pool, or lack of access to proprietary instruments or experimental apparatus. When a literal replication is possible, however, it can provide useful information about variation in findings due to sources ranging from sampling error to researcher competence and motives. If (and it can be a big if) a more or less exact duplication of the original study is possible, then the source of variation that is the target of the study is sampling error. As with literal reproducibility studies, JOMSR is most interested in independent replications.

Issues addressed by independent versus dependent studies

Generally speaking, an independent reproducibility or replication study allows one to assess the role of researcher-related factors. If flaws exist in the original work, then they must be due either to a lack of knowledge and skills on the part of the original author team, lack of access to superior design/analysis elements (such as access to a larger sample, benevolent organizational research partners, or access to proprietary data or software), lack of funding to support superior design/analysis elements, or to a lack of desire to subject their hypotheses to appropriate scrutiny. Only an author team with different competencies, resources, and/or motives is likely to correct these flaws.²

The application of the above to constructive studies is obvious, but these principles also apply to literal studies. As mentioned above, some of the flaws in Reinhart and Rogoff (2010) were accidental, some intentional, but none were reported. The original authors could not have reported the accidental flaws because they did not have the knowledge and skills to recognize them as such, and they chose not to report the other flaws presumably because those flaws helped them to draw the conclusions that they wished to draw. Only an independent reproducibility study could get to the bottom of this. The real contribution of Herndon et al. (2014) is as a constructive study, but the constructiveness would not have been possible if they had not begun by trying to recreate the original Reinhart and Rogoff (2010) analyses. It was their inability to do so that finally led them to the flaws in Reinhart and Rogoff (2010). Then, Herndon et al. (2014) could apply their superior data handling and analysis skills as well as their desire to draw appropriate (rather than ideology-consistent) conclusions in the carrying out of a comprehensive constructive reproducibility study. But that would not have been possible without the lessons that they learned from their literal reproducibility attempt.

This last example also serves to make a final point about independent literal studies. There would not have been much point in Herndon et al. (2014) showing that, when they reanalyzed the Reinhart data, omitting theory-disconfirming cases and overweighting theory-confirming cases as did the original authors, they produced the same results. That might have been a good starting point, but the real hook for their paper had to be to show that the conclusions regarding austerity measures were very different when one did not omit inconvenient cases, weight convenient cases more heavily, etc. In other words, the literal reproduction of the flawed original would have served as a launching point for the constructive reproduction but would not have stood on its own.

Recommendations for designing reproducibility and replication studies for JOMSR

In order to make a sufficient contribution, each type of study must possess certain virtues, and in sufficient quantities. This is true of any paper of course, and because JOMSR publishes different sorts of papers from the ones to which we are used, it might be worthwhile to consider the virtues required of the various forms of reproducibility and replication study (see Table 1).

Literal studies

A literal study must follow the procedures of the original as exactly as possible. Obviously, this means incorporating every detail in the Methods section of the original. There may, however, be other details that were omitted from the original. Perhaps the original authors felt that they were unimportant. Or perhaps a reviewer (as they sometimes do, unfortunately) suggested that certain methodological details be excised in the interest of journal space. We, therefore, recommend that the literal replicator/reproducer contact the first author of the original study in order to discover if there were any additional details that would not be obvious to a reader of the published work.

Along these same lines, the literal replicator/reproducer may begin their work only to realize at some point that the original authors had had options, but that it was not clear which they had chosen. For example, suppose that the original authors had stated that they had used the 8-item measure of Y by Abbott and Costello (1937), but that their 1-factor CFA had 9 degrees of freedom, which is consistent with the shorter 6-item measure of Y developed by Laurel and Hardy (1941). Or perhaps there were two different versions of the experimental task and it was not clear which had been used by the original authors. It would be important for the replicator/reproducer to pause their work in order to find out³.

As we mentioned earlier, if the original work had a major flaw, then a stand-alone literal replication probably would not make enough of a contribution to be considered for publication in JOMSR. Why retake one's temperature with a broken thermometer? Instead, a literal replication could be used as a first step intended to ensure that everything had been calibrated properly. Then a constructive study in which the major flaw was removed could be conducted in order to demonstrate the effect of the flaw. For example, Schultze, Pfeiffer, and Schulz-Hardt (2012) began with a literal replication of Conlon and Parks (1987) followed by a constructive replication in which a major design flaw was fixed. This allowed Schultze et al. (2012) to show that the flawed design does indeed lead to one set of results (viz. those originally reported in Conlon & Parks, 1987), while the superior design leads to another.

If one's purpose really is to perform a literal study only, then one's target is presumably sampling error. And because sampling error would not be a factor in a reproducibility study, we are really just talking about replication here. To do justice to issues regarding sampling error, then, the literal replicator would have to do more than just repeat the small-sample original study. Perhaps multiple literal replications could be conducted, or perhaps a single large-sample literal replication could be conducted. One way or another, the reader would have to have confidence that the replicator's conclusions were at least as legitimate as, but probably more legitimate than the conclusions in the original study.

There is one other element of literal studies that is worth mentioning. Köhler and Cortina (2021) and others (Lykken, 1968; Stroebe & Strack, 2014) have pointed out that, strictly speaking, literal (aka exact, direct) replication/reproducibility studies are not possible. Time will have passed, participants will have come from a different part of the country, a different secondary dataset will not have used the exact same measures for core variables, etc. In order to allow for literal studies, then, the “literalness” requirement must be relaxed slightly. In order to be considered for publication, then, a literal study must differ in no obviously consequential way from the original. For example, suppose that Experimenter A used task t and operationalizations x and y to study the relationship between construct X and construct Y. Suppose further that the data were collected from University of Melbourne Class of 2015 undergraduates. Experimenter B, in an effort to conduct a literal replication of this work, uses the same task and operationalizations but collects data from Monash University Class of 2016 undergraduates.

The replication does differ from the original, but it is difficult to imagine constructs whose relationship would be expected to differ between Class of 2015 undergraduates at one Melbourne-based “Australian Group of 8” university and Class of 2016 undergraduates at a different Melbourne-based “Australian Group of 8” university. There is a difference, but it is of no obvious consequence. As a result, the work of Experimenter B is sufficiently literal.

But that was an easy example. What if Experimenter B, in replicating Experimenter A's study, had used undergraduates from Virginia Commonwealth University in the USA instead of Monash students? What if Experimenter B had used Class of 2022 University of Melbourne students rather 2015 students? With regard to geographic location, depending on the nature of X and Y, there might be cultural differences that would influence their relationship. Experimenter B would do well to point out that the USA and Australia are very similar on Hofstede's dimensions (Hofstede, Hofstede, & Minkov, 2010), so, no worries mate. Or the Class of 2022 might be expected to differ from the Class of 2015 because the Class of 2022 had spent half of their college years on Zoom. Experimenter B should consult the literature on the consequences of this fact for the Class of 2022 in the hopes of being able to argue that, although there would be many relationships for which these consequences would be relevant, the relationship between X and Y is not among them. Just to be clear, if Experimenter B decides that these differences matter and subsequently adds cultural differences or pre/post Covid educational experience as moderators to the tested hypotheses, they are conducting a generalizability study and, therefore, should follow the guidelines provided below.

In short, a literal study must be conducted with great care in order to ensure that no confounds have been introduced. Depending on the nature of the difference between the original and the follow-up, the author may need to argue why there is little reason to fear that the difference is consequential vis-à-vis the phenomenon of interest.

Candidates for Literal Studies

The most important characteristic of a target for a literal study is that it is influential (or is likely to become so). There is not much point to a second attempt when the first attempt was ignored.

It also helps if the findings of the target were, in some way, counterintuitive. In illustrating the importance of replication, for example, Lykken (1968) described a study published by Sapolsky (1964) in a prominent journal. Sapolsky suggested that psychiatric patients with eating disorders harbor an unconscious “cloacal theory of birth,” that is, oral impregnation⁴. Thus, patients who see frogs in the Rohrschach are more likely to suffer from eating disorders. And Sapolsky (1964) did indeed find more eating disorders among frog responders. Lykken (1968), for some reason, was not buying it. As he put it, “I regarded the prior probability of Sapolsky's theory…to be nugatory and its likelihood unenhanced by the experimental findings” (p.151). Put another way, the findings were sufficiently counterintuitive to warrant a revisit.

But counterintuitiveness can take many forms. For example, the probability of a study with a typical sample size finding support for all three two-way interactions and the three-way interaction among three predictors is in the low single digits. Such a finding is, therefore, counterintuitive. A literal study, or better yet, a series of literal studies would be likely to shed light on the reasons for the original findings.

We should also repeat the fact that a literal study is often a precursor to a constructive one. Herndon et al. (2014) set out to conduct a literal reproducibility study, and based on what they learned from that attempt, they were able to conduct a comprehensive constructive reproducibility study that debunked the target study. Schultze et al. (2012) began with a literal replication just to show that they were in fact doing the original study justice. Their series of constructive replications could then be used to show that a design flaw in the original had led to incorrect conclusions. Similarly, Ghosh et al. (2016), in their Stage 1, tried to get as close as possible to a literal replication of the Ahuja et al. study, varying only the time period of the collected data.

Constructive studies

A constructive study must retain all of the major virtues of the original while eliminating at least one major flaw. To use the terms from Köhler and Cortina (2021), a constructive study must be either substantial (fixing at least one major flaw) or comprehensive (fixing all major flaws). In some cases, the major vs. minor flaw distinction will be simple. For example, as was mentioned earlier, tests of the Pygmalion Effect prior to Eden and Shani (1982) had been in relatively low-stakes situations. No one could reasonably object to the claim that Eden and Shani (1982) had eliminated a major flaw in previous work by corroborating the existence of the effect in life-or-death situations.

In most cases, though, what constitutes a major flaw will be a matter for debate. Authors of constructive studies must, therefore, argue convincingly that their improvement really is a major one. This usually involves a methodological argument based on current best practice recommendations that are relevant for the chosen research question, the constructs of interest, etc. One way to avoid this debate is to crowdsource the effort. For example, Landy et al. (2020) had 15 different research teams design and conduct studies to test hypotheses related to moral judgments, negotiation, and implicit cognition. If the teams in such an effort were instructed to design their studies in a manner most appropriate to the research question, then those studies would satisfy Lykken's requirements for constructiveness. Several such independent efforts would shed light not only on the research question but also on the importance of design characteristics for its answer.

Authors of constructive studies must also take care not to introduce any flaws of their own. Consider Chen, Chen, and Sheldon (2016). In study 1, these authors somehow managed to incorporate an objective measure of unethical pro-organizational behavior (UPB), thus avoiding the problems of intentional distortion that infect self-report measures of such behaviors. Their study 2 improved upon study 1 in some ways, but it also eliminated its greatest virtue by replacing the objective measure of actual UPB with a self-report measure of intentions to engage in UPB. Köhler and Cortina (2021) described this as a “confounded” replication because there would be no way of knowing whether differences in results between the two studies were due to the elimination of flaws or the elimination of virtues. Authors intending to submit to JOMSR should strive to eliminate major flaws while retaining all major virtues.

Across all of these designs, it is important that authors writing for JOMSR argue in their papers for the appropriateness of their chosen approach and explain to the reader the intended gains for knowledge advancement. Elaborate theoretical arguments are unnecessary as the theorizing itself cannot change. The hypotheses are what they are. However, authors need to explain to the reader the specific benefits of the chosen reproducibility or replication study. Along the same lines, due attention needs to be given to a description of the chosen methodology and how it differs in meaningful ways (or, in the case of literal studies, does not differ in meaningful ways) from the original study. Readers can refer to our recommendations above. In the end, it needs to be clear in what ways the reproducibility or replication study and its findings makes an empirical contribution and adds to our scientific knowledge base.

When discussing limitations of the reproducibility or replication study, authors of substantial constructive studies could discuss how future research could address any remaining methodological flaws of the original study that they were unable to address. Authors of literal studies could highlight the utility of future constructive replication attempts. However, it needs to be clear why the authors chose not to incorporate their suggestions in their own work.

Candidates for constructive studies

As is the case with a literal study, the contribution of a constructive study is enhanced when the target is influential and its findings counterintuitive, two attributes that, unfortunately, often go hand in hand. In addition, the target of a constructive study needs to have at least one flaw that is likely to be consequential. The consequentiality of the flaw may be general or specific. Regarding the former, one could lean on the methods literature to argue that correction of the flaw is likely to bear fruit. If, by contrast, the flaw is specific to the phenomenon of interest, then one would lean on the substantive literature to argue for consequentiality. Herndon et al. (2014) could lean on the methods literature, or indeed on common sense, to argue that omission of cases that flew in the face of the original author's reasoning was likely to be consequential. Schultze et al. (2012), on the other hand, had to get into the details of typical escalation-of-commitment argumentation and methods to explain why their changes to Conlon and Parks (1987) were, in fact, improvements and why these improvements were likely to matter.

Substantive moderators: generalizability studies

Let us now turn our attention to generalizability studies, which include the addition of a substantive moderator to the originally tested relationship or model. Popular substantive moderators include gender, cultural value differences, membership in different organizational groups, geographic location, and different levels of environmental risk, to name a few. Ultimately, the goal is to determine whether existing theorizing is universally applicable or needs to be modified to include theorizing about suspected variations across different contexts, people, etc.

As pointed out previously, the abundance of meta-analyses and the many studies contained within them provide an indication that there is no shortage of generalizability studies in our field. JOMSR will publish generalizability studies. However, whether or not a generalizability study falls within JOMSR's scope depends on the extent to which novel theorizing has to be introduced to justify the moderation effect. If extensive novel theorizing is necessary to justify the addition of the substantive moderator, then the paper would not be suitable for JOMSR and should rather be pitched to one of the outlets with a primary focus on novel theory, such as JOM. JOMSR is only interested in refinement, clarification, or integration of existing theory.

On the other hand, addition of a moderator to an existing model requires some sort of theoretical justification of the proposed interaction effect. If no such justification exists, then authors are essentially engaging in some sort of random fishing for significant effects to be explained via HARKing (hypothesizing after results are known). Needless to say, JOMSR is not interested in such random fishing studies. Thus, it is important to delineate too much theorizing from too little. Examples should help.

If a researcher were interested in the relationship between abusive supervision and interpersonal justice (e.g., Vogel et al., 2015) and believed that this relationship varies across cultures, then the researcher would need to explain why cultural differences, for example, differences in cultural values or norms, would impact the abusive supervision-justice relationship. The employment of existing theorizing about cultural value or norm differences would lead to refinement of the theoretical argument for the linkage between abusive supervision and employee justice perceptions to clarify that such theorizing may not be universally applicable across all cultural contexts. The researcher does not need to generate new cross-cultural theory or come up with novel theorizing that explains the abusive supervision–justice relationship. Rather, the researcher borrows from existing cultural theorizing to define boundary conditions for the abusive supervision–justice relationship. Similarly, if the researcher argued that gender moderated the relationship, they might incorporate gender role theory or gender identity theory to explain how gender differences impact on the abusive supervision–justice relationship.

These examples fall into the category of JOMSR papers in which two different theories are combined to make and test predictions, so let us refer to them as integrative generalizability studies. In naming it so, we lean on Cronin, Stouten, and van Knippenberg’s (2021) work that highlights integration as the crucial mechanism that leads from scattered (independent) unit theorizing to (integrated) programmatic theorizing. Cronin et al. emphasize that integrating and reconciling different unit theories to establish their respective boundaries and to determine how they intersect with each other to produce a phenomenon is crucial for the creation of a coherent programmatic theory. In their second editorial, Kraimer et al. (2023) underscore the important role that JOMSR, as a journal focused on constructive theory testing, will play in laying the empirical foundation for the building of more programmatic theories.

Important to note is that not every possible generalizability study is worth submitting to JOMSR. To be considered by JOMSR, a paper needs to explain how the combination of the two existing theories refines the theorizing that underlies the proposed model. Simply submitting studies that examine serendipitously chosen moderator effects (i.e., random fishing) will likely be of little interest to the readership of JOMSR. Rather, a study should carefully consider important boundary conditions of a theory that might invalidate or heavily impact the predictions of said theory in certain situations.

One type of reasoning that can be layered onto existing theory to justify an interaction is restricted variance reasoning (Cortina, Koehler, Keeler, & Nielsen, 2019; Cortina et al., 2015, 2022a). A generalizability study could, for example, take the form of determining when a theorized relationship between two variables holds and when situational context factors restrict the variance of one of the variables in such a way that the proposed relationship is weaker or stronger than it was previously thought to be. Cortina et al. (2022b) showed that IPO value of biotech firms was compressed upwards in biotech hubs such as Boston and San Francisco but not in areas such as Chicago and Houston that have middling densities of biotech. Studies that had, not unreasonably, studied predictors of biotech IPO value using data from biotech hubs would have found weaker prediction than would a study using data from mid-density areas.

Consider as another example the mediation in Wasti, Bergman, Glomb, and Drasgow (2000). These authors hypothesized that job masculinity affects perceived sexual harassment, which in turn affects women's health. They further argued that the first stage of the mediation would be weaker in Turkey than in the USA because Turkish women would be likely to see harassment at work as normal and not as harassment at all. Cortina et al. (2022) reframe this interaction in restricted variance terms: The harassment variable is compressed downwards in Turkey, and this compression reduces the prediction of harassment by job masculinity. But what about the second stage of the mediation? Compression on a predictor actually strengthens its prediction of other variables. Thus, restricted variance reasoning could be layered onto reasoning from the harassment and occupational health literature to argue that the harassment–health relationship should be stronger in Turkey than it is in the USA generalizability study like Wasti et al. (2000) could then be justified by refining existing theory rather than generating new theory.

Let us be clear: Justification of a generalizability study would necessitate an alteration of the existing theorizing about the relationship or model of interest. However, the alteration to theorizing needs to be one of theoretical refinement, clarification, and integration to be of interest to JOMSR. The testing of such integrative theorizing could be done by comparing findings from two contexts for which the theory makes different predictions, for example, testing the harassment–health relationship in Turkey and in the USA (as in Wasti et al., 2000), using cultural theory to argue why findings in Turkey and the USA should differ.

However, JOMSR will also consider papers that propose a generalizability study in only one context. In this case, authors need to lean on existing theory to argue why they would expect to find different results in the second context and propose how the discovery of these theory-inconsistent findings would lead to refutation or refinement of said theory. In contrast, authors might want to argue for a generalizability study in which the same patterns as found previously are expected despite differences that appear to challenge some of the core assumptions or expectations on which the original theory was built. In this case, authors would be looking for generalizability across contexts or populations. In such a case, authors would need to argue how theory-consistent findings would lead to theory refinement, clarification, or integration.

Another example of generalizability studies welcomed by JOMSR falls into the category of substantive or contextual moderators that are already included in the underlying theory of interest but have yet to be tested. Let us call these foundational generalizability studies. Building on Cronin et al.'s (2021) work, Kraimer et al. (2023, p. 4) state that “(d)irect tests of the underlying assumptions of existing theory […] contribute to our foundational understanding of unit theories.” Furthermore, Kraimer et al. write that “(o)ne of the most direct ways to contribute to unit theories then is to empirically test the propositions/hypotheses in existing theoretical models” (p. 4).

Generalizability studies of this type should establish why the testing of this moderator is needed for theory confirmation. For example, it might be that a theorized moderator, when tested, could establish important boundary conditions of the theoretical model similar to our previously mentioned example of restricted variance interaction effects. As such, insights from such a generalizability study would provide important insights into the limitations of the theory's applicability.

Another example for this type of generalizability study specified by Kraimer (2023) are tests of the underlying assumptions of theoretical models with the goal of theory refinement and clarification. As Kraimer (2023, p. 3) rightly notes, this is “an often-overlooked aspect when testing theory.” Kraimer et al. (2023) offer an example from the work of Zhao and Liu (2022). Careful examination of scale characteristics of three different available measures of entrepreneurial passion (i.e., question content, correlations between dimensions, reliability, internal and external validity) challenge the theorized “dimensional structure and operation of multi-dimensional constructs” (Kraimer et al., 2023, p. 10). Zhao and Liu (2022) hence conclude that the theoretical model for entrepreneurial passion needs further development. According to Cronin et al. (2021) such testing of the construction of a unit theory “increases clarity of constructs and causal linkages” (p. 675). Useful guidelines to explore and challenge assumptions of existing theories can be found in the work of Alvesson and Sandberg who elaborate the process of problematization and of generating research questions that challenge assumptions on which theories are built (e.g., Alvesson & Sandberg, 2011; Sandberg & Alvesson, 2011).

The Editors of JOMSR have further specified a category of papers of interest to JOMSR that combines replication studies with a generalizability study, which they call generalizability replications. These can be considered a combination of the two designs, not a new category of replication studies. The part of a theoretical model that has previously been tested and is being retested with new data would be considered the replication part of the follow-up study. The part of the theoretical model that introduces a substantive moderator, requiring theory refinement or constituting the test of a previously untested, but theorized, interaction effect, would be considered the generalizability part of the study. In short, generalizability replications as defined by the JOMSR Editors kill two birds with one stone. In fact, Bettis, Helfat, and Shaver (2016) have argued that, in research settings in which the influence of context factors looms large and in which there are many different contextual factors that may or may not be in the control of the researcher (as is often the case in macro-level research), the combination of replication and generalizability efforts may in fact be the most constructive way to add meaningful knowledge to the existing body of research.

Conclusion

Many scholars, ourselves included, have long perceived a “replication crisis” in our field. But the real crisis is that we do not know if we have a crisis or not, the reason being that our journals publish so few replication/reproducibility studies that are independent and constructive. Enter JOMSR. If you, dear reader, follow the advice that we have offered, there will be space for your work in the journal. True, it is a new journal, but it may become an extremely influential one in short order. This is our chance to get theory testing right.

We are grateful to the JOMSR Editors for the opportunity to reiterate the importance of reproducibility, replication, and generalizability studies for advancing our field of research. We have attempted to provide insight into the value of theory refinement, clarification, and integration to be achieved with these research designs. In addition, we have provided actionable advice regarding the core design features of each type of study and the theoretical rationale necessary to justify submission of each type to JOMSR. We have attempted to ensure that our recommendations are built on current best practices related to reproducibility, replication, and generalizability. Given JOMSR's first-mover status and the fact that this journal will likely attract innovative and ground-breaking submissions that will challenge our thinking about theorizing and theory testing, it is foreseeable that the criteria for what is and is not consistent with theory testing and refinement will evolve as the journal receives submissions. We are excited about the potential changes to research that JOMSR will bring to the field.

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Work on this paper was supported by a University of Melbourne – Faculty of Business and Economics Visiting Research Scholar grant.

ORCID iDs

Tine Köhler

Jose M. Cortina

Notes

References

Aguinis

Ramani

R. S.

Alabduljader

(2018). What you see is what you get? Enhancing methodological transparency in management research. Acad of Manage Ann 12(1), 1–28.

Ahuja

Polidoro

Jr Mitchell

(2009). Structural homophily or social asymmetry? The formation of alliances by poorly embedded firms. Strateg Manage J 30, 941–958.

Albright

J. J.

Park

H. M.

(2009). Confirmatory factor analysis using Amos, LISREL, Mplus, SAS/STAT CALIS. Accessed on May 8, 2017 at https://scholarworks.iu.edu/dspace/handle/2022/19736

Alvesson

Sandberg

(2011). Generating research questions through problematization. Acad Manage Rev 36, 247–271.

Antonakis

(2017). On doing better science: From thrill of discovery to policy implications. Leadersh Q 28, 5–21.

Asparouhov

Muthen

(2006). Comparison of estimation methods for complex survey data analysis. Mplus Web Notes. Accessed on May 8, 2017 at https://www.statmodel.com/download/SurveyComp21.pdf

Bettis

R. A.

Helfat

C. E.

Shaver

J. M.

(2016). The necessity, logic, and forms of replication. Strateg Manage J 37, 2193–2203.

Chen

C. C.

Sheldon

O. J.

(2016). Relaxing moral reasoning to win: How organizational identification relates to unethical pro-organizational behavior. Journal of Applied Psychology 101(8), 1082–1096.

Conlon

E. J.

Parks

J. M.

(1987). Information requests in the context of escalation. J Appl Psychol 72(3), 344–350.

10.

Cortina

J. M.

Koehler

Keeler

K. R.

Nielsen

B. B.

(2019). Restricted variance interaction effects: What they are and why they are your friends. J Manage 45, 2779–2806.

11.

Cortina

J. M.

Köhler

Craig

(2022). Are we building our science on quicksand? Current reproducibility practices in management. In Academy of management proceedings (Vol. 2022, pp. 15461). New York: Academy of Management.

12.

Cortina

J. M.

Köhler

Keeler

K. R.

Pugh

S. D.

(2022a). Situation strength as a basis for interactions in psychological models. Psychol Methods 27, 212.

13.

Cortina

J. M.

Köhler

Nielsen

B. B.

(2015). Restriction of variance interaction effects and their importance for international business research. J Int Bus Stud 46, 879–885.

14.

Cortina

J. M.

Köhler

Sheng

Keeler

K. R.

Nielsen

B. B.

Coombs

J. E.

Ketchen

D. J.

Jr (2022b). Restricted variance interactions in entrepreneurship research: A unique basis for context-as-moderator hypotheses. Entrepreneurship Theory and Practice 10422587221121293.

15.

Cronin

M. A.

Stouten

van Knippenberg

(2021). The theory crisis in management research: Solving the right problem. Acad Manage Rev 46, 667–683.

16.

Eden

Shani

A. B.

(1982). Pygmalion goes to boot camp: Expectancy, leadership, and trainee performance. J Appl Psychol 67(2), 194–199.

17.

Edwards

J. R.

Berry

J. W.

Kay

V. S.

(2013). Bridging the great divide between theoretical and empirical management research. Working paper, Kenan-Flagler Business School, University of North Carolina.

18.

Ghosh

Ranganathan

Rosenkopf

(2016). The impact of context and model choice on the determinants of strategic alliance formation: Evidence from a staged replication study. Strateg Manage J 37, 2204–2221.

19.

Herndon

Ash

Pollin

(2014). Does high public debt consistently stifle economic growth? A critique of Reinhart and Rogoff. Cambridge J Econ 38, 257–279.

20.

Hofstede

G. J.

Minkov

(2010). Cultures and organizations: Software of the mind (3rd ed., pp. 255–258). McGraw Hill.

21.

Hollenbeck

J. R.

DeRue

D. S.

Mannor

(2006). Statistical power and parameter stability when subjects are few and tests are many: Comment on Peterson, Smith, Martorana, & Owens. 2003. J Appl Psychol 91(1), 1–5.

22.

Köhler

Cortina

J. M.

(2021). Play it again, Sam! an analysis of constructive replication in the organizational sciences. Journal of Management 47, 488–518.

23.

Kraimer

M. L.

(2023). Introducing journal of management scientific reports (JOMSR). J Manage Sci Rep, 1(1), 3–7. https://doi.org/10.1177/27550311221142152

24.

Kraimer

M. L.

Martin

Schulze

Seibert

S. E.

(2023). What does it mean to test theory. J Manage Sci Rep, 1(1), 8–17.

25.

Landy

J. F.

Jia

M. L.

Ding

I. L.

Viganola

Tierney

Dreber

, & Crowdsourcing Hypothesis Tests Collaboration. (2020). Crowdsourcing hypothesis tests: Making transparent how design choices shape research results. Psychol Bull 146, 451.

26.

Leavitt

Mitchell

T. R.

Peterson

(2010). Theory pruning: Strategies to reduce our dense theoretical landscape. Organ Res Methods 13, 644–667.

27.

Lykken

D. T.

(1968). Statistical significance in psychological research. Psychol Bull 70(3), 151–159.

28.

Minefee

McDonnell

M. H.

Werner

(2021). Reexamining investor reaction to covert corporate political activity: A replication and extension of Werner (2017). Strateg Manage J 42, 1139–1158.

29.

Peterson

R. S.

Smith

D. B.

Martorana

P. V.

Owens

P. D.

(2003). The impact of chief executive officer personality on top management team dynamics: One mechanism by which leadership affects organizational performance. J Appl Psychol 88, 795–808.

30.

Reinhart

C. M.

Rogoff

K. S.

(2010). Growth in a time of debt. Am Econ Rev 100, 573–578.

31.

Sandberg

Alvesson

(2011). Ways of constructing research questions: Gap-spotting or problematization? Organ 18, 23–44.

32.

Sapolsky

(1964). An effort at studying Rorschach content symbolism: The frog response. J ConsultPsychol 28, 469.

33.

Schultze

Pfeiffer

Schulz-Hardt

(2012). Biased information processing in the escalation paradigm: Information search and information evaluation as potential mediators of escalating commitment. J Appl Psychol 97(1), 16–32.

34.

Silberzahn

Uhlmann

E. L.

Martin

D. P.

Anselmi

Aust

Awtrey

Bahník

Bai

Bannard

Bonnier

Carlsson

Cheung

Christensen

Clay

Craig

M. A

Dalla Rosa

Dam

Evans

M. H

Cervantes

I. F

, ... , Nosek

B. A

. (2018). Many analysts, one data set: Making transparent how variations in analytic choices affect results. Advances in Methods and Practices in Psychological Science, 1(3), 337–356.

35.

Stroebe

Strack

(2014). The alleged crisis and the illusion of exact replication. Perspect Psychol Sci 9(1), 59–71.

36.

Vogel

R. M.

Mitchell

M. S.

Tepper

B. J.

Restubog

S. L.

Hua

Huang

J. C.

(2015). A cross-cultural examination of subordinates’ perceptions of and reactions to abusive supervision. J Organ Behav 36, 720–745.

37.

Wasti

S. A.

Bergman

M. E.

Glomb

T. M.

Drasgow

(2000). Test of the cross-cultural generalizability of a model of sexual harassment. J Appl Psychol 85, 766.

38.

Zhao

Liu

2023. Entrepreneurial passion: a meta-analysis of three measures. Entrep Theory Pract, 47(2), 524–552. https://doi.org/10.1177/10422587211069858

Constructive replication,reproducibility,and generalizability: Getting theory testing for JOMSR right

Abstract

Keywords

Defining and distinguishing replication, reproducibility, and generalizability

Methodological moderators: different forms of reproducibility and replication and their value for scientific advancement

Issues addressed by literal versus constructive studies

Examples of constructive and literal reproducibility studies

Examples of constructive and literal replication studies

Issues addressed by independent versus dependent studies

Recommendations for designing reproducibility and replication studies for JOMSR

Literal studies

Candidates for Literal Studies

Constructive studies

Candidates for constructive studies

Substantive moderators: generalizability studies

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iDs

Notes

References