Abstract
We point out potential drawbacks of some of Leising et al.’s (2022a) proposed ways how personality science can be improved. We argue that it is ill-advised to use only one measure for a concept. Also, we argue that researchers should not refrain from conducting a study when a high level of statistical power is precluded. Then, we go one step further and formulate additional ideas of how to improve research. Specifically, we argue that it is a good thing to use different methods rather than only one when attempting to generalize across these methods. Moreover, we argue for a more theory-driven strategy for specifying factor analytic models, and we emphasize that high-quality research is often interdisciplinary. Finally, we point to a particular risk associated with any formal reward system.
Scholars have pointed out that research conducted in psychology suffers from questionable research practices of individual researchers as well as structural problems that reinforce the use of these practices (Asendorpf et al., 2013), which has led to calls for improvement (e.g., Nosek et al., 2012). Recently, Leising et al. (2022a) suggested ten ways to improve personality science, which may also serve as a blueprint for other subdisciplines in psychology. We truly appreciate their contribution and largely agree with the authors’ main goal to improve scientific quality standards and, thereby, to foster good research, as well as with almost all of the authors’ recommendations. For example, we too consider the replicability of findings to be a quality indicator and transparency in terms of detailed reporting and code sharing to be a necessary precondition for successful replication attempts by other researchers. However, we hesitate to offer our unreserved support on all presented arguments because we feel that some points deserve more discussion and reflection. Trying to optimize current practices that might be potentially flawed is only useful and warranted when the means taken to this end are themselves advantageous—a prerequisite that might not hold for some of the proposed steps as well as for the resulting proposed reward system.
Our basic motivation for writing this article is the assumption that consensus regarding scientific knowledge is an important aim that should not be confused with consensus regarding the use of methods. Rather than rewarding the use of consensus methods, we should reward attempts to approach research questions from many different angles with different methods. In the following, we will argue that it is generally ill-advised to use only one consensus measure per concept as the reduction of a (potentially) broad spectrum of different methods to a single method violates essential principles in science, such as the pluralism of methods. Moreover, we will argue that it is unwise to refrain from conducting a study when its design seemingly precludes a low type II error (i.e., a sufficiently high power). After presenting additional ideas of how to improve research, we will express concern about any formal system that is proposed as a means to reward researchers.
Why standardization can hamper science
Leising et al. (2022a) argued that the use of variants of an existing measure or completely different measures to assess a concept would increase the risk of running into jingle-jangle fallacies. A jingle fallacy describes the misguided assumption that two measures with the same label assess the same concept. In contrast, a jangle fallacy appears when two similar measures with different labels are mistakenly assumed to assess different concepts. Leising et al. (2022a) suggested that personality science should collectively develop a single standard measure per concept (i.e., a consensus measure), thereby ruling out the possibility that a diverging result is due to the use of a different measure. We acknowledge the many challenges associated with the existence of multiple measures for the same concept, and we have faced these challenges in our own research. The situation becomes even more difficult when different constructs exist for a concept or even different definitions of the concept. We agree with Leising et al. (2022a) that a measure needs to be tested carefully and improved gradually and that this task can be solved more easily if many research groups work together. At the same time, we ask ourselves whether moving towards the use of a consensus measure would lead to significant advancements of a field.
For example, we notice that one key assumption made by the authors is that science essentially represents cumulative knowledge building, a view of science that misses important aspects of scientific progress. As Kuhn (1970) pointed out, science proceeds in multiple phases. One of these phases is dominated by a common understanding or consensus (i.e., a paradigm), whereas another phase results in a shift of paradigm. Thus, according to Kuhn’s (1970) view, scientific progress is not solely cumulative. Leising et al. (2022a) seem to have referred mainly to the paradigmatic phase in which researchers are involved in addressing open questions posed by the predominant paradigm. Of course, using consensus measures can help in this phase. However, as with all measures, a consensus measure has only a limited capacity to make new discoveries. In order for a paradigm to shift, there must be a critical mass of new, surprising and potentially contradicting findings that cannot be explained by the current paradigm. It should be noted that Leising et al. (2022a) acknowledged deviation from a consensus measure only for the means of advancing the measurement itself. In this light, using not just one but many different measures for a single concept can help detect such inconsistent findings and thus for opening up for a paradigm shift. See Hogan et al. (2022) for a similar but not identical argument. Hogan et al. (2022) criticized that Leising et al. (2022a) understanding of high-quality research would focus first and foremost on work coming from the “context of verfication”, whilst future progress would lie in the “context of discovery”.
A prominent example that is often used for illustrative purposes stems from the early days of optics (Hacking, 1983). In order to better understand the phenomenon of light, researchers used many different approaches for studying it, among which was also Bartholin’s use of calcite (i.e., a transparent and colorless crystal). If you were to place a calcite crystal on this page, you would see this writing twice. Technically speaking, beside the ordinary light beam, this tool allows observers to view a second “extraordinary” beam. For Huygens, this phenomenon was a massive challenge. He wrote that he “was in a sense compelled to make this inquiry, because the refractions in this crystal seemed to overthrow my foregoing explanation of regular refraction.” As a consequence, Huygens had to refine his theory by assuming and incorporating the rotational elliptical propagation of light. The use of calcite and many other measures led to findings that could not be explained by the dominant theories at that time and finally resulted in a new understanding of light as partially having wave characteristics and thus to an advancement in physics. Further examples can be found in psychological science. For instance, due to the weak internal structure of the (postulated) unidimensional Brief Self-Control Scale (BSCS; Tangney et al., 2004)—one of the most widely used questionnaires for measuring general self-control—various researchers used item subsets of the original scale to measure two-dimensional concepts of self-control (for an overview, see Lindner et al., 2015). Despite exhibiting a poor fit, the two dimensions of de Ridder et al. (2011) proved to be powerful predictors of relevant outcomes. Thus, departing from the intended original BSCS was beneficial for the development of new concepts differentiating between inhibitory and initiatory self-control (de Boer et al., 2011; Nilsen et al., 2020).
Even if a simple questionnaire is used, it may flexibly be adapted. Because personality is typically measured by asking questions about behavioral dispositions, resulting questions are naturally only relevant given certain assumptions. To illustrate this, the first item measuring extraversion as part of the FFM (DeYoung et al., 2007) reads “I am the life of the party”, and persons indicate how much they agree with this statement. This makes the assumption that people go to parties, and they must have at least one, preferably more, relevant reference points of a party in their memory to determine whether they were “the life of it.” In this example, reference points of parties might be something that new parents or older people find hard to conjure up and might answer “neither agree nor disagree” despite perhaps actually being extraverted. This example raises the question of whether we need to calibrate scales for different ages and life stages, as has been done in measures of personality disorders (Barendse & Thissen, 2006; van Alphen et al., 2006). The same applies across different cultures. For example, the UPPS-P scale for measuring impulsivity (Cyders et al., 2007) measures sensation seeking with the item “I would enjoy the sensation of skiing very fast down a high mountain slope.” People in different cultures and countries might have never seen snow, and thus find it difficult to imagine what the sensation of skiing would be like. The point here is that it is a hard, perhaps impossible task to measure personality using a single measure that suits people across all ages, life stages, and cultures. To better suit the targeted group, the questionnaire may thus be adapted; that is, the items may be reformulated, some items may be replaced, new items may be added, or a completely different, nonoverlapping set of items may be used in a study (see Horstmann & Ziegler, 2022, who made the case for adjusting the wording of items to fit a certain language level). Notably, this does not mean that both (the original and the adapted) measures capture all aspects of a concept equally well.
Sometimes, it might even be a good strategy to choose a measure that emphasizes one aspect more than another when this aspect is more central to the study. If the goal is to predict behavior within a certain domain or context, a more specialized scale or measure might be preferable to a broader instrument (bandwidth-fidelity dilemma; Cronbach & Gleser, 1957; Salgado, 2017). For example, researchers predicting behavior in real world consumer data (rather than in a laboratory setting) may be more inclined to use a “shopping impulsivity” scale as opposed to a “catch all” impulsive disposition measure. Another example stems from self-concept research. When comparison effects are the target of investigation (e.g., in research on the internal/external frame of reference model), researchers should be aware that these effects are usually stronger when the corresponding comparison processes appear in the item formulations (e.g., “I am better at math than my classmates;” see Wolff et al., 2021). Only if the field uses different measures, will we gain insights into which flavor of concept predicts or does not predict different dependent variables in different populations and contexts. This heterogeneity enables the progression in our understanding of hard to measure and sometimes hard to define concepts, such as personality traits. Of course, such a freedom in the choice of measure comes at the price of less standardization, but it also allows measures to be more closely tailored to the requirements of a particular study (see Ziegler, 2014). Admittedly, this does not necessarily contradict Leising et al. (2022a) demand for consensus measures because such measures may be developed for different application contexts.
In a similar vein, we think that using different constructs for the same concept can be helpful. In this article, we adopt the view that constructs should be distinguished from concepts. According to this view, a construct is only a proxy that lies between the concept and its indicators (see Rigdon, 2012; see also Uher, 2021, for a deeper discussion). A construct is created to enable operationalization and validation (e.g., by developing a nomological net, Cronbach & Meehl, 1955). In the simplest case, a somewhat different construct may be obtained by slightly modifying the existing one, for example, by allowing correlated uniquenesses, which can become necessary as a result of an imperfect translation of the items into another language (see Schmidt et al., 2017). In other contexts, a different structure or even a completely different type of construct may be needed (see Lu et al., 2023). Constructs can be formed in different ways, one of which is using factor-based methods. These methods use a common factor to explain the correlations between a construct’s indicators. Another, sometimes even more suitable way to form constructs are composite-based methods, which combine the indicators into a composite (Hair et al., 2021). Although factor-based methods are more popular in psychology, composite-based methods can be superior to factor-based methods not only from a theoretical point of view but in practice. An example of such composite-based methods is the partial least squares method, which Wold (1982) and also others recommended because models with structural relations between constructs that are formed by this and similar methods exhibit better small-sample properties (Rigdon et al., 2017; Tenenhaus et al., 2005; Zitzmann & Helm, 2021).
Finally, we should acknowledge and appreciate that there may also be different definitions of a concept, each coming with particular strengths as well. To give an example, in social psychological science, the concept of tolerance has been defined as involving liking others’ beliefs, preferences, and practices, or regarding them as something good. However, according to a recent definition by Bernd Simon, tolerance is defined as the attitude that one accepts others’ beliefs, preferences, and practices despite one’s disapproval of them (Simon, 2020; Simon et al., 2019). Importantly, unlike the former definition of tolerance, the latter definition includes disapproval as a definitional condition (Gibson et al., 1992). A somewhat broader definition of tolerance was recently put forward by Verkuyten et al. (2022). The fact that different definitions exist testifies that tolerance allows, if not calls, for them. Each definition sheds a different light on the phenomenon and thus helps deepen our understanding of it (see Zitzmann, Loreth, et al., 2022; see also Fernandes & Aharoni, 2022, who emphasized that conceptual differences are legitimate). It does not hamper science as long as researchers are aware of the differences between these definitions and interpret findings in strict accordance with the specific definition used—a strategy that can also help reduce the risk of jingle-jangle fallacies. One way to achieve this is address these fallacies openly as Schmidt et al. (2018) and Keller et al. (2016) did. For example, Keller et al. (2016) found that the enthusiasm that teachers perceive while teaching and the enthusiasm that teachers display were both termed teaching enthusiasm. Ever since, researchers have clearly stated which of both concepts they used.
Leising et al. (2022a) most compelling argument for why consensus measures and possibly also consensus constructs and definitions would be needed in psychology is that efforts to engage in cumulative knowledge building, including meta-analysis, would otherwise be “difficult or even futile” (p. 9). However, such differences are not problematic for meta-analyses, at least when an adequate model is chosen. In meta-analytic research, the mixed-effects model has become the gold standard for two decades now. In this model, effect sizes are expressed in terms of true effect sizes and deviations from these true effect sizes. In addition, the true effect sizes in turn are expressed as an average effect size plus study-specific deviations from this average effect size. How much the deviations from the true effect sizes vary defines the sampling error, and the variance of the deviations from the average effect size describes how the study-specific true effect sizes vary around the average effect size. The latter variability is often referred to as “study heterogeneity” (e.g., Overton, 1998). Because a certain amount of this heterogeneity may be due to differences in measure for a concept, the differences are inherently accounted for by the model. Hence, statistical tools for accounting for the variation in measures are readily available, and thus, from a statistical point of view, there is no need for reducing the spectrum of different measures to a single one. In addition, the use of multiple measures in combination with random effects allows for generalized conclusions regarding effect sizes, whereas using only a single measure would tie conclusions to this measure. As Yarkoni (2022) pointed out, researchers are usually interested in general conclusions, and thus, they agree that multiple samples need to be drawn in order to generalize findings. If Leising et al. (2022a) idea of a consensus measure was adopted to samples, this would mean that a single (consensus) group be studied. It is very clear that this would not allow findings to generalize beyond this group.
Overall, we agree that a consensus measure can help address open questions posed by a paradigm. However, we think that disregarding other measures for a concept may produce artificially more consistent findings. This means that a consensus measure might prevent the emergence of inconsistent findings, and the predominant paradigm persists for longer (because inconsistencies accumulate more slowly). As a consequence, there will be less pressure to innovate and potentially less scientific progress in terms of paradigm shifts. While some researchers may opt for a specific measure, construct, and definition of the concept in their studies, it is important to acknowledge that there are many other (potential) measures, constructs, and definitions for the same concept, which represents something valuable rather than a threat to good research.
We fear that if we were to use only consensus measures, consensus constructs, and consensus definitions, this could be seen as a step back to Popperian thinking. Admittedly, a great deal of research in psychology is devoted to Popper’s doctrines, and Leising et al. (2022a) even referred to these ideas in their explanation of good research. However, these ideas have long been superseded by other ideas, such as those by Paul Feyerabend but also by pragmatists (see Albers et al., 2018, for an example of pragmatism in psychology). Similar to us, Feyerabend (2010) argued that the prescription of only one method can even hamper science, and as Hacking (1983) put it, we should not really expect something as colorful as science to be tied to a single method. For example, indices addressing the fit of a model to the data are routinely used to justify a researcher’s decision for or against the validity of the model. However, there are other methods that can help researchers assess whether a model is valid. For example, researchers can make use of theory: When X must impact Y more strongly than the other way around for theoretical reasons, a model indicating the opposite is invalid and should therefore be rejected even when this model fits the data (see Stone, 2021).
Our view is in line with Oreskes (2020), who explicitly warns about methodological fetishism and emphasizes methodological pluralism as a central component of a science. She argues that a strong scientific consensus will emerge only if researchers arrive to a great part at the same conclusion despite using different methods. Similarly, Zitzmann and Loreth (2021) made the case for an “almost anything goes” attitude toward methods (see also Klimstra, 2022, who even included qualitative methods). While researchers should remain open for other methods, a basic scientific framework of logic and evidence still defines the limits (see Hilbig et al., 2022).
It is interesting to note that Leising et al. (2022a) idea of a consensus has also been criticized by other commentators, such as Corker (2022), Denissen and Sijtsma (2022), Fernandes and Aharoni (2022), and Hagemann (2022). In a nutshell, their criticisms can be summarized as follows. These authors argued that any consensus measure would be biased, because it expresses what the mainstream thinks is the “right measure.” Even worse, it has been argued that the choice of the consensus measure would potentially be influenced by a few powerful people, and without a means to protect the choice from being influenced too much by these persons, they would essentially determine the consensus measure (e.g., Adler, 2022; Beck et al., 2022; Fedorenko et al., 2022; Galang & Morales, 2022; McLean & Syed, 2022). In other words, the choice would not be grounded in true consensus. Also, commentators have argued that the dictate of a consensus measure would have the potential to devaluate research that does not obey it, which can negatively influence an otherwise naturally evolving science (e.g., Asendorpf & Gebauer, 2022; Hilbig et al., 2022; Klimstra, 2022). This latter argument bears some similarity with our argument, which is yet different. For example, whereas Asendorpf and Gebauer’s (2022) argued from an evolutionary perspective on personality science, we adopted Kuhn’s theory of paradigm and an “anarchist view” to argue against consensus, using concrete examples from physics and psychology. We believe that these views add greatly to the discussion by shedding a different light on the issue.
Why a (seemingly) underpowered study can be worth conducting
We strongly agree with Leising et al. (2022a) suggestion to always plan studies in such a way that the type II error rate will be as small (i.e., a high level of power). To ensure a high level of power, the authors suggested that power analyses should be performed in advance. However, we deliberate the extent to which researchers should categorically refrain from conducting an underpowered study, especially when constraints exist that seemingly preclude a sufficiently high level of power (e.g., a limited budget). We think that this suggestion deserves some qualification for two reasons: First, a power analysis can be biased in either direction and thus be unreliable (i.e., it can underestimate the actual power of a study). Second, even a truly underpowered study can be worth conducting, because it can still serve as input for related meta-analyses. Regarding the first point, besides other possible reasons (e.g., false assumptions about the data-generating mechanism), this bias may be due to the use of a different estimator to analyze the data although the model is the same. For example, it is well known that Bayesian estimators with a regularizing effect on estimates can be less variable (e.g., they provide smaller standard errors; Greenland, 2000; Zitzmann et al., 2021) and thus also more highly powered than the estimators typically used in power analyses tools such as G*Power (Faul et al., 2007) or PowerUp! (Dong & Maynard, 2013).
To illustrate the effect of choosing a Bayesian estimator on a study’s power, we pick out a specific example from organizational psychological science, but we want to emphasize that downward bias of power analysis is by no means limited to this example or to this field of research. A person may be assessed by eliciting ratings from a group of others to rate these persons, for example, employees rating their team leaders’ leadership skills (e.g., Croon & van Veldhoven, 2007). The assessed leadership variable can then be related to other variables, such as the employees’ achievement, in order to study the relationship between leadership and employee achievement. As a study’s budget is often given and limited, and the power critically depends on the sample size, researchers might wish to find the optimal numbers of team leaders and employees to maximize the power to detect the relationship of interest under the given budget (e.g., van Breukelen, 2013; Zitzmann, Wagner, et al., 2022). The dotted black line in Figure 1 shows this maximized power based on a power analysis conducted with usual software (e.g., PowerUp!) as a function of the size of the slope in the model. In addition, the figure shows the true power when the data are analyzed with a mildly regularized estimator (i.e., a Bayes estimator with a weak, not necessarily accurate prior distribution). As Zitzmann, Wagner, et al. (2022) argued, with such an estimator, the level of power can be increased. For example, using such an estimator may lead to an acceptably high power of .80, although the initial power analysis suggested a power of only 74%. This increase is not very large, but it indicates that studies can still be sufficiently powered even though conventional power analyses did not suggest this. Of course, researchers could decide their estimator prior to the power analysis and taylor the power analysis to this estimator, but this is a difficult task for most of them because it requires advanced statistical knowledge, especially when they use anything other than the simplest models. Although our argument is not per se an argument against Leising et al. (2022a) demand for sufficiently powered studies, it highlights an issue with the judgment of whether a particular study is sufficiently powered, rendering an a priori power analysis an unreliable indicator of high-quality research. Power from initial power analysis versus actual power.
The second, perhaps more convincing reason why we question Leising et al. (2022a, 2022b) suggestion is that such a study can still be informative when its results are used in subsequent meta-analyses. Even when an underpowered study failed to detect an existing effect, it can still contribute by adding data to a meta-analysis and can thus help reduce uncertainty (i.e., by reducing the standard error).
An often-neglected feature of meta-analyses is that beside yes-or-no questions (e.g., whether personality affects certain outcomes), these analyses allow to investigate how effects vary across features of the study. This allows researchers to understand under which conditions effects are weaker or stronger. Hence, meta-analyses can also generate new findings. Relevant study features may include the specific measure, construct, and definition of the concept. For instance, to investigate the role of the measure more explicitly, meta-analytic models can be extended by adding a discrete variable with as many categories as there are measures as a moderator for the effect sizes using dummy coding (see Möller et al., 2020, for an example). However, besides such methodological variables, the effect size may also depend on more substantive variables, such as the studied population. In order to identify moderating variables and to quantify their role by using so-called meta-regressions, these analyses should ideally be based on a large body of studies to allow for a robust understanding of the moderations. Needless to say, to make use of underpowered studies and studies that failed to detect an existing effect, these studies need to be published together with their necessary characteristics and statistics to be included in meta-analyses, and it needs to be ensured that any form of bias in publication is minimized because otherwise, this practice may lead to overestimation of effect sizes (Nujiten et al., 2015). We would like to add that instead of running meta-analyses on (partly) underpowered studies, researchers who are faced with insufficient resources could combine their resources with those of other departments and conduct a multicenter study with sufficient power such as in clinical psychology, where such a study is conducted when one department alone cannot raise enough resources for a sufficiently powered study (see the manylabs/many babies projects or the psychological science accelerator).
Some might criticize our “defense” of underpowered studies by arguing that these studies would result in an inflated study heterogeneity and an invitation to do science poorly. Regarding the first point, it is important to note that study heterogeneity is defined as the variation of true effect sizes τ2 (van Hippel, 2015), and thus, it cannot be affected by power. Only the sampling error σ2 will be affected. As a consequence, whereas the estimates will be more scattered due to the larger sampling error, the study heterogeneity will remain unaltered. Alternatively, study heterogeneity may be defined as a relative quantity that compares the variation of true effect sizes to the sampling error, τ2/(τ2 + σ2) (i.e., the idea behind the prominent I2 measure). However, even when study heterogeneity is defined this way will study heterogeneity not inflate if studies are underpowered. Rather, I2 will decrease in this case because the denominator will be increased through σ2. Of course, this only holds true when τ2 is held unchanged. Note that with small studies, the statistical power of the commonly used estimate of I2 to detect significant study heterogeneity can however be low (e.g., Huedo-Medina et al., 2006).
The second point that we would invite researchers to do science poorly is valid only when one accepts the premise that conducting underpowered studies is poor science, showing that this point is a circular argument.
Further suggestions to improve personality science
We have argued that variation of methods is generally a good rather than a bad thing. However, in practice, this type of variation usually occurs between studies, not within studies. As a consequence, in a given study, method effects cannot be separated from the effect of interest. However, the authors of that study still want to gain insights that are rather independent of the concrete method used, meaning that they want to generalize across methods, although this aim is hardly ever stated explicitly. Specifically, they want to generalize to the size of the effect of interest that would be obtained if all measures from the population of measures were administered to assess the outcome variable. Of course, administering all measures in one study is not practical. Alternatively, researchers could use a subset of different measures (e.g., different questionnaires) and subject the resulting data to a mixed-effects model with a random effect parameter describing the specific contributions of the measures. The effect size obtained from such a model generalizes to the study’s true effect size that would be yielded if all (potentially) available measures had been employed. In other words, generalization across different measures is wanted and certainly possible even in a concrete study (see also Yarkoni, 2022, who suggested that models including random effects be used more routinely in psychology). We think that using different measures or methods in a study in combination with an appropriate statistical approach and thereby allowing for generalized insights can be considered an indicator of high-quality research.
Regarding our second suggestion, it is instructive to note that a great deal of research is concerned with studying relations between concepts. To this end, factor analytic techniques have extensively been applied, particularly in personality science. In the measurement of a concept, several items are typically used, which may in theory all be equal indicators—an assumption that we believe most researchers implicitly make. However, factor analysis tends to find different loadings for the items, because freely estimating the loadings comes with a better model fit. As factor-analysis leads to some items defining the meaning of the latent variable more than other items (i.e., they correlate stronger with the latent variable than the other items), this produces a misfit between our understanding of the concept reflected by the (equally weighted) contents of the items and the actual meaning of the latent variable in a given study (see Robitzsch & Lüdtke, 2022; see also Steger et al., 2022, for a very similar argument). When the concept does not match the latent variable, we may make faulty conclusions from the data about the concept’s putative mechanisms, correlates, and theories of change. Moreover, this could also be a problem for assessment “in the real world.” For example, questionnaires are frequently used to diagnose persons with clinical disorders or assess whether incarcerated persons may be at risk of recommitting a crime. In these instances, scale scores are typically used in which items are equally weighted. However, if this type of score does not match the latent variable because factor analysis found different loadings, then diagnostic decisions based on scale scores will not necessarily be backed up by factor analytic evidence of validity and predictive capacity. Thus, when studying relations between concepts and the (implicit) assumption is that all items reflect the concept equally well, a model with equal loadings should be selected rather than the locally best-fitting one. It is interesting to note that this practice is also in line with a particular reading of the classical test theory according to which all items are equally associated with the latent variable, which corresponds to equal loadings in factor analysis (Bollen, 1989; McNeish & Wolf, 2020). Admittedly, theory might prescribe a more nuanced pattern of loadings. In this case, a model with this specific pattern should be specified. As working with sound constructs is desirable, using such a more theory-driven strategy for specifying factor analytic models is good research.
Personality scientists are specialists in their discipline. However, we argue that a group of similar-minded specialists alone will not be able to answer big questions in a satisfactory manner. Significant advances happen at the borders between fields. They can only be achieved by combining the strength of many different disciplines or even different sciences. Each field has its own theories and its own methods for obtaining and interpreting data, and combining these ideas may stimulate new theories and research that provide powerful means to address questions. As an example, consider once more the work of Bernd Simon and colleagues. They have laid important foundations for future research on tolerance, benefitting greatly from political science and philosophy (e.g., Brown, 2006; Forst, 2013; Marcuse, 1970; Scanlon, 2003). Consulting other fields of research “payed dividends” in strengthening their own research. Indeed, interdisciplinary research has become increasingly central to academic interest, and Okamura (2019) found that interdisciplinarity increased impact significantly. Thus, in our view, interdisciplinarity is another important quality indictor. However, interdisciplinary research calls for a system that allows to validly assess also the quality of the contributions from other fields. Instead of trying to create a “one size fits all framework,” which risks being better suited to a certain field, one suggestion is that the contributions be explicitly rated in various categories of scientific rigor. However, this would require other fields to develop own perspectives on what constitutes good research and their researchers to act as reviewers in order to assess the interdisciplinary work of others.
Why a formal reward system bears the risk of intolerance
Based on Leising et al. (2022a) ten steps to improve personality science, they proposed a formal reward system. The system is described in their article, and it is essentially an array of features that will be rewarded (e.g., a publication will get five reward points if it presents broad consensus regarding measurement practices). Although we agree that there is a dire need to improve the current system, we believe that Leising et al. (2022a) system can also be viewed as controversial, especially with regard to what the reward system would imply for researchers. Leising et al. (2022a) themselves mentioned that good research requires more time, effort, and financial resources. As a consequence, the reward system would automatically favor those who are already favored (e.g., researchers at good/established universities with financial resources and more support), thus contributing to inequalities.
Moreover, the proposed reward system points to a preference towards pre-registration and confirmatory work. With many researchers moving toward analyzing existing datasets using also methods that are exploratory by nature, the reward system may threaten to disadvantage these researchers. For example, will the use of bottom-up approaches, such as machine learning, in the analysis of passively collected big data “lose points” and perceived rigor for laying less emphasis on theoretical considerations and well-defined hypotheses? Moreover, the reward system refers to power analyses and sample size planning. When dealing with noisy large datasets and using an algorithmic modeling approach, these steps are not appropriate. Similar penalties are suggested for not providing open data access, which in principle is the ideal scenario but is not often possible in practice, particularly when working on proprietary datasets.
In other words, besides the potentially positive aspects of the reward system (e.g., improving the transparency of research), there is the risk that driven by categorizations of researchers into “good” or “bad” researchers, the system can lead to conflicts in the research community. Proponents of a reward system that penalizes researchers who, for the reasons outlined above, do not meet these criteria not only disapprove of these researchers (because they disapprove of these researcher’s work), but they might also disrespect these researchers, because they do not consider them as equal fellow researchers—a clear case of intolerance among researchers. Once researchers who are met with intolerance are discouraged to publish in the same respected journals, other researcher might not become aware and influenced by these researchers’ work. Whether researchers must be tolerant at all can be debated. However, we think that without tolerance, a pluralism of methods will not be possible, and without a certain degree of pluralism, science will not flourish and thus not proceed.
Other commentators also focused on the proposed reward system (e.g., Beck et al., 2022; Friedman, 2022; Schmitt, 2022). Similar to our argument, they argued that the system would disadvantage researchers who adhere to other approaches to personality (e.g., Klimstra, 2022; McLean & Syed, 2022). However, unlike these authors and based on a well-established theory of tolerance, we discussed potentially disastrous social-psychological consequences for the community and the resulting adverse effects on the progress of science.
Conclusion
This article is essentially another comment on Leising et al. (2022a). However, we presented arguments and suggestions that go beyond the 20 already published commentaries. While some of the drawbacks of Leising et al. (2022a) that we pointed out had been the subject of discussion before, our lines of reasoning differed from these discussions. For example, in our critique of the idea of a consensus measure, we referred to the history of science and illustrated our point with concrete examples. Although Hogan et al.’s (2022) criticism pointed into the same direction by emphasizing the importance of the “context of discovery,” their argument felt somewhat short and remained vague. Moreover, we formulated further ideas how to improve research: generate findings that generalize across measures and other kinds of methods, specify factor models in stricter accordance with the theory, and conduct interdisciplinary research; and we argued against a reward system that is too strict.
It should be noted that Leising et al. (2022b) themselves took the opportunity to respond to the already published commentaries. In their response, the authors clarified some of their original arguments and even qualified them. For example, they qualified their argument that researchers should use consensus measures by stating that they “advocated the inclusion, not the exclusive use, of such measures” (p. 10 f.). Put differently, they suggested that a consensus measure should be included alongside other measures of the same concept. By doing this, they acknowledged not only the existence of other measures but also their added value. In a way, their appreciation of yet other measures is in line with our plea for a methodological pluralism. However, they still seem to think that a consensus measure would come with certain benefits, with the most important one being that this measure is the privileged way to separate substantively relevant from irrelevant influences on effect sizes. As we have argued, the question to which extent differences in effect sizes between studies are due to substantively irrelevant factors, such as different measures, can be addressed through meta-analytic methodology as well (but see Gollwitzer & Schwabe, 2022, who preferred replication projects). Furthermore, this methodology employs random effects, thereby allowing researchers to generalize across different measures—that is, it allows them to generalize to the effect size that would be obtained if all (potential) measures were administered (Yarkoni, 2022). While consensus measures may have some minor merits, we still doubt that these merits fully compensate for their drawbacks.
We thank Leising et al. (2022a) for taking the initiative to make psychology a better science and their inspiring ideas, which we used as a springboard to generate further discussion. In line with Leising et al. (2022a), we value cooperation and see the improvement of our science as a collaborative effort. We truly appreciate their initiative, which we view as a first proposal that needs to be evaluated continuously and improved based on the outcome of these evaluations. If we perceived the spirit of their article correctly, discussions and possible future adaptations are welcome. For example, one could debate Leising et al. (2022a) main message that the responsibility for doing science well lies with the researchers. In our view, decisive changes must also be made at other levels (Krammer & Svecnik, 2020).
To conclude, rather than recommending Leising et al. (2022a) suggestions in all respects, we wish to encourage researchers, reviewers, editors, lecturers, and appointment committees to commit themselves more to a science that is inspiring (sometimes even surprising!), moral, and grounded in creative thinking.
• Standardization can hamper science. • A (seemingly) underpowered study can be worth conducting. • A formal reward system bears the risk of intolerance. In a recent article, which was published in Personality Science, Leising et al. (2022a) proposed ten ways how personality science can be improved, which may also be applicable to psychological science in general. We are a diverse group of 19 researchers with backgrounds in very different subdisciplines of psychology. What unites us is that we all remain skeptical that Leising et al. (2022a) suggestions may sufficiently improve the field. In our article, we point out potential drawbacks of some of the proposed steps, formulate additional ideas of how to improve research, and point to a particular risk associated with any formal reward system, thereby contributing to the ongoing debate.Key insights
Relevance statement
Supplemental Material
Supplemental Material - On the role of variation in measures, the worth of underpowered studies, and the need for tolerance among researchers: Some more reflections on Leising et al. from a methodological, statistical, and social-psychological perspective
Supplemental Material for On the role of variation in measures, the worth of underpowered studies, and the need for tolerance among researchers: Some more reflections on Leising et al. from a methodological, statistical, and social-psychological perspective in China by Steffen Zitzmann, Wolfgang Wagner, Rosa Lavelle-Hill, Alexander J. Jung, Hayley Jach, Lukas Loreth, Christoph Lindner, Fabian T. C. Schmidt, Peter A. Edelsbrunner, Christoph D. Schaefer, Robert Deutschländer, Stefan K. Schauber, Georg Krammer, Fabian Wolff, Bronson Hui, Christian Fischer, Lisa Bardach, Benjamin Nagengast, and Martin Hecht in Personality Science.
Footnotes
Author note
John F. Rauthmann was the handling editor.
Acknowledgements
Not applicable.
Author contributions
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
ORCID iD
Not applicable.
Data accessibility statement
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
Supplemental material
Supplemental material for this article is available online. Depending on the article type, these usually include a Transparency Checklist, a Transparent Peer Review File, and optional materials from the authors.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
