Abstract
This paper critically evaluates the conventional insistence on establishing measurement invariance (MI) in cross-cultural psychology. We argue that complex and seemingly arbitrary benchmarks for assessing MI can be unrealistic and effectively prohibit meaningful research. The widespread use of various MI criteria creates unnecessary and often unattainable hurdles for cross-cultural researchers who have made the effort to collect data in multiple cultural contexts. Additionally, the prohibitionist tone of discussions surrounding MI is unhelpful, unscientific, and discouraging. We argue that emerging findings that cultural differences might not be as widespread or profound as once assumed imply that significant cross-cultural differences in measurement should not be the default assumption. Additionally, we advocate a shift towards external validity as a more useful metric of measurement quality. Our overall message is that researchers who go to the considerable trouble of gathering data in more than one country should not be disadvantaged compared to researchers who avoid cross-cultural complications by gathering data only at their home campus.
“Measurement invariance” is a term that newcomers to cross-cultural research quickly come to dread. It refers to the statistical criterion typically considered necessary for using a psychological measurement instrument in more than one cultural context. 1 Several years ago, the first author, who was then new to the field, gave a talk presenting some data recently gathered in 20 different countries, using an instrument developed in our lab (the Riverside Situational Q-sort). A member of the audience asked, “what did you do to assess measurement invariance?” After being given no real answer, the questioner shook his head sadly and the speaker worried he had missed something important.
This semi-traumatic episode was only the beginning of our experience with dealing with critical questions about our use, or lack thereof, of assessments of measurement invariance. Judging from conversations with other psychologists and a slew of recent publications on the topic (e.g., Fischer et al., 2022; Robitzsch & Lüdtke, 2022; Welzel et al., 2021), we are not the only ones to encounter this criticism nor the first to start to develop misgivings about the whole enterprise. If a researcher gathers data in multiple cultures and does not assess measurement invariance, then the researcher earns scorn for ignoring the issue. If the researcher does perform the conventional kinds of analyses recommended, the results are typically discouraging.
Disappointing RMSEA’s and Delta CFI’s bigger than .01 reflect failure to achieve the (possibly unrealistic) standard of achieving the equivalence of item intercepts necessary for comparing scores on psychological measures across groups. A member of a symposium we recently organized proposed during his presentation, “If you can show me some real data where strict measurement invariance was achieved across cultures, I shall buy you a beer!” He had no takers. Yet the following message is approaching the status of conventional wisdom: the lack of equivalence in the properties of psychological measures across cultures means that they cannot be used for cross-cultural comparison and attempts to do so are not just psychometrically ignorant, they are fatally flawed. We believe this message discourages researchers from entering or continuing cross-cultural research, and that the result, rather than better cross-cultural being conducted, will ultimately be less.
In this paper, we describe our misgivings about the standard requirement of establishing measure invariance 2 in cross-cultural psychology, hereafter referred to as MI. Specifically, the wide variety of methods for assessing MI combined with seemingly arbitrary benchmarks make for a complicated set of unrealistic standards that can actively prohibit researchers from reporting meaningful work, and thus act as a brake on the entire field of cross-cultural psychology. To illustrate, we will present examples of measures that lack measurement invariance in its traditional meaning but still prove insightful to the field and measures that may be highly invariant across groups but mask underlying conceptual issues. We will conclude with some specific suggestions for improving the current state of affairs.
Complicated yet arbitrary benchmarks
Our first point concerns the widespread use of various MI criteria in cross-cultural research. Assessments of MI use complex methods that few researchers appear to fully understand, and typically yield dichotomous evaluative decisions based on seemingly arbitrary benchmarks. The typical answer concerning the origin of whatever specific benchmark is employed is that someone (or an institution such as the Educational Testing Service) published an article recommending it. These well-entrenched metrics, which may include (expected) inter-item correlations, slopes, and intercepts, are taken as compelling in and of themselves, so any further empirical or theoretical justification of the recommended standard generally remains obscure, if it exists at all. The (most likely few) researchers who actually go and study said authoritative article will therefore not necessarily be enlightened. But they will still find that they must follow its “recommendations,” as grant reviewers, journal reviewers, and editors enforce the presence of statistical tests for MI.
It is noteworthy that in almost all cases, the recommended metrics ultimately stemmed from statistical simulations of generated data rather than originating from real (and messy) cross-cultural data (e.g., Bauer et al., 2020; Cole et al., 2019; Wang et al., 2022). The lack of empirical evidence for established metrics is telling. In 2007, Herbert Marsh challenged his fellow colleagues of an SEM association to refute his claim that it is impossible to find good “fit” based on these established metrics using any existing psychological instruments with multiple factors and multiple items per factor (Marsh et al., 2014). No one reported being able to do so. More recent efforts to move away from strict cutoffs and instead to treat MI as a continuous outcome rather than a dichotomous decision are notable must still confront the same issue of establishing a standard metric for defining “partial” or “approximate” invariance (Fischer et al., 2022; Robitzsch & Lüdtke, 2022).
The vantage point of empirical cross-cultural researchers, by which we mean the ones who actually gather data in more than one country, is typically different than that of editors, reviewers, and other armchair critics. After having made the considerable effort to obtain data in more than one cultural context and, often, language, empirical researchers then face a black box where they dump their data in one side (such as into a newly publicized R package), and wait for the output on the other side, with fingers crossed. And then, almost always, they get bad news. Cross-cultural differences are “significantly” non-zero. The RMSEA’s, CFI’s and other metrics fall short from ideal prescriptions. This outcome is typical when data from two cultural contexts are compared. The outcome is inevitable when more than two cultures and languages are included. There is simply no way to establish MI across, say, 63 different cultural contexts and 40 languages. 3
When the by-now unsurprising lack of measurement invariance is found, researchers may be encouraged to remove items from their measure or adjust scores based on the MI criteria to more closely align with the “true” latent scores for a group. Papers that report differences between traditional and “corrected” mean scores and/or construct correlates cite these differences as evidence that MI corrections are necessary for cross-group comparisons (e.g., Lacko et al., 2022). But there is typically little if any evidence that these new scores are any more valid than the original ones, and the adjustments make it difficult to compare results across past and future studies employing the same scale (Robitzsch & Lüdtke, 2022). While some problematic items might be uncovered, such as discovering that a word has been mistranslated (Mueleman et al., 2022), weak item-total or inter-item correlations are not necessarily harmful to a scale’s external validity, and may in fact be helpful 4 (Revelle, in press). Additionally, as Welzel and colleagues (2021) illustrated with data from the World Values Survey, larger group differences decrease the variance of scale items, and yet such variance is statistically necessary for finding higher inter-item correlates. Thus, subsetting measures to include only items with strong inter-item connections can reduce group differences by removing extreme items with very little variance despite having meaningful external correlates across countries.
Against prohibition
Discussions of MI often have a prohibitionist tone. A “failure” (that’s the actual word used most often) to achieve MI by conventional criteria is not typically treated as a scientific finding of interest in its own right (but see Fischer et al., 2022 for a recent exception). Rather, it is more often given as a reason, even a “violation” (another often-used word) that implies one should not take the cross-cultural data seriously, or sometimes, even look at them. In a recent paper in which MI was not achieved, the authors primly stated that, therefore, they did not examine their data any further. No wonder they didn’t dare, in the face of a recently published warning that “widespread hidden invalidity in the measures we use... pose[s] a threat to many research findings” (Hussey & Hughes, 2020, p. 166).
Such a prohibitionist tone goes too far. First, the amount of non-invariance required to throw substantive results into question is far from clear and, as noted above, often is evaluated on the basis of seemingly arbitrary benchmarks. Second, the implications of a “failure” of MI depends on the kind of MI one decides to insist on. Do you want to interpret correlations among measures within countries? Then configural MI is enough, and it is indeed often found (e.g., Aluja et al., 2019). But what about if you want to interpret mean differences between countries? Well, then you are said to need “strict” or scalar invariance, which is a very high bar seldom attained.
More balanced treatments of failures of scalar invariance that do not use the overly prohibitionist tone may still instead say something like “all questionnaires showed some noninvariance across countries, indicating that caution needs to be exercised when investigating and interpreting mean differences” (Wetzel et al., 2021, p. 40). The careful, moderate tone of the quote is appreciated but still: what is this finely worded advice supposed to mean? That otherwise you can throw caution to the winds when making mean level comparisons across groups? Of course not. Caution is warranted regardless of the evidence for or against MI, as it is for any scientific study, rendering the conventional pleas for careful interpretation meaningless, and the very motivation for assessing MI unclear.
Meaningful cross-cultural differences should not be the default assumption
An emerging, and perhaps surprising, theme emerging in cross-cultural research is that differences in psychological attributes and processes are turning out to be much smaller than many of us initially expected (Allik & Realo, 2017). The touted fundamental differences between East and West were not only almost absurdly simplistic (Asia is a big and diverse place, as is Europe) but also have turned out to be smaller, less profound, and less consistent than first assumed. East Asia contains many individualists, and Europe and North America have a fair number of collectivists, and while there still might be overall differences, the distributions of individuals in these categories overlap considerably (Takano, in press).
The results of our own international project, which gathered data from more than 60 countries around the world, surprised us as well. We began by expecting that large and meaningful cultural differences would be ubiquitous across the many measures we employed. As we have analyzed and explored our large and complex data set over the past several years, this expectation has not been borne out. For example, in one set of analyses, we examined two measures of happiness, one developed in the US and the other developed in Japan, that were purported to be profoundly different because of the contrasting views of happiness in these two countries. But the data revealed correlates and other results that were much more similar than different around the world. Two countries in which the two measures had particular similar psychometric properties were, ironically, the US and Japan (Gardiner et al., 2020).
We also analyzed situational experience around the world in two samples, one with participants in 20 countries and one with participants from 64 countries. In both samples, we found that individual experiences within countries were more similar to each other than experiences compared across countries, as we expected, but the difference was surprisingly small and indeed, just barely reached statistical significance even with sample sizes in the thousands (Guillaume et al., 2016; Lee et al., 2020). More generally, the distinguished and pioneering cross-cultural researcher Allik (2005) has written about how personality variation across countries has turned out to be (unexpectedly) small compared with variation within countries (see also: Hanel et al., 2018). In the face of all this, how long can we maintain the default assumption that members of different cultures are importantly different from each other in their interpretation of scale items unless “proven” to be the same, and a conventional wisdom that cultural variation in the basic properties of well-established measurement instruments is typically large, consequential, and maybe even fatal?
Consider, again, the nature of cross-cultural versus within-culture variation. Perhaps, indeed, the items on the BFI-2 extraversion scale have a different meaning for someone living in Japan than they do for you. 5 But might they not also, to some degree, have a different meaning for your next-door neighbor than they do for you? And can we assume that the former difference in meaning is really all that different, from the latter difference? Juri Allik’s conclusions give reason to be uncertain. Measurement instruments surely have at least somewhat different properties and implications for different individuals. But the data we have seen do not offer a strong basis to presuppose that these properties and implications necessarily vary to any importantly consequential degree according to whether the individuals in question reside in the same or different countries. Perhaps, sometimes, they do. But the burden of proof seems misplaced, given what we are learning about cultural variation elsewhere.
A shift towards external validity
Common applications of MI focus on internal validity and pay little if any attention to external validity, which is a much more appropriate metric of evaluation. Consider two illustrative examples.
First, it is possible to measure a construct with a high degree of internal consistency across groups but still draw incorrect theoretical conclusions when external validity is not the focus. Consider the construct of individualism versus collectivism, one of the first major theories to emerge from cross-cultural psychology and arguably still the most widely referenced today (Smith & Bond, 2022). Variations in questionnaires for measuring collectivism have been developed and refined over the years, often with the goal of improving measurement equivalence across cultural groups such as by adding reversed items to lessen the effects of response style differences (Vignoles et al., 2016). But the measurement of collectivism across cultures has still run into issues, notably that even after establishing measurement equivalence between groups the theoretically expected mean level differences are not found (Wong et al., 2018).
However, as illustrated most recently by Talhelm (2022), the issue with measuring collectivism may not stem from response style biases, detectable and correctable using MI techniques, but from the conceptualization of the measure itself. Specifically, the content of items in most measures of individualism/collectivism is developed by what Talhelm has called a biased view of collectivism as a “socialist utopia” that is inferred rather than observed by cultural psychologists. Only after looking at mean level comparisons and country correlates does this issue with measuring collectivism become apparent. In this case, researchers can establish technical measurement equivalence—that their questionnaire of collectivism is measuring the same thing across groups—but still not actually measure the originally targeted construct (see Takano in press, for a related critique).
A second example of the importance of external validity is offered by the famous World Happiness Report rankings from the Gallup World Poll. These rankings are widely referenced not just in scientific research but in pop culture as well. The rankings stem from a single item that asks people to assess their current relative standing compared to the best possible life they could imagine, and then comparing the average of those item ratings across countries. A plausible theoretical case could be made that someone’s “ideal life” may be culturally specific and thus not be comparable across cultures, or indeed even across people within a culture. However, the question of MI for the World Happiness Report has been generally ignored, largely because its use of a single measurement item protects it from the application of any traditional MI techniques that rely on inter-item relationships. This lack of demonstrated MI, though, has not stopped the yearly country rankings from being widely reported in the media, cited in government policy debates, or even, in the case of Denmark, proclaimed on a welcome banner to international arrivals at the Copenhagen airport. 6
Instead, evidence for the validity of single item measures of happiness comes from external validity, both through convergent correlations with other measures of well-being (e.g., Cheung & Lucas, 2014) and from meaningful country-level correlations. For example, average well-being scores are higher in more developed countries where incomes are higher and people are healthier, two well-established predictors of greater well-being among individuals. In countries that face sudden conflict or other negative events, longitudinal studies have observed expected declines in average country level well-being (e.g., before and during the Syrian civil war—Cheung et al., 2020). This external evidence offers convincing grounds for concluding that the World Happiness Report enjoys considerable validity for international comparisons, notwithstanding its lack of conventional establishment of MI.
These two examples highlight a key critique of MI: Conventional assessments of MI are completely internal to the measurement instruments. They focus on the structure of the latent factors of the instruments, and the degree to which this structure is maintained across contexts, and—even more stringently—the intercepts of the items on latent traits or factors. 7 This may be all well and good, but internal validity is not the same as external validity and the former is actually not even always necessary for the latter (Revelle, in press). As was noted in a footnote earlier in this paper, this fact has long been understood if not always remembered. The classic examples in personality psychology are the MMPI and the CPI (California Psychological Inventory), the scales of which have well-established validity-in-use for predicting important outcomes, but which “fail” many conventional psychometric tests of internal reliability and factorial homogeneity.
To be clear, we do not advocate for the complete abandonment of reporting internal reliabilities. Examining the properties of an instrument’s performance is still a crucial component to any scientific report, and studies in cross-cultural psychology are no exception. We support efforts to compare the internal reliability of measures across cultural groups, such as comparing patterns of inter-item relationships (Gardiner et al., 2019) or patterns of factor loadings (e.g., Hogan Data Science, 2023), as useful steps in checking translation accuracy. But at present there is an overreliance on strict metrics that are entirely based on internal properties of the measurement for determining the quality of the data.
A useful future direction would be to move away from the almost exclusive focus on the internal properties of our measurement instruments in favor of increased emphasis on external validity (Welzel & Inglehart, 2016). This can and should be done at both the cultural and individual level. At the cultural level, research could assess average levels of measurements with other country-level variables (e.g., Mottus et al., 2010; Allik & Realo, 2017), for example, as illustrated with the previous research on country level correlates of happiness. But lest we fall prey to the ecological fallacy, this kind of research must be complemented by investigations at the individual level, assessing to what degree and when the measure’s correlations with other psychological variables are maintained across cultural contexts, for example, if a measure of happiness correlates with other indicators of well- being within and across some, many, or all countries (e.g., Diener et al., 2010; Gardiner et al., in press). This kind of convergence would be more persuasive evidence for the cross-cultural validity of a measure than even the finest-grained demonstration of strict MI.
Even better—and even more difficult—a measure used in more than one country could be compared in its associations with actual behavior, something that despite psychology’s self- definition as the “science of behavior” continues to get less attention than it should (Baumeister et al., 2007). One small step in the right direction is the use of “anchoring vignettes,” in which respondents read hypothetical descriptions of an individual and are asked to rate their traits (Mottus et al., 2012; Talhelm, 2022). Another promising approach was demonstrated by a study that assessed differences in sociability between residents of Mexico and the United States using naturalistic audio recordings as well as self-reports (Ramírez-Esparza et al., 2009). Research like these examples may lead to an eventual gold standard for cross-cultural psychology, in which behavioral data, and not just self-reports, are compared across cultures. To do this will be difficult and expensive. But we must, sooner or later.
Summary and recommendations
Reverse the default
We noted earlier in this paper that presumptions of profound cultural differences are becoming increasingly untenable as evidence of cross-cultural similarity accumulates. This is not to say that such differences do not exist or are not important, but we are willing to argue that they should no longer be presumed, but demonstrated. Indeed, we are led to wonder what the current state of cross-cultural research would be if critics were required to demonstrate meaningful measurement variance rather than “failures” of invariance. In every other area of psychology save this one, the presumption is that experimental conditions or groups must be considered equivalent until shown otherwise. Similarly, we suggest that researchers concerned about – or (better) just interested in—differences in measurement properties across cultures be required, first, to demonstrate that they are meaningfully large—not simply “significantly” non-zero—and second, to explain why and how those differences matter.
Emphasize external validity
This paper has noted several times that internal validity is no guarantee or substitute for external validity, and the psychological meaning and predictive utility of any measurement depends more on the latter than the former. We suggest that future cross-cultural research decrease its reliance on statistical evaluations of the internal structure of its measures, and increase attention to properties of individuals, characteristics of cultures, and important outcomes that the measures might (or might not) be associated with.
The data are the data
This is our most important point. Researchers who go to the considerable trouble of gathering data in more than one country should not be discouraged from doing so, should not be prohibited from analyzing their data in any way they find informative, and certainly should not be disadvantaged compared to researchers who avoid cross-cultural complications by gathering data only at their home campus. Of course, interpretations should be appropriately cautious, but this warning is a truism that applies to all research of any kind. You never really know for sure what the scores on your measures mean; all you can do is try to triangulate them with other data and interpret the patterns—and even the mean differences—that emerge the best you can. This is a worthy endeavor and indeed, the essence of scientific activity. The issue of “measurement invariance” should not be allowed to inhibit it.
Footnotes
Author’s note
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the NSF Grants BCS-0642243, BCS-1052638, and BCS-1528131. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the individual researchers and do not necessarily reflect the views of the National Science Foundation.
