Abstract
Moderated multiple regression (MMR) is frequently used to test moderation hypotheses in the behavioral and social sciences. In MMR with a categorical moderator, between-groups heteroscedasticity is not uncommon and can inflate Type I error rates or reduce statistical power. Compared with research on remedial procedures that can mitigate the effects of this violated assumption, less research attention has focused on statistical procedures that can be used to detect between-groups heteroscedasticity. In the current article, we briefly review such procedures. Then, using Monte Carlo methods, we compare the performance of various procedures that can be used to detect between-groups heteroscedasticity in MMR with a categorical moderator, including a heuristic method and a variant of a procedure suggested by O’Brien. Of the various procedures, the heuristic method had the greatest statistical power at the expense of inflated Type I error rates. Otherwise, assuming that the normality assumption has not been violated, Bartlett’s test generally had the greatest statistical power when direct pairing occurs (i.e., when the group with the largest sample size has the largest error variance). In contrast, O’Brien’s procedure tended to have the greatest power when there was indirect pairing (i.e., when the group with the largest sample size has the smallest error variance). We conclude with recommendations for researchers and practitioners in the behavioral and social sciences.
Keywords
Testing for the equality of regression slopes is frequently conducted in the behavioral and social sciences. Evidence of this can be found in research on differential prediction (Aguinis, Culpepper, & Pierce, 2010; American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 1999; Saad & Sackett, 2002) and analysis of covariance (Fox, 2008; Huitema, 1980; Rutherford, 1992). Testing for the equality of regression slopes is equivalent to testing whether the relationship between a continuous outcome and a continuous predictor differs depending on a third variable—a moderator (Saunders, 1956; Stone-Romero & Liakhovitski, 2002).
The study of moderator variables, in general, is important for theory development and knowledge cumulation in education, management, industrial-organizational psychology, and related disciplines. Consistent with this, Hall and Rosenthal (1991) noted,
If we want to know how well we are doing in the biological, psychological, and social sciences, an index that will serve us well is how far we have advanced in our understanding of the moderator variables of our field. (p. 447)
Although a variety of procedures exist for detecting the effects of continuous and categorical moderators (Stone-Romero & Liakhovitski, 2002; Zedeck, 1971), researchers have noted that moderated multiple regression (MMR) has become the major procedure for testing hypotheses involving categorical moderators (Aguinis, 2004; Overton, 2001; Sackett & Wilk, 1994; Shieh, 2009).
Regrettably, in MMR with a categorical moderator, it is not uncommon to violate the homoscedasticity assumption (see Aguinis & Pierce, 1998; DeShon & Alexander, 1996; Overton, 2001), which can lead to inflated Type I errors or reduced statistical power (DeShon & Alexander, 1996; Ng & Wilcox, 2010; Overton, 2001). More specifically, in MMR, the form of heteroscedasticity that can manifest is one in which the error variance differs across the levels of a categorical moderator (e.g., gender; for a review, see Aguinis, 2004; DeShon & Alexander, 1996; Ng & Wilcox, 2010; Rosopa, Schaffer, & Schroeder, 2013; Wilcox, 1997), or stated another way, between-groups heteroscedasticity exists (Ng & Wilcox, 2010).
Based on a review of three journals (
Although there exist a number of remedial procedures (Rosopa et al., 2013) that can be used to mitigate the effects of between-groups heteroscedasticity in MMR, including the use of statistical approximations (Alexander & Govern, 1994; DeShon & Alexander, 1994; Shieh, 2009), robust methods (Cribari-Neto, 2004; Long & Ervin, 2000; Wilcox, 2005), and weighted least squares regression (Overton, 2001; Rosopa, 2006), less research attention has focused on statistical procedures that can be used to detect between-groups heteroscedasticity in MMR. Currently, there is no empirical research that systematically compares the various approaches that can be used to detect between-groups heteroscedasticity. Thus, consistent with recommendations by Rosopa et al. (2013), one major purpose of the present article is to compare the performance of various procedures that can be used to detect between-groups heteroscedasticity in MMR with a categorical moderator.
Although researchers across diverse disciplines (e.g., econometrics, psychology, and statistics) have suggested different approaches for detecting heteroscedasticity in general (Rosopa et al., 2013), some procedures are sensitive to non-normality. A robust approach by O’Brien (1979, 1981), however, has been recommended for use in ANOVA. Thus, another purpose of the present article is to suggest a variation of O’Brien’s procedure that can be used for instances in which a researcher is interested in testing for the equality of regression slopes.
Our article is divided into four major sections. First, we formally define the model used in MMR with a categorical moderator. Second, we describe between-groups heteroscedasticity and its biasing effects. Third, we review various procedures that can be used to detect between-groups heteroscedasticity, including O’Brien’s (1979, 1981) procedure. Fourth, we describe the results of a Monte Carlo simulation designed to assess the relative performance of various procedures that can be used to detect between-groups heteroscedasticity.
MMR With a Continuous Predictor and a Categorical Moderator
When testing for the equality of
When
for
More generally, for
Note that normally distributed ε
In Equation 1, the
(Fox, 2008; Rencher, 2000). Note the common variance on the main diagonal in Equation 2. Heteroscedasticity, in contrast, is said to exist when these variances are no longer equal. This can be denoted by
where
Between-Groups Heteroscedasticity and Its Biasing Effects
Extant research has found that between-groups heteroscedasticity can affect statistical inferences (e.g., increased Type I or Type II error rates) and these effects are nontrivial (DeShon & Alexander, 1996; Dretzke, Levin, & Serlin, 1982; Ng & Wilcox, 2010; Overton, 2001).
The error variance in the
where
Inspection of Equation 4 shows that if
where
This violated assumption has biasing effects on the Type I error rates and the statistical power of MMR whether subgroup sample size (
With between-groups heteroscedasticity and unequal
To exacerbate matters, unequal
Overall, the biasing effects of between-groups heteroscedasticity on Type I error rates and statistical power can have implications on both theory development and practice in the behavioral and social sciences (Aguinis & Pierce, 1998; Oswald, Saad, & Sackett, 2000; Rosopa et al., 2013). For example, assume that sample sizes are unequal between two independent groups (e.g., male vs. female) and between-groups heteroscedasticity exists such that the larger error variance is paired with the group with the larger sample size (i.e., direct pairing). Furthermore, assume that the researcher/practitioner failed to detect a hypothesized slope difference between groups (i.e., between males and females) that actually exists in the population. Stated differently, the failure to detect a hypothesized moderating effect that exists in the population might be due to the influence of between-groups heteroscedasticity. As detailed in a review by Aguinis and Pierce (1998), inflated Type I error rates could lead to the publication of specious results. This seems plausible considering that, for decades, researchers have noted the problem of failing to detect hypothesized moderators using MMR (Aguinis & Stone-Romero, 1997; McClelland & Judd, 1993; Zedeck, 1971).
As noted above, researchers have identified a number of alternatives to MMR when between-groups heteroscedasticity exists. For example, DeShon and Alexander (1996) conducted a comprehensive Monte Carlo study evaluating the relative performance of various statistical approximations, with two statistical approximations (viz.,
Because violation of the between-groups homoscedasticity assumption can afflict the Type I error rates and power of MMR, it would be useful to assess whether this assumption has been violated. The following section considers this issue.
A Review of Procedures for Detecting Between-Groups Heteroscedasticity
An issue seldom raised by researchers or practitioners in the context of MMR is how to detect violations of the between-groups homoscedasticity assumption. Although Aguinis (2004) explained that there are two methods (to be noted below) for evaluating whether the assumption has been violated, any procedure that can be used to test the equality of
In addition, although a number of studies have compared various tests for homogeneity of variances specifically in ANOVA (see, for example, Conover et al., 1981; Martin & Games, 1977), we could not find any studies involving MMR with between-groups heteroscedasticity and the effects of direct and indirect pairing. For example, a simulation conducted by Conover et al. (1981) involved a one-way ANOVA with four independent groups, and they included only direct pairing conditions when
As noted above, Aguinis (2004) mentioned two methods for detecting between-groups heteroscedasticity. One was a heuristic method suggested by DeShon and Alexander (1996). The second was a statistical test by Bartlett (1937). In the sections that follow, we describe these and other procedures that could be used to detect between-groups heteroscedasticity.
Heuristic Method
DeShon and Alexander (1996) described a heuristic method to signal whether the between-groups heteroscedasticity assumption has been violated to such a degree as to unduly influence the results of MMR analyses. Specifically, when a researcher calculates the variance of the residuals separately within each of the
Note that the heuristic method is not a statistical test, but rather a rule-of-thumb and its statistical performance, in terms of Type I error or power, has not been examined. As a rule-of-thumb, the heuristic method may not possess the desirable property of being robust at any Type I error rate (α). That is, regardless of α (e.g., .01 or .05), a researcher would conclude that heteroscedasticity exists if the ratio (based on sample estimates of two variances) exceeds 1.5. However, the heuristic method was included in the present simulation to assess its performance relative to other procedures.
Bartlett
Bartlett (1937) developed a procedure that can be used to test for homogeneity of variances by conducting a transformation of the variances. To use this procedure, this test involves transforming the variances of the residuals across the levels of
and
where
Brown and Forsythe
To detect heteroscedasticity in the context of ANOVA, Brown and Forsythe (1974) suggested conducting a one-way ANOVA on the absolute value of the residuals around the group median instead of the mean (cf. Levene, 1960). Based on simulations conducted by Conover et al. (1981), tests for homogeneity of variances based on the median tend to control Type I error rates better than tests based on the mean. Brown and Forsythe’s procedure is relatively straightforward and appears to be less affected by skewed data in unbalanced designs than other procedures, while still providing adequate statistical power (Lix, 1996). In addition, because of its computational ease, it may be a very practical procedure for researchers and practitioners (Boos & Brownie, 2004; Conover et al., 1981).
Score
The score test, developed independently in the econometrics (Breusch & Pagan, 1979) and statistics (Cook & Weisberg, 1983) literature, can be used to detect various forms of heteroscedasticity. For example, the score test can be used to test whether error variances differ as a function of continuous predictors, categorical predictors, or predicted values. This procedure requires two regression analyses. In the first analysis, the sum of squares error (
Although the score test is not frequently used in the behavioral sciences, this procedure was included in the present study because of its flexibility. In addition, because the components needed for the statistical test are based on two regression equations (i.e., customized syntax or a stand-alone program is not required), this procedure would be generally accessible for a wide variety of users.
O’Brien
Analogous to testing for the main and interactive effects in ANOVA, O’Brien (1979, 1981) developed a procedure that could be used to test for the main and interactive effects of the variances in the cells of one-way and factorial designs. This robust procedure has been recommended even when the normality assumption is violated (Maxwell & Delaney, 2000; O’Brien, 1979, 1981). An especially lucid description of the procedure can be found in Maxwell and Delaney (2000).
Because O’Brien’s (1979, 1981) procedure is limited to designs that have only categorical predictors (e.g., one-way and factorial ANOVAs), it would be useful to generalize this method to designs that include categorical and continuous predictors. Below, we describe how O’Brien’s (1979, 1981) method can be used where hypotheses involving the equality of regression slopes are being tested. Here, we focus on a dichotomous moderator. 3
The modified procedure requires three steps. The first step is to calculate the residuals (
The second step involves a transformation of each of the individual residuals. This calculation is achieved using the following equation:
To check the calculations, the average of the
The third step is to conduct a two-independent-samples
In the following sections, we describe the design and results of a Monte Carlo study used to compare the performance (viz., Type I error and statistical power) of the five procedures described above—heuristic method, Bartlett’s (1937) test, Brown and Forsythe’s (1974) test, score test, and modified O-Brien’s (1979, 1981) test.
Method
We used Monte Carlo methods (Robert & Casella, 2004) to evaluate the performance of five procedures that can be used to detect between-groups heteroscedasticity in MMR with a dichotomous moderator. Note that the nominal α for all tests was .05. The manipulated parameters of our 5 × 3 × 8 × 2 × 5 research design resulted in 1,200 conditions. Each of the manipulated parameters is described next.
Manipulated Parameters
Total sample size
Five levels of
Subgroup sample size
Sample size within groups was systematically manipulated using the following three ratios (
Between-groups heteroscedasticity
Between-groups heterosce-dasticity assumed eight levels, which involved the ratios of the population error variance in each group (
Type of pairing
Depending on whether the larger error variance (
Effect size
In the present study, the MMR effect size (
Thus, in the present study, the levels of the manipulated effect size were .001, .002, .005, .01, and .02. These levels included the median effect size reported by Aguinis et al. (2005). Although Cohen (1988) labeled
Data Generation
For each condition, data generation and statistical analyses were conducted in
We also conducted a series of accuracy checks to ensure that the data we generated conformed to the various parameters that we manipulated. In addition, we checked our data generation algorithm against similar conditions considered by DeShon and Alexander (1996), Dretzke et al. (1982), and Overton (2001).
On each simulated data set, the five procedures (viz., heuristic method; Bartlett’s, 1937, test; Brown and Forsythe’s, 1974, test; the score test; and the modified O’Brien’s, 1979, 1981, test) were used to test whether between-groups heteroscedasticity existed. For each condition, there were 5,000 replications. The proportion of times that the null hypothesis was rejected within a condition was recorded for each procedure.
Results
The performance of the five procedures are compared below in terms of Type I error rate and power. Due to space limitations, we do not present the results of all 1,200 conditions. Because the pattern of results were the same regardless of the size of the moderating effect, we present results when
Type I Error
For the conditions in which homoscedasticity existed (i.e.,
Empirical Type I Error Rates as a Function of Sample Size and Subgroup Proportions.
Although the heuristic method is not a formal statistical test, but simply a rule-of-thumb, and given that sampling error will affect the estimate of the residual variance in Group 1 and the estimate of the residual variance in Group 2, when
Another interesting result regarding the heuristic method is that as
Statistical Power
In this section, empirical rejection rates when the homoscedasticity assumption is violated (i.e., heteroscedasticity exists) are presented (see Tables 2-4). Table 2 presents results when subgroup sample sizes are equal (i.e.,
Statistical Power as a Function of Sample Size and Degree of Between-Groups Heteroscedasticity When Subgroup Sample Sizes Are Equal.
Statistical Power as a Function of Sample Size, Degree of Between-Groups Heteroscedasticity, and Type of Pairing When Subgroup Proportions (
Statistical Power as a Function of Sample Size, Degree of Between-Groups Heteroscedasticity, and Type of Pairing When Subgroup Proportions (
Equal subgroup sample sizes
Recall that when subgroup sample sizes are equal, direct versus indirect pairing does not apply because the pairing of the larger (or the smaller) error variance with the group with the larger (or smaller) sample size is a non-issue because sample sizes are the same. In Table 2, with the exception of the heuristic method, four procedures had power that increased monotonically as
For the heuristic method, although it had the greatest power of all five procedures, recall that it had very inflated Type I error rates (see Table 1). Thus, the increased power comes at the cost of inflated Type I error rates. The heuristic method had power that increased monotonically as the degree of between-groups heteroscedasticity increased. For example, in Table 2, assuming
Interestingly, when the degree of between-groups heteroscedasticity is fixed (e.g., 1:1.25), the power of the heuristic method did not increase monotonically as
Direct pairing
In Table 3, when there was direct pairing, the heuristic method was generally the most powerful of the five procedures when
In Table 4, with direct pairing, the trends were similar to Table 3. However, compared with Table 3, because of the increasingly disproportionate subgroup sample sizes in Table 4, power generally decreased. However, the rank order of the various procedures remained the same. Excluding the heuristic method, which was the most powerful due to its inflated Type I error rate, Bartlett’s (1937) test continued to be the most powerful and the modified O’Brien (1979, 1981) was still the least powerful.
Indirect pairing
For all five procedures, power was lower when there was indirect pairing versus direct pairing. Notably, of the four procedures that were able to control Type I error rate at the nominal level (viz., Bartlett’s, 1937, test; Brown and Forsythe’s, 1974, test; the score test; and the modified O’Brien’s, 1979, 1981, test), the rank order of these procedures changed when there was indirect pairing. When there was indirect pairing (see Table 3 and Table 4), the modified O’Brien (1979, 1981) tended to be the most powerful, followed by the score test, Bartlett’s (1937) test, and Brown and Forsythe’s (1974) test.
It deserves noting that the heuristic method had the greatest statistical power of all five procedures because of its inflated Type I error rate. However, as its Type I error rate decreased with increasing
Discussion
Because between-groups heteroscedasticity is a problem in MMR analyses with categorical moderators, the present study compared the performance of various procedures that could be used to detect this statistical violation. As noted above, research has focused primarily on remedial procedures that can be used when between-groups heteroscedasticity exists. However, we felt that it was also important to compare different ways of detecting between-groups heteroscedasticity that have not been previously examined empirically in MMR with a dichotomous moderator (viz., heuristic method; Bartlett’s, 1937, test; Brown and Forsythe’s, 1974, test; score test; and modified O’Brien’s, 1979, 1981, test). By comparing various procedures, we hoped to offer some initial recommendations for researchers and practitioners in the behavioral and social sciences.
A number of key findings can be gleaned from our study. In general, Bartlett’s (1937) test is the most powerful in detecting between-groups heteroscedasticity when sample sizes are equal or when direct pairing occurs, thus providing empirical support for the recommendation offered by DeShon and Alexander (1996). It is noteworthy, however, that when there is indirect pairing (i.e., the largest
The score test performed well across conditions, typically with the second highest power levels. Perhaps due to its origins in econometrics and statistics, it does not appear to be well known in the psychology literature and related fields. However, the score test may still be a very attractive alternative for researchers because of its flexibility to detect heteroscedasticity of various forms, including between-groups heteroscedasticity.
Brown and Forsythe’s (1974) test was the least powerful across conditions. It deserves noting, however, that this procedure was developed to be robust against violations of normality. Because normality was not manipulated in the present study, it is plausible that under conditions of non-normality, in which research has suggested that Bartlett’s (1937) test performs poorly (Box, 1953; Levene, 1960), Brown and Forsythe’s test could potentially outperform Bartlett’s test. Similarly, because O’Brien’s (1979, 1981) method has been found to be robust when the normality assumption is violated, it is possible that, under conditions of non-normality, the modified O’Brien could outperform Bartlett’s test even in the direct pairing conditions.
As
The present study demonstrated that O’Brien’s (1979, 1981) procedure can be extended to designs beyond one-way and factorial ANOVA to include continuous predictors. The modified procedure controlled Type I error at the nominal level and had power levels comparable with, and in some cases greater than, other procedures.
The heuristic method had very poor properties. Admittedly, it is not a statistical test. Thus, it may not be reasonable to expect the heuristic method to be robust. Note that the empirical rejection rates (i.e., Type I error and power) for the heuristic method are unaffected by whether α = .01, .05, or .10. Thus, at any alpha, for the 1,200 conditions considered in the present study, the heuristic method would have the same rejection rates. To counteract its inflated Type I error rate, and interpolating from Table 1, the heuristic method may be recommended for use when
Recommendations for Research and Practice
A few recommendations for research and practice can be identified. First, when testing for the equality of regression slopes, it is important that researchers and practitioners evaluate whether the homoscedasticity assumption has been satisfied. Consistent with Rosopa et al. (2013), the residuals from Equation 1 (for the two-group case, specifically) or Equation A1 in the appendix (for two or more groups, more generally) should be calculated. Then, the sample-based variance of these residuals can be calculated separately for each group. Assuming that
Second, based on the results of the present study, Bartlett’s (1937) test is the most powerful for detecting between-groups heteroscedasticity when the normality assumption is satisfied and direct pairing exists. However, when there is indirect pairing, the modified O’Brien’s (1979, 1981) test should be used. Notably, if subgroup sample sizes are approximately equal, it makes little difference which statistical procedure is used because the differences in statistical power are generally small.
Third, if between-groups heteroscedasticity is detected, an alternative procedure should be used instead of ordinary least squares regression. To mitigate the biasing effects of between-groups heteroscedasticity, Rosopa et al. (2013) discussed a number of procedures including weighted least squares regression and heteroscedasticity-consistent covariance matrices.
Conclusion
The present study adds incrementally to the extant literature on statistical procedures that can be used to detect between-groups heteroscedasticity in MMR with categorical moderators. It appears that different procedures may be needed to optimally detect between-groups heteroscedasticity when there is direct pairing (viz., Bartlett’s, 1937, test) versus indirect pairing (viz., modified O’Brien’s, 1979, 1981, test). This is a finding unique to this study. Moreover, because the heuristic method has never been empirically examined, the present simulation results are the first to note that this method has very inflated Type I error rates and it may be best to use this method when
Footnotes
Appendix
The full linear model for the
where
The best linear unbiased estimator of the parameters in Equation A1 is
Although
where
where
Although
Assuming that
where
Authors’ Note
Portions of this article were presented at the 75th annual conference of the Psychometric Society in Athens, Georgia, and the 26th annual conference of the Society for Industrial and Organizational Psychology in Chicago, Illinois.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research and/or authorship of this article.
