Sage Journals: Discover world-class research

Abstract

In a recent article, André (2022) addressed the decision to exclude outliers using a threshold across conditions or within conditions and offered a clear recommendation to avoid within-conditions exclusions because of the possibility for large false-positive inflation. In this commentary, I note that André’s simulations did not include the situation for which within-conditions exclusion has previously been recommended—when across-conditions exclusion would exacerbate selection bias. Examining test performance in this situation confirms the recommendation for within-conditions exclusion in such a circumstance. Critically, the suitability of exclusion criteria must be considered in relationship to assumptions about data-generating mechanisms.

Keywords

outliers false positives power methodology statistical analysis

It is common practice in research to identify and remove unrepresentative responses. Excluding these responses is challenging because there are many possible procedural decisions one can use to identify them. Frequently, researchers have sensible rules to exclude unrepresentative responses, such as failing attention checks. Other times, they use the extremity of responses as an indicator of unrepresentativeness and exclude these observations, which are called “outliers.” Because the point of statistical inference is first to generalize from samples to parent populations, the imperative of representativeness is the foundation on which outlier exclusion is based (Aguinis et al., 2013). In a recent article, André (2022) helpfully addressed the decision to exclude outliers using a threshold across conditions or within conditions. André offered a clear recommendation to avoid within-conditions exclusions because of the possibility for large false-positive inflation.

Critically, the suitability of exclusion criteria must be considered in relationship to assumptions about data-generating mechanisms (i.e., the parent population and the manner or manners in which unrepresentative responses are generated). The correct approach to removing contamination by unrepresentative responses depends, of course, on how the sample is contaminated in the first place. Imagine there are two experimental conditions, control (A) and treatment (B), whose population means, μ_A versus μ_B, researchers are interested in comparing. If the null hypothesis is that their means do not differ, H₀: μ_A – μ_B = 0. One would then draw samples to test this. Consider a sample A = (A₁, . . . ,A_n) of size n, where each A_i is sampled from a parent distribution, N(μ_A, σ_A). There is also another sample B with the same setup. In this straightforward case, one uses a test statistic based on the sample means, M_A versus M_B, to test whether the population means differ from each other.

In practice, however, researchers do not know that their samples are in fact drawn from only the parent distributions of interest. Rather, they may suspect that their samples are “contaminated” by observations that are unrepresentative of the populations of interest. That is, they have a sample $\hat{A}$ = ( $\hat{A}$ ₁, . . . , $\hat{A}$ _n) of size n, where each $\hat{A}$ _i is sampled from N(μ_A, σ_A) with probability p and a contamination distribution, say N(μ_A + 5, 1), with probability 1 – p; sample $\hat{B}$ has the same setup. In this case, the usual analysis based on the sample means, $M_{\hat{A}}$ versus $M_{\hat{B}}$ , allows one to test a different null hypothesis, which is $H_{\hat{0}}$ : $μ_{\hat{A}}$ – $μ_{\hat{B}}$ = 0. This is problematic because the test of $M_{\hat{A}}$ versus $M_{\hat{B}}$ is not a clean test of the hypothesis, so outlier exclusion is used to, hopefully, minimize the presence of unrepresentative observations. The simulations reported by André (2022) that reject within-conditions exclusion were conducted either without contamination or with equal contamination in both conditions (as in the previous example) and under the null. However, within-conditions exclusion (or likewise, “hypothesis-aware” exclusion based on the residuals of a statistical model that included condition as a predictor) has previously been recommended in prior research for a different situation.

Meyvis and Van Osselaer (2018) recommended searching for responses that deviate exceptionally “from the cell mean” (p. 1162), and they explained their reasoning for within-conditions exclusions. Using a single threshold can “create a nonequivalence of participants between conditions (i.e., introducing a confound)” (p. 1163), which is otherwise known as introducing selection bias (Heckman, 1979). Under the null considered by André (2022), without contamination or with equal contamination, this risk of creating a nonequivalence does not exist, hence André’s rejection of within-conditions exclusion. However, when there is differential contamination extremity, this potential problem should lead one to consider within-conditions exclusion.

Imagine two contaminated conditions, but the contamination in one condition is more extreme than in the other. For example, suppose one condition has contaminants drawn from N(μ_A + 3, 1), whereas the other condition has contaminants drawn from N(μ_B + 5, 1). A single threshold for exclusion essentially compromises the exclusion thresholds and therefore excludes more observations from the more extremely contaminated condition and fewer observations from the less extremely contaminated condition.

Assuming 5% contamination, within-conditions exclusion using a 3-SD threshold would exclude 23% of the contaminants in the control condition and 62% in the treatment condition; across-conditions exclusion worsens this discrepancy and would exclude 11% and 78%, respectively. Because André (2022) shared their Python code on OSF, one can extend their simulations to this situation and demonstrate this problem (more information on these calculations and simulations is available in the Supplemental Material). When the representative observations are drawn under H₀, the differential selection induced by across-conditions exclusion biases the results. Figure 1a shows that the t statistics are biased away from zero under across-conditions exclusion (M_across = –.29 vs. M_within = .01). Even though the Type I error rate inflation is not very high (Fig. 1c), across-conditions exclusion biases the estimates.

Fig. 1.

The impact of across versus within exclusions with differential contamination extremity using a 3-SD threshold. Given this simulation’s parameters, the distribution of t values is more negatively biased using across (vs. within) exclusions (a) under the null hypothesis and (b) under the alternative hypothesis. This results in relatively similar (c) Type I error inflation (across = 6% vs. within = 8%) but worse (d) Type II error inflation (across = 43% vs. within = 30%).

When it exacerbates selection bias, across-conditions exclusion increases the likelihood of misestimating the sign or magnitude of an effect. Within-conditions exclusion, on the other hand, generates a more dispersed distribution of t values but one that is less biased. Under differential contamination extremity, across-conditions (vs. within-conditions) exclusion leads to worse selection bias—the sample used for the analysis is further from a representative realization of the population’s sampling distribution. The resulting estimates will less accurately reflect the population, which is especially problematic when the researcher’s goal is not the dichotomous reject/fail to reject decision for a hypothesis but obtaining accurate effect estimates (e.g., meta-analysis, policy research).

In fact, the increased bias of across-conditions exclusion also occurs when the representative observations are drawn under the alternative hypothesis, d = 0.40. Figure 1b shows that the t statistics are more biased away from the expected mean 2.83 under across-conditions exclusion (M_across = 2.15 vs. M_within = 2.52). Predictably, this means the Type II error rate inflation is more severe (Fig. 1d). Across-conditions exclusion can also exacerbate selection bias when the conditions have unequal variances (also outside the scope of André, 2022; see Supplemental Material and Karch, 2023). Researchers would need to have prior information suggesting these situations apply to their research context (e.g., they are conducting their latest in a string of several replication studies, they are using a well-established manipulation from previous research), but that is ultimately at the crux of this commentary—the appropriate analytical approach depends on what is known about the data-generating process or processes. Selection bias is a serious threat to the validity of estimates and inferences; when across-conditions exclusion is expected to increase selection bias, then it should be avoided.

Although André (2022) is correct to recommend a healthy skepticism of within-conditions exclusion, within-conditions exclusion is recommendable when across-conditions exclusion is expected to exacerbate selection bias (e.g., with differential contamination extremity). The recommendation in this situation can be succinctly represented using weighted error rates computed from the simulation results (Maier & Lakens, 2022; Mudge et al., 2012). Figure 2 plots the weighted error rate under within- and across-conditions exclusion as a function of the relative cost of making a Type I versus Type II error. Conventionally, this ratio is implicitly set at 4 (5% Type I error rate, 20% Type II error rate, with a desired weighted error rate of 8%). In this situation, within-conditions exclusion is just barely preferred to across-conditions exclusion. This is exactly the situation in which excluding outliers within conditions has been previously recommended (Cousineau & Chartier, 2010; Meyvis & Van Osselaer, 2018) and within-conditions exclusion dominates across-conditions exclusion as Type II errors become more costly relative to Type I errors (i.e., as the ratio decreases from 4). Note, however, that beyond these error rates, selection bias biases the test statistics (Figs. 1a and 1b), which may be the more problematic issue. When outlier exclusion comes with the threat of differential selection, the benefits of within-conditions exclusion and risks of across-conditions exclusion are at their peak.

Fig. 2.

Weighted error rates of across versus within exclusion with differential contamination extremity as a function of the relative cost of Type I and Type II errors. Using a 3-SD threshold and α = .05. Neither exclusion approach is perfect, but when there is differential contamination extremity, within- (vs. across-) conditions exclusion achieves a lower weighted error rate.

To sum up, the decision to use within-conditions exclusion has often not been thoroughly scrutinized, and in many situations, within-conditions exclusion should not be used because of its tendency to inflate Type I error rates (André, 2022). Yet the choice of outlier-exclusion approach is conditional on the expected state of the world and research goals, and it is up to researchers to ensure all the necessary conditions are in place to support their approach. In line with André (2022), in this commentary, I assumed there is suspected contamination, outlying is indicative of contamination, and outliers should be removed (if any of these conditions is not met, outlier exclusion should be avoided altogether). When the researcher intends to remove outliers and it is only a question of deciding on an appropriate exclusion rule (across vs. within conditions), this commentary highlights the importance of also considering the nature of the expected contamination (and more broadly, nature of the data). Researchers must be conscious of the assumptions scaffolding the suitability of their exclusion approaches.

Supplemental Material

sj-docx-1-amp-10.1177_25152459231186577 – Supplemental material for The Appropriateness of Outlier Exclusion Approaches Depends on the Expected Contamination: Commentary on André (2022)

Supplemental material, sj-docx-1-amp-10.1177_25152459231186577 for The Appropriateness of Outlier Exclusion Approaches Depends on the Expected Contamination: Commentary on André (2022) by Daniel Villanova in Advances in Methods and Practices in Psychological Science

Footnotes

Acknowledgements

The author gratefully acknowledges support from the Open Access Publishing Fund administered through the University of Arkansas Libraries.

Transparency

Action Editor: Pamela Davis-Kean

Editor: David A. Sbarra

Author Contributions

Daniel Villanova: Conceptualization; Formal analysis; Investigation; Methodology; Writing – original draft; Writing – review & editing.

ORCID iD

Daniel Villanova

Supplemental Material

Additional supporting information can be found at

References

Aguinis

Gottfredson

R. K.

Joo

(2013). Best-practice recommendations for defining, identifying, and handling outliers. Organizational Research Methods, 16(2), 270–301. https://doi.org/10.1177/1094428112470848

André

(2022). Outlier exclusion procedures must be blind to the researcher’s hypothesis. Journal of Experimental Psychology: General, 151(1), 213–223.

Cousineau

Chartier

(2010). Outliers detection and treatment: A review. International Journal of Psychological Research, 3(1), 58–67. https://doi.org/10.21500/20112084.844

Heckman

J. J.

(1979). Sample selection bias as a specification error. Econometrica, 47(1), 153–161. https://doi.org/10.2307/1912352

Karch

J. D.

(2023). Outliers may not be automatically removed. Journal of Experimental Psychology: General, 152(6), 1735–1753. https://doi.org/10.1037/xge0001357

Maier

Lakens

(2022). Justify your alpha: A primer on two practical approaches. Advances in Methods and Practices in Psychological Science, 5(2). https://doi.org/10.1177/25152459221080396

Meyvis

Van Osselaer

S. M. J.

(2018). Increasing the power of your study by increasing the effect size. The Journal of Consumer Research, 44(5), 1157–1173. https://doi.org/10.1093/jcr/ucx110

Mudge

J. F.

Baker

L. F.

Edge

C. B.

Houlahan

J. E.

(2012). Setting an optimal α that minimizes errors in null hypothesis significance tests. PLOS ONE, 7(2), Article e32734. https://doi.org/10.1371/journal.pone.0032734

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

3.41 MB