Abstract
Many surveys of effect size (ES) reporting practices have been conducted in social science fields such as psychology and education, but few such studies are available in applied linguistics. To bridge this gap and to echo the recent calls for more robust statistics from scholars in applied linguistics and beyond, this study represents the first attempt, in the field of applied linguistics, to focus upon ES reporting practices. With an innovative “two-standards” approach for coding, which overcomes the limitations with similar studies in other social science fields (e.g., communication), this study assesses the ES reporting practices over a span of 6 years in a major journal. Findings include the following: (a) the ES reporting rate is about 50% and (b) some improvement of ES reporting over time is in evidence. Future research directions (e.g., examining whether and how ES is interpreted after being reported) are suggested.
Keywords
Introduction
The importance of effect size vis-à-vis the inherent limitations of Null Hypothesis Significance Testing (NHST; including the significance level, viz. the
On the contrary, effect size, simply put, is “an objective and (usually) standardized measure of the magnitude of observed effect” (Field, 2009, p. 56). Compared with the
The importance of effect size notwithstanding, only a handful of journals in applied linguistics make such reporting practices mandatory in their editorial policies. While Larson-Hall (2010, p. 114) claims that “currently, the only journal in the second language research field which requires effect sizes is
In contrast to the increasing awareness of the importance of effect size, little is known about the current status of effect size reporting in the field of applied linguistics. Although Plonsky (2013) and Lindstromberg (2016) note that the effect size reporting rates in their sampled papers are not high (25% and 49%, respectively), the focus of these studies is not on effect size reporting practices in the field. In contrast, many studies in such fields as education, psychology, and communication (e.g., Meline & Wang, 2004; Sun & Fan, 2010) have focused upon the effect size reporting practices (see “Literature Review” section).
In view of the undue neglect of effect size reporting in applied linguistics, this article aims to contribute to our understanding by surveying such practices in
This exploratory study focuses upon the effect size reporting practices concerning five statistical procedures:
Three research questions are pursued:
In the remainder of this article, after providing a more detailed introduction to the definition and use of effect size with an illustrative example in a published study, we review relevant studies from such fields as education and psychology as well as from applied linguistics. We then report upon the data collection and analysis methods of our study. After presenting and discussing major findings, we conclude the article by offering suggestions for effect size reporting practices and for further studies that help contribute to the ongoing methodological reform in applied linguistics.
Effect Size
Definitions of Effect Size
Effect size is defined as an objective and standardized measure of the magnitude of an observed effect, with the wording “(usually)” removed from the above-cited concise definition by Field (2009, p. 56). Although some other definitions (e.g., Meline & Wang, 2004; Sun & Fan, 2010) are so broad that they include nonstandardized forms (e.g., raw mean difference), it is strongly recommended that effect size measures be confined to standardized forms only, so as to maximize the benefits of these forms such as letting “the reader compare effects across groups” and “meta-analysts compare studies even if they use different original measures” (Larson-Hall & Plonsky, 2015, p. 135). Furthermore, the danger of relying on raw mean difference, vis-à-vis the benefits of drawing upon standardized forms of effect size, will be illustrated with one authentic example of effect size reporting below.
Dozens of effect size measures are available, each with relative strengths and weaknesses for particular purposes (Ellis, 2010; Henson, 2006; Kirk, 1996). Two types
3
of effect sizes highly relevant to applied linguistic research are the
Table 1 also lists some benchmarks for interpreting effect sizes recommended by Cohen (1988) and by researchers in applied linguistics. Cohen’s (1988) benchmark system had better be reserved “as a last resort” (Ellis, 2010, p. 42), although it has been used by too many researchers as iron-clad criteria without reference to the measurements taken, the study design, or the practical importance of the findings. Whenever possible, researchers should try to interpret effect sizes by grounding them in a meaningful context (e.g., comparisons with previous studies vis-à-vis the measurements and study design) or by assessing their contribution to knowledge (e.g., in terms of practical or clinical value). The two benchmark systems from researchers in applied linguistics (see Table 1) provide more nuanced guidance in interpreting the effect size in question than Cohen’s system: Plonsky and Oswald’s (2014) benchmarks are highly relevant to experiment-based studies in what they called “L2 research,” and Wei and Hu’s (2018) to survey-based studies examining the effects of sociobiographical variables (e.g., gender and multilingualism) on (socio-)psychological variables (e.g., L2 joy and tolerance of ambiguity).
Consequences of Not Reporting Effect Size
As Zientek, Capraro, and Capraro (2008) point out, “not reporting effect size can be detrimental” (p. 212). Presenting one authentic example helps drive home the consequences of failing to report effect sizes. Table 2 is adapted from Wei and Su’s (2015) analysis of the respondents’ self-reported data concerning their English spoken proficiency and other variables from the largest language survey in China. The major modification made to Wei and Su’s (2015) original table was that we added a column containing Cohen’s
An Authentic Example: Spoken Proficiency in English of People With English Learning Experience.
The corresponding research question for Table 2 asks, with regard to English spoken proficiency, was there a significant difference between the national average and the city average for each of the seven selected cities? These authors answer the question with results (see Table 2) from a series of one-sample
Two important observations can be made regarding Table 2. First, if one relies on the raw mean differences (viz. the city mean minus the national mean) for Beijing (0.269) and Shenzhen (0.256), one might reach a conclusion that Beijing performed better than Shenzhen with the national average being a baseline. But an entirely opposite conclusion that Shenzhen performed better than Beijing is true because the effect size for the former (0.326) was higher than that for the latter (0.295). In this example, effect size, rather than raw mean difference, is the appropriate measure reflecting the magnitude of the real difference. In other words, failure to use effect size and reliance upon unstandardized measures (e.g., raw mean difference) can lead to a completely opposite conclusion. Second, many researchers with traditional training tend to erroneously believe that “the smaller the
It is noteworthy that Wei and Su (2015) explain why they do not attempt to interpret the
Literature Review
Many surveys of effect size reporting (and to a less extent, interpreting) practices have been conducted in such fields as psychology, education, and communication. For example, in gifted education research, with all the 723 papers from six full volumes of three selected journals, Paul and Plucker (2004, p. 69) report that “28.9% of the quantitative research blocks contained effect size estimates”; the so-called “quantitative research blocks” include three subgroups: descriptives, univariate blocks, and multivariate blocks; the effect size reporting rates for the latter two blocks were 17.9% and 52.2%, respectively. To these authors, there is no need for papers utilizing only descriptive statistics to report effects sizes.
More recently, in the fields of education and psychology, Sun, Pan, and Wang’s (2010) survey of 1,243 articles published in 14 journals from three full volumes (2005-2007) reveals an effect size reporting rate of 49%. In the field of communication, after examining four full volumes (2003-2006) of four influential journals, Sun and Fan (2010) find a relatively high effect size reporting rate (about 75%) in their 224 sampled papers. One major limitation with Sun and Fan’s (2010) study is that their coding method tends to overestimate the effect size reporting rate. If, in one particular article, two or more focal statistical procedures (say,
However, in the field of applied linguistics, no studies focus upon effect size reporting practices. Although Plonsky (2013) finds an effect size reporting rate of 25% by examining 606 articles from
To date, no studies concerning effect size reporting in applied linguistic research endeavor to make explicit the coding standard regarding papers with multiple statistical procedures, let alone adopt two standards to arrive at a more comprehensive picture of the reporting practices. Furthermore, no studies have surveyed papers from journals that do not mandate effect size reporting, as all the journals covered in Plonsky (2013) and Lindstromberg (2016) have such a mandate.
The Study
Sampling
To contribute to the current understanding of effect size reporting in the field, six full volumes (2011-2016) of
The sampling frame for this study was based on all of the 414 full-length research articles from the six selected volumes of
Finally, the remaining 217 articles formed the core dataset, each of which should have effect size(s) reported. This total number was used as the denominator to generate the overall effect size reporting rates for Research Question 1.
Coding
The unit of analysis was each article. Each article in the core dataset was coded in terms of the research topic, publication year, nature of paper (empirical or not), types of statistical procedures, practices of effect size reporting, types of effect size measures, and authors’ awareness of effect size (see the appendix). Two coding standards are used for the situations where two or more of the focal statistical procedures are used in one single paper, so as to achieve a more comprehensive picture of effect size reporting in applied linguistics and facilitate comparisons with findings from other fields: one is Sun and Fan’s (2010) standard, which tends to give “benefit of the doubt” and hence is relatively loose, and the other is the more stringent one proposed in “Effect Size” section.
The use of these two standards might introduce an element of subjectivity, although most of the coded variables are dichotomous and involve little subjective judgment (e.g., reported vs. not reported). To ensure consistent application of the checklist, the first and second authors independently coded a common set of articles (20.3% of the core dataset, totaling 44). The intercoder agreement rate was 93.1% (with an acceptable agreement rate ranging between 85% and 90%; cf., Miles, Huberman, & Saldana, 2014), and the points of disagreement were resolved through collegial discussion. Once consistency was established, the second author continued to record information pertaining to the other articles.
Data Analysis
After data were coded, both descriptive and inferential statistics were generated with the statistical package SPSS 21.0. For Research Question 1 regarding the extent of effect size reporting practices, only descriptive statistics in the forms of percentage and frequency were generated. To answer Research Question 2 concerning whether the effect size reporting practices change over time, a series of chi-square tests were performed, using Cramer’s
Findings and Discussion
Research Question 1: To What Extent Are Measures of Effect Size Reported?
As Table 3 shows, overall, 73.27% of the sampled papers that should have effect size reported do report effect size(s), when Sun and Fan’s (2010) standard is adopted for situations involving papers with two or more statistical procedures. This effect size reporting rate diminishes to 52.07% when a standard more stringent than Sun and Fan’s (2010) is adopted. It is unfortunate that effect sizes, the importance of which is no less than that of the
No. of Articles Reporting Effect Size in Selected Years.
These remarks may seem overly critical toward the field of applied linguistics. To be fair, we need to situate the discussion in a broader context by reiterating that the underreporting of effect sizes has also been observed in other fields. One comparable study is a survey of 256 papers from the
Research Question 2: Do the Effect Size Reporting Practices Vary Across the Years?
A chi-square test, χ2(5) = 3.533,
The effect size values reported above are higher than those reported in previous studies. The counterpart Cramer’s
All in all, the answer to Research Question 2 is that effect size reporting practices do vary across time. The strength of association lies between Cohen’s (1988) small and medium benchmarks.
Research Question 3: For Each of the Five Focal Statistical Methods, What Is the Effect Size Reporting Rate and What Effect Size Measures Are Typically Reported?
For papers that used correlation analysis, 94.29% of them (see Table 4) reported an effect size measure. This extremely high reporting rate can be attributed to the fact that the test statistic (i.e., correlation coefficient) in itself is effect size (Sun et al., 2010). Similarly high effect size reporting rates for correlation analysis can be found in other fields. For instance, in the field of communication, Sun and Fan (2010, p. 334) note that “nearly 100% of studies” that used Pearson correlation reported effect size measures, whereas the corresponding rate in Alhija and Levy (2009) reached 100% in the field of education. In this study, the effect size measures typically used were correlation coefficients such as Pearson’s
No. of Articles Using the Focal Statistical Procedures and Reporting Effect Sizes.
For the papers that used regression analysis, about 84.00% (refer to Table 4) reported effect sizes. High effect size reporting rates for regression analysis can be found in other fields such as communication (nearly 100%, see Sun & Fan, 2010) and education (100%, see Alhija & Levy, 2009). The effect size measure mostly used was adjusted
More than 60% (64.10%, see Table 4) of the papers using ANOVA reported effect size measures. This was highly similar to its counterparts, namely, 56.5% and 57%, respectively, from Sun and Fan (2010) and Alhija and Levy (2009). The reporting rate for ANOVA was lower than that for regression, partly because effect sizes for ANOVA are not as readily available as those for regression in statistics packages. Take SPSS as an example. In SPSS, ANOVA can be realized through three ways. The most common way is to initiate the test by clicking “Compare Means → One-way ANOVA,” but an effect size measure for ANOVA, eta-squared, cannot be generated in the output this way, misleading many researchers into believing that SPSS does not provide eta-squared for ANOVA (Zhang, 2009). However, this effect size can be generated in the two less commonly used ways in SPSS (cf., Plonsky & Oswald, 2017). 4 Therefore, when effect sizes were reported, (partial) eta-squared 5 was unsurprisingly most reported, which is consistent with the earlier findings (e.g., Sun & Fan, 2010). In light of the observation that “researchers arbitrarily selected one of these two” (i.e., eta-squared and partial eta-squared) from the field of communication (Sun & Fan, 2010, p. 338) and a most recent discussion of the misuses of (partial) eta-squared in the field of L2 research (Norouzian & Plonsky, 2018), future research needs to investigate whether these effect sizes have been correctly used when being reported.
Twenty-three (34.32%) of a total of 67 articles that used
Seven (30.43%) of the 23 articles that used chi-square tests reported effect sizes. Similarly, low reporting rates are in evidence elsewhere. In Alhija and Levy’s (2009) sampled papers from five educational journals that do not require effect size reporting, the corresponding rate was 17%. In Sun and Fan’s (2010) sampled papers from two communication journals without effect size reporting requirements, none of the five papers that used chi-square tests reported effect size; to account for this, the authors speculate that “it is likely that neither Cramer’s
Conclusion
This study has examined the effect size reporting practices in one major applied linguistics journal. The effect size reporting practices seem to have improved in the past few years, while the identified reporting rate of about 50% is inadequate. Although such improvement is encouraging, evidence from other disciplines suggests that such advances of effect size reporting can be lost without continued vigilance (Loewen et al., 2014). Therefore, journal editors, researchers, and researcher trainers need to (continue to) encourage and/or implement good reporting practices (e.g., reporting effect sizes along with the exact
Although this exploratory study is innovative in terms of the “two-standards” approach for coding and the target journal selection, it has three major limitations. First, it would have benefited from a larger sample size. The above findings and conclusions are tentative, which require verification and/or falsification in future research. In terms of generalizability, the results may not be representative of the use of effect sizes in applied linguistics in general, as this study has only focused on one journal in the field. Second, the findings here provide limited information about effect size reporting practices for statistical procedures (such as factor analysis and structural equation modeling) other than the five focal ones. Third, the present study has provided evidence of frequency of application of effect sizes in the focal journal, but does not indicate whether these effect sizes have been correctly applied or not (see Norouzian & Plonsky, 2018, for a review of the misuses of [partial] eta-squared in L2 research).
To contribute to the on-going methodological reform in applied linguistics (Larson-Hall & Plonsky, 2015), more studies on effect size reporting are needed. Future studies will stand to gain by expanding the sample size and/or comparing the reporting practices across journal types (journals with vs. without a requirement for effect size reporting). It is also useful to examine whether effect sizes are reported more frequently for statistically significant results than their nonsignificant counterparts, as Plonsky (2013) notices that some authors tend to report effect sizes solely for statistically significant results, although such information was “not coded for throughout the entire sample” in his study. Furthermore, future studies of effect size reporting need to incorporate an element of effect size interpretation in a more systematic way, as the reporting of effect sizes should not be treated “as an end in itself” (Larson-Hall & Plonsky, 2015, p. 135). It is useful to know how effect sizes are interpreted after being reported.
Ellis (2010) predicts that “If history is anything to go by, statistical reforms adopted in psychology will eventually spread to other social science disciplines” (p. xiv). Recently, the editors of
Footnotes
Appendix
A Checklist for Analyzing the Sampled Articles.
| No. | Item | Note/check |
|---|---|---|
| 1 | Title | ________ |
| 2 | Year | ________ |
| 3 | Issue number | ________ |
| 4 | Is this paper empirical? | ________ |
| 5 | Which type of empirical research was adopted (quantitative, qualitative, or mixed methods)? | ________ |
| 6 | Are the statistics purely descriptive? | ________ |
|
|
||
| 7.1 | □Yes□No | |
| 7.2 | Analysis of variance | □Yes□No |
| 7.3 | Chi-square | □Yes□No |
| 7.4 | Correlation | □Yes□No |
| 7.5 | Regression | □Yes□No |
| 7.6 | The other procedures | □Yes□No |
| 8.1 | Is effect size reported? | □Yes□No |
| 8.2 | Is effect size reported with clear awareness on the part of the author(s)? | □Yes□No |
| 8.3 | What effect size measure(s) is(are) used? | ________ |
| 8.4 | Is effect size interpreted? | □Yes□No |
| 9 | Overall, does this paper report effect sizes according to Sun and Fan’s (2010) standard? | □Yes□No |
| 10 | Overall, does this paper report effect sizes according to a more stringent standard? | □Yes□No |
Acknowledgements
The authors would like to extend their sincere thanks to the anonymous reviewers and the article editor for their constructive comments on an earlier version of this article. All the remaining inadequacies are the authors’ responsibility.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The writing of this article was supported by the Educational Science Research Fund of Jiangsu Province (D/2018/01/18) and the Research Development Fund of Xi’an Jiaotong-Liverpool University (RDF-16-01-61).
Notes
Author Biographies
Rining Wei (Tony), PhD, teaches courses related to bilingualism and research methods at undergraduate and postgraduate levels, at the Department of English, Xi’an Jiaotong-Liverpool University. He has supervised master’s and doctoral dissertation projects concerning bilingual education, TESOL, and language policy. He has published in journals including English Today, and World Englishes. He serves on the editorial board of the TESOL International Journal. Yuhang Hu (Sophie) is a master student at the Department of Linguistics with a concentration in Applied Linguistics, Georgetown University. Her areas of research include (socio-)psychological variables in bilingualism and quantitative methodology. She will commence her PhD study in Applied Linguistics at Northern Arizona University this Fall.Jianhui Xiong, PhD, conducts research concerning educational policy and comparative education at the National Center for Education Development Research, Ministry of Education of the People’s Republic of China. His recent research interests include internationalization of education and the use of big data in education.
