Sage Journals: Discover world-class research

Abstract

Linear regression is a simple yet powerful tool that has been extensively used in all fields where the relationships among variables are of interest. When linear regression is applied, the coefficient of determination or R-squared (R²) is commonly reported as a metric gauging the model’s goodness of fit. Despite its wide usage, however, R² has been commonly misinterpreted as the proportion or percent of variation in the dependent variable that is explained by the independent variables (PVE -- percent of variation explained). This study demonstrated R² substantially overstates the true PVE. When the assumptions of linear regression are met, R² overstates PVE by up to 100%. For instance, when R² is 0.99, 0.80, 0.50, or 0.10, the true PVE is 0.9, 0.55, 0.29, or 0.05, respectively. The misinterpretation of R², which greatly exaggerates the effect of the interventions or causes on the outcomes, could exert undue influence on clinical decisions in medicine and policy decisions in other fields such as environmental protection and climate change research. Therefore, when linear regression is applied, reporting the true PVE is warranted.

Keywords

Linear regression R-squared goodness of fit predictive power

Background

Since its initial conceptualization by Sir Francis Galtonand and a mathematical enhancement by Karl Pearson in the late 19^th century,^1,2 linear regression has been extensively used in all fields where the relationships among variables are of interest. For instance, in medical research, linear regression is commonly used to quantify the linear relationship between a continuous outcome (e.g., blood pressure) and one or more independent variables such as treatments and confounders.^3–7

When linear regression is used, R², also called the coefficient of determination, is a preferred and arguably the most often reported metric gauging the model’s goodness of fit.^8,9 R² is universally interpreted as the proportion or percent of the variation in the dependent variable that is explained or predicted by the independent variables (hereafter abbreviated to PVE -- percent of variation explained).^4,10–15 Although R² may not be the most important statistic for initial model selection or specification, thanks to its intuitive interpretation, R² has become one of the most important metrics rendering real-life implications. When linear models are applied, R² is often used to determine whether a disease is preventable, a treatment is effective, how much genetic factors and environmental toxins (e.g., tobacco smoking, pesticide exposures) contribute to diseases, and how much human activities are contributing to climate change.

However, R² has been widely misconceived and misinterpreted, which significantly exaggerates the model’s predictive power and leads to overestimation of the strength of evidence or interventions. This study aimed to elucidate the misinterpretation of R² and assess its relationship with the true PVE.

Methods and results

In this section, a heuristic example was first given to illustrate the fallacy of interpreting R² as PVE, and then simulations were carried out to assess the relationship between R² and PVE.

A heuristic example

Suppose a sleep specialist investigating the relationship between daily walking distance (miles) and hours of sleep at night gathered 5 days of data from a patient as shown in Table 1. The investigator fitted a simple linear regression,

Y = α + β X + ε

, and obtained the following result:

\hat{Y} = 6 + 0.5 \times X,

which indicates, on average, a patient would sleep 6 h without any walking and sleep 0.5 h longer for each additional mile walked (also see Figure 1).

Table 1.

Regression data and summary statistics.

	Miles Walked, x	Hours of Sleep, y	Predcited $\hat{y}$	SST ${(y - \bar{y})}^{2}$	SSR ${(y - \bar{y})}^{2}$	TV $\| y - \bar{y} \|$	RV $\| y - \bar{y} \|$
Day 1	3	7.5	7.5	0	0	0	0
Day 2	1	6.75	6.5	0.5625	0.0625	0.75	0.25
Day 3	2	6.5	7	1	0.25	1	0.5
Day 4	5	8.25	8.5	0.5625	0.0625	0.75	0.25
Day 5	4	8.5	8	1	0.25	1	0.5
$\sum$	15	37.5	37.5	3.125	0.625	3.5	1.5

$\bar{y}$ = 37.5/5 = 7.5.

Figure 1.

Correlation between walking distance and sleep duration.

The investigator was also interested in how much variation in the sleep hours can be explained by the walking distance and thus calculated $R^{2} = 1 - S S R / S S T = 0.8,$ where $S S R = \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2} = 0.625$ is called residual sum of squares, and $S S T = \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2} = 3.125$ is called total sum of squares.

Now, does the walking distance explain 80% variation of the sleep hours? The answer is no. R² overstates the model’s predictive power. The problem stems from the fact that SST is not the total variation of the dependent variable, and SSR is not the total variation of what is not explained by the regression. Rather, they are the sums of the squared variations. Therefore, R² is not the proportion of variation in the dependent variable explained by the independent variables. As shown in Table 1, the true total variation (TV) is $\sum_{i = 1}^{n} | y_{i} - \bar{y} | = 3.5$ , and the true total variation of what is not explained by the regression (RV) is $\sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} | = 1.5$ . Thus, the true percent or proportion of the dependent variable explained is: PVE = 1 – 1.5/3.5 = 0.57.

What causes the discrepancy between R² and PVE? It is the squaring. In calculating R², | $y_{i} - \hat{y} |$ is disproportionally shrunk compared to | $y_{i} - \bar{y} |$ because the former is almost always smaller than the latter, which result in larger R² = 1 - SSR/SST compared to PVE. To see this more intuitively, for instance, two is four times larger than 0.5, but after being squared, four is 16 times larger than 0.25.

It is worth to note that “variation” and “variance” are synonyms in everyday language and often used exchangeably in describing R² in clinical literature. In fact, Sewall Wright,¹⁶ who is credited for the creation of R², used “variation” rather than “variance” in his 1921 seminal paper. However, in statistics, “variance” has a specific definition, i.e., $σ^{2} = \frac{1}{n} \sum_{1}^{n} {(y_{i} - μ)}^{2}$ , and for sample variance, $S^{2} = \frac{1}{n - 1} \sum_{1}^{n} {(y_{i} - \bar{y})}^{2}$ . In the R² definition, SSR/SST is indeed a ratio of the variance of the regression error to the variance of the dependent variable. Given SST = SSR + SSE, where SSE = $\sum_{i = 1}^{n} {({\hat{y}}_{i} - \bar{y})}^{2}$ is called explained sum of squares, R² can also be expressed as SSE/SST, a ratio of the variance of ${\hat{y}}_{i}$ (the estimated $y_{i}$ ) to the variance of the dependent variable $y_{i}$ . Thus, it is technically correct to say R² is the proportion of variance ( $S^{2}$ ) of the dependent variable attributed to the explained ${\hat{y}}_{i}$ or the independent variables.

Nevertheless, in statistics, variance is not equivalent to variation, and thus R² is different from PVE. As shown in the illustrative example above, when R² = 0.80, PVE = 0.57. Given the large difference between R² and PVE, and the latter is more informative in practice, therefore, examining the relationship between them should be instructive.

Simulations

Since there is no closed form expression linking R² and PVE, this study uses simulations to demonstrate their relationship. Without loss of generality, let’s assume the dependent variable Y is generated by the following process

Y = α + β X + ε

where

ε \sim N (0, σ)

. Since the number of independent variables in the regression is irrelevant to the problem of interest, here X represents a single variable. To assess if the distribution of X affects the relationship between R² and PVE, X was generated from five commonly used distributions¹⁷: normal, uniform, Poisson, gamma, and exponential. Since the relationship between R² and PVE is not affected by the values of

α

and

β

for a given R², the results reported here were based on

α = 2

and

β = 0.5

. Because R² cannot be predetermined in the simulations, different R² values were produced by varying

σ

with an increment of 0.1 from 0.1 to 3. Thus, 30 pairs of R² and PVE values were produced. To derive each R² and the corresponding PVE, 1,000 observations were generated for the regression. For each pair of R² and PVE, 500 iterations were carried out, i.e., the regression was fitted 500 times. All the simulations were conducted with SAS®

Given different

X

generated from different distributions (normal, uniform, Poisson, gamma, and exponential) yielded practically identical results, Table 2 only reports the results based on the

X

generated from

N (μ_{x}, σ_{x})

and

ε

generated from

N (0, σ)

. Because, for a given R², the values of

μ_{x}

and

σ_{x}

do not affect the simulation results (i.e., the difference and ratio of R² and PVE), the results reported in Table 2 were based on

μ_{x} = 5

, and

σ_{x} = 2

Table 2.

Simulation results: Relationship between R² and PVE.

R² (95% CI)	PVE (95% CI)	R²-PVE (95% CI)	R²/PVE (95% CI)
0.99 (0.99 – 0.99)	0.90 (0.89 – 0.91)	0.09 (0.08 – 0.10)	1.10 (1.09 – 1.11)
0.96 (0.96 – 0.97)	0.80 (0.79 – 0.82)	0.16 (0.15 – 0.17)	1.20 (1.18 – 1.21)
0.92 (0.91 – 0.93)	0.71 (0.69 – 0.73)	0.20 (0.20 – 0.21)	1.29 (1.27 – 1.31)
0.86 (0.85 – 0.88)	0.63 (0.61 – 0.65)	0.23 (0.22 – 0.24)	1.37 (1.34 – 1.40)
0.80 (0.78 – 0.82)	0.55 (0.53 – 0.58)	0.25 (0.24 – 0.26)	1.45 (1.41 – 1.48)
0.74 (0.71 – 0.76)	0.49 (0.46 – 0.51)	0.25 (0.24 – 0.26)	1.51 (1.47 – 1.56)
0.67 (0.64 – 0.70)	0.43 (0.40 – 0.46)	0.24 (0.23 – 0.26)	1.57 (1.52 – 1.63)
0.61 (0.57 – 0.65)	0.38 (0.35 – 0.41)	0.23 (0.22 – 0.25)	1.62 (1.56 – 1.69)
0.55 (0.51 – 0.59)	0.33 (0.30 – 0.36)	0.22 (0.20 – 0.24)	1.67 (1.60 – 1.74)
0.50 (0.46 – 0.54)	0.29 (0.26 – 0.32)	0.21 (0.19 – 0.23)	1.71 (1.62 – 1.79)
0.45 (0.41 – 0.50)	0.26 (0.23 – 0.29)	0.19 (0.17 – 0.21)	1.74 (1.64 – 1.84)
0.41 (0.37 – 0.46)	0.23 (0.20 – 0.26)	0.18 (0.16 – 0.20)	1.77 (1.65 – 1.89)
0.37 (0.33 – 0.42)	0.21 (0.18 – 0.24)	0.16 (0.14 – 0.19)	1.79 (1.66 – 1.93)
0.34 (0.29 – 0.38)	0.19 (0.16 – 0.22)	0.15 (0.13 – 0.18)	1.82 (1.67 – 1.96)
0.31 (0.26 – 0.35)	0.17 (0.14 – 0.20)	0.14 (0.12 – 0.16)	1.83 (1.67 – 2.00)
0.28 (0.24 – 0.33)	0.15 (0.12 – 0.18)	0.13 (0.10 – 0.15)	1.85 (1.67 – 2.03)
0.26 (0.21 – 0.30)	0.14 (0.11 – 0.17)	0.12 (0.09 – 0.14)	1.87 (1.67 – 2.06)
0.24 (0.19 – 0.28)	0.13 (0.10 – 0.15)	0.11 (0.09 – 0.14)	1.88 (1.66 – 2.10)
0.22 (0.17 – 0.26)	0.12 (0.09 – 0.14)	0.10 (0.08 – 0.13)	1.89 (1.66 – 2.13)
0.20 (0.16 – 0.24)	0.11 (0.08 – 0.13)	0.09 (0.07 – 0.12)	1.90 (1.65 – 2.16)
0.19 (0.14 – 0.23)	0.10 (0.07 – 0.12)	0.09 (0.06 – 0.11)	1.91 (1.64 – 2.19)
0.17 (0.13 – 0.21)	0.09 (0.06 – 0.12)	0.08 (0.06 – 0.11)	1.92 (1.63 – 2.22)
0.16 (0.12 – 0.20)	0.08 (0.06 – 0.11)	0.08 (0.05 – 0.10)	1.93 (1.62 – 2.24)
0.15 (0.11 – 0.19)	0.08 (0.05 – 0.10)	0.07 (0.05 – 0.09)	1.94 (1.60 – 2.27)
0.14 (0.10 – 0.18)	0.07 (0.05 – 0.10)	0.07 (0.04 – 0.09)	1.95 (1.59 – 2.30)
0.13 (0.09 – 0.17)	0.07 (0.04 – 0.09)	0.06 (0.04 – 0.09)	1.95 (1.58 – 2.33)
0.12 (0.09 – 0.16)	0.06 (0.04 – 0.09)	0.06 (0.04 – 0.08)	1.96 (1.56 – 2.36)
0.11 (0.08 – 0.15)	0.06 (0.04 – 0.08)	0.06 (0.03 – 0.08)	1.97 (1.55 – 2.39)
0.11 (0.07 – 0.14)	0.06 (0.03 – 0.08)	0.05 (0.03 – 0.07)	1.97 (1.53 – 2.41)
0.10 (0.07 – 0.14)	0.05 (0.03 – 0.07)	0.05 (0.03 – 0.07)	1.98 (1.51 – 2.44)

As shown in Table 2 and Figure 2, when R² is high, the difference between R² and PVE is small. In fact, when the regression has a perfect fit, both R² and PVE equal 1. However, as R² decreases, the difference between R² and PVR accelerates: when R² = 0.99, then PVE = 0.9, where R² overstates PVE by about 10%. When R² equals 0.8, PVE drops to 0.55, and thus PVE is overstated by 45%. When R² drops to 0.5, PVE decreases to 0.29, where R² overstates PVE by 71%. The trend continues: when R² = 0.1, then PVE = 0.05, where PVE is overstated by 98%. Even though R² = PVE = 0 when the model has zero predictive power ( ${\hat{y}}_{i} = \bar{y}$ ), further simulations indicate when R² approaches 0, the ratio R²/PVE approaches 2 as shown in Figure 2 (detailed results are not reported).

Figure 2.

Relationship between R² and PVE.

Further, although the distribution of X does not practically affect the relationship between R² and PVE, the distribution of the error term $ε$ does. The ratio of R² to PVE is smaller when the error term $ε$ is generated from all other four distributions (uniform, Poisson, gamma, and exponential; note that normal distribution is not required in linear models when sample sizes are large). For example, when $ε$ has a Poisson distribution, R²/PVE = 1.49 for R² = 0.10 (detailed results are not reported here), which is substantially smaller than R²/PVE = 1.98 derived from normally distributed $ε$ . In addition, even for normally distributed $ε$ , heteroscedasticity can also affect the R²/PVE ratio. For instance, when the variance of the error term is proportional to x^0.5, x, or x², the R²/PVE ratios are 1.91, 1.79, and 1.45, respectively for R² = 0.1 (detailed results are not reported here), which are different from R²/PVE = 1.98 reported in Table 2 assuming no heteroscedasticity. In short, the distribution of the error term $ε$ in linear models does affect the relationship between R² and PVE.

Discussion

The misconception of R² is pervasive and can have significant ramifications in the real world. For instance, based on R² = 0.65 (r = 0.804 between cancer incidence and normal stem cell divisions), two widely cited studies published in Science concluded that two thirds of cancer cases are due to intrinsic random genetic mutations and thus are unpreventable.^18,19 This misconceived finding gave rise to the “bad luck” theory, which has been widely covered by mainstream media, academic journals, and anything between.^20-24 In fact, R² = 0.65 indicates about 40% rather than two-thirds of cancer is due to random mutations and thus unpreventable (given the premise that random genetic mutations cause cancer is valid). Obviously, misinterpretations like this can profoundly influence public health policies associated with disease prevention and thus people’s health.

Given the practical significance, medical researchers, clinicians, and policymakers alike need to understand the difference between R² and PVE. Beneath the large discrepancy between R² and PVE, the misunderstanding or misinterpretation of R² stems from the fact that the true total variation of Y is $\sum_{i = 1}^{n} | y_{i} - \bar{y} |$ , not $\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}$ . The latter is a byproduct of the ordinary least square (OLS) method used to fit linear regression, where α and β are selected to minimize $\sum_{i = 1}^{n} ε_{i}^{2} = \sum_{i = 1}^{n} {(y_{i} - α - β x_{i})}^{2}$ , rather than $\sum_{i = 1}^{n} {| ε}_{i} | = \sum_{i = 1}^{n} | y_{i} - α - β x_{i} |$ .

In fact, there are algorithms (e.g., linear programming) that minimize $\sum_{i = 1}^{n} {| ε}_{i} |$ rather than $\sum_{i = 1}^{n} ε_{i}^{2},$ which are called least absolute deviation (LAD), least absolute errors (LAE), or least absolute residuals (LAR).^25,26 LAD does have advantages over OLS. The former minimizes true differences between the predicted and the observed rather than the squared differences, which is more meaningful in practice. Doctors do not measure squared blood pressure, meteorologists do not predict squared temperature, and Wall Street does not speculate on squared dollars either. In fact, LAD was indeed found outperforming OLS in investment business forecasting.²⁷ In addition, compared to OLS, LAD is more robust to vertical outliers.²⁸ Of course, LAD has some disadvantages. It is notoriously difficult to mathematically manipulate absolute values of any complex functions while anything squared does not pose any problems. Further, contrary to LAD, OLS possesses some mathematical wonders such as SST = SSE + SSR and BLUE (best linear unbiased estimator).

Obviously, when the objective is to forecast, LAD is superior, especially given today’s computing power -- fitting LAD can be readily carried out, and its predictive power can be easily gauged.²⁹ On the other hand, if hypothesis testing is also involved (e.g., assessing whether an intervention is effective or not), linear models are more convenient because none of the commonly used tests are valid under LAD. Nonetheless, LAD has been rarely used in medical research and yet the misconception of R² in linear regression has been overlooked.

Conclusion

Given the consequences of misinterpreting R² in practice and PVE can be readily calculated, when linear models are applied, PVE, a measure of the true variation in the dependent variable explained by the model, should be reported.

Footnotes

Acknowledgements

This material is based upon work supported (or supported in part) by the Department of Veterans Affairs, Veterans Health Administration, Office of Research and Development. The author is indebted to Mr Frederick Malphurs, a retired senior healthcare executive, a visionary leader who dedicated his entire 37 years’ career to patient care, for his continued support of research to improve public health.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Jian Gao

References

Cowan

. Francis Galton’s statistical ideas: the influence of eugenics. Isis 1972; 63(219): 509–528.

Stanton

J.M. Galton

. Pearson, and the Peas: a brief history of linear regression for statistics instructors. Journal of Statistics Education 2001; 9(3). DOI: 10.1080/10691898.2001.11910537

Corbett

Nason

Flach

, et al. Immune correlates of protection by mRNA-1273 vaccine against SARS-CoV-2 in nonhuman primates. Science 2021; 373.

Morden

Chyn

Wood

, et al. Racial inequality in prescription opioid receipt - role of individual health systems. N Engl J Med. 2021; 385(4): 342–351. DOI: 10.1056/NEJMsa2034159

Levin

Lustig

Cohen

, et al. Waning Immune Humoral Response to BNT162b2 Covid-19 Vaccine over 6 Months. N Engl J Med. 2021; 385(24): e84. DOI: 10.1056/NEJMoa2114583

Bunyavanich

Grant

Vicencio

. Racial/Ethnic variation in nasal gene expression of transmembrane serine protease 2 (TMPRSS2). JAMA 2020; 324(15): 1567–1568. DOI: 10.1001/jama.2020.17386

Speaker

Pfoh

Pappas

, et al. Oral temperature of noninfected hospitalized patients. JAMA 2021; 325(18): 1899–1901. DOI: 10.1001/jama.2021.1541

Chicco

Warrens

Jurman

. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput Sci. 2021; 7: e623. DOI: 10.7717/peerj-cs.623

Ellis

Hsu

Siracuse

, et al. Development and assessment of a new framework for disease surveillance, prediction, and risk adjustment: the diagnostic items classification system. JAMA Health Forum 2022; 3(3): e220276. DOI: 10.1001/jamahealthforum.2022.0276

10.

Scheinker

Valencia

Rodriguez

. Identification of factors associated with variation in US county-level obesity prevalence rates using epidemiologic vs machine learning models. JAMA Netw Open 2019; 2(4): e192884. DOI: 10.1001/jamanetworkopen.2019.2884

11.

Levy

Louis

Cote

, et al. Contribution of aging to the severity of different motor signs in Parkinson disease. JAMA Neurology 2005; 62(3): 467–472. DOI: 10.1001/archneur.62.3.467

12.

Schober

Vetter

. Linear Regression in Medical Research. Anesth Analg 2021; 132(1): 108–109. PMID: 33315608; PMCID: PMC7717471. DOI: 10.1213/ANE.0000000000005206

13.

Lunt

. Introduction to statistical modelling: linear regression. Rheumatology (Oxford) 2015; 54(7): 1137–1140. PMID: 23594471. DOI: 10.1093/rheumatology/ket146

14.

Marill

. Advanced statistics: linear regression, part II: multiple linear regression. Acad Emerg Med 2004; 11(1): 94–102. PMID:. DOI: 10.1197/j.aem.2003.09.006

15.

Fernando

J. R-Squared.

https://www.investopedia.com/terms/r/r-squared.asp. Investopedia, April 08, 2023.

16.

Wright

. Correlation and causation. Journal of Agricultural Research 1921; XX(7): 557–585.

17.

Krishnamoorthy

. Handbook of statistical distributions with applications. 2nd edn. Boca Raton, FL: Chapman and hall/CRC Press, 2016. 33487.

18.

Tomasetti

Vogelstein

. Cancer etiology. Variation in cancer risk among tissues can be explained by the number of stem cell divisions. Science 2015; 347(6217): 78–81.

19.

Tomasetti

Vogelstein

. Stem cell divisions, somatic mutations, cancer etiology, and cancer prevention. Science 2017; 355(6331): 1330–1334.

20.

Grady

. Cancer’s random assault. 2015. The New York Times, https://www.nytimes.com/2015/01/06/health/cancers-random-assault.html#:∼:text=It_may_sound_flippant_to,happen_when_healthy_cells_divide

21.

Healy

‘Bad luck’ with random DNA errors is responsible for two-thirds of cancer mutations, study says. Los Angeles Times. 2017. https://www.latimes.com/science/sciencenow/la-sci-sn-cancer-bad-luck-20170323-story.html

22.

Zhu

Thompson

, et al. Evaluating intrinsic and non-intrinsic cancer risk factors. Nat Commun. 2018;9(1):3490. DOI: 10.1038/s41467-018-05467-z

23.

Belizário

. Cancer Risks Linked to the Bad Luck Hypothesis and Epigenomic Mutational Signatures. Epigenomes 2018; 2(3): 13. DOI: 10.3390/epigenomes2030013

24.

Goldhaber

. The randomness of life: bad luck cancers. American Council on Science and Health, 2021, https://www.acsh.org/news/2021/06/24/randomness-life-bad-luck-cancers-15629

25.

Charnes

Copper

Ferguson

. Optimal estimation of executive compensation by linear programming. Management Science 1955; 1: 138–151.

26.

Dielman

. Least absolute value regression: recent contributions. Journal of Statistical Computation and Simulation 2005; 75: 263–286.

27.

Meyer

Glauber

. Investment decision. Economic and public policy. Boston: Harvard Business School, 1964.

28.

Dodge

. LAD regression for detecting outliers in response and explanatory variables. Journal of Multivariate Analysis 1977; 61: 144–158.

29.

Mckean

Sievers

. Coefficients of determination for least absolute deviation analysis. Statistics and Probability Letters 1987; 5: 49–54.