Abstract
In regression analysis, the adjusted R2 value describes the proportion of the variance in the dependent variable that is explained by the independent variables in the equation. What remains is unexplained or residual variance. Residual variance has two components: the contribution of unmeasured variables (residual confounding) and measurement error (noise). These concepts are explained using examples.
Previous articles in this series introduced the concept of statistical noise and discussed statistical noise in the context of individual variables, randomized controlled trials, and observational studies.1,2 This article considers statistical noise in the context of regression analysis.
In regression, we quantify the hypothetical effect of an independent variable on a dependent variable. For example, in a univariable regression analysis, we may examine how strongly the IQ of students predicts their multiple choice question (MCQ) test results. Or, in a multivariable regression analysis, we may examine how well IQ, number of lectures attended, and socioeconomic status (SES) predict MCQ scores. In the latter example, we may be interested in the effect of each of the three predictor variables. Or, we may be interested in only the effect of IQ on MCQ scores, in which case we say that the regression was “adjusted” for attendance and SES, both of which have the potential to alter the signal that describes the relationship between IQ and MCQ results. As a side note, here, the multivariable regression is run in the same way whether we are interested in the effects of every independent variable or whether we are interested in only one independent variable with the remaining variables being adjusted for.
The multivariable regression analysis provides us with many results, one of which is an R2 value. R2 tells us the proportion of the variance in the dependent variable that is explained by the independent variables. R2 ranges from 0 to 1 (or 0 to 100%). So, if R2 in our study is 0.43, it means that the independent variables IQ, attendance, and SES explain 43% of the variance in MCQ scores. The implication is that the remaining 57% of the variance in MCQ scores remains unexplained.
Unexplained (Residual) Variance
This has two components: (a) the contribution of (known and unknown) unmeasured variables and (b) measurement error. As examples, in addition to IQ, attendance, and SES, it is likely that interest in the subject (known and unmeasured), examination preparedness (known and unmeasured), and other variables that we don’t know about (unknown and unmeasured) influence MCQ performance. If we could add these variables to our study, we might find that our variables explain a higher proportion, say 75%, of the variance in MCQ results.
In the real world, especially in medical and social science research, study variables never explain 100% of the variance in the outcome. As already stated, this is partly because there are always plenty of known and unknown unmeasured variables. But, even if we somehow managed to include every relevant variable in our predictor set, we would still not explain 100% of the variance in the outcome. This is because of inevitable measurement error.
Unexplained variance from relevant unmeasured variables can constitute noise because, for example, disinterest in the subject or lack of examination preparedness could distort the signal that links IQ to MCQ performance. Measurement error, however, is a more understandable source of noise. Consider: intelligence may validly predict MCQ results, but an IQ score is only a proxy for intelligence, and, anyway, IQ tests are not always administered in an ideal way. So, measurement errors may arise. Likewise, attendance may be a crude proxy for interest in and understanding of the subject. Hours of study per day in the week before the MCQ test may be a crude proxy for examination preparedness because different students learn differently in the same time span. Finally, the MCQ items may not tap a full understanding of the syllabus, and these MCQ items may not have been framed in an ideal way. So, measurement error because of (often unavoidable) poor operationalization of variables introduces noise into studies.
Side Notes
In univariable regression, r2, the square of the correlation coefficient r, tells us the proportion of the variance in the dependent variable that is explained by the single independent variable. So, hypothetically, if the correlation between IQ and MCQ scores is 0.60, IQ explains 0.36 or 36% of the variance in MCQ scores. In similar univariable regressions, we may find that attendance explains 20% of the variance in MCQ scores and that SES explains 25% of the variance. So, will IQ, attendance, and SES together explain 36+20+25 or 81% of the variance? No, because variance explained by different variables can overlap. For example, hypothetically, students with higher SES may be more fluent in language and test- taking and may therefore perform better on IQ tests; and students with higher IQ may be more motivated to attend all classes.
In multivariable regression, we look not at R2 but at adjusted R2. As a simplification, this is because R2 is a potentially inflated value that needs to be adjusted for by the number of independent variables entered.
Clinical and Research Implications
We prescribe an antidepressant to a depressed woman. We know that antidepressants work for depression and that the patient should respond because she has depression. However, if responsiveness to depression is a signal, the patient is also filled with unknown variance arising from variables that pull the value of the signal in one direction or another. That is why we do not know in advance how well a particular patient will respond to a particular antidepressant; the signal does not always show through the noise.
We wish to examine the influence of childhood emotional abuse on adult cognition. Adult cognition can be reasonably reliably measured using a standardized battery of cognitive tests. However, how well can we measure childhood emotional abuse, and other variables that can influence adult cognition, contributing to noise? These variables include antenatal, intranatal, postnatal, nutritional, socioeconomic, medical, alcohol and substance abuse, and other vari- ables, as well as unknown and unmeasurable influences related to genes and childhood and adult environmental adversities. The scope for measurement error and noise is large.
Concluding Notes
This series of articles has explained statistical noise in research in various contexts, from its presence in a single variable to its presence in RCTs and observational studies. When designing studies, investigators need to take steps that pre-emptively reduce the presence of noise and accurately measure variables that may be a source of the noise. When analyzing data, researchers need to use appropriate methods that best adjust for noise. When reading research, readers need to consider the extent to which noise may mask, blur, or falsify a signal.
Footnotes
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
