Pitfalls of big data

Abstract

To the Editor

The paper by Dipnall et al. (2017) exemplifies many of the problems we face in using and interpreting ‘big data’ and ‘data science’ approaches, particularly to observational data, in the choice of dataset, the variables used and the difference between predictive ability of a model and the inferences made. As Munafò et al. (2017) highlight, a lack of reproducibility bedevils many disciplines and leads to repeated identification of ‘predictors’ that cannot be replicated. This paper will contribute to this problem.

The National Health and Nutrition Examination Survey (NHANES) is an interesting survey but does not assess many of the strongest risk factors for depression, such as family history, discrimination, abuse, work stress or even prior mental ill health, and includes many that are rarely assessed, for example, pesticide usage, making replication or use of this model in other data extremely unlikely.

The omission of these very important factors effectively weighs the pre-test likelihood of them being included in the risk model as zero and promotes the post-test likelihood of other factors. The results stating the ‘relative importance of the depression determinants’ are almost impossible to interpret in the absence of so many important determinants.

The authors make comparisons of the Risk Index for Depression (RID) to the Framingham index and other risk prediction models of FUTURE outcome in those currently without that outcome.However, their analysis is one of a cross-sectional survey evaluating the concurrent association of certain selected factors with CURRENT depression, not a risk prediction model as claimed, of which there are examples in psychiatry (Fernandez et al., 2017).

As such the claim that RID ‘shows promise for future clinical use by providing indications of main determinant(s) associated with a patient’s predisposition to depression’ is unjustified as it does not assess predisposition for a future condition only the presence of a current condition, and the peculiar variable selection hugely limits ‘the ability to be translated for the development of risk indices for other affective disorders’. I fear this paper is a triumph of advanced data techniques and mining over basic epidemiological thinking.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship and/or publication of this article.

ORCID iD

Nick S Glozier

References

Dipnall

Pasco

Berk

et al . (2017) Getting RID of the blues: Formulating a risk index for depression (RID) using structural equation modeling. Australian and New Zealand Journal of Psychiatry 51: 1121–1133.

Fernandez

Salvador-Carulla

Choi

et al . (2018) Development and validation of a prediction algorithm for the onset of common mental disorders in a working population. Australian and New Zealand Journal of Psychiatry 52: 47–58.

Munafò

Nosek

Bishop

DVM

et al . (2017) A manifesto for reproducible science. Nature Human Behaviour 1: 0021.