The Turker Blues: Hidden Factors Behind Increased Depression Rates Among Amazon’s Mechanical Turkers

Abstract

Data collection from online platforms, such as Amazon’s Mechanical Turk (MTurk), has become popular in clinical research. However, there are also concerns about the representativeness and the quality of these data for clinical studies. The present work explores these issues in the specific case of major depression. Analyses of two large data sets gathered from MTurk (Sample 1: N = 2,692; Sample 2: N = 2,354) revealed two major findings: First, failing to screen for inattentive and fake respondents inflates the rates of major depression artificially and significantly (by 18.5%–27.5%). Second, after cleaning the data sets, depression in MTurk is still 1.6 to 3.6 times higher than general population estimates. Approximately half of this difference can be attributed to differences in the composition of MTurk samples and the general population (i.e., sociodemographics, health, and physical activity lifestyle). Several explanations for the other half are proposed, and practical data-quality tools are provided.

Keywords

depression crowdsourcing Mechanical Turk prevalence of depression data quality measures open data open materials

Get full access to this article

View all access options for this article.

References

American Psychiatric Association. (2013). Diagnostic and statistical manual of mental disorders (5th ed.). Washington, DC: Author.

Amichai-Hamburger

Barak

(2009). Internet and well-being. In Amichai-Hamburger

(Ed.), Technology and psychological well-being (pp. 34–76). New York, NY: Cambridge University Press.

Arditte

K. A.

Çek

Shaw

A. M.

Timpano

K. R.

(2016). The importance of assessing clinical phenomena in Mechanical Turk research. Psychological Assessment, 28, 684–691.

Arroll

Goodyear-Smith

Crengle

Gunn

Kerse

Fishman

. . . Hatcher

. (2010). Validation of PHQ-2 and PHQ-9 to screen for major depression in the primary care population. The Annals of Family Medicine, 8, 348–353.

Bai

(2018). Evidence that a large amount of low quality responses on MTurk can be detected with repeated GPS coordinates. Retrieved from https://www.maxhuibai.com/blog/evidence-that-responses-from-repeating-gps-are-random

Beck

A. T.

(1967). Depression: Clinical, experimental, and theoretical aspects (Vol. 32). Philadelphia: University of Pennsylvania Press.

Beck

A. T.

Steer

R. A.

Brown

G. K.

(1996). Beck Depression Inventory–II. San Antonio, TX: Psychological Association.

Beesdo

Bittner

Pine

D. S.

Stein

M. B.

Höfler

Lieb

Wittchen

H.-U.

(2007). Incidence of social anxiety disorder and the consistent risk for secondary depression in the first three decades of life. Archives of General Psychiatry, 64, 903–912.

Buhrmester

Kwang

Gosling

S. D.

(2011). Amazon’s Mechanical Turk: A new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science, 6, 3–5.

10.

Bunge

Cook

H. M.

Bond

Williamson

R. E.

Cano

Barrera

A. Z.

. . . Muñoz

R. F

. (2018). Comparing Amazon Mechanical Turk with unpaid internet resources in online clinical trials. Internet interventions, 12, 68–73.

11.

Cacioppo

J. T.

Hughes

M. E.

Waite

L. J.

Hawkley

L. C.

Thisted

R. A.

(2006). Loneliness as a specific risk factor for depressive symptoms: Cross-sectional and longitudinal analyses. Psychology and Aging, 21, 140–151.

12.

Chandler

Shapiro

(2016). Conducting clinical research using crowdsourced convenience samples. Annual Review of Clinical Psychology, 12, 53–81. doi:10.1146/annurev-clinpsy-021815-093623

13.

Chandler

Shapiro

Sisso

(2019). Best practices and pitfalls when recruiting rare groups online. Manuscript in preparation.

14.

Connor

K. M.

Kobak

K. A.

Churchill

L. E.

Katzelnick

Davidson

J. R. T.

(2001). Mini-SPIN: A brief screening assessment for generalized social anxiety disorder. Depression and Anxiety, 14, 137–140.

15.

Curran

P. G.

(2016). Methods for the detection of carelessly invalid responses in survey data. Journal of Experimental Social Psychology, 66, 4–19.

16.

de Leeuw

E. D

. (1992). Data quality in mail, telephone and face to face surveys. Amsterdam, Netherlands: TT Publikaties.

17.

Dennis

S. A.

Goodson

B. M.

Pearson

(2019). Virtual private servers and the limitations of IP-based screening procedures: Lessons from the MTurk quality crisis of 2018. SSRN. doi:10.2139/ssrn.3233954

18.

Difallah

Filatova

Ipeirotis

(2018). Demographics and dynamics of mechanical Turk workers. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (pp. 135–143). New York, NY: Association for Computing Machinery.

19.

Dunn

A. M.

Heggestad

E. D.

Shanock

L. R.

Theilgard

(2018). Intra-individual response variability as an indicator of insufficient effort responding: Comparison to other indicators and relationships with individual differences. Journal of Business and Psychology, 33, 105–121.

20.

Eisenberg

Gollust

S. E.

Golberstein

Hefner

J. L.

(2007). Prevalence and correlates of depression, anxiety, and suicidality among university students. American Journal of Orthopsychiatry, 77, 534–542.

21.

El-Den

Chen

T. F.

Gan

Y.-L.

Wong

O’Reilly

C. L.

(2018). The psychometric properties of depression screening tools in primary healthcare settings: A systematic review. Journal of Affective Disorders, 225, 503–522. doi:10.1016/j.jad.2017.08.060

22.

Elhai

J. D.

Dvorak

R. D.

Levine

J. C.

Hall

B. J.

(2017). Problematic smartphone use: A conceptual overview and systematic review of relations with anxiety and depression psychopathology. Journal of Affective Disorders, 207, 251–259. doi:10.1016/j.jad.2016.08.030

23.

Elphinstone

(2018). Identification of a Suitable Short-form of the UCLA-Loneliness Scale. Australian Psychologist, 53, 107–115.

24.

Fong

D. Y.

S. Y.

Lam

T. H.

(2010). Evaluation of internal reliability in the presence of inconsistent responses. Health and Quality of Life Outcomes, 8, Article 27. doi:10.1186/1477-7525-8-27

25.

Fried

E. I.

(2017). The 52 symptoms of major depression: Lack of content overlap among seven common depression scales. Journal of Affective Disorders, 208, 191–197. doi:10.1016/j.jad.2016.10.019

26.

Fried

E. I.

Nesse

R. M.

(2015). Depression is not a consistent syndrome: An investigation of unique symptom patterns in the STAR* D study. Journal of Affective Disorders, 172, 96–102.

27.

Goodman

J. K.

Cryder

C. E.

Cheema

(2013). Data collection in a flat world: The strengths and weaknesses of Mechanical Turk samples. Journal of Behavioral Decision Making, 26, 213–224.

28.

Greenberg

M. S.

Vazquez

C. V.

Alloy

L. B.

(1988). Depression versus anxiety: Differences in self- and other-schemata. In Alloy

L. B.

(Ed.), Cognitive processes in depression (pp. 109–142). New York, NY: Guilford Press.

29.

Huang

J. L.

Bowling

N. A.

Liu

(2015). detecting insufficient effort responding with an infrequency scale: Evaluating validity and participant reactions. Journal of Business and Psychology, 30), 299–311.

30.

Jones

M. S.

House

L. A.

Gao

(2015). Respondent screening and revealed preference axioms: Testing quarantining methods for enhanced data quality in web panel surveys. Public Opinion Quarterly, 79, 687-709.

31.

Katon

Ciechanowski

(2002). Impact of major depression on chronic medical illness. Journal of Psychosomatic Research, 53, 859–863.

32.

Kennedy

Clifford

Burleigh

Waggoner

Jewell

(2019). The shape of and solutions to the MTurk quality crisis. Manuscript in preparation.

33.

Kessler

R. C.

Petukhova

Sampson

N. A.

Zaslavsky

A. M.

Wittchen

H. U.

(2012). Twelve-month and lifetime prevalence and lifetime morbid risk of anxiety and mood disorders in the United States. International Journal of Methods in Psychiatric Research, 21, 169–184.

34.

Kessler

R. C.

Stang

Wittchen

H. U.

Stein

Walters

E. E.

(1999). Lifetime co-morbidities between social phobia and mood disorders in the US National Comorbidity Survey. Psychological Medicine, 29, 555–567.

35.

Kosara

Ziemkiewicz

(2010). Do Mechanical Turks dream of square pie charts? In Proceedings of the 3rd BELIV’10 Workshop: Beyond time and errors: Novel evaluation methods for information visualization (pp. 63–70). New York, NY: ACM.

36.

Kroenke

Spitzer

R. L.

Williams

J. B. W.

(2001). The PHQ-9: Validity of a brief depression severity measure. Journal of General Internal Medicine, 16, 606–613. doi:10.1046/j.1525-1497.2001.016009606.x

37.

Lee

B. W.

Stapinski

L. A.

(2012). Seeking safety on the internet: Relationship between social anxiety and problematic internet use. Journal of Anxiety Disorders, 26, 197–205.

38.

Lovibond

S. H.

Lovibond

P. F.

(1995). Manual for the depression anxiety stress scales (2nd ed.) Sydney, Australia: Psychology Foundation.

39.

Löwe

Spitzer

R. L.

Gräfe

Kroenke

Quenter

Zipfel

. . . Herzog

. (2004). Comparative validity of three screening questionnaires for DSM-IV depressive disorders and physicians’ diagnoses. Journal of Affective Disorders, 78, 131–140.

40.

McCredie

M. N.

Morey

L. C.

(2019). Who are the Turkers? A characterization of MTurk workers using the Personality Assessment Inventory. Assessment, 26, 759–766. doi:10.1177/1073191118760709

41.

Menke

Flynn

(2009). Relationships between stigma, depression, and treatment in white and African American primary care patients. The Journal of Nervous and Mental Disease, 197, 407–411.

42.

Mor

Hertel

Ngo

T. A.

Shachar

Redak

(2014). Interpretation bias characterizes trait rumination. Journal of Behavior Therapy and Experimental Psychiatry, 45, 67–73. doi:10.1016/j.jbtep.2013.08.002

43.

Morey

L. C.

(1991). Personality Assessment Inventory professional manual. Lutz, FL: Psychological Assessment Resources, Inc.

44.

Morey

L. C.

(2007). Personality Assessment Inventory (PAI): Professional manual. Lutz, FL: Psychological Assessment Resources.

45.

Moss

A. J.

Litman

(2018). After the bot scare: Understanding what’s been happening with data collection on MTurk and how to stop it [Blog post]. Retrieved from https://blog.turkprime.com/after-the-bot-scare-understanding-whats-been-happening-with-data-collection-on-mturk-and-how-to-stop-it

46.

Nolen-Hoeksema

Morrow

(1991). A prospective study of depression and posttraumatic stress symptoms after a natural disaster: The 1989 Loma Prieta earthquake. Journal of Personality and Social Psychology, 61, 115–121. doi:10.1037/0022-3514.61.1.115

47.

Nolen-Hoeksema

Parker

L. E.

Larson

(1994). Ruminative coping with depressed mood following loss. Journal of Personality and Social Psychology, 67, 92–104. doi:10.1037/0022-3514.67.1.92

48.

Nolen-Hoeksema

Wisco

B. E.

Lyubomirsky

(2008). Rethinking rumination. Perspectives on Psychological Science, 3, 400–424. doi:10.1111/j.1745-6924.2008.00088.x

49.

Ophir

(2017). SOS on SNS: Adolescent distress on social network sites. Computers in Human Behavior, 68, 51-55.

50.

Ophir

Lipshits-Braziler

Rosenberg

(2019). New-media screen time is not (necessarily) linked to depression: Comments on Twenge, Joiner, Rogers, and Martin (2018). Clinical Psychological Science. Advance online publication. doi:10.1177/2167702619849412.

51.

Ophir

Mor

(2014). If I only knew why: The relationship between brooding, beliefs about rumination, and perceptions of treatments. Behavior Therapy, 45(4), 553–563.

52.

Osman

Wong

J. L.

Bagge

C. L.

Freedenthal

Gutierrez

P. M.

Lozano

(2012). The depression anxiety stress Scales—21 (DASS-21): Further examination of dimensions, scale reliability, and correlates. Journal of Clinical Psychology, 68, 1322–1338.

53.

Prims

J. P.

Sisso

Bai

(2018). Flag suspects app. Available from https://itaysisso.shinyapps.io/Bots

54.

Researchers investigate problems with MTurk data. (2018, September). APS Observer, 31(7), 9. Retrieved from https://www.psychologicalscience.org/publications/observer/obsonline/researchers-investigate-problems-with-mturk-data.html

55.

Prizant-Passal

Shechner

Aderka

I. M.

(2016). Social anxiety and internet use–A meta-analysis: What do we know? What are we missing? Computers in Human Behavior, 62, 221–229.

56.

Riolo

S. A.

Nguyen

T. A.

Greden

J. F.

King

C. A.

(2005). Prevalence of depression by race/ethnicity: Findings from the National Health and Nutrition Examination Survey III. American Journal of Public Health, 95, 998–1000.

57.

Russell

Peplau

L. A.

Ferguson

M. L.

(1978). Developing a measure of loneliness. Journal of Personality Assessment, 42, 290–294.

58.

Russell

D. W.

(1996). UCLA Loneliness Scale (Version 3): Reliability, validity, and factor structure. Journal of Personality Assessment, 66, 20–40.

59.

Sartorius

Üstün

T. B.

Lecrubier

Wittchen

H.-U.

(1996). Depression comorbid with anxiety: Results from the WHO study on psychological disorders in primary health care. The British Journal of Psychiatry, 168(S30), 38–43.

60.

Schoofs

Hermans

Raes

(2010). Brooding and reflection as subtypes of rumination: Evidence from confirmatory factor analysis in nonclinical samples using the Dutch Ruminative Response Scale. Journal of Psychopathology and Behavioral Assessment, 32, 609–617.

61.

Shapiro

D. N.

Chandler

Mueller

P. A.

(2013). Using Mechanical Turk to study clinical populations. Clinical Psychological Science, 1, 213–220.

62.

Sisso

(2019). Best practices in online experiments–42 ways to measure and increase data quality. Manuscript in preparation.

63.

Spitzer

R. L.

Kroenke

Williams

J. B. W.

(1999). Validation and utility of a self-report version of PRIME-MD: The PHQ primary care study. JAMA: Journal of the American Medical Association, 282, 1737–1744.

64.

Spitzer

R. L.

Kroenke

Williams

J. B. W.

Löwe

(2006). A brief measure for assessing generalized anxiety disorder: The GAD-7. Archives of Internal Medicine, 166, 1092–1097. doi:10.1001/archinte.166.10.1092

65.

Stewart

Ungemach

Harris

A. J.

Bartels

D. M.

Newell

B. R.

Paolacci

Chandler

(2015). The average laboratory samples a population of 7,300 Amazon Mechanical Turk workers. Judgment and Decision Making, 10, 479–491.

66.

Tomitaka

Kawasaki

Ide

Akutagawa

Yamada

Ono

Furukawa

T. A.

(2018). Distributional patterns of item responses and total scores on the PHQ-9 in the general population: Data from the National Health and Nutrition Examination Survey. BMC Psychiatry, 18, Article 108. doi:10.1186/s12888-018-1696-9

67.

Tourangeau

Yan

(2007). Sensitive questions in surveys. Psychological Bulletin, 133, 859–883.

68.

Twenge

J. M.

Joiner

T. E.

Rogers

M. L.

Martin

G. N.

(2018). Increases in depressive symptoms, suicide-related outcomes, and suicide rates among U.S. adolescents after 2010 and links to increased new media screen time. Clinical Psychological Science, 6, 3–17.

69.

U.S. Centers for Disease Control and Prevention. (2018). National Health and Nutrition Examination Survey data, 2015–2016. Hyattsville, MD: National Center for Health Statistics.

70.

Walters

Christakis

D. A.

Wright

D. R.

(2018). Are Mechanical Turk worker samples representative of health status and health behaviors in the US? PLOS ONE, 13, Article e0198835. doi:10.1371/journal.pone.0198835

71.

Zika

Chamberlain

(1992). On the relation between meaning in life and psychological well-being. British Journal of Psychology, 83, 133–145.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.59 MB

0.83 MB