Exposing the many biases in machine learning

Abstract

In recent years, there have been numerous articles highlighting issues with bias in machine learning algorithms underpinning the use of AI in decision making. Specifically, algorithms trained on historical real-world observations. However, less is written about the many ways bias can be introduced into the machine learning process. This article outlines 12 different types of bias that can occur during the data science process, from capture through curation to analysis and application.

Keywords

algorithms artificial intelligence bias big data machine learning responsible AI

Get full access to this article

View all access options for this article.

References

Barr

(2017) Wealth and Poverty Sit Side by Side in Grenfell Tower’s Borough. The Guardian. https://www.theguardian.com/uk-news/2017/jun/15/wealth-and-poverty-sit-side-by-side-in-grenfell-towers-borough (accessed 23 July 2022).

Betchel

R B

(1967) Human movement in architecture. Trans-action 4(6): pp53-56.

Calacal

(2019) Who Edits The Wikipedia Editors? TruthDig. https://www.truthdig.com/articles/who-edits-the-wikipedia-editors/ (accessed 22 May 2022).

De Cosmo

(2022) Google Engineer Claims AI Chatbot Is Sentient: Why that Matters. Scientific American. https://www.scientificamerican.com/article/google-engineer-claims-ai-chatbot-is-sentient-why-that-matters/ (accessed 17 July 2022).

Doshi

(2015) Why Doctors Still Misunderstand Heart Disease in Women: Reconsidering the “typical” heart attack symptoms. The Atlantic. https://www.theatlantic.com/health/archive/2015/10/heart-disease-women/412495/ (accessed 22 May 2022).

Dunning

Heath

Suls

(2004) Flawed Self-Assessment: Implications for Health, Education, and the Workplace. Psychological Science in the Public Interest 5(3): 69–106. DOI: 10.1111/j.1529-1006.2004.00018.x.

Fairlearn (2022) Improve fairness of AI systems. https://fairlearn.org accessed online 31 July 2022.

Gama

, et al. (2014) Learning with Drift Detection. Advances in Artificial Intelligence - SBIA 2004, 17th Brazilian Symposium on Artificial Intelligence. São Luis, Brazil: Maranhão.

Gao

, et al. (2020) The Pile: An 800GB Dataset of Diverse Text for Language Modelling. arXiv. https://arxiv.org/abs/2101.00027 (accessed 17 July 2022).

10.

Haas

, et al. (2020) Female hunters of the early Americas. Science Advances 6(45). Available at: https://www.science.org/doi/10.1126/sciadv.abd0310

11.

Zhang

Ren

Sun

(2015) Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. arXiv. DOI: 10.48550/arXiv.1502.01852.

12.

Hern

(2020) Berlin artist uses 99 phones to trick Google into traffic jam alert. The Guardian, 3 February 2020. https://www.theguardian.com/technology/2020/feb/03/berlin-artist-uses-99-phones-trick-google-maps-traffic-jam-alert (accessed online 31 July 2022).

13.

Hey

Tolle

Tansley

(2009) The Fourth Paradigm: Data-Intensive Scientific Discovery. Washington: Redmond.

14.

Gens

, et al (2019) IDC FutureScape: Worldwide IT Industry 2020 Predictions. IDC. https://www.idc.com/research/viewtoc.jsp?containerId=US45599219 (accessed online 23 July 2022).

15.

Johnson

(2020) MIT Researchers Find ‘systematic’ Shortcomings in ImageNet Dataset . VentureBeat. https://venturebeat.com/2020/07/15/mit-researchers-find-systematic-shortcomings-in-imagenet-data-set/(accessed 23 July 2022).

16.

Kapoor

Narayanan

(2022) Leakage and the Reproducibility Crisis in ML-based Science. ArXiv DOI: 10.48550/arxiv.2207.07048 (accessed 23 July 2022).

17.

Khalil

, et al. (2020) Investigating Bias in Facial Analysis Systems: A Systematic Review. IEEE Access 8: 130751–130761. doi: 10.1109/ACCESS.2020.3006051

18.

Liu

, et al (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach, arXiv:1907.11692

19.

Milmo

(2022) UK Data Watchdog Investigates whether AI Systems Show Racial Bias. The Guardian. https://www.theguardian.com/technology/2022/jul/14/uk-data-watchdog-investigates-whether-ai-systems-show-racial-bias (accessed online 23 July 2022).

20.

Parry

Crossley

(1950) Validity of Responses to Survey Questions. The Public Opinion Quarterly 14(1): pp61–80. https://www.jstor.org/stable/2745899

21.

Richardson

(2020) Modelling Socio-Spatial Dynamics from Real-Time Data: Towards. In A Context-Aware Framework for Modelling Behaviour Change in Urban Space. London, UK: University College London.

22.

Sap

, et al. (2019) The Risk of Racial Bias in Hate Speech Detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy, pp. p1668–1678. https://aclanthology.org/P19-1163/.

23.

Sattelberg

(2021) The Demographics of Reddit: Who Uses The Site? Alphr. https://www.alphr.com/demographics-reddit/(accessed 22 May 2022).

24.

Schaake

Clark

(2022) Stanford Launches AI Audit Challenge. Stanford University. https://hai.stanford.edu/news/stanford-launches-ai-audit-challenge (accessed online 30 July 2022).

25.

Simon

(1997) Administrative Behaviour: A Study of Decision-Making Processes in Administrative Organisations. Fourth Edition. New York: Free Press.

26.

Statista (2021) Ranking of the Number of Reddit Users by Country 2020. Statista. https://www.statista.com/forecasts/1174696/reddit-user-by-country (accessed 22 May 2022).

27.

Stephes-Davidowitz

(2017) Everybody Lies: What the Internet Can Tell Us about Who We Really Are. New York: Bloomsbury Publishing.

28.

Tiku

(2022) Big Tech Builds AI with Bad Data. So Scientists Sought Better Data. The Washington Post, https://www.washingtonpost.com/technology/2022/07/21/big-science-ai-open-source-language-model/ (accessed 23 July 2022).

29.

Titanic Facts (2022) Titanic Survivors . Titanic Facts. https://titanicfacts.net/titanic-survivors/ (accessed 30 July 2022).

30.

Zhang

, et al. (2022) OPT: Open Pre-trained Transformer Language Models. https://arxiv.org/pdf/2205.01068.pdf (accessed 22 May 2022).

31.

Zuboff

(2019) The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power. London: Profile Books Ltd.