Abstract
In recent years, there have been numerous articles highlighting issues with bias in machine learning algorithms underpinning the use of AI in decision making. Specifically, algorithms trained on historical real-world observations. However, less is written about the many ways bias can be introduced into the machine learning process. This article outlines 12 different types of bias that can occur during the data science process, from capture through curation to analysis and application.
Get full access to this article
View all access options for this article.
References
1.
Barr
C
(2017 ) Wealth and Poverty Sit Side by Side in Grenfell Tower’s Borough . The Guardian . https://www.theguardian.com/uk-news/2017/jun/15/wealth-and-poverty-sit-side-by-side-in-grenfell-towers-borough (accessed 23 July 2022).
2.
Betchel
R B
(1967 ) Human movement in architecture . Trans-action 4 (6 ): pp53 -56 .
3.
Calacal
C
(2019 )
Who Edits The Wikipedia Editors?
TruthDig . https://www.truthdig.com/articles/who-edits-the-wikipedia-editors/ (accessed 22 May 2022).
4.
De Cosmo
L
(2022 ) Google Engineer Claims AI Chatbot Is Sentient: Why that Matters . Scientific American . https://www.scientificamerican.com/article/google-engineer-claims-ai-chatbot-is-sentient-why-that-matters/ (accessed 17 July 2022).
5.
Doshi
V
(2015 ) Why Doctors Still Misunderstand Heart Disease in Women: Reconsidering the “typical” heart attack symptoms . The Atlantic . https://www.theatlantic.com/health/archive/2015/10/heart-disease-women/412495/ (accessed 22 May 2022).
6.
Dunning
D
Heath
C
Suls
JM
(2004 ) Flawed Self-Assessment: Implications for Health, Education, and the Workplace . Psychological Science in the Public Interest 5 (3 ): 69 –106 . DOI: 10.1111/j.1529-1006.2004.00018.x .
7.
Fairlearn
(2022 ) Improve fairness of AI systems . https://fairlearn.org accessed online 31 July 2022 .
8.
Gama
J
, et al . (2014 ) Learning with Drift Detection. Advances in Artificial Intelligence - SBIA 2004, 17th Brazilian Symposium on Artificial Intelligence . São Luis, Brazil : Maranhão .
9.
Gao
L
, et al . (2020 ) The Pile: An 800GB Dataset of Diverse Text for Language Modelling . arXiv . https://arxiv.org/abs/2101.00027 (accessed 17 July 2022).
10.
Haas
R
, et al . (2020 ) Female hunters of the early Americas . Science Advances 6 (45 ). Available at: https://www.science.org/doi/10.1126/sciadv.abd0310
11.
He
K
Zhang
X
Ren
S
Sun
J
(2015 ) Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification . arXiv . DOI: 10.48550/arXiv.1502.01852 .
12.
Hern
A.
(2020 ) Berlin artist uses 99 phones to trick Google into traffic jam alert. The Guardian, 3 February 2020 . https://www.theguardian.com/technology/2020/feb/03/berlin-artist-uses-99-phones-trick-google-maps-traffic-jam-alert (accessed online 31 July 2022 ).
13.
Hey
T
Tolle
K
Tansley
S
(2009 ) The Fourth Paradigm: Data-Intensive Scientific Discovery . Washington : Redmond .
14.
Gens
F
, et al (2019 ) IDC FutureScape: Worldwide IT Industry 2020 Predictions. IDC . https://www.idc.com/research/viewtoc.jsp?containerId=US45599219 (accessed online 23 July 2022) .
15.
Johnson
K
(2020 )
MIT Researchers Find ‘systematic’ Shortcomings in ImageNet Dataset
. VentureBeat . https://venturebeat.com/2020/07/15/mit-researchers-find-systematic-shortcomings-in-imagenet-data-set/(accessed 23 July 2022).
16.
Kapoor
S
Narayanan
A
(2022 ) Leakage and the Reproducibility Crisis in ML-based Science . ArXiv DOI: 10.48550/arxiv.2207.07048 (accessed 23 July 2022).
17.
Khalil
A
, et al . (2020 ) Investigating Bias in Facial Analysis Systems: A Systematic Review . IEEE Access 8 : 130751 –130761 . doi: 10.1109/ACCESS.2020.3006051
18.
Liu
Y
, et al (2019 ) RoBERTa: A Robustly Optimized BERT Pretraining Approach, arXiv:1907.11692
19.
Milmo
D
(2022 ) UK Data Watchdog Investigates whether AI Systems Show Racial Bias . The Guardian . https://www.theguardian.com/technology/2022/jul/14/uk-data-watchdog-investigates-whether-ai-systems-show-racial-bias (accessed online 23 July 2022).
20.
Parry
HJ
Crossley
HM
(1950 ) Validity of Responses to Survey Questions . The Public Opinion Quarterly 14 (1 ): pp61 –80 . https://www.jstor.org/stable/2745899
21.
Richardson
S
(2020 ) Modelling Socio-Spatial Dynamics from Real-Time Data: Towards . In A Context-Aware Framework for Modelling Behaviour Change in Urban Space . London, UK : University College London .
22.
Sap
M
, et al . (2019 ) The Risk of Racial Bias in Hate Speech Detection . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . Florence, Italy , pp. p1668 –1678 . https://aclanthology.org/P19-1163/.
23.
Sattelberg
W
(2021 ) The Demographics of Reddit: Who Uses The Site? Alphr . https://www.alphr.com/demographics-reddit/(accessed 22 May 2022).
24.
Schaake
M
Clark
J
(2022 ) Stanford Launches AI Audit Challenge . Stanford University . https://hai.stanford.edu/news/stanford-launches-ai-audit-challenge (accessed online 30 July 2022).
25.
Simon
H
(1997 ) Administrative Behaviour: A Study of Decision-Making Processes in Administrative Organisations . Fourth Edition . New York : Free Press .
26.
Statista
(2021 ) Ranking of the Number of Reddit Users by Country 2020 . Statista . https://www.statista.com/forecasts/1174696/reddit-user-by-country (accessed 22 May 2022).
27.
Stephes-Davidowitz
S
(2017 ) Everybody Lies: What the Internet Can Tell Us about Who We Really Are . New York : Bloomsbury Publishing .
28.
Tiku
N
(2022 ) Big Tech Builds AI with Bad Data. So Scientists Sought Better Data . The Washington Post , https://www.washingtonpost.com/technology/2022/07/21/big-science-ai-open-source-language-model/ (accessed 23 July 2022).
29.
Titanic Facts
(2022 )
Titanic Survivors
. Titanic Facts . https://titanicfacts.net/titanic-survivors/ (accessed 30 July 2022).
30.
Zhang
S
, et al . (2022 ) OPT: Open Pre-trained Transformer Language Models . https://arxiv.org/pdf/2205.01068.pdf (accessed 22 May 2022).
31.
Zuboff
S
(2019 ) The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power . London : Profile Books Ltd .
