Sage Journals: Discover world-class research

Abstract

Objective: Mitigation of racism in artificial intelligence (AI) is needed to improve health outcomes, yet no consensus exists on how this might be achieved. Methods: At an international conference in 2022, experts gathered to discuss strategies for reducing bias in healthcare AI. Results: This paper delineates these strategies along with their corresponding strengths and weaknesses and reviews the existing literature on these strategies. Conclusions: Five major themes resulted: reducing dataset bias, accurate modeling of existing data, transparency of artificial intelligence, regulation of artificial intelligence and the people who develop it, and bringing stakeholders to the table.

Keywords

AI bias healthcare machine learning policy

Background

Over the last 50 years, there has been an abundance of literature from social science disciplines on the existence of racial disparities across all social domains. Significant investment to eliminate racial health disparities has been made with the convergence of racial injustice protests and national solidarity; empirical evidence related to social determinants of health and health disparities; and the financial burden of disparities, which is expected to exceed $363 billion by 2050.¹ Conventionally, racial health disparities have been attributed to genetic variation among races, lifestyle, and behavioral choices. However, findings from the Human Genome Project^2,3 have shown that race is a social construct rather than a biological function, with slight genetic variation between races. Lifestyle and behavioral choices are estimated to account for only 30% of an individual’s health outcomes.⁴ Thus, genetic variation among races, lifestyle, and behavioral choices fail to sufficiently explain racial health disparities.

Researchers have begun to explore racial health disparities as a mechanism of structural racism. Structural racism occurs when multiple interconnected institutions operate with embedded laws, policies, and rules that result in detrimental treatment for members of a particular racial group. This is in contrast to institutional racism, which focuses on the actions of a single entity.⁵ There are two critical elements of structural racism related to processes or policies: it generates disparities with or without explicit bias or intention⁴ and it reinforces and perpetuates racial group inequities.⁵ A growing body of evidence related to structural racism has galvanized leading health and research organizations and governing entities to address health disparities. The National Institutes of Health (2021) established the UNITE collaborative initiative intending to address structural racism within clinical research by promoting diversity (https://www.nih.gov/ending-structural-racism/unite). The National Academy of Medicine (2021) published the Future of Nursing 2020-2030, Charting A Path to Achieve Health Equity, which includes a detailed plan for nurse leaders to address structural racism.⁶

An emerging field of research has discovered racial bias in artificial intelligence (AI) across multiple fields. This bias has impacted interest rates, housing and employment opportunities, criminal justice, and the social determinants of health of racial minorities.⁷ The exploration of existing algorithms, AI, and machine learning (ML) models in all social institutions, including healthcare, has intensified due to the dissemination of racially biased findings. Regulatory interventions have been suggested in many fields including facial recognition.⁸

While algorithms may be accurate in a population overall, they may differentially recognize and have worse performance for certain populations. This may occur as a result of a biased population in training data, outcomes, variables in survey data, or laboratory values.⁹ An example of dataset bias in AI is the issue of facial recognition systems struggling to accurately identify individuals with darker skin tones. Early training datasets for these systems often lacked diversity, leading to biased models that performed poorly on people of color; they performed even more poorly when detecting women of color. Despite significant publicity and ongoing efforts to ensure fairness and equity in facial recognition technology, bias remains an enduring problem.¹⁰

While statistical correction,^11–13 pre-processing of data and fairness constraints (i.e., transforming data to remove sensitive information),¹¹ and post-processing (i.e., trying to correct bias when the ML data or algorithms are unavailable)⁷ have been discussed, none represent a comprehensive solution to algorithmic inaccuracies and poor performance.^9,14

AI holds great promise for healthcare, but proper bias detection and correction are essential in ensuring good healthcare outcomes.¹⁵ There is a growing need for discussion on the topic of bias correction in ML and AI.¹⁵ The purpose of this white paper is to describe the findings of a recently held workshop in which experts described potential solutions for detecting and correcting bias within healthcare AI, and is further illuminated by a subsequent literature review. We anticipated that this approach would facilitate the emergence of both innovative and established ideas on the topic. By engaging with renowned experts in dynamic discussions, we aimed to provoke new insights and refine existing concepts; generating insight from the workshop attendees allows us to determine strategies that are currently being thought of and/or utilized by people in the industry. This method not only leverages expert feedback to enhance and validate current evidence but also identifies critical gaps that merit further investigation through subsequent literature review.

Methods

To discuss potential solutions to reducing healthcare bias in ML/AI, we created a workshop that was held during the 2022 Healthcare Information and Management Systems Society (HIMSS) Global Health Conference and Exhibition in Orlando, Florida. Nearly 29,000 professionals across disciplines in healthcare and data science attended the conference and approximately 50 of those professionals in medicine, nursing, engineering, data science, and healthcare education participated in the workshop; specific demographics of the attendees were not collected as this was not a research study. The workshop began with a presentation that defined bias in healthcare and discussed the effect of biased data on AI in healthcare. After the presentation, workshop participants were requested to provide feedback aimed at achieving the following objectives.

• Contrast various proposed solutions to reduce and eliminate bias in healthcare AI

• Hypothesize best solutions for addressing concerns about bias in healthcare, whether by mixing existing solutions or determining if a new solution is needed

• Appraise the solutions presented during the discussion session for reducing/eliminating bias in healthcare AI

The workshop was recorded and posted on the conference platform for view by those attendees unable to attend, and viewers were invited to submit additional feedback. That additional feedback was incorporated with the feedback from the workshop.

Participants were encouraged to provide feedback on the content and the ideas presented by others. Notes were taken by the session moderator and subsequently analyzed for thematic content by the authors of this paper within the framework of current evidence, using the methods suggested by Kiger and Varpio.¹⁶ First, the workshop moderators familiarized ourselves with the data through attending the workshop, discussing the results, and reviewing the notes and transcripts from the workshop. We then generated initial codes which were reviewed by the manuscript authors and searched for themes. Together, through iterative rounds we defined and named the themes, and worked together to produce the report. Supported by a subsequent literature review, the results of the thematic analysis are presented and further explored below. As this workshop did not fit the criteria of human subjects’ research, Institutional Review Board approval was not required.

Results

For the purposes of the workshop, AI was defined as mathematical algorithms used to simulate intelligence built on datasets input by healthcare systems. Because of the cyclic nature of feedback meant to improve algorithms over time, biases in the dataset or the algorithms can have a positive feedback loop, increasing the biases over time.

Participant feedback related to the reduction of AI bias in healthcare encompassed five major themes: reducing dataset bias, accurate modeling of existing data, transparency of AI, regulation of AI and the people who develop it, and bringing stakeholders to the table. These are provided in Table 1 and in the text that follows together with corresponding strengths and weaknesses (Table 1). Notably, overlapping factors that cause bias in AI are depicted in Figure 1.

Table 1.

Approaches to reduce bias in healthcare and potential strengths and weaknesses based upon workshop participant feedback.

Solution	Strengths	Weaknesses
Reducing dataset bias
Ensuring that all the data collected in the electronic health record (EHR) are suitable for use and all appropriate data are used in modeling^17,18	Captures the rich data available in the EHR¹⁸	Difficult to provide all possible options for the end-user and user error may occur when entering data^17,18
Using synthetic data to build models¹⁹	Can fill gaps for modeling to reduce disparities¹⁹	May not be useful for real-world scenarios¹⁹
Experimenting with ensemble models²⁰	Already well researched²⁰	Difficult to maintain out-of-domain and in-domain-performance; lose ability for auto-detection of dataset bias²⁰
Crowd-sourcing to collect more diverse datasets²¹	Could increase the diversity of the data collected^22,23	Data could be incomplete; users self-select, thus, introducing self-selection bias^22,24
Accurate modeling of existing data
Building AI that can handle the complexity of electronic healthcare data properly.^17,18	AI is built to account for the decision maker making the selection based on unobservable data²⁵	• EHR is one of the most complex datasets and risks the potential of overfitting data¹⁸• It is difficult to build a model that can account for unobservable data²⁵
Building EHR systems that reduce user fatigue	Errors could potentially be reduced if the provider is not overwhelmed by the data input process²⁵	• Auto-population of EHR data can increase error• Insurance/payers drive healthcare systems to increase documentation¹⁷
Being cognizant of biases early in the ideation phase of clinical trials	Mitigation strategies can be promoted early in the modeling life-cycle	Several models/applications do not require clinical trials
Transparency of AI
Explaining AI results and recommendations²⁶	Allows providers to make informed clinical decisions and/or corrections	• Needs humans to ask the questions which may introduce additional bias^27,28• Explanations may be unreliable or misleading²⁹
Revisiting the assumption that AI models need to be black-box^29,30	Allows a user to understand how variables are used to make predictions³⁰Allows for stakeholder involvement in the development of models²⁷	May stifle innovation by limiting the use of black box AI²⁶
Regulation of AI
Regulating data scientists in healthcare and standardizing what is expected of developers	Models would be more standardized and clinical outcomes more “value” based	• Current state, minimum requirements, and skillsets are not currently defined• Possible unintended consequences such as stymied growth
Testing continuously for bias after implementation	Real-world data could be submitted for proof of feasibility and then monitored for developing bias	• Regulation on software that was built on limited or old datasets• Continuous auditing of new software and algorithms is time consuming
Utilizing AI tools that detect bias	Easy to disseminate and easy to regulate	• While there are calls for these type of solutions, they are not yet readily available³¹• The circular logic of using AI to detect AI repeats the same bias problems
Bringing stakeholders to the table
Capitalizing on the human-machine relationship by having stakeholders at the table	Reduce the reinforcement of bias	Often overlooked³²
Increasing underrepresented groups in the health IT field	Decrease bias caused by development through a singular lens	• Need to standardize the skill set so that underserved groups are able to meet the required criteria• Often overlooked³³
Developing self-awareness of intrinsic bias through training programs³⁴	Scenario-based programs could help users catch the pitfalls in their thinking³⁴	Hard to develop programs that can make a sustained impact³⁵

Figure 1.

The relationship of factors that contribute to bias. A model showing how various factors introduce bias into AI overlap; these need to be regulated. Opacity and error, as well as personal bias, cause bias in both data and models and reinforce bias in each other. Bias in data sets contributes to bias in models. Regulation should be governed by diverse stakeholders and regulation should also include regulation of stakeholders.³⁶

Discussion

Reducing dataset bias

The discussion from workshop participants revealed that the use of existing data may have inherent biases. While some datasets are created for the express purpose of developing AI on a particular topic, existing datasets must often be used; a key example is the Electronic Health Record (EHR). An EHR is a digital database of patient information, medical results, provider notes, images, and much more.¹⁸ While it serves as the repository of rich information that can be used to better understand medical procedures and practices, it stems from what was formerly a physical document, which through many iterations was developed into an electronic format over the past 40 years. Because of this, EHRs and related data warehouses combining these records are exceedingly complex with many redundancies. Often, finding information can be difficult, even with the development of natural language processing to assist in this capacity.

Participants suggested that one of the most important ways to reduce bias in AI is to ensure that the data used to train the model are accurate and suitable for use in an AI model.³⁷ For example, during a patient encounter, a provider has the opportunity to do a review of systems (ROS) and to account for the review in the EHR.¹⁷ The data input into the EHR during the patient appointment provide valuable information that can be used to understand how to treat the patient in the future. If the data that are input into the EHR based upon the encounter are inaccurate, then the future decisions made based upon these data will be biased. Studies have shown that there are inconsistencies between audio-visual data of what occurred in the appointment and the information input into the EHR.¹⁷ Thus, while ensuring that data input into an EHR will generate rich data with which to model, accuracy of the data must also be considered.

Workshop participants also suggested the use of synthetic data may to build a model if the data in an EHR are not reliable or do not include sufficient observations from a given race, ethnic, or minority group. The use of synthetic data has many strengths. Synthetic data is artificially generated or simulated information, modeled to reflect the statistical properties of real data derived from actual events.¹⁹ There is little cost associated with generating synthetic data, there are no privacy concerns, it enables increased exploration, and potentially avoids the need for a lengthy review process.¹⁹ While these strengths are many, synthetic data, based upon how it is simulated, may not be accepted by a stakeholder viewing the results. This is especially the case if the synthetic data are created in such a way that they do not reflect what would occur in a real-world scenario. For example, if there is insufficient underlying evidence about certain populations, synthetic data will not reflect the statistical properties of those populations and, thus, create unintended bias. If the use of synthetic data is discouraged and a dataset is known to be biased, then an alternative could be to experiment with how the dataset is sampled and modeled. For example, an ensemble of models (where a group of models is combined and used to build a better predictive model than any of the individual models) could be built using the current data,²⁰ thus reducing the bias of using a single methodology to model the data.

An additional intervention suggested by workshop participants to reduce dataset bias was the use of crowdsourcing of data. If a dataset omits certain minority groups and is thus known to be biased, the public could be requested to reply to a survey or there could be a call for data to supplement the dataset (i.e., “crowdsourcing”). Studies have shown that crowdsourced data can increase the quality of data, reduce the cost of acquiring data, and increase the volume of data.²⁴ In a recent study, Yank et al.³⁸ compared crowdsourced data to comparable data from the Behavioral Risk Factor Surveillance System (BRFS). While individuals in the crowdsourced data had slightly different characteristics from the BRFS data (e.g., younger and with higher levels of education), the additional data were valuable in making decisions, and the differences were controlled for in the analysis. Historically, this method has been limited because it tends to recruit mainly younger, more highly-educated, non-Hispanic whites, or has not collected robust datasets.^24,38 However, recent efforts have demonstrated the ability to intentionally target more diverse, representative, and robust samples, such as the All of Us Research Program conducted by the National Institutes of Health.²³

All of Us came out of the National Institute of Health’s (NIH) “Precision Medicine Initiative Working Group of the Advisory Committee to the Director.” To achieve precision medicine (i.e., individualized healthcare plans based on the individual and incorporating the environment), there must be a representative database. Therefore, the All of Us project aims to build a diverse and comprehensive database through online recruitment ad campaigns that can provide data for thousands of research studies on a variety of topics and conditions. Within this program, there is a targeted effort to include a representative sample of all types of people and groups.²³ This is one of the largest examples of crowdsourcing and demonstrates the power of this mechanism to develop a large, diverse database using crowdsourcing. The dataset has recently been used to study health equity in the fields of hypertension³⁹ and pneumonia⁴⁰ and to illuminate biases in access to care.⁴¹ While there is no perfect way to address lack of or inaccurate data, the actions discussed in this section such as crowd-sourcing and the use of synthetic data may help reduce bias.

Accurate modeling of existing data

Because the EHR contains a huge amount of rich information, it is a valuable resource when trying to model and understand disparity in care; workshop participants emphasized that accurate modeling of these kinds of data is needed to reduce bias. AI built to handle the vastness and complexity of the data has the potential of enabling researchers to improve on shortcomings within datasets. While an EHR contains providers’ notes, descriptions, and diagnoses, providers may also make decisions based upon their intangible assessments of a patient. This data has been described as unobservable data.¹⁹ Notably, if a provider has an explicit or implicit bias against patients because of their race, ethnicity, or other factors, the information and the conclusions they formulate will be biased. Although the medical information is recorded in the EHR, the bias/prejudice is not; it is unobservable. Thus, any AI built from the data will be biased, and unable to recognize a fairer decision. The decisions from the AI will perpetuate the bias further, and increase the negative feedback loop that keeps patients from receiving the care they deserve or need.⁴²

It can be especially difficult to build AI that can account for unobservable data. Lakkaraju, Kleinberg²⁵ propose a model that compares the performance of a predictive model and human decision-makers with the results of a single model decision-maker (e.g. provider) with that of a group and, specifically, the “most lenient” of the group. For example, in a medical setting, the decision of a provider to recommend a patient for a specific medication would be compared to a group of providers who have made a similar decision, and the provider who has recommended the medication the most frequently (i.e., the most lenient). If it is found that a model or decision-maker is abnormal compared to the group, then action can be taken. An evaluation of providers’ actions with this type of model could improve the data that are input into the EHR, and thus, improve an AI model built from the data.

Another way suggested to potentially account for currently unobservable data is to try to increase the amount of observable data. This would entail increasing the amount of data that are recorded in the EHR or another mechanism that could recognize and record unobservable physician action. This solution could also be problematic for a few reasons. First, the EHR is already complex. Adding more data and trying to fit a model to a specific dataset from each EHR may result in overfitting of a model. Too many specialized fields in an EHR for a specific hospital would make any model built from that data less robust and applicable across fields and medical centers. Secondly, health care providers already report fatigue with current entry requirements for EHRs.²⁵ User fatigue when entering data leads to mistakes; errors in the EHR introduce bias in AI. Lakkaraju, Kleinberg²⁵ state “If the health-care provider is not overwhelmed by the data input process, the probability of errors will be much smaller.”

Increasing the amount of data collected in the EHR may be necessary in an attempt to collect unobservables, but should only be done if there is confidence that the additional data will help to capture a datapoint that was previously considered unobservable. It is important to develop mechanisms that can account for this bias; in the case of the most “lenient provider,” this could be the amount of funding spent in advertising by pharmaceutical companies (that could unnecessarily influence providers), or whether scientific literature based on the efficacy of certain drugs reaches prescribers.

Finally, the last recommended strategy for accurate modeling of existing data was that researchers should be aware of bias during the ideation phase of clinical trials. Planning for ways to reduce bias in clinical trials could potentially reduce the bias in the end result, and should be considered at each stage, starting with the development of the clinical protocol.⁴³

Transparency of AI

Workshop participants emphasized that transparency of AI is important, as it allows for more oversight and accountability. In the process of developing AI, black box models have become prevalent, where the logic behind a model and the link between the inputs and outputs cannot be understood by a human. Some feel that not understanding the mechanisms of the AI can be dangerous because it does not allow for or makes it very difficult for the user to detect mistakes.²⁸ Others have argued that AI models do not need to be black box, and interpretable models, models where the inputs and outputs can be linked and understood, do not reduce the accuracy or ability of the model to make predictions.^29,30 Even if a black box model can be explained after it is built, explanations can be unreliable or misleading. This could mean that while explainable models are helpful, it would be better to build an interpretable model, where users can understand how input variables are used to determine the output.³⁰ Some decisions that are high risk may require human oversight to avoid costly or harmful mistakes.²⁹ On the other hand, some argue that explainability is not necessary, as humans often do not fully understand the mechanisms of the tools they use (e.g., most people do not understand how a computer works, but still rely on the output it generates as being valid).²⁶

The idea of transparency relates back to the concept of bias in the dataset. For instance, one may allow for less transparency if the dataset is known to be unbiased, whereas one may be less comfortable with a lack of transparency when dealing with unfamiliar or unexplored concepts.²⁶ Bias can also be introduced during the modeling process, so it is important to use appropriate evaluation metrics and testing procedures to ensure that the model is not biased. For example, training data with labels applied by humans has been shown to be flawed by magnifying the bias introduced during the labeling process.²⁸ AI developers should document the data used to train the model, the features selected, and the algorithms used.

Transparency can also be used as a method to engage stakeholders, by allowing them to manipulate inputs and have confidence that implementation of the model will work properly, and has been suggested as a way to assist in the regulation of AI.^27,44 Stakeholders refer to those that have an interest in or are impacted by a model. Compromise between transparency and a willingness to use AI with a black box that has demonstrated value, and accuracy may be an important way to ensure the advancement of technology without introducing additional bias.

Regulation of AI

The participants noted that regulation of AI would be an important step to reduce bias in healthcare AI. Currently, regulation of data scientists and developers in healthcare is lacking; there are minimal requirements for who is qualified to produce, distribute, or vend healthcare AI, and skillsets or credentials are not currently defined. Those opposing such regulation fear the possible unintended consequences of regulation, such as stymied growth.³⁶

Government regulation of AI has not matched the rate of its adoption AI across industries. While the United States government has issued a “Blueprint for an AI Bill of Rights,” the document lacks concrete steps for monitoring and regulating both AI and producers of AI.⁴⁵ Moreover, existing laws lack the ability to catch and regulate subtleties that could exacerbate racism in healthcare.⁴⁵ The United Kingdom (UK) has progressed further with the first proposed rules on AI, but these are also lacking in concrete steps to monitor or regulate AI.⁴⁶ The U.S. Food and Drug Administration (FDA) attempted to issue guidance in 2019⁴⁷ on the regulation of AI, but the guidance was issued for discussion purposes only, and therefore had little impact. In 2021, it issued an action plan in response to the proposed guidance, but as of the time of this writing, the plan has not been implemented. Whereas the focus of this paper is to strategize ways to reduce bias in AI models, the workflow proposed by the FDA gives an overview of the lifecycle and process of AI development, denoting points where interventions could be made to ensure the safety and quality of a machine learning product (the full model can be viewed in the report at: https://www.fda.gov/media/122535/download/).³⁷ This workflow may clarify the lifecycle of ML/AI development, but the lack of a concrete implementation plan limits the utility of these suggestions.

Importantly, regulation should include stakeholders, a strategy that overlaps with many of the other aspects already presented here for reducing bias in healthcare AI. Among the ways to include stakeholders are: active involvement of healthcare workers, multidisciplinary sharing of information, training of diverse and underserved groups to become professionals in the field, the ability to ask a human for a second opinion, and having people from outside the industry involved in regulation.⁴⁸

The National Academy of Medicine issued a report that suggests that existing laws may provide regulation over some aspects of AI, but these are not specifically geared toward AI and may miss subtleties that are introduced through the use of AI.⁴⁹ Importantly, while stakeholders are frequently mentioned, the need for regulation and diversity of these stakeholders is not.

External validation of a model could help with regulation by allowing outsiders to confirm the utility of certain AI models. However, external validation is limited by the fact that it requires sharing of information that could be private, and limits the external validator’s ability to properly assess a model.⁴⁹ Some authors suggest that full transparency is not feasible for a number of reasons, including disclosure of private data, the loss of a competitive edge for the developers, the ability for users to “game” the system, and a possibility that due to the complex nature of ML, algorithms would be difficult to understand.⁵⁰ To overcome this, oversight boards may allow for regulation of ML and such bodies could be granted full transparency, even if it is not feasible for the public.⁵⁰

Methods such as bias-correction algorithms have been recommended,^11,12 but have inherent problems that cannot be addressed by programming alone. The algorithms require regulation over their production and the training of the producers of the algorithms. As we have discussed, using AI to regulate AI bias exacerbates the issues of bias in AI.

Recent advances in AI have resulted in large language models (LLMs), such as those used by CHATGPT4, that are being quickly adopted to advance healthcare. LLMs are deep-learning neural networks that use language to predict and generate text in a human-like fashion. While LLMs may be used as a starting point to answer a question, they may provide incomplete or inaccurate responses. Because LLMs are being built on widely available data, they have the propensity to exacerbate biases imbedded in data. For example, in the spring of 2023, we asked CHATGPT3.5 to write this paper. The results were limited in depth and included inaccurate and fabricated references. For example, one article reference had the correct authors and publisher, but an incorrect title (correct title: Fairness and machine learning: Limitations and opportunities; inaccurate title: Fairness and machine learning. In Big Data: A Survey).⁵¹ Another provided reference had the correct title and publisher, but included an additional author and incorrect year (correct authors: Raji & Buolamwini; inaccurate authors: Raji, Hardt, & Buolamwini).⁵² As LLMs rapidly develop, special attention needs to focus on the potential integration of biases, as recent literature shows biases by race, ethnicity and politics persist.⁵³ Furthermore, companies that develop these models are mainly controlled by the US and so results tend to favor western English speakers, US culture and economic status.⁵³

This is partly due to the limited access of other countries to the resources that US holds both in terms of physical equipment and technology needed to develop LLMs, access to large data sets needed to develop LLMs (some countries, such as China, restrict both the use of LLMs and access to the data needed to develop such models), and the companies developing LLMs. Additionally, without freedom of speech, government censorship may also hinder development of AI and intentionally introduce biases in AI.⁵⁴

The true power of LLMs is the continuous data input that improves results; however, because of the privacy of health data, this ability is limited. Some have argued that regulation of LLMs should be focused on outcomes rather than the technology itself, while the UK has proposed that regulation should be focused on use-case (e.g., more stringent regulation of high risk scenarios than low risk scenarios).^46,55 Rapid development of LLMs has already revealed that existing laws may not be enough to deter companies from unethical practices; certainly clearer laws with penalties sufficient to deter this behavior must be developed.⁵⁶

Guidelines for testing ML models have been proposed and could be adapted for use in regulation, in addition to requiring researchers and developers to test models as they are built.⁵⁷ Toolkits have been developed that are freely available for integrating bias detection within algorithms and commercial companies are developing detection software programs, but neither have been integrated into governmental oversight.^58,59 Additionally, toolkits may introduce additional bias by putting the ultimate authority of one group as the decision makers of detecting bias over all others. While there has been academic discussion about the need for testing for bias, more needs to be implemented in terms of clinical research and regulation.⁵⁸

Bringing stakeholders to the table

Workshop attendees suggested an overlapping strategy recommended to reduce bias was to ensure that underrepresented groups are included in the development of healthcare AI.^59,60 Underrepresented groups should be included in all aspects of bias detection and correction.⁵⁹ This should include pipeline development, where strategies are put in place to ensure that underserved or underrepresented people have the opportunity to obtain the education and training necessary to become data scientists, software developers, and others.⁶¹ Programs such as the United States’ Affirmative Action help to ensure a diverse body of students in the technology field. Historically, bans on programs such as these have resulted in fewer people of color across fields of studies, with some of the more notable effects in engineering fields.⁶² Thus, recent bans of Affirmative Action programs are likely to increase bias in AI in the long run. Because of the overlap with many other methods to reduce bias as described above, it is evident that bringing representative stakeholders into the process of creating, using, and regulating AI is critical in reducing bias in healthcare.

To diversify the technology workforce and develop a robust pipeline of AI talent from both within and outside the industry, a multifaceted approach is essential. One key strategy is expanding access to education and training through scholarships, grants, and affordable or free AI courses, particularly targeting underrepresented communities. Early STEM education is also crucial in fostering interest in tech careers. Inclusive hiring practices, such as bias-free recruitment, diverse interview panels, and targeted outreach to institutions like historically Black colleges and universities (HBCUs), can help in attracting diverse talent. Additionally, partnerships between industry and academia, along with mentorship programs, can provide real-world experience and guidance for individuals from underrepresented groups.

Retention and career advancement of diverse employees can be supported through transparency in pay and hiring processes and ensuring diversity at all levels, as well as supporting worker-driven diversity movements.⁶³ Reskilling programs and career transition partnerships can also open doors for professionals from non-tech industries to enter the AI field.

Supporting entrepreneurship among underrepresented groups through funding, innovation hubs, and incubators can further diversify the AI landscape. Finally, advocating for policy changes, such as government incentives for companies that hire and train diverse talent, and fostering public-private partnerships can ensure a comprehensive approach to developing a diverse AI workforce. These strategies are aligned with the growing recognition of the importance of diversity in AI, as a diverse workforce brings varied perspectives that can lead to more innovative and equitable AI solutions.⁶³

Limitations

This paper presents the results of a single workshop. While the HIMSS Global Health Conference is one of the largest health informatics conferences in the world, with many professionals across industries, diverse backgrounds, and a broad geographical representation, it is likely there were some views or ideas that were not represented. To counter this, we invited feedback from those unable to attend via the video posted through the conference proceedings. We also conducted a literature review to augment the feedback and uncover whether there was additional information or ideas missing from the commentary. However, future studies could include other methods to ensure more ideas are included, such as a Delphi survey.

The purpose of this paper was to understand the potential mechanisms that could be used to overcome bias in healthcare AI rather than to define the specific processes to surmount the current issues with bias in healthcare. Specific, concrete steps should be developed and could include developing action committees, both within institutions as well as within governments to develop specific policies and procedures to implement the strategies suggested here. Progress measures should include outcome measures (such as more equitable healthcare outcomes) rather than only process measures (e.g. number of policies or committees developed, etc.). Future studies may also compare the relative ability of each strategy to create change.

Conclusion

Our workshop and subsequent literature review of the suggestions resulted in five main domains to be addressed to reduce bias in AI; reducing dataset bias, transparency in AI, accurate modeling of existing data, regulation of AI and the people who develop it, and bringing stakeholders to the table. Current literature illuminates a deficit of research or discussion about the need for representative stakeholders, and the need for diversity in both the development of oversight and the pipeline of future scientists who will create AI.

Evaluating the long-term impact of strategies such as reducing dataset bias, improving AI transparency, and regulating AI systems and their developers is crucial. These strategies will require sustained effort to achieve meaningful short-term results and are likely to produce more significant long-term effects. Similarly, efforts to include diverse stakeholders and create a robust pipeline for increasing diversity among AI developers may not yield immediate results but are expected to have a profound and lasting impact on reducing bias in healthcare AI over time.

The passage of the Health Information Technology for Economic and Clinical Health Act (HITECH Act) in 2009 and subsequent legislation mandating that all private and public providers maintain EHRs has spurred the rapid development of AI to improve health outcomes. The saying, “garbage in, garbage out,” related to programming input illuminates the risks of health-oriented learning models and algorithms trained on inaccurate data. It also highlights the associated detrimental consequences related to health outcomes (including misdiagnosis and undertreatment), and the increase in health disparities for underrepresented or misrepresented groups within the dataset. Safeguarding the accuracy of new and existing datasets, such as EHRs, is crucial for health equity. New software and recruitment interventions have attempted to mitigate AI bias, but the need to bring diverse stakeholders to the table to reduce flawed input leading to biased Al models is crucial for long term success.

The workshop presented a breadth of experts interested in reducing bias in AI. While most of the ideas presented were not novel, they did provide a comprehensive overview of the many aspects that need to be addressed to ensure that bias is reduced or eliminated from AI. Many of these will require long term, strategic planning to be successful. This paper presents an in-depth overview with clear descriptions of promising methods to reduce bias in healthcare AI.

The workshop offers a significant advantage over traditional literature reviews by harnessing the power of collective idea generation in a much shorter time frame. Unlike literature reviews, which can be lengthy and sometimes stagnant processes, a conference workshop fosters real-time collaboration and rapidly synthesizes diverse perspectives, yielding insights that might otherwise take much longer to uncover.

Our experience underscores this advantage. The workshop application was submitted in 2021, the event took place in 2022, and we have been working to publish our findings since late 2022. This timeline illustrates the efficiency of the workshop model in generating actionable ideas and solutions. By contrast, the process of reviewing and analyzing the data, and conducting a comprehensive literature review took nearly a year, highlighting the extended duration typically required for conventional methods.

Moreover, the workshop surfaced critical and novel insights not previously highlighted in existing literature. For example, it emphasized the urgent need to build a diverse workforce pipeline and address broader societal structural issues that perpetuate racism—topics that have not been sufficiently addressed in prior literature reviews. This demonstrates the workshop’s unique ability to advance the discourse by identifying and prioritizing emerging issues that may be overlooked in more static review processes. Thus, the workshop not only accelerates the generation of innovative ideas but also contributes valuable, previously unexplored perspectives to the field.

We believe focus should be put on developing a pipeline of diverse healthcare technology engineers and healthcare stakeholders. We found nothing in the literature to suggest whether or not having a diverse group of stakeholders improves healthcare outcomes. This is an area ripe for research. Our literature review suggests that while many of these topics have been acknowledged, there are more proposed suggestions for improvement rather than concrete action plans. Governments and institutions should recognize these stakes and create structures and plans that support the development of a healthcare AI workforce. Strategies and progress measures should be actionable and transparent to ensure change.

Footnotes

Acknowledgements

The authors would like to acknowledge with gratitude those who attended the workshop at the HIMSS 2022 conference and participated in the generation and discussion of these ideas, as well as Dr. Lisa Persad for input into early drafts of the manuscript and Dr. Shiela Strauss for assistance with copyediting earier drafts of the manuscript.

Author contributions

CS developed the concept for the workshop and paper. Both authors contributed to analysis of finidngs, literature review, writing, editing, and review of the manuscript.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Carolyn Sun

Shannon L Harris

References

Berger

Miller

. Health disparities, systemic racism, and failures of cultural competence: authors’ response to commentaries. AJOB 2021; 21: W1–W3. DOI: 10.1080/15265161.2021.1956636.

Royal

Dunston

. Changing the paradigm from'race'to human genome variation. Nat Genet 2004; 36: S5–S7.

McCann-Mortimer

Augoustinos

LeCouteur

. ‘Race’and the human genome project: constructions of scientific legitimacy. Discourse Soc 2004; 15: 409–432.

Powell

. Understanding structural racialization. Clgh Rev 2013; 47: 146.

Gee

Hicken

. Structural racism: the rules and relations of inequity. Ethn Dis 2021; 31: 293–300.

Wakefield

Williams

Le Menestrel

. The future of nursing 2020-2030: Charting a path to achieve health equity. Washington, DC: National Academy of Sciences, 2021.

Huang

Singh

. Artificial intelligence and algorithmic bias: source, detection, mitigation, and implications. In: Pushing the boundaries: frontiers in impactful OR/OM research. Catonsville, MA: INFORMS, 2020, pp. 39–63.

Buolamwini

. Gender shades: intersectional phenotypic and demographic evaluation of face datasets and gender classifiers. Cambridge, MA: Massachusetts Institute of Technology, 2017.

Obermeyer

Powers

Vogeli

, et al. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019; 366: 447–453.

10.

Gwyn

Roy

. Examining gender bias of convolutional neural networks via facial recognition. Future Internet 2022; 14: 375.

11.

Saleiro

Kuester

Hinkson

, et al. Aequitas: a bias and fairness audit toolkit. https://arxiv.org/abs/1811.05577.

12.

Kim

Ghorbani

Zou

. Multiaccuracy: black-box post-processing for fairness in classification. Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. New York, NY, USA: Association for Computing Machinery, 2019, pp. 247–254.

13.

McNamara

. Equalized odds implies partially equalized outcomes under realistic assumptions. Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. New York, NY, USA: Association for Computing Machinery, 2019, pp. 313–320.

14.

Courtland

. Bias detectives: the researchers striving to make algorithms fair. Nature 2018; 558: 357–360.

15.

Rajpurkar

Chen

Banerjee

, et al. AI in health and medicine. Nat Med 2022; 28: 31–38. DOI: 10.1038/s41591-021-01614-0.

16.

Kiger

Varpio

. Thematic analysis of qualitative data: AMEE Guide No. 131. Med Teach 2020; 42: 846–854. DOI: 10.1080/0142159x.2020.1755030.

17.

Berdahl

Moran

McBride

, et al. Concordance between electronic clinical documentation and physicians’ observed behavior. JAMA Netw Open 2019; 2: e1911390. DOI: 10.1001/jamanetworkopen.2019.11390.

18.

Cyganek

Graña

Krawczyk

, et al. A survey of big data issues in electronic health record analysis. Appl Artif Intell 2016; 30: 497–520. DOI: 10.1080/08839514.2016.1193714.

19.

James

Harbron

Branson

, et al. Synthetic data use: exploring use cases to optimise data utility. Discov Artif Intell 2021; 1: 15.

20.

Clark

Yatskar

Zettlemoyer

. Don’t take the easy way out: ensemble based methods for avoiding known dataset biases. 2019. https://arxiv.org/abs/1909.03683.

21.

Niu

Silva

. Crowdsourced data mining for urban activity: review of data sources, applications, and methods. J Urban Plann Dev 2020; 146: 04020007.

22.

Tang

Ritchwood

, et al. Crowdsourcing to improve HIV and sexual health outcomes: a scoping review. Curr HIV AIDS Rep 2019; 16: 270–278. DOI: 10.1007/s11904-019-00448-3.

23.

National Institutes of Health . About. https://www.allofus.nih.gov/about (2021, accessed April 5 2023).

24.

Ranard

Meisel

, et al. Crowdsourcing—harnessing the masses to advance health and medicine, a systematic review. J Gen Intern Med 2014; 29: 187–203.

25.

Lakkaraju

Kleinberg

Leskovec

, et al. The selective labels problem: evaluating algorithmic predictions in the presence of unobservables. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 275–284.

26.

Wang

Kaushal

Khullar

. Should health care demand interpretable artificial intelligence or accept “black box” medicine? Ann Intern Med 2020; 172: 59–60.

27.

Wang

Huang

Jasin

, et al. Algorithmic transparency with strategic users. Management Science 2020; 69.

28.

Buolamwini

Gebru

. Gender shades: intersectional accuracy disparities in commercial gender classification. Conference on fairness, accountability and transparency. Cambridge, MA: PMLR, 2018, pp. 77–91.

29.

Rudin

. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell 2019; 1: 206–215.

30.

Rudin

Radin

. Why are we using black box models in AI when we don’t need to? A lesson from an explainable AI competition. Harvard Data Science Review 2019; 1(10): 1162.

31.

National Center for Advancing Translational Sciences (NCATS) . Bias detection tools in health care challenge. https://ncats.nih.gov/funding/challenges/bias-detection-tools-in-health-care (2022, accessed February 22 2023).

32.

Langer

Oster

Speith

, et al. What do we want from Explainable Artificial Intelligence (XAI)? A stakeholder perspective on XAI and a conceptual model guiding interdisciplinary XAI research. Artif Intell 2021; 296: 103473.

33.

Lockey

Gillespie

Holm

, et al. A review of trust in artificial intelligence: challenges, vulnerabilities and future directions. Honolulu, Hawaii: Hawaii International Conference on System Sciences, 2021.

34.

Kim

Roberson

. I’m biased and so are you. What should organizations do? A review of organizational implicit-bias training programs. Consulting Psychology Journal 2022; 74: 19–39.

35.

Gill

Zhou

Greely

, et al. Longitudinal outcomes one year following implicit bias training in medical students. Med Teach 2022; 44: 744–751.

36.

Sun

Harris

. Reducing bias in AI technology. HIMSS 2022; 2022.

37.

U.S. Food and Drug Administration . Proposed regulatory framework for modifications to artificial intelligence/machine learning (AI/ML)- based software as a medical device (SaMD): discussion paper and request for feedback. Silver Spring, MA: U.S. Food and Drug Administration, 2019.

38.

Yank

Agarwal

Loftus

, et al. Crowdsourced health data: comparability to a US national survey, 2013–2015. Am J Publ Health 2017; 107: 1283–1289.

39.

Kurniansyah

Goodman

Khan

, et al. Evaluating the use of blood pressure polygenic risk scores across race/ethnic background groups. Nat Commun 2023; 14: 3202.

40.

Gilmore

Lee

Schmidt

, et al. Antibiotic prescribing by age, sex, race, and ethnicity for patients admitted to the hospital with community-acquired bacterial pneumonia (CABP) in the All of Us database. J Clin Transl Sci 2023; 7: e132.

41.

Hill

Colón-López

. Delays in care by race, ethnicity, and gender before and during the COVID-19 pandemic using cross-sectional data from national Institutes of health’s all of us research program. Wom Health Issues 2024; 34(4): 391–400.

42.

Adam

Chang

C-HK

Haibe-Kains

, et al. Hidden risks of machine learning applied to healthcare: unintended feedback loops between models and future data causing model degradation. Machine Learning for Healthcare Conference. PMLR, 2020, pp. 710–731.

43.

Weissler

Naumann

Andersson

, et al. The role of machine learning in clinical research: transforming the future of evidence generation. Trials 2021; 22: 537. DOI: 10.1186/s13063-021-05489-x.

44.

Buiten

. Towards intelligent regulation of artificial intelligence. Eur j risk regul 2019; 10: 41–59.

45.

The White House . Blueprint for an AI Bill of Rights: making automated systems work for the American people. https://www.whitehouse.gov/ostp/ai-bill-of-rights/ (2023, accessed May 26 2023).

46.

News: European Parliament . AI Act: a step closer to the first rules on Artificial Intelligence. https://www.europarl.europa.eu/news/en/press-room/20230505IPR84904/ai-act-a-step-closer-to-the-first-rules-on-artificial-intelligence (2023, accessed June 13 2023).

47.

Slokenberga

. EU regulatory responses to medical machine learning in paediatric care: a missed opportunity to overcome a therapeutic gap? Uppsala, Sweden: Uppsala University, 2022.

48.

Gruson

Helleputte

Rousseau

, et al. Data science, artificial intelligence, and machine learning: opportunities for laboratory medicine and the value of positive regulation. Clin Biochem 2019; 69: 1–7.

49.

Matheny

Israni

Ahmed

, et al. Aritificial Intelligence in Health Care: the hope, the hype, the promise, the peril. National Academies of Sciences, 2022.

50.

De Laat

. Algorithmic decision-making based on machine learning from big data: can transparency restore accountability? Philos Technol 2018; 31: 525–541.

51.

Barocas

Hardt

Narayanan

. Fairness and machine learning: Limitations and opportunities, 2019, p. 3. Available at: https://fairmlbook.org/.

52.

Raji

Buolamwini

. Actionable auditing: investigating the impact of publicly naming biased performance results of commercial ai products. Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 2019, pp. 429–435.

53.

Wang

. Performance and biases of Large Language Models in public opinion simulation. Humanit Soc Sci Commun 2024; 11: 1095. DOI: 10.1057/s41599-024-03609-x.

54.

Mozur

Liu

Metz

. China’s rush to dominate A.I. Comes with a twist: it depends on U.S. Technology. New York, NY: New York Times, 2024.

55.

Meltzer

. The US government should regulate AI if it wants to lead on international AI governance. https://www.brookings.edu/blog/up-front/2023/05/22/the-us-government-should-regulate-ai/ (2023, accessed June 13 2023).

56.

Metz

Kang

Frenkel

, et al. How tech giants cut corners to harvest data for A.I. New York, NY: New York Times 2024.

57.

Braiek

Khomh

. On testing machine learning programs. J Syst Software 2020; 164: 110542.

58.

Bellamy

RKE

Dey

Hind

, et al. AI fairness 360: an extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. https://arxiv.org/abs/1810.01943 (2018, accessed July 7 2023).

59.

Obermeyer

Nissan

Stern

, et al. Algorithmic bias playbook. Chicago: Center for Applied AI at Chicago Booth, 2021.

60.

Eaneff

Obermeyer

Butte

. The case for algorithmic stewardship for artificial intelligence and machine learning technologies. JAMA 2020; 324: 1397–1398.

61.

Haas

. Play 1. Enable diverse and multi-disciplinary teams working on algorithms and AI systems. https://haas.berkeley.edu/wp-content/uploads/EGAL_Playbook_Play1_Teams.pdf (2023, accessed July 20 2023).

62.

Garces

. Understanding the impact of affirmative action bans in different graduate fields of study. Am Educ Res J 2013; 50: 251–284.

63.

West

Whittaker

Crawford

Discriminating systems: gender, race, and power in AI. New York, NY: AI Now Institute. 2019.

Reducing bias in healthcare artificial intelligence: A white paper

Abstract

Keywords

Background

Methods

Results

Discussion

Reducing dataset bias

Accurate modeling of existing data

Transparency of AI

Regulation of AI

Bringing stakeholders to the table

Limitations

Conclusion

Footnotes

Acknowledgements

Author contributions

Declaration of conflicting interests

Funding

ORCID iDs

References