Sage Journals: Discover world-class research

Abstract

Drawing on ethnographic, interview, and textual data with researchers creating machine learning solutions for health care, the author explains how researchers justify their projects while grappling with uncertainties about the benefits and harms of machine learning. Researchers differentiate between a hypothesized world of machine learning and a “real” world of clinical practice. Each world relates to distinct frameworks for describing, evaluating, and reconciling uncertainties. In the hypothesized world, impacts are hypothetical. They can be operationalized, controlled, and computed as bias and fairness. In the real world, impacts address patient outcomes in clinical settings. Real impacts are chaotic and uncontrolled and relate to complex issues of equity. To address real-world uncertainties, researchers collaborate closely with clinicians, who explain real-world implications, and participate in data generation projects to improve clinical datasets. Through these collaborations, researchers expand ethical discussions, while delegating moral responsibility to clinicians and medical infrastructure. This preserves the legitimacy of machine learning as a pure, technical domain, while allowing engagement with health care impacts. This article contributes an explanation of the interplay between technical and moral boundaries in shaping ethical dilemmas and responsibilities, and explains the significance of collaboration in artificial intelligence projects for ethical engagement.

Keywords

AI machine learning morality equity health care collaboration

For every complex problem there is an answer that is clear, simple, and wrong.

—H. L. Mencken

I step into Professor Gould’s office, standing opposite a bookshelf and long desk pushed underneath a window that stretches the length of the far wall. To the right is a gray rocking chair angled toward a gray cushioned love seat complete with a yellow throw blanket. In the center, there is a circular table with three metal chairs facing a clean whiteboard. I ask Professor Gould where the best place is for me to sit. “Well,” he begins, “are we talking data,” he points to the round table and chairs, “or are we talking life,” he motions toward the gray couch.

Professor Gould, a computer scientist, works on projects that apply machine learning techniques to health care. I spoke with him as part of research on interdisciplinary collaborations creating machine learning (a subfield of artificial intelligence [AI]) models for health care. These researchers primarily have technical backgrounds in computer science, data science, and engineering, or clinical backgrounds. They often work in interdisciplinary teams across academic and medical settings. Using large-scale clinical or patient datasets, they train algorithms that perform a range of tasks such as predicting diagnoses, assigning risk scores, autocompleting doctors’ notes, modeling biomolecular processes, and triaging patients between hospital departments.

Although such machine learning applications are promoted as promissory tools to address health care challenges, recent work documents unintended harms. For example, machine learning algorithms have been shown to perpetuate disparate treatment outcomes for marginalized groups and reinforce biased stereotypes (Obermeyer et al. 2019; Zack et al. 2023). Social science scholarship, media, and policy groups have been instrumental in uncovering and reporting adverse outcomes of AI systems (Benjamin 2019; Committee on Responsible Computing Research: Ethics and Governance of Computing Research and Its Applications et al. 2022; Eubanks 2017; Levi and Gorenstein 2023; Noble 2018). Researchers who develop models in health care express concern about potential risks such as the reproduction of biases and patient data security (Chen et al. 2021; McCradden et al. 2020; Shachar and Gerke 2023). As AI in health care is a multibillion-dollar industry that will affect millions of patients (Bohr and Memarzadeh 2020), consideration of AI systems and impacts needs to be prioritized.

In this article I address how researchers developing machine learning models for health care justify the ethical implications of their work while grappling with uncertainties between machine learning benefits and harms. Past scholarship revealed researchers and technologists to be unconcerned with ethical impacts (Burrell and Fourcade 2021; Cech 2014; Forsythe 1993) and constrained by corporate interests (Ali et al. 2023; Metcalf, Moss, and boyd 2019); more recent literature focuses on how technologists understand ethics (Avnoon, Kotliar, and Rivnai-Bahir 2023; Orr and Davis 2020; Wehrens et al. 2023). Scholarship on technical collaborations has often focused on interpersonal relationships, communication, and achieving success in corporate settings (Mao et al. 2019; Passi and Jackson 2018; Zhang, Muller, and Wang 2020), rather than project justifications, ethical engagement, and responsibilities aimed at societal benefit. In contrast, this study is focused on the relationship between ethical dilemmas, moral boundaries, and collaboration. I draw primarily on interviews with technical and clinical collaborators involved in creating machine learning models for health care, as well as participant observation of three interdisciplinary groups, observations of other meetings, conferences, workshops, and events, and textual analysis to contextualize the work of participants.

Professor Gould’s distinction between “talking data” in the metal chairs versus “talking life” on the gray couch illuminates a primary method in which researchers justify the ethical implications of machine learning projects. I find that researchers distinguish between a hypothesized world of machine learning development and a “real” world of clinical settings. Each world relates to different frameworks to describe, evaluate, and reconcile uncertainties between benefits and harms. When researchers seek to address uncertain “real” impacts, they seek to align hypothetical and applied aspects of work through two processes: (1) collaborating closely with clinicians who access and translate the real world, and (2) participating in data generation and curation projects to create more accurate and representative datasets. Alignment between the two worlds expands ethical discussions, while delegating responsibility to adjudicate between conflicting moral values to clinicians and medical infrastructure.

This study contributes to the burgeoning field of the sociology of AI by presenting a model to understand the interplay of ethics and morality in AI/machine learning research, as well as emphasizes the role of collaboration in addressing ethical considerations. I explain the relationship between technical and moral boundaries. Boundaries between the hypothetical and the “real” ultimately define the contents of ethical debate, delegate moral responsibilities, and serve as a mechanism to reconcile uncertainties. These boundaries shape the types of ethics work researchers do, the questions that they consider, the assumptions that they bring to their work, and their methods of implementing ethics. By providing a framework to identify the consequential stakes of defining what is “real” in debates about AI and applied scientific solutions, the article explains how actors come to define and prioritize technology benefits and harms. Additionally, the research highlights the significance of collaboration for AI ethics and morality in science. Collaboration bridges moral and technical boundaries to expand ethical engagement and integrate ethical responsibility into the structure of AI projects. As calls to increase stakeholders involved in AI research grows, it is necessary to understand the implications of collaboration on moral discussions, boundaries, responsibilities, and impacts.

The Sociology of AI and Ethics

Early social science scholarship on science, technology, and engineering portrayed technical researchers as agnostic or oblivious to the social and ethical implications of their work (Burrell and Fourcade 2021; Cech 2014; Forsythe 1993; Joyce et al. 2018, 2021). Case studies from across domains of science exhibited strong boundaries between science and society that upheld beliefs in the purity and objectivity of scientific pursuits (Gieryn 1995; Latour 1999; Shapin and Schaffer 1985). Similarly, computer scientists expressed discomfort with the uncontrollable nature of social interaction and excluded social concerns from their domain (Forsythe 1993). Maintaining a controllable environment to prove facts remained critical to legitimizing scientific advances (Latour 1999; Shapin and Schaffer 1985).

More recently, communities of AI and machine learning practitioners have begun to grapple explicitly with social impacts, ethics, and instances of AI harms (Bender et al. 2021; Selbst et al. 2019). For example, a group of technical researchers uncovered an AI health care algorithm that underreported the severity of risk for Black patients compared with white patients, leaving them significantly sicker and allocating fewer resources to their treatment (Obermeyer et al. 2019). Technologists have initiated efforts to lobby for what they see as more ethical standards (i.e., Collins et al. 2021; Ganapathi et al. 2022). Within both academic and corporate communities, scholars seek to encourage the evaluation of the broader impacts of new tools (Gebru et al. 2021; Mitchell et al. 2019; Wang et al. 2022). Recent reviews in top computer science and medical journals include articles such as “Governing AI Safety through Independent Audits” (Falco et al. 2021) and “Ethical Machine Learning in Healthcare” (Chen et al. 2021).

Social scientists concerned with technology impacts and ethics have studied researchers’ and developers’ interpretations of and practices to realize moral ideals. Technologists rely on a variety of frames to interpret moral aspects of their work (Avnoon et al. 2023; Sloane and Zakrzewski 2022; Wehrens et al. 2023) and perform ethics through mundane decision-making processes (Jaton 2020; Seaver 2021; Tanweer 2022; Ziewitz 2019). Through actions and interpretations, they prioritize ethical values that shape all aspects of bringing a technology into being, including topics of group conversation, model contents, and forms of sociality (Amoore 2020; Neyland 2016). However, the majority of this scholarship portrays data scientists and technologists as the “coding elite” (Burrell and Fourcade 2021), working within and constrained by corporate logics (Ali et al. 2023; Metcalf et al. 2019; Rider 2022) and decontextualized from concrete interlocuters and outcomes (Orr and Davis 2020; Widder and Nafus 2023).

In contrast, when developing machine learning solutions for health care, technologists work in interdisciplinary teams across institutional settings. Scholarship from computer-supported cooperative work centers similar collaborative, technical projects, however this scholarship often focuses on corporate teams and issues of interest alignment, communication, and success evaluation, rather than ethical justification and accountability (Hou and Wang 2017; Mao et al. 2019; Passi and Jackson 2018; Zhang et al. 2020). Increasing the stakeholders involved in technological design is purported as a method to create more beneficial technologies (Costanza-Chock 2020). However, little is known about the significance of collaboration on moral dilemmas, ethical responsibility, and project justification. Boundaries that divide and distinguish disciplinary groups and spheres have been shown to be important for professional collaboration (Farchi, Dopson, and Ferlie 2023; Weber et al. 2022), scientific legitimacy (Gieryn 1995), and moral order (Lamont and Molnár 2002). What scholarship on AI ethics misses is the relationship between moral frameworks, project justifications, and structures of work as research teams puzzle through, define, and assign responsibility for machine learning impacts.

This article focuses on AI ethics in academic and medical domains, thus outside of pure corporate settings.¹ The interpretation and practice of AI ethics for noncorporate researchers joins a large body of scholarship on morality in science. Scholarship shows how scientists respond to ethical policies and regulations in fields with clearer mandates (Evans and Silbey 2022; Hedgecoe 2014; Smith-Doerr and Vardi 2015; Stark 2012). Other scholarship highlights how scientists incorporate moral expertise as part of their jurisdiction (Evans 2021), define ethical aspects of their work (Wainwright et al. 2006), the impact of moral debates on research agendas (Dromi and Stabler 2023; Frickel et al. 2010; Thompson 2013), and the creation of new professions with ethical jurisdiction (Evans 2012). In summary, the literature on morality in science largely focuses on the cooption or projection of ethics onto regulatory structures or “ethics” professionals (Evans 2012, 2021; Evans and Silbey 2022; Jasanoff 2005). However, neither of these legitimized moral structures yet exists in the field of AI. The question remains how researchers negotiate technical and moral boundaries within the context of a collaborative, interdisciplinary domain to retain the legitimacy of science, recognize AI harms, and pursue motivations for societal benefit.

Ethics in Action

In line with dominant scholarship, I adopt a pragmatic conceptualization of ethics (Boltanski and Thevenot 1999). Ethics is found in how people negotiate among competing values, reason and make decisions, and define their goals and objectives within specific social contexts (Ananny 2016; Jaton 2021; Seaver 2017). A pragmatic ethical approach contrasts with views of ethics as fixed rules, duties, and mandates (Kant [1785] 1998), or as utilitarian focused on outcomes and maximization of action (Mill 1879). AI ethics is particularly well-suited to a pragmatic approach given the lack of centralized authority, regulations, or professional norms that substantiate “ethical AI” (Ensmenger 2010; Jobin, Ienca, and Vayena 2019). Additionally, there is a noted gap between high-level principles and on the ground practices (Morley et al. 2020). Although medicine has developed pathways for creating protocols and guidelines to realize best practices and standardize ethical prescriptions (Timmermans and Berg 1997, 2010), these steps are less clear for AI. For example, without codified guidance, it is uncertain what steps are required for a machine learning project to comply with “patient safety” or “nondiscrimination.” As a result, ethical action requires explicit deliberation as it is yet to become a normative course of action.

In a pragmatic ethical approach, people mobilize evaluative categories that define “the good” to deliberate courses of action and realize a version of this ideal (Boltanski and Thevenot 1999). In this article, I focus on the categories that actors mobilize to discuss beneficial machine learning projects and strategies to reconcile trade-offs between multiple categories of evaluation in hopes of realizing an understanding of the good. Table 1 presents a nonexhaustive sample of ethical concerns and categories of evaluation that actors mobilize to deliberate courses of action in the field of machine learning for health care. Within this framework, ethics constitutes the categories used to make evaluations, as well as the results of evaluations on social action and arrangements. Oftentimes, scholarship on ethical evaluations focuses on “hot” moments or conflict between incommensurable evaluative criteria (Boltanski and Thevenot 1999; Dromi and Stabler 2023; Stark 2011; Tavory, Prelat, and Ronen 2022). Instead, I focus on strategies used to avoid conflict, manage uncertainty, retain the “purity” of evaluative criteria, and distribute responsibility for ethical judgement.

Table 1.

Ethical Concerns and Evaluations.

Ethical concern	Categories of evaluation
Discrimination and exclusion	Bias
	Fairness
	Equality
	Equity
	Representation
Patient safety	Medical errors
	Physical injury
	Death
	Liability
Market incentives	Cost
	Financial viability
	Efficiency
System integrity	Privacy
	Usability
	Scale

Data and Methods

This research focuses on interdisciplinary teams creating machine learning models for health care as a subfield of AI research and development. AI encompasses the development and deployment of computing infrastructures that seek to mimic or incorporate elements of human reasoning. Machine learning is a subfield of AI that constitutes the training of computer systems to analyze and draw inferences from patterns in data. AI is increasingly consequential for health care, assisting tasks such as predicting treatment outcomes, diagnosing diseases, automating documentation and billing, assisting in surgeries, and discovering drugs. Machine learning is often used for optimization, prediction, object recognition, natural language processing, and text comprehension. For example, machine learning models are being used to predict risk for breast cancer by automatically analyzing mammograms (Yala et al. 2022). Risk predictions then inform diagnosis and treatment decisions.

Data for this project include (1) interviews with participants in interdisciplinary groups engaged in machine learning for health care research (see Appendix A for descriptive statistics); (2) in-depth participant observation with three project groups; (3) observations from other meetings, workshops, conferences, and events; and (4) textual analysis of materials related to each participant (see Appendix B). Table 2 presents a summary of data modalities.

Table 2.

Data Modalities.

Data Source	Description	Exposure	Use in Analysis
Interviews with interdisciplinary group members	Formal interviews with members engaged in an interdisciplinary collaboration	61 interviews	Provided data about collaborators understanding of personal roles and the roles of collaborators, the structure of work, and views on the risks/benefits of AI/ML in health care
Observation with three core lab groups	Observation of weekly lab/group and project meetings Participate in lunches and lab events Discussions and short interviews	7 months Approximately 75 hours	Provided data on interactions between collaborators and divisions of roles and responsibilities
Observations of ad hoc lab meetings, conferences, workshops, and events	Ad hoc lab meetings with other groups Attendance of data hackathons and conferences on the topic of machine learning/AI and health care Virtual and in-person observation of webinars and lectures	4 group meetings (6 hours) 2 data hackathons (35 hours) 2 conferences (36 hours) 20 lectures/webinars (25 hours)	Provided data on key discussions and concerns in the field, interactions between collaborators, information geared toward technical and/or clinical audiences, and contextualization of core lab groups with others in the field
Textual data	Journal articles produced by each lab and participant Media reports of lab and participant activities Transcribed podcast interviews and lectures from participants on the topic of machine learning and health care CV, LinkedIn profile, and Web site profile of each participant	166 journal articles 66 media reports 59 transcribed lectures and podcasts 153 personal documents	Provided data on the professional background of participants, core areas of work, and enriched analysis with contextual elements

Note: AI = artificial intelligence; CV = curriculum vitae; ML = machine learning.

Respondents were identified using a multipronged approach. One method of recruitment sampled researchers engaged with any of three prominent conferences on machine learning and health that target stakeholders from technical and clinical fields. Another method relied on digests and news updates from medical and research institutions that work in the area of AI and health care. Third, respondents were recruited through department and web specific Google searches. Last, respondents were recruited through snowball sampling, particularly collaborators on the same project. Respondents and labs qualified for the study if they described their work as using machine learning methods for health care application, worked on at least one machine learning and health care project, or published papers on a similar theme. For example, a lab that publishes research on creating an algorithm to predict cardiac arrest among intensive care unit (ICU) patients qualifies for the study. Similarly, a lab that describes its work as “augmenting medical decision making through machine learning” also qualifies for the study. Respondents were recruited through e-mail and confirmed qualification.² All members of a collaboration or lab were invited to participate including principal investigators, research associates, postdocs, graduate students, medical fellows, physicians, interns, and administrative staff members (Appendix A).³

Interviews lasted between 30 and 90 minutes and focused on professionals’ career trajectories, goals, project workflows, roles in relation to collaborators, views on ideal collaboration, and prominent topics under the banner of ethical AI (i.e., bias, fairness, privacy, regulation, risks). To gain further insights about the composition of labs and the professional background of members and contextualize projects and discussions, a corpus of textual data was collected from the web (see Appendix B for details). Together, interview transcripts and textual data compose a profile for each individual and group that represents the structure of labs and collaborators, project details, and broader participation and views on the field of machine learning and health care.

Ethnographic observations occurred in labs located in one metropolitan area and recruited through the methods previously described in which the principal investigator invited me to attend a group meeting. Three core groups were selected for extended observation. A comparison of each group is not the subject of this article. With each group, I spent one day a week attending project meetings, as well as attended special events and socials. Observations at other events, including data hackathons,⁴ conferences, lectures, and webinars provided further opportunities to observe interactions between collaborators, discussions about roles and responsibilities, and the potentials and risks of machine learning.

Data were analyzed inductively. After each interview or event, I wrote a memo outlining major themes and reactions. Through the memos, I identified consistent distinctions between the world of technological development and the “real” world of clinical practice. I was often struck by the addition of the word real to describe entities such as “real-world hospital data.” From this observation, I conducted word searches to isolate data segments that related to “real” aspects of work compared with “imagined” or “hypothetical” aspects. I read and annotated isolated segments to understand the meanings of reality compared with imagined or hypothetical work. Last, I mapped identified segments to professional profiles, research focuses, and project structures. Conclusions are drawn from the ways in researchers use themes of reality in connection to specific research agendas, workflows, and role definitions.⁵

Results

Researchers distinguish between a hypothesized world of machine learning and a “real” world of clinical practice, each with different frameworks to describe, evaluate, and reconcile ethical uncertainties (Figure 1). In the hypothesized machine learning world, solutions are hypothetical. Within this framework, researchers are content with imagining future impacts. They can control and compute hypothetical impacts as definitive metrics that represent algorithmic bias and fairness. In contrast, in the real world, solutions relate to concrete patient outcomes in clinical settings. When considering real impacts, researchers are concerned with how AI tools will integrate within health systems and affect patient well-being and clinical workflows. Impacts within clinical settings are chaotic and uncontrolled and cannot be computed as summary metrics. Instead, these impacts relate to complex issues of equity.

Figure 1.

Social impacts in the hypothesized world of machine learning and the real world of clinical practice.

To address more uncertain “real” impacts, researchers seek to align the two worlds by collaborating with clinicians and participating in patient data generation and curation projects geared toward representation. Through collaboration, researchers bridge technical and moral boundaries, which concretizes and assigns clinicians the responsibility of defining the impacts of machine learning. Distinctions between the hypothesized and the real help resolve ethical uncertainties between benefits and risks by defining relevant ethical frameworks that maintain a controllable world to legitimate machine learning while relying on collaboration to make ethical judgements in an uncertain world outside of machine learning researchers’ full responsibility.

Distinctions between the Hypothesized World of Machine Learning and the “Real” World of Clinical Practice

A research group based in a large medical facility had been working on building a machine learning large language model for a few months. The model is intended to help with a variety of clinical tasks, such as answering medical questions, searching for diagnosis, and summarizing labs. In a group check-in, the lead researcher, Dr. Ryder, began the meeting with an update. He explained, “[The computer science intern] has been working with the server and we’ve got a real, I shouldn’t say real, but we’ve got, you know, some forward progress in terms of the approach to what we’re building.” The team, excited about the progress, focuses on organizing the next phase of the project for the remainder of the meeting. They create an agenda and assign programming tasks but postpone questions about “end users” for when the product is ready to be implemented into the hospital setting. Although Dr. Ryder initially describes progress on the model as “real,” he had redacted the statement. The model still exists in the realm of creation. Dr. Ryder does not have any concrete or tangible impacts to share, and the model is not integrated within the hospital infrastructure or used by clinicians. The model is progressing, but it is not yet real. Similar common phrases used in interviews, papers, and conference sessions include “real-world hospital data,” “real-world impacts of AI in health,” “running models in real-time clinical settings,” and the “reality of the hospital.” The “real” world associates with medical infrastructure, personnel, and settings.

In contrast, machine learning models associate with the realm of the imagined. Mr. Long, a senior health executive, describes as follows:

We understand that tech improvements are there, but the speed at which they are evolving, stretches human imagination. We have to get to an area that says what types of use cases are relevant today. Is [the technology] something that we should actually pull into the [health] system or not? . . . when [the technology] gets closer to an area that we are intimately familiar with, our own health and decisions that are made with a physician in an exam room, a clinical decision, maybe we get a little bit more uncomfortable.

As Mr. Long explains, machine learning models exist “there,” separate from health systems. The “there” of machine learning keeps evolving, in the realm of imagination. In contrast, the health system is “an area that we are intimately familiar with” such as “a physician in an exam room.” Machine learning models could be pulled into these familiar settings, but function as distinct entities in their own realm coming from “there.” Given the current state of research and development, the difference between imagined machine learning solutions and clinical settings makes Mr. Long uncomfortable about their prospect of integration. Continual discrepancies separate the real world of clinical settings and the hypothesized world of machine learning.

Impacts in the Hypothesized and “Real” World

Hypothesized Machine Learning Impacts

Distinctions between the real and hypothetical worlds uphold technical and moral boundaries that narrow relevant contingencies and uncertainties to consider when describing and justifying machine learning impacts. In the hypothesized world of machine learning, respondents define impacts as hypothetical. Participants describe “imagining” the benefits of solutions. Carly, a research associate in a biomedical informatics lab, works on projects such as creating models to interpret radiographs and combine data modalities (i.e., radiographs and genomics data) to aid clinical decision making. She connects her work to outcomes such as “improving diagnosis and prognosis,” “the responsible use of AI,” and “validation and bias detection” but does not work on projects at the “implementation stage” in clinical settings. She explains, “My overall goal is to be working on problems that I can at least imagine what the impact would be. . . . I’m comfortable with being a couple of degrees removed from the final impact.” By justifying her impacts hypothetically outside of clinical settings, Carly can imagine the benefits of her tools without considering additional questions that arise during use or application.

Imagined impacts can be controlled and neatly and definitively evaluated and presented. This allows machine learning projects to be summarized as technical papers and contribute to a scientific discipline, justifying impacts as valuable, even if hypothetical. Jim, a graduate student, describes the publication process in relation to releasing code and software which would make the research more accessible and practical for medical personnel:

What we typically do is examine the currently existing methods to see what is still lacking, and then try to solve the problem. We find a test [data]set to test on that, find a metric to evaluate if [our model is] better. We will evaluate the algorithm on [a], [b] and [c]. And if all of that works, then it’s already a publishable paper . . . but we still need to engineer the software to make it more user friendly to solve the bugs people are reporting . . . there’s typically a lot of new problems that come up . . . for research, you don’t have to be like 100% robust, you just need it to work enough so that you can test it yourself and get some results for a paper. But to release it for other people there are going to be a lot of other problems.

As Jim explains, publishing solutions as a technical paper is simpler than making software and data available for use. He admits that for publication, research does not need to be “100% robust.” It just has to work well enough to compute a result framed as significant. Clear, controlled steps, such as evaluating the algorithm against existing benchmarks, create a stable pathway to publication and justification of value. However, making the code usable to implement or test requires further work. By differentiating hypothetical research for publication from the necessary steps for application, Jim upholds technical boundaries, limiting the relevance of additional ethical questions as long as he remains in the hypothetical machine learning world.

Last, machine learning impacts can be computed, often as a metric for bias or fairness. Although participants’ colloquial and formulaic definitions of bias and fairness vary, and the two are often used interchangeably, typically bias refers to instances when some subgroups receive less optimal outcomes compared with others, and fairness refers to equalizing optimization for all subgroups. At a computer science and health care conference, the top paper in the tract “Impact and Society” was a paper on “fair risk prediction.” When presenting the paper, Dr. Frise, the lead author, provided background:

Just to give an overview of fair machine learning there are two steps, first we want to define some measure of fairness and . . . then we are going to use an algorithm to try to optimize the model to satisfy that measure.

As the paper demonstrated, fairness within the bounds of the hypothesized machine learning world is measured as a metric that can be optimized through calculation.

Similarly, bias can be statistically evaluated. Dr. Volger, an MD with a background in biomedical engineering, works on a variety of machine learning projects such as creating models to customize and optimize patient postoperative treatment plans. He advises evaluating models to check for bias by comparing the statistical distribution of patient demographics and outcomes. Some of the factors he would use to evaluate a model include ensuring that the model is “actually relevant across multiple groups” on the basis of the “demographic of the setting” and “looking whether our prediction actually was consistent with what actually was happening over time.” In this instance, model bias is calculated as disparities in accuracy between demographic groups in different settings and over time. Bias exists as a computable metric that represents an aspect of model performance between discrete population groups that can be neatly summarized. Differentiating hypothetical impacts creates an imagined scenario in which outcomes can be statistically evaluated, computed, and neatly presented, reducing uncertainties, and justifying machine learning benefits. Within the hypothetical world, boundaries between the technical and the moral limit the relevance of additional risks or ethical questions.

Real-World Impacts

In contrast to the hypothetical, controllable, and calculable nature of machine learning impacts, real-world impacts consider clinical use and integration and grapple with the uncontrollability of health care settings and complex issues of equity. Thus, real impacts require an expanded moral framework.

Dr. Smith, a lead researcher in a hospital group, gave a presentation about his work on machine learning language models to clinicians at his institution. He shows examples about how the model can help summarize labs and suggest treatment options. In discussing his goals to “get [the machine learning model] to the bedside,” Dr. Smith elaborates,

this is an exact sort of rubber meets the road example of where these language models can affect real impact at the bedside. . . . What we’re trying to build here is not just a research tool. We are trying to make it as valuable as possible in the clinical standpoint.

Dr. Smith defines project value on the basis of “real impact,” which occurs “at the bedside,” meaning at the point of clinical care used by medical professionals. Impacts must be tangible in the hospital environment and the tool must be in the hands of clinicians.

However, real hospital and clinical settings are chaotic and difficult to evaluate, requiring more uncertain moral deliberation in contrast to controllable machine learning impacts. Speaking at a conference Professor Bennett, a prominent entrepreneur and professor of machine learning and health care, admits that “papers always look nicer than the reality of the real world. The reality of the real world is really complicated.” Similarly, Dr. Carlyle, a researcher with an MD and PhD in computer science, compares the complexity of impacts in clinical settings versus publishing in machine learning outlets:

I often saw machine learning researchers that started out with a nice goal, but then . . . I had the situation of explaining to them the complexity of a real medical problem . . . people don’t really know about the complexity of problems that can occur. Medicine and machine learning often boils down to some predefined tasks that people work on and optimize and then after a couple of years, they realize oh, we optimized this task, I want to translate it, but it doesn’t work.

“Real medical problems” and the “reality of the real world” are “complicated” and “complex.” As a result, controlled and computed machine learning solutions currently do not adequately grapple with this complexity, and thus they do not translate between the two worlds. The “real” world requires asking a less defined and more uncertain set of ethical questions that transcend technical boundaries.

The complexity of the real world leads to consideration of equity issues that pervade calculation and directly contrast with the computable nature of bias and fairness metrics. Although equity can be variably defined, respondents typically use it to refer to goals of eliminating disparities and improving health for all groups by considering health and social system outcomes in which machine learning models play a part. Professor Nittle in biomedical informatics describes tensions between fairness and equity:

the question is always going to be a tradeoff between pure model performance and equity . . . is it actually fair to potentially penalize a group that the model works well on for the sake of equity and another group? It’s like, we just won’t run this model in this group because it doesn’t work well for them. Is that a fair thing to do?

Professor Nittle feels unsure about how to achieve equity given the uncertainty and complexity of impacts on different populations. Similarly, Professor Jefferies, also in biomedical informatics, critiques the use of performance metrics in contrast to a focus on equity:

I think of challenges of mostly equity. . . . I think of it as will this tool even work for certain kinds of people? Should I even bother to build such a tool? . . . I have these conversations a lot with people in machine learning who operationalize all of these values specifically through performance metrics, but then end up realizing that that doesn’t actually translate to a real health outcome.

Achieving a “real health outcome” requires considerations beyond narrow and neat metrics that extend beyond the technical boundaries of the hypothetical machine learning world, including scrutinizing the entire research and development process from first steps of project conception. A focus on equity entails expanding relevant ethical questions to consider holistic processes of health care and model development; however, this leaves machine learning researchers with less certain outcomes and answers.

Alignment between the Hypothesized World of Machine Learning and the “Real” World of Clinical Practice

When faced with uncertainty in “real” world benefits and harms, researchers seek to structure projects and collaborations to align the two worlds and resolve uncertainties. This occurs through two processes: working closely with practicing clinicians and participating in patient data generation projects that seek to promote representation. Through these processes, collaboration and clinical consideration bridge technical and moral boundaries and come to stand in for and define “real” world ethical engagement.

Clinical Collaboration

Researchers seek to align the hypothetical and “real” world by closely working with practicing clinicians. Clinicians serve as gatekeepers of health care settings and resources and are seen as responsible for health care outcomes. As Lucas, a computer science PhD student, says, “what’s really important if you want to actually take it to the real world is . . . doctors.” Practicing doctors directly interact with patients in medical settings. They are critical for implementing tools in clinical practice. Furthermore, the association of their expertise with the clinical setting makes them responsible and accountable for patient outcomes within health care. Making machine learning models “real” requires clinical collaboration.

Clinicians privileged access to the real-world tasks them with discretion over judgements about the benefits and value of machine learning technologies. Research teams consistently describe instances of deferring to the perspective of clinicians when making ambiguous “real” world value decisions. Carl, a joint MD-PhD student, describes trade-offs in his work on developing diagnostic and prognostic algorithms:

What is really challenging is the fact that there is no perfect [model] outcome that captures everything that we want . . . people will define the outcome in different ways and reach different conclusions . . . especially for topics around race and around bias and around equity.

The lack of clear ways to make value judgements about what constitutes an acceptable machine learning outcome leads to uncertainty. When asked how he manages trade-offs between value frames or definitions of success, Carl explains,

We have generally talked with clinicians about which [outcome] they end up relying on most, and then using that as the basis. For example, with heart function, you can measure the accuracy of the equation itself, how well it predicts, you know, just like mean squared error or something like that. You could measure whether it’s within a certain range of outcomes, like the differences and how many people get diagnoses or how much people get paid, or where they end up on the transplant waiting list. Generally, clinicians will tell you that they care very little about the first one and they care very much about the last ones, who gets what thing and how resources are allocated. How do clinical thresholds change? So, we’ve been [leaning] towards that.

Carl relies on the outcomes that clinicians deem most valuable to structure his projects. Machine learning models are valuable if they provide “clinical utility” as determined by collaborating doctors. What doctors declare most useful maximizes real-world impact. Through collaborations, machine learning perspectives remain isolated from responsibility for ethical judgements while engaging with broader ethical questions.

Practicing clinicians also translate the meaning of data to provide clinical context, better align the hypothesized and “real” worlds, and explain implications for patients and clinical settings. Liam, a PhD student in computer science, explains that doctors, as domain experts, “understand the data that they have, where it comes from, how its measured.” Observations of interdisciplinary project meetings clarify the role of clinicians in translating the meaning of data as a method of bridging technical and moral problem solving. At a lab meeting in a medical engineering program, Matt, a visiting student in bioengineering, presented research on predicting treatment outcomes in an ICU using a local hospital database. Matt showed a list of variables used in the models. Dr. Crofter, a senior researcher in the group and an ICU doctor, interrupted the presentation. He stated that red blood cell count, one of the variables listed, is not a good variable because it can be correlated with some of the treatment outcomes the model intends to predict, such as being put on a ventilator. Dr. Crofter goes on to explain that often, being prescribed a stool softener is used in similar models as a random effect.⁶ However, stool softeners are actually not good random-effects variables because they can be correlated with outcomes of interest. Oftentimes in the ICU doctors give people fentanyl, which makes patients constipated and require a stool softener. Doctors then put in a standing order for stool softeners, but it does not mean that the patient is actually taking them and, because it is associated with being given fentanyl, correlates with a lot of serious clinical outcomes. As the most senior practicing clinician in the lab, one of Dr. Crofter’s main roles is to explain the context and meaning of data inputs and variables for technical researchers with less access to and knowledge about clinical practice. Through direct clinical input, groups more closely align machine learning projects with the reality of clinical settings in an effort to understand how data and models more definitively relate to practices, patients, and outcomes.

Last, accessing the real world through clinicians includes embodied exposure to the clinical setting, which provides context to understand the implications of and potential harmful blind spots in machine learning work. Professor Troy, in computer science, recounts the insights she gained by shadowing a physician in her subfield of pediatric neurology:

I shadowed one of my physicians I was working with in [neurology]. It was shocking to me, because I’ve been working with [neurology] data for a while, and I just had this thought of, “Oh, I’ve never actually seen an EEG (electroencephalogram)⁷ being taken. I’ve been looking at EEG data all day, I should go look at it.” I saw a nurse deleting part of the data and I said, “whoa, whoa, whoa, what’s going on there.” I also saw the system saved, it just connected the two parts. So, if someone is looking at the data later, I was like, “Okay, this is a problem when we’re not looking at the data in the software that they’re using. We’re looking at it from our end, we’re only reading it in as a continuous data stream.”

By witnessing clinical practice, Professor Troy learned that what appears to her to be a continuous process actually is produced by the conjoining of two separate processes of data collection. Discrepancies between how the data were created and how it is viewed in the lab could produce issues by discounting contextual features, such as the temporal structure of a clinical practice. Professor Troy describes the value of her clinical encounters which inform “what’s the big picture” to ask “the right questions.” She bridges her technical research and moral engagement by collaborating closely with clinicians to understand the full context of her work, such as the data and its limitations, and relevant questions for clinical impact. The collaborative context raises questions about the accuracy, benefits, and limits of a model not considered in the “hypothetical” world of machine learning.

Patient Data

A second method to align the two worlds is to move closer to the stage of collection and curation of identifiable patient data in the clinic. Dr. Oliver, a postdoc in computer science, discusses issues in data representation when interviewed on a machine learning podcast. She explains that when you develop a health care algorithm “questions about fairness and bias” can be “baked in” to the dataset because of “systematic health disparities.” These systemic disparities influence who accesses health care, and thus the population of people represented in clinical datasets, and then the population of people for which machine learning tools become optimized. Olivia uses an “equity” lens, which requires scrutinizing clinical resources and datasets used in her work.

Researchers shift their focus from model creation to resource generation, bridging technical and moral boundaries with the goal of creating “better data” in support of beneficial health care solutions. As Carl, the MD-PhD student, remarks, “if you just collected more data from underrepresented groups, you could improve [equity] very quickly.” Brian, a senior researcher in computational biomedicine, expresses a similar sentiment: “I think the only real way you can fix all these biases is to have better, more sensitive data.” One way in which he works toward this goal is “trying to make sure that we’re getting datasets, pretty good shared by other organizations around the world.” Part of Brian’s job includes helping run a repository of reviewed clinical datasets submitted by research groups and hospitals. Working on the repository helps achieve the goal of curating more representative patient data to build models that better reflect “real” patient populations while bridging hypothetical machine learning work and real world considerations.

Dr. Crofter, the researcher and ICU clinician previously mentioned, explains how he now devotes the majority of his work to understanding data and improving data collection to be more representative. He shifted the focus of a course that he teaches to work toward health equity:

our course on data science is now primarily, “let’s start off with understanding where is the selection bias, who did not make it to your database? And what is the implication of building an algorithm only on the people who made it to your database?”

By first asking questions about resources, medical infrastructure, and data content in the real world, Dr. Crofter hopes to help machine learning work toward health equity.

Researchers also work to generate more representative and higher quality data by working directly in clinical settings, rather than relying on preprocessed databases. Professor Bentley, who works in a public health department on machine learning problems from prediction to diagnosis and natural language processing, describes a new project in which he is collaborating to build a camera that will provide information about a patient’s blood through reflected light. Patients will be able to use the camera through a personal device. Professor Bentley discusses the importance of directly working with patients to collect these data: “I could try to post hoc fix the data that we have in the EHR (electronic health record), or I could just try and go in directly to measure physiology (bodily functions) to learn directly from that.”

Similarly, Dr. Troy, previously mentioned, expressed interest in wearable devices, such as sensors, that generate abundant individual level data directly from patients. She describes how this can address concerns about bias and representation because the model will learn from your own data. She explains,

if you’re wearing a wearable device . . . it should replace cohort level data with your own individual data, because you’ve got so much. That is really what I think is central to these ethical issues of data being trained, models being trained on people who are not like you.

To better align machine learning with the “real” world of patients served by health systems and to address systemic disparities, researchers advocate considering alternative methods to acquire patient data. These methods require accessing patients and/or clinical settings directly and devoting attention to medical infrastructure and resources. As such, they bridge technical and moral boundaries as researchers engage in hypothetical machine learning work, while expanding relevant ethical frameworks to carry out beneficial projects.

Ethical Trade-Offs between the Hypothetical and the “Real”

Researchers who separate their work from clinical practice and identifiable patient data uphold technical and moral boundaries, defining “real” impacts and ethical questions as less relevant to their research. They emphasize distinction between the two worlds, drawing on frameworks from the “hypothetical” world to describe and justify projects. Gavin, a graduate researcher in biochemical engineering, works on drug discovery in machine learning. He explains that “one advantage of working on the preclinical side is, you don’t have to worry too much about that . . . working with cell level data, there’s not too much ethical concern.”

Similarly, Dr. Lester, a principal investigator of a neurobiology lab studying brain processes explains how he does not think issues of ethics, bias, and social impact relate much to his work:

There’s a big open question about the ethics of things like AI, and how that impacts society. For us, what we do might be a little bit more granular, directed at particular questions, and applying the tools to derive more information out of big data sets that we couldn’t have otherwise. It’s kind of more towards foundational understanding . . . it could have social impact, but many steps away.

Dr. Lester goes on to describe how he started using computational tools to leverage neurological datasets too big to analyze any other way. In the future, he notes this will prove challenging: “we’re consuming so much power from the power grid to be able to analyze these data sets, they need to become more efficient.” Although access to expensive data and compute resources could be seen as part of ethical trade-offs inherent to AI projects, the divide between the hypothetical and “real” world limits considerations of other dilemmas as part of the moral justification of machine learning work. As these examples show, understandings of “real” ethical concerns in machine learning associate with evaluations based on “bias,” “fairness,” “representation,” and “subgroup disparities” in the “real” world of clinical practice working with identifiable patient characteristics. When researchers perform machine learning and health research, but do not directly work with “real” world patient data or engage with clinicians and outcomes in medical settings, they do not define ethics and “real”-world impacts as relevant. Boundaries between the technical and the moral come to define ethical content and frameworks.

Discussion and Conclusion

When developing machine learning models for health care, researchers and developers distinguish between a hypothesized world of machine learning and a “real” world of clinical practice. Distinctions between the hypothetical and the “real” determine relevant frameworks to describe, evaluate, and reconcile ethical uncertainties. In the machine learning world, impacts remain hypothetical, controllable, and computable. In the “real” world, impacts relate to patients and outcomes in clinical settings. Real impacts are uncontrollable, incomputable, and complex. To resolve uncertainties in determining “real” machine learning impacts, researchers seek to align hypothesized and “real” aspects of work. Alignment between the two worlds occurs by centering patient data representation and leaning on clinical collaborators to explain the value of results, the meaning of data, and access clinical resources. Consequently, clinicians become responsible for defining and concretizing beneficial impacts while machine learning development remains a distinct arena of scientific practice. In the real world, questions about clinical access, data representation, and health system resource allocation come to define ethical engagement in place of questions about machine learning development, harm, and equity.

These results bear implications for considering multidimensional aspects of equity and harm in debates about technology ethics, explaining the relationship between technical and moral boundaries, and the significance of collaboration and expanding project stakeholders for AI and machine learning. Rather than study what algorithms obscure (Burrell 2016; Kiviat 2023; Pasquale 2016), in this article I focus on collaborative and cultural processes that concretize “real” technology impacts. In the case of machine learning and health care, associating the “real” with the realm of clinical practice and medical infrastructure essentializes tangible patient outcomes and clinical resources as the most impactful and beneficial goals of machine learning work. These definitions of the “real” open and forecloses ethical discussion (Jasanoff 2005). For example, definitions of the “real” exclude considerations such as project funding allocation as part of “real” impacts and ethical machine learning (Crawford 2021). Furthermore, privileging clinical resources and outcomes as “real” precludes consideration of cultural processes as fundamental parts of inequity (Lamont and Pierson 2019), such as who is included and able to imagine and evaluate technological solutions,. Through an explanation of what is “real” in the context of machine learning collaborations, the article lends a framework to substantiate statements such as “AI for good,” “just data,” or “AI for health equity” by illuminating how people purporting these goals define and access the worlds they seek to benefit.

The paper also contributes an explanation of the relationship between technical and moral boundaries. Boundaries not only separate “science” from “nonscience” (Gieryn 1995), but define the ways in which actors mobilize ethical categories to connect their actions to outcomes and describe project impacts. In the case of machine learning and health care, boundaries between the technical and the clinical preserve machine learning as an abstracted, controllable, and legitimate arena of scientific practice (Gieryn 1995; Latour 1999; Shapin and Schaffer 1985), while allowing recognition of broader impact in a domain outside of researchers’ responsibility. Machine learning remains a “pure” domain by creating categorical and role distinctions between the hypothesized world of machine learning and the real world of clinical practice.

Last, studying machine learning development within the context of collaboration contributes a model of how interactions and structures of work support ethical justifications and the distribution of moral responsibilities. Collaboration serves as a mechanism that bridges technical and moral boundaries while delegating ethical responsibilities. Within the collaborative context, clinical expertise and experience include defining optimal machine learning outcomes.⁸ Clinicians renarrativize abstracted data elements and model outputs (Kiviat 2023). Although clinicians do not become “ethics experts” (Evans 2012), they do become responsible for defining ethical impacts and benefits. Rather than knowledge being translated across domains and responsibilities shared, expertise is differentiated and accountability delegated (Hou and Wang 2017; Mao et al. 2019; Zhang et al. 2020). As a result, clinical expertise is seen as more “social,” reinforcing distinctions between “the social” and “the technical” even as technologists recognize considerations of the risks and impacts of their tools (Bowker and Star 2000). The “technical” can be contextualized within a broader system of practice, allowing researchers to express moral concern and recognize ambitions for broader impact, without shifting moral boundaries or responsibilities. Collaboration comes to stand in for ethical engagement focused on questions of clinical practice and medical infrastructure.

As suggestions to include more stakeholders in AI design, development, and deployment become prominent (i.e., Committee on Responsible Computing Research: Ethics and Governance of Computing Research and Its Applications et al. 2022; U.S. Department of Health and Human Services 2021), future research can adopt a framework of collaborative ethics to explain the responsibilities assumed of each actor and the implications on defining categories of ethical justification. In particular, research can focus on the role of community and patient representatives and social scientists as stakeholders being brought into AI projects, as well as the effects of stakeholder inclusion on model impacts and project trajectories. Future research must consider how the reconfiguration of collaborative teams and stakeholder perspectives may center different pathways for addressing equity and harm, as well as how groups define, evaluate, and imagine problems and solutions. Working toward ethical AI will require more than uncovering AI impacts and harms, but understanding the definitions of impact, harm, and benefit promoted throughout AI projects and collaborations.

Footnotes

Appendix A

Participant Demographics.

	BA	MA	MPH	MBA	JD	PhD	MD	MD-PhD	Total
Gender
Male	2	3	0	2	1	19	9	5	41
Female	1	2	2	2	0	9	3	1	20
Race
White	2	0	2	2	1	14	4	5	30
Asian	1	4	0	1	0	13	5	1	25
Other^a	0	1	0	1	0	1	3	0	6
Work experience
0–10 y	1	5	2	3	0	13	3	1	28
11–20 y	1	0	0	0	1	8	7	0	17
21–30 y	1	0	0	1	0	7	1	2	12
≥31 y	0	0	0	0	0	0	1	3	4
Additional degrees
MA^b	0	1	0	2	0	1	7	1	12
MPH	0	0	0	1	0	1	6	0	8

Other category includes Black, Hispanic, and Middle Eastern/North African.

MA is marked if the participant has an MA that is in a different field than their highest degree.

Appendix B

Textual data include the following:

Acknowledgements

I would like to thank Michele Lamont, Ya-Wen Lei, Michael Zangler-Tischler, Mira Vale, members of the Inside the Sausage Factory working group at Harvard University, Kelly Joyce, Taylor Cruz, and two reviewers for their thoughtful feedback.

Funding

The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Horowitz Foundation and the Graduate School of Arts and Sciences, the Weatherhead Center, and the Institute of Quantitative Social Science at Harvard University.

ORCID iD

Shira Zilberstein

Notes

Author Biography

Shira Zilberstein is a PhD candidate in sociology at Harvard University. Her research focuses on cultural sociology, science and technology studies and organizations. Broadly, she studies the production, interpretation and evaluation of ideas and the dynamics between hegemonic and counter-hegemonic forms of knowledge in institutional and technical settings. Her dissertation focuses on the creation of machine learning models for health care and the ways in which knowledge is put into practice to define and address social problems.

References

Ali

Sanna J.

Christin

Angèle

Smart

Andrew

Katila

Riitta

. 2023. “Walking the Walk of AI Ethics: Organizational Challenges and the Individualization of Risk among Ethics Entrepreneurs.” Pp. 217–26 in ACM Conference on Fairness, Accountability, and Transparency, Chicago, IL, June 12–15. New York: Association for Computing Machinery.

Amoore

Louise

. 2020. Cloud Ethics: Algorithms and the Attributes of Ourselves and Others. Durham, NC: Duke University Press.

Ananny

Mike

. 2016. “Toward an Ethics of Algorithms: Convening, Observation, Probability, and Timeliness.” Science, Technology, & Human Values 41(1):93–117.

Avnoon

Netta

Kotliar

Dan M.

Rivnai-Bahir

Shira

. 2023. “Contextualizing the Ethics of Algorithms: A Socio-professional Approach.” New Media & Society. doi:10.1177/14614448221145728.

Bechky

Beth A.

2003. “Object Lessons: Workplace Artifacts as Representations of Occupational Jurisdiction.” American Journal of Sociology 109(3):720–52.

Bender

Emily M.

Gebru

Timnit

McMillan-Major

Angelina

Shmitchell

Shmargaret

. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” Pp. 610–23 in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event, Canada, March 3–10. New York: Association for Computing Machinery.

Benjamin

Ruha

. 2019. Race after Technology: Abolitionist Tools for the New Jim Code. Medford, MA: Polity.

Bohr

Adam

Memarzadeh

Kaveh

. 2020. “The Rise of Artificial Intelligence in Healthcare Applications.” Pp. 25–60 in Artificial Intelligence in Healthcare, edited by Bohr

Memarzadeh

San Diego, CA: Academic Press.

Boltanski

Luc

Thevenot

Laurent

. 1999. “Sociology of Critical Capacity.” European Journal of Social Theory 2(3):359–77.

10.

Bosk

Charles L.

2003. Forgive and Remember: Managing Medical Failure. 2nd ed. Chicago: University of Chicago Press.

11.

Bowker

Geoffrey C.

Star

Susan Leigh

. 2000. Sorting Things Out: Classification and Its Consequences. Cambridge, MA: MIT Press.

12.

Burrell

Jenna

. 2016. “How the Machine ‘Thinks’: Understanding Opacity in Machine Learning Algorithms.” Big Data & Society 3(1):205395171562251.

13.

Burrell

Jenna

Fourcade

Marion

. 2021. “The Society of Algorithms.” Annual Review of Sociology 47:213–37. doi:10.1146/annurev-soc-090820-020800.

14.

Cech

Erin A.

2014. “Culture of Disengagement in Engineering Education?” Science, Technology, & Human Values 39(1):42–72.

15.

Chen

Irene Y.

Pierson

Emma

Rose

Sherri

Joshi

Shalmali

Ferryman

Kadija

Ghassemi

Marzyeh

. 2021. “Ethical Machine Learning in Healthcare.” Annual Review of Biomedical Data Science 4(1):123–44.

16.

Christin

Angele

. 2020. Metrics at Work. Princeton, NJ: Princeton University Press.

17.

Collins

Gary S.

Dhiman

Paula

Andaur Navarro

Constanza L.

Jie

Hooft

Lotty

Reitsma

Johannes B.

Logullo

Patricia

, et al. 2021. “Protocol for Development of a Reporting Guideline (TRIPOD-AI) and Risk of Bias Tool (PROBAST-AI) for Diagnostic and Prognostic Prediction Model Studies Based on Artificial Intelligence.” BMJ Open 11(7):e048008.

18.

Committee on Responsible Computing Research: Ethics and Governance of Computing Research and Its Applications, Computer Science and Telecommunications Board, Division on Engineering and Physical Sciences, and National Academies of Sciences, Engineering, and Medicine. 2022. “Fostering Responsible Computing Research: Foundations and Practices.” Washington, DC: National Academies Press.

19.

Costanza-Chock

Sasha

. 2020. Design Justice: Community-Led Practices to Build the Worlds We Need. Cambridge, MA: MIT Press.

20.

Crawford

Kate

. 2021. Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence. New Haven, CT: Yale University Press.

21.

Dodier

Nicolas

. 1994. “Expert Medical Decisions in Occupational Medicine: A Sociological Analysis of Medical Judgment.” Sociology of Health & Illness 16(4):489–514.

22.

Dromi

Shai M.

Stabler

Samuel D.

2023. Moral Minefields: How Sociologists Debate Good Science. Chicago: University of Chicago Press.

23.

Ensmenger

Nathan

. 2010. The Computer Boys Take Over: Computers, Programmers, and the Politics of Technical Expertise. Cambridge, MA: MIT Press.

24.

Eubanks

Virginia

. 2017. Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor. New York: St. Martin’s Press.

25.

Evans

John Hyde

. 2012. The History and Future of Bioethics: A Sociological View. Oxford, UK: Oxford University Press.

26.

Evans

Joelle

. 2021. “How Professionals Construct Moral Authority: Expanding Boundaries of Expert Authority in Stem Cell Science.” Administrative Science Quarterly 66(4):989–1036.

27.

Evans

Joelle

Silbey

Susan S.

2022. “Co-opting Regulation: Professional Control through Discretionary Mobilization of Legal Prescriptions and Expert Knowledge.” Organization Science 33(5):2041–64.

28.

Falco

Gregory

Shneiderman

Ben

Badger

Julia

Carrier

Ryan

Dahbura

Anton

Danks

David

Eling

Martin

, et al. 2021. “Governing AI Safety through Independent Audits.” Nature Machine Intelligence 3(7):566–71.

29.

Farchi

Tomas

Dopson

Sue

Ferlie

Ewan

. 2023. “Do We Still Need Professional Boundaries? The Multiple Influences of Boundaries on Interprofessional Collaboration.” Organization Studies 44(2):277–98.

30.

Forsythe

Diana E.

1993. “Engineering Knowledge: The Construction of Knowledge in Artificial Intelligence.” Social Studies of Science 23(3):445–77.

31.

Frickel

Scott

Gibbon

Sahra

Howard

Jeff

Kempner

Joanna

Ottinger

Gwen

Hess

David J.

2010. “Undone Science: Charting Social Movement and Civil Society Challenges to Research Agenda Setting.” Science, Technology, & Human Values 35(4):444–73.

32.

Ganapathi

Shaswath

Palmer

Alderman

Joseph E.

Calvert

Melanie

Espinoza

Cyrus

Gath

Jacqui

Ghassemi

Marzyeh

, et al. 2022. “Tackling Bias in AI Health Datasets through the STANDING Together Initiative.” Nature Medicine 28(11):2232–33.

33.

Gebru

Timnit

Morgenstern

Jamie

Vecchione

Briana

Vaughan

Jennifer Wortman

Wallach

Hanna

Daumé

Hal

III Crawford

Kate

. 2021. “Datasheets for Datasets.” Communications of the ACM 64(12):86–92.

34.

Gieryn

T. F.

1995. “The Boundaries of Science.” Pp. 393–443 in Handbook of Science and Technology Studies. Thousand Oaks, CA: Sage.

35.

Hackett

Edward J.

2014. “Academic Capitalism.” Science, Technology, & Human Values 39(5):635–38.

36.

Hedgecoe

Adam

. 2014. “A Deviation from Standard Design? Clinical Trials, Research Ethics Committees, and the Regulatory Co-construction of Organizational Deviance.” Social Studies of Science 44(1):59–81.

37.

Hoffman

Steve G.

2017. “Managing Ambiguities at the Edge of Knowledge: Research Strategy and Artificial Intelligence Labs in an Era of Academic Capitalism.” Science, Technology, & Human Values 42(4):703–40.

38.

Hou

Youyang

Wang

Dakuo

. 2017. “Hacking with NPOs: Collaborative Analytics and Broker Roles in Civic Data Hackathons.” Proceedings of the ACM on Human-Computer Interaction 1(CSCW):1–16.

39.

Jasanoff

Sheila

. 2005. Designs on Nature: Science and Democracy in Europe and the United States. Princeton, NJ: Princeton University Press.

40.

Jaton

Florian

. 2020. The Constitution of Algorithms: Ground-Truthing, Programming, Formulating. Cambridge, MA: MIT Press.

41.

Jaton

Florian

. 2021. “Assessing Biases, Relaxing Moralism: On Ground-Truthing Practices in Machine Learning Design and Application.” Big Data & Society 8(1):205395172110135.

42.

Jobin

Anna

Ienca

Marcello

Vayena

Effy

. 2019. “The Global Landscape of AI Ethics Guidelines.” Nature Machine Intelligence 1(9):389–99.

43.

Joyce

Kelly Ann

Darfler

Kendall

George

Dalton

Ludwig

Jason

Unsworth

Kristene

. 2018. “Engaging STEM Ethics Education.” Engaging Science, Technology, and Society 4:1–7. doi:10.17351/ests2018.221.

44.

Joyce

Kelly

Smith-Doerr

Laurel

Alegria

Sharla

Bell

Susan

Cruz

Taylor

Hoffman

Steve G.

Noble

Safiya Umoja

, et al. 2021. “Toward a Sociology of Artificial Intelligence: A Call for Research on Inequalities and Structural Change.” Socius. doi:10.1177/2378023121999581.

45.

Kant

Immanuel

. [1785]1998. Groundwork of the Metaphysics of Morals, edited by Gregor

M. J.

. Cambridge, UK: Cambridge University Press.

46.

Kiviat

Barbara

. 2023. “The Moral Affordances of Construing People as Cases: How Algorithms and the Data They Depend on Obscure Narrative and Noncomparative Justice.” Sociological Theory 41(3):175–200.

47.

Kreft

Ita GG

De Leeuw

Jan

. 1998. “Introducing Multilevel Modeling.” Sage.

48.

Lamont

Michèle

Molnár

Virág

. 2002. “The Study of Boundaries in the Social Sciences.” Annual Review of Sociology 28:167–95.

49.

Lamont

Michèle

Pierson

Paul

. 2019. “Inequality Generation & Persistence as Multidimensional Processes: An Interdisciplinary Agenda.” Daedalus 148(3):5–18.

50.

Latour

Bruno

. 1999. “Give Me a Laboratory and I Will Raise the World.” Pp. 258–75 in The Science Studies Reader, edited by Biagioli

New York: Routledge.

51.

Levi

Ryan

Gorenstein

Dan

. 2023. “AI in Medicine Needs to Be Carefully Deployed to Counter Bias—and Not Entrench It.” National Public Radio, June 6. Retrieved June 7, 2024. https://www.npr.org/sections/health-shots/2023/06/06/1180314219/artificial-intelligence-racial-bias-health-care.

52.

Mao

Yaoli

Wang

Dakuo

Muller

Michael

Varshney

Kush R.

Baldini

Ioana

Dugan

Casey

Mojsilović

Aleksandra

. 2019. “How Data Scientists Work Together with Domain Experts in Scientific Collaborations: To Find the Right Answer or to Ask The Right Question?” Proceedings of the ACM on Human-Computer Interaction 3(GROUP):1–23.

53.

McCradden

Melissa D.

Joshi

Shalmali

Mazwi

Mjaye

Anderson

James A.

2020. “Ethical Limitations of Algorithmic Fairness Solutions in Health Care Machine Learning.” Lancet Digital Health 2(5):e221–23.

54.

Metcalf

Jacob

Moss

Emanuel

boyd

danah

. 2019. “Owning Ethics: Corporate Logics, Silicon Valley, and the Institutionalization of Ethics.” Social Research: An International Quarterly 86(2):29.

55.

Mill

John Stuart

. 1879. Utilitarianism. London: Longmans, Green.

56.

Mitchell

Margaret

Simone

Zaldivar

Andrew

Barnes

Parker

Vasserman

Lucy

Hutchinson

Ben

Spitzer

Elena

, et al. 2019. “Model Cards for Model Reporting.” Pp. 220–29 in Proceedings of the Conference on Fairness, Accountability, and Transparency, GA, Atlanta, January 29–31. New York: Association for Computing Machinery.

57.

Morley

Jessica

Floridi

Luciano

Kinsey

Libby

Elhalal

Anat

. 2020. “From What to How: An Initial Review of Publicly Available AI Ethics Tools, Methods and Research to Translate Principles into Practices.” Science and Engineering Ethics 26(4):2141–68.

58.

Neff

Gina

Stark

David

. 2004. “Permanently Beta: Responsive Organization in the Internet Era.” Pp. 173–88 in Society Online: The Internet in Context, edited by Howard

P. N.

Jones

Thousand Oaks: CA: Sage.

59.

Neyland

Daniel

. 2016. “Bearing Account-Able Witness to the Ethical Algorithmic System.” Science, Technology, & Human Values 41(1):50–76.

60.

Noble

Safiya Umoja

. 2018. Algorithms of Oppression: How Search Engines Reinforce Racism. New York: New York University Press.

61.

Obermeyer

Ziad

Powers

Brian

Vogeli

Christine

Mullainathan

Sendhil

. 2019. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations.” Science 366(6464):447–53.

62.

Orr

Will

Davis

Jenny L.

2020. “Attributions of Ethical Responsibility by Artificial Intelligence Practitioners.” Information, Communication & Society 23(5):719–35.

63.

Pasquale

Frank

. 2016. The Black Box Society: The Secret Algorithms That Control Money and Information. Cambridge, MA: Harvard University Press.

64.

Passi

Samir

Jackson

Steven J.

2018. “Trust in Data Science: Collaboration, Translation, and Accountability in Corporate Data Science Projects.” Proceedings of the ACM on Human-Computer Interaction 2(CSCW):1–28.

65.

Reich

Adam D.

2014a. “Contradictions in the Commodification of Hospital Care.” American Journal of Sociology 119(6):1576–1628.

66.

Reich

Adam D.

2014b. Selling Our Souls: The Commodification of Hospital Care in the United States. Princeton, NJ: Princeton University Press.

67.

Rider

Karina

. 2022. “Building Ideal Workplaces: Labor, Affect, and Identity in Tech for Good Projects.” International Journal of Communication 16:1–18.

68.

Seaver

Nick

. 2017. “Algorithms as Culture: Some Tactics for the Ethnography of Algorithmic Systems.” Big Data & Society 4(2):205395171773810.

69.

Seaver

Nick

. 2021. “Care and Scale: Decorrelative Ethics in Algorithmic Recommendation.” Cultural Anthropology 36(3):509–37.

70.

Selbst

Andrew D.

Boyd

Danah

Friedler

Sorelle A.

Venkatasubramanian

Suresh

Vertesi

Janet

. 2019. “Fairness and Abstraction in Sociotechnical Systems.” Pp. 59–68 in Proceedings of the Conference on Fairness, Accountability, and Transparency, Atlanta, GA, January 29–31. New York: Association for Computing Machinery.

71.

Shachar

Carmel

Gerke

Sara

. 2023. “Prevention of Bias and Discrimination in Clinical Practice Algorithms.” JAMA 329(4):283–84.

72.

Shapin

Steven

Schaffer

Simon

. 1985. Leviathan and the Air-Pump: Hobbes, Boyle, and the Experimental Life. Princeton, NJ: Princeton University Press.

73.

Shestakofsky

Benjamin

. 2017. “Working Algorithms: Software Automation and the Future of Work.” Work and Occupations 44(4):376–423.

74.

Sloane

Mona

Zakrzewski

Janina

. 2022. “German AI Start-Ups and AI Ethics: Using a Social Practice Lens for Assessing and Implementing Socio-technical Innovation.” Pp. 935–47 in FAccT ’22: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, June 21–24. New York: Association for Computing Machinery.

75.

Smith-Doerr

Laurel

Vardi

Itai

. 2015. “Mind the Gap: Formal Ethics Policies and Chemical Scientists’ Everyday Practices in Academia and Industry.” Science, Technology, & Human Values 40(2):176–98.

76.

Stark

David

. 2011. The Sense of Dissonance: Accounts of Worth in Economic Life. Princeton, NJ: Princeton University Press.

77.

Stark

Laura

. 2012. Behind Closed Doors: IRBs and the Making of Ethical Research. Chicago: University of Chicago Press.

78.

Tanweer

Anissa

. 2022. “Tradeoffs All the Way Down: Ethical Abduction as a Decision-Making Process for Data-Intensive Technology Development.” Big Data & Society 9(1):205395172211013.

79.

Tavory

Iddo

Prelat

Sonia

Ronen

Shelly

. 2022. Tangled Goods: The Practical Life of Pro Bono Advertising. Chicago: University of Chicago Press.

80.

Thompson

Charis

. 2013. Good Science: The Ethical Choreography of Stem Cell Research. Cambridge, MA: MIT Press.

81.

Timmermans

Stefan

Berg

Marc

. 1997. “Standardization in Action: Achieving Local Universality through Medical Protocols.” Social Studies of Science 27(2):273–305.

82.

Timmermans

Stefan

Berg

Marc

. 2010. The Gold Standard: The Challenge of Evidence-Based Medicine. Philadelphia, PA: Temple University Press.

83.

U.S. Department of Health and Human Services. 2021. “Trustworthy AI (TAI) Playbook.” Retrieved September 29, 2022. https://www.hhs.gov/sites/default/files/hhs-trustworthy-ai-playbook.pdf.

84.

Wainwright

Steven P.

Williams

Clare

Michael

Mike

Farsides

Bobbie

Cribb

Alan

. 2006. “Ethical Boundary-Work in the Embryonic Stem Cell Laboratory.” Sociology of Health & Illness 28(6):732–48.

85.

Wang

H. Echo

Landers

Matthew

Adams

Roy

Subbaswamy

Adarsh

Kharrazi

Hadi

Gaskin

Darrell J.

Saria

Suchi

. 2022. “A Bias Evaluation Checklist for Predictive Models and Its Pilot Application for 30-Day Hospital Readmission Models.” Journal of the American Medical Informatics Association 29(8):1323–33.

86.

Weber

Clarissa E.

Kortkamp

Christian

Maurer

Indre

Hummers

Eva

. 2022. “Boundary Work in Response to Professionals’ Contextual Constraints: Micro-strategies in Interprofessional Collaboration.” Organization Studies 43(9):1453–77.

87.

Wehrens

Rik

Stevens

Marthe

Kostenzer

Johanna

Weggelaar

Anne Marie

De Bont

Antoinette

. 2023. “Ethics as Discursive Work: The Role of Ethical Framing in the Promissory Future of Data-Driven Healthcare Technologies.” Science, Technology, & Human Values 48(3):606–34.

88.

Widder

David Gray

Nafus

Dawn

. 2023. “Dislocated Accountabilities in the ‘AI Supply Chain’: Modularity and Developers’ Notions of Responsibility.” Big Data & Society 10(1):20539517231177620.

89.

Yala

Adam

Mikhael

Peter G.

Lehman

Constance

Lin

Gigin

Strand

Fredrik

Wan

Yung-Liang

Hughes

Kevin

, et al. 2022. “Optimizing Risk-Based Breast Cancer Screening Policies with Reinforcement Learning.” Nature Medicine 28(1):136–43.

90.

Zack

Travis

Lehman

Eric

Suzgun

Mirac

Rodriguez

Jorge A.

Celi

Leo Anthony

Gichoya

Judy

Jurafsky

Dan

, et al. 2023. Coding Inequity: Assessing GPT-4’s Potential for Perpetuating Racial and Gender Biases in Healthcare. medRxiv. Retrieved June 7, 2024. https://www.medrxiv.org/content/10.1101/2023.07.13.23292577v2.

91.

Zhang

Amy X.

Muller

Michael

Wang

Dakuo

. 2020. “How Do Data Science Workers Collaborate? Roles, Workflows, and Tools.” Proceedings of the ACM on Human-Computer Interaction 4(CSCW1):1–23.

92.

Ziewitz

Malte

. 2019. “Rethinking Gaming: The Ethical Work of Optimization in Web Search Engines.” Social Studies of Science 49(5):707–31.

Ethical Dilemmas and Collaborative Resolutions in Machine Learning Research for Health Care

Abstract

Keywords

The Sociology of AI and Ethics

Ethics in Action

Data and Methods

Results

Distinctions between the Hypothesized World of Machine Learning and the “Real” World of Clinical Practice

Impacts in the Hypothesized and “Real” World

Hypothesized Machine Learning Impacts

Real-World Impacts

Alignment between the Hypothesized World of Machine Learning and the “Real” World of Clinical Practice

Clinical Collaboration

Patient Data

Ethical Trade-Offs between the Hypothetical and the “Real”

Discussion and Conclusion

Footnotes

Appendix A

Appendix B

Acknowledgements

Funding

ORCID iD

Notes

Author Biography

References