Abstract
Data interoperability poses unique ethical challenges across a range of academic, industrial, and governmental implementations of data systems. Central to data interoperability is the design of systems and protocols for exchanging or integrating data from different initial source domains. Data interoperability is often regarded as necessary for carrying out tasks between different organizations and suborganizations as well as for ensuring secondary use of data for research purposes. However, interoperability poses a number of ethical problems whose contours can prove especially challenging in comparison to how ethical harms take hold at other moments of the data life cycle (such as algorithmic processing or results dissemination). Taking biomedical data interoperability as a focal domain, this article provides an overview of data interoperability, maps the central ethical harms that may challenge interoperability projects, and proposes a response to these problems through an approach rooted in philosophical pragmatism. Pragmatist responses to both individual and structural harms of interoperability are presented through three companion strategies: shared standards, manual data curation, and meticulous data documentation.
Introduction
The unprecedented expansion of data collection has introduced a host of advantages in industrial, governmental, and research contexts. In healthcare settings the expanding use of electronic health records and other forms of electronic documentation have increased the rate and quality of information exchange between healthcare providers compared to older physical formats (Kuiler and McNeely, 2018: 164). However, the advantages of digital data formats are not unproblematic. Efficiencies in the storage of digital data increase the possibility of future misuse (Sula, 2016: 20), and algorithmic techniques for sorting and classifying data can lead to the introduction of bias (Fazelpour and Danks, 2021). A third issue concerns problems flowing from the disparate data structures operationalized in separate storage systems or platforms. In settings where separate data systems need to exchange information or functions with each other, differences in data formatting pose problems related to the quality of data exchange, and these quality issues in turn generate novel possibilities for data misuse and bias.
This third issue is that of
How data are digitally organized and curated can generate significant ethical harms both
Interoperability is a particularly pressing concern within healthcare and biomedical domains where both patient care and medical research privilege data longevity and reuse while also posing proximate ethical challenges for the individual and wider populations. For clinical practice, interoperability is a major aspect in relation to health information exchange, especially as it is implemented through electronic medical records (EMRs) and electronic health records (EHRs) (Berryman et al., 2013: 85). EMRs and EHRs were specifically designed with interoperability in mind to ensure easy exchange of patient data between different healthcare providers (Berryman et al., 2013: 86). In Europe, standards set by Health Level 7 (HL7) have attempted to create common guidelines for interoperability, but in the United States, the slow and scattered uptake of EHRs in the first decade of the 21st century required additional legislation in the form of the HITECH Act of 2009, which attempted to financially incentivize the adoption of interoperable EHRs among healthcare systems (Berryman et al., 2013). Still, due to scattered EHR systems across hospital systems, there remains a lack of standard coding for diseases and other medical terminology. The code “MS” may mean “mitral stenosis,” “multiple sclerosis,” “morphine sulfate,” or “magnesium sulfate” depending upon the system and the particular coder, leading to issues in interpretability, data loss, and inaccuracy (Chute, 2005: 170; Hoffman and Podgurski, 2013: 57). Interoperability harms can also occur where disparate data systems store data in formats of differing resolution, for instance where data need to be merged between a coarse-grained taxonomy and a finer-grained taxonomy. When fine-grained data are passed into coarser-grained databases (or vice versa), each data point needs to be resolved into broader (or narrower) categories. When operated at scale across populations, these seemingly minor technical issues can quickly add up to inaccurate estimations about individuals within larger groups. This can lead in turn to substantial inequalities.
The scale of the COVID-19 pandemic highlighted such harms relative to the secondary use of data for epidemiological purposes. While initial concerns about secondary data use were associated with “-omics” research and patient registries, especially for uses of genomic data (Nicholson and Perego, 2020; Wang and He, 2021), COVID-19 also brought into view the extensive barriers to the exchange of information between clinical settings and public health agencies (Subbian et al., 2021). As a result of these challenges, a new wave of research into interoperability emphasizing the importance of maintaining interoperable systems for emergency health initiatives related to the tracking and maintenance of novel diseases (Naudé and Vinuesa, 2021; Pelizza, 2020; Piller, 2020). This research not only extends the discussion of interoperability to new areas, but it also problematizes a common reliance on individual-oriented frameworks that tend to emphasize rights to privacy and autonomy at the cost of neglecting both the broader menu of social goods that can be generated from secondary uses of health information as well as the negative effects flowing from insufficiently interoperable data systems.
Taking a broader view of ethical concerns as involving both individual- and structure-oriented harms, this article discusses data interoperability as a distinctive sociotechnical operation within which many of the most well-documented problems of data ethics manifest in ways that are different from how they occur in settings that do not implement interoperability. There is now a rich literature on data ethics (Mittelstadt and Floridi, 2016). Central to this scholarship are concerns about information privacy, data discrimination, and digital divides of unequal access to quality information and to computer technology. Much of the literature on these issues focuses on the technical processes through which ethical harms might manifest. The most prominent technical processes under focus in the existing literature are algorithmic data analytics.
1
We argue that data interoperability ethics requires reflection that moves beyond the most familiar sociotechnical moments in the data life cycle, most prominently algorithmic processing, to the barriers imposed by the prior implementation of what we describe as the
The concept of formatting, developed by one of us in prior work, refers to the manifold processes by which data are defined as such and information is formed as such. 2 Formats are that which makes data and information possible. Consider a cell in a spreadsheet file. A cell can store a data point only against background conditions of formatting: these include technical conditions of data typing (specifying whether the contents of the cell to be stored as a numerical integer, a text string, or a calendrical date) and conceptual or semantic conditions defining permissible values for a variable (e.g. a defined range of permitted options for a list-selected variable whose column header is “gender”). When not subjected to thorough investigation, the ensembles of formats wielded in data interoperability projects can lead to the generation, entrenchment, and escalation of unexpected harms at both the individual and structural levels.
The first and more diagnostic contribution of the article consists of a conceptualization and mapping of these unexpected ethical harms with a particular focus on making the underlying practices of data formatting more explicit. Said otherwise, we show how the harms inhering in data interoperability often result from an unreflective inattentiveness to how data are formatted. An important implication of this argument is that even some of the most familiar harms of data systems prove uniquely challenging when occasioned by interoperability in contrast to the ways these harms manifest in other, more frequently analyzed, moments of the data life cycle.
As a response to the specific ethical problems raised by data interoperability, the second contribution of the article is more positive, or curative, in its aims. We propose a trio of strategies, all of which are rooted in a form of sociotechnical analysis recommended by philosophical pragmatism. Our pragmatist approach addresses both structural and individual harms by focusing on how data practitioners and institutional stakeholders can strategically implement more reflective data practices in the face of the inherent unpredictability of future occasions of data use.
The article proceeds as follows. In the first section (“An overview of data interoperability”), we provide a general overview of data interoperability as a problem in data science and informatics, paying careful attention to its role in ensuring successful data integration and data exchange. In the second section (“Mapping the ethical landscape of data interoperability”), we map particular harms associated with the formats of data interoperability in relation to two leading frameworks for data ethics (both of which are also prominent in the biomedical ethics literature): those concerning individual rights (and associated notions of privacy, autonomy, and dignity) and those focusing on social structures (such as issues of justice, fairness, and equality). In the third section (“A pragmatist approach to achieving data interoperability”), we introduce our pragmatist approach to problems of data interoperability which addresses harms from both frameworks by attending to the reflective practices that data practitioners and institutional stakeholders can implement to render data most adaptable to new situations.
Before proceeding to these three elements of our analysis, it will be valuable to consider the generalizability of the analyses that follow in light of our central focus on biomedical data. The general ethical challenges mapped in this article are of increasing salience as data interoperability is becoming ever more important for a broad range of institutions and industries across education, business, governmental, and various healthcare and research-related domains. Consider education data technology. There is now a widespread use of data systems in education for such purposes as data exchange between administrative data systems, classroom management tools, learning analytics software, and learning assessment platforms both within and between individual educational institutions (Daniel, 2017; Santos et al., 2016; Wong et al., 2023). We believe that the ethical concerns raised by interoperability in biomedical data are indeed generalizable to other domains where interoperability has become increasingly essential. 3 That said, every domain in which data interoperability poses challenges will face those challenges in relatively unique domain-specific ways. Though the mapping and strategies we provide for healthcare data interoperability is one we take to be broadly applicable to other domains, applicability cannot itself be assumed to be self-guiding. Generalization, especially with respect to ethical challenges, requires nuanced interpretation and reparticularization in every different domain of application.
An overview of data interoperability
If we conceive of data interoperability as the ability to exchange information and functions between different data systems or platforms, an initial question arises as to how interoperability should be conceptualized in relation to similar concepts in data science. Following Pagano and colleagues (2013) we distinguish between three tasks that can be carried out on data and the
Another definitional issue arises when we consider the complexity of the sociotechnical systems in which problems of data interoperability arise. In order to clarify the processes and practices surrounding data interoperability, this idea can be organized according to three levels of abstraction: (1) technical interoperability, (2) semantic interoperability, and (3) organizational interoperability (Hellberg and Grönlund, 2013; Pagano et al., 2013; Shrivastava et al., 2021).
5
At the lowest level of abstraction,
The formats constituting conditions of possibility for data are sites at which relations between technical, semantic, and organizational constraints on interoperability are negotiated. Once operational, formats express and enact the outcomes of those negotiations. Formats as the operational outcomes of these negotiations can thus be highly variable across different data systems. Format variability is sometimes due to technical constraints (e.g. incompatible technological infrastructure), other times due to semantic constraints (e.g. contrastive coding systems), and yet other times due to organizational constraints (e.g. local institutional requirements mandating the use of a particular data schema). Such variability in formats, as well as the many possible causes of this variable, is one way of understanding the complexity involved in any effort at data interoperability.
Problems in semantic interoperability
One of the most pressing issues raised in contexts of data interoperability involves a mismatching of the purposes and contexts of data use. This is most clearly exemplified in cases of semantic interoperability. For example, in an early paper on the topic, Heiler notes that semantic interoperability is fundamentally grounded in “semantic agreements” between a requester and provider of information concerning the meaning of a particular term or category (Heiler, 1995: 271). These agreements can be difficult to establish when old data are set to new purposes. For instance, Heiler documents a project by the U.S. Department of Defense employing a database of military personnel addresses to establish where new veteran's hospitals should be located (Heiler, 1995: 271). Although initially the secondary use of these data appeared relevant for these purposes, it was later discovered that the addresses corresponded to active military assignments, including temporary assignments, and therefore, the data was useless at capturing where veterans and their families lived after their respective assignments. In this case, a single data field of “address” contained implicit semantic information which needed to be made explicit within the metadata in order to prevent future misunderstandings. However, Heiler further explains that it can be difficult to make semantic agreements explicit given that the information of interest will always be context-dependent, requiring documenters to hypothesize various future applications (Heiler, 1995: 272).
There are, however, methods which can be used to predict the likelihood of a semantic disagreement. For instance, Brazhnik and Jones (2007) have developed a set of concepts which can be helpful for determining the long-term reliability of categories for secondary data use. Concerning data elements (DEs), they distinguish between
Additional instances of human decision-making may also impart unreliability in relation to peripheral DEs. For example, a DE like “flu” may be coded in healthcare records to account for both vaccination and diagnosis due to the difficulty of memorizing discrete codes for each (Brazhnik and Jones, 2007: 258). Pine has discussed such decisions as a “qualculative” aspect of interoperability which “sees judgment and calculation as inherently related—calculation is not straightforward and mechanical, it involves situated qualitative judgments that are inherently quite effortful” (Pine, 2019: 539). In Pine's human-centered account of semantic interoperability, data recording practices which may be viewed as unreliable or error-prone by researchers are actually part of complex, pragmatic social negotiations which are difficult to correct through technical means alone. For example, if discrepancies occur between the primary doctor's description of treatment and a hospital's discharge summary for a respective patient, the coder in charge of the patient's chart will likely choose to code in conformity with the discharge summary instead of backtracking to determine the reliability of each respective account (Pine, 2019: 542). In Pine's analysis, such a decision is not a purposive failure to record accurately, but a decision to record as accurately as possible given the economic and temporal constrains of the coder's workflow.
Finally, semantic issues can result due to implicit standards and formatting, specifically between units of measurement for a recorded variable or for the order of terms within dates (Brazhnik and Jones, 2007: 262). For example, a weight recorded as “120” or a date as “10-09” has different meaning in the United States and Germany. There can be barriers to interoperability even in instances where focal DEs have been reliably recorded. Although these cases may pose issues for automated data integration, contextual information within the data often provide clues as to the correct order of terms and details about the cultural, historical, and geographic conditions of collection can help to standardize measurement variables. Yet this contextual information will not always provide a satisfactory interpretation of ambiguous DEs. In many instances, missing data cannot be interpolated or imputed, and these data may contribute to data loss and bias (Bradwell et al., 2022: 1173). The result is that even relatively trivial exclusions may pose challenges to data integration and exchange, especially when one considers the time-consuming nature of manual data cleansing for large-scale datasets.
Mapping the ethical landscape of data interoperability: Individual-oriented versus structure-oriented frameworks
With a mapping of data interoperability in view, we now turn to the ethical features of the landscape we have mapped. This section develops a further layer for analyzing data interoperability. The ethical problems posed by data interoperability can be introduced according to principlist ethical frameworks proposed for general use within both data ethics and biomedical ethics. Mittelstadt and Floridi (2016) outline the literature concerning the ethics of big data in biomedicine as centering upon five areas: (1) informed consent, (2) privacy (including anonymization), (3) ownership, (4) epistemology and objectivity, and (5) big data divides. A more concise mapping by Ganiat and Olusola (2015) adapts Beauchamp and Childress’ (2001) influential four principles of bioethics for use in scenarios which specifically address data interoperability: (1) autonomy, (2) beneficence, (3) nonmalfeasance, and (4) justice.
Both of these frameworks can be considered in light of a central tension for ethical reflection in any form. This is the tension between individual-oriented frameworks which look to protect individual rights like privacy and informed consent, and structure-oriented frameworks which seek justice, fairness, or equality (or all three) in data-driven initiatives. We propose a mapping of the ethical landscape of data interoperability according to this tension.
We begin with frameworks oriented around individuals and the basic rights owed to them—remaining agnostic for the sake of presentation about the justificatory frameworks within which such rights can be derived. 8 We then turn to frameworks which are more structural in their focus on how social structures differentially impact persons and populations—for the purposes of a mapping we also here remain agnostic about how structurally focused values are justified and even how social structure is conceptualized. 9
Individual rights: Privacy and informed consent
In relation to individual rights, an increase in data interoperability may be associated with increased risks to privacy and greater loss of individual control over personal data. One major privacy concern is that of deanonymization due to recombination of previously separated DEs. For example, Sula argues that increasing interoperability threatens to deanonymize individuals since it renders old data more readily available to novel and unforeseen methods of data analysis and extraction (Sula, 2016: 19). One current strategy for deanonymization is through “data triangulation” methods where an anonymized dataset may be algorithmically combined with outside information to produce the necessary variables for inferring a data subject's identity (World Health Organization, 2021: 41). While deanonymization is often not purposeful, it is not difficult to foresee how accidental cases may occur through increasing data exchange and the implementation of machine learning (or artificial intelligence) approaches, rendering it easier over time to deidentify patients, clients, or research subjects.
This issue is perhaps most pronounced in healthcare-related -omics research, where anonymous patient records may be combined with external data to reidentify patients (sometimes in violation of the HIPAA Privacy Rule). In some cases, data researchers have learned that certain variables are so rare as to invalidate any form of anonymization. Layman documents how diseases including cystic fibrosis, Friedreich ataxia, hereditary hemorrhagic telangiectasia, Huntington disease, phenylketonuria, Refsum disease, sickle cell anemia, and tuberous sclerosis were infrequent enough at particular hospital locations that combining genomic data with discharge records sufficed for deanonymizing patients with these diseases in 32.9% to 100.0% of cases (Layman, 2008: 156). Other methods such as using surname elements within genomics records in combination with basic demographic data in other records can also reidentify subjects or even link genetic data from one individual to their relatives (Gymrek et al., 2013).
This issue was already documented in the bioinformatics literature 20 years ago by Malin and Sweeney, who present the following case: John Smith is admitted to a local hospital, where he is diagnosed, via a DNA diagnostic test, with a DNA-influenced disease, such as cystic fibrosis. The hospital stores the clinical and DNA information in John's electronic medical record. For treatment, John visits several other hospitals, where his electronic medical record is also collected and stored. For research purposes, the hospitals forward certain DNA databases, including John's DNA, onto a research group. The DNA records are tagged with the submitting institution and with pseudonyms for their submitted sequences. By state law, the hospital sends a copy of the identified discharge record, including name, gender, zip code, visit date diagnoses, and procedures, onto a state-controlled database. The discharge database is made publicly available in a deidentified format and can be reidentified to publicly available records, such as voter registration databases. This final step of linking is based on the uniqueness of demographics, which has been validated in previous data privacy research, as well as in demography, public health, and epidemiology communities. (Malin and Sweeney, 2004: 181)
This is an account, all too common, of a trail of identifying data which can easily be used to break anonymization procedures. Despite anonymization techniques, interoperable data formats embed technical, semantic, and organizational constraints whose implications can help enable reidentification.
These cases demonstrate the risks associated with the development of interoperable healthcare systems, specifically at the level of organizational interoperability as outlined above. Notably, the increased potential for unifying and consolidating localized forms of data may come at the cost of generating datasets which openly identify their subjects or which contain all of the separate elements required for deanonymization. However, it is up for debate whether interoperable systems do or do not increase privacy risks overall. Within healthcare settings, it seems that consolidated digital information stored within EHRs might be more readily accessible to bad actors and would contain more information than any individual system (Layman, 2008). On the other hand, it may be easier to secure one single system for EHRs rather than rely upon a patchwork of separate data systems which are scattered across a range of healthcare providers (O’Reilly-Shah et al., 2020: 342).
Key problems to consider here concern how access to consolidated data systems should be conducted and how data subjects can trust that their data will not be misused by professionals with credentialed access. Here, informed consent is a major issue, since interoperability may greatly expand the secondary uses of any particular data point beyond what is currently conceivable or link data in unexpected ways (see Hand, 2018; Mittelstadt and Floridi, 2016). In healthcare settings, a major question is whether interoperable EHRs will allow providers not associated with a patient to nonetheless have full access to their health records. In relation to this problem, patient authorization has been proposed as a means to secure data autonomy under conditions of expanded EHR interoperability (Ganiat and Olusola, 2015: 14). However, a call for authorization procedures may be too strict in relation to secondary data use outside of standard healthcare practice, especially when health data has undergone deidentification for research purposes.
Social structure: Justice, fairness, and equality
The potential overestimation of the risks of rights violations of individuals may not only stifle research projects which require secondary data use, but they may themselves be a site of ethical or normative concern, especially in relation to structural matters of justice, fairness, and equality. In outlining their application of principlism to healthcare interoperability, Ganiat and Olusola define justice as the principle upheld when “interoperating electronic healthcare systems are used to provide equal and prompt healthcare care to everyone as well as also ensuring data availability, accuracy, and security” (Ganiat and Olusola, 2015: 15). They additionally note the essential role interoperability plays in reducing a “digital divide” between healthcare systems, a problem which Mittelstadt and Floridi (2016) further extend into the concept of a “Big Data divide.” This divide involves unequal distributions of benefits and burdens flowing from big data analyses. All of these concerns point to the need for critically questioning who is being represented in data, how they are being represented, how fairly their data representations are being subjected to treatment, and what purposes lay behind each of these operations (Crawford et al., 2014).
Data interoperability fundamentally attempts to reduce what has been termed lossiness, where “data collection and/or analysis may involve aggregation, case construction, or standardization in such a way that certain aspects of the phenomena of interest are lost” (Busch, 2014: 1732). In privacy-oriented accounts, this lossiness is analyzed as an unfortunate but ethically neutral state of affairs. For instance, Sula has argued that with greater longevity of data life, “the potentials for data loss, theft and unintended consequences are high—but entirely mitigated when no personally identifiable information is collected in the first place” (Sula, 2016: 20). By contrast, in relation to structural frameworks addressing data divides, data loss is not just a technical problem derivative of privacy concerns.
By taking seriously an emphasis on structural questions of ethics, we can begin to understand how data loss is intricately connected to discrepancies in data recording, biases in formatting, and structural barriers to technical interoperability which result in unequal distributions of data loss among marginalized communities. These issues were on full display during the COVID-19 pandemic, where data interoperability issues caused by hyperfragmented datasets and lack of reporting standards prevented secondary data use in disease tracking (Backhaus, 2020). Describing these problems, Naudé and Vinuesa employed the term
Due to these and other problems, several scholars have identified the United States' failures during the COVID-19 pandemic as a wakeup call for current interoperability limitations in healthcare (Greene et al., 2021; Naudé and Vinuesa, 2021). In a way that highlights what we referred to above as the central tension between individual-oriented and structure-oriented frameworks, Piller (2020) frames the U.S. COVID-19 response as ultimately being overly cautious about reidentification harms within data shared between public authorities and epidemiologists. This cautious approach is antithetical to ethical aims when standard privacy-oriented accounts are supplemented with frameworks which seek to ensure a more just, fair, or equal distribution of the benefits of increased social welfare. Though privacy may be further compromised within an interoperable healthcare data system, it is also necessary to emphasize how broader concerns for social welfare are easily neglected in favor of individual rights-based approaches to the detriment of vulnerable populations. Therefore, while we should not downplay the harms that can be directed against individuals, an overly restrictive individual-centered framework may inadvertently generate social-structural harms. In other words, the tension that manifest between individual-oriented and structure-oriented frameworks are often difficult to resolve. Above all, we should not pretend they are resolved by focusing all of ethical analysis on one side or the other of the ledger.
A pragmatist approach to achieving data interoperability
There are several strategies for minimizing the harms of data interoperability. These include technical approaches leveraging big data analytics, ontological approaches which emphasize standardized vocabularies for coding relevant information, and policy-oriented approaches which require state and/or market stakeholders to contribute to better training of coders and more standardized collection of data.
On the technical front, big data analytics and machine learning have been suggested as tools which could be used to parse through information and derive correlations between DEs in a manner that exceeds human capacities. However, scholarship in this area points to issues such as opacity in deep learning models and introduction of additional bias in the training and unequal deployment of algorithms (Gianfrancesco et al., 2018). Alternatively, a longstanding proposed solution to data interoperability has been the establishment of standardized codes, vocabularies, or ontologies within defined academic, clinical, or research domains (Dixon et al., 2014; Kuiler and McNeely, 2018). Here, even those who endorse such approaches recognize the practical limitations which impede the universal adoption of ontological standards. For one, in healthcare settings, economic costs are likely to fall upon healthcare providers who will need additional time to adequately train personnel (Dixon et al., 2014). The coding and cleaning of data in healthcare settings could also be performed by the public health officials who gather data for secondary use. However, such models presume adequate public funding and do not address interoperability between clinical care providers themselves (Dixon et al., 2014). Additionally, such a solution is unlikely to resolve issues which primarily occur at the time of initial data entry, such as missing DEs, misidentification and mislabeling, or DEs too broad for secondary use (e.g. a racial or gender category labeled as “other”).
In acknowledging the limitations of both market-driven and publicly funded proposals, a third “strategic, cooperative approach” introduced by Dixon, Vreeman, and Grannis advocates for a set of shared practices which distribute the costs of interoperability across all relevant stakeholders. [P]ublic health would collaboratively develop a strategic plan with data sharing partners whereby all stakeholders that generate and report clinical data would partner to improve semantic interoperability. The onus of translation would not fall disproportionately to any one group, making it equitable. Instead each stakeholder group would invest time and resources into the process of translation to enable full semantic interoperability across the myriad health IT systems and scenarios for public health reporting. So while implementation might be somewhat more complex in this scenario, it is likely to be more acceptable to all stakeholders and incur the lowest cost. (Dixon et al., 2014: 6)
The strategic-cooperative approach offers a compromise between various parties who collaborate to enact meaningful change to current standards of interoperability at the technical, semantic, and organizational levels.
We recognize a pragmatist thread implicit in the strategic cooperative approach, namely that interoperability is embedded within complex sociotechnical environments where the ideal theoretical conditions presumed by some (but not all) forms of individual-centered and structure-centered ethical principles may be difficult to implement in practice. This pragmatic impulse extends to other previously cited domains within the scholarship, including the limited and contextually defined scope of semantic negotiation and the “qualculative” elements at play when data entry occurs on-the-ground (Pine, 2019).
Given the time constraints and the limited economic resources of stakeholders, one could ask how an ethical solution to interoperability concerns could be achievable within our current system at all. In response to such skepticism, and in an effort to repel the cynicism that skepticism always invites, we conclude by analyzing how ethical data interoperability could be justifiably based upon a pragmatist version of the strategic-cooperative approach. To this end, we outline three ways in which philosophical pragmatism can speak to the situation-dependent and fallibilistic procedures involved within the selection and matching of DEs as well as the complex sociotechnical conditions in which effective data collection and use must occur.
Three pragmatist strategies: Data standards, manual curation, and data documentation
Originally formulated by Peirce (1878) as a maxim through which theoretical disputes could be settled by analyzing the consequences they engender, philosophical pragmatism is informed by a range of philosophers including Peirce, James (1907), and Dewey (1938) as well as more recent analytic pragmatisms developed by Rorty (1991), Brandom (2008), and Anderson (2020). In all of its versions, pragmatism focuses on the socially-embedded and practice-centered nature of epistemological and ethical problems. A central feature of pragmatism's concern with social practice, as described by Rorty (1991), is the search for toeholds rather than skyhooks. Pragmatism seeks to ground theoretical commitments in our contingent social practices rather than searching for immutable foundations for our epistemological and ethical projects.
In relation to the ethics of interoperability, a pragmatist approach cautions against seeking overarching solutions for all contexts of data use. It instead aims to consider what social practices could be put in place to render current data systems and data formats into forms that ameliorate present problems and mitigate actual harms. The pragmatist's aim is not to predict all future data uses. Rather, the pragmatist seeks to address how, in our data practices, we can and should be cognizant of fundamental epistemic limitations and responsive to current sociotechnical conditions. Pragmatism avoids hypothetical prediction in favor of concrete curation.
One entry point into the pragmatist approach is through the work of Leonelli (2016), 10 whose study of data-centric biology employs Dewey's pragmatist theory of inquiry. In tracing data through different sites of use, Leonelli discards the familiar term “context” to describe different data problematics in favor of Dewey's term “situation” (see Dewey, 1938). For Leonelli, philosophical concepts of “context” tend to ignore the role played by nontheoretical considerations in scientific inquiry in a way that obscures the messiness of our data practices; by contrast, a “situation” refers to the total field of inquiry in terms that are inclusive of material, institutional, and social elements in addition to the conceptual or theoretical terms that are the focus of classical philosophy of science. According to Leonelli, a situational attentiveness also acknowledges the presentation and curation of data within and between domains in addition to the constantly shifting aims of researchers and social systems in any individual situation. Leonelli specifically links these features to interoperability concerns, noting that the ability to manipulate and present data within data systems is an important feature for continuing the “life” of the data (Leonelli, 2016: 183–184).
Leonelli's situational pragmatism connects to additional pragmatist concerns based on Wittgenstein's considerations on rule-following, in particular, his idea that no rule can contain a rule for its own application (Wittgenstein, 1953: §201). In Dewey's terms, since a problem results from a recognition of disordered elements within a situation and is resolved (if at all) by particular actions conducted within this situation, it is impossible to derive on the basis of one problematic situation the correct procedures one must follow in all possible situations (Dewey, 1938: 107–110). Rather, pragmatism emphasizes the value of processes of inquiry in contrast to the finalized products of prior inquiry, and as such resonates with Edwards's focus on “metadata processes” as the spontaneous and informal communication that occurs among data practitioners in situations where the products of metadata are otherwise imprecise (Edwards et al., 2011: 684). In looking at our human ability to converse with one another, offer novel inferences, and consider alternative possibilities, these processes demarcate human practices as “simultaneously focused and flexible (unlike that of computer programs, whose performance typically degrades precipitously or fails altogether in the presence of unanticipated contingencies)” (Edwards et al., 2011: 685).
In highlighting the situated flexibility of human action, pragmatism offers an important framing for addressing the ethical challenges of data interoperability in light of the real epistemic and ethical limitations on interoperability as implemented in actual situations. More specifically, pragmatism provides a framework for practicable strategies at the level of semantic interoperability that can help achieve more ethical data interoperability by way of improving data and metadata quality. We present three such strategies: data standardization, manual data curation, and data documentation. We envision these strategies as most likely to minimize or mitigate unexpected ethical harms of data interoperability when implemented in coordination with one another.
First, it is crucial to acknowledge the importance of
That data standards are both situationally limited and potentially harmful does not, however, mean that we should abandon them altogether. Data standards are both technically and socially (i.e. sociotechnically) necessary for domain-specific and cross-domain data exchanges and integrations. It may be objected that artificial intelligence applications render standard taxonomies otiose because of their ability to process massive quantities of data often referred to as “unstructured” (Jercich, 2022). But this is a misguided position, at least in sociotechnical situations implementing data interoperability. All data are structured to some degree by some amount of formatting—if they were not, they would not be machine-readable (nor human-readable). The question is always which formats are in place such that data can be well-formed, not whether there should be any formatting at all for data. Some degree of formatting is necessary for any collection, storage, or processing of data. A standardized taxonomy, then, can be defined as just a data format that applies across two or more datasets. Some standards, of course, serve as domain-wide specifications because of social or institutional rules. But even with domain-wide implementation, at a technical level a standard just is that which enables interoperability between two or more differently formatted datasets. Standards are thus, in a way, necessary for responsible data interoperability. But because of the limitations noted above, they are also typically insufficient. Standards thus need to be implemented in a manner that respects their limitations. This raises a crucial question: what can be added to or implemented alongside standardization efforts in order to render data interoperability more ethical? This brings us to our next two pragmatist strategies. These strategies are focused on improvements in data quality at the level of primary data and of metadata.
A second pragmatist strategy we advocate is that of
Data curation practices that design in manual data correction and cleaning can help minimize or mitigate such unethical consequences. In some ways, this is an analytical point. If we know that a data system has generated ethical harms on the basis of inaccurate data, then this implies that someone somewhere along the line has located the inaccurate data that formed the basis for the harm. When this does occur, of course, it is typically the result of a
A third pragmatist strategy for implementing practices more likely to produce higher-quality data and therefore less likely to generate ethical harms involves implementing
In looking back to the COVID-19 pandemic, we can see how these strategies may have mitigated semantic issues in the recording of fatalities from the pandemic. For instance, Backhaus (2020) describes how several terms were used interchangeably in the reporting of deaths. While “case fatality rate” is supposed to refer to the number of deaths per number of reported infections, “infection fatality rate” is supposed to supplement the number of reported infections with estimated unreported infections in its calculation, and “mortality rate” is supposed to refer to deaths divided by total population, these standard taxonomic categories were not always strictly followed in reporting (Backhaus, 2020: 162–163). However, even with clear taxonomies, Backhaus shows that different metrics were used to determine whether a patient died from COVID-19 or a comorbid condition (Backhaus, 2020: 164). In such cases, meticulous documentation could make explicit the procedures used to delineate deaths from COVID-19 versus those caused by a serious comorbidity and manual curation could help produce high quality data to track epidemiological spread among certain more localized population groups.
The three strategies we have presented are not perfect solutions for data interoperability harms. The authors of the “datasheets for datasets” model, for instance, explicitly note that they cannot account for dataset creators’ limited capacity to imagine alternative uses nor can they provide an effective financial model incentivizing their approach (Gebru et al., 2021: 92). However, in relation to the first worry, our pragmatist approach responds by noting that fundamental epistemic limitations will arise for any and all proposals. Thus, by accepting that no proposal can derive perfect rules for future applications, it can be affirmed that manual coding and meticulous documenting are some of the most effective proposed solutions available. This is because they ensure higher quality data for peripheral DEs and eliminate the guesswork of data practitioners in secondary use situations while also acknowledging the epistemic limitations of data practitioners in original or primary data situations. 13 Additionally, the second worry can be deflated when we look to previously proposed solutions at the organizational-level of interoperability. The strategic-cooperative approach, for instance, could provide sufficient funding and incentives by distributing accountability across market and governmental stakeholders in a manner which, in turn, effectively distributes the benefits of interoperable data systems.
A fully pragmatist approach to data interoperability should make use of both semantic-level and organizational-level proposals while tracking the epistemic and ethical limitations necessarily imposed upon social actors at both levels. As noted earlier, at the semantic level of interoperability, ontological standards within specified domains in combination with manual data curation and meticulous documentation practices can help meliorate data harms by making initial semantic agreements explicit to practitioners in primary, secondary, and tertiary use situations. At the organizational level of interoperability, data practices need to be situated within their socioeconomic conditions by accounting for the economic and social limitations that market, civic, and governmental stakeholders face.
Across the multiple levels of interoperability, a strategic-cooperative approach informed by pragmatism and experimentally committed to implementing and balancing multiple strategies for implementing ethical data interoperability offers a way of realizing the benefits of data technologies while remaining attentive to the enormous potential of data harms. These harms are incentivized by a tendency to seek solutions to real problems by implementing data without sufficient reflection on what those data exclude and what they fail to include. Such tendencies are now augmented by implementations of artificial intelligence that seek to automate out reflective consideration. What we need to confront the ethical harms of data-driven solutions is not less reflexiveness and more automation but more reflective intelligence. It is precisely this kind of intelligent reflection upon our social practices which pragmatism seeks to cultivate.
Conclusion
The specific ethical challenges involved in data interoperability remain concerningly undertheorized in existing data ethics scholarship. The way that data are and must be constituted by formats gives rise to unique ethical (as well as epistemic) challenges where data are exchanged or integrated. These challenges flow from the variability of disparate formats. Additionally, these challenges are technically, semantically, and organizationally irreducible to other much-discussed issues in data ethics scholarship concerning the potentially-harmful effects of algorithmic processing. In the context of interoperability, it is almost always the formats of data that lead to degraded accuracy, concomitant inequality, and other epistemic and ethical problems. Pragmatist strategies for navigating data interoperability can help mitigate these individual-level and structural-level problems flowing from data interoperability. Yet pragmatist strategies are no surefire solution. The approach we advocate acknowledges the limitations faced by data practitioners by emphasizing the reflective intelligence of agents in virtue of which they can be capable of adapting to novel situations. Although our focus has primarily been on healthcare and biomedical data, the pragmatist approach to structural issues of interoperability ethics we have proposed is one that can be reflectively generalized to other domains in light of the generality of the pragmatist strategies we have outlined. A pragmatist approach cannot guarantee ethical data interoperability, but it can provide valuable reflexive support for imperfect human actors adapting themselves to new data-saturated situations. Pragmatism provides no algorithmic guarantees, and yet it nonetheless points the way toward improved data practices.
Footnotes
Acknowledgements
The authors thank several colleagues for their input and feedback on this project: Steven D. Bedrick (of Oregon Health Sciences University), Carlos Montemayor (of the Department of Philosophy at San Francisco State University), and Thomas A. Thornhill IV (of Yale School of Public Health). The authors also thank the editors and two anonymous reviewers for extensive comments.
Funding
Caplan's and Koopman's contributions were funded in part by a University of Oregon Data Science Initiative Seed Funding Convening Award. Koopman's contributions were additionally supported by an Individual Research Fellowship from the United States National Endowment for the Humanities (NEH). Funding for open-access publishing was provided by the Oregon Humanities Center at the University of Oregon and the University of Oregon Libraries Open Access Article Processing Charge Award Fund.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
