What is responsible and sustainable data science?

Abstract

Keywords

Commons health responsibility ethics privacy data protection

This article is a part of special theme on Health Data Ecosystem. To see a full list of all articles in this special theme, please click here: https://journals.sagepub.com/page/bds/collections/health_data_ecosystem.

Introduction

In the expansion of health ecosystems, issues of responsibility and sustainability of the data science involved are central. The idea that these values should be central to the practice of data science is increasingly gaining traction, yet there is no agreement on what exactly makes data science responsible or sustainable because these concepts prove slippery when applied to a global field involving commercial, academic and governmental actors. This lack of clarity is causing problems in setting goals and boundaries for data scientific practice, and risks fundamental disagreement on governance principles for this emerging field. Responsibility, in terms of data science practice, has largely been used to signify practices on the part of scientists that are in line with legal data protection compliance, privacy and confidentiality with regard to data subjects, and with efforts to reduce bias and inaccuracy in data analysis (van der Aalst et al., 2017). Sustainability, following definitions from other fields, may be interpreted as a broader objective of promoting mutually beneficial interaction between data science and society. Yet these values are far from informing a coherent system of norms and standards in relation to (health) data governance. We will argue in this commentary for a Commons analytical framework as one approach to this problem, since it offers useful signposts for how to establish governance principles for shared resources.

The commons is often referred to both to make a normative statement regarding distribution of wealth (e.g. to contest private ownership and enclosure) and as a roadmap for finding solutions for sustainable resource management. The classic example is fisheries: in a fishing lake an individual fish is part of the fish stock and the lake’s ecosystem that together constitute a common resource which the local fishermen and large fishing companies use. The Commons offers a language to problematize the enclosure of fisheries, e.g. by the fishing industry privatizing fishing rights, fishing waters, or using intensive fishing techniques and thus making less fish available to local fishermen who have no access to industrial fishing equipment. Commons theory also, however, offers a roadmap to governance of a shared resource that preserves values and accounts for the interests of affected people. In the case of fisheries, the subsistence of the fish stock in the lake, the interests of local and commercial fishermen will be fostered by agreeing on fishing rules and establishing institutions to facilitate and monitor such agreements. We aim for the second use of this theory: if we share data, how should we govern that shared resource in order to protect the connected ecosystems that form the commons?¹ To discuss this we can borrow our analytical framework from neo-institutional economics (e.g. Ostrom et al., 1994).

We use the terms ‘sustainable’ and ‘responsible’ data science as largely synonymous. The former is a part of the vocabulary of Commons scholarship (e.g. Dietz et al., 2003) and the latter is favoured in policy and industry circles (though the notion of ‘responsibilisation’ of society in relation to managing resources is also prevalent in the commons literature, e.g. Ostrom, 2009). We posit here that it is possible to draw lessons from scholarship on commons governance in relation to environmental resources to understand how the data commons might be governed. To do so, we will talk about data generally rather than focusing on what is traditionally considered deserving of special attention and protection, i.e. ‘personal data.’ The reason for this is twofold. First, while the term ‘personal data’ is mostly associated with the impact of data processing on a single individual, it largely excludes the collective dimensions of data-driven harms. Instead, the assumption underlying this commentary is that data science will often affect groups and societies rather than just individuals and hence societal harm is also possible. If we accept that responsible and sustainable data science should also have a collective dimension, a broader focus on data rather than personal data is necessary to formulate standards. Second, in the hyperconnected onlife world of the near future, all data will likely have impact on identifiable individuals and be ‘personal’ in the sense of data protection law (Purtova, 2018), and therefore ‘data’ per se offers the most useful comparison to the environmental resource-base that is the focus of existing scholarship on commons governance.

The sensitivity of much health data is a central factor to consider when making statements about how data ‘should’ be a commons. As Evans notes (2017), although emerging ‘health data commons’ may involve data from hundreds of millions of individuals, ‘the commons’ in relation to health data is often experienced by those individuals as a smaller collective, relating to particular conditions or research priorities. These realities are explicitly in tension with regard to the dual priorities of keeping data and individual identities separate (which is easier with more participants) and governing the data commons in a democratic and participatory way (which is more possible on the local level). How to technically build a health data commons to prevent harm is an issue beyond the scope of this paper, but is the topic of much work in the spheres of philosophy (see for example Sharon, 2016) and in computer science (Verheul and Jacobs, 2017). In this paper we focus on the theoretical, practice-based and institutional perspectives on ways to disincentivise harmful behaviour with health data, which is an important complement to understanding and preventing harm.

The theory of the commons and what it means for data science

Why is the notion of the commons relevant to data science?

The commons theory was originally developed by Elinor Ostrom and her followers focusing on the governance of the natural resources, like water basins or fisheries, that are subject to shared use by a group of appropriators (e.g. local communities and industry), when appropriators are difficult to exclude from the resource use and the use subtracts from the resource quality or quantity. These are the commons or ‘common-pool resources’ (CPR). The goal of the theory is to establish conditions under which the shared resource is used sustainably, i.e. without degrading in quality or quantity (e.g. Gardner et al 1990; Ostrom et al., 1994). This aim seems relevant to the current policy, academic and business rhetoric calling for data sharing and common benefits, for example the plea of health researchers to ease access to health data for the purposes of advancing research (Sethi, 2015). If data is discussed as a resource that ought to be put to common use, this suggests commons theory as a useful analytical framework (Purtova, 2017).

Despite the original focus of commons scholarship on the sustainable management of (often local) natural resources, it has evolved into an entire paradigm of how wealth should be created and managed in and outside the bio-physical domain, locally and globally. Cyberspace, world oceans, atmosphere and the Antarctic are examples of these new global commons (Regan, 2002; United Nations, 2013). Bollier (in the context of the knowledge commons) notes that speaking in terms of the commons helps ‘articulate … concerns and provide a public vernacular for talking about the politics of creativity and knowledge’ (2007: 31). This suggests that applying the Commons analytical framework to data may also serve to redirect the focus of the debate from sharing and drawing common benefit towards the social dilemmas associated with data use and governance, and to the politics of data.

At first sight, there are lessons from commons governance scholarship that offer some striking parallels to the development of law and institutional structures for data governance, especially in the EU. Dietz et al. (2003: 1908) posit that the globalisation of resource production and trade tends to obscure the value of local and ‘traditional solutions such as informal communication and sanctioning’, and indeed we have seen the latter emerging in the responses of patient representative groups and patients themselves in the UK in response to health data breaches (Telegraph, 2017). Dietz et al. also provide a core set of requirements for the adaptive governance of the commons which closely parallel current efforts to form and extend data protection provisions in the EU (if we take illegitimate or harmful data use as a comparator for the unsustainable consumption of natural resources). These requirements include systems for providing information about the resource being governed (these can be compared to the GDPR’s reporting requirements for companies handling data); meaningful and graduated sanctions to induce rule compliance (again, a feature of the GDPR, which authorises fines of up to 4% of offenders’ annual turnover per instance of data misuse); and most importantly, preparedness for change – which seems to parallel the GDPR’s risk-based approach where companies handling data must assess the risks of data use dynamically. Dietz et al. emphasise dynamic governance: ‘Fixed rules are likely to fail because they place too much confidence in the current state of knowledge, whereas systems that guard against the low probability, high consequence possibilities and allow for change may be suboptimal in the short run but prove wiser in the long run’ (2003: 1909). This is similarly the assumption embedded in the EU’s data governance instruments: that data is a rich but risky resource where new potential for misuse and harm may emerge as different actors become involved over time.

Key lessons for responsible/sustainable data science

Developing this comparison further requires that we interrogate the similarities and differences between data as a resource, and the kinds of environmental resources that commons scholarship has so far focused on. As Commons scholarship shows, CPRs are complex system resources consisting of a combination of interrelated and interdependent elements. For instance, fisheries have a two-fold structure: stock (e.g. a fishing pond with its unique ecosystem maintaining its population of fish) and subtractable benefits produced by the stock and appropriated by the common-pool users (units of fish caught by fishermen; McGinnis and Walker, 2010: 641–672). Other CPRs have a more complex anatomy. Scientific knowledge commons, according to Hess and Ostrom, comprise three elements: ideas, artifacts (e.g. scholarly publications), and facilities (e.g. libraries) (2003, 2007). Data is similarly a complex resource ecosystem that includes individuals and groups, in relationships with each other and digital infrastructures and institutions in a society, all of whom generate data and are affected by it. We consider these and not data alone to be a common resource. Therefore when we talk about sustainable data science, we ought to talk about the effects of data use not on the data itself, but on individuals and societies, since like many other CPRs the data ecosystem has both local and global levels (Purtova, 2017).

Therefore, in the data commons the problem of sustainability is broader than the physical exhaustion or extinction of, for example, data or digital infrastructures. Achieving sustainability in data science could be understood in terms of avoiding negative effects from turbo-charged data practices – such as political manipulation through large-scale profiling – on the survival of certain social values and interests and leading to the ‘extinction’ of society as we would like it to be, i.e. conditions of fairness, due process and non-discrimination (Purtova, 2017).

How should such a sustainable data commons be achieved? Each type of the commons is unique and therefore the commons theory does not offer a universal set of solutions. However, it does generalize that effective commons governance is easier to achieve when the following conditions are present: (i) the resource and its use can be monitored at low cost; (ii) the resources, users, technology, and economic and social conditions are changing at a moderate rate; (iii) frequent communication between stakeholders that facilitates trust, lowering the cost of monitoring; (iv) outsiders can be excluded from the resource at low cost; and (v) users support effective monitoring and rule enforcement (Dietz et al., 2003). Where data commons are concerned, each of these points presents considerable difficulties. Data processing is ubiquitous (see Prainsack, this issue), the rate of technological change is staggering, there are hardly any institutions in place to facilitate a cross-stakeholder conversation about data, the cost of exclusion from data is prohibitively high for everyone except large information industry players, and finally, a threat of enforcement action and major scandals like the one concerning Facebook and Cambridge Analytica are nearly the only triggers for the key actors to volunteer information about their data practices. In the case of health data, the ‘users’ of the data are researchers, both academic and corporate, and any other actors who draw value from data, who will often have different motivations and interests with regard to how those data should be used and managed. The challenge is to build institutions that facilitate these conditions. The next section will address two issues that are critical for such institution building: stakeholders and the nature of the institutions.

Selected issues of responsible and sustainable data science

The problem of stakeholdership

Understanding who will be affected by a particular data analytic product or service (and hence who stakeholders are) and involving them in the data governance is challenging. This is because in the era of algorithmic data processing people are often unaware their data is being processed, or that they are subjected to data-driven decision making, and because data-driven decisions are often based on data coming from other people or groups (Taylor, 2017). Furthermore, in the case of algorithmic processing, the composition of stakeholder groups may change over time as people are moved in and out of the target group as the objective of processing develops and evolves (Taylor et al., 2017).

Data protection law – the primary legal instrument currently dealing with issues of sustainable data science – relies on individual awareness and rights and does not map well onto many problems raised by data analytics. First, in order to invoke data protection rights, a data subject must be aware that data processing impacting her is taking place. Second, the primary focus of data protection remains impact on individual data subjects, rather than societal harms. While European data protection law does include some considerations of larger societal harms, such as when a data controller conducts a balancing of legitimate interest as grounds for data processing, or assesses whether or not the purpose of processing is legitimate, US privacy law is different in this respect. Moreover research ethics (as codified in the US system) provide yet another perspective on what is right: Metcalfe and Crawford (2016)² note that responsibility towards research subjects must be reconceptualised if data is not collected directly, and if possible harm may occur to others downstream of the research rather than to the experimental subject directly.

This problem of stakeholdership is further complicated by the dimension of time: as more datasets become available data’s utility will grow, and the larger the pool, the more detailed the analysis. Thus increase in value will be in inverse proportion to data’s governability. As data is shared for new purposes, understanding of who the stakeholders are will decrease over time. Data may also be de-identified, making the task of identifying individual stakeholders difficult but nevertheless enabling data-driven decision-making through creation and application of profiles (Taylor et al., 2017).

Location also causes problems for our understanding of stakeholdership. Although some data processing is local and is regulated by nation states, the data market is global and data is gathered and marketed across borders. Hence stakeholder groups are not only local, but also transborder and global. Therefore sustainable data science has to account for what will happen to data gathered in one national system, processed in another, and sold on to yet more users in a third, and for the effects of the algorithms trained on data in one country and used in relation to the population in another.

The question these problems pose from a Commons governance perspective mainly centre around creating effective monitoring that also has political buy-in from participants in the Commons. Fuster-Morell and Espelt in their study of technology collectives (2018) propose a commons-based framework for the governance of data-related activities where sustainability is achieved through governance that is collectivist, participatory and democratic (characteristics which are related but not necessarily synonymous). Evans (2017: 663), moreover, points out that in a health data commons, non-participating or non-consenting patients are likely to be implicated by the use of extensive datasets, either through inference or through the co-opting of their data via arguments of public good – an insight which supports the notion of collective and participatory structures for governance to complement the institutional ones which may be involved on the co-opting side. This is potentially a useful insight given the primarily regulatory character of the current governance of the data commons. Participatory informal modes of governance and boundary-setting, as seen in the example of NHS health data breaches above (Telegraph, 2017), may be an important complement to regulation given that the opacity of much use and transfer of data currently poses problems of ‘selling’ the benefits of effective monitoring to unscrupulous but powerful users who can make a strong economic case for their dominance of the resource (cf. arguments currently made by technology giants with respect to regulation). This suggests that as well as connecting data governance to individuals, institutions must be present that can exert formal power over these substantial technology firms. We explore next an approach focusing on institutional power and legitimacy in relation to the data market, which may offer more leverage against undesirable and destructive uses of the data commons.

The role of institutions

One way in which a commons framework may help address these problems is by orienting us toward institutional arrangements that have relevance for communities of actors rather than just individuals. The spatial and temporal problems of identifying stakeholders suggest that institutions in the international public domain may be best positioned to respond to the data commons that are global. One such example is the Global Alliance for Genomics and Health, which has developed an ecosystemic approach to internationalising the sharing of genomic data, including networks of international actors in the field, systems for querying databases, and sector-specific ethics processes (Rahimzadeh et al., 2016).

Significant elements of data commons governance and stakeholder communication must be left to local or national institutions (think of a school platform for digitisation of parent–teacher–child communication, or a national forum where consumers, financial services and government decide on data use for credit rating). Yet there is a part of the puzzle that can only be solved by recognising the global data commons and the need of global dialogue and institutions. That is, it is necessary to locate, define and operationalize the idea of an international public domain for debates on data governance. Some relevant fora have been created, for example the International Data Responsibility Group (IDRG) started by multilateral institutions and NGOs working on humanitarian action. Yet meaningful civil society involvement has so far not been achieved by such processes (Taylor, 2016).

One lesson from the IDRG process is that for such institutions to be answerable to the general public on a global level they should be both independent and expert-led, and they should incorporate perspectives and expertise from the civil society, policy and technology domains without weighting the decisionmaking process towards any one domain. In line with the commons theory, such institutions should facilitate a discussion that can result in binding rules; they should be oriented toward finding compromises, and should also be a forum for the stakeholders involved to speak. Given the challenge of knowing who is affected by data processing within a global market, such institutions should be proactive in seeking to understand who the relevant stakeholders are. Building institutions for governance of data commons is a task that is yet to begin.

Conclusions

In this commentary we proposed to use the analytical framework of the commons to define and operationalize responsible and sustainable data science. This is important at a time when not only academic work but field-building (for example through research funding, humanitarian debates and university program structuring) is being based on these concepts. Taking the Commons perspective, we think of data science as both drawing on, and feeding back into, society. Hence the sustainability of data science should also be assessed by looking at its larger societal effects. While useful for fleshing out collective dimensions of the impact of data science, a Commons approach does not offer ready solutions for sustainable data commons governance. Creating the conditions for successful Commons management; transparency and monitoring of the resource use, and relative stability of the resource and of economic, social and technological contexts, are challenging with regard to digital data. We therefore posit that the key to sustainable data science may lie in institutional governance. In line with the Commons theory, the role of governing institutions would be to facilitate communication and trust between stakeholders, and build agreement on rules for sustainable data use. Such institutions also need to be capable of auditing, monitoring and enforcing the rules. Considering the spatial and temporal challenges of identifying data science stakeholders, as well as the cross-border impact of data science, some of those institutions will have to be built on a global level.

Whatever the future agreement on the rules of sustainable data use is, it will constitute a statement about what a given society would like from data science, e.g. in relation to health. It matters that the discussion leading up to this agreement is balanced in terms of participation, i.e. multistakeholder in nature. A balanced discussion should involve not just the voices of data scientists, but also civil society and experts from other fields (social scientists, philosophers, and others). One approach to this has been posited by theories of global (data) justice (e.g. Brock and Moellendorf, 2005; Taylor, 2017) which offer respectively the justification and practical approaches for understanding what constitutes a violation or support of the commons across different societies and cultures. A genuinely broad consensus on what responsibility and sustainability mean is key to data becoming a commons from which benefits can be drawn sustainably.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The contribution written by Nadezhda Purtova reports on the results of the project “Understanding information for legal protection of people against information-induced harms” (‘INFO-LEG’). This project has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No 716971). The paper reflects only the author's view and the ERC is not responsible for any use that may be made of the information it contains. Linnet Taylor's contribution was suported by funding from the European Research Council for the Global Data Justice project (grant agreement No 757247).

Notes

References

Bollier

(2007) The growth of the commons paradigm. In: Hess

Ostrom

(eds) Understanding Knowledge as Commons, Cambridge, MA: MIT Press.

Brock G and Moellendorf D (eds) (2005) Current Debates in Global Justice. Vol. 2. Dordrecht: Springer Science & Business Media.

Dietz

Ostrom

Stern

(2003) The struggle to govern the commons. Science 302(5652): 1907–1912.

Evans

(2017) Barbarians at the gate: Consumer-driven health data commons and the transformation of citizen science. American Journal of Law & Medicine 42(4): 651–685. https://doi.org/10.1177/0098858817700245 .

Fuster Morell

Espelt

(2018) A framework for assessing democratic qualities in collaborative economy platforms: Analysis of 10 cases in Barcelona. Urban Science 2(3): 61.

Gardner

Ostrom

Walker

(1990) The nature of common-pool resource problems. Rationality and Society 2.

Hess

Ostrom

(2003) Ideas, artifacts, and facilities: Information as a common-pool resource. Law and Contemporary Problems 66(1&2).

Hess

Ostrom

(2007) Introduction: An overview of the knowledge commons. In: Hess

Ostrom

(eds) Understanding Knowledge as Commons Hess, Cambridge, MA: MIT Press.

McGinnis

Walker

(2010) Foundations of the Ostrom workshop: Institutional analysis, polycentricity, and self-governance of the commons. Public Choice 143.

10.

Ostrom

(2009) A general framework for analyzing sustainability of social-ecological systems. Science 24: 419–422.

11.

Ostrom

Garder

Walker

(1994) Rules, Games, and Common-Pool Resources, Ann Arbor, MI: The University of Michigan Press.

12.

Purtova N (2017) Health Data for Common Good: Defining the Boundaries and Social Dilemmas of Data Commons in Ronald Leenes, Nadezhda Purtova, Samantha Adams (eds.) Under Observation – The Interplay Between eHealth and Surveillance. Springer International.

13.

Purtova N (2018) The law of everything. Broad concept of personal data and the future of European data protection law, Law, Innovation and Technology 10(1).

14.

Rahimzadeh

Dyke

Knoppers

(2016) An international framework for data sharing: Moving forward with the global alliance for genomics and health. Biopreservation and Biobanking 14(3): 256–259.

15.

Regan

(2002) Privacy as a common good in a digital world. Information, Communication & Society 5(3).

16.

Sethi

(2015) The promotion of data sharing in pharmacoepidemiology. European Journal of Health Law 21.

17.

Sharon

(2016) The Googlization of health research: From disruptive innovation to disruptive ethics. Personalized Medicine 13(6): 563–574.

18.

Taylor L (2016) The ethics of big data as a public good: which public? Whose good? Philosophical Transactions of the Royal Society: A 374(2083): 20160126.

19.

Taylor L (2017a) What is data justice? The case for connecting digital rights and freedoms globally. Big Data & Society 4(2).

20.

Taylor L, Floridi L and van der Sloot B (2017) Introduction. In L. Taylor, L. Floridi, & B. van der Sloot (Eds.), Group Privacy: new challenges of data technologies. Springer.

21.

Telegraph (2017) Security breach fears over 26 million NHS patients. Available at: https://www.telegraph.co.uk/news/2017/03/17/security-breach-fears-26-million-nhs-patients/ (accessed 1 May 2019).

22.

United Nations (2013) Global and governance of the global commons in the global partnership for development beyond 2015. Available at: www.un.org (accessed 1 May 2019).

23.

van der Aalst

WMP

Bichler

Heinzl

(2017) Responsible data science. Business & Information Systems Engineering 59: 311.

24.

Verheul

Jacobs

(2017) Polymorphic encryption and pseudonymisation in identity management and medical research. Nieuw Archief voor Wiskunde NAW 5/18(3): 168–172.