Big Data ethics

Abstract

The speed of development in Big Data and associated phenomena, such as social media, has surpassed the capacity of the average consumer to understand his or her actions and their knock-on effects. We are moving towards changes in how ethics has to be perceived: away from individual decisions with specific and knowable outcomes, towards actions by many unaware that they may have taken actions with unintended consequences for anyone. Responses will require a rethinking of ethical choices, the lack thereof and how this will guide scientists, governments, and corporate agencies in handling Big Data. This essay elaborates on the ways Big Data impacts on ethical conceptions.

Keywords

Ethics moral responsibility societal impact education Big Data governance

On 21 September 2012, a crowd of 3000 rioting people visited a 16-year-old girl’s party at home in the little village of Haren, the Netherlands, after she had mistakenly posted a birthday party invite publicly on Facebook (BBC, 2012). Some might think that the biggest ethical and educational challenge that modern technology is posing concerns mainly children. It seems, however, that particularly with the emergence of Big Data, ethicists have to reconsider some traditional ethical conceptions.

Since the onset of modern ethics in the late 18th century with Hume, Kant, Bentham, and Mills, we took premises such as individual moral responsibility for granted. Today, however, it seems Big Data requires ethics to do some rethinking of its assumptions, particularly about individual moral agency. The novelty of Big Data poses ethical difficulties (such as for privacy), which are not per se new. These ethical questions, which are commonly known and understood, are also widely discussed in the media. For example, they resurface in the context of the Snowden revelations and the respective investigations by The Guardian concerned with the capabilities of intelligence agencies (The Guardian, 2013b). But its novelty would not be the sole reason for having to rethink how ethics works. In addition to its novelty, the very nature of Big Data has an underestimated impact on the individual’s ability to understand its potential and make informed decisions. Hence, much less commonly discussed are the ethical implications of impersonal data. Examples include, among others, the “likes” on Facebook sold to marketing companies in order to more specifically target certain micro-markets; information generated out of Twitter feed based sentiment analyses for political manipulation of groups, etc.

This essay aims to underline how certain principles of our contemporary philosophy of ethics might be changing and might require a rethinking in philosophy, professional ethics, policy-making, and research. First, it will briefly outline the traditional ethical principles with regard to moral responsibility. Thereafter, it will summarize four qualities of Big Data with ethical relevance. The third delves deeper into the idea of the changing nature of power and the emergence of hyper-networked ethics; and the fourth section illustrates which ethical problems might emerge in society, politics and research due to these changes.

Traditional ethics

Since the enlightenment, traditional deontological and utilitarian ethics place a strong emphasis on moral responsibility of the individual, often also called moral agency (MacIntyre, 1998). This idea of moral agency very much stems from almost religiously followed assumptions about individualism and free will. Both these assumptions experience challenges when it comes to the advancement of modern technology, particularly Big Data. The degree to which an entity possesses moral agency determines the responsibility of that entity. Moral responsibility in combination with extraneous and intrinsic factors, which escape the will of the entity, defines the culpability of this entity. In general, the moral agency is determined by several entity innate conditions, three of which are commonly agreed upon (Noorman, 2012):

Causality: An agent can be held responsible if the ethically relevant result is an outcome of its actions.

Knowledge: An agent can be blamed for the result of its actions if it had (or should have had) knowledge of the consequences of its actions.

Choice: An agent can be blamed for the result if it had the liberty to choose an alternative without greater harm for itself.

Implicitly, observers tend to exculpate agents if they did not possess full moral agency, i.e. when at least one of the three criteria is absent. There are, however, lines of reasoning that consider morally relevant outcomes independently of the existence of a moral agency, at least in the sense that negative consequences establish moral obligations (Leibniz and Farrer, 2005; Pogge, 2002). New advances in ethics have been made in network ethics (Floridi, 2009), the ethics of social networking (Vallor, 2012), distributed and corporate moral responsibility (Erskine, 2004), as well as computer and information ethics (Bynum, 2011). Still, Big Data has introduced further changes, such as the philosophical problem of ‘many hands’, i.e. the effect of many actors contributing to an action in the form of distributed morality (Floridi, 2013; Noorman, 2012), which need to be raised.

Four qualities of Big Data

When recapitulating the core criteria of Big Data, it will become clear that the ethics of Big Data moves away from a personal moral agency in some instances. In other cases, it increases moral culpability of those that have control over Big Data. In general, however, the trend is towards an impersonal ethics based on consequences for others. Therefore, the key qualities of Big Data, as relevant for our ethical considerations, shall be briefly examined. At the heart of Big Data are four ethically relevant qualities:

There is more data than ever in the history of data (Smolan and Erwitt 2012):

Beginning of recorded history till 2003—5 billion gigabytes

2011—5 billion gigabytes every two days

2013—5 billion gigabytes every 10 min

2015—5 billion gigabytes every 10 s

Big Data is organic: although this comes with messiness, by collecting everything that is digitally available, Big Data represents reality digitally much more naturally than statistical data—in this sense it is much more organic. This messiness of Big Data is (among others, e.g. format inconsistencies and measurement artifacts) the result of a representation of the messiness of reality. It does allow us to get closer to a digital representation of reality.

Big Data is potentially global: not only is the representation of reality organic, with truly huge Big Data sets (like Google's) the reach becomes global.

Correlations versus causation: Big data analyses emphasize correlations over causation.

Certainly, not all data potentially falling into the category of Big Data is generated by humans or concerns human interaction. The Sloan Digital Sky Survey in Mexico has generated 140 terabytes of data between 2000 and 2010. Its successor, the Large Synoptic Survey Telescope in Chile, when starting its work in 2016, will collect as much within five days (Mayer-Schönberger and Cukier, 2013). There is, however, also a large spectrum of data that relates to people and their interaction directly or indirectly: social network data, the growing field of health tracking data, emails, text messaging, the mere use of the Google search engine, etc. This latter kind of data, even if it does not constitute the majority of Big Data, can, however, be ethically very problematic.

New power distributions

Ethicists constantly try to catch up with modern-day problems (drones, genetics, etc.) in order to keep ethics up-to-date. Many books on computer ethics and cyber ethics have been written in the past three decades since, among others, Johnson (1985) and Moor (1985) established the field. For Johnson, computer ethics “pose new versions of standard moral problems and moral dilemmas, exacerbating the old problems, and forcing us to apply ordinary moral norms in uncharted realms” (Johnson, 1985: 1). This changes to some degree with Big Data as moral agency is being challenged on certain fundamental premises that most of the advancements in computer ethics took and still take for granted, namely free will and individualism. Moreover, in a hyperconnected era, the concept of power, which is so crucial for ethics and moral responsibility, is changing into a more networked fashion. Retaining the individual’s agency, i.e. knowledge and ability to act, is one of the main challenges for the governance of socio-technical epistemic systems, as Simon (2013) concludes.

There are three categories of Big Data stakeholders: Big Data collectors, Big Data utilizers, and Big Data generators. Between the three, power is inherently relational in the sense of a network definition of power (Hanneman and Riddle, 2005). In general, actor A’s power is the degree to which B is dependent on A or alternatively A can influence B. That means that A’s power is different vis-à-vis C. The more connections A has, the more power he or she can exert. This is referred to as micro-level power and is understood as the concept of centrality (Bonacich, 1987). On the macro-level, the whole network (of all actors A–B–C–D…) has an overall inherent power, which depends on the density of the network, i.e. the amount of edges between the nodes. In terms of Big Data stakeholders, this could mean that we find these new stakeholders wielding a lot of power:

Big Data collectors determine which data is collected, which is stored and for how long. They govern the collection, and implicitly the utility, of Big Data.

Big Data utilizers: They are on the utility production side. While (a) might collect data with or without a certain purpose, (b) (re-)defines the purpose for which data is used, for example regarding:

Determining behavior by imposing new rules on audiences or manipulating social processes;

Creating innovation and knowledge through bringing together new datasets, thereby achieving a competitive advantage.

Big Data generators:

Natural actors that by input or any recording voluntarily, involuntarily, knowingly, or unknowingly generate massive amounts of data.

Artificial actors that create data as a direct or indirect result of their task or functioning.

Physical phenomena, which generate massive amounts of data by their nature or which are measured in such detail that it amounts to massive data flows.

The interaction between these three stakeholders illustrates power relationships and gives us already an entirely different view on individual agency, namely an agency that is, for its capability of morally relevant action, entirely dependent on other actors. One could call this agency ‘dependent agency', for its capability to act is depending on other actors. Floridi refers to these moral enablers, which hinder or facilitate moral action, as infraethics (Floridi, 2013). The network nature of society, however, means that this dependent agency is always a factor when judging the moral responsibility of the agent. In contrast to traditional ethics, where knock-on effects (that is, effects on third mostly unrelated parties, as for example in collateral damage scenarios) in a social or cause–effect network do play a minor role, Big Data-induced hyper-networked ethics exacerbate the effect of network knock-on effects. In other words, the nature of hyper-networked societies exacerbates the collateral damage caused by actions within this network. This changes foundational assumptions about ethical responsibility by changing what power is and the extent we can talk of free will by reducing knowable outcomes of actions, while increasing unintended consequences.

Some ethical Big Data challenges

When going through the four ethical qualities of Big Data above, the ethical challenges become increasingly clearer. Ads (1) and (2): as global warming is an effect of emissions of many individuals and companies, Big Data is the effect of individual actions, sensory data, and other real world measurements creating a digital image of our reality. Cukier (2013) calls this “datafication”. Already, simply the absence of knowledge about which data is in fact collected or what it can be used for puts the “data generator” (e.g. online consumers, cellphone owning people, etc.) at an ethical disadvantage qua knowledge and free will. The “internet of things” further contributes to the distance between one actor’s knowledge and will and the other actor’s source of information and power. Ad (3): global data leads to a power imbalance between different stakeholders benefitting mostly corporate agencies with the necessary know-how to generate intelligence and knowledge from information. Ad (4): like a true Delphian oracle, Big Data correlations suggest causations where there might be none. We become more vulnerable to having to believe what we see without knowing the underlying whys.

Privacy

The more our lives become mirrored in a cyber reality and recorded, the more our present and past become almost completely transparent for actors with the right skills and access (Beeger, 2013). The Guardian revealed that Raytheon (a US defense contractor) developed the Rapid Information Overlay Technology (RIOT) software, which uses freely accessible data from social networks and data associated with an IP address, etc., to profile one person and make their everyday actions completely transparent (The Guardian, 2013a).

Group privacy

Data analysts are using Big Data to find out our shopping preferences, health status, sleep cycles, moving patterns, online consumption, friendships, etc. In only a few cases, and mostly in intelligence circles, this information is individualized. De-individualization (i.e. removing elements that allow data to be connected to one specific person) is, however, just one aspect of anonymization. Location, gender, age, and other information relevant for the belongingness to a group and thus valuable for statistical analysis relate to the issue of group privacy. Anonymization of data is, thus, a matter of degree of how many and which group attributes remain in the data set. To strip data from all elements pertaining to any sort of group belongingness would mean to strip it from its content. In consequence, despite the data being anonymous in the sense of being de-individualized, groups are always becoming more transparent. This issue was already raised by Dalenius (1977) for statistical databases and later by Dwork (2006) that “nothing about an individual should be learnable from the database that cannot be learned without access to the database”. This information gathered from statistical data and increasingly from Big Data can be used in a targeted way to get people to consume or to behave in a certain way, e.g. through targeted marketing. Furthermore, if different aspects about the preferences and conditions of a specific group are known, these can be used to employ incentives to encourage or discourage a certain behavior. For example, knowing that group A has a preference α (e.g. ice cream) and a majority of the same group has a condition β (e.g. being undecided about which party to vote for), one can provide α for this group to behave in the domain of β in a specific way by creating a conditionality (e.g. if one votes for party B one gets ice cream). This is standard party politics; however, with Big Data the ability to discover hidden correlations increases, which in turn increases the ability to create incentives whose purposes are less transparent.

Conversely, hyper-connectivity also allows for other strategies, e.g. bots which infiltrate Twitter (the so-called Twitter bombs) are meant to create fake grass-roots debates about, for example, a political party that human audiences also falsely perceive as legitimate grassroots debates. This practice is called “Astroturfing” and is prohibited by Twitter policies, which, however, does not prevent political campaigners from doing it. The electoral decision between Coakley and Brown (in favor of the Republican Brown) of the 2010 special election in Massachusetts to fill the Senate seat formerly held by Ted Kennedy might have been decided by exactly such a bot, which created a Twitter smear campaign in the form of a fake public debate (Ehrenberg, 2012). A 2013 report showed that in fact 61.5% of website visitors were bots (with an increasing tendency). Half of this traffic consisted of “good bots” necessary for search engines and other services, the other half consisted of malicious bot types such as scrapers (5%), hacking tools (4.5%), spammers (0.5%), and impersonators (20.5%) for the purpose of market intelligence and manipulation (Zeifman, 2013).

Propensity

The movie Minority Report painted a vision of a future in which predictions about what people were likely to do could lead to their incarceration without an act committed. While the future might not be as bad as depicted in the movie, “predictive policing” is already a fact in cities like Los Angeles, where Big Data analytics point to certain streets, gangs or individuals, who are more likely to commit a crime, in order to have them subjected to extra surveillance (Mayer-Schönberger and Cukier, 2013; Perry et al., 2013). The problem is very much a political one: the high probability of a certain person committing a murder cannot be ignored without major public criticism if nothing had been done to prevent it. Another example puts the stakes somewhat lower: what if Big Data analytics predict that a certain person (e.g. a single parent living in a certain neighborhood, with no job, a car, no stable relationship, etc.) has a likelihood of 95% to be involved in domestic violence? No social welfare organization having such information would politically be able not to act on such information. Sending social workers to the person’s house might not be as invasive as incarcerating people before the deed and it also does not violate the presumption of innocence. However, this might cause a stigma on the person, the family, and friends. Furthermore, this raises questions about the ethical role of those setting the intervention threshold and the data scientists writing the algorithm that calculates the chance based on certain variables available in the Big Data pool. One of the key changes in Big Data research is that data scientists let algorithms search for correlations themselves. This can often lead to surprise findings, e.g. the very famous Wal-Mart finding of increased Pop-Tart purchases before hurricanes (Hays, 2004). When searching for random commonalities (through data mining), it can be concluded/suggested that the more data we have, the more commonalities we are bound to find. Big data makes random connectedness on the basis of random commonalities extremely likely. In fact, no connectedness at all would be the outlier. This, in combination with social network analysis, might yield information that is not only highly invasive into one’s privacy, but can also establish random connections based on incidental co-occurrences. In other words, Big Data makes the likelihood of random findings bigger—something that should be critically observed with regard to investigative techniques such as RIOT.

Research ethics

Ethical codes and standards with regard to research ethics lag behind this development. While in many instances research ethics concerns the question of privacy, the use of social media such as Twitter and Facebook for research purposes, even in anonymous form, remains an open question. On the one hand, Facebook is the usual suspect to be mentioned when it comes to questions of privacy. At the same time, this discussion hides the fact that a lot of non-personal information can also reveal much about very specific groups in very specific geographical relations. In other words, individual information might be interesting for investigative purposes of intelligence agencies, but the actually valuable information for companies does not require the individual tag. This is again a problem of group privacy. The same is true for research ethics. Many ethical research codes do not yet consider the non-privacy-related ethical effect (see, for example, BD&S’ own statement “preserving the integrity and privacy of subjects participating in research”). Research findings that reveal uncomfortable information about groups will become the next hot topic in research ethics, e.g. researchers who use Twitter are able to tell uncomfortable truths about specific groups of people, potentially with negative effects on the researched group.¹ Another problem is the “informed consent”: despite the data being already public, no one really considers suddenly being the subject of research in Twitter or Facebook studies. However, in order to represent and analyze pertinent social phenomena, some researchers collect data from social media without considering that the lack of informed consent would in any other form of research (think of psychological or medical research) constitute a major breach of research ethics.

Conclusions

Does Big Data change everything, as Cukier and Mayer-Schönberger have proclaimed? This essay tried to indicate that Big Data might induce certain changes to traditional assumptions of ethics regarding individuality, free will, and power. This might have consequences in many areas that we have taken for granted for so long.

In the sphere of education, children, adolescents, and grown-ups still need to be educated about the unintended consequences of their digital footprints (beyond digital literacy). Social science research might have to consider this educational gap and draw its conclusions about the ethical implications of using anonymous, social Big Data, which nonetheless reveals much about groups. In the area of law and politics, I see three likely developments:

political campaign observers, think tank researchers, and other investigators will increasingly become specialized data forensic scientists in order to investigate new kinds of digital manipulation of public opinion;

law enforcement and social services as much as lawyers and legal researchers will necessarily need to re-conceptualize individual guilt, probability and crime prevention; and

states will progressively redesign the way they develop their global strategies based on global data and algorithms rather than regional experts and judgment calls.

When it comes to Big Data ethics, it seems not to be an overstatement to say that Big Data does have strong effects on assumptions about individual responsibility and power distributions. Eventually, ethicists will have to continue to discuss how we can and how we want to live in a datafied world and how we can prevent the abuse of Big Data as a new found source of information and power.

Footnotes

Acknowledgement

The author wishes to thank Barteld Braaksma, Anno Bunnik and Lawrence Kettle for their help and feedback as well as the editors and the anonymous reviewers for their invaluable insights and comments.

Declaration of conflicting interest

The author declares that there is no conflict of interest.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Notes

References

BBC (2012) Dutch party invite ends in riot. BBC News. Available at: http://www.bbc.co.uk/news/world-europe-19684708 (accessed 31 March 2014).

Beeger B (2013) Internet Der gläserne Mensch. FAZ.NET. Available at: http://www.faz.net/aktuell/wirtschaft/internet-der-glaeserne-mensch-12214568.html?printPagedArticle=true (accessed 11 March 2014).

Bonacich

(1987) Power and centrality: a family of measures. American Journal of Sociology 92(5): 1170–1182.

Bynum T (2011) Computer and information ethics. In: Zalta EN (ed) The Stanford Encyclopedia of Philosophy. Available at: http://plato.stanford.edu/archives/spr2011/entries/ethics-computer/ (accessed 23 July 2014).

Cukier K (2013) Kenneth Cukier (data editor, The Economist) speaks about Big Data. Available at: https://www.youtube.com/watch?v=R-bypPCIE9g (accessed 11 March 2014).

Dalenius

(1977) Towards a methodology for statistical disclosure control. Statistik Tidskrift 5: 429–444.

Dwork C (2006) Differential privacy. In: Bugliesi M, et al. (eds) Automata, Languages and Programming. Berlin, Heidelberg: Springer, pp.1–12. Available at: http://link.springer.com/chapter/10.1007/11787006_1 (accessed 23 July 2014).

Ehrenberg

(2012) Social media sway: worries over political misinformation on Twitter attract scientists’ attention. Science News 182(8): 22–25.

Elson SB, Yeung D, Roshan P, et al. (2012) Using Social Media to Gauge Iranian Public Opinion and Mood after the 2009 Election. Santa Monica, CA: RAND Corporation. Available at: http://www.jstor.org/stable/10.7249/tr1161rc (accessed 31 March 2014).

10.

Erskine T (2004) Can Institutions Have Responsibilities? Collective Moral Agency and International Relations. Houndmills, Basingstoke, Hampshire: Palgrave Macmillan.

11.

Floridi

(2013) Distributed morality in an information society. Science and Engineering Ethics 19(3): 727–743.

12.

Floridi

(2009) Network ethics: information and business ethics in a networked society. Journal of Business Ethics 90: 649–659.

13.

Hanneman RA and Riddle M (2005) Introduction to Social Network Methods. Riverside, CA: University of California. Available at: http://faculty.ucr.edu/∼hanneman/nettext/C10_Centrality.html (accessed 29 August 2013).

14.

Hays CL (2004) What Wal-Mart knows about customers’ habits. The New York Times. Available at: http://www.nytimes.com/2004/11/14/business/yourmoney/14wal.html (accessed 11 March 2014).

15.

Johnson

(1985) Computer Ethics, 1st ed. Englewood Cliffs, NJ: Prentice-Hall.

16.

Leibniz GW and Farrer A (2005) Theodicy: essays on the goodness of God, the freedom of man and the origin of evil. Available at: http://www.gutenberg.org/ebooks/17147 (accessed 16 August 2012).

17.

MacIntyre

(1998) A Short History of Ethics, 2nd ed. London: Routledge.

18.

Mayer-Schönberger

Cukier

(2013) Big Data: A Revolution that Will Transform How We Live, Work, and Think, Boston: Houghton Mifflin Harcourt.

19.

Moor

(1985) What is computer ethics? Metaphilosophy 16(4): 266–275.

20.

Noorman M (2012) Computing and moral responsibility. In: Zalta EN (ed) The Stanford Encyclopedia of Philosophy. Available at: http://plato.stanford.edu/archives/fall2012/entries/computing-responsibility/ (accessed 31 March 2014).

21.

Perry WL, McInnis B, Price CC, et al. (2013) Predictive Policing: The Role of Crime Forecasting in Law Enforcement Operations. Santa Monica, CA: RAND Corporation. Available at: http://www.jstor.org/stable/10.7249/j.ctt4cgdcz (accessed 31 March 2014).

22.

Pogge

(2002) Moral universalism and global economic justice. Politics Philosophy Economics 1(1): 29–58.

23.

Simon J (2013) Distributed epistemic responsibility in a hyperconnected era. Available at: http://ec.europa.eu/digital-agenda/sites/digital-agenda/files/Contribution_Judith_Simon.pdf (accessed 31 March 2014).

24.

Smolan

Erwitt

(2012) The Human Face of Big Data, Sausalito, CA: Against All Odds Productions.

25.

The Guardian (2013a) How secretly developed software became capable of tracking people’s movements online. Available at: http://www.youtube.com/watch?v=O1dgoQJAt6Y&feature=youtube_gdata_player (accessed 11 March 2014).

26.

The Guardian (2013b) The NSA files. World News, the Guardian. Available at: http://www.theguardian.com/world/the-nsa-files (accessed 2 April 2014).

27.

Vallor S (2012) Social networking and ethics. In: Zalta EN (ed) The Stanford Encyclopedia of Philosophy. Available at: http://plato.stanford.edu/archives/win2012/entries/ethics-social-networking/ (accessed 6 September 2013).

28.

Zeifman I (2013) Bot Traffic is up to 61.5% of All Website Traffic. Incapsula.com. Available at: http://www.incapsula.com/blog/bot-traffic-report-2013.html (accessed 3 April 2014).

29.

Zimmer

(2010) “But the data is already public”: on the ethics of research in Facebook. Ethics and Information Technology 12(4): 313–325.