Sage Journals: Discover world-class research

Abstract

How are data science systems made to work? It may seem that whether a system works is a function of its technical design, but it is also accomplished through ongoing forms of discretionary work by many actors. Based on six months of ethnographic fieldwork with a corporate data science team, we describe how actors involved in a corporate project negotiated what work the system should do, how it should work, and how to assess whether it works. These negotiations laid the foundation for how, why, and to what extent the system ultimately worked. We describe three main findings. First, how already-existing technologies are essential reference points to determine how and whether systems work. Second, how the situated resolution of development challenges continually reshapes the understanding of how and whether systems work. Third, how business goals, and especially their negotiated balance with data science imperatives, affect a system’s working. We conclude with takeaways for critical data studies, orienting researchers to focus on the organizational and cultural aspects of data science, the third-party platforms underlying data science systems, and ways to engage with practitioners’ imagination of how systems can and should work.

Keywords

Corporate data science critical data studies data science ethnography human work invisible work workability

Introduction

Data science is the practice of analyzing large-scale data using techniques drawn from domains such as machine learning, artificial intelligence (AI), and natural language processing. Building data science systems is a laborious process, requiring extensive amounts of technical work.¹ Unsurprisingly, dominant narratives about the working of such systems—what work they do, how they work, and how to assess their working—remain technology centered, comprising formal accounts of algorithmic steps or performance metrics. Still, although models and numbers are vital to the process, building data science systems is a sociotechnical endeavor that requires not only technical but also human work (Baumer, 2017; Dourish and Cruz, 2018; Passi and Jackson, 2017, 2018). Thus, understanding how these systems work requires also clarifying the human work this entails.

A good example is the work of problem formulation (Hand, 1994)—translating high-level goals into data-driven problems. During problem formulation, an essential first step, practitioners outline the system’s intended working in the service of given goals. Research has shown how the expected working of a data science system is not given but negotiated in the problem formulation stage through “discretionary judgments of various actors and further affected by choice of methods, instruments, and data” (Passi and Barocas, 2019: 46). “Even the simplest piece of software has embedded within it a series of architectural decisions about what ‘works’ regarding the purposes for which it was created” (Shaw, 2015: 2).

Even after problem formulation, practitioners’ conceptions of how, whether, or in what ways their systems work remain in flux. In this paper, we unpack the ongoing forms of human work involved in a corporate project to show how a data science system’s working is not stable but remains in the making throughout the project. We situate “working” as a system’s ability to work as intended from the perspective of the practitioners who build the system. We use phrases such as “the system now needed to work differently” to call out changes in practitioners’ expectations of the system’s working, highlighting changes done to align system working with shifting expectations. As exercises in “collective authorship” (Seaver, 2019: 418), building data science systems—and making them work—requires enormous subjective judgment.

In fact, we show that even determining what aspects of a system work or do not work is not always obvious or numerically determinable. One reason for this is that a system’s working is multifaceted. The system works or does not work in distinct ways for different actors. Data scientists, for instance, often describe a system’s working via performance metrics (Rieder and Simon, 2016). As artifacts of “algorithmic witnessing” (Passi and Jackson, 2018), numbers remain tightly coupled with those aspects of working “that are most readily computationally quantifiable” (Baumer, 2017). Project managers, however, define working through the lens of business use cases, and product managers prioritize compatibility and feasibility as essential aspects of working—articulations embedded in broader organizational imperatives. “Mathiness” (Lipton and Steinhardt, 2018) is but one feature of a data science system whose eventual working is a “practical accomplishment” (Garfinkel, 1967), marked by diverse forms of sociotechnical work.

This paper addresses the question: how are data science systems made to work? Unpacking this question is especially important with the growing impact of data science in virtually every sphere of modern life. When the (wrong)doings of systems become visible, system builders may argue that their systems are, in fact, not at fault—they were not designed to work this way! Such misalignments between systems’ intended and situated working are commonplace. One reason for this is that the perspective of people facing the most significant effects from a system’s output is often not included in the system’s design and development process (Binns et al., 2018). This absence creates a “de-coupling” between how system builders engineer, users use, and data subjects experience systems (Baumer, 2017). In such situations, without a clear understanding of the sociotechnical work involved in building data science systems, it is challenging to locate actionable sites of accountability and intervention. Critical data studies researchers strive to unpack the implications of data science systems, ranging from how they shape professional practices (Dudhwala and Larsen, 2019) and knowledge production (Bechmann and Bowker, 2019; Miles, 2019) to how they enable surveillance (Aradau and Blanke, 2015) and marginalization (Buolamwini and Gebru, 2018). Such research efforts aim to not only make visible the implications of data science systems but also shape their design and development practices.

Engaging with the design of data science systems, however, is challenging. Opening the data science black box is neither easy nor straightforward. Analyzing such systems requires multiple forms of knowledge. Access to data science practitioners, especially those working in corporations, remains restricted. Critical data studies scholarship on the working of data science systems thus centers on their use, rather than their design—their impact in the way they work as opposed to how and why practitioners build them to work the way they do. As researchers, we know that building technological systems is a non-linear process, requiring extensive amounts of collaborative, discretionary, and situated work (Suchman, 1987; Suchman et al., 2002, 1999). However, regarding data science, as we continue to focus on the implications of data science systems, we know little about how and why practitioners build such systems to work in some ways and not in others. If we, as researchers, have stakes in how data science systems should work, we must attempt to understand how these systems are made to work.

In this paper we address this gap, contributing to a growing body of research on the design practices of data science systems (e.g., Amershi et al., 2019; Muller et al., 2019; Passi and Barocas, 2019; Passi and Jackson, 2018; Saltz and Grady, 2017; Saltz and Shamshurin, 2015). Based on ethnographic fieldwork with a corporate data science team, we describe how actors negotiate central aspects of a system’s working: what work the system should do, how the system should work, and how to assess whether the system works. These negotiations, we show, lay the foundation for how, why, and to what extent a system ultimately works the way it does. We focus on a specific project—a self-help legal chatbot—to highlight these negotiations and the invisible human work that goes into determining whether and how systems work. Through a detailed recounting of corporate data science in action, we develop a more general account of how a system’s working is not only affected by algorithmic and technical design but also continually reworked in light of competing technologies, implementation challenges, and business goals.

In the following sections, we describe our research site and methodology before moving to the empirical case study. We conclude with our findings, describing ways for the field of critical data studies to move forward based on them.

Research site and methods

This paper builds on six months of ethnographic fieldwork with Aurelion,² a multi-billion-dollar technology corporation based on the US West Coast. To gain immersive access to ordinary work practices, the first author worked as a data scientist at the corporation during this time, serving as a lead on two business projects (not reported in this paper) and taking part in several others. Aurelion owns several companies across domains such as health and law. There are multiple teams of project managers, business analysts, and software developers at Aurelion and its subsidiaries. Aurelion’s data science team works with product teams to build diverse business applications using data science. During fieldwork, Aurelion’s data science team had 8–11 members (including the first author).³ Remy—Director of Data Science with 30+ years of experience managing projects in several major technology firms—heads the data science team. Remy and his team report to Chief Technology Officer (CTO) Harper, who has 20+ years of industry experience.

We conducted 52 interviews with project managers, business analysts, data scientists, software engineers, and corporate executives. The collected data also included 426 pages of fieldwork notes and 104 photographs. We used principles of grounded theory analysis to code interview and fieldwork data (Charmaz, 2014; Strauss and Corbin, 1990). Through two rounds of in-vivo and thematic coding,⁴ we identified several forms of human work through which actors establish what systems should do, in what ways, and to what extent, the three most salient of which we report below. The first author headed the empirical analysis, including multiple rounds of discussions with four other researchers. We organized our interview and fieldwork data in two ways: (a) categorized by projects (e.g., self-help legal chatbot) and (b) categorized by professional groups (e.g., analysts, scientists, managers, and executives). The former helped to analyze themes within and across projects (e.g., problems specific to certain kinds of projects), and the latter helped to examine similarities and differences between groups (e.g., how different professionals expect a system to work). In this paper, we focus on one project for its salience to the paper’s theme, but we observed similar dynamics across other projects. The chatbot project was in its initial stage when we began fieldwork. The first author was only an observer in this project.

Empirical case study: Self-help legal chatbot

Law&You, an Aurelion subsidiary, offers online marketing services to attorneys, lawyers, and law firms. Clients pay to integrate Law&You’s digital tools into their websites to convert website visitors into paying customers. One such tool is an online chat module that connects users with human chat operators, who guide users toward self-help legal resources or collect information on users who need professional legal services. If a user needs professional help, the operator collects data such as the user’s name, address, and contact, and forwards it to the client.

In late 2016, Law&You started replacing human chat operators with automated guided chats. A guided chat is a scripted list of options presented to the user one at a time. Depending on the user’s selection, guided chat moves to the next set of options. Guided chat generates its initial options based on, for instance, keyword analysis. If the user’s request is, for example, “I want to file for bankruptcy,” guided chat will identify the keyword “bankruptcy” and provide three options: Personal Bankruptcy, Corporate Bankruptcy, and Something Else.

Law&You, however, faced two challenges with guided chats. First, users described legal issues in multiple ways. Although “I want to file for bankruptcy” is one way to describe bankruptcy, “I do not have money” and “I have debt” are also valid bankruptcy descriptions. It was impossible to hard code all the ways in which people describe legal issues. Second, users often went off script. Instead of selecting an option, users often responded with open-ended descriptions or asked questions. Follow-up questions or transferring users to human operators clarified these situations, but Law&You felt that repeated questioning and transfers lowered user and client confidence in the product. Guided chat’s inability to handle such situations meant that users often left chats midway, providing no data at all. This was unacceptable since the collection of user data was a business priority. Law&You lost out on profitable data, spending further resources to clean the messy data captured by guided chat.

In May 2017, Law&You’s director of technology, Paul, approached Aurelion’s director of data science, Remy, with the idea of a smart chatbot. Law&You’s move to chatbots was partly a response to the fact that several of their competitors now provided AI-driven marketing tools. The chatbot, from Paul’s perspective, was the future of digital marketing. The data science team had never built a chatbot before but had experience with natural language processing (NLP) tools. The team had previously developed NLP-based systems using in-house models and third-party services provided by companies such as Amazon, IBM, and Microsoft. At the beginning of the project, as part of requirement gathering, Remy asked Paul for a definition of the chatbot’s use case—what was it supposed to do? In a later meeting, Paul restated the chatbot’s business use case:

Paul (director of technology): People in legal chat do not always follow a script. We want to find ways to anticipate conversational pathways. We are agnostic to technology as long as it provides value—user information. (Fieldwork Notes, 31 July 2017)

Guided chat’s failure to perfectly determine what users say hindered the business goal of data collection. A chatbot could help overcome this challenge. Given this business use case, as Remy described later, the data science team saw its task as building a chatbot to guide open-ended conversations:

Remy (director of data science): [We wanted to go] beyond guided chat. Guided chats are easy because they are guided […] to the point where it is a request-and-response kind of scripted conversation. The challenge was—how to go from a guided chat to, what I would call, an open-ended chat. […] If [users] went off rails, can [we] somehow guide them back to the original conversation and let them not give up on the conversation? […] Try to get them back to a valuable conversation. An open-ended conversation can get out of control very quickly. (Interview, 26 October 2017)

The data science team broke down the chatbot’s functionality into two tasks: (1) “knowing” what users say (identifying what users want to accomplish with the chatbot) and (2) “guiding” conversations (helping users by making the chatbot talk to them in a human-like manner).⁵ Initially, data scientists focused on the first task—determining what users say. When they researched existing chatbots, they found that third-party AI platforms powered most chatbots. In earlier projects, the data science team had “tried” most third-party AI platforms; given their experiences, they favored VocabX⁶ and considered it “smarter than most systems.”

The data science team believed that building a VocabX-enabled chatbot required the use of several VocabX services working in tandem: (1) Grasp—a service to analyze unstructured text to extract topics and keywords, (2) Erudition—a service to train VocabX on a specific domain (it comes pre-trained on several topics), and (3) Colloquy—a service to enable VocabX to hold conversations. Given their previous experience with VocabX, data scientists stated that using VocabX to identify legal topics should be “relatively easy” using two services—parse the text through Grasp and pass the output to Erudition. Right out of the box, VocabX “already knew a lot” about several topics, further bolstering the team’s confidence. Data science team members often shared examples of VocabX’s “success” in team meetings:

Remy (director of data science): We fed VocabX a line deliberately trying to confuse it. We wrote, ‘I am thinking about chapter 13 in Boston divorce filing.’ VocabX figured out the two topics:

1. business and industrial/company/bankruptcy

2. society/social institution/divorce. (Fieldwork Notes, 31 May 2017)⁷

They considered the line “confusing” because it had keywords for both divorce and bankruptcy.⁸ Such attempts to confuse the chatbot⁹ were commonplace, considered an informal, yet reasonable, form of testing. The chatbot “knew” what users said if it identified the “correct” topics.

The more significant challenge was the second task—guiding conversations. When users went off track, the data science team wanted the chatbot to guide back the conversation in a “natural” manner. Data scientists tried but were unsuccessful in making the chatbot hold conversations. The chatbot could identify legal topics, but data scientists could not successfully configure Grasp, Erudition, and Colloquy services to work together. Director of data science Remy mentioned that perhaps the team did not “fully understand” VocabX. He believed it was possible to configure VocabX for open-ended legal conversations and organized a meeting with VocabX representatives to resolve the problem.¹⁰ Apart from three data scientists, data science project manager Daniel, director of data science Remy, and two VocabX representatives, director of technology Paul and CTO Harper attended the meeting. The meeting’s goal was to understand how to configure VocabX for open-ended conversations. In the meeting, however, another problem emerged when Remy asked for a specific demonstration.

Remy: Can you give a demo or an example of VocabX in action, particularly when a user query goes outside the scope of the conversation?

VocabX representative #1: You mean when the query hits outside the bounds of the algorithm?

Remy: Yes. If the user is, let us say, talking about legal stuff and then suddenly types ‘oh, what’s the weather in Chicago?’ That kind of thing. How to bring the conversation back on track? Guiding it in the direction we want.

VocabX representative #1: What you are asking for is an in-practice thing. With VocabX you can configure different ways of saying the same thing. So, if you encounter something that is below a certain confidence interval threshold or outside the confidence boundary, that is maybe out of scope. You can also call Equals [another VocabX service] for misspelled words. You can tell the system that ‘piza,’ ‘pizzza,’ and ‘pizzzza’ are all just ‘pizza.’

Remy: Yes, for entities, but can we do it for intents as well?

Harper (CTO): Remind me – what is the difference between entities and intents?

Remy: Intent is the end goal, and entities are the small parts that make you reach that goal. For example, ‘order a pizza’ is an intent, and things such as ‘large,’ ‘pepperoni,’ etc. are entities of the intent. We can give a minimum of six instances of entities to tell the VocabX model about the entity range of an intent, but ideally it shouldn’t take that many if the system is capable of NLP, which I think it is.

Paul (director of technology): If you had enough entities, figuring out intent shouldn’t be hard. (Fieldwork Notes, 5 June 2017)

Remy asked how the chatbot would work when users went “outside the scope of the conversation.” The data science team often discussed the two tasks (knowing what users say and guiding conversations) separately but recognized that they overlapped: to guide users, you must know what they said. As data scientist Alex put it: “the bot has to know why you are on the platform!”. The task of “knowing” required discerning not only what a user said but also what they left unsaid. As Remy had described a few days before the meeting: “it is almost like we want to read someone’s mind.”

The meeting surfaced a pertinent challenge. Guiding conversations required knowing what a user says as well as identifying whether what they say is on or off track within the context of the ongoing conversation. It was possible to configure the chatbot to identify themes related to a topic by manually inputting intent–entity combinations. Intent refers to users’ desired goals—the reason they are talking to the chatbot (e.g., to get legal advice on divorce). Entities refer to themes within users’ intended goals—unique aspects of what users want (e.g., to get advice on annulment). The chatbot could identify that entities such as “marriage,” “annulment,” and “custody” are all linked to divorce. If users asked about “custody,” the chatbot would use configured confidence thresholds to determine that users were on track in a legal conversation on divorce.

The situation grew complex when users said things that were not already specified as entities and intents. For example, when users describe love for their kids in divorce conversations. Pre-trained on general topics, the chatbot can identify that the user is perhaps on the “family” or “parenting” topic. If the chatbot does not identify that these topics are also related to divorce, then it might incorrectly determine that the user is off track. Law&You wanted to avoid such situations that could make users feel that they were not understood or taken seriously, making them drop conversations midway. For the legal use case, the chatbot needed to identify whether a topic was within or outside legal discussions. Identifying on- and off-track topics required either manually inputting all intent–entity relations or the chatbot to learn intent–entity relations on its own. The former was not even a possibility—it was after all the very problem with guided chats. CTO Harper asked if the latter was possible.

Harper: I am happy to give it six entities to populate it if it is then able to figure out the next 50 entities on its own. Is that possible? We want to get intent. If the user says: ‘I have no money, and I have to sell my car,’ we want to figure out that the user is talking about bankruptcy. We can do that manually. We can have a button that says: ‘click here to file for bankruptcy.’ But, instead of having preprogrammed buckets, asking people to self-identify, we want to extract it directly out of the conversation.

VocabX representative #2: You can use Wisdom [a VocabX product] to manually label your data and train the system to identify domain-specific things. (Fieldwork Notes, 5 June 2017)

Director of technology Paul and CTO Harper turned down manual labeling, arguing that it incurred high financial and personnel costs. They could allocate resources to manually label a small set of entities, but not all of them.

The data science team organized the meeting to know how to “guide” conversations. However, much of the meeting centered on the work of identifying legal issues, i.e. “knowing” what users say, which the data science team believed it had already accomplished. The team realized that the full scope of the task of knowing what users say includes knowing the difference between “really off-track” users and on-track users “describing things differently.”

Our description of the meeting may seem incomplete, consisting only of discussions among business personnel—Remy, Paul, and Harper—and VocabX representatives. Although data scientists were at the meeting, they are absent in our description. During fieldwork, we observed similar dynamics between data scientists and business personnel. Data scientists were vocal in data science team meetings (even in the presence of their director or project manager). However, they were far less vocal in meetings with business personnel, especially senior personnel. For example, in requirement gathering or status update meetings with business teams, Remy and Daniel often spoke on behalf of the team.

The same dynamic was at play in the VocabX meeting. One reason for this, as we learned through discussions with data scientists, lies in their understanding of the goal of such meetings. It may seem that the meeting’s goal was technical—to learn how to configure VocabX to hold open-ended conversations. Data scientists, however, felt differently. For them, the meeting’s goal was primarily business in nature. They recognized that if VocabX could power the chatbot, the company would need to invest a substantial amount of money into VocabX to get access to their cloud infrastructure to handle the thousands of customers that would use the chatbot daily. The meeting’s technical and business goals were intertwined—a common occurrence in the world of corporate data science.

Going back to our story, the meeting surfaced additional problems but resolved some previous ones. Data scientists now better understood what they could accomplish with VocabX (e.g., identify preconfigured intents and entities) and what they could not (e.g., discern whether new entities were part of legal intents)—as discussed in the debrief session after the meeting. They decided that they needed to figure out a way to make the chatbot automatically learn legal intent–entity relations. The team discussed two ways to achieve this. First, using VocabX’s Erudition service, which provided a way to learn new things, to train the chatbot on data containing legal books and articles. Second, developing in-house NLP models trained on the same data. The former prioritized compatibility with VocabX; the latter enabled greater control over learning. Irrespective of the choice, the immediate task was to create a training dataset. Law books and articles, however, describe legal issues in legal vocabulary. It was possible to use them as training data to train models, but only as these topics appeared in legal terminology. This data, however, was inadequate to learn how users colloquially describe legal issues, a point raised in the meeting. The practical difficulty of curating data on people’s everyday descriptions of legal issues posed challenges for the learning task. Data scientists attempted to resolve this issue by also including archived chat transcripts with human operators and forum discussions as part of their training data. Besides VocabX, the chatbot would now also use in-house models.

The data science team presumed that VocabX could easily learn new intent–entity relations (just like their earlier assumption that VocabX could hold open-ended conversations). The final verdict, however, was against VocabX, as described later by director of data science Remy:

Remy: My expectation of their [VocabX] service was a little bit higher than what they were actually delivering. The reason I started looking at VocabX is because I was hoping that going beyond guided chat … that their technology had progressed to the point where they can help me with the less guided or open-ended chat. That is when we started asking questions to VocabX team—does your technology actually help in that regard? What ended up happening was that they acknowledged that their Colloquy service really was not … <pause> I should not have expected their service to do that. Instead, they offered yet another service [Wisdom]. Even that [is] not that smart. You still have limitations. (Interview, 26 October 2017)

A few weeks later, the data science team provided an update on the chatbot’s development. Remy, data science project manager Samuel,¹¹ data scientists Max and Alex, Law&You’s director of technology Paul, and Law&You’s software engineer Richard attended the meeting. Remy told Paul that the chatbot was a work in progress. It performed better than before but was far from ready for deployment.¹² There were “too many edge cases” in which conversations did not go as planned. Paul, however, was not convinced that the chatbot’s performance was as bad as Remy described. For him, the chatbot just had to be good enough.

Paul: Maybe we need to think about it like an 80/20 rule. In some cases, it works well, but for some, it is harder. 80% of the time everything is fine, and in the remaining 20%, we try to do our best.

Remy: The trouble is how to automatically recognize what is 80 and what is 20.

Paul: I agree. Let us focus on that. We just want value. Tech is secondary.

Max: It is harder than it sounds. (Remy laughs, Paul asks Max: ‘In what way?’). One of the models I have is a matching model trained on pairs of legal questions and answers. 60,000 of them. It seems large but is small for machine learning.

Paul: That’s a lot. Can it answer a question about say visa renewal?

Max: If there exists a question like that in training data, then yes. But with just 60,000, the model can easily overfit, and then for anything outside, it would just fail.

Paul: I see what you are saying. Edge cases are interesting from an academic perspective, but from a business perspective the first and foremost thing is value. You are trying to solve an interesting problem. I get it. But I feel that you may have already solved it enough to gain business value. (Fieldwork Notes, 31 July 2017)

For the data science team, the chatbot was better than before but still far from perfect. Paul did not require perfection. The chatbot had business value, even if it worked 80% of the time. Paul differentiated between academic and business perspectives. Edge cases posed exciting data science challenges. Solving them, however, was outside the project’s scope. The business gained value from a good-enough chatbot even if it was not ideal from a computational perspective.

It is not surprising that imperfect, good-enough systems can provide business value. What was surprising was that Paul argued that the chatbot’s failures were, in fact, not failures at all!

Paul: Edge cases are important, but the end goal is user information, monetizing user data. We are building a legal self-help chatbot, but a major business use case is to tell people: ‘here, talk to this lawyer.’ We do want to connect them with a lawyer. Even for 20%, when our bot fails, we tell users that the problem cannot be done through self-help. Let us get you a lawyer, right? That is what we wanted in the first place. (Fieldwork Notes, 31 July 2017)

For Paul, the primary goal was to collect user data and sell it to clients. If the chatbot did this most of the time, it worked. It was acceptable to fail, and failures did not mean that the chatbot did not work. In such cases, the chatbot can inform users that their legal problem (the problem it failed to identify) is not a self-help problem and requires professional help. The users should thus provide their information so that the company could put them in touch with lawyers. There were no failures, only data.

Findings

We began the paper with the question: how are data science systems made to work? In this section, we answer this question by examining how actors negotiated central aspects of the chatbot’s working. First, what work should the chatbot do? We show how existing technologies shape the chatbot’s intended working. Second, how should the chatbot work? We show how the resolution to the challenge of identifying on- and off-track users influences the way actors expected the chatbot to work. Third, how to assess whether the chatbot works? We show how actors evaluate the chatbot in distinct ways and finally agree to assess its working by skewing the balance between business and data science goals. Making visible the ongoing forms of discretionary work essential to building data science systems, we complement existing critical data studies research with a detailed account of the human work required to make data science systems work.

Existing technologies: The old and the new

The working of data science systems is not just an artifact of their technical features but entangled with existing technologies. In this subsection, we analyze how two existing technologies—one other than the chatbot and one making up the chatbot—shaped the chatbot’s intended working.

First, assessments regarding whether a technology other than the chatbot—guided chat—worked shaped actors’ articulations of the chatbot’s intended working. Director of technology Paul initially described the chatbot as different from guided chat. The chatbot was a novel technology, signifying the company’s move toward AI. The chatbot’s initial problem formulation, however, was motivated by the perceived performance of guided chat. The guided chat successfully increased profit margins by reducing the cost of hiring human chat operators but unsuccessfully collected reliable data because of problems with scripted chats.

The company could revert to hiring chat operators, but this was undesirable. Hiring back human operators would incur a high financial cost. The company would also lose market share to competitors who already provided AI-driven digital marketing tools. Reverting to human operators would mean that to remedy guided chat’s failure (unreliable data collection), Law&You would also have to give up on guided chat’s success (higher profit margin). The company instead went for the chatbot that promised reliable data collection and maintained, if not increased, profit margins and market share. What guided chat could or could not do shaped what the chatbot should or should not do. If users followed guided chats, the company would not even require a chatbot, at least for data collection. The need for the chatbot to anticipate conversational pathways, to know what users say, and to guide them were founded in the perceived poor performance of guided chat.

Second, an important technical feature of a technology making up the chatbot shaped actors’ understanding of users’ legal issues, affecting how actors expected the chatbot to work. For our actors, users had legal intents, which were combinations of distinct legal entities. The chatbot needed to determine intent by identifying entities. How did actors settle on this understanding? The answer lies within VocabX’s technical design in which texts are analyzed as combinations of intents and entities. For VocabX, the meaning of a piece of text is equal to the text’s intent–entity makeup. Our actors’ understanding of legal issues directly mirrored VocabX’s technical design,¹³ enormously impacting, as we saw, the chatbot’s intended working. The work of identifying what users say became the work of identifying intents and entities. In doing so, actors also configured a particular type of user (Woolgar, 1991) in their system—a user with apparent intentions, which they described using recognizable terms.

Emergent challenges: Situated resolutions and system working

The working of data science systems is shaped not only by problem formulation and algorithmic design but also by situated forms of work to resolve emergent implementation challenges. In this subsection, we examine how the resolution to the problem of identifying on- and off-track users shaped how actors expected the chatbot to work.

The data science team initially equated the work of knowing what users say with identifying legal intents. In doing so, the team made two assumptions. First, users described legal issues, i.e. users were on track. This assumption was apparent in that the text used to test the chatbot contained valid accounts of legal issues (e.g., bankruptcy and divorce). Second, users used recognizable legal words in their descriptions. This assumption was evident in the use of specific keywords (e.g., chapter 13, divorce, and child custody) in test cases. The chatbot worked through the correct identification of legal intents of on-track users who described issues using recognizable legal words.

The VocabX meeting, however, surfaced additional challenges. The meeting’s goal was to configure the chatbot to guide users (actors assumed that the work of knowing what users say was already done). Director of data science Remy asked how to guide a user asking about the weather. Everyone agreed that this query was outside legal discussions. Through this query, Remy invoked an exemplary instance of an off-track user who said things “completely unrelated” to law. Guiding, even identifying, such off-track users required the chatbot to perform additional work to identify what is or is not a valid part of legal discussions.

Actors resolved this challenge by proposing that the chatbot must identify all kinds of on-track users. The chatbot should learn the distinct ways in which users remain on track in legal discussions. If the chatbot did this successfully, it could identify off-track users since their queries would not correspond to the chatbot’s model of on-track users. Recognizing the many kinds of on-track users, however, required the chatbot to work differently. In fact, the chatbot needed to perform multiple kinds of work: identifying the content of legal topics (e.g., is annulment a part of the divorce topic?), mapping relations between legal topics (e.g., are bankruptcy and divorce connected?), and discerning the scope of legal discussions (e.g., is caring for kids a part of a divorce discussion?). One way the chatbot could learn legal content, relations, and scope was through training data prepared by manually labeling legal texts. Doing so incurred high financial and personnel costs, and this was not how the actors wanted the chatbot to work. It needed to learn on its own—a requirement made difficult by differences between formal and colloquial descriptions of legal issues. In the end, the chatbot needed the technical ability to identify the content of and relations among legal topics besides accounting for the scope of legal discussions to differentiate between on- and off-track users. At face value, this change to the chatbot’s working might seem like a mere redefinition of its working rather than an actual change. However, it is crucial to note that this redefinition consequentially altered the chatbot’s technical set-up and working.

Negotiated balance: Business and data science considerations

Whether a data science system works is neither obvious nor given (for a more general account, see Collins, 1985; Rooksby et al., 2009; Suchman et al., 2002); the perceived success and failure of its working depend as much on business goals as on data science imperatives. In this subsection, we unpack how actors evaluated the chatbot in distinct ways, agreeing to assess its working in a practical way founded in a negotiated yet skewed balance between business and data science goals.

Business and data science actors evaluated the chatbot in different, somewhat divergent, ways. The data science team focused on assessing the algorithmic efficacy of the chatbot’s working. From this perspective, the chatbot was far from perfect because of its inability to account for several edge cases. The data science team’s fixation on scoping and resolving edge cases was not an arbitrary choice. An essential part of director of data science Remy’s project goal was that the chatbot should know when users went off track and guide them back. It should not come as a surprise that most, if not all, edge cases were off-track user queries—queries at the heart of the data science team’s assessment criteria.

For the business team, the chatbot’s assessment depended on the practical efficacy of its working. Director of technology Paul argued that solving edge cases was an interesting academic challenge but not the project’s business goal—the chatbot needed to work for most, not all, cases. This articulation of the difference between academic challenges and business goals is in line with recent work, e.g. Wolf (2019), on how industry practitioners perceive differences between applied and scholarly data science. For Paul, the chatbot’s success did not just depend on its algorithmic prowess. The chatbot was also a competitive tool. A good-enough chatbot was already a huge success, signaling the company’s uptake of AI-driven technologies to clients and competitors.

Our finding that practitioners often need systems to work in good-enough, and not perfect, ways echo similar findings concerning other systems (Gabrys et al., 2016; Keller, 2000; Lewis et al., 2012). However, what is surprising is how the business team reframed the situations which the data science team considered as failures as potential sites of success. The chatbot’s computational failures were, for the business team, a result of the complexity of users’ legal issues and not of the chatbot’s technical inadequacies. In such cases, the chatbot could inform users that their legal issue required professional help—I cannot help you because your issue is not a self-help issue, thus give me your information, and I will connect you with lawyers. The chatbot worked 100% of the time from a business perspective. Paul’s 80/20 rule became the new ground for assessment, establishing an accountable definition of a successful chatbot. The data science team could disagree with this new assessment criteria but not entirely reject it, especially given that the data science team described its central organizational role as that of “adding value” to businesses—an aspect we observed in this and many other projects.

Discussion

In this paper, we described how actors make data science systems work relative to existing technologies, emergent challenges, and business goals. Enormous amounts of discretionary, often invisible work lay the foundation for how, why, and to what extent systems ultimately work.

One way to explain this is to see the chatbot’s final working as a mundane consequence of data science practice. From this perspective, a plausible narrative of the chatbot project would be that a problem (anticipating conversational pathways) hinders a business goal (data collection). Data scientists break down the problem (identify topics, learn relations among topics) and build a chatbot to solve it (by knowing what users say and guiding them). The chatbot’s working is thus a foregone conclusion—always stable, merely realized through development.

This framing, however, does not account for the choices and decisions that alter the working of systems throughout development—sometimes in ways that are invisible even to practitioners. Building systems—indeed, making systems work—requires “ordering the natural, social, and software worlds at the same time” (Vertesi, 2019: 388). Business goals and existing technologies shape problem formulations. The design of existing technologies configures the work systems must do and assumptions about how users will interact with the systems. Considerations of financial cost, personnel time, and resource allocation lead actors to require systems to do specific work in automated ways. The working of data science systems is not just an account of their technical design but made account-able (Garfinkel, 1967) through ongoing work by many practitioners (Neyland, 2016).

Researchers continue to analyze data science implications, recommending how systems should or should not work. Researcher’s ability to effectively address responsible design requires understanding and addressing how and why practitioners choose to make systems work in specific ways. Through a detailed description of the negotiated nature of the working of data science systems, our empirical findings call attention to the consequential relationship between the working of data science systems and the everyday practices of building them, highlighting three takeaways for critical data studies.

First, practitioners have different, sometimes divergent, assumptions and expectations of system’s intended working. We saw how data science and business team members differed in their approach to evaluating the chatbot’s working, highlighting underlying differences in their understanding of how the chatbot was intended to work. Data science is as much the process of managing different expectations, goals, and imperatives as of working with data, algorithms, models, and numbers. Building data science systems requires work by many practitioners, but not all forms of work are considered equal. We saw how business team members had more power than data scientists in the chatbot project (Saltz and Shamshurin (2015) point to similar dynamics at another company). The organizational culture at Aurelion reinforced the notion that the data science team’s job was to “add value” to business products. Data science teams remain one of the most recent additions to corporate organizations. But their everyday work intersects with already-existing project, product, and business practices—with teams that often have more weight in organizational decisions.

In making visible the impact of organizational aspects, we suggest researchers orient toward how differences among practitioners and teams are managed, resolved, or sidelined within projects. Who gets to take part in negotiating a system’s working? Who decides the nature of this participation? Who gets to arbitrate the outcome of negotiations? In studying the implications of data science systems, we also must engage with the culture and structure of organizations that design such systems to understand how specific viewpoints are (de)legitimized by practitioners (Haraway, 1988; Harding, 2001). This engagement can help us better understand the entangled relationship between the organizational context and product development practices of corporate organizations (Boltanski and Thévenot, 2006; Passi and Jackson, 2018; Reynaud, 2005; Stark, 2009).

The second takeaway concerns itself with new empirical sites for analyzing the work of building data science systems. The challenging nature of gaining access to corporate practice continues to limit critical data studies scholarship. Data science systems, however, do not exist in isolation but are embedded in wider sociotechnical ecosystems. Justifications concerning the working of systems may lie, as we have shown, not within but outside them—in the technologies that systems replace or the technologies that make up systems. We must keep in mind that existing technologies that data science systems replace—even those that seem to have nothing to do with data science—also shape how practitioners envision what data science systems can or cannot, and should or should not, do. What existing practices and technologies do data science systems augment or replace? In what ways do existing technologies seem deficient or superior to data science systems? Engaging with these questions is useful, helping us see how data science systems intersect with existing ways of doing and knowing.

As we have shown, third-party AI platforms make up and shape the working of data science systems in important ways. As we continue to work to gain access to corporate data science practices, analyzing these systems promises to provide meaningful insights into how, why, and to what extent systems work the way they do. How do third-party AI services define and solve common problems such as identifying user intent? What are the affordances and limits of different AI platforms? Describing how AI platforms work and what futures they promise to their clients falls within the purview of our efforts to unpack the working and implications of data science systems.

Third, a data science system’s working is impacted as much by how practitioners anticipate the kinds of work the system can perform as by the actual work of practitioners to build the system. Initially, our actors believed that the chatbot could successfully solve guided chat problems. In two other instances, data science team members expected the chatbot to easily recognize legal intent, guide conversations, and learn new information. Beyond the project’s immediate goals, business actors believed that the chatbot was the future of digital marketing. Such forms of “anticipation work” (Steinhardt and Jackson, 2015) shaped how actors imagined viable approaches and solutions to problems at hand, further affecting how and why actors built the chatbot to work the way it did. In our case, however, anticipations often faltered, requiring work by actors to adjust their programs of action. Situated in the present, articulations of plausible futures consequentially shape everyday data science work.

Proclaiming the efficacy of actions before performing them is not the mistake of putting the cart before the horse but, in fact, an essential attribute of professional work. Pinch et al. (1996) describe how professionals use forms of anticipation to decide what is and is not doable, imagining and selecting among viable actions when faced with uncertainty. Examples of such anticipatory skills range from the ability of mechanics to provide accurate time estimates for repairs to the ability of aerospace engineers to assess the safety of a rocket before flight. Similarly, data science practitioners must know not only how to build systems to work in specific ways but also whether certain forms of working are possible. Like all professionals, data science practitioners often think and act “in the future perfect tense” (Schuetz, 1943: 40):

We cannot find out which of the alternatives will lead to the desired end without imagining […an] act as already accomplished. […We] have to place ourselves mentally in a future state of affairs which we consider as already realized, though to realize it would be the end of our contemplated action.

For critical data studies researchers, analyzing and participating in the anticipatory work of practitioners is crucial because it is through such forms of work that practitioners imagine possible futures—worlds in which systems can, do, and must work in some ways and not in others. As researchers, our efforts to assist practitioners in designing better systems thus must include the work of fostering alternative, even somewhat radical, imaginations of futures to help practitioners envision new possibilities. We get the systems we imagine, but not necessarily the ones we need. The work of building better systems begins with working with practitioners to imagine better futures.

We realize that such engagements will sometimes frustrate both critical data studies researchers and data science practitioners; the two may, and often do, have different normative goals. Data science practitioners may cater to a different set of ethics, caring more about the clients they are in relationship with than about the concerns espoused by critical data studies. Sometimes critical data studies researchers may argue that it is better not to build a system. Data science practitioners may still go ahead with it because of the perceived business value or because they believe that if they do not build it, someone else will. In such situations, it may seem better to not engage with practitioners at all.

Our aim is not to simplify the lives of practitioners and researchers and create a binary divide between them. Both practitioners and researchers are entangled in the current data science moment in complex ways. Still, we want to make visible what we believe is increasingly becoming a challenge within critical data studies scholarship: calls, such as ours, to engage with practitioners are often seen as futile (at best) or appalling (at worst)—to put it at its most extreme, how could you work with, and not against, these evildoers?

We strongly believe that we need more research on the capitalistic underpinnings and negative implications of data science—certainly, you do not need to always work with practitioners to do important and valuable research. But we are troubled when critical data studies researchers appear to treat ethical values and normative goals as always stable a priori frameworks that just need to be implemented in systems. It is almost as if practitioners and researchers are thought to have no ethics of their own. Not only do ethics exist in all practices, they are often worked out as part of everyday work. What is normatively better depends as much on people’s normative stance as on the practical judgments that drive their everyday work (Passi and Barocas, 2019). Practitioners and researchers struggle with different sets of constraining forces, but it is important to be reflexive and remember that both groups perceive and act on the world through constraints that shape what they believe to be good and possible. The goal of engaging with practitioners is thus neither to school them nor to do their bidding. Instead, our call to work with data science practitioners is best understood as embarking on a difficult journey to learn more about the situated nature of data science practice and research, making visible the differences and similarities in our normative goals.

Conclusion

In this paper we have provided a process-centered account of the everyday work of building data science systems, showing how and why the working of a system is neither stable nor given, but a resourceful and improvised artifact that remains in the making throughout development. Through this work, we help to advance the sociotechnical scholarship in critical data studies on the everyday practices of doing data science, providing researchers new pathways into effectively engaging with the entangled relationship between the everyday work of building data science systems and their eventual working and social implications. We make a case for examining the human and organizational work through which practitioners decide how and why systems should work in specific ways, including forms of anticipatory work that drive practitioners toward certain technological futures.

Footnotes

Acknowledgments

We wish to thank Matthew Zook, Age Poom, and three anonymous reviewers for their constructive feedback and help with the review process. We would also like to thank Artificial Intelligence, Policy, and Practice (AIPP) research group members and Technology, Law, and Society (TLS) 2018 Summer Institute participants for their helpful comments on earlier versions of this work, and Ranjit Singh, Priya Gupta, and Utkarsh Srivastava for their help in refining the case study.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: National Science Foundation (NSF) Cyber-Human Systems Grant #1526155.

ORCID iD

Samir Passi

Notes

References

Amershi

Begel

Bird

, et al. (2019) Software engineering for machine learning: A case study. In: International conference on Software Engineering (ICSE 2019) – Software engineering in practice track, pp.291–300. New York: ACM.

Aradau

Blanke

(2015) The (big) data-security assemblage: Knowledge and critique. Big Data & Society 2(2): 1–12.

Baumer

EPS

(2017) Toward human-centered algorithm design. Big Data & Society 4(2): 1–12.

Bechmann

Bowker

(2019) Unsupervised by any other name: Hidden layers of knowledge production in artificial intelligence on social media. Big Data & Society 6(1): 1–11.

Binns

Kleek

Van Veale

, et al. (2018) ‘It’s reducing a human being to a percentage’: Perceptions of justice in algorithmic decisions. In: Proceedings of the 2018 CHI conference on human factors in computing systems (CHI ’18), Montreal, Canada, pp.1–14. New York: ACM.

Boltanski

Thévenot

(2006) On Justification: Economies of Worth. Princeton, NJ: Princeton University Press.

Buolamwini

Gebru

(2018) Gender shades: Intersectional accuracy disparities in commercial gender classification. In: Proceedings of the 1st conference on fairness, accountability and transparency (eds Friedler

Wilson

), pp.77–91. New York: PMLR.

Charmaz

(2014) Constructing Grounded Theory (Introducing Qualitative Methods Series). 2nd ed. London: Sage.

Cohn

(2019) Keeping software present software as a timely object for STS studies of the digital. In: Vertesi

Ribes

(eds) DigitalSTS: A Field Guide for Science & Technology Studies. Princeton, NJ: Princeton University Press, pp.423–446.

10.

Collins

(1985) Changing Order: Replication and Induction in Scientific Practice. London: Sage.

11.

Dourish

Cruz

(2018) Datafication and data fiction: Narrating data and narrating with data. Big Data & Society 5(2): 1–10.

12.

Dudhwala

Larsen

(2019) Recalibration in counting and accounting practices: Dealing with algorithmic output in public and private. Big Data & Society 6(2): 1–12.

13.

Gabrys

Pritchard

Barratt

(2016) Just good enough data: Figuring data citizenships through air pollution sensing and data stories. Big Data & Society 3(2): 1–14.

14.

Garfinkel

(1967) Studies in Ethnomethodology. Upper Saddle River, NJ: Prentice Hall.

15.

Hand

(1994) Deconstructing statistical questions. Journal of the Royal Statistical Society: Series A (Statistics in Society) 157(3): 317–356.

16.

Haraway

(1988) Situated knowledges: The science question in feminism and the privilege of partial perspective. Feminist Studies 14(3): 575–599.

17.

Harding

(2001) Feminist standpoint epistemology. In: Lederman

Bartsch

(eds) The Gender and Science Reader. London: Routledge, pp.145–168.

18.

Keller

(2000) Models of and models for: Theory and practice in contemporary biology. Philosophy of Science 67: 72–86.

19.

Lewis

Atkinson

Harrington

, et al. (2012) Representation and practical accomplishment in the laboratory: When is an animal model good-enough? Sociology 47(4): 776–792.

20.

Lipton

Steinhardt

(2018) Troubling trends in machine learning scholarship. arXiv 1807.03341: 1–15.

21.

Miles

(2019) The combine will tell the truth: On precision agriculture and algorithmic rationality. Big Data & Society 6(1): 1–12.

22.

Muller

Lange

Wang

, et al. (2019) How data science workers work with data: Discovery, capture, curation, design, creation. In: Proceedings of the 2019 CHI conference on human factors in computing systems, pp.126:1–126:15. New York: ACM.

23.

Neyland

(2016) Bearing accountable witness to the ethical algorithmic system. Science, Technology, & Human Values 41(1): 50–76.

24.

Paine

Lee

(2017) “Who has plots?”: Contextualizing scientific software, practice, and visualizations. In: Proceedings of the ACM on human-computer interaction, 1, CSCW, pp. 1–21. New York: ACM Press.

25.

Passi

Barocas

(2019) Problem formulation and fairness. In: Proceedings of the ACM conference on fairness, accountability, and transparency (FAT* ’19), pp.39–48. New York: ACM.

26.

Passi

Jackson

(2017) Data vision: Learning to see through algorithmic abstraction. In: Proceedings of the 2017 ACM conference on computer supported cooperative work and social computing (CSCW ’17), pp.2436–2447. New York: ACM Press.

27.

Passi

Jackson

(2018) Trust in data science: Collaboration, translation, and accountability in corporate data science projects. In: Proceedings of the ACM on human-computer interaction, 2(CSCW), pp.1–28. New York: ACM Press.

28.

Pinch

Collins

Carbone

(1996) Inside knowledge: Second order measures of skill. The Sociological Review 44(2): 163–186.

29.

Reynaud

(2005) The void at the heart of rules: Routines in the context of rule-following. The case of the Paris Metro Workshop. Industrial and Corporate Change 14(5): 847–871.

30.

Rieder

Simon

(2016) Datatrust: Or, the political quest for numerical evidence and the epistemologies of big data. Big Data & Society 3(1): 1–6.

31.

Rooksby

Rouncefield

Sommerville

(2009) Testing in the wild: The social and organisational dimensions of real world practice. Computer Supported Cooperative Work (CSCW) 18(5): 559.

32.

Saltz

Grady

(2017) The ambiguity of data science team roles and the need for a data science workforce framework. In: 2017 IEEE international conference on big data (Big Data), Boston, MA, pp.2355–2361.

33.

Saltz

Shamshurin

(2015) Exploring the process of doing data science via an ethnographic study of a media advertising company. In: 2015 IEEE international conference on big data (Big Data), Santa Clara, CA, pp.2098–2105.

34.

Schuetz

(1943) The problem of rationality in the social world. Economica 10(38): 130–149.

35.

Seaver N (2019) Knowing Algorithms. In Vertesi J & Ribes D (eds) DigitalSTS: A Field Guide for Science & Technology Studies. Princeton, NJ: Princeton University Press, pp. 412-422.

36.

Shaw

(2015) Big data and reality. Big Data & Society 2(2): 1–4.

37.

Stark

(2009) The Sense of Dissonance: Accounts of Worth in Economic Life. Princeton, NJ: Princeton University Press.

38.

Steinhardt

and Jackson SJ (2015) Anticipation work: Cultivating vision in collective practice. In: Proceedings of the 18th ACM conference on computer supported cooperative work & social computing, CSCW ’15, pp.443–453. New York: ACM.

39.

Strauss

Corbin

(1990) Basics of Qualitative Research: Grounded Theory Techniques and Procedures. New York: Sage.

40.

Suchman

Trigg

Blomberg

(2002) Working artefacts: Ethnomethods of the prototype. The British Journal of Sociology 53(2): 163–179.

41.

Suchman

(1987) Plans and Situated Actions: The Problem of Human-Machine Communication. New York: Cambridge University Press.

42.

Suchman

Blomberg

Orr

, et al. (1999) Reconstructing technologies as social practice. American Behavioral Scientist 43(3): 392–408.

43.

Vertesi

(2019) From affordances to accomplishments PowerPoint and Excel at NAS. In: Vertesi

Ribes

(eds) DigitalSTS: A Field Guide for Science & Technology Studies. Princeton, NJ: Princeton University Press, pp.369–392.

44.

Wolf

(2019) Conceptualizing care in the everyday work practices of machine learning developers. In: DIS ’19 Companion – Companion publication of the 2019 on designing interactive systems conference 2019 companion, pp.331–335. New York: ACM.

45.

Woolgar

(1991) Configuring the user: The case of usability trials. In: Law

(ed) A Sociology of Monsters: Essays on Power, Technology and Domination. London: Routledge, pp.58–100.

Making data science systems work

Abstract

Keywords

Introduction

Research site and methods

Empirical case study: Self-help legal chatbot

Findings

Existing technologies: The old and the new

Emergent challenges: Situated resolutions and system working

Negotiated balance: Business and data science considerations

Discussion

Conclusion

Footnotes

Acknowledgments

Declaration of conflicting interests

Funding

ORCID iD

Notes

References