Abstract
Machine learning has become a key component of contemporary information systems. Unlike prior information systems explicitly programmed in formal languages, ML systems infer rules from data. This paper shows what this difference means for the critical analysis of socio-technical systems based on machine learning. To provide a foundation for future critical analysis of machine learning-based systems, we engage with how the term is framed and constructed in self-education resources. For this, we analyze machine learning tutorials, an important information source for self-learners and a key tool for the formation of the practices of the machine learning community. Our analysis identifies canonical examples of machine learning as well as important misconceptions and problematic framings. Our results show that machine learning is presented as being universally applicable and that the application of machine learning without special expertise is actively encouraged. Explanations of machine learning algorithms are missing or strongly limited. Meanwhile, the importance of data is vastly understated. This has implications for the manifestation of (new) social inequalities through machine learning-based systems.
Introduction
Critical scholarship in fields such as media and communication studies, sociology, political science, science and technology studies, and education research increasingly attends to the ways in which software and code—or algorithms more broadly—configure social reality (e.g., Gillespie, 2014; Mackenzie and Vurdubakis, 2011; Striphas, 2015). Beer (2017) examines the social power of algorithms in promoting certain visions of calculative objectivity and provides crucial insights that shape our understanding of contemporary information systems. This paper contributes to this line of research by examining how a new class of software systems that infers rules from data may be researched critically.
Focusing on such systems is highly important because the particularities of such “trained” systems escape critical analysis that understands algorithms as sets of more or less intentional instructions to solve well-defined problems (e.g., Burrell, 2016). This paper focuses on Machine Learning (ML), a subfield of Artificial Intelligence (AI) that enabled a large number of novel applications such as machine translation and object recognition in images. While acknowledging these successes, scholars as well as activists have drawn attention to the possible negative consequences for those affected by ML systems. For example, O’Neill (2016) highlights the problematic applications of ML in sectors such as finance, human resources, and public education. Others have pointed to harmful effects across domains, in particular with respect to racial biases (e.g., Benjamin, 2019; Noble, 2018) or the “automation” of social inequalities (e.g., Eubanks, 2018).
In our study, we examine ML tutorials, which are an important self-education resource for the increasing number of software developers responding to a spike in demand for ML practitioners. ML tutorials allow acquiring the necessary expertise about this new programming paradigm through informal learning environments. We understand tutorials as a site of investigation for analyzing how the community of ML practitioners frames and constructs their field. Throughout this article, we use the term Machine Learning without quotation marks as the self-imposed term of the field. This does, however, not mean that we subscribe to the notion that such “machines” actually “learn”.
Scholars in critical data studies like D’Ignazio and Klein (2020) increasingly stress the important role of training data in ML-based systems. For software based on imperative programming, software code can be studied as instructional text. Due to the strong reliance on data and statistical inference, this is not possible for ML-based systems. The related work provides evidence that the contemporary understanding of ML and its role in the design of complex socio-technical systems is still limited. To contribute to future critical analysis of ML-based systems, we engage with how the term Machine Learning is framed and constructed in tutorials, and consider how this potentially affects those who apply Machine Learning in their professional lives. We hence address the following two, related research questions:
How is Machine Learning framed in self-education resources like tutorials? Specifically, we ask: What types and applications of ML are described? What applications are used as examples? Which elements of ML systems are explained or neglected/black-boxed? What implications does this framing have for the critical analysis of ML-based systems?
The paper is structured in the following way: First, we discuss different understandings of the term algorithm and its relevance to Machine Learning. We demonstrate that in ML, the algorithms can be trivial. Informed by the difference between imperative programming and ML systems based on statistical inference, we examine both the algorithms and the data used to train an ML-based system. With the observation that programming ML systems is trivial (in comparison to imperative programming) and easily available to any lay programmer, we review the role of expertise and ML tutorials in the formation of ML practice. In the subsequent section, we present our methodology for sampling and analyzing ML tutorials. The next section presents our analysis of framings of Machine Learning in tutorials. In particular, we focus on different types of ML and data, algorithms, expertise, and how the applications are presented. In the discussion section, we point to canonical examples of ML as well as important misconceptions, in particular with respect to ML’s universal applicability and the underinformed application of ML. In the conclusion, we highlight the two main insights from this study: (1) we identify a number of important misconceptions about Machine Learning and how it works; and (2) we find that algorithms play a marginal role in ML tutorials and that the role of data is vastly understated. Based on these insights, we present proposals for future research.
Background
Gillespie (2014) described algorithms as “procedures for transforming input data into a desired output, based on specified calculations”. He argues that algorithms are “inert, meaningless machines until paired with databases on which they function”. He rejects the black box metaphor for complex algorithms because they are both obscured and malleable. He highlights that algorithms are embedded into practice in “the lived world”. Kitchin (2017) shows that algorithms can be conceived in a number of ways—technically, computationally, mathematically, politically, culturally, economically, contextually, materially, philosophically, and ethically. He recognizes algorithms as performative in nature and embedded in wider socio-technical assemblages. Following Gillespie (2014), understanding algorithms requires more than thinking about how they work, where they are deployed, or what animated them financially. Seaver (2013) argues that the barriers to studying and understanding algorithms are (1) access to the proprietary commercial systems, (2) expertise, i.e. the required technical know-how to make sense of algorithms, and (3) an understanding of how algorithms exist and work “in the wild”. He proposes that algorithmic systems should be studied as intricate, dynamic arrangements of people and code that blend “technical” and “cultural” concerns.
These conceptualizations of algorithms and algorithmic systems—as synecdoches for the complex socio-technical assemblages in which ML-based systems are designed and operate—go well beyond the more technical definitions in computer science. For example, Knuth (1997) defined an algorithm as a set of rules that defines a sequence of operations such that each rule is effective and definite and such that the sequence terminates in finite time. Under the paradigm of imperative programming, a programmer explicitly formulates these computational rules of the system in a programming language.
Studying an algorithm and the set of rules it consists of has proven to provide insights into the values and ideas inscribed into software based on imperative programming. For example, Mackenzie (2013) argued that software developers and programmers live in “regimes of anticipation through technical practices”. Suchman (2012) proposed the concept of “sociomaterial configurations” to draw our attention to the “imaginaries” and “materialities” that technologies “join together” (p. 48). Central to these considerations is the assumption that software developers are expert designers of systems. An important method in fields such as critical code studies or software studies are, therefore, interviews with software developers. Such interviews have become a standard method of qualitative study designs. These empirical inquiries rely on a framing of programmers as programming subjects that exert power over design decisions while implementing a system. Such assumptions are also very prominent in software development approaches such as co-design or participatory design, which normatively claim that those affected by (future) information systems should have a say in design decisions (e.g., Bratteteig and Wagner, 2014; Jarke, 2021). Hence, most software studies frame software developers and programmers as making inscriptions to their code and algorithms which in turn allows for critical analysis.
For example, Kitchin (2017) provides six methodological approaches for researching algorithms: (1) examining pseudo-code/source code, (2) reflexively producing code, (3) reverse engineering, (4) ethnographies of coding teams, (5) unpacking the full socio-technical assemblage of algorithms and (6) examining how algorithms do work in the world. Kitchin (2017) argues that studies with access to the pseudo-code, code, and coders “may well be the most illuminating”. In this article, we show why (1) examining or (2) producing (pseudo-)code, as well as (3) reverse-engineering the system is only of limited use in the context of ML-based systems because the algorithms used to infer rules from data are so generic that little can be gained from understanding them. As an extension to methods 4, 5, and 6, we show that qualitative studies of self-education resources like ML tutorials can yield important insights into both technical aspects and social imaginaries of algorithmic systems based on ML.
In contrast to systems based on imperative programming, Machine Learning-based systems are “trained”, not programmed. While they are still “trained” by somebody, this process is very different from imperative programming (Mackenzie, 2013). To “train” an ML system, a mathematical model is formulated and a cost function is defined. The parameters of the mathematical model are optimized to minimize the cost function with respect to certain input and output data.
With ML, very little is gained from studying the generic set of rules used to “train” the model by minimizing a cost function. Algorithms like gradient descent or expectation-maximization, which are used to train ML-based systems, merely describe the optimization routine and are algorithmically trivial. Therefore, ML models escape critical examination using established methods from e.g. software studies (Kitchin, 2017). The code of ML-based systems cannot be studied as text in which intentions of programming subjects are inscribed.
Figure 1 demonstrates this crucial difference using a concrete example. The figure shows a full-fledged Python implementation of an ML-based system that can detect spam messages using a

Self-contained Python implementation of an ML system that detects spam in emails. Yellow highlights indicate the aspects that are specific to spam filtering.
The yellow highlights in Figure 1 emphasize all those aspects in the code that are specific to spam filtering. The highlights show that only the data that is loaded and the dimensionality of the data is specific to the spam filtering application. The file
The interchangeable use of ML-based systems and algorithmic systems leads to the important fallacy that much of the discussion concerning the accountability of algorithms is organized around human authorship and human agency (Lee and Larsen, 2019). Thus, algorithms are intertwined with “normativities” at every step of their existence. Lee and Larsen (2019) describe algorithms in ML-based systems as “biased black boxes” that reproduce racism, and thus automate inequality. To uncover the hidden normativities of such systems, they explore what approaches—other than going under the hood of “algorithms”—exist to critically examine ML-based systems. Our paper extends on this work by examining how the term Machine Learning is framed and constructed within the community of ML practitioners.
Amershi et al. (2014), who focus on the role of humans in interactive Machine Learning, explore what ML means for users. They cite the risk of potentially unexpected behaviors, e.g. due to data that was never anticipated by the developers of a system. Other challenges that such systems face are errors that can be subtle and evaluation metrics that can be misleading. This connects to research that investigates how ML systems and their inferred models can become unintended sources of unfairness. With a focus on the technical inner workings of ML systems, Benjamin (2019), among others, argue that there is a general tendency for automated decisions to favor statistically dominant classes. Minorities, by definition, have proportionately fewer training samples available. This leads to models that make worse and potentially more biased (or even oppressive) decisions about specific groups or individuals (D’Ignazio and Klein, 2020; Eubanks, 2018; Noble, 2018). We show that the risk of ML systems to become unintended sources of unfairness is further elevated by misconceptions about ML in self-education resources and the subsequent, potential underinformed application of ML.
Expertise and the role of tutorials in the formation of ML practice
The point of departure for our study is Mackenzie’s (2017) book
In this article, we focus on self-education resources in the form of online tutorials to understand how ML is framed and constructed by ML practitioners. The analysis of ML tutorials is an expedient endeavor considering the increasing application and demand for Machine Learning. In 2018, Tencent estimated the global number of Artificial Intelligence researchers and industry practitioners to be between 200,000 and 300,000 people (Kahn, 2018). Compared to 18 million software developers in the world, this means that only about one per cent of software developers have the skills to engage with AI and ML as novel paradigms of programming. Considering the habitual practice of software developers to self-educate and the increasing demand for ML techniques, self-education in Machine Learning can be expected to increase significantly considering its growing importance and demand for workforce. A Stack Overflow (2019) survey demonstrated that informal learning and self-education are important ways of knowledge acquisition in a world in which development frameworks and technologies rapidly change. 86.8% of professional software developers (
Our focus on ML tutorials is motivated by the important role that self-education plays in software development and computer science. Professional software developers frequently do not have a formal background in computer science and are used to teaching themselves new technologies. This is especially important since many of the technologies used in practice were invented and introduced long after the software developers finished their formal education. The Stack Overflow (2019) survey of professional software developers (
This is problematic because understanding and effectively applying Machine Learning requires practitioners to have a strong background in linear algebra and statistics (Goodfellow et al., 2016). This lack of formal education regarding the application of ML is even more problematic considering that a large proportion of software developers graduated college before recent advances in Machine Learning were published and before curricula were upgraded to reflect the strong demand for ML practitioners.
While expertise used to be understood as something logical, Evans and Collins (2008) argue that the understanding of it has moved towards ideas of expertise as something practical that is “based in what you can do rather than what you can calculate or learn”. They also argue that the distinction between expert and non-expert cannot be neatly mapped onto the boundaries of social institutions and highlight (along with other STS scholars) that expertise and knowledge also ‘exist outside the mainstream scientific community’ (Evans and Collins, 2008: 609). Expertise is hence not solely a quality of individuals but also belongs to a community. Within distributed and dispersed communities, the question of how knowledge may be shared and transferred becomes crucial (e.g., Jarke, 2014; Vaast and Walsham, 2009). Orlikowski (2002: 249) argues that knowledge is not something static or a stable disposition, but something that is continuously produced and reproduced in everyday practice. Tutorials are one way to reproduce and circulate a community’s knowledge practices. Tutorials constitute the practical doing of a community in material form and can be conceived as “boundary objects” (Jarke, 2014; Star, 2010) to share the expertise and knowledge of practitioners.
ML tutorials may be understood as
Methodology
Our understanding of expertise and the emergence of a community of Machine Learning practitioners is based on ML tutorials as boundary objects. In the following, we provide details on our data collection and analysis process.
For this paper, we sourced ML tutorials through Google Search. We downloaded the Top 50 search results for the query “machine learning tutorials”. The search was executed from Germany with English keywords. We focused on individual text tutorials and ignored collections of tutorials, paid ads, video tutorials, and massive open online courses (MOOCs). The tutorials were collected from a laptop that had never searched for ML or AI before, limiting the effect of personalized search results. We reduced the 50 tutorials to 41, excluding duplicates (3), tutorials that did not focus on ML (3), one tutorial that only focused on installing the software required for ML, one guide with best practices of ML, and one book. Thirty-eight of the 41 tutorials are written in English, three are written in German.
We performed a qualitative coding of the tutorials, following Mayring’s (2014) systematic, rule-bound procedure focused on categories. We analyzed the tutorials using an inductive approach, where we highlighted textual quotations to identify emergent themes. We analyzed the material step-by-step, i.e. we first assigned individual codes to words and sentences. After that, we revisited the material and combined the individual codes into larger categories. This allowed us to account for minor differences, e.g. merging similar ML applications such as self-driving cars and autonomous vehicles. Subsequently, we sorted the different codes based on how often they occurred in individual tutorials.
The average number of words of the examined tutorials is 5200 (SD = 13,012). The median number of 2505 words is considerably smaller. The shortest tutorial had 851 words, the longest tutorial had 85,512 words. We also computed Flesch Reading Ease tests on the tutorials to quantify whether the material is easy (100) or hard to read (0). The average reading ease score is 50.98 (SD = 12.32), the median is 49.60. This puts the tutorials close to readability commonly found at college level. These reading scores imply that the tutorials are written for a well-educated audience. The most readable tutorial has a readability score of 76.50, which means that it could be easily understood by 13- to 15-year-old students. The least readable tutorial has a score of 25.20, which means that the text is difficult to read and best understood by university graduates.
Framings of Machine Learning
Our analysis explored how different framings of the concept ML are manifested in the tutorials that we analyzed. As a first step, we reviewed the 41 tutorials and searched for definitions or operationalizations of the term Machine Learning. Surprisingly, only 21 of the 41 tutorials (51%) explained the term Machine Learning. In half of the tutorials, the authors did not define or operationalize what ML “is”.
In those tutorials that defined the term ML, we found two dominant definitions. Also, we identified a long-tail of other definitions each only cited once. The most widely cited definition of ML describes it as the “field of study that gives computers the ability to learn without being explicitly programmed” (T4, T3, T9, T27, T30, T39). This definition is attributed to Samuel (1959), who was among the first people to use the term Machine Learning. The second most commonly cited author is Mitchell (1997) (T4, T3, T10, T27, T39), who defined ML as follows:
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.”
This definition is also used without referring to Mitchell, for instance in Tutorials 15 and 37. In addition to these two dominant definitions, there are a variety of other definitions and descriptions that focus on other aspects. Tutorial 5, for instance, compares ML to pattern recognition, while Tutorial 6 and 8 regard ML as “learning based on experience”, without further detailing what the terms “learning” and “experience” mean. Tutorial 13 distinguishes “traditional programming” from ML. They argue that “traditional programming” relies on hard-coded rules, while ML relies on “learning patterns based on sample data”. Other definitions are more general, describing ML as “a technology design to build intelligent systems” (T20) or as “based on the idea of giving machines access to data and allowing them [the systems] to learn for themselves” (T22). Tutorial 48 defined the term “learning” as “figuring out an equation to solve a specific problem based on some example data”. Tutorial 36 states that “instead of writing code, you feed data to the generic algorithm and it builds its own logic based on the data”. Tutorial 36 states that this generic algorithm “can tell you something interesting about a set of data without you having to write any custom code specific to the problem”. According to these tutorials, “the algorithm/machine builds the logic based on the given data” (T36), hence ascribing agency to the data. This focus on data is even more explicitly described in Tutorial 39, which defines ML as “a part of AI [artificial intelligence] that learns from the data” (T23). Tutorial 24 defines ML as “generic algorithms that can tell you something interesting about a set of data without you having to write any custom code specific to the problem”. Tutorial 4 describes ML as “feel[ing] its way to the answer” without explaining further what this means. In Tutorial 21, ML is described as “the brain where all the learning takes place”, later comparing ML to how humans learn from experience.
Types of Machine Learning
In addition to how the term ML is described, defined, and/or operationalised, our analysis also revealed different types of ML that are commonly recognised. The dominant types of ML are supervised learning, unsupervised learning, and reinforcement learning.
Supervised learning is commonly framed as a type of ML that is based on data and labels corresponding to the data (T1, 35, 19, and 40). Tutorial 12 describes supervised learning as learning a general rule that maps inputs to outputs. Other definitions describe the data and its labels as input and target pairs (T37), inputs and desired outputs (T20), or as “feedback from the humans” (T21). One metaphor describes supervised learning as “the computer” being presented with example inputs and desired outputs by a “teacher” (T12). This imagined actor is also sometimes called an “external supervisor” (T33) or “the scientist” (T32). In other tutorials, supervised learning is framed as working backwards from the solution. Surprisingly, even though supervised ML is only possible if aligned pairs of input and output data are available, the laborious process of data labeling, which has to be done “by a human being beforehand”, is only mentioned in Tutorial 35.
Unsupervised learning is another commonly mentioned type of ML that is often contrasted with supervised learning. Tutorial 5, for instance, regards supervised learning as using labeled data, and unsupervised learning as finding patterns in unlabeled data. A variety of tutorials refer to the lack of labels (T12, 20, and 35), which leaves the “algorithm […] on its own to find structure in its input” (T12). Tutorials frame unsupervised learning as discovering similarities or regularities in the input data. For Tutorial 4, “the program is given a bunch of data and must find patterns and relationships therein”. Similarly, Tutorial 8 regards unsupervised learning as a task where it is up to the machine “to determine the relationship between the entered data and any other hypothetical data”. Tutorial 19 describes unsupervised learning as an approach to data “where you do not have any information about inner interrelations in advance”. This is similar to Tutorial 33, which regards the main task in unsupervised learning as finding “the underlying patterns rather than the mapping”. More broadly, Tutorial 37 describes unsupervised learning as characterizing the unknown distribution.
A third type of ML, reinforcement learning (RL), is defined by Tutorial 35 as a computer program that dynamically interacts with its environment while receiving positive and/or negative feedback to improve its performance. The definitions in Tutorials 1, 12, 20, and 33 focus on an agent that is interacting with a dynamic environment while stressing the importance of “a certain goal”. Tutorial 22 strongly focuses on the agency of “the machine” which “continuously trains itself using trial and error” in relation to a specific environment. This anthropomorphization can also be observed in Tutorial 36, which states that when an RL-based system “makes a wrong prediction[,] it will update its rule by itself”. Other reinforcement definitions (Tutorials 5, 18, and 33) focus on a “reward” that is being optimized by an “agent” based on feedback, highlighting the difference between reinforcement learning and supervised learning.
Data and Machine Learning
Following the popular definition of ML as learning from data, it is noteworthy to consider how data is framed in the different tutorials. Overall, we found little discussion of the nature or significance of data. Those that engage with the term frame data as “any unprocessed fact, value, text, sound or picture that is not being interpreted and analyzed” (T14) or describe data as “the new oil”, that is “precious but useful only when cleaned and processed” (T13). Considering the small number of tutorials that engage with the term data, it is surprising that almost half of the tutorials (46%) mentioned data preparation as a topic. Six of the tutorials (15%) apply and explain data preparation techniques.
Regarding data in ML, it is noteworthy that the large majority of the tutorials does not explain that data presented to a model is a sample that may or may not be representative of a population. Only Tutorial 4 stated that ML systems require “a statistically significant random sample as training data”. Basic assumptions regarding the class distributions, which are crucial for the successful training of ML (Müller and Guido, 2016) and which can be a great catalyst for fairness problems in ML (Benjamin, 2019), are rarely discussed. The practice of randomly shuffling data is also rarely mentioned (T38 and 40). Stratification, the practice of making sure that the training and test set are randomly shuffled while ensuring that the class distribution is the same in both sets, is only mentioned in Tutorial 29. However, it is merely called a good practice that “will ensure your training set looks similar to your test set”.
Understanding data is mentioned as an important aspect of being a data scientist (T25). Tutorial 41 encourages practitioners to take a “peak at the data itself” by looking at statistics like mean and median as well as class distributions, data visualizations, boxplots, histograms, and scatterplots. Tutorial 28 mentions that it is important for datasets with greater complexity to visualize the distribution of the data “in order to gain an understanding of the data”. Furthermore, Tutorial 29 stresses the importance of speaking to domain experts to gain a contextual understanding of the data and its origin.
Tutorial 20 discusses the importance of transforming the data into a form that is “useful” for the ML system. The process of data preparation is framed as making data “even more valuable” (T32). A frequently mentioned data preparation step is normalization (T15, 16, 17, 26, and 29), which means subtracting the mean and dividing the data by the standard deviation, thus centering the data points at zero with unit variance. Tutorials 13 and 29 mention feature scaling as a similar practice aimed at making all features comparable by putting them on the same scale. Data representation is also discussed for specific application domains such as natural language processing (T31, 37, and 40). Surprisingly, Tutorial 39 is the only tutorial that addresses the issue of missing data and proposes imputation as a solution, i.e. how missing values can and should be replaced.
Tutorials rarely discuss the impact that the quality of the data can have on the performance of ML systems. Tutorial 39 addresses this issue very briefly by pointing out that the better the quality of data, the more suitable it will be for modeling. Tutorial 4 assesses that “real-world data” is always “a little noisy”, which prohibits the model from fitting the data “neatly on a straight line”. Lack of data and lack of diversity in the dataset are mentioned as primary challenges of ML in Tutorial 21. The tutorial further elaborates that “a machine needs to have heterogeneity to learn meaningful insights”.
Tutorial 20 evokes the notion that “hidden patterns” exist in the data that can be identified by ML systems. Tutorial 24 warns that, when predicting the price of a house, “the function” that an ML system may end up with is “totally dumb”. The ML system does not know what “square feet” or “bedrooms” are. According to Tutorial 24, a regression model is merely “stir[ing] in some amount of those numbers to get the correct answer”. They argue that if a human expert could not use the data to solve the problem manually, “a computer probably won’t be able to either” (T24).
Only one tutorial explicitly addresses the limitations of ML in relation to data. Tutorial 31 argues that the accuracy of the system they are training “seems to be a natural limit for this data with its given size”. This crucial concept—that there is a limit of what can be inferred from data—is only brought up here.
Machine Learning algorithms
In addition to definitions of ML and the types of ML that are distinguished, this paper also explores the specific ML algorithms that are mentioned. While—as stated above—the term algorithm is commonly used as a synecdoche for larger assemblages of socio-technical actors and issues in critical data studies, this section relates to the actual (technical) procedures with which ML-based systems infer from data. Overall, we found that such ML algorithms were mentioned in 31 of the 41 tutorials (76%). The most commonly mentioned ML algorithm is support vector machines, which are mentioned 16 times in the tutorials (39%). The second most frequently mentioned algorithms are artificial neural networks, which are mentioned in 14 tutorials (34%). The third most commonly mentioned ML algorithm is linear regression, mentioned in 13 tutorials (32%). Decision trees and naïve Bayes classifiers are mentioned in 11 tutorials (27%), Logistic regression and k-nearest neighbors in 10 (24%). This means that most tutorials are focused on supervised ML models that perform classification or regression. That said, nine tutorials mentioned the unsupervised clustering model k-means, while eight tutorials focused on reinforcement learning. For these algorithms, a long-tail phenomenon can be observed. Thirty-eight algorithms are only mentioned once. These algorithms span a broad range, including inductive logic programming, Bayesian networks, extreme learning machines, long short-term memory networks, multi-armed bandits, and neural Turing machines.
The most commonly applied algorithm is linear regression, which is applied in 4 of the 41 tutorials using a concrete application example (10%). The second most commonly applied algorithm is logistic regression, which is applied in three tutorials. Neural networks and k-nearest neighbors are applied in two of the tutorials. Surprisingly, support vector machines, the most commonly mentioned algorithm, is only applied once.
Considering how ML algorithms are presented, we find that even in tutorials that are aimed at explaining ML, the underlying algorithms are presented as iblack boxes. The inner mechanics of the most commonly mentioned algorithms—support vector machines and neural networks—are rarely explained. If they are explained, then not in-depth. Key concepts like gradient descent and backpropagation, which are described in detail in textbooks like Bishop (2006) or Goodfellow et al. (2016), are not explained either. Few tutorials mention or explain concrete ways to measure the generalization capabilities of ML-based systems. Metrics such as accuracy, precision, and recall are mentioned but rarely formally defined or applied. Widely used metrics such as mean average precision are never mentioned. The large body of work on the interpretability of ML, including visualization of feature importance, is not mentioned in any of the tutorials that we reviewed.
Expertise in Machine Learning
We also explored how tutorials framed the required expertise for successfully applying ML. Tutorial 30 argues that
“[ML is] a lot like a car, you do not need to know much about how it works in order to get an incredible amount of utility from it.”
Tutorial 30 further asserts that people can “engage in ML very easily without almost any knowledge at all of how it works” since the default settings of ML libraries can get 90–95% accuracy on many tasks. However, to “push the limits in performance and efficiency”, Tutorial 30 recommends readers “to dig in under the hood”.
The sentiment that a thorough understanding of ML is not required can also be found in Tutorial 41, which tells its readers that they “do not need to understand everything (at least not right now)”. Tutorial 41 states that ML practitioners do not need to know how a model works, arguing that learning about the benefits and limitations of various algorithms can be done later. The idea that expertise is not needed is especially problematic considering the fact that Tutorial 30 does not discuss model evaluation in the text and only refers to a follow-up tutorial on evaluation. That said, the tutorial argues that it is “important to know about the limitations and how to configure ML algorithms”.
Surprisingly, the potential of ML systems to overfit, i.e. the danger of inferring parameters that do not generalize beyond the training data, is rarely mentioned (T16 and 31), even though it is a key concept in textbooks like Goodfellow et al. (2016). Only Tutorial 16 mentions the complexity of a model as a possible cause of overfitting. Tutorial 22 addresses the fit between the problem and the ML system, arguing that “it is a fact that no one ML model can solve every problem”. The tutorial stresses the importance of aspects such as the structure and size of a dataset in finding the most suitable model for a given problem. They also declare that “you can’t say that decision trees is [sic!] always better than neural networks and vice-versa”.
Applications of Machine Learning
Finally, our analysis provides an overview of the different applications of ML and how common they are. Overall, the 41 tutorials mention 160 distinct applications of ML. The most frequently mentioned application of ML is the detection of spam. It is brought up in 11 tutorials, which means that almost a third (27%) of tutorials reference the detection of spam. The second most frequently mentioned application is self-driving cars, which are referred to in every fourth tutorial (10 mentions, 24%). The prediction of housing prices is the third most frequently used example, mentioned by nine tutorials and applied by one. Other applications include face recognition (6 mentions), sentiment analysis (5), and playing the game Go (5). Playing chess (4) and cancer detection (4) are mentioned in 10% of the tutorials.
The distribution of the different applications of ML follows a strong long-tail distribution. 111 applications are only mentioned in one tutorial. Twenty-nine applications of ML are brought up in more than two tutorials. In the long-tail of applications only mentioned once, we find examples such as Facebook’s News Feed curating news, an ML system creating art, industrial logistics in general, and a robot learning to fly. Other applications include preventing jaywalkers, detecting pornography, robots doing backflips, network security anomaly detection, as well as text mining and social media analysis.
Considering how ML applications are presented in the tutorials, it is noteworthy to point out how comparatively few tutorials show how to actually apply ML. Only eight tutorials explain how to implement an ML system. For the application, examples are unique, e.g. no application was implemented in more than one tutorial. The applications include regression problems like the prediction of housing prices or stock prices as well as problems like the classification of handwritten digits, fruits, flowers, and the quality of wines. Clustering problems include the sorting of building bricks as well as the clustering of people based on specific attributes.
Discussion
This discussion reviews and contextualizes the findings of our analysis. We explore the misconceptions about ML that we discovered, question the universal applicability of ML that we found, engage with the dangers of an underinformed application of ML, and explore the relationship of data and power in ML.
First, let us reflect on some general observations about tutorials. The Stack Overflow (2019) survey showed the central role that self-education plays in the formation of practices of software developers. Our analysis of tutorials as a central self-education resource, therefore, yields important insights into how Machine Learning is understood by ML practitioners. Overall, our analysis indicates that tutorials are not very actionable. Though ML algorithms are frequently mentioned in the tutorials, they are rarely applied or explained in detail. Tutorials also do not discuss the potential issues related to data and evaluation, which connects to the dangers posed by the underinformed application of ML. Overall, the tutorials provide comparatively few actionable insights into how to use ML. We also find that the tutorials do not engage with the terminology of academic literature compared to textbooks like Bishop (2006) or Goodfellow et al. (2016). Only half of the tutorials explicitly define or explain the term Machine Learning. Of those that do, they mostly refer to the two most widely cited definitions by Samuel (1959) and Mitchell (1997).
Canonical examples of ML
We identified a small number of ML applications that can be regarded as canonical examples of ML, since they are brought up frequently, especially in comparison to the large number of applications that are only mentioned once. The canonical examples of ML are the detection of spam, self-driving cars, and the prediction of housing prices. Another frequently mentioned example is playing the game Go. In this context, it is interesting to point out that while spam detection and housing price prediction are tasks that are comparatively easy to implement (even for inexperienced programmers), self-driving cars as well as playing the game Go require large research teams with substantial resources.
Misconceptions about ML
We uncovered a variety of misconceptions about ML in the examined tutorials. The most visible and problematic misconception was that ML was presented as “adapt[ing] in response to new data and experiences to improve efficacy over time” (T21). This is a misconception since the large majority of ML algorithms used in practice have clearly defined training phases after which the system is being deployed. This means that ML systems are rarely trained during use, a special case commonly referred to as Online Machine Learning (Saad, 1998). Such online learning is not supported by the large majority of algorithms mentioned in the tutorials, especially not those that are frequently mentioned. Nevertheless, the tutorials suggest a belief that the ML system adapts to new data and experiences and that such systems can be “improved over time by feeding them with information and data in the form of real-world interactions and observations” (T22). However, online learning is rarely applied in practice.
Another interesting misconception that we found in the tutorials was the idea that ML systems “learn for themselves”. According to Tutorial 10, ML gives systems “the ability to automatically learn and improve from experience”, largely disregarding that the tasks and cost functions are defined by humans, that the data is preprocessed, and that the model needs to be evaluated and deployed. This is echoed in Tutorial 39, which describes ML as “automating and improving the learning process of computers based on their experiences without being programmed, i.e. without any human assistance”. We regard this as misleading and highly problematic, considering how the large majority of applications found in the tutorials are developed. Applying ML requires a highly skilled and specialized team that (1) collects the right input data, (2) makes sure that data is sampled, prepared, and referenced correctly, (3) ensures that a system is “trained” properly, and (4) evaluates whether what the system “learned” is generalizable and actionable.
Some tutorials also evoke the notion of “hidden pattern[s]” (T20) that are invisible to humans and that can be identified by ML. The overwhelming majority of tutorials does not give readers any assistance in learning to determine what can be learned from certain data and what cannot.
Universal applicability
In the 41 tutorials, we found 160 distinct applications and a strong long-tail distribution in which very few tutorials mention the same ML application. This implies that ML is presented as applicable to a large variety of problems. This finding connects to Mackenzie (2017), who found that ML is applied to a wide range of problems and application contexts. In an analysis of Friedman et al.’s (2001) ML textbook
We corroborate these findings by showing that tutorials, too, present ML as universally applicable in a large variety of contexts.
Underinformed application
In addition to the challenges associated with this universal applicability, we also identified another important misconception that we refer to as underinformed application, a framing for the idea that ML can be applied without special expertise. Some tutorials explicitly encouraged such underinformed application by telling readers that they “do not need to understand everything” (T41) and that they can apply ML “very easily without almost any knowledge at all of how it works” (T30). The same two tutorials also state that ML practitioners “do not need to know how the algorithms work” (T41) or that “you do not need to know much about how it works to get an incredible amount of utility from it”. This is in line with popular arguments by researchers such as Jockers (2013). He applied the unsupervised ML technique latent Dirichlet allocation (LDA) to analyze how an author’s gender, nationality, and publication year affected the topics in 3346 19th century of a book from British, Irish, and American literature. Reviewing the results of his study, he claimed that: “It is fair to skip the mathematics and focus on the results. We needn’t know how long and hard Joyce sweated over Ulysses to appreciate his genius, and a clear understanding of the LDA machine is not required to see the beauty of the result”.
This statement is remarkable for several reasons. First and foremost, it equates “the LDA machine” to one of the most influential and important authors of the 20th century. Second, the intricate practices of writing a book are likened to a procedure where words and topics are sampled from probability distributions. Third, an awareness that the beauty of the result of LDA relies on randomness in its training and inference phases is missing, which means that the results are partly due to chance.
Statements like the one from Jockers (2013) are highly problematic because ML operates in the domain of statistics, where spurious correlations can lead to misleading results. In addition to explicit statements that encourage users to apply ML without a deep understanding, our analysis of the tutorials also provides ample evidence that the framing of ML can act as a catalyst for the underinformed application of ML. The significance of data, the preparation, selection, and representation of data, as well as the evaluation of ML systems, play a comparatively small role across the examined tutorials. The overwhelming majority of tutorials does not acknowledge the difference in complexity between tasks such as the detection of spam and self-driving cars. The tutorials also do not reflect on the statistical nature of ML. Even basic statistical assumptions, e.g. that the data that are used are a sample of a population and prone to sampling biases, is not addressed. Common mistakes, e.g., the potential problems that arise from data leakage are not discussed either. Fundamental problems like overfitting are only mentioned twice. The tutorials also do not address the influence that the complexity of an algorithm can have on the results.
Overall, the tutorials do not convey that ML is not just an “algorithm” but an interplay of socio-technical actors. Thus, the framings do not reflect the intricate practices associated with training ML systems, especially considering data representation and model evaluation.
The framings that we uncovered can lead to highly problematic scenarios where people, who do not have sufficient data preparation and model evaluation expertise, train systems with potentially catastrophic consequences for individuals and society-at-large. Since these framings are based on ML tutorials, one could argue that these risks are exaggerated or an artifact of using ML tutorials as a lens. One possible explanation could be that ML tutorials as a genre may not allow going into enough detail to tackle these issues. This explanation, however, can be refuted by the finding that other complex topics such as hyperparameter optimization are discussed in the tutorials.
A stronger counterargument against the risks of the underinformed application would be that ML tutorials are only one particular source of knowledge that is used to learn about ML. Those who want to learn about ML might combine university courses, textbooks, MOOCs, video tutorials, and blog posts. As such, the limitations of ML tutorials could be compensated for by other sources of knowledge. Nevertheless, the tutorials revealed significant limitations of the communal understanding of ML, which provides important insights into how ML is framed by ML practitioners.
Disregard of social and ethical aspects of Machine Learning
In combination, the universal applicability and the underinformed application can lead to dangerous misapplications of Machine Learning that pose important risks for worsening the social inequalities highlighted by Noble (2018) and Benjamin (2019). Disregarding the wider socio-technical assemblages in which Machine Learning is applied, one might attempt to repurpose an ML system used to filter spam emails to select job applicants. Any reader could easily train such a system using the programming code we presented in Figure 1 by replacing the file
Conclusion
This paper provides evidence for a variety of potentially harmful misconceptions about Machine Learning and shows that the analysis of ML tutorials can yield important insights into how ML is framed and constructed by practitioners. Based on the analysis of 41 Machine Learning tutorials, we reveal canonical examples of ML as well as important misconceptions and problematic framings. Our analysis uncovers the dangers of misconceptions like a supposed universal applicability of ML-based systems and the potential underinformed application of ML algorithms. We find that ML algorithms only play a marginal role in the tutorials we examined. Our analysis also shows that the importance of data and related data preparation and data labeling practices is vastly understated.
Considering the importance of data that we highlight in this paper, we argue that critical data studies can help overcome these limitations. Recent work on Data Feminism by D’Ignazio and Klein (2020), for instance, stresses the importance of training data. In support of their claims, we argue that it is not only data but a variety of practices associated with data preparation and data labeling that impact the ways in which ML-based systems infer and make predictions. We argue that these have to be studied in more detail and provide evidence that a study of self-education resources can yield interesting insights into such practices. Our paper provides an important starting point for this investigation by mapping out the different framings of ML in self-education resources and highlighting their specific shortcomings. However, more work is needed to study these practices and their impact on society.
Our analysis suggests that rather than explicitly defining Machine Learning and its characteristics, tutorials rely on example applications. This, however, introduces additional uncertainty about what Machine Learning is or is not as well as what it can or cannot accomplish. We do identify spam filtering, self-driving cars, and the prediction of housing prices as canonical examples of Machine Learning. The lack of rigor regarding the definition of Machine Learning is reflected in the important misconceptions uncovered by our study. We find that ML-based systems are presented as improving over time and learning “for themselves”. While this is technically possible, our analysis shows that the systems most commonly discussed in tutorials do not adhere to these principles.
Overall, we find that the importance of data is underestimated in the examined tutorials. This is especially problematic considering the tendency of tutorials to present Machine Learning as a technology that can be applied to any problem and that can be used without a strong background in statistics and mathematics. The low threshold for training an ML system goes hand in hand with the increasing public availability of Machine Learning to almost anybody interested. The public availability of ML also increased due to the growth of companies that offer Machine Learning as a service (MLaaS). Such cloud services not only minimize the technical abilities needed to train ML systems. They also provide the computing resources necessary to train such systems.
To gauge the severity and the impact of this problem, further research should explore how ML practitioners make sense of Machine Learning and their own role, e.g. with respect to the impact of specific (design) decisions. One example could be investigating how they reflect on the effect of specific data preparation measures on the training data set. More generally, accounts of practitioners’ perspectives on data science and exploratory data analysis are needed.
Our investigation shows that it is important to open the “black box” of ML and to make the practices that affect the predictions of a system examinable. The idea that the application of ML does not require specific expertise and that it is universally applicable may lead to the emergence and increase of socio-technical systems that are harmful. Hence, based on our findings, we argue that critical research on information systems that are fully or partially based on Machine Learning needs to investigate the practices and processes of data preparation and processing. Data and data-related practices should then be understood as part of the synecdochal understanding of algorithms as sociotechnical assemblages.
Supplemental Material
sj-zip-1-bds-10.1177_20539517211017593 - Supplemental material for Machine learning in tutorials – Universal applicability, underinformed application, and other misconceptions
Supplemental material, sj-zip-1-bds-10.1177_20539517211017593 for Machine learning in tutorials – Universal applicability, underinformed application, and other misconceptions by Hendrik Heuer, Juliane Jarke and Andreas Breiter in Big Data & Society
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The work of Hendrik Heuer and Andreas Breiter was funded by the German Research Council (DFG) under project number 374666841, SFB 1342. The work of Juliane Jarke was funded by the German Federal Ministry of Education and Research under project number 01JD1803A.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
