Abstract
Machine learning (ML) systems have shown great potential for performing or supporting inferential reasoning through analyzing large data sets, thereby potentially facilitating more informed decision-making. However, a hindrance to such use of ML systems is that the predictive models created through ML are often complex, opaque, and poorly understood, even if the programs “learning” the models are simple, transparent, and well understood. ML models become difficult to trust, since lay-people, specialists, and even researchers have difficulties gauging the reasonableness, correctness, and reliability of the inferences performed. In this article, we argue that bridging this gap in the understanding of ML models and their reasonableness requires a focus on developing an improved methodology for their creation. This process has been likened to “alchemy” and criticized for involving a large degree of “black art,” owing to its reliance on poorly understood “best practices”. We soften this critique and argue that the seeming arbitrariness often is the result of a lack of explicit hypothesizing stemming from an empiricist and myopic focus on optimizing for predictive performance rather than from an occult or mystical process. We present some of the problems resulting from the excessive focus on optimizing generalization performance at the cost of hypothesizing about the selection of data and biases. We suggest embedding ML in a general logic of scientific discovery similar to the one presented by Charles Sanders Peirce, and present a recontextualized version of Peirce’s scientific hypothesis adjusted to ML.
Keywords
Introduction
Machine learning (ML) has become a key technique in solving a wide range of problems within fields as diverse as marketing, financial trading, policing, and medical diagnostics, and its potential uses seem only to increase. However, many scholars point to important societal issues that might result from the rapid and widespread implementation of these techniques in society (Veale and Binns, 2017; Whittaker et al., 2018). Algorithmic transparency and accountability can be difficult to ensure, when decisions are made on the basis of an “automatically” learning system (Burrell, 2016; Citron and Pasquale, 2014; Friedman and Nissenbaum, 1996; Kroll et al., 2017; Pasquale, 2015), and hidden uncertainties and biases encoded in data sets can have subtle influences on the learned models (Boyd and Crawford, 2012; Harcourt, 2007). To make matters worse, the scientific methodology, that is supposed to account for such problems in the research and development of ML models, is being attacked as well, with some critical voices likening it to “alchemy” or lamenting the prevalence of “black art” and “magic spells” (Campolo and Crawford, 2020). Such criticism has the potential to undermine the scientific legitimacy of the ML field and should be responded to with critical scrutiny of and reflection on the methods of ML and their justification. In this article, we pose the following research question: what kind of reasoning is operating in the development and implementation of ML models, and how can such reasoning be employed in a scientifically rigorous way? As ML models enter into important high-risk decision-making, pressure mounts to develop a better understanding of the knowledge they produce and apply to guarantee their reasonable and accountable development and use (Passi and Sengers, 2020; Wachter and Mittelstadt, 2019; Whittaker et al., 2018; Wieringa, 2020).
This article focuses on ML models as tools for inferential reasoning 1 and investigates the logics underlying discovery and sense-making processes that utilize them. ML is often portrayed as automated, yet a great deal of human labor is involved in modeling and deployment (Passi and Jackson, 2017; Passi and Sengers, 2020): data is cleaned and transformed before use, often according to the intuition of the ML developer (Rouvroy, 2011; Veale and Binns, 2017), models are chosen (Hastie et al., 2009), and classification and sorting are employed to create groups and select variables and labels (Bowker and Star, 1999). Whereas these processes of cleaning and categorizing are applied for practical and functional reasons, they concurrently shape and affect the outcomes of studies they partake in (Harcourt, 2007).
To some scholars and practitioners, the human involvement in the ML process introduces subjective biases 2 into the models that are at best unnecessary (Kanter and Veeramachaneni, 2015) and at worst problematic. Such opinions have been extensively criticized by, among others, Kitchin (2014). We instead argue that human involvement is indispensable for successful ML. The problem is not that researchers inflict subjective biases, but that the ML field, in both research and industry, is poorly equipped to reflect on and account for the biases of both researchers and technical methods (Green and Hu, 2018; Whittaker et al., 2018). In order to preserve its scientific legitimacy, we argue that the field of ML must reevaluate the soundness of its methodology and the inferential reasoning it aims to facilitate. We need to open our eyes to the—already existing—methodological considerations applied by practitioners and researchers when creating ML models and insist on the relevance of critical scrutiny at this level for the responsible, reasonable, and accountable use of ML models to facilitate informed decision-making. We need to understand the role of biases in ML models, whether chosen intentionally or unwittingly, not to remove them but to evaluate and account for their impact and understand and discuss their use.
In ML, one starts with the hypothesis that a functional relationship exists between some given predictor (input) variables and other predicted (output) variables, such as the link between the pixels of a digital image and what that image depicts, or one between the physiological data of an individual and the condition of their health. Much may be known or unknown of the hypothetical input–output relationship such as the relative importance of individual features, the general type of its functional form, or whether such a relationship even exists at all. ML techniques apply biases derived from specialized knowledge and hypothetical assumptions about the input–output relationship, e.g. in the form of model types and architectures, data transformations, or loss functions, to find correlative patterns indicative of this relationship in training data. The goal of this pattern recognition is to infer the parameters of a model that utilizes the identified correlational patterns in a sufficiently reliable way that new output values can be predicted from previously unseen input with minimal error. To infer these model parameters, ML techniques implement a type of inductive inference, where the parameters of a given starting model are incrementally adjusted in a way that continuously increases predictive performance on the given training data. When the process is successful, the result is a model that performs a good approximation of the hypothetical input–output relationship, such as a model that can recognize common objects in images with sufficient accuracy, or one that can estimate the likelihood of a particular patient having an illness based on their medical data. It is in this sense that machines are said to “learn”; they improve their performance at some task defined as mimicking a hypothesized functional relationship given data illustrating that functional relationship. The main consideration in both ML research and development is on optimizing generalization performance, i.e., identifying patterns that are equally applicable to unseen data as to the data from which they are derived. Generalization performance is typically measured as the accuracy of the trained model on held-back testing data with a high testing accuracy indirectly indicating a successful model induction.
In this article, we argue that hypothesizing, especially about the modeled input–output relationship, is of central importance to ML research and practice, though it is often disregarded. First, we present some of the problems that arise when ML research and practice is myopically focused on optimizing generalization performance at the cost of hypothesizing and reflecting on the human choices made in ML modeling. We engage with problems arising from considering ML to be a purely inductive process, after which we show how choosing which data and inductive biases to use for any given ML project unavoidably requires committing to some preconceived explanatory hypotheses of the input–output relationship being modeled. We then present two example cases where major breakthroughs in ML research have been predicated on developing strong explanatory hypotheses. Following this, we present an updated concept of the scientific hypothesis for ML based on the work of Peirce and suggest how this may be used to embed ML within a larger framework of scientific discovery that can strengthen the theoretical groundwork on which ML models are based. Finally, we reflect on the relevance and implications of our findings for ML research and practice.
Problems with the special status of induction in ML
Though ML systems are often seen as functioning automatically with an abstract and sometimes almost magical (Campolo and Crawford, 2020; Moss and Schüür, 2018) ability to generalize across multiple domains and tasks, their development and application nonetheless depend on specialized and experiential knowledge supplied by those who develop and apply the systems (Passi and Jackson, 2017). The importance of this knowledge is often underappreciated (Domingos, 2012) perhaps owing to expectations that models and data speak for themselves (Anderson, 2008). Such expectations hint at a powerful set of ideas at work in the new empiricist epistemologies that have followed the great success of ML methods in diverse fields of science. One of the central critics of this new empiricist “Big Data paradigm” for science, Kitchin (2014), has pointed to precisely these ideas. Listing the following four misconceptions, he argues that a new mode of science is being created in the wake of new data-driven approaches to research, one in which the modus operandi is entirely inductive in nature,
Big Data can capture a whole domain and provide full resolution; there is no need for a priori theory, models or hypotheses; through the application of agnostic data analytics the data can speak for themselves free of human bias or framing, and any patterns and relationships within Big Data are inherently meaningful and truthful; meaning transcends context or domain-specific knowledge, thus can be interpreted by anyone who can decode a statistic or data visualization. (Kitchin, 2014: 4)
Central in Kitchin’s critique is precisely the lack of theoretical and scientific models, hypotheses, and domain-specific knowledge in the paradigm of Big Data, i.e., the undervaluing of and indifference towards contextual knowledge and human interference and reasoning in data-driven fields of research. Through the application of “impartial” data analytics, Kitchin states, it is believed in the field that data-driven research can reach “full resolution” and “inherently meaningful and truthful” results that transcend the troublesome context of particular and situated data patterns, types, or environments.
The special status of induction in ML might be traced back to a shift in the research methodology of the field in the late 1980s. In 1988, Langley published an editorial titled “Machine Learning as an Experimental Science” in the newly formed journal Machine Learning advocating for an approach to ML that was gaining popularity at the time and would since come to dominate the field. In the editorial, Langley presents now-familiar concepts such as (1) defining learning by improvement on a performance metric, (2) dividing data into training and testing sets, (3) comparing the performance of different machine learners on the same benchmark task, (4) measuring the performance of the same learner in disparate settings, and (5) performing rigorously controlled experimental evaluation of different learners under different circumstances. These recommendations came as a response to growing disappointment with the informal arguments used to justify learning methods and the limited utility of formal bounds on performance (Langley, 2011), as well as a growing appreciation of the unique affordances of experimentation in computational ecosystems.
3
Langley (2011) would since come to regret some of the unfortunate side-effects this empiricist optimism had leveled on the field, such as leading to an increased reliance on so-called “bake-offs,” explained as, “mindless comparisons among the performance of algorithms that reveal little about the sources of power or the effects of domain characteristics” as opposed to the controlled experimentation recommended 25 years prior (278). In particular, Langley lamented the separation of ML and AI into disciplines with different objectives, stating that [m]achine learning was originally concerned with developing intelligent systems that exhibited rich behavior on complex tasks, while many modern researchers seem content to tackle problems that do not require either intelligence or systems. Machine learning focused initially on using and acquiring knowledge cast as rich relational structures, while many researchers now appear to care only about statistics. (2011: 278)
Inductive reasoning is indeed central to ML, as it is the primary mechanism at play when machines are said to learn. However, this induction does not happen in a vacuum. In order for learning to be successful, a process of intellectual labor, where data is chosen and prepared and necessary assumptions are made, must precede the learning phase. We will investigate these choices and how they exceed the scope of automated model induction below.
The problem of selecting data
ML as a process of pattern recognition in data requires data to succeed. While data is often described as possessing qualities of neutrality and unbiasedness in their initial “raw” state as they are given, this view has been brought into question by Gitelman and Jackson among others who point out that data always requires interpretation. 5 The question then remains of how to both select and prepare data for ML and meaningfully approach questions of subjectivity and bias in these data. Which data is chosen and how they are preprocessed for use in making ML models ultimately depends on which hypotheses developers commit to with regards to the relationship between the input and output variables and with regards to the context in which the model will be deployed. As these hypotheses determine the shape of the data they cannot at the same time be determined by the data—rather they stem from a process of scientific deliberation that precedes the development and deployment of the model and that is often revisited multiple times in a ML project. Furthermore, the general approach to selecting and preparing training data itself depends on ideas of what constitutes “good” or “valuable” data in the first place. In this section, we will explore three broad strategies for choosing and preprocessing data in ML, each representing a different idea of what constitutes “valuable” data.
The first strategy, which we will call “the Big Data strategy,” is implicit in the “Big Data paradigm” (Boyd and Crawford, 2012; Kitchin, 2014) and is perhaps best summarized by Anderson in his notoriously controversial 2008 editorial for Wired, This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves. (2008)
A second strategy for the problem of selecting data, which we will call “the statistical strategy,” is inspired by more tempered attempts to ensure the objectivity and quality of the data through statistical sampling. Within this strategy, the “value” of data is primarily measured by its “cleanliness”. As such, data is, either explicitly or implicitly, assumed to be sampled from a hypothetical underlying statistical distribution. The cleanliness of the data measures their deviation from imagined true values, they would have had according to the underlying distribution, had it not been for contaminating influences such as measurement errors, missing values, imperfect conversions, etc., soiling the data, and causing them to become “dirty” (Kim et al., 2003). It is also often implicitly assumed, that the data is representative of an underlying empirical relational system of actual objects and that the numbers constituting the data behave in a similar way when combined and manipulated as would their empirical counterparts. 8 Unlike in the Big Data strategy, the exercise of creating a high-quality data set requires extensive theorizing and investigation of the data and their context in order to make the correct statistical assumptions about the shape and nature of the underlying distribution. Common theoretical frameworks for statistical learning often explicitly assume such a strategy, as seen in the definition of Probably Approximately Correct learning (Shalev-Shwartz and Ben-David, 2014), where data is assumed to be an unbiased independent and identically distributed sample of the correct distribution. Within this strategy, data quality is a separate measure from data quantity (Chu et al., 2016). The goal for choosing training data with this strategy is that training data, after being properly “cleaned” by correcting measurement errors, sample biases, inconsistencies, and other contaminants, should not be manipulated beyond what makes them into a high-quality statistical sample of the investigated empirical phenomenon. While data quantity is still important and desirable in the statistical strategy, data quality as a measure of accuracy and veracity is often more important and very rarely are the data changed after modeling has begun. In the statistical strategy, good data is clean data.
The third strategy for selecting data, which we will call “the pragmatic strategy,” is to view data as a starting point and a resource to be exploited in creating high-performing models. In this strategy, data is viewed as malleable mathematical objects that allow for certain transformations and that can be “molded” into a shape, not to better represent their empirical counterparts as in the statistical strategy, but solely to optimize the performance of a particular ML model. Whereas the Big Data strategy measures data value as quantity and the statistical strategy measures value as accuracy and veracity, this strategy instead measures data value solely by its usefulness, and is thereby inherently more pragmatic than the two others. This strategy is often seen in practice, but at the same time it is rarely given extensive theoretical attention. As described by Passi and Jackson (2018), this strategy is especially prevalent in corporate organizational contexts, where data’s value is measured in direct relation to the resources consumed in the extraction process (e.g. acquisition of more data, computing power, or specialized knowledge) and where there is a fixation on extracting value from readily available data first. If value is assessed primarily as increased predictive performance, the resulting myopic focus on optimizing this metric can lead projects down problematic paths where important concerns such as fairness or transparency are forgotten or ignored (Selbst et al., 2019).
An example of the pragmatic strategy can be seen in the techniques of data augmentation in image classification (Wong et al., 2016). A basic assumption in supervised learning is that a mathematical functional relationship relates input features to correct outputs. This hypothetical function is exactly what the machine learner is attempting to approximate in a given learning task. Crucially, assuming such a relationship requires not just a transformation from empirical object to digital information, i.e., digitization, but also an interpretation of the digital information as a mathematical object such as a tensor or a graph, i.e., mathematization. This mathematization is exploited in data augmentation to identify transformations that can be used on the input features without changing the correct output. For images, such transformations could be slight rotations, croppings, or scalings, or the introduction of artificial noise. When such transformations have been identified, they can be used to great effect to multiply the existing training data into new transformed but still recognizable shapes. As the technique increases the amount of training data available, it might be tempting to attribute it to the Big Data strategy outlined above. However, the creation of this data necessarily depends on the assumption that the data conforms to a particular mathematical structure that can be exploited for the particular task, contradicting the Big Data strategy. Furthermore, it is motivated only by its usefulness for increasing the robustness of resulting classifiers, and not because it results in more accurate or clean data, which contradicts the statistical strategy (Krizhevsky et al., 2012). Several such techniques exist in modern ML, such as feature engineering (Domingos, 2012), where new features are crafted by specialists in order to aid learning, over- or undersampling (Chawla, 2009), where the distribution of data between the targeted classes is skewed to create a more balanced data set, or data programming (Ratner et al., 2016), where noisy labels are automatically generated based on imperfect labeling functions defined by domain experts. Unlike in the statistical strategy, it is not uncommon in the pragmatic strategy that training data changes as much as the rest of the learning system in the iterative process of model development. In the pragmatic strategy, good data is useful data.
As these three strategies represent three different ideas of what constitutes “good” data, they illustrate that there is no one answer to the question of data value. Deciding on the relative value of big, clean, and useful data is not trivial and requires methodological reflection and scientific deliberation. The weighing of these values is often reflective of the ultimate goal of the particular ML project, as the pragmatic and Big Data strategies align well with a focus on engineering high-performance models, e.g., in corporate settings, and the statistical strategy with its focus on principled scientific discovery. However, the concerns about inference addressed in the statistical strategy cannot easily be ignored in engineering-focused or corporate projects, since this potentially undermines trust in the model’s predictions, while the impressive successes of projects following the Big Data and pragmatic strategies often raise the bar for, e.g., academic projects focused on scientific discovery. Committing to hypotheses about the phenomena modeled with ML is a necessary first step in any meaningful application, and one that cannot easily be automated. The way hypotheses are formulated and revisited has decisive influence on which data is selected and how they are prepared for training and thereby also on the resulting ML model. As such, this process of deliberation going into constructing the training data is an equally important target of critical scrutiny and investigation as any other when trying to understand and account for the effects of finished ML models.
The grounds for analysis: The problem of selecting an inductive bias
For ML to be successful, more than data has to be selected. As proven by Mitchell (1980), in order for statistical learning to properly generalize to new data, a choice of inductive bias must be made that allows the learning algorithm to prefer one generalization over another apart from strict consistency with the given training data. Mitchell’s proof of the necessity of inductive bias in learning generalization has important implications for the creation of machine learners. He states, If totally unbiased generalization systems are incapable of making the inductive leap to characterize the new instances, then the power of a generalization system follows directly from its biases – from decisions based on criteria other than consistency with the training instances. Therefore, progress toward understanding learning mechanisms depends upon understanding the sources of, and justification for, various biases. (1980: 2)
The choice of inductive biases to test in each iteration of the ML process precedes the inductive inference implemented by ML systems. As such, these biases must be selected by some other process than the inductive reasoning they facilitate. Furthermore, since no sufficient set of biases exist for ML, which can be relied upon in all situations, a process of bias selection cannot generally be circumvented. 9 It is indeed in relation to this process of bias selection that much of the controversy surrounding the lack of methodological rigor in current ML resides, leading even researchers within the field to complain about the prevalence of “magic spells and alchemy” (Campolo and Crawford, 2020), “black art” (Domingos, 2012), and lack of empirical rigor (Sculley et al., 2018). To counter this lack of scientific rigor, the choices of biases have to be articulated and explicated clearly as hypotheses and be made the subject of theoretical analysis, discussion, and teaching, not as an afterthought but as a default.
To sum up, addressing the problems of selecting data and inductive biases outlined above requires a development of the methodology of ML beyond empirical testing (Langley, 2011). In the remainder of this article, we will outline how the development of ML models can benefit from being embedded in a broader logic of scientific discovery, making explicit the different types of reasoning applied at different stages of the ML process. We start by introducing two historical examples from ML research, which illustrate the benefit that can be drawn from being guided by explanatory hypotheses. Having argued for the worth of theorizing we then suggest a way to position hypothesizing more centrally in the ML methodology through a framework inspired by Peirce (1966) in his seminal work on the logic of scientific discovery (5.1; Fann, 1970), finally reflecting on its relevance for modern ML practice.
Hypotheses in ML
In the development of the ML field, breakthroughs have often been made on the basis of strong explanatory hypotheses with regards to the modeled input–output relationship. In this section, we have selected two cases that illustrate this in particularly salient ways. While we do not believe the cases to be reflective of ordinary ML practice, they are both very well-known and have had deep impacts on the ML field. In both cases, we will outline how a breakthrough was attained not solely through the application of empirical testing, but through well-reasoned theorizing as well, leading to new models applicable to statistical and mathematical development.
The first case is the development of the Convolutional Neural Network (CNN) in the 1980s by Fukushima (1980) and LeCun (1989) among others, which has since become one of the most influential innovations in the field of computer vision (Schmidhuber, 2015). Fukushima’s seminal 1980 article took the first steps in this development by introducing the “Neocognitron”—a neural network model heavily inspired by prevailing theories of the time describing the organization of vertebrates’ visual cortices. The Neocognitron was invented with two features of biological visual systems in mind; shift invariance, i.e., that such systems can recognize images even if their visual content is shifted or distorted slightly and hierarchy, i.e., that they can recognize complex patterns in images as a hierarchical combination of simpler forms (e.g., a picture of the number seven can be recognized as two to three simple lines arranged in a particular way). While Fukushima’s Neocognitron was derived with clear inspiration from biological systems, the model was only developed into the modern CNN when LeCun and others went beyond biological analogies and used theoretical insight from statistical learning to design efficiently trainable networks. Since then, CNNs have been essential techniques for visual pattern recognition across many different tasks (Krizhevsky et al., 2012), and the theory behind them has been further developed and has given rise to new innovations.
Our second case is the development of Long Short-Term Memory Networks by Hochreiter and Schmidhuber in the 1990s. In his 1991 thesis, Hochreiter established a theory formalizing and explaining the problem of vanishing or exploding gradients in Deep Neural Networks (DNNs), using extensive testing and analysis to do so. The issue of exploding or vanishing gradients was especially problematic for predictions involving sequences with long time lags such as those found in natural language processing (NLP). Hochreiter’s theory led to a flurry of work in accounting for this problem (Schmidhuber, 2015) perhaps most famously by the invention of the Long Short-term Memory (LSTM) Network (Hochreiter and Schmidhuber, 1997). Similarly to the strategy of the CNNs above, Hochreiter and Schmidhuber theorized that the problem of vanishing or exploding gradients for sequences with large time lags could be accounted for in the architecture of DNNs and derived working prototypes for such architectures based on this hypothesis. The resulting LSTM became one of the first truly successful deep learning architectures for recurrent neural networks and led to large performance gains in many areas dealing with sequential data such as NLP (Schmidhuber, 2015).
The breakthroughs outlined above were both related to the development of DNNs, a field of study now notorious for its black-box methodologies (Marcus, 2018), yet both were derived from the strong theoretical groundwork preceding them. Both of these important architectures follow directly from their authors’ willingness to commit to a hypothetical explanation of why they would work, and these hypotheses coupled with rigorous mathematical derivations and principled empirical testing enabled the further development of the networks to the powerful tools they are today. While empirical performance testing was important in establishing these architectures as practically viable, the establishment of hypothetical explanations for their utility was at least as important for their development and investigation. However, as mentioned previously, the process of hypothesizing in ML is typically itself undertheorized and poorly documented. In the remainder of the article, we will focus on how the work of Peirce on the logic of scientific discovery can be applied to remedy this lack.
The Peircean logic of scientific discovery
Throughout a career spanning from the latter half of the 19th century to the beginning of the 20th century, the American logician and polymath Peirce was preoccupied with structuring the logic of scientific discovery. Peirce was particularly interested in solving the problem of how hypotheses and new ideas come about. While the derivation of testable consequences from hypotheses is handled by deduction and the testing of these consequences through coherence with data is handled by induction, the creation of the hypotheses themselves remained unexplained. Peirce (1966) thus developed the notion of abduction to describe the logical process of creating explanatory hypotheses, which are a prerequisite for the development of scientific theories (5.172). The development of abduction became crucial to his triadic structuring of the logic of scientific discovery, as it was the singular entry point for new ideas in his system. As he explains in one of his Lectures on Pragmatism in 1903, “Abduction is the process of forming an explanatory hypothesis. It is the only logical operation which introduces any new idea; for induction does nothing but determine a value, and deduction merely evolves the necessary consequences of a pure hypothesis” (1966: 5.171).
The reasoning in ML research and practice can be structured in a similar way as the one Peirce proposed for scientific discovery. Thus, we might say that (1) abduction introduces new hypotheses about the relationship between input and output, which we are attempting to approximate, (2) deduction derives implementations of these hypotheses in the form of specific biases and models, and (3) induction experimentally tests this implementation by inferring and testing the optimal model parameterization under the specific implementation of the learner, thereby indirectly and weakly testing the hypotheses as well. While ML projects are often approached with an engineering mindset, there are unavoidable aspects of scientific discovery to the process, as even the existence of a sufficient model is unknown to begin with and as ML models attempt to infer previously unknown facts. The separation above into theorizing, deriving, and experimenting elucidates one of the problems with the current ML methodology as focus is typically on deriving new models and experimenting with them at the cost of reflecting on the hypothetical reasoning behind these derivations and experiments. We believe that Peirce’s concept of abduction is uniquely suited to strengthen and structure this part of ML research as it is developed with the specific goal in mind of facilitating the generation of productive hypotheses.
Unfortunately, though the term “abduction” was coined by Peirce in his work on the logic of science, no coherent picture of the term emerges from his writings, as he spent more than 50 years developing the concept under different names. It is clear, however, that Peirce’s understanding of abduction was quite different from that which it is currently taken to mean (Campos, 2011; Douven, 2017; Schurz, 2008). An earlier name used by Peirce (1992) was retroduction (140–141), i.e., “backwards” reasoning, which covered a type of syllogistic reasoning meant to reverse the familiar process of deduction by inferring a plausible cause from a given effect rather than the necessary effects from a given cause. The underdetermination of causes by their effects in retroduction was later addressed by the type of abduction known as inference to the best explanation (IBE; Harman, 1965), where a cause can be inferred on the basis of auxiliary desiderata (such as simplicity, plausibility, or sufficiency) qualifying it as the “best” explanation among plausible candidates. For a definition of “goodness” for explanations, IBE becomes a type of formal and thereby potentially computational logic. This places IBE in the context of justification as a method for justifying belief in the best among competing hypotheses and systems of abductive learning in artificial intelligence generally use such a type of reasoning in syllogistic form (Gabbay and Kruse, 2000: 12; Mooney, 2000). 10 In his later work on abduction, Peirce came to see syllogistic reasoning as too narrow to account for the full scope of abduction as the source of new ideas and hypotheses (Campos, 2011), and this fundamentally changed his conception of abduction away from the kind of syllogistic reasoning implied by retroduction. Instead he grounded his concept of abduction in the three stages of scientific inquiry outlined above, placing abduction in the context of discovery as a pragmatic method for generating and selecting hypotheses for further investigation (1966: 5.590), having no influence on the justification for believing such hypotheses to be true (Fann, 1970: 31–32). Peirce never claimed that any formal mathematical mechanism was at play in abducing hypotheses, instead settling on humans having an innate ability or instinct to have good ideas (Fann, 1970; Peirce, 1966: 5.171). Even so, we believe that the Peircean concept of abduction is more appropriate for our purpose, as it is concerned with generating and applying novel hypotheses rather than justifying belief in existing ones. While computational IBE is interesting and valuable in its own right, it only makes possible the combination and refinement of existing hypotheses. In this article, we are not as interested in the reasoning used within ML models as we are in the reasoning used to create them. The conception of abduction emerging from Peirce’s later work is most relevant to our investigation of the methods of ML, as it encompasses exactly the type of theoretical and reasoned engagement with explanatory hypotheses that is often neglected in ML.
It may seem paradoxical that abduction can simultaneously rely on instinct and be a type of logical inference. It is important, however, to remember that Peirce considered the discipline of logic to be normative, i.e., the study of how we ought to think rather than how we actually think, analogously to ethics as a study of what we ought to do rather than actual human behavior (Fann, 1970: 40–41). While hypotheses might arise as a result of a flash of insight, the pursuit of a particular hypothesis is justified by reasons that Peirce suggests constitute a separate type of logical reasoning (Fann, 1970: 41). While Peircean abduction might thus have little to say as to the particular cognitive processes that create new ideas, its logical structure can be of great help in deciding what to do with such an idea after the fact by reasoning and reflecting on its quality for knowledge production.
Peirce puts forth two main requirements for any abductive hypothesis: that it (1) explains the facts it seeks to explain and (2) is capable of experimental verification in the sense that it should have practical implications that could be investigated (Campos, 2011: 431; Fann, 1970: 43–44). While these are minimal criteria for establishing something as a hypothesis, Peirce also suggests a number of economic considerations for selecting among the many possible hypotheses that are allowed by (1) and (2). Campos, reiterating Hookway, lists Peirce’s further recommendations that we should (3) favor hypotheses that seem simple, natural, and plausible to us, (4) prefer theories that explain a wide range of phenomena to those narrow in scope, (5) be mindful of successful theories in other areas and employ similar kinds of explanations, (6) keep in mind the question of economy of money, time, thought, and energy, and (7) not give undue preferences to hypotheses on the basis of “antecedent likelihoods” (Campos, 2011: 431).
The economics of research leading to these recommendations was taken by Peirce to be part of logic, as he had extended the notion of logic so far as to become the “method of methods” (Fann, 1970: 47). The ML field may benefit from such a method of methods by making the models themselves objects of study, if models in ML are seen as implementations derived from certain hypotheses. If the underlying hypothesis that led to a particular model being implemented and trained uphold the requirements (1) and (2) above, then a failure of the trained model can be the result of a flawed induction, i.e., unsuccessful training, an unsound deduction leading to a poor implementation, or a wrong hypothesis. These three aspects can be investigated independently, and the trained model may be the locus for such an investigation. In the following, we will focus on Peirce’s concept of abduction and how it may be applied in ML.
Peircean abduction in ML
As explained above, the result of a ML process is determined by the choices made and the hypotheses adopted when preparing the data and the model for learning. Similarly to Peirce’s three stages of scientific inquiry, ML starts by adopting hypotheses as to the relationship between input and output that the model will attempt to approximate. These range from the general, e.g., that the relationship can be adequately described by a mathematical function, to the specific, e.g., that the output is invariant to small rotations of the input.
We suggest that Peirce’s requirements and recommendations (1) to (7) should be recontextualized for ML in the following way. A hypothesis in ML is required (1) to explain the relationship between input and output and (2) to be possible to implement either as a bias in training data or as inductive bias. In other words, a Peircean conception of hypotheses makes clear the interpretation of the relationship between input and output, and how this interpretation leads to informed choices of biases in a ML process. Furthermore, a hypothesis in ML is recommended to (3) seem simple, natural, and plausible, (4) explain multiple relationships across tasks and contexts over explaining merely the specific relationship investigated, (5) be analogous to successful theories in other fields, (6) be able to be implemented within present constraints on money, time, thought, and energy, minding the economics of research, 11 and, as a continuation of this, (7) not be preferred on the basis of their statistical likelihood alone. These requirements and recommendations all stress the importance of human intellectual labor in the process of creating and developing ML models: not to remove it but to methodologically understand, refine, and account for it.
We believe that this recontextualized version of Peirce’s scientific hypotheses can act as guiding principles in the development of ML models and be the focal point for documenting the theoretical work necessary for successful ML development. However, a few issues remain to be addressed by future work, as these principles are put into practice. The first is that Peirce’s work on scientific discovery was developed with the individual scientist in mind, which is becoming an increasingly unlikely situation, as the development of ML models moves to larger and larger teams, e.g., in corporate settings. This means that organizational mechanisms should be put in place to resolve the different perspectives on what constitutes simple, natural, and plausible hypotheses in recommendation (3) for heterogeneous ML teams. Furthermore, while our examples show the value that can be gained from recommendation (5) of minding successful theories in other fields, it should be noted that many innovations in ML have been possible without any significant influence from other fields. It is worth remembering that only points (1) and (2) are strict requirements, whereas points (3) to (7) are merely pragmatic recommendations that should be interpreted in the context of the individual ML project.
Returning to our two previous examples of the development of the CNN and the LSTM, we see that the explanatory hypotheses leading to their development each fulfills the minimum requirements and several of the recommendations of outlined above. As Fukushima illustrated with his Neocognitron, the hypotheses both explained the input–output relationship in image recognition well and could be implemented as an inductive bias. The establishment of these hypotheses furthermore aligns well with the above recommendations (3) to (5), that we should prefer simple and broad explanations, keeping analogies from related fields in mind. When LeCun and others (LeCun, 1989; LeCun et al., 1989) applied and developed this theory by deriving different designs of CNNs that were amenable to efficient training by backpropagation, they were motivated by recommendations (6) and (7) of considering the economy of research while furthermore taking additional inspiration from theories in statistical learning. In the case of the LSTM, Hochreiter and Schmidhuber’s (1997) hypothesis both (1) explained the relation between the input and output through a recurrent cell learning to remember and forget information in order to facilitate later predictions and (2) could be implemented as an inductive bias. While far from simple, the hypothetical model followed naturally from mathematical analysis, it had broad applicability to any task involving sequences with long time lags, and it could be implemented and trained effectively, thereby minding the economics of research as recommended.
As we have seen above, ML techniques are sometimes misunderstood as purely inductive processes that automatically extract useful knowledge directly from a “raw” data input (Kitchin, 2014). However, the processes of creating ML models are iterative in practice, and human intuition and creativity play decisive roles in shaping the data and models into forms that facilitate learning valuable patterns. We argue that this process can be adequately described as an iterative repetition of abductive hypothesizing, deductive derivations of particular prototypes, and inductive testing of their performance. What lacks in this process is not a higher degree of automation, but an increased focus on the scientific reasoning applied, maintaining the abductive, deductive, and inductive types of reasoning as equally important and equally worthy of critical scrutiny. This requires that ML practitioners take great care as they choose biases and data preprocessing strategies, making sure to articulate and commit to clear and testable hypotheses for their chosen methods’ utility. We believe that ML as an academic and R&D field would benefit if models were expected to be accompanied by explanatory, and ideally also testable, hypotheses that described the current best understanding of why the models work as well as they do. We furthermore believe that the recontextualized version of Peirce’s abductive hypothesis presented in this article would aid in generating and documenting such explanatory hypotheses.
Such explanatory hypotheses would enable the learned models themselves to become objects of study, such that adequate hypotheses can be formed about the complex associations encoded within them, and new concepts and theories can be introduced to explain them. In order for such investigations to be successful, the choice of ML models must be subject to new demands. Specifically, we must prioritize models that have clear implications, or from which we can extract meaningful relations that comprise the patterns identified by the models. While this type of transparency does not necessarily preclude black-boxed models, it does require that they lend themselves to systematic analysis and assessment of the patterns they encode. Sculley et al. (2018) present a number of methods for investigating models and their implied hypotheses. In particular, they suggest ablation studies, where a system is disassembled and each part is tested on its own against our expectation for it. They also suggest simplified experiments and counterfactuals, where a system is tested in illustrative cases with counterfactual or counter-usual data.
Turning biases into hypotheses through method
It was paramount to Peirce’s classification of abduction as logical reasoning that it could be the subject of critical reflection at every step. To Peirce (1966), reasoning cannot be unconsciously performed, as it is “deliberate, voluntary, critical, controlled, all of which it can only be if it is done consciously” (2.182). While this is not taken to imply that one should be absolutely aware of the cognitive processes of the reasoning mind, it does mean that any suggestion, that is to be considered as reasoned, must be the subject of critical and logical scrutiny. “What we call a reasoning is something upon which we place a stamp of rational approval. In order to do that, we must know what the reasoning is,” as Peirce (1966) firmly states (2.183). The field of ML has recently been the subject of controversy relating to this idea in two ways. The first is that ML models reason in mysterious ways that cannot be understood by human beings (Burrell, 2016), which implies that such models are unreasonable, at least in the Peircean understanding of the term. The second, and perhaps more serious critique, is that the process of creating ML models in the first place is practiced as a “black art” with unarticulated biases (Campolo and Crawford, 2020), implying that researchers and developers themselves are similarly unreasonable. While a defense against the first critique is mounting in the field of explainable AI (Guidotti et al., 2018; Gunning, 2017; Miller, 2019; Mittelstadt et al., 2019), the validity of the second is still a subject of controversy, where some researchers acknowledge the fault (Sculley et al., 2018), while others vehemently deny it (LeCun, 2017).
In this article, we have argued that the work of Peirce on the logic of scientific discovery may aid in bolstering the scientific legitimacy of the ML field. By taking seriously the process of theorizing about data and model selection as a scientific, logical, and reasoned one, the creation of ML models can be subjected to the same critical scrutiny and discussion seen in other types of data science. Through method, the biases inherent to ML may be turned into hypotheses as the first step of a process of scientific discovery that may lead not just to more effective and creative ML systems, as we have seen in the cases presented here, but also to a stronger understanding of why and how these systems work. With the increased prevalence of ML-based systems of surveillance and profiling, we must insist on the relevance of critical scrutiny of these systems and the processes that create them to ensure their reasonable, responsible, and accountable use.
Footnotes
Acknowledgements
First and foremost, the authors would like to thank the reviewers, whose generous comments and encouragement greatly strengthened the article. The authors would also like to thank Prof. Ira Assent, Prof. Henrik Kragh Sørensen, and the researchers at IND, Copenhagen University for valuable feedback on early drafts of the article.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
