Sage Journals: Discover world-class research

Abstract

The use of machine learning (ML) allows us to automate and scale the decision-making processes. The key to this automation is the development of ML models that generalize training data toward unseen data. Such models can become extremely versatile and powerful, which makes democratization of artificial intelligence (AI) possible, that is, providing ML to non-ML experts such as software engineers or domain experts. Typically, automated ML (AutoML) is being referred to as a key step toward it. However, from our perspective, we believe that democratization of the verification process of ML systems is a larger and even more crucial challenge to achieve the democratization of AI. Currently, the process of ensuring that an ML model works as intended is unstructured. It is largely based on experience and domain knowledge that cannot be automated. The current approaches such as cross-validation or explainable AI are not enough to overcome the real challenges and are discussed extensively in this article. Arguing toward structured verification approaches, we discuss a set of guidelines to verify models, code, and data in each step of the ML lifecycle. These guidelines can help to reliably measure and select an optimal solution, besides minimizing the risk of bugs and undesired behavior in edge-cases.

Introduction

Over the past decade, huge improvements have been made in the field of machine learning (ML)¹ to solve various industrial problems. Extensive research has been done to train and tune optimal models, which has ultimately led to automated ML (AutoML), an approach that can efficiently search over a vast set of potential ML pipelines.² This development made the possibility of democratizing artificial intelligence (AI) appear realistic. On the downside, these solutions are often impractical for real-world applications or fail shortly after deployment.

One of the reasons for such behavior is that modeling can be automated via AutoML, whereas the verification of potential solutions is still a manual and unstructured process, often resulting in suboptimal choices. In software development, this is avoided by verification: “Did I build what I told” and validation: “Did I build what is needed.”

This differs in ML context: Here only, Validation is considered for the ML model alone whereas Verification needs to be done over the whole system in which the model is integrated. In the majority of cases, only validation is performed via common approaches such as cross-validation (CV) or explainable AI (XAI) and verification is either missing or ignored. This is because a huge focus is placed on improving the predictive performance of the models that validation ensures, whereas many other crucial metrics and tests for verifying the stability/robustness (verification tests) are neglected, making the ML models unsuitable for real-world scenarios.

For instance, commonly used ML models (ex. Deep Neural Networks [DNNs]) are a black box and judging the reliability of the ML models purely based on validation techniques such as CV cannot be justified. This makes the verification tests crucial, but as discussed, verification needs to be done for the whole ML systems, that is, the systems utilizing ML models, and not looking at the model in isolation.

In this article, we discuss a set of best practices and guidelines to perform structured verification at each phase of the ML development process (ML lifecycle) in three dimensions, namely data, code, and model. The verification of these three components at every phase could indicate the verification of the ML system at whole.

The data component would constitute the data used for training the model as well as the data obtained from the model's predictions, which are used by other systems/as an end result. The code component would include a code for model interfaces, data generation/preprocessing, model training/testing and inference, etc. The model component constitutes the ML model itself on which validation such as CV is performed. The phases of the ML lifecycle along with the defined dimensions are introduced next:

ML lifecycle

We structure the ML process in the following lifecycle scheme consisting of five phases that are dynamically linked and allow continuous iterations throughout the lifespan of the ML product. The used lifecycle scheme has a close correspondence to the CRISP-DM model,³ DevOps,⁴ and MLOps.* In contrast to MLOps, we do not separate the phases between modeling and deployment, but we assume them to be interlinked. The five phases, namely business understanding (BU), data acquisition and exploratory data analysis (DA & EDA), Modeling, Deployment, and Monitoring, along with their interrelations are illustrated in Figure 1.

FIG. 1.

The ML lifecycle with five phases and their interconnections. ML, machine learning.

These phases are further divided into three dimensions (data, code, and model) on which verification is performed. We define these dimensions next, already outlining key challenges that we see within each dimension when it comes to verification.

Data

ML is preferred for problems where data are complicated, and formalizing an intended solution analytically is not possible. In many cases, the nature of data can pose additional challenges while designing ML systems. For instance, temporal data differ from the usually expected independent and identically distributed (i.i.d) data due to its dependence on time component and the procedure to model on these data are different. Thus, there is a need to capture/verify the distinctive nature of the data for proper modeling.

Code

Code is written for the majority of phases in the ML lifecycle, starting from transforming and extracting the raw data, labeling (in supervised learning), feature engineering, feature selection, model building, performance evaluation, and deployment. The verification of such code is necessary to detect bugs in earlier stages of the pipeline, and failing to do so leads to error propagation across various stages of the ML lifecycle. We argue for the need of unit or integration tests analogous to software verification tests in the ML lifecycle to catch code bugs effectively.

Model

In many cases, additional data are collected based on predictions of a trained model and these new data are used to update the model via feedback loops. If the model is biased, the updates further worsen the model due to bias amplification.^5–7 So, the trained ML model needs to be verified against such biases besides (adversarial) robustness and user requirements. These are discussed in detail in each phase to highlight the need for verification tests to catch such model regressions. Gathering information about the model also helps in modeling an ML governance process to check the model's behavior against inherent bias, skew and edge-cases,⁸ as well as in understanding the decision-making process of the model.

Nevertheless, the developers of ML systems are not the end-users (at least in the industrial applications we consider in the Applications and Methods section) and hence quality assessment such as model performance solely from a developer's view cannot be guaranteed. This can be solved by either testing the systems via crowdsourcing platforms such as Mechanical Turk^† or through experimentation on real users via beta testing.⁹

The importance and goal of verification

Before digging deeper into the structured verification of ML systems in next sections, we discuss the importance of verification and elaborate on the verification and validation (V&V) performed in traditional business processes that vary for ML applications.

In the Business to Consumer scenario, an error might only lead to small losses due to customer churn or a marketing issue whereas an error in industrial settings might lead to more severe consequences. For example, when utilizing ML for visual part inspection (VPI) to identify contamination in food production or person identification in self-driving cars, errors can easily lead to casualties. With such increasing usage of ML in these applications, the correctness of the holistic ML system needs to be ensured, not just the correctness of the model.

In quality management systems such as ISO 9000, business processes are defined, which ensures that the customer requirements are fulfilled. In these processes, there is a distinction between V&V. Verification checks that the product or service fulfills the specifications, whereas validation on the other hand checks that the system meets the operational needs of the user. Although V&V ideally ensures the correctness of the system under test, this can rarely be guaranteed or proven. Formal verification aims at verifying the formal specifications. This cannot be achieved in ML systems, if training data is considered the (partial) specification of the system's behavior.

Hence, the goal of the V&V process is to reduce the risk of a problem in the system. In software engineering, this is typically achieved by means of several test methods such as unit tests, integration tests, etc. These tests typically do not check the SUT under all conditions but are designed well enough to cover the most important situations, typically including, for example, boundary conditions. This implies that the passing of all run tests indicates the correctness of the SUT. However, in ML, tests are done by means of examples. Therefore, crafting the right test would include selecting examples that correspond to the boundary conditions, where the boundary conditions are changing due to the nature of the ML model and the right text example is not in the data set used for training it.

However, more important is that if one test fails, it is an indication that one needs to dig deeper because it is likely that many more uncovered errors are present in the system. A method that is often applied compared with well-designed tests is the so-called random testing methodology, which tests the SUT with random situations. Here, independent samples are generated randomly from the input domain and passed to the SUT to test its behavior.

Although random testing is rather an effortless testing method, it suffers from its reliance on randomness, for example, the test could fail simply because it was unlucky in selecting a nonpassing test sample. The typical testing methods in ML, such as CV, correspond to random testing, as the cases are drawn from a pool of data. CV tries to overcome the randomness in random testing since one unluckily selected fold would be averaged out over the other folds. Still, it remains important in CV to measure not only the mean but also the variance over the folds.

Often, it is recommended to utilize XAI to overcome these effects. However, inspecting the results and decisions based on a single run is similar to debugging in software engineering. The XAI methods may not provide any conclusive evidence or proof for correctness but can only generate human trust into the correctness, such as debugging does in software engineering. Hence, XAI should not be used for generating evidence in a V&V process but only to identify issues in the SUT.

Checklist for structured verification in ML systems

Providing a common structured verification approach that can be applied to any domain is hardly possible due to the necessary complexities involved. Hence, we propose a methodology to derive such a set of guidelines that can be applied in the development of ML systems. Many teams have a specific domain or use-case cluster, for which a structured verification list can be derived by using our proposed methodology.

Our approach is adopted from the principles in the realm of software development. The first principle is the idea of Test-Driven Development,¹⁰ that is; for each function/feature, an automated test is developed before it is implemented. A functionality is only developed if it fixes (or passes) a test and this principle guarantees that the code is error free. The second principle is that for any mistake that has occurred once, it must be ensured that it will not occur again. In the context of software, this means that the root cause of the bug is identified, which can be in either the user specifications or the code itself. Afterward, the current automated tests are analyzed to check whether they are complete enough and free of redundancies.

In the context of the structured verification for ML systems, the tests correspond to the items that need to be checked throughout the development process of the ML model. In addition, it is desired to automate these verification tests wherever possible. In later sections, we define these tests in the BU phase of the ML lifecycle, and we implement them for verification in the corresponding phases of the lifecycle that are derived for typical industrial application fields.

Applications and methods

We focus on three applications that are typical in the industrial settings: (1) quality inspection (QI)¹¹, (2) predictive maintenance (PM), and (3) anomaly detection (AD). This serves as a preliminary for the further sections, where specific challenges of those applications are discussed within the concept of the ML lifecycle. The interdependence of these applications is depicted in Figure 2.

FIG. 2.

ML applications within industrial production that are treated in this work.

To summarize, the article is focused toward bridging the gap between software engineers having limited knowledge on ML processes and the ML researchers/engineers lacking the ability to bring ML into production. We attempt to bring these two worlds together, and our key contributions are as follows: 1.

We structure the ML process in a lifecycle scheme consisting of five phases and further disentangle each phase into data, code, and model dimensions to identify and understand the key issues in detail.

We distinguish between V&V of the ML models and argue that solely performing validation of an ML model cannot be considered as an indicator for its reliability. Hence, we insist on the structured verification of the ML system at every phase of the ML lifecycle.

We propose a methodology that can assist in deriving a checklist for structured verification in each phase of the ML lifecycle focusing on certain industrial applications.

In the realm of this work, the rest of the article is organized as follows: The Related Work section discusses on the prior literature. In the Structured Verification in ML Lifecycle section, we will discuss on the structured verification for the complete ML lifecycle, which comprises five phases that are further divided into three dimensions: data, code, and model. Finally, we conclude our study in the Conclusion section.

Related Work

Previous research^8,12,13 advocates for the management and versioning of model and data in addition to the code. This is due to the fact that the issues exist at a system level rather than the code level and the data influence the behavior of ML models. Sculley et al.¹² discussed these issues as hidden technical debt in ML systems analogous to software engineering and argued that this debt needs to be reduced to avoid huge maintenance costs. In a similar notion, Breck et al.¹³ provided tests specifically for data, model, and infrastructure and formulated an ML test score to help reduce this technical debt.

In addition to the technical debt, our work also relates with issues discussed under the terminologies of shortcut learning,⁵ underspecification,¹⁴ ML fairness,⁶ adversarial robustness¹⁵ and drifts,¹⁶ among others. Especially, we relate our finding across different dimensions to shortcut learning and underspecification, which are discussed in brief here.

Shortcut learning

Shortcut learning describes a phenomenon where the model utilizes nonrobust features to distinguish between different objects of interest. Geirhos et al.⁵ state that many of these shortcut opportunities are a consequence of natural relationships, for instance, grazing cows are typically surrounded by grass lands and the model uses this unintended background for its prediction rather than taking the characteristics of the cow into consideration. In the famous Pascal VOC dataset,¹⁷ all images labeled as horse contained a source tag where removing the tag removed the ability of the ML model to classify it as a horse.¹⁸

In literature, this is also typically referred to as the Clever Hans effect.¹⁹ The ML models that learn these spurious correlations present in the data are prone to shortcut learning and fail to perform well on unseen real-world data.

Underspecification

Due to the expressiveness of modern ML models, a limited data set will almost always lead to situations where multiple predictors may solve a single task with similar predictive risk.¹⁴ In other words, the ML pipeline may return one among many similar models that might or might not encode credible inductive bias. This degrades the trust over the ML system, and D'Amour et al.¹⁴ propose different stress tests to evaluate the obtained model's behavior for robustness and against requirements that are not guaranteed by evaluations on i.i.d data.

In the next section, we discuss the different phases of the ML lifecycle as proposed in Figure 1 across data, code, and model dimensions along with recommended practices to avoid pitfalls and to design better ML models. The categorization of dimensions and phases in the ML lifecycle is not strict and is made for the sake of our convenience.

Structured Verification in ML Lifecycle

We now discuss on verifying each phase of the ML lifecycle focusing on the industrial applications mentioned earlier (c.f. the Applications and Methods section). At the end of each phase, we also summarize the checks that can be performed over all the three dimensions (data, code, and model). It must be noted that the list is not exhaustive and only contains a subset of checks that we consider crucial for each phase.

Business understanding

This is arguably the most important phase in ML projects, as there is always a risk involved that a data scientist makes a decision aimed at improving the performance of the model without necessary domain understanding.²⁰ This phase marks the first step of all ML projects and includes discussions between the ML Experts and the Domain Experts, which also involves product owners and stakeholders (customers) in case of industrial projects. To yield a realistic understanding and evaluation of the problem, reasonable baseline estimates and potential constraints are discussed. This allows one to estimate the underlying risks of the project and also points toward expected successes.

Mistakes in this phase often result in a complete restart from scratch. Therefore, it is important that the interaction between the ML experts and the stakeholders is not to be limited to this phase only and should be part of all phases for a successful ML project.

With the help of domain knowledge, the problem is framed and the performance metrics as well as evaluation strategies are selected while taking stakeholder's requirements into consideration. We mention the selection of such performance metrics²¹ separately in Figure 3. Additionally, we provide a brief overview of the selection of perfomance metrics based on the nature of data in Table 1.

FIG. 3.

Selection of performance metrics.

Table 1.

Overview of the choice of performance metrics based on data

Problem	Description	Potential solutions
Imbalanced data	Accuracy and AUC ROC are bad choices for a model trained on imbalanced datasets,^22,23 especially in VPI where the number of faulty samples are considerably less compared with the good ones.	Look at the PR curves and define operating conditions for the model. Other metrics such as “Lift” is considered as a good measure, which is the factor by which the model's performance is better than a random choice.²⁰
Anomalous data/Outlier data	In cases of imbalanced data and AD tasks, the cost of misclassifying the minority positive class may be greater than the majority negative class.	Account the misclassification costs either via Cost-sensitive learning^24,25 or by modifying the probability estimates of modified outputs and perform Cost-sensitive classification methods such as thresholding, rebalancing among others.²⁶
Nonuniform data	Overall accuracy is not a suitable metric in this case of nonuniform data (data from different sources/regions), because it hides the fact that the model may discriminate toward a minority class or is skewed toward a certain data group.	Look at the performance for each subgroup individually. It is vital especially for evaluating fairness in ML systems.⁶

AD, anomaly detection; AUC ROC, area under ROC; ML, machine learning; PR, precision-recall; VPI, visual part inspection.

As discussed, the correct problem framing is key in ML projects. For instance, in PM, the task can be framed as a regression (estimated time until machine breakdown), classification (probability of machine breakdown in a given time span), or survival analysis problem (survival probability of the machine over time).²⁷ Hence, a proper framing of the problem, quality of model's prediction, and choice of proper performance metrics are considered as the most important challenges.

The key result of this requirements engineering is to ensure all the requirements from the stakeholders are collected pertaining to data, model, and code and to decide on a set of verification tests to verify against these requirements. These verification tests act as Definition of Done for the subsequent phases.

Issues in this phase also arise in case of miscommunication between the Domain Experts and the ML Experts, which could result in erroneous project requirements. In many companies, this phase is handled by a Requirements Engineer. Requirement Engineering for ML is challenging, especially when the model is integrated into conventional (software) systems, which have their own requirements. Apart from playing an integral role in formulating the user/stakeholder's requirements, the Requirements Engineer needs to be aware of other requirements such as General Data Protection Regulation (GDPR), freedom from discrimination, etc.²⁰ These requirements can also be considered as Quality Assurance in an ML context and are elaborated across the three defined dimensions later.

Data

A major challenge for Domain Experts is to distinguish normal from flawed behavior. This information is crucial since unclear information would lead to incorrect labeling, thereby resulting in incorrect ML models. For example, in QI, a sample (e.g., image of a production part) may be labeled as a defect only if there is a specific broken part, but not if there are only visual defects. In some other situations, both the cases can be considered a defect. Since it largely depends on the task at hand, the requirements should specify such data complexities to construct correct labels in the later stages to avoid undesirable results.

The Requirements Engineer also needs to collect the requirement on how the model should generalize on the data and decide on the validation strategy to later perform train, validation, and test splits. This is especially important, as an improper validation scheme can give an overoptimistic estimate over the test set whereas the ML system can fail dramatically in a real/production environment incurring huge potential losses.²⁸ This can be avoided by choosing a proper CV technique that largely depends on the nature and size of data.

This CV technique does not help to learn better models but to deliver reliable estimates of the true performance of the model in deployment.²⁸ However, since CV is time consuming, it might not be suitable for large datasets and data heavy deep learning (DL) applications. We mention some potential problems and appropriate CV techniques to be used in such situations in Table 2.

Table 2.

Overview of cross-validation techniques

Problem	Description	Potential solutions
Large datasets	For many applications, there may be a large number of samples, or the size of each sample is huge (e.g., image/video data in Computer Vision DL applications)	Single split: We perform a single split of train, validation, and test sets and the performance of such models depends on the quality of these sets as discussed earlier.
Imbalanced dataset	If the classes are imbalanced, then performing standard k-fold CV would result in unequal proportions of target classes	Stratified k-fold: This can be helpful in ensuring that each fold contains the same proportions of each class. Stratification helps in reducing both bias and variance compared with k-fold CV and would be preferred instead to obtain an unbiased estimate.²⁹ For model selection, stratified k-fold CV is a better choice to ensure the uniform model performance across all groups.
Temporal datasets	In data situations with temporal correlations, e.g., in time series forecasting, classical CV can lead to overoptimistic performance metrics due to a temporal overlap between train, validation, and test sets	OOS: This strategy ensures that only data gathered after the timespan of the training are used for model evaluation. Following this, the data are split among the temporal dimension, ensuring that models are trained on past data and validated und unseen, future data only. There exist special cases where a CV strategy might even be suitable for such temporal settings, but in doubt an Out-of-Sample strategy is favorable.³⁰
Multiple samples to represent same object/group	In some cases, we have a group of samples for the same object (e.g., identify defective part of a machine) and do not want them to appear in validation or test sets to avoid overlap.	Group-k-fold: This CV technique will be useful in this case and can also be combined with a stratified k-fold CV.³¹

CV, cross-validation; DL, deep learning; OOS, Out-of-Sample Evaluation.

The datasets used in CV must also be analyzed to ensure that there is no overlap between train, validation, and test sets. For instance, many datasets are temporal in nature and a random split assuming i.i.d data would result in validation and test sets not matching the real world.

Further, this phase should consider the fairness aspects to exclude potential harmful features and include only ethical/legally okay ones. It is because the learned patterns may encode prejudices degrading the credibility of ML systems. In addition, a few sensitive features may need to be collected to check the model's bias against them. For instance, MAC/device address is sensitive data and should not be utilized for ML modeling whereas gender might not be used for ethical reasons but needs to be collected to control/verify the model's fairness against this feature.

Nevertheless, eliminating such sensitive features is challenging, if a model is trained on sensitive features and we remove the attribute, the model might find a redundant encoding in terms of other features and wind-up modeling sensitive features without explicitly being asked.⁶ Hence, requirements on how to ensure our model is fair against bias (fairness constraints) need to be defined here and accounted for in modeling and monitoring phases.

As the ML training is dependent on both code and data, testing of data quality needs to be considered as a separate new class of requirements. The requirements should not be on the quantity but rather on the diversity of the samples and must be checked for completeness, consistency, and correctness²⁰ of the data. Completeness refers to sparsity of the data to cover a whole range of possible values; consistency refers to a uniform format and representation of data in the complete dataset. Correctness refers to the degree of trust that the data are truly from the data-generating process (e.g., free from adversaries)³² and need to be verified by verification tests after ML modeling.

All these requirements depend vastly on the way the data are collected and how they should reflect the real-life data. We explain ways to verify these requirements in detail in the subsequent phases.

On a side note, it is important to verify these requirements regarding data quality even for validation techniques such as CV, because the reliability of the estimation obtained by CV depends on the data and cannot be trusted if the quality is not verified; this further strengthens our claim that validation tests such as CV that are performed over the model in isolation are not sufficient to ensure the system's performance in real-life data.

Another potential pitfall regarding the data is the construction of the features that are based on the independent variables that are inputs to the ML model. In some situations, the features are faulty, unavailable, or engineered differently during training and serving time (when the models are integrated into a different system), which may lead to degradation in model performance or failed deployment. The root cause in such issues can be identified by distributing a raw holdout dataset during deployment along with the trained model artifact and reassess the model performance on this raw holdout dataset.⁸

For example: If a sensor providing certain data is broken and passed to the model, the model would deliver faulty predictions. This can be identified either by sanity checks on the model's input data or by integration testing. In addition, we can also assess the performance of the model on a holdout set and make sure it is the sensor that is broken and not the model. This strategy is defined in this phase as a requirement to create a holdout set and should be used as a verification test along with other tests (such as sanity checks) in the deployment phase.

Code

In bigger teams, a common challenge that the ML lifecycle faces is that different teams own different parts of the code and there is a handover across different phases without a clear expectation of how to cross these boundaries.⁸ For instance, the transfer of code without clear documentation from DA & EDA to the Modeling phase might lead to delays and friction in the project if they are handled by different people, which is a common scenario in larger teams.

Another potential threat is in writing code for performance evaluation, which is done before any development following the test driven development (TDD) principle. Here, even small logical bugs in the generation of such tests can lead to big estimation errors.³³

The requirements should also include information regarding the deployment to ML systems, where the trained models need to be converted into a different format for faster inference such as on edge devices (embedded devices, android/iOS devices, etc.). One commonly used format is Open Neural Network Exchange (ONNX),^‡ and the deployed models can be verified against the original model via differential testing. Such differential tests test the dissimilarity of the outputs of two similar systems given the same input and are capable of catching errors that cannot be caught via regular testing.

In some cases, the interface of external services/applications consuming the model can change that might lead to failure in the deployment. For example: The external system that was earlier utilizing the ML model's prediction probabilities might now only use binary or Boolean output. This issue is generally identified by means of Contract tests in software engineering. So, the possibility of such occurrence in ML systems needs be noted in this phase so that they can be verified later in the deployment phase by means of Contract tests.

Model

The aspects of security and transparency of a model comes under the Non-Functional Requirements of ML systems.³⁴ In cases having legal and regulatory requirements (e.g., GDPR), explainability or transparency may be even more important than predictive power.²⁰ This is possible either by constraining the model to derive explanations or via XAI techniques.³⁵ Still, explainability might not be helpful because the model is prone to adversaries³⁶ and shortcut learning due to its reliance on nonrobust features. A Requirements Engineer needs to explicitly include explainability requirements from a user's point of view and focus on specifying situations that demand explanations.²⁰

The model also lacks the ways to distinguish between the type of patterns it utilizes while training (leading to shortcut learning, underspecification, and adversarial attacks), so it is not possible to rectify them even by increasing the size of datasets.³⁷ The only way to verify the occurrence of such issues is to test the models on adversaries and out of distribution (o.o.d) datasets that are systematically different from the i.i.d training data used.

These tests can reveal a mismatch between intended and learned rules such as making decisions only based on pixels or background rather than the object characteristics. For example, in VPI, we need the ML model to identify the defective parts only based on the Region of Interest (say deformation in machine gears) and not the background. So, we must define such o.o.d test cases in this phase that need to be implemented in later phases to verify the model's prediction quality and ensure trust in the model's correctness.³⁸

Another important aspect of this phase is the definition of baseline methods jointly with the Business Experts. Reasonable baselines are crucial to place the model performance in perspective, as it can be trivial to present a very complex model with great performance without mentioning the fact that a simpler method was not too far behind.³³ Such baselines could range from simple ones such as majority voting based on the training data distribution to more complex rule-based systems that might be already in place at the client side. In the Modeling phase, these baselines will be implemented and the desired models are evaluated against them.

Lastly, this phase should also include the freshness requirements of the model, a Requirements Engineer should specify how often retraining is necessary to protect against data/concept drifts³⁹ and adversarial attacks based on what the client considers as risk. Besides, the requirements must also contain details of desired metrics that needs should be considered to satisfy the quality requirements.¹³ For example, if the model is provided as a service, then constraints such as latency, uptime, power consumption, etc. should be mentioned in service level agreements (SLAs) and models must be verified against this after deployment.

The summary of checks for this phase is provided in Table 3.

Table 3.

Summary of checks for business understanding phase

Business understanding
Data	• Specify abnormalities in data if any. • Decide on a proper validation strategy (c.f. Table 2 for different CV strategies). • Specify the nature of data to perform train, validation, and test splitting and avoid overlaps. • Specify fairness aspects and select features to collect based on fairness criteria. • Specify checks for completeness, correctness, and consistency of data. • Define a holdout set for integration testing to verify model deployment.
Code	• Specify on documentation of the code for a smooth handover across teams. • Define differential tests to verify different model formats during deployment. • Define contract tests, to verify changes in the model interfaces if any.
Model	• Specify on explainability requirements from the user or the regulator • Specify on o.o.d test cases to test the model against shortcut learning, underspecification, and adversaries. • Specify naive baselines by discussing with Domain Experts • Specify on model retraining criteria: freshness requirements. • Specify constraints to be met by the model according to SLAs.

BU, business understanding; o.o.d, out of distribution; SLAs, service-level agreements.

Data acquisition and exploratory data analysis

This phase is deeply entangled with the first phase, requiring close cooperation and reiterations with the Domain Experts. Understanding the data profoundly is preliminary to both, understanding the business problem and to better assess the feasibility of the project. This phase aims at acquiring, verifying, and preparing the datasets for the subsequent modeling phase. In addition, we extensively discuss verifying the data via sanity checks because the varied properties of data mainly constrain the ML application.

Data

It is desired to have data free from any kind of bias, because it is almost evident that the data are biased when humans are involved in labeling it. This is known as Data bias and can be introduced via selection bias, capture bias, and label bias among others.⁴⁰ This poses a major challenge to pick verification techniques for developing well-generalizing models. Instead, we can try to create unbiased datasets. Although it is very difficult to remove bias from the data, Torralba and Efros⁴⁰ provide recommendations on developing datasets that avoid or at least minimize the effects of the biases mentioned earlier.

The raw data are often obtained from multiple sources and need to be checked for missing data, uniformity, bias, robustness, and stability of data-generating sources. This data generation can happen both ways, either data points are from different sources or the features are created separately.

It is vital to know such dependencies before modeling to save the time and effort of acquiring all the features in data (feature extraction might be costly in some cases). Instability in data can also arise when the input of our ML model is dependent on another ML system and the changes occurring in that ML system are not known to the downstream dependencies. This causes data drift in our intended model (detailed in monitoring phase). Other data dependencies include¹³:

(1)

usage of legacy features that were included early in the development but become redundant over time,

(2)

using many features to slightly increase the performance, at a cost of increased vulnerability,

(3)

unknowingly using a correlated feature instead of a real (causal) one.

These dependencies result in a nonrobust ML model, because the learning may happen via shortcut learning, making them also prone to adversarial attacks. To a certain extent, we can mitigate the issues arising from (1) and (2) by keeping track and using only sufficient features but it is not possible to tackle (3) as it is very difficult to determine the features that influence ML model predictions.

To minimize errors in the presence of such multiple data-generating sources, we need tests to verify input data against an expected schema. This is important to maintain uniformity and can also help to localize errors to a particular source.⁴¹ The schema tests need to verify the features to fall within expected ranges (numeric features) or take only certain values (categorical features).²⁸ Also, the schema should be flexible enough to accommodate drifts in the feature distribution in training data due to changes in real-world distribution.⁴¹ These checks can be considered as static checks that are performed before modeling, and Breck et al.¹³ provide detailed information on constructing such a schema and tests over this schema can be considered as sanity checks.

In addition, we can also use Pipeline tests as sanity checks to verify the data given to the model at batch time (i.e., when new data arrives). It reduces pipeline debt, which occurs when unclear and forgotten features are buried inside the codebase as the system matures. Libraries such as Great Expectations^§ provide a framework in the form of assertions to test such data pipelines easily.

Apart from this, verifying the data quality in train, test, and validation sets is equally important. The validation and test sets should reflect data on which we expect the model to generalize well. Both these sets must be chosen from the same distribution, and the test set should be large enough to accommodate diverse samples to give high confidence in the overall performance of the model.⁴²

Especially, the quality of test data is crucial as it influences the acceptance criteria of the trained model over a chosen performance metric. Mani et al.⁴³ stress the necessity to measure the test dataset quality and propose test case generations to broaden its coverage over the training set. Their study on the quality of most common datasets on popular DNNs such as VGG-19,⁴⁴ LeNet,⁴⁵ etc. showed that the performance over test data is not a proper metric if the test dataset is not good enough. We also discuss about Neuron Coverage in the Modeling phase that can be used to generate test cases over the trained ML model.

Code

The impact of code bugs is tremendous if not rectified in earlier stages, so quantifying it in case of ML applications is nontrivial. In cases where the decision regarding feature construction takes place before data collection, unit testing of all the feature generation code is crucial. If not, it results in faulty data for both training and test sets, which remain unnoticed until the monitoring phase where we will observe a drift from the real distribution even though we obtained better performance while modeling as the test set is also faulty.

In the majority of cases, an engineer may end up writing a massive amount of supporting (wrapper) code to get data in and out of generic packages such as Pytorch, Tensorflow, etc., which makes code maintenance difficult. Sculley et al.¹² discuss such anti-patterns occurring in ML systems due to glue code and further state that the systems might end up being less expensive by creating a native solution rather than using many different generic packages. Another issue discussed is the pipeline jungles that occur due to the presence of intermediate file artifacts, temporary patches, multiple joins, etc. creating unmaintainable pipelines and technical debt. These can be avoided by better data collection and feature engineering practices.

Model

As discussed in the BU phase, we need to verify the model's performance on samples dissimilar to the training data (o.o.d). Geirhos et al.⁵ argue toward making such o.o.d test cases a standard practice by adding to the current i.i.d tests. These o.o.d tests can be related to edge cases in software systems. One example for such a test is to verify shortcut learning in QI, where only the pixels from Region of Interest (ROI) can be passed to the model and its output is verified. Another test would be to simply black out the ROI pixels and verify whether the model's output is based on the background or the ROI itself.

It is also vital to verify models against adversarial attacks, because Neural Networks are vulnerable to adversarial samples.³² To test the models against such adversaries, we can design adversarial samples before the modeling and check the model's robustness in the next phase (similar to TDD in software development). If the model fails the test, then the first step would be to make the model more robust and if the issue still persists, then another model should be selected that is less prone to adversaries.

Carlini and Wagner⁴⁶ proposed advanced attack models that can be used to design such test cases. In addition, ML model verification tools and techniques are becoming predominant to check robustness against such adversaries.^47,48 However, it is not possible to test against all the adversarial attacks, but if these tests are passed, we can only hope that our selected model is robust but cannot guarantee its robustness against all kinds of adversarial attacks.

The summary of checks for this phase is provided in Table 4.

Table 4.

Summary of checks for data acquisition and exploratory data analysis phase

DA & EDA
Data	• Verify the data against missing values, uniformity, etc. using schema tests. • Verify newly arrived data using pipeline tests. • Ensure the quality of test, train, and validation data to obtain a well-generalizing model.
Code	• Perform unit testing for all the feature-generating codes. • Avoid glue code/pipeline jungles and try to create maintainable data and model pipelines.
Model	• Generate test cases comprising o.o.d data, adversarial samples etc. that are decided in the BU phase to verify against the model after training.

DA & EDA, data acquisition and exploratory data analysis.

Modeling

Modeling constitutes the third phase where we take the preprocessed data and configure the ML pipeline. This pipeline consists of feature engineering, model selection, hyperparameter tuning, and postprocessing.

The ML models created in this phase are dynamic in nature and are not only limited to change in hyperparameters or initializations but also in its input, where a change in the input feature would shift its effect on the predicted output. This phenomenon is commonly known as CACE (Change Anything Changes Everything),^12,49 which makes it necessary to evaluate the data and model simultaneously.

Data

The quality of test data for ML modeling is crucial, and metrics such as Neuron Coverage⁵⁰ and its variants^51,52 are used in the literature to generate test cases under the implicit assumption that increasing Neuron Coverage would increase the test set quality. These are analogous to code coverage in traditional software development. Recent advances⁵³ showed that Neuron Coverage cannot to be considered as a meaningful criterion since it is easy to generate samples with 100% coverage that are less natural/nonrealistic.

Another important aspect is to verify the collected data for leakage to avoid overoptimistic estimates of model performance after deployment. Data leakage can occur when training samples are also used in testing unintentionally. It can also occur when the features of data are normalized altogether before CV and then split into train, validation, and test sets. This is because the distribution of test data leaks indirectly by the normalization to the train data.

In addition to leakage, data contamination can also occur if the data in CV are used for tuning the hyperparameters of the ML model. This can be avoided by using the nested CV approach, which can be computationally expensive in some cases. There are also various ways in which data can be contaminated and potential solutions also exist to avoid them.²⁸

Code

During modeling, improper maintenance of code paths/branches increases difficulty in maintaining backward compatibility.¹² Most of the branches are never to be integrated and only contain a code that is usually not relevant. It is always suggested to clear out dead and experimental code paths that grow over time and version the required code along with data for reproducibility.

Apart from the regular code versioning tools such as Git, there are many tools available that can be used for versioning both data and code, in addition to model management. Examples of such tools include Data Version Control,^** Mlflow,^†† Weights & Biases^‡‡ that help to visualize multiple experiments along with their parameters and performance metrics.⁸

Also, for mature systems, the complexity of configuration files can even exceed traditional software. This makes it more vulnerable to mistakes and cannot be detected by code tests.⁴⁹ A solution for this issue should be to maintain modularity even in configuration files, which can be achieved via tools such as Hydra^§§ and review configuration files in code review.

Model

The goal for this phase is to create a model that generalizes well to new, unseen data. It is always a good practice to move from simple models to complex ones for improving over the baseline established in the BU phase. This also helps to prevent model overfitting, which happens when the model memorizes the training set or starts to model random noise in the training data and eventually fails to generalize over unseen data. This commonly occurs when large/complex models are trained with comparatively less data; hence, it is crucial to mitigate this overfitting on the training data. This is ensured by the aforementioned evaluation strategies discussed in BU and the use of methods such as regularization, early stopping, etc.

Simplicity of a model is also preferred where the explanation of predicted results is more important than the model's performance. It is easier to obtain explanations for simple ML models than complex black box Neural Networks. For instance, a high-performing complex DL model may be created by utilizing sensitive features whose transparency is required by the law (GDPR). However, obtaining explanations for such a model is challenging and hence a tradeoff between performance and transparency is made. One possible solution to obtain a higher performing fair model is to engineer features that are uncorrelated with the sensitive attributes in the feature space.⁶

This is generally not possible in high-dimensional data such as images. Other potential solutions involve applying XAI techniques such as LIME,⁵⁴ SHAP,⁵⁵ etc. to the trained ML models that can provide information regarding features influencing the model's prediction and help detect correlated features, which might also include sensitive attributes. However, the explanations generated from these techniques may not detect all such correlations and removing them based on these explanations may not guarantee a model free from discrimination/bias.

This makes it crucial to verify the performance of the trained model for inherent bias in the training data. There might be more data points for a given value of feature than others. For example, the gender male can be present in the majority of data points, making the model biased toward it. So, it is important to check the performance across different slices of data and visualize the slices and feature distribution in the datasets.⁸ Barocas et al.⁶ provide methods to reduce bias by incorporating fairness in a model agnostic manner without the need of retraining.

As discussed briefly in the Data Acquisition and Exploratory Data Analysis section (model dimension), it is impractical to test the ML model against all adversaries and the defined tests can only verify the model's robustness against selected adversaries, which creates the need for designing robust models. Especially, DNNs are prone to such adversarial samples because they are typically overparameterized and there is a significant redundancy, making them susceptible to adversarial attacks. Han et al.⁵⁶ discuss pruning redundant connections that can reduce overparameterization and help in making the models robust to adversarial attacks.

Finally, once a suitable model is determined, it is important to verify it against other user-specific requirements decided in the BU phase to ensure that it does not break any contract with its customers after deployment.⁸

The summary of checks for this phase is provided in Table 5.

Table 5.

Summary of checks for modeling phase

Modeling
Data	• Verify the data leakage/contaminations while using CV and if present, avoid them by using nested CV (but computationally expensive).
Code	• Clear out unused/dead experimental paths for maintainability. • Integrate tools for experimental tracking to maintain backward compatibility and reproducibility. • Integrate configuration management tools to handle configuration files and review them in code review along with the regular code.
Model	• Gradually increase model complexity to handle overfitting. • Use XAI methods to verify the trained model against explainability and fairness requirements defined in the BU phase. • Verify the model against bias by checking it across different slices of data. • Use test cases designed in the DA & EDA phase to verify models against adversaries and potentially use pruning techniques to reduce over-parametrization.

XAI, explainable AI.

Deployment

In this phase, the model output is finally delivered to the client, which can be a human decision maker as well as a subsequent machine. There exist different levels of deployment complexities that are often constrained by hardware infrastructure. However, before delivery, the selected model must be verified for its compatibility to the client's infrastructure and more importantly against SLAs specified in the BU phase.

Data

The access of data at deployment also depicts a common challenge in ML pipelines but does not correspond to the availability of data. We expect the ML pipelines that are developed over the static training data to perform equally well with dynamic real-world data streams, which usually is not the case. Moving from a static to dynamic setting comes with two general issues, which needs to be handled:

First, it must be ensured that during deployment, the data are available in a complete and trusted manner. This can be done by verifying the new incoming data using the sanity checks that were implemented in the DA & EDA phase.

Second, some ML applications require real-time availability of the data, meaning that model inference must be done within certain time bounds. For instance, in PM tasks, this real-time availability is crucial and is not met if the databases are updated only on a daily/weekly basis.

Code

As a good coding practice, we should try to reuse the code wherever possible between training and deployment pipelines to avoid training serving skew⁹ and potential errors that might occur due to bugs in the code. The tests for this skew are most important but least implemented in the majority of ML projects.¹³

As discussed in BU, we need to perform differential testing before deployment to make sure that the predictive performance achieved by the productionized model is the same as the original one.⁸

If required, we also need to implement Contract tests as discussed in the BU phase to verify the compatibility of our selected model's interface with the external consuming application.

Model

We need to test the quality of trained models to either reject or deploy them to production. This can be done by checking various thresholds via automated tests. For instance, the models can be thresholded over normalized error rate over holdout data and can be rejected if they are below a certain threshold percentage.

If the trained ML model is being integrated into an already existing system, we need to verify the proper functioning of the system and model together, which is done by means of integration testing. Here, we perform verification at a system level and ensure that the system works end-to-end without any issues, especially at the interfaces.

We also perform Unit Testing of the deployed model on the raw holdout data,⁸ which was set aside before modeling (discussed in the BU phase). This is helpful in assessing whether the trained model that is integrated into the existing system is at fault, if the performance of the whole system drops. If the unit test passes, then the performance drop can be attributed to the change/fault in the dependent systems.

For instance, imagine a VPI task where the model was trained on raw images processed by imaging system A. Between the modeling phase and the deployment phase, the firmware to preprocess the images was secretly updated to system B. Here, the unit testing of the ML model on the raw holdout set would be helpful to detect this change in the image processing pipeline. If the unit test fails with the new system B but passes with the old system A, this allows detecting this hidden change on the system level and avoids faulty model predictions in the future.

In cases when the deployed model needs to be replaced with a newer version either due to the addition of new/better features or due to drift (detailed discussion in the next phase), shadow models⁸ can be deployed. If the shadow model performs well, satisfying the user requirements, then it replaces the older model, immediately preventing downtime to customers.

The summary of checks for this phase is provided in Table 6.

Table 6.

Summary of checks for deployment phase

Deployment
Data	• Verify the incoming data by using sanity/schema tests that were implemented in the DA & EDA phase.
Code	• Reuse the code from the training pipeline wherever possible to avoid training-serving skew. • Implement differential tests to verify the consistency of the trained and productionized model. • Implement contract tests to ensure stable interfaces with connecting systems as defined in the BU phase.
Model	• Verify the quality of the trained model against use-case specific constraints. • Implement integration testing by using the holdout dataset defined in BU to verify proper working of the ML system and the model together. • Use shadow models, to check whether the newer model version can replace the older one.

Monitoring

This constitutes the final phase of the lifecycle and only terminates if the underlying application or project is ended. The deployed model is continuously monitored according to defined measures in the BU phase to ensure the stability of both the model and incoming data. In this phase, model updating and retraining with new, incoming data might be initiated. In addition, the stability of the deployment infrastructure and hardware is continuously monitored and checked for breakdowns.

This phase is particularly important in the cases of online learning, because the models are retrained by the data collected in production, which has the capability to influence the model's prediction via feedback loops.

(1)

Direct feedback loops occur when the model is influenced by its own decision. This causes a shift in class distributions and is predominant in cases of bias amplification or AD tasks.⁸

(2)

Hidden feedback loops are difficult to detect and they mainly occur in the presence of multiple ML models used for independent tasks but are interrelated, where an improvement of one system leads to the degradation of another.^12,57

These online learning models use algorithms that improve their performance with the arrival of new data, making the versioning and monitoring of both training and production data important. For instance, it is crucial in the context of QI, where the information of broken parts is added to the data to enable continuous improvement of the model via retraining.

In general, the monitoring of ML models in production is done via Human in the Loop approach, where several aspects need to be monitored:⁸

(1)

inputs to the model: to give visibility of emerging training-serving skew if any

(2)

outputs of the model: to understand the performance over real data and decisions that are being made with these outputs (in cases where the outputs of ML models are being used to control other systems)

(3)

model bias: the model's response/output against sensitive features such as race, gender, etc.

(4)

model interpretability: to understand which features influence the decision that can be obtained by using LIME, SHAP, saliency maps,⁵⁸ counterfactuals,³⁵ etc.

The aspects mentioned earlier can provide the maintainer with actionable insights to either retrain the model with additional data (via 1 or 2) or discard it if the learning is not as intended (via 3 or 4).

The major function of this phase is to detect the presence of drifts/distribution shifts and possibly determine its type to deal with the issue. In the majority of cases, drifts in the ML model can be attributed to either a change in the input/target distribution or the relationship between them and each type; these drifts and their occurrences are studied separately in Table 7. However, it is important to know the presence of these shifts/drifts because in cases where there is no drift, the standard models perform better.⁶³

Table 7.

Taxonomy of different drifts in machine learning (clear solutions of problems are under active research, so we only provide causes instead of potential solutions)

Problem	Description	Cause
Concept drift	The reason for this drift can be that the model learns confounding features (shortcut learning) and does not perform well in contexts not observed by training data. One driver of this drift is a distributional shift in the input or output data, as this can break the initial and learned relationship between input features and output. This property is particularly present in applications such as quality inspection and predictive maintenance. Even small alterations within the production chain can cause distribution shifts of the input features, e.g., change in the angle of a camera or in the raw material.⁵⁹	It occurs when the relationship between input and output changes over time. It is also caused when the model is deployed in an environment that differs from the training data, reducing the generalizability of the deployed model.
Data drift	One of the reasons is that the data may evolve and new classes may emerge, causing a drift. This poses a major challenge, especially in AD, where new classes that are initially detected as an anomaly might occur due to a shift into a new normal behavioral state of the monitored machine.⁶⁰ The detection of such out-of-set classes is actively researched as open-set recognition⁶¹ and can lead to bad consequences if not taken care of.	It occurs when the model's input distribution changes and everything else remains constant. It is sometimes also known as Data skew/Covariate shift^62,63 and can occur when the distribution of feature changes (ex. if voter's age change from 18 to 16),⁴⁹ or if data are handed differently in training and serving/deployment pipelines.⁹
Label drift	It is also known as prior probability shift.⁶³	It occurs when only the model's target distribution changes and the rest remain unchanged.

Data

In dynamically changing environments, the data evolve over time, making its underlying distributions of either input or target or both to change, leading to different kinds of drifts. These drifts may introduce new concepts or the previously seen concepts reoccur again.¹⁶ If the occurrence of a specific drift is known, necessary modifications can be carried out to obtain better performance in the real world. Storkey⁶³ studied the occurrence of such drifts under the terminology of dataset shifts.

Along with unit and integration tests discussed in earlier phases, ML models need additional tests due to its tight integration with data. We can perform basic statistical tests to compare mean, median, and standard deviation min/max values between training and real distributions.⁴⁹

Advanced tests such as skew tests can help in determining the representativeness of training data over live/real data. These tests involve monitoring the percentage of missing data in training versus real data. The proportion of these missing data can also be assessed through the Chi-squared test,⁶⁴ analysis of variance or t-test if features are normally distributed⁴⁹ or a Kolmogorov-Smirnov test whether features are continuous and not heavily skewed. Apart from these, metrics such as Population Stability Index (PSI) and Character Stability Index (CSI) can be used to measure scores based on distribution changes and can determine the performance of deployed ML models.

Code

As this phase involves only monitoring of the ML systems after deployment, there is no code involved other than implementing verification tests for the other two dimensions.

Model

It is more important which actions are performed based on the model's output rather than its predictive performance. We must detect whether the selected performance metric aligns well with the requirements discussed in the BU phase. If a change in the feature attributes improves the direct objective function but degrades the performance of KPIs, then the objective function must be revisited.^9,42

The model's output must be monitored for distribution changes to check the presence of label drift (c.f. Table 7), and this can be done with the same tests mentioned earlier, that is, Kolmogorov-Smirnov and Chi-squared tests. In addition, the model's predictions must be verified against the sensitive features defined in the BU phase (fairness constraints).

Apart from label drift, there are other kinds of drifts that may occur and need to be handled. Storkey⁶³ provides information on detecting and dealing with different kinds of shifts/drifts, where each drift can be dealt with via a different form of modeling and no single model can deal with all the kinds of mentioned drifts.^63,65 In particular, concept drift can be adapted via change detection methods used in adaptive learning and is commonly known as concept drift adaptation.¹⁶ However, for most of the adaptation techniques, we need the true label of the data soon after the model's prediction, which is not available in the majority of cases.

If the label is available soon after the prediction (e.g., will the user click on an ad or not), we can also judge the quality of the system by comparing the distribution between the model's output and observed labels. This can detect any bias in predictions and slicing it across various dimensions helps to localize the issues and alert the user.¹² It can also be achieved with PSI, where we compare the model's output with the observed label and set bounds on them to alert if the threshold limit is crossed.

The best way to handle drifts in ML models is via retraining, which is a common approach in dynamic systems. Here, we can log real/serving inputs and manually annotate the model's output as labels to create additional data to retrain the model. A slice of this can also be held out to verify the updated model prior to deployment¹³ (as discussed in the deployment phase). However, once the model updates on new data, the distribution of output changes and thresholds in the earlier mentioned tests becomes invalid.¹² Hence, we need to update these older thresholds for adapting to the newer model but a manual update is a time-consuming process. Instead, these thresholds can be learned via simple evaluation on the held-out validation data.⁶⁶

However, in many cases, the models are not updated frequently and drift away due to their dependence on data and grow stale over time. This can be handled via generating alerts for retraining by monitoring the ages of both the model and data in production and thresholding them according to model freshness requirements mentioned in the BU phase.¹³

Finally, there must be the possibility of a rollback to the previous version if the model performs poorly based on the metrics monitored in this phase.

The summary of checks for this phase is provided in Table 8.

Table 8.

Summary of checks for monitoring phase

Monitoring
Data	• Perform monitoring using basic statistical tests such as mean, mode, min/max or complex tests such as skew test, chi2, PSI, CSI, ANOVA between training and real distributions.
Code	Empty since no code is written in this phase other than implementing tests for Data and Model dimensions!!
Model	• Verify against the KPIs defined in the BU phase by monitoring them. • Monitor the model's output for label drift. • Monitor for fairness constraints defined in the BU phase. • Alert the user, in case of drift in the model. • Monitor and create alerts for model retraining based on the freshness requirements defined in the BU phase. • Ensure a possibility of rollback if the trained ML model performs poorly based on metrics monitored in this phase.

ANOVA, analysis of variance; CSI, Character Stability Index; PSI, Population Stability Index.

Conclusion

In this work, we argued for the need of a structured verification approach for ML systems. We provided evidence for the need of such an approach by analyzing typical errors and performance issues in different industrial settings. As shown, these issues cannot be attributed to the model alone but code and data play prominent roles. Our experiences from applying ML in various industrial domains show that our incapability to provide quality data to the model impedes them from learning reliable/robust features and hence failing to generalize well in real-world environments. This nonrobust feature learning is the root cause that makes the models prone to drifts, discrimination, adversarial attacks, etc.

This cannot be detected entirely even from the existing validation techniques such as CV or XAI, making structured verification at each phase of the ML lifecycle essential. For instance, (1) the results from CV cannot be trusted if the quality of data used to perform CV is not verified; (2) the explanations derived from XAI techniques can detect incorrectness but cannot guarantee the confidence we want to place in the ML model, as they might be able to detect biases due to a specific feature but cannot assure that removing that feature would completely eliminate the biases.

By this review, we intend to provide a methodology that can aid practitioners to derive a checklist for verifying their domain-specific ML models, in addition to their own specific checks (c.f. the Checklist for Structured Verification in ML systems section). Our discussion, by no means, provides an exhaustive checklist due to the complexities involved in creating it.

We conclude that structured verification of data, code, and model at each phase is necessary to build trustworthy ML models and recommend ML practitioners to invest longer time in verifying the data quality, which is a prime necessity toward creating robust and trustworthy AI solutions.

Footnotes

Author Disclosure Statement

No competing financial interests exist.

Funding Information

This work was supported by the Bavarian Ministry of Economic Affairs, Regional Development and Energy through the Center for Analytics – Data – Applications (ADA-Center) within the framework of BAYERN DIGITAL II (20-3410-2-9-8) and partially supported by the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IS18036A.

Abbreviations Used

References

Alpaydin

Introduction to machine learning. Cambridge, MA: MIT Press, 2020.

Hutter

, Kotthoff

, Vanschoren

. Automated machine learning: Methods, systems, challenges. Cham, Switzerland: Springer

Nature

. 2019.

Wirth

, Hipp

. CRISP-DM: Towards a standard process model for data mining. In: Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining. New York: AAAI Press, 2000.

Ebert

, Gallardo

, Hernantes

, et al. DevOps. IEEE Softw. 2016; 33:94–100.

Geirhos

, Jacobsen

, Michaelis

, et al. Shortcut learning in deep neural networks. Nat Mach Intell. 2020; 2:665–673.

Barocas

, Hardt

, Narayanan

. 2019. Fairness in machine learning. Available online at: https://fairmlbook.org (last accessed December 14, 2020).

Zhao

, Wang

, Yatskar

, et al. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. arXiv preprint arXiv:1707.09457. 2017.

Sato

, Wider

, Windheuser

. 2019. Continuous delivery for machine learning. Available online at: https://martinfowler.com/articles/cd4ml.html (last accessed January 8, 2021).

Zinkevich

2017. Rules of machine learning: Best practices for ML engineering. Available online at: https://developers.google.com/machine-learning/guides/rules-of-ml (last accessed December 16, 2020).

10.

Beck

. Test-driven development: By example. Boston

: Addison-Wesley

Professional

. 2003.

11.

Escobar

, Morales-Menendez

. Machine learning techniques for quality control in high conformance manufacturing environment. Adv Mech Eng. 2018; 10:1687814018755519.

12.

Sculley

, Holt

, Golovin

, et al. Hidden technical debt in machine learning systems. Adv Neural Inf Process Syst. 2015; 28:2503–2511.

13.

Breck

, Cai

, Nielsen

, et al. The ml test score: A rubric for ml production readiness and technical debt reduction. In: 2017 IEEE International Conference on Big Data (Big Data), Boston, MA: IEEE, 2017. pp. 1123–1132.

14.

D'Amour

, Heller

, Moldovan

, et al. Underspecification presents challenges for credibility in modern machine learning. arXiv preprint arXiv:2011.03395. 2020.

15.

Madry

, Makelov

, Schmidt

, et al. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. 2017.

16.

Gama

, Žliobaitė

, Bifet

, et al. A survey on concept drift adaptation. ACM Comput Surv. 2014; 46:1–37.

17.

Everingham

, Van Gool

, Williams

, et al. The Pascal visual object classes (VOC) challenge. Int J Comput Vis. 2010; 88:303–338.

18.

Lapuschkin

, Wäldchen

, Binder

, et al. Unmasking Clever Hans predictors and assessing what machines really learn. Nat Commun. 2019; 10:1–8.

19.

Pfungst

. Clever Hans: ( the horse of

. Von

Osten

.) A contribution to experimental animal and human psychology. New York: Holt

Rinehart and Winston

. 1911.

20.

Vogelsang

, Borg

. Requirements engineering for machine learning: Perspectives from data scientists. In: 2019 IEEE 27th International Requirements Engineering Conference Workshops (REW), Jeju Island, South Korea: IEEE, 2019. pp. 245–251.

21.

Sokolova

, Lapalme

. A systematic analysis of performance measures for classification tasks. Inf Process Manag. 2009; 45:427–437.

22.

Saito

, Rehmsmeier

. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015; 10: e0118432.

23.

Brabec

, Machlica

. Bad practices in evaluation methodology relevant to class-imbalanced problems. arXiv preprint arXiv:1812.01388. 2018.

24.

Kukar

, Kononenko

. Cost-sensitive learning with neural networks. ECAI. 1998; 15:88–94.

25.

Thai-Nghe

, Gantner

, Schmidt-Thieme

Cost-sensitive learning methods for imbalanced data. In: The 2010 International Joint Conference on Neural Networks (IJCNN), Barcelona, Spain: IEEE, 2010. pp. 1–8.

26.

Lang

, Binder

, Richter

, et al. mlr3: A modern object-oriented machine learning framework in R. J Open Source Softw. 2019; 4:1903.

27.

Wang

, Li

, Reddy

. Machine learning for survival analysis: A survey. ACM Comput Surv. 2019; 51:1–36.

28.

RapidMiner. 2018. How to correctly validate machine learning models (whitepaper). Available online at: https://rapidminer.com/resource/correct-model-validation/ (last accessed December 12, 2020).

29.

Kohavi

A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI. 1995; 14:1137–1145.

30.

Bergmeir

, Hyndman

, Koo

. A note on the validity of cross-validation for evaluating autoregressive time series prediction. Comput Stat Data Anal. 2018; 120:70–83.

31.

Thakur

Approaching (almost) any machine learning problem. Abhishek Thakur, 2020.

32.

Goodfellow

, Shlens

, Szegedy

. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. 2014.

33.

Minhas

, Asif

, Ben-Hur

. Ten ways to fool the masses with machine learning. arXiv preprint arXiv:1901.01686. 2019.

34.

Horkoff

. Non-functional requirements for machine learning: Challenges and new directions. In: 2019 IEEE 27th International Requirements Engineering Conference (RE), Jeju Island, South Korea: IEEE, 2019. pp. 386–391.

35.

Molnar

Interpretable machine learning. New York: Association for Computing Machinery, 2020.

36.

Ilyas

, Santurkar

, Tsipras

, et al. Adversarial examples are not bugs, they are features. arXiv preprint arXiv:1905.02175. 2019.

37.

Halevy

, Norvig

, Pereira

. The unreasonable effectiveness of data. IEEE Intell Syst. 2009; 24:8–12.

38.

Jacovi

, Marasović

, Miller

, et al. Formalizing trust in artificial intelligence: Prerequisites, causes and goals of human trust in AI. arXiv preprint arXiv:2010.07487. 2020.

39.

Widmer

, Kubat

. Learning in the presence of concept drift and hidden contexts. Mach Learn. 1996; 23:69–101.

40.

Torralba

, Efros

. Unbiased look at dataset bias. In: CVPR 2011, Colorado Springs, CO: IEEE, 2011. pp. 1521–1528.

41.

Polyzotis

, Roy

, Whang

, et al. Data management challenges in production machine learning. In: Proceedings of the 2017 ACM International Conference on Management of Data, New York: Association for Computing Machinery, 2017. pp. 1723–1726.

42.

2019. Machine learning yearning: Technical strategy for AI engineers in the era of deep learning. Available online at: https://www.mlyearning.org (last accessed December 29, 2020).

43.

Mani

, Sankaran

, Tamilselvam

, et al. Coverage testing of deep learning models using dataset characterization. arXiv preprint arXiv:1911.07309. 2019.

44.

Simonyan

, Zisserman

. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. 2014.

45.

Bottou

, Bengio

, Le Cun

Global training of document processing systems using graph transformer networks. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Juan, PR: IEEE, 1997. pp. 489–494.

46.

Carlini

, Wagner

. Towards evaluating the robustness of neural networks. In: 2017 IEEE Symposium on Security and Privacy (SP), 2017. pp. 39–57.

47.

Gehr

, Mirman

, Drachsler-Cohen

, et al. AI2: Safety and robustness certification of neural networks with abstract interpretation. In: 2018 IEEE Symposium on Security and Privacy (SP), Washington, DC: IEEE Computer Society, 2018. pp. 3–18.

48.

Ranzato

, Zanella

. Abstract interpretation of decision tree ensemble classifiers. Proc AAAI Conf Artif Intell. 2020; 34:5478–5486.

49.

Samiullah

2020. Monitoring machine learning models in production. Available online at: https://christophergs.com/machine%20learning/2020/03/14/how-to-monitor-machine-learning-models (last accessed December 16, 2020).

50.

Pei

, Cao

, Yang

, et al. Deepxplore: Automated whitebox testing of deep learning systems. In: Proceedings of the 26th Symposium on Operating Systems Principles, New York: Association for Computing Machinery, 2017. pp. 1–18.

51.

, Juefei-Xu

, Zhang

, et al. Deepgauge: Multi-granularity testing criteria for deep learning systems. In: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, New York: Association for Computing Machinery, 2018. pp. 120–131.

52.

Tian

, Pei

, Jana

, et al. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In: Proceedings of the 40th International Conference on Software Engineering, New York: Association for Computing Machinery, 2018. pp. 303–314.

53.

Harel-Canada

, Wang

, Gulzar

, et al. Is neuron coverage a meaningful measure for testing deep neural networks? In: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, New York: Association for Computing Machinery, 2020. pp. 851–862.

54.

Ribeiro

, Singh

, Guestrin

“Why should i trust you?” Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. pp. 1135–1144.

55.

Lundberg

, Lee

. A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874. 2017.

56.

Han

, Pool

, Tran

, et al. Learning both weights and connections for efficient neural networks. arXiv preprint arXiv:1506.02626. 2015.

57.

Amershi

, Begel

, Bird

, et al. Software engineering for machine learning: A case study. In: 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), New Jersey: IEEE Press, 2019. pp. 291–300.

58.

Simonyan

, Vedaldi

, Zisserman

. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. 2013.

59.

Mera

, Orozco-Alzate

, Branch

. Incremental learning of concept drift in Multiple Instance Learning for industrial visual inspection. Comput Ind. 2019; 109:153–164.

60.

Lavin

, Ahmad

. Evaluating Real-Time anomaly detection algorithms—The

Numenta anomaly benchmark

. In: 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), New Jersey: IEEE, 2015. pp. 38–44.

61.

Geng

, Huang

, Chen

. Recent advances in open set recognition: A survey. IEEE Trans Pattern Anal Mach Intell. 2021; 43:3614–3631.

62.

Bickel

, Brückner

, Scheffer

. Discriminative learning under covariate shift. J Mach Learn Res. 2009; 10:2137–2155.

63.

Storkey

When training and test sets are different: Characterizing learning transfer. Dataset Shift Mach Learn. 2009; 30:3–28.

64.

McHugh

ML.

The chi-square test of independence. Biochem Med. 2013; 23:143–149.

65.

Turhan

On the dataset shift problem in software engineering prediction models. Empir Softw Eng. 2012; 17:62–74.

66.

Sculley

, Otey

, Pohl

, et al. Detecting adversarial advertisements in the wild. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York: Association for Computing Machinery, 2011. pp. 274–282.

Structured Verification of Machine Learning Models in Industrial Settings

Abstract

Introduction

ML lifecycle

Data

Code

Model

The importance and goal of verification

Checklist for structured verification in ML systems

Applications and methods

Related Work

Shortcut learning

Underspecification

Structured Verification in ML Lifecycle

Business understanding

Data

Code

Model

Data acquisition and exploratory data analysis

Data

Code

Model

Modeling

Data

Code

Model

Deployment

Data

Code

Model

Monitoring

Data

Code

Model

Conclusion

Footnotes

Author Disclosure Statement

Funding Information

Abbreviations Used

References