Abstract
More and more biologists and bioinformaticians turn to machine learning to analyze large amounts of data. In this context, it is crucial to understand which is the most suitable data analysis pipeline for achieving reliable results. This process may be challenging, due to a variety of factors, the most crucial ones being the data type and the general goal of the analysis (e.g., explorative or predictive). Life science data sets require further consideration as they often contain measures with a low signal-to-noise ratio, high-dimensional observations, and relatively few samples. In this complex setting, regularization, which can be defined as the introduction of additional information to solve an ill-posed problem, is the tool of choice to obtain robust models. Different regularization practices may be used depending both on characteristics of the data and of the question asked, and different choices may lead to different results. In this article, we provide a comprehensive description of the impact and importance of regularization techniques in life science studies. In particular, we provide an intuition of what regularization is and of the different ways it can be implemented and exploited. We propose four general life sciences problems in which regularization is fundamental and should be exploited for robustness. For each of these large families of problems, we enumerate different techniques as well as examples and case studies. Lastly, we provide a unified view of how to approach each data type with various regularization techniques.
1. Motivation
In the era of personalized medicine, biospecimen collection and biological data management are still a challenging and expensive task (Toga and Dinov, 2015). Only few large-scale research enterprises, such as ENCODE (encodeproject.org), ADNI (adni.loni.usc.edu), or TCGA (cancergenome.nih.gov), have sufficient financial and human resources to manage, share, and distribute access of heterogeneous types of biological data. To date, many biomedical studies still rely on a small number of collected samples (McNeish and Stapleton, 2016). A number that is even lower in cases of rare diseases (Garg et al., 2016) or in high-throughput molecular data (e.g., genomics and proteomics) where the number of variables measured can be in the order of hundreds of thousands (Yu et al., 2013).
Asking biological or clinical questions from these data using machine learning techniques requires particular consideration of many factors, such as random fluctuations in the measurements introduced by the acquisition devices, a small number of samples, or, observed variables may not be representative of the target phenomenon. From a modeling standpoint, every combination of the factors above can be seen as noise affecting the data. Precautions in the model formulation process must be taken to achieve solutions that are robust to the noise effect. To this end, we can couple machine learning methods with regularization, a set of techniques that can be introduced independently from the learning machine (Okser et al., 2014). Regularization is of fundamental use not only to achieve robustness in the presence of noise but also to impose consistence with prior knowledge. We show in Section 2 that there are different methods to attain either goal, and that they can be combined.
In this review, we describe how regularization can be used, together with machine learning methods, to successfully address complex life science questions. Unlike previous review articles on this matter (Ma and Huang, 2008; Sohail and Arif, 2020), we provide a vast range of methods incorporating the advances made in the last 10 years of research, and focus on regularization per se and how it has been successfully exploited to answer questions on various types of data, including omic-data, imaging data, clinical outcomes, and much more. We provide the reader with a wide and full understanding of possible concerns and situations. More specifically, we identify four families of life science questions that occur regularly and which regularization techniques are suitable to be used. Although these do not cover the entirety of all possible questions that can be answered with machine learning techniques, they present some of the most common uses of regularized machine learning in the life sciences.
Such questions are the following: (Q1) How to find the relationships between input and output from noisy data, (Q2) which variables are the most relevant, (Q3) are there hidden patterns in the data, and (Q4) are there relevant relationships between variables?
1.1. Outline
In the remainder of the article, we provide background on supervised and unsupervised machine learning (Section 2), focusing on the specific ways of introducing regularization within the different methods. In Section 3, we describe the four main representative questions, and we answer each of them separately in Sections 4–7. We conclude the article with a discussion (Section 8) on the most proper method to use depending on the type of data, providing a list of use cases as per each data type and method.
2. Learning Machines and Regularization
Life science problems can be tackled with a vast amount of statistical and machine learning methods. Here, we do not want to discuss how to address all the possible problems, but restrict ourselves to those that can be approached with specific regularized methods both in the supervised and unsupervised setting.
2.1. Supervised learning
Supervised learning defines a subset of machine learning methods that allows to study relationships between input and output pairs. In this setting, we denote data as
Typically both regression and classification tasks can translate into the optimization of the following problem:
where
Definition of the Loss Function
2.2. Unsupervised learning
Unsupervised learning defines a subset of machine learning methods that allows to study internal patterns among possibly heterogeneous observations. In this setting, data are
We also discuss the problem of network inference, which is the problem of inferring relationships among variables through observations. Such method addresses the problem of understanding how the variables in play can describe the system by interacting with each other.
All the methods mentioned above entail the minimization of a loss, depending on the problem at hand the loss may change, we can generally write it as in Equation (1):
Here,
2.3. The problem of overfitting
Learning algorithms are often prone to overfitting, which can be described as the phenomenon where the learned model is more accurate on known data (training) than on unseen data (test). Such a model will explain too precisely the known data fitting noise as well as signal, and therefore losing the ability to generalize on future examples. Overfitting is more prone to happen when learning is performed on a low number of samples, or the complexity of the model is high. Indeed, in the first case, we might lose the ability to discern which information is noise and which is relevant; in the second case, a high complex model is prone to fitting noise in the training data. Regularization and model selection techniques are the go-to tools to prevent overfitting and obtain robust models. These two complementary sets of techniques, respectively, penalize overly complex models or test the model ability to generalize by evaluating its performance on a set of data not used for training (i.e., validation set, a part of the training set left aside for explicit evaluation of generalization properties).
2.4. Regularization
Given Problems (1) and (2), there are many possible ways of performing regularization to be robust to noise (i.e., prevent overfitting) or impose prior knowledge. They differ in the way they act on retrieving the optimized solution: they can act on the model, on the optimization technique, or on the data.
2.4.1. Addition of a penalty
This type of regularization acts on the model and is based on the addition of a penalty term to Problem (1), as follows:
The term
The scalar
2.4.2. Ensemble techniques
Another way of avoiding overfitting is to combine a finite set of alternative models to allow for higher flexibility and thus better performance. Typical ensemble techniques are bagging and boosting. The first two act on the data and involve multiple models trained on random subsets of the input samples. They yield the final prediction by merging the predictions of the models that equally concur to the final solution. When using this approach as a regularization strategy, one must be careful to select the right number of models to learn, as well as their complexity or overfitting might still occur. Boosting is an ensemble method that acts on the optimization process by performing predictions by sequentially fitting several base learners that cast a weighted vote (Freund, 1995). At each boosting iteration, the model is forced to learn the relationships between input and output that were previously missed as the weights corresponding to poorly predicted samples increase. From a theoretical standpoint, it is possible to boost any learning machine, nevertheless boosting methods are truly beneficial only when based on weak learners, such as stumps or linear regression (Hastie et al., 2009)—stumps are one node decision trees (Iba and Langley, 1992). Examples of these techniques are Random forest and Gradient boosting, which we discuss in Section 4.
2.4.3. Dropout and data augmentation
These two regularization techniques are mostly used for neural networks (NNs). The first one, Dropout (Srivastava et al., 2014), is a technique that acts on the model by temporarily deactivating a defined number of randomly chosen units of the network at training phase. This reduces the degrees of freedom of the model and it implicitly allows to achieve an ensemble of several smaller networks whose predictions are combined. Data augmentation acts on data as it is a preprocessing technique. It is typically used when dealing with NNs and images and it consists in expanding an input data set by applying transformations as scaling or translation on the available samples. Hernández-García and König (2018) show evidence of how this method can be understood to achieve regularization as it avoids overfitting such as more explicit regularization techniques.
2.4.4. Early stopping
This is a popular regularization strategy (Prechelt, 1998) that consists in interrupting the fitting process as soon as the error on an external validation set increases (Angermueller et al., 2016). This type of regularization acts on the optimization procedure and it is typically used on iterative methods such as gradient descent. It is based on the idea that given a set of data on which we train the model (training) and a set on which we validate it (validation), the optimization procedure minimizes the error both for the training and the validation up to a point after which the validation error starts increasing as the model overfits the training data.
2.5. Model selection
Each of the aforementioned regularization techniques has an intrinsic parameter that needs to be tuned. For the penalized methods we have
This problem is typically referred to as model selection. It must be distinguished from model evaluation, which aims at estimating the generalization error of the chosen model on new data.
Model selection is usually performed by estimating, for a given value of a parameter, the prediction error. The simplest and most widely used method for estimating the prediction error of the model is to perform K-fold cross-validation. Given an integer K, we split the data in K parts of approximately the same size. For each of these parts in turn, we compute on the k-th part the error of the model fitted to the
This procedure is repeated for a certain range of parameters values, the best parameter is selected as the one that returns the lowest prediction error in average. Many other cross-validation routines are proposed in literature, we refer to Molinaro et al. (2005) for a detailed description of the most important cross-validation strategies.
In contrast with cross-validation, multiple methods have been developed to perform an analytical estimation of the prediction error of a model. Some of the most widely used of these methods are the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) (Vrieze, 2012). Both methods are based on the idea of minimizing the loss function (or maximizing the likelihood) while penalizing such quantity depending on the degrees of freedom of the problem. As an example, consider the well-known clustering method K-means, which divides the data points in K clusters. For K equal to the number of samples, we would reach a perfect fit in terms of value of the loss function, but this would overfit on the samples. Thus, using methods such as AIC or BIC, we add to the error a penalty proportional to the value K, to obtain a balance between the error and the number of degrees of freedom of the problem.
3. From Biological Questions to Learning Tasks
In applied life science, it is crucial to choose the right approach to not incur bias and obtain robust results.
We identified four recurring biological questions that, even though they do not completely cover the complex variety of problems related to life science data, are the most amenable to regularized learning techniques. We provide in Figure 1a a schematic explanation of how to reach a particular question starting from the data and the problem at hand.

Flux diagram explaining how to reach a specific question. In practice we first need to distinguish if we have labeled data or not, in the first case, we are in a supervised learning setting, while in the second we are in an unsupervised setting. In the supervised setting, we want to predict the labels, and we can simply do this in the best possible way or we may ask which are the best variables to predict. In the unsupervised setting, we can look for patterns in the samples or for relationships among the features.
In all these questions, regularization plays a key role for robustness to impose prior knowledge on the solution. The regularization schemes presented in the previous section can be used in different ways to address all these questions, sometimes combined and sometimes alone.
4. How to Find Relationships Between Input and Output from Noisy Data? (A1)
This problem lies in the macrocategory of supervised problems and it is one of the most largely discussed. We provide a variety of well-known techniques that differ both in the way they approach regularization and the type of data they can handle.
4.1. Tikhonov regularization
This regularization strategy is based on the addition of an
This penalty shrinks the coefficients toward zero, but it does not achieve a parsimonious representation, as it tends to keep all the variables in the model. This penalty is typically applied to the square loss, thus taking the name of Ridge regression (Hoerl and Kennard, 1970), but it is known under several different names, among which we recall, weight decay (Krogh and Hertz, 1992) and Regularization Network (Evgeniou et al., 2000). It is easy to show that Ridge regression is equivalent to a Bayesian approach to linear regression where we impose a normal prior on the regression coefficients (Murphy, 2012, chapter 7).
4.1.1. Applications
This model is successfully applied in a variety of biological studies mainly involving regression problems. For instance, in Kratsch and McHardy (2014), the authors propose a Ridge regression-based method to estimate the trees of mutations within a species from the ancestors of the species to the present, while in Bøvelstad et al. (2007) this technique is used to predict the survival of patients from gene expression data. Tikhonov regularization can also be combined with other types of regularization as in Fiorini et al. (2017) where they exploit the addition of a nuclear norm penalty to perform temporal prediction of possible responses of patients affected from multiple sclerosis.
4.2. Random forests
Random forests (RFs) are ensembles of decision trees, each grown on a subset of samples randomly chosen with replacement from training data. Decision trees are interpretable models where each node can be seen as a particular question on a single feature that leads to partition the training data into subsets. The feature that yields the best split in terms of a preselected metric is chosen to create a new node—we refer to Qi (2012) for possible choices of such metric that are suitable for different biological problems. Each path from root to leaf is called classification rule.
Decision trees alone tend to not perform well, which led to the introduction of RFs in 2001 (Breiman, 2001). The final prediction is made by aggregating the prediction of m trees, either by a majority vote in the case of classification problems, or by averaging predictions in the case of regression problems. Several techniques for applying regularization to RFs have been proposed. These techniques broadly fall under two categories: (1) cost-complexity pruning, which consists in limiting tree depth, resulting in less complex models (Kulkarni and Sinha, 2012); and (2) Gini index penalization, which weights the probabilities of each class to favor large partitions (Liu et al., 2014a).
4.2.1. Applications
RFs can handle both numerical and categorical variables, multiple scales, and nonlinearities. This makes them popular for the analysis of diverse types of biological data, such as gene expression, sequencing, GWAS (Genome-Wide Association Study), or mass spectrometry data. A detailed review specific to RF is provided in Qi (2012). Deng and Runger (2013) and Kursa (2014) use regularized and robust RF for the selection of genes in classification tasks. RFs can be used also for regression, as in Johann et al. (2019), where the authors aim at quantifying tumor purity or, for learning interactions between noncoding RNA and messenger RNA (Soulé et al., 2020).
4.3. Gradient boosting
Gradient boosting is an ensemble method that performs predictions by sequentially fitting several base learners that cast a weighted vote (Freund, 1995). At each boosting iteration, a new model is created by giving increasing weight to the errors made by previous models, so that each model is forced to learn the relationships between input and output that were previously missed as the weights corresponding to poorly predicted samples increase. From a theoretical standpoint, it is possible to boost any learning machine; nevertheless, boosting methods are truly beneficial only when based on weak learners, such as stumps or linear regression (Hastie et al., 2009). Gradient boosting (Friedman, 2001) is one of the most widely applied boosting methods in biological problems.
Gradient boosting has several desirable properties (Mayr et al., 2014), such as its capability to learn nonlinear input/output relationship, its ability to embed a feature importance measure, and its stability in case of high-dimensional data (Buehlmann, 2006).
Boosting methods may suffer overfitting. The main regularization parameter to control is the number of boosting iterations m, that is, the number of base learners, fitted on the training data. Careful consideration should also be put on tuning the complexity of the base learners that are used.
4.3.1. Applications
Approaches based on gradient boosting classification are used to detect de novo mutations showing an improved specificity and sensitivity with respect to state-of-the-art methods (Liu et al., 2014b). When combined with stability selection (Meinshausen and Bühlmann, 2010), gradient boosting has demonstrated to be a very resourceful method for variable selection, leading to an effective control of the false discovery rate. This strategy was followed to associate overall survival with single-nucleotide polymorphisms of patients affected by cutaneous melanoma (He et al., 2016) and to detect differentially expressed amino acid pathways in autism spectrum disorder patients (Hofner et al., 2015).
4.4. Deep learning
Deep learning (DL) methods are a broad class of machine learning techniques that, starting from raw data, aim at learning a suitable feature representation (Section 7) and a prediction function, at the same time (LeCun et al., 2015). DL methods can be seen as an extension of classical NN, where the final prediction is achieved by composing several layers of nonlinear transformations. DL architectures can be devised to tackle binary/multicategory classification (Leung et al., 2014; Angermueller et al., 2016) as well as single/multiple-output regression tasks (Chen et al., 2016).
Particular attention must be paid when fitting deep models as they can be prone to overfit the training set (Angermueller et al., 2016). This is particularly true in health care contexts in which the available data set dimension can be small. Regularization in DL methods can be achieved by penalizing the weights of the network. The most common regularization strategy consists in adding an
4.4.1. Applications
DL can be regularized in many different ways. For example, weight decay is adopted in Chen et al. (2015) to train a deep architecture on rat cell responses to given stimuli, with the final aim to predict human cell responses in the same conditions. Moreover, weight decay is also adopted in Yuan et al. (2016) to train DeepGene, that is, a simple fully connected network known as multilayer perceptron (LeCun et al., 2015), which is designed to classify the tumor type from a set of somatic point mutations. Furthermore, weight decay is used in Fakhry et al. (2016) to train a DL architecture for brain electron microscopy image segmentation. Although less common, the
These methods iteratively update the weights of the network to decrease the training error. The use of dropout alone can improve the generalization properties, as in Chen et al. (2016), where the authors propose D-GEX, DL regression architecture trained to predict the expression of a number of target genes. Dropout can also be used in combination with weight decay or other forms of regularization, as in Leung et al. (2014), where the authors propose to use a deep network to achieve splicing pattern prediction. Dropout is combined with early stopping in Fiorini et al. (2019) where they use textual representation of medical prescriptions to classify the patients, likely to worsen their diabetes in the future. DL methods are nowadays becoming a standard for most biomedical imaging applications. In such context, regularization plays a key role, as it allows to learn robust models for automatic image retrieval, segmentation, and disease prediction. One of the main drawbacks of DL methods is that to learn a prediction function that does not simply overfit the training set, the number of training data should be large (e.g., in the order of tens of thousands). In the context of biomedical images, retrieving a large data set may be hard. To cope with this issue, we can use data augmentation (Schlemper et al., 2017). An interesting property of DL architectures is that when properly trained on a given collection of images, they can learn both specific and a specific feature. So, in general, it is possible to reuse (or fine-tune) the weights learned by a network from some data set, to another case. This strategy is known as transfer learning and, among others, it was successfully exploited by Li et al. (2018) to classify subjects with autism spectrum disorder from medical images. As transfer learning helps to prevent overfitting, it can be considered, to some extent, a regularization strategy.
For a complete review on the impact of DL on this subject, we refer to Lundervold and Lundervold (2019). When model interpretability is as important as prediction performance, DL methods must be trained with particular care. This relevant topic is addressed in Plumb et al. (2019), where the authors propose a regularization term that encourages explainability of the trained model in the neighborhood of the training points without significantly affecting the predicting performance. On the same line, Tong et al. (2018) recently introduced the so-called Graph Spectral Regularization that, applied on neuron activations of an arbitrary NN, can be used to enforce a meaningful graph structure. This method is successfully applied to learn gene marker correlations in a single-cell RNA-sequencing data set. For a specific review clarifying the role of DL in biology, we refer the reader to Ching et al. (2018), where the authors analyze the application of DL to many tasks, among which are clinical outcome forecasting, biological processes, treatment discovery, and neuroscience.
5. Which Variables are the Most Relevant? (A2)
When dealing with health science problems, often we want to learn the best predictors for a certain outcome. Typically, the regularized solution to this problem is to add sparsity-inducing penalties on the loss of the specific machine learning method. A model is said to be sparse when it is defined upon a small number of features (Hastie et al., 2015).
5.1. Lasso and Elastic-Net
There are many penalties that can be added to enforce sparsity. All these penalties are based on the Lasso (Tibshirani, 1996) penalty or
Sparsity can also be achieved through other feature selection techniques besides regularization. Those include filtering techniques, which score features according to their individual relationship with the outcome (e.g., through correlations or statistical association testing) and only keep the highest-scoring ones, or wrapper techniques, which assess subsets of variables according to their usefulness to a given learner. By contrast, embedded methods such as the Lasso directly satisfy the sparsity constraint while optimizing the model, which is more efficient. All three family approaches are reviewed in Guyon and Elisseeff (2003).
As for the
The Elastic-Net (Zou and Hastie, 2005; De Mol et al., 2009a) method can be formulated as a least-square problem penalized by a convex combination of the Lasso (
The combined presence of the
5.1.1. Applications
A popular application of the Lasso is to perform shrinkage and variable selection in survival analysis for Cox proportional hazard regression and additive risk models. Such penalized methods were extensively applied in literature to predict survival time from molecular data collected from patients affected by different kinds of tumor (Ma and Huang, 2007; Tang et al., 2017). The Elastic-Net method is successfully applied in several biomedical fields (Waldmann et al., 2013). For example, De Mol et al. (2009b) exploited an incremental version of Elastic-Net to identify nested groups of correlated genes and Hughey and Butte (2015) exploit it to distinguish between four lung cancer subtypes. In Csala et al. (2017), the authors propose an iterative algorithm that exploits the variable selection capabilities of this method to estimate explanatory variable weights to explain the variability in gene expressions by epigenomic data (i.e., methylation markers) collected from blood leukocytes of Marfan syndrome patients.
5.2. Lasso extensions
It is also possible to design regularizers that force the features that are assigned nonzero weights to follow a given underlying structure (Micchelli et al., 2013). This structure can be defined by arranging features in groups (typically for bioinformatic applications, biological pathways) or graphs (typically, biological networks). In the case of groups, the regularizer constrains entire groups of features to be either all selected or all discarded. When the groups are disjoint, this can be implemented by the Group Lasso (Yuan and Lin, 2006). Suppose that the d features are grouped into L groups, with dl the number of features in group l. Let us denote by
where the same weight wl is associated with all variables from group l. The Group Lasso was later extended to the case where the groups can overlap (Jacob et al., 2009) or be hierarchical (Jenatton et al., 2011).
In the case of networks, the regularizer encourages features that are connected on the network to be selected together. This can be implemented directly with the overlapping Group Lasso, by defining groups as pairs of features connected by an edge (Jacob et al., 2009). Another way to smooth regression weights along the edges of a predefined network, while enforcing sparsity, is a variant of the generalized fused Lasso (Tibshirani et al., 2005). The corresponding penalty is given by Equation (8)
where
These approaches are rather sensitive to the quality of the network they use, and might suffer from bias due to graph misspecification (Yang et al., 2012b). GOSCAR (Yang et al., 2012b) was proposed to address this issue, and replaces the term
5.2.1. Applications
Hierarchical Group Lasso was used in a classification setting to localize the brain regions involved in the processing of visual stimuli from functional magnetic resonance imaging (fMRI) (Jenatton et al., 2012). In Xin et al. (2014), the authors successfully applied network Lasso to Alzheimer's disease diagnostics from brain images. A more detailed review of these approaches and their applications to bioinformatic problems can be found in Azencott (2016), which also presents how these regularizers can be used in the context of filter approaches to feature selection.
5.3. Evaluation
As for the other methods presented in this review, we need to perform model selection also when utilizing the penalties described in this section. Nonetheless, when adopting sparse techniques, it is necessary to evaluate if the model recovers the correct features. In bioinformatics, there usually is no ground truth for this question, which can hence only be answered on synthetic data: if the feature selection process is stable, it should retrieve the same features on overlapping subsets of the same data set.
The set of selected features can only be interpreted if it remains robust to slight variations in the data. Do multiple repeats of the algorithm, for instance, on cross-validation training folds, yield the same sets of features? A variety of measures have been developed to evaluate the stability of a feature selection algorithm.
While predictivity is typically assessed by cross-validation (Guyon et al., 2002). It is important to highlight that variable/feature selection should not be considered a preprocessing step. In fact, using the same data set to select the most important features and to evaluate the model performance leads to an overoptimistic predictive capability. This phenomenon is known as selection bias (Ambroise and McLachlan, 2002).
6. Are There Hidden Patterns in the Data? (A3)
Pattern recognition is a very general machine learning problem that comprehends tasks as clustering of samples or retrieval of basic signals within the features. Nonetheless, in life science settings, while it is useful to obtain information on samples (typically patients), it may also be useful to retrieve patterns from the features. Using clustering methods in these settings will be harder as they typically assume samples that belong to the same cluster to be i.i.d. Features, on the other hand, may have complex dependency patterns difficult to interpret with standard clustering algorithms. In signal analysis, the possibility to detect latent patterns present in sampled signals has been studied in deep for the possibility to obtain a better representation of data. The most common ways to decompose a signal are principal component analysis (PCA) (Wold et al., 1987) and its derivatives. Nonetheless, they typically assume strong prior on the patterns, for example, in PCA all the patterns have to be orthogonal to each other. In some contexts, this assumption can prevent the analysis to detect factors that do not satisfy the requirements imposed.
6.1. Dictionary learning
We therefore discuss a technique called matrix factorization, which, given an input matrix
where
We can assume that the dictionary is known a priori, mimicking signal decomposition techniques such as Fourier transform or wavelet transform. In this case, the problem is called sparse coding and it is a convex problem. In general, we do not know the underling patterns and we therefore need to learn the dictionary too.
This type of techniques allows to perform a variety of different tasks such as clustering, dictionary learning, sparse coding, data integration, matrix completion, and others. These methods can be regularized through the addition of a penalty both on the patterns and on the coefficients
where R1 and R2 are penalties chosen by the user to impose regularization. Common choices are
6.1.1. Applications
Dictionary learning is widely used to analyze biological data, in particular it is mostly exploited for the analysis of biomedical images. It was exploited for the reconstruction of magnetic resonance images from undersampled data (Ravishankar and Bresler, 2010), and also for the detection of microaneurysm in retinal images (Zhou et al., 2017). Dictionary learning can be also used for other types of data, as in Nowak et al. (2011) and Masecchia et al. (2013), where they use a fused Lasso dictionary learning approach to perform subtyping of cancer patients analyzing copy number variation (CNV) data.
6.2. Non-negative matrix factorization and discriminative dictionary learning
The dictionary learning problem allows to be specialized in many forms. One of the most popular specializations is the so-called Non-negative matrix factorization, which has the same exact form of Problem (11), but the sets in which we are optimizing the coefficient and the dictionary are restricted to the positive space with
The second problem is discriminative dictionary learning where the coefficients are used as a new representation for the original signal in a new problem such as classification or regression. The possibility to learn the dictionary, the coefficients, and the classification parameters at the same time was first proposed by Huang and Aviyente (2007). In this specialization, the functional becomes
where L is a classification/regression loss as the ones in Table 1 and
6.2.1. Applications
In Piaggio et al. (2019), they exploit penalized non-negative matrix factorization to find patterns of somatic mutations specific of uveal melanoma from SNP data. In Javidi et al. (2017), they exploit discriminative dictionary learning and sparse representation based on Lasso penalty to perform vessel segmentation on retinal images. In Li et al. (2017), they use multimodal dictionary learning with Lasso penalty to distinguish between stages of Alzheimer's disease.
7. Are There Relevant Relationships Between Variables? (A4)
Network inference is the process of estimating a graph from real-world measurements. The inferred graph is the mathematical abstraction of a system where nodes represent the variables and edges may represent different types of relationships according to the system under analysis. Often, in real-world scenarios, the graph structure is not known and in fields such as computational biology, network inference plays a key role in understanding how molecular interaction works. At the cellular level, for example, we may seek for evidence of regulatory functions (Lozano et al., 2009), coexpression edges, metabolic influence (Kanehisa, 2001), as well as protein/protein interaction networks (Huang et al., 2016). Learning the network structure from data may be hard due to the ratio between number of features and samples. The research in this area has increased in the last years and many methods that tackle some of these problems have been proposed. These methods include Bayesian network (BN)-based (Nielsen and Jensen, 2009), Gaussian graphical model (GGM)-based, differential equation (DE)-based (de Hoon et al., 2002), and mutual information (MI)-based (Margolin et al., 2006) methods. In this section, we focus on GGMs as a specific example of a wider set of probabilistic methods that naturally leverage regularization to infer networks. GGMs are based on penalized maximum likelihood estimation (MLE) and can be written as in Equation (3). GGMs can also easily be adapted to many different regularization strategies. Regularization in these methods helps to cope with the high dimensionality of the data and identifiability and interpretability of the resulting network. Moreover, GGMs are suited to both the inference of coexpression (Friedman et al., 2008) and regulatory networks (Krämer et al., 2009). This class of methods can also be easily adapted to non-Gaussian data through appropriate data manipulation.
7.1. Graphical Lasso
Graphical Lasso is the most representative example of penalized MLE method for network inference. It assumes the variables in the system to be distributed according to a multivariate Gaussian distribution
where
7.1.1. Applications
An example is the work proposed in Ramanan et al. (2016) where the authors inferred a network demonstrating an antagonistic relationship between Clostridiales and Bacteroidales communities from the Human Microbiome Project. Since it was first proposed that the graphical Lasso has received much attention for its application in biology, we refer the reader to this review (Kuismin and Sillanpää, 2017) that compares it with other network inference methods in the context of system biology.
7.2. Graphical Lasso extensions
Many extensions of Equation (13) were proposed over the years to model systems of increasing complexity. These extensions are widely based on the addition of further penalties that force the graph structures to respect certain constraints. One notable example is the extension to the multitask/multiclass case in which the graphs share a common structure, but they differ in some connections (Danaher et al., 2014). These methods are mainly based on the Group Lasso or fused Lasso penalties and they were successfully applied in genomics (Xie et al., 2016) and neuroscience (Belilovsky et al., 2016). To include the dynamical properties of systems, Zhou et al. (2010) propose a weighted method to estimate the graph temporal evolution. Whereas Hallac et al. (2017) propose evolving precision matrices in time, similarly to Danaher et al. (2014). Here, again, the extension is performed by applying a regularization term that enforces similarities between graphs close in time. The graphical Lasso has also been extended to consider hidden and unmeasurable variables that influence the system through the nuclear norm penalty (Chandrasekaran et al., 2010). The dynamical and latent aspects were fused together in Tomasi et al. (2018) where the authors show the ability to detect perturbation in cellular system subject to external stimuli.
Graphical Lasso can be further extended to consider the multilayer case, which integrates components of the cellular system that can act at different scales or time to obtain a more precise overview.
7.2.1. Applications
In Cheng et al. (2017), they propose a regularized extension that translates into a Group Lasso penalty on the entries of the precision matrix. This method is able to detect pathway/pathway and gene/gene interactions. Monti et al. (2014) used a dynamical graphical Lasso to detect brain functional connections from fMRI images. Libbrecht et al. (2015) performed semiautomated genome annotation by inferring a network with graph-based regularization.
7.3. Lasso in the non-Gaussian case
The Gaussian assumption allows to provide easy and computationally tractable algorithms and extensions, but it imposes a limitation in the type of data that can be analyzed. Several methods consider non-Gaussian data distributions simply manipulating the input data through
7.3.1. Applications
Research has also moved toward the use of other distributions and models, for example, the Ising model for discrete variables or the Poisson model that provides a better modeling of next-generation sequencing data (Yang et al., 2012a). These methods are powerful and they allow to consider graphs, for example, gene/gene interactions, that are generated from different data measurements such as CNV, gene expression, or single-nucleotide polymorphism data. In this context, a method that integrates the network obtained from diverse measurements assuming the best distribution has been proposed in Žitnik and Zupan (2015) where they showed that it allows to recover a more detailed network. Zitnik and Leskovec (2017) exploit a similar method to perform prediction of multicellular function by inferring multilayer tissue networks regularized through
8. Conclusion
This article clarifies the importance of regularized methods for life science studies from different perspectives. We covered both supervised settings, where the expected outcome is to predict some target variable, and unsupervised scenarios, where the aim is to infer the topology of the network modeling the interactions between the observed variables. Moreover, we showed how prior knowledge on the problem at hand can be embedded into a regularization penalty, allowing to identify meaningful and interpretable solutions. Moreover, we also highlighted how, thanks to different regularization penalties, it is possible to overcome the issues faced by standard statistical methods in settings where the number of variables outnumbers the available samples (
We summarized the applications cited in the articles in Tables 2 and 3. We highlighted that regularization is heavily used for the analysis of omic-data (Table 2), which is due to the natural high dimensionality of these types of data. Furthermore, we cannot identify one specific type of method or regularization type that is more used in general for omic-data. Indeed, the choice of regularization method depends on a variety of additional considerations. In Table 3, we report other types of data; a clear preference for DL and dictionary learning emerges when it comes to the analysis of biomedical images. Such behavior is expected, indeed both DL and dictionary learning learn representations of meaningful parts of the input signal, which is crucial in image analysis as we may want the model to have suitable properties, for example, translation-invariance.
Applications Related to the Analysis of Omic-Data of Various Nature
For each type of datum, we provided the specific type of analyzed data, the citation, the machine learning method, and the type of regularization. Note that recursive feature elimination was never explicitly mentioned, but it is part of the sparsity inducing regularization techniques, details can be found in Guyon and Elisseeff (2003).
CNV, copy number variation; DL, deep learning; RF, Random forest; RLS, Regularized Least Squares.
Applications Related to the Analysis of Biomedical Images and Textual/Clinical Data
For each type of datum, we provided the specific type of analyzed data, the citation, the machine learning method, and the type of regularization.
MR, magnetic resonance; MRI, magnetic resonance imaging; fMRI, functional MRI; sMRI, structural MRI.
Regularization is a key aspect in all these works, and in many others. In the era of large-scale data, it is very much worth to invest effort in adopting suitable regularization techniques when developing an analysis pipeline to obtain robust, reliable, and interpretable results.
Footnotes
Author Disclosure Statement
No competing financial interests exist.
Funding Information
No funding was received for this article.
