Abstract
High-content analysis has revolutionized cancer drug discovery by identifying substances that alter the phenotype of a cell, which prevents tumor growth and metastasis. The high-resolution biofluorescence images from assays allow precise quantitative measures enabling the distinction of small molecules of a host cell from a tumor. In this work, we are particularly interested in the application of deep neural networks (DNNs), a cutting-edge machine learning method, to the classification of compounds in chemical mechanisms of action (MOAs). Compound classification has been performed using image-based profiling methods sometimes combined with feature reduction methods such as principal component analysis or factor analysis. In this article, we map the input features of each cell to a particular MOA class without using any treatment-level profiles or feature reduction methods. To the best of our knowledge, this is the first application of DNN in this domain, leveraging single-cell information. Furthermore, we use deep transfer learning (DTL) to alleviate the intensive and computational demanding effort of searching the huge parameter’s space of a DNN. Results show that using this approach, we obtain a 30% speedup and a 2% accuracy improvement.
Introduction
Recent advances in quantitative microscopy and high-performance computing have enabled rapid progress in the development of high-throughput image-based assays. These high-content analysis (HCA) assays allow not only a precise quantitative observation of multiple parameters such as nuclear size, nuclear morphology, DNA replication, and many more subtle features derived from each image, but also the screening of thousands of cells. To tackle this high-throughput high-dimensional problem, biologists tend to use population averages of per-cell information prior to machine learning (ML) algorithms such as principal component analysis, random forest, K-nearest neighbors, or support vector machines. Moreover, a recent survey 1 shows that about 70% of the papers on HCA experiments published in Science, Nature, Cell, and the Proceedings of the National Academy of Sciences from 2000 to 2012 used only one or two of the cell’s measured features, and less than 15% used more than six. Unfortunately, and due to the exponential increase in the number of product terms, 2 such ML algorithms become impractical for these problems with thousands of samples and hundreds of measured features. As a result, about 85% of the research work in HCA underutilized potentially valuable information that might have helped in speeding up early-stage drug discovery. In this paper, we are interested in exploring state-of-the-art algorithms developed in the field of artificial intelligence to address these high-throughput high-dimensional data.
The discovery of hierarchical visual sensory processing systems in the neocortex of the mammal brain motivated the field of artificial intelligence to develop algorithms to hierarchically extract information from data.3,4 Deep learning5,6 has thus emerged as a new paradigm in artificial intelligence focusing on computational models for information representation that exhibit characteristics similar to those of the neocortex, in an attempt to imitate a primate visual system with its sequence of processing stages: detection of edges, primitive shapes, and moving up gradually to more complex visual shapes.7,8 Since 2006, deep learning research has been successful not only in academia but also in companies such as Google (image retrieval) and Facebook (face recognition). With many application domains, including image recognition9,10 and speech recognition, 11 deep learning has beaten other ML techniques at predicting the activity of potential drug molecules using quantitative structures 12 and predicting the effects of mutations in noncoding DNA on gene expression and disease. 13
We focus on the challenge of using information content as high as possible, by considering per-cell information and all the available features, to build a classifier for the chemical mechanism of action (MOA). A mechanism of action usually refers to biochemical interaction through which the drug binds to form pharmacological effects. In here, MOA is specifically used to express a share of similar phenotypic outcomes among different compound treatments and not a strict modulation of a particular target or target class. 14 According to Ljosa et al., 14 the mechanistic classes were selected to provide the data with a wide cross section of cellular morphological phenotypes. We propose a deep transfer learning (DTL) framework, combining the advantages of deep learning with the flexibility of transfer learning. Transfer learning consists of reusing the knowledge gained from a (source) problem to solve a new (target) problem. Ideally, DTL should improve the performance of the reused classifier in the target problem over the baseline, that is, over the classifier trained directly in the target problem.
Our contribution can thus be summarized as follows:
Use of per-cell information with all the extracted features from high-content images
Use of state-of-the-art deep learning models coupled with GPU computational power to analyze such high-throughput high-dimensional data
Use of transfer learning to improve the performance of the models (in terms of computational speed)
In this paper, we consider stacked autoassociators15,16 (SAAs) as classifiers of MOAs on freely available MFC7 wild-type breast cancer data 14 using a DTL framework that includes a supervised layer-based feature transference approach.16,17
A possible use case of the work presented in this paper would be for a researcher to (1) solve a given classification problem of MOA or obtain the classifiers used to solve such a problem from the result of a previous work, (2) select what part of a previously developed classifier to transfer, and (3) solve the new problem by doing transfer learning of the learned classifiers for a new MOA task and benefit from a faster training (when compared to a random initialization) and an eventual improvement in classification accuracy. In the case of using deep neural networks as classifiers, as we do in this work, the researcher can choose which layers should be reused from a previous experiment. In the Results section, we discuss several settings and advise the use of the setting that produces the best results in our work.
Materials and Methods
Data
We used a publicly available (http://www.broadinstitute.org/bbbc, accession BBBC021) dataset from the genetically engineered MCF7-wt (breast cancer expressing wild-type p53) cell line. 26 Briefly (all details of sample preparation and image analysis can be found in Ljosa et al. 14 ), images of cell cultures with a given treatment (specific compound × concentration combination) were acquired on a high-content imaging platform using a 16-bit camera. Each image was further segmented using CellProfiler 18 (CP) by identifying nuclear and cytoplasmic boundaries. Then, 453 distinct features for each cell representing a variety of geometric, intensity, subcellular localization, and texture features 19 were extracted with CP. Figure 1 shows some examples of captured images representing some of the MOAs, as well as some of the features extracted with CP.

(
Our problem consists of predicting the MOA of a given treatment using per-cell information, in contrast to other established methodologies that use some profiling and/or feature reduction techniques (see Ljosa et al. 14 for a comparative study). Profiling in this context is meant as the process of building a multivariate vector profile for each treatment based on all the cells treated with that treatment. There are a total of 103 treatments corresponding to combinations of 38 compounds at one to seven concentrations. We only used the 148,649 cells of noncontrol samples, thus giving a data matrix with 148,649 rows (representing cells) and 453 columns (representing the extracted features).
To perform transfer learning, we need to define a source and a target problem. For that purpose, the original MFC7 dataset with 12 MOAs is split into two mutually exclusive datasets with 6 MOAs each, Pset1 and Pset2. The distribution across the two subsets was performed in order to join MOAs with common batches (a batch represents the week in which a group of cells were cultured in the same environmental setting) to prevent classification bias arising from batch and/or plate effects (see
Classifier: Stacked Autoassociators
Let us represent a dataset by a set of tuples
In this paper, we consider stacked autoassociators
24
(SAA) to build our classifier of MOAs. An autoencoder or autoassociator is a simple neural network with one hidden layer designed to reconstruct its own input. We additionally constrain the encoding and decoding feature sets (input-hidden and hidden-output weights, respectively) to be the transpose of each other (tied weights). SAA training
17
comprises two stages: an unsupervised pretraining stage where the information of the labels (MOAs) is not used, followed by a supervised fine-tuning stage, now using the MOA information. In the pretraining stage, a greedy layerwise approach is used to train the hidden layers of the SAA. The first hidden layer

High-content image analysis of breast cancer cells using SAA. (
Framework: Deep Transfer Learning
Traditionally, the goal of transfer learning is to transfer the knowledge (learning) obtained with a source problem to one or more target problems to efficiently develop an effective hypothesis for a new task, problem, or distribution. 20
In this work, we combine deep learning with transfer learning by means of a supervised layer-based feature transference21,22 method. In this method, a deep classifier is obtained (pretrained and fine-tuned) using data from a source problem and reused (partially or not) in the deep classifier for the target problem. The latter is finally fine-tuned with the data from the target problem. By partially, we mean that one can transfer all or part of the source model features (layers) to the target model. In this way, we are transferring knowledge acquired with the source to help in solving the target. It is expected that the TL process supplies the target classifier with an initial set of weights that is a better starting point than the traditional random initialization, providing improved performance (positive transference) over the baseline (by contrast, negative transference occurs when the baseline classifier performs better than the TL classifier). To be more precise, let us introduce some notation considering an SAA with seven hidden layers plus one logistic layer, for both the source and target models. We use four different TL settings for supervised layerwise feature transference. In such settings the 0 represents “no transfer,” that is, the weights of that specific layer of the target model are randomly initialized and not reused from the source model, and the 1 represents “transferred,” that is, the initial weights of that specific layer are obtained (reused) from the trained source model. Note that for each setting, the logistic regression layer is also transferred from the source model to the target model. The setting [00111111] means that we randomly initialized the first and second layers of the target model and transferred all the remaining layers from the source problem. The target network thus built is then fine-tuned with the target data.
LOOCV Training and Network Hyperparameters
Regarding the training process, we followed a procedure similar to that in Ljosa et al. 14 To prevent sharing of batch-specific image properties/features or compound properties between the training and test sets, and thus to prevent the classifiers from learning artifact properties of the set of individual images rather than the more general cell phenotype, 23 we considered using a leave-one-compound-out cross-validation (LOOCV) procedure where all the cells treated with the same compound as the treatment being classified are held out, even if those other cells were treated with a different concentration. Thus, the test set in LOOCV is composed of all the cells from one of the compounds that is held out; the remaining cells (from all the other compounds) are split in a training set, used to train the model, and a validation set, used to prevent overfitting by evaluating early-stopping criteria in the fine-tuning phase. The choice of when to stop fine tuning is based on a geometrically increasing amount of patience. The patience is geometrically increased when the current validation score is below the best validation score. The backpropagation error is fine-tuned until it runs out of patience or the maximum fine-tuning epochs allowed is reached. The trained classifier is then tested on the unseen individual cells from the test set, and each prediction is matched with its ground truth of MOA. The classifier prediction of each cell from the same field of view is then combined to calculate treatment prediction accuracy using majority voting. Each of the experiments is repeated 10 times.
Tuning hyperparameters such as the learning rate or setting the appropriate network architecture for training the deep model is desirable but highly time-consuming. The results of the following section were obtained using SAAs with seven hidden layers of 500 neurons each. We used pretraining and fine-tuning learning rates of 0.001 and 0.1, respectively. The stopping criteria for pretraining were fixed to 60 epochs, which is the value where the reconstruction cost saturates; stopping criteria for fine tuning were set to a maximum of 1000 epochs with the validation set. The complete details of these networks are listed in
Processing large data as we did, on millions of neural connections, would take several weeks using traditional CPUs. For that reason, we used Theano, 24 a GPU-compatible machine learning library, to perform all our experiments on two i7-377 (3.50 GHz), 16 GB RAM with two GTX 770 and five GTX 980 GPU processors, respectively (see High-Performance Computing section of the supplementary material). The software to reproduce the results is available at http://www.deepnets.ineb.up.pt/files/software/DTL_frontend.html.
Results
The analysis of large volumes of multiparametric high-dimensional data without overfitting the network using a high number of cytological features in a time frame suitable for drug discovery presents a significant challenge for any learning algorithm. In the following, we present the results obtained by our approach.
The results of the baseline SAA for classifying MOAs for Pset1 and Pset2 datasets are listed in
Table 1
. We observe that classifying MOAs of Pset2 is about 2.8% more accurate than classifying MOAs of Pset1, even though both datasets have an equal number of MOAs. Also, the computation time to classify the Pset2 dataset is greater than that of the Pset1 dataset. The Pset2 dataset has 61 treatments for 18 compounds, whereas Pset1 has 42 treatments for 20 compounds. The confusion matrix for classifying MOAs using the baseline approach for both Pset1 and Pset2 datasets is shown in
Figure 4
, and the precision, recall, and f1-scores are listed in
Average Accuracy in Percentage and Average Computation Time in Minutes (Standard Deviation in Parentheses) of the Baseline (BL) and DTL Approaches.
The results are over 10 repetitions for the target data (PT) with compounds (C) and source data (PS). The best results show in bold.
To further improve the results over the baseline approach, we considered a deep transfer learning framework where the knowledge gained with the source problem is reused to solve the target problem. The results for four DTL settings are presented in Table 1 and the respective boxplots displayed in Figure 3 . Essentially, we observe that the DTL_1 setting improves over the baseline for both Pset1 and Pset2 datasets. It is interesting to note that the best results are obtained when such specific (top) layer weights are transferred from the source to the target problem (the seventh hidden layer weights and the logistic regression weights are reused) and the rest of the (lower) layers are randomly initialized. For example, classifying Pset1 reusing Pset2 with the DTL_1 transfer setting produces models 2% more accurate than the baseline and about 0.8% over the transfer all case DTL_4. One of the reasons for this behavior is that higher layers of the network learn problem-specific features from the data, while the lower layers learn generic features;21-22 thus, it seems beneficial to use the knowledge acquired in the source problem on its higher layers. Moreover, the DTL_1 setting speeds up computation time by 30% over the baseline approach. Confusion matrices for all DTL settings can be analyzed in Figure 4 . Given these results, we believe that DTL_1 would be a good setting to use on similar problems by a researcher who wishes to use DTL on this type of problem.

Comparison of baseline vs. DTL approaches. Left: Baseline average accuracy for classifying Pset1 and DTL approaches for classifying Pset1 reusing Pset2. Right: Baseline average accuracy for classifying Pset2 and DTL approaches for classifying Pset2 reusing Pset1.

Confusion matrices for the baseline and TL settings on the MOA problem (average outcomes over 10 repetitions). To represent class imbalance, the confusion matrix represents number of elements in each class and the background blue color is normalized confusion matrices (the higher the accuracy, the darker the color).
Comparison with Other State-of-the-Art Methods
Table 2
lists a comparison of our deep learning (baseline and best TL setting) results with two state-of-the-art machine learning algorithms: support vector machines (SVMs)
25
with linear and radial basis function (RBF) kernels using a freely available and fast C-based implementation of multiclass SVM (SVM
multiclass
, version 2.20). For linear SVM, we optimized the trade-off between training error and margin cost from 0.001 to 50,000 (see
Comparison of Accuracy Obtained and Total Time Taken per Repetition in Minutes with Other State-of-the-Art Methods.
Best results in bold.
Discussion
To stimulate the development of new drugs effective against a wide spectrum of cancers, we propose a deep transfer learning (DTL) classifying framework that uses high-content HCA data. Our classifiers are built upon individual cell information without employing any type of profiling or reduction methods on extracted cell features. The main motivation to use a DTL approach was to show that we can reuse, with minor modifications, the knowledge acquired in solving a given classification problem of MOAs to solve a new one (of MOAs also) without having to follow the whole training procedure. This is particularly useful for new drug testing, as computational time is saved. For that purpose, the data were carefully split into two mutually exclusive six-class problems represented by Pset1 and Pset2 datasets. The average accuracies of the baseline SAAs for the Pset1 and Pset 2 datasets are about 84% and 87%, respectively, using a seven-hidden-layer SAA with 500 neurons in each layer. The DTL approach showed that the transference of specific weights of the source model was useful, and we have obtained positive transference for both datasets. Although the difference in accuracy of Pset1 and Pset2 between baseline and transfer learning is not statistically significant, we observed around 30% computational speedup when using the DTL approach. Our approach was also superior when compared to multiclass support vector machines.
Regarding the 12-class problem, we trained several SAAs ranging from three to eight hidden layers with 500 to 1000 neurons in each layer. However, training a seven-hidden-layer SAA with 500 neurons in each layer may take, on average, 30–48 h per repetition. We performed some preliminary experiments using the adequate leave-one-out approach, and without too much hyperparameter search, the best model obtained around 77% accuracy. As future work, we intend to explore a different approach for the 12-class problem using convolutional neural networks (CNNs) directly applied to the images and not to hand-crafted features. CNNs are state-of-the-art deep neural networks that use a sort of hierarchical representation of the data similar to that of the neocortex and are especially designed for image recognition tasks. We expect to obtain a similar hierarchical feature extraction directly from the images, giving the possibility of the deep network self-extracting relevant cytological features layer by layer.
Footnotes
Acknowledgements
We would like to acknowledge the support of Joaquim Marques de Sá, Jaime S. Cardoso, Vebjorn Ljosa, Shantanu Singh, Szymon Stoma, Tiago Laundos Santos, Jonathan Barber, Ricardo Sousa, and Abhishek Chatterjee. We also acknowledge the valuable comments provided by the reviewers that greatly helped to improve the paper.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was financed by FEDER funds through the Programa Operacional Factores de Competitividade—COMPETE and by Portuguese funds through FCT—Fundação para a Ciência e a Tecnologia in the framework of the project PTDC/EIA-EIA/119004/2010.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
