Abstract
PURPOSE:
This study endeavors to build a deep learning (DL)-based model for predicting disease progression in head and neck squamous cell carcinoma (HNSCC) patients by integrating multi-omics data.
METHODS:
RNA sequencing, miRNA sequencing, and methylation data from The Cancer Genome Atlas (TCGA) were used as input for autoencoder, a DL approach. An autoencoder-based prognosis model for PFS was built by SVM algorithm and tested in three confirmation sets. Predictive performance of the model was compared to two alternative approaches. Differential expression analysis for mRNAs, microRNAs (miRNA) and methylation was conducted. Moreover, functional annotation of differentially expressed genes (DEGs) was achieved through function enrichment analysis.
RESULT:
The DL-based prognosis model identified two subgroups of patients with significantly different PFS, and showcased a good model fitness (C-index
CONCLUSION:
The DL-based model introduced in this study is reliable and robust in predicting disease progression in HNSCC patients. A number of pathways and genes targets are unraveled to be implicated in cancer progression. Utility of this model would facilitate development of more individualized therapy for HNSCC patients and improve prognosis.
Introduction
Head and neck squamous cell carcinoma (HNSCC) accounts for more than 90% of all head and neck cancers, and is one of the most common cancers worldwide with over 500,000 new cases annually [1, 2]. Despite treatment advancements, the prognosis is dismal with an overall five-year survival rate of 50%, which is largely attributed to recurrence and metastasis [3]. Recurrence is reported in over 50% of patients within three years after diagnosis [4, 5]. This imposes an urgent need for a reliable risk stratification method to identify the subgroup of HNSCC patients at high risk of disease progression.
Tremendous potential exists for machine learning methods to reduce reduction and classify gene expression data. Deep learning (DL), a subtype of machine learning method, has been found to be satisfying in extracting information from high dimension image [6]. An autoencoder is an hourglass-shaped unsupervised artificial neural network frequently used for classification [7]. It has been proved to be a promising DL method for extracting meaningful features from gene expression data of breast cancer [8]. Furthermore, DL computation framework implemented by autoencoder has been used to predict survival in liver cancer by integrating multi-omics data including RNA sequencing, miRNA sequencing and methylation data, and, as a result, identifies two survival subtypes [9]. In a recent study, a genomic predictive model has been developed for recurrence and metastasis development in HNSCC patients by a three-class support vector machine (SVM) algorithms [10]. However, autoencoder-based model has not been applied to genomic data of HNSCC.
In this study, we employed DL framework implemented by autoencoder for analysis of RNA sequencing data, microRNA (miRNA) sequencing data, DNA methylation data and clinical information of HNSCC patients to derive a prognostic model for progress free survival (PFS). Two subgroups identified by the DL-based prognosis model had significantly different PFS and were successfully validated in three confirmation cohorts. Moreover, efforts have been made to explore the promising biological roles played by differentially expressed genes (DEGs) between the subgroups in HNSCC.
Materials and methods
Datasets and research design
Total three datasets were applied in this study. The Cancer Genome Atlas (TCGA) set was used in two steps: firstly, the whole TCGA dataset was used to acquire the labels of progression-risk classes; secondly, the samples were split (60%/40%) to training and testing sets to train a SVM model (“data partitioning and robustness evaluation” subsection). Predictive accuracy of the DL-based model was tested in three additional confirmation sets.
TCGA set. Multi-omics HNSCC data was obtained from the TCGA portal (
Confirmation set 1 (microarray gene expression): A total of 86 oral cancer samples with survival and recurrence information were gained from E-GEOD-26549 [11] Affymetrix high-throughput GeneChip Human Gene 1.0 microarray dataset (
Confirmation set 2 (microarray gene expression): Total 109 samples of primary laryngeal cancer with survival and recurrence information from E-GEOD-27020 [12] Affymetrix high-throughput GeneChipHG-U133A microarray dataset were applied in this study. RMA-calculated signal intensity values were provided by authors.
Confirmation set 3 (microarray gene expression): A total of 270 HNSCC samples with survival information were selected from E-GEOD-65858 [13] Illumina high- throughput HumanHT-12 V4.0 expression beadchip microarray dataset (
Features transformation using a DL framework
In the present study, the preprocessed TCGA HNSCC 3-omics data of 368 samples were selected as the input for the autoencoder framework. The three matrices were stacked to construct a unique matrix. The autoencoder DL model was applied as previously reported [9]. New features from the omics data were produced by bottleneck layer of the autoencoder.
Transformed features selection and K-means clustering
A hundred new features were extracted from the bottleneck layer of the autoencoder, and then underwent a univariate Cox proportional hazards (Cox-PH) model. The significant features (log-rank
Data partitioning and robustness evaluation
The TCGA set was split by a cross-validation (CV)-like procedure as reported elsewhere [9]. Firstly, the 360 samples from TCGA set were randomly partitioned into 5 folds, 2 of which were used as the testing set and the other 3 of which as the training set. As a consequence, 10 new combinations (folds) were obtained. For each new combination, a model was built using the training set followed by labels prediction in testing set. For each training combination, a distinct autoencoder and a classifier were constructed for predicting the labels of the testfold. Eventually, an autoencoder based on all TCGA samples was used to infer the labels for the confirmation sets.
Supervised classification
A supervised classification model was constructed by the SVM algorithm using the labels derived from K-means clustering. Subsequently, to predict on TCGA 3-omics test data, an SVM classifier was built based on the top 100 mRNAs, 100 methylation, and 30 miRNA features selected by analysis of variance (ANOVA). Besides, another SVM classifier was developed with the corresponding top 100 mRNAs to predict on the three confirmation sets.
With regard to a confirmation dataset from a specific omic layer, common features shared by this cohort and the omic layer of the TCGA set were chosen. To be specific, the common features shared by the three testing sets and the TCGA set were 12,849 features for E-GEOD-26549, 10,626 features for E-GEOD-27020, and 11,997 features for E-GEOD-65858. A median scaling was applied to the samples of TCGA set and all three testing datasets. All operations were carried out under R software. The penalizedSVM package [17, 18] (
Three metrics for evaluating performance of models
Concordance index (C-index): C-index refers to the fraction of all pairs of individuals with correctly ordered survival times [19] and is based on Harrell C statistics [20]. A score around 0.70 suggests a good model, whereas a score around 0.50 implies a random background. C-index was computed using concordance.index function in R survcomp package [21].
Log-rank
Brier score. Brier score is used to evaluate the mean difference between the observed and the predicted survival after a certain time in survival analysis [22]. It is calculated by R survcomp package and ranges from 0-1. The higher the score is, the more inaccurate the model is.
Comparison of the DL framework with two alternative approaches
DL framework was compared to two other approaches. In the first approach, principal component analysis (PCA) was conducted using the same number (100) of principal components. Subsequently, univariate Cox-PH model was applied to the PCA features using the same Cox-PH procedure in DL-based model so as to analyze the associations of the PCA features with PFS. In the second approach, single-variant Cox-PH models based on C-index scores were applied to choose the top 32 features from all 3-omics features. Also, K-mean clustering of the samples was performed.
Clinical covariate analysis
Fisher test was employed to study the associations of the identified two PFS subtypes (S1 and S2) with other clinical factors, including age, sex, perineural invasion (PNI), radiotherapy, margin status, Hpv status, group stage, histologic grade and alcohol. Uni-and multi-variate Cox regression analysis was also conducted for investigating the relations of clinical features and the prognosis model with PFS.
Progress free survival differences between two subtypes (S1 and S2) for the TCGA and three confirmation sets: TCGA cohort (A), E-GEOD-26549 cohort (B), E-GEOD-85858 cohort (C), E-GEOD-27020 (D).
Differential expression analysis for the mRNAs, miRNAs expression and methylation genes was carried out between the two PFS subtypes by DESeq2 package [23] (
For methylation data, beta values were transformed into M values by lumi package [24] (
Performance evaluation of the SVM classification model on training and testing sets in TCGA cohort
Performance evaluation of the SVM classification model on training and testing sets in TCGA cohort
SD, standard deviation.
Performance of the model in the three confirmation sets
Gene ontology (GO) function and pathway enrichment analyses were carried out with the python goatools package [25] and kobas 3.0 interface [26], respectively. Significance for a pathway or a GO biological process (BP) term was defined a
Results
Identification of two PFS subtypes for HNSCC by DL approach
A total of 368 HNSCC samples with combined RNA-Seq, miRNA-Seq, and DNA methylation data were acquired from TCGA. Following preprocessing, 14,282 genes from RNA-seq, 220 miRNAs from miRNA-seq and 19883 genes from DNA methylation data were obtained and used as input features for autoencoder DL model which stacked the three types of omics features together.
In univariate Cox-PH regression analysis, 32 features out of the 100 new features derived from the bottleneck hidden layer of the DL-based model were identified to be significantly associated with survival (log-rank
As mentioned in Materials and Method, the 368 TCGA samples were split into 10 folds with a 60/40 ratio for training set and testing set. As shown in Table 1, the SVM classification model displayed a high C-index (0.71
Predictive capability of the classification model was successfully validated in three confirmation sets
Capability of this model in stratifying HNSCC patients into two PFS subtypes was tested in three independent conformation sets (E-GEOD-26549, E-GEOD-27020, and E-GEOD-65858). The common top features chosen by ANOVA followed by SVM classification are summarized as follows: E-GEOD-26549 (95%), E-GEOD-27020 (54%) and E-GEOD-65858 (92%). For E-GEOD-26549 with 86 samples, the two survival subgroups had a good C-index of 0.75, a lower Brier score of 0.16, and a log-rank
The DL-model performed better compared to two alternative methods.
As described in Materials and Methods, the DL-based model was compared to PCA and univariate Cox-PH-based models in terms of C-index, Brier score and Log-rank
Comparison of performances of autoencoder and two alternative approaches
Comparison of performances of autoencoder and two alternative approaches
Results of Fisher exact test showed PNI was significantly related to PFS (
Results of fisher exact test for clinical features
Results of fisher exact test for clinical features
Summary of results from uni-variate cox regression analysis
Results of multi-variate Cox regression analysis
A total of 10 up-regulated and 338 down-regulated DEGs were selected between two identified PSF subgroups (log2 fold change
A list of significant signaling pathways
A list of significant signaling pathways
Count means the number of genes significantly enriched in a pathway.
A heatmap of DEGs across two PFS classes identified by the DL-based model. The color bar corresponds to expression levels of DEGs.
Significant GO BP terms for the DEGs between two PFS subgroups. -log(p-value) is encoded in a color bar for visualization. Gene count indicates the number of genes significantly enriched in a GO term. Gene ratio indicates the ratio of the enriched genes in each gene set.
Accumulative genomic alterations play a critical role in HNSCC progression [27]. The present study integrated RNA-Seq, miRNA-Seq, and DNA methylation data of 368 HNSCC patients from TCGA to build a prognosis model for PFS by using DL approach. The DL method used in this study was autoencoder that reconstructed the multi-omics data to produce 100 new features to represent the data. There is evidence for the utility of autoencoder to unveil the organization of transcriptomic machinery [28]. Danaee et al. identify a set of genes as biomarkers for breast cancer by using autoencoder [29]. More recently, it has been reported that autoencoder-based model is efficient and accurate in predicting lung cancer, stomach cancer and breast cancer using RNA-seq data [30]. In the present study, the autoencoder-based model identified two subgroups of patients with significantly different PFS. CV results suggested that this model was robust in classifying patients into two PFS subgroups. Moreover, robustness and reliability of the model was successfully validated in three independent confirmation sets. Furthermore, according to C-index and Log-rank
It has been recognized by all that effective clinical features are closely related to prognosis. In the present study, the clinical feature PNI was shown to be significantly related to PFS of HNSCC patients, and this association had been reported in two European literature [Goepfert H, 1984; Fagan JJ, 1998]. Today, as the process of neoplastic invasion of nerves, PNI has been emerging as an important pathologic feature of many malignancies, as well as a marker of poor outcome and a harbinger of decreased survival [Liebig C, 2009]. The present study will provide a new idea for the effective clinical treatment of HNSCC.HNSCC has been regarded as an immune-suppressive disease characterized by profound immune defects [31]. Immunotherapy has been increasingly recognized as a promising novel therapy against HNSCC, calling for intense investigation into the biological mechanisms on immune process [32]. This study uncovered 348 DEGs between two identified PSF subgroups, which were functionally related to a number of immune-related GO terms, such as adaptive immune response, immune system process and immune response. Moreover, a total of 25 significant pathways for 348 DEGs were unraveled between two identified PSF subgroups, and according to the number of genes involved in, the top three pathways were cell adhesion molecules (CAMs), Neuroactive ligand-receptor interaction and Cytokine-cytokine receptor interaction. It was reported that the changes in the expression or function of CAMs have been implicated in all steps of tumor progression [Makrilia N, 2009], and Neuroactive ligand-receptor interaction was one of the most significant pathway for pancreatic cancer risk [Wei P, 2012]. It worth noting that some DEGs were invovled in different pathways, such as CD40LG, CD79A, CLDN3, CLDN10, CLDN20 and VCAM1. CD40L is a member of the TNF receptor superfamily, regulating immune response through CD40-expressing T cells activation [33]. CD79A protein, namely B-cell antigen receptor complex-associated protein alpha chain, binds to CD79b to form a dimer related to membrane-bound immunoglobulin in B cells and is involved in B cell development and function [34]. In the present studym, CD40LG and CD79A were significantly enriched in both primary immunodeficiency and B cell receptor signaling pathways. Furhtermore, both cell adhesion molecules (CAMs) and leukocyte transendothelial migration pathways involved CLDN3, CLDN10, CLDN20 and VCAM1 in the current study. CLDN3,CLDN10 and CLDN20 belong to the family of claudin proteins which are the key components of the tight junctions. Emerging studies have established that deregulated claudins lead to disrupted tight junctions, thus contributing to cancer development and progression [35, 36]. It has been found that polarity and distribution of claudins are altered in HNSCC and that expression of claudins associate with survival of patients [37]. VCAM1 acts as a cell adhesion molecule and is relevant to tumor growth, metastasis and angiogenesis [38]. These findings further confirm the pivotal roles by immune system and cell adhesions in HNSCC progression and present CD40LG, CD79A, CLDN3, CLDN10, CLDN20 and VCAM1 as candidate therapeutic targets for HNSCC.
Conclusion
This study develops a DL-based prognostic model that is able to robustly discriminate two subpopulations of HNSCC patients with significantly different PFS. Several immune and cell adhesion-related pathways are involved in cancer progression. CD40LG, CD79A, CLDN3, CLDN10, CLDN20 and VCAM1 may be critical genes in HNSCC. This study provides more insights into the underlying molecular mechanisms of HNSCC progression. Further studies are demanded to validate the feasibility of this prognostic model.
