Deep learning-based model for predicting progression in patients with head and neck squamous cell carcinoma

Abstract

PURPOSE:

This study endeavors to build a deep learning (DL)-based model for predicting disease progression in head and neck squamous cell carcinoma (HNSCC) patients by integrating multi-omics data.

METHODS:

RNA sequencing, miRNA sequencing, and methylation data from The Cancer Genome Atlas (TCGA) were used as input for autoencoder, a DL approach. An autoencoder-based prognosis model for PFS was built by SVM algorithm and tested in three confirmation sets. Predictive performance of the model was compared to two alternative approaches. Differential expression analysis for mRNAs, microRNAs (miRNA) and methylation was conducted. Moreover, functional annotation of differentially expressed genes (DEGs) was achieved through function enrichment analysis.

RESULT:

The DL-based prognosis model identified two subgroups of patients with significantly different PFS, and showcased a good model fitness (C-index $=$ 0.73). The two identified PFS subtypes were successfully validated in three confirmation sets. The DL-based model was more accurate and efficient than principal component analysis (PCA) or individual Cox-PH-based models. There were 348 DEGs, 23 differentially expressed miRNAs and 55 differentially methylated genes between the two PFS subtypes. These genes were significantly involved in several immune-related biological processes and primary immunodeficiency, cell adhesion molecules (CAMs), B cell receptor signaling and leukocyte transendothelial migration pathways.

CONCLUSION:

The DL-based model introduced in this study is reliable and robust in predicting disease progression in HNSCC patients. A number of pathways and genes targets are unraveled to be implicated in cancer progression. Utility of this model would facilitate development of more individualized therapy for HNSCC patients and improve prognosis.

Keywords

Machine learning deep learning prognostic model progress free survival autoencoder

1. Introduction

Head and neck squamous cell carcinoma (HNSCC) accounts for more than 90% of all head and neck cancers, and is one of the most common cancers worldwide with over 500,000 new cases annually [1, 2]. Despite treatment advancements, the prognosis is dismal with an overall five-year survival rate of 50%, which is largely attributed to recurrence and metastasis [3]. Recurrence is reported in over 50% of patients within three years after diagnosis [4, 5]. This imposes an urgent need for a reliable risk stratification method to identify the subgroup of HNSCC patients at high risk of disease progression.

Tremendous potential exists for machine learning methods to reduce reduction and classify gene expression data. Deep learning (DL), a subtype of machine learning method, has been found to be satisfying in extracting information from high dimension image [6]. An autoencoder is an hourglass-shaped unsupervised artificial neural network frequently used for classification [7]. It has been proved to be a promising DL method for extracting meaningful features from gene expression data of breast cancer [8]. Furthermore, DL computation framework implemented by autoencoder has been used to predict survival in liver cancer by integrating multi-omics data including RNA sequencing, miRNA sequencing and methylation data, and, as a result, identifies two survival subtypes [9]. In a recent study, a genomic predictive model has been developed for recurrence and metastasis development in HNSCC patients by a three-class support vector machine (SVM) algorithms [10]. However, autoencoder-based model has not been applied to genomic data of HNSCC.

In this study, we employed DL framework implemented by autoencoder for analysis of RNA sequencing data, microRNA (miRNA) sequencing data, DNA methylation data and clinical information of HNSCC patients to derive a prognostic model for progress free survival (PFS). Two subgroups identified by the DL-based prognosis model had significantly different PFS and were successfully validated in three confirmation cohorts. Moreover, efforts have been made to explore the promising biological roles played by differentially expressed genes (DEGs) between the subgroups in HNSCC.

2. Materials and methods

2.1 Datasets and research design

Total three datasets were applied in this study. The Cancer Genome Atlas (TCGA) set was used in two steps: firstly, the whole TCGA dataset was used to acquire the labels of progression-risk classes; secondly, the samples were split (60%/40%) to training and testing sets to train a SVM model (“data partitioning and robustness evaluation” subsection). Predictive accuracy of the DL-based model was tested in three additional confirmation sets.

TCGA set. Multi-omics HNSCC data was obtained from the TCGA portal (https://tcga-data.nci.nih.gov/ tcga/), including RNA-Seq data (UNC IlluminaHiSeq- RNASeqV2; level 3), miRNA-Seq data (BCGSC Illumina HiSeq_miRNASeq; level 3), DNA methylation data (JHU-USC Human Methylation 450; level 3) and clinical features. The DNA methylation data was preprocessed as described previously [9].

Confirmation set 1 (microarray gene expression): A total of 86 oral cancer samples with survival and recurrence information were gained from E-GEOD-26549 [11] Affymetrix high-throughput GeneChip Human Gene 1.0 microarray dataset (https://www.ebi.ac. uk/rrayexpress/arrays/A-AFFY-141/?ref=E-GEOD-26549). Robust Multi-array Average (RMA)-calculated signal intensity values were used for analysis in this study.

Confirmation set 2 (microarray gene expression): Total 109 samples of primary laryngeal cancer with survival and recurrence information from E-GEOD-27020 [12] Affymetrix high-throughput GeneChipHG-U133A microarray dataset were applied in this study. RMA-calculated signal intensity values were provided by authors.

Confirmation set 3 (microarray gene expression): A total of 270 HNSCC samples with survival information were selected from E-GEOD-65858 [13] Illumina high- throughput HumanHT-12 V4.0 expression beadchip microarray dataset (https://www.ebi.ac.uk/array express/arrays/A-GEOD-10558/ref=E-GEOD-65858).Following $\log_{2}$ transformation and normalization, the data was used for analysis.

2.2 Features transformation using a DL framework

In the present study, the preprocessed TCGA HNSCC 3-omics data of 368 samples were selected as the input for the autoencoder framework. The three matrices were stacked to construct a unique matrix. The autoencoder DL model was applied as previously reported [9]. New features from the omics data were produced by bottleneck layer of the autoencoder.

2.3 Transformed features selection and K-means clustering

A hundred new features were extracted from the bottleneck layer of the autoencoder, and then underwent a univariate Cox proportional hazards (Cox-PH) model. The significant features (log-rank $p$ -value $<$ 0.05) were selected to cluster the samples using the K-means clustering algorithm. Silhouette index [14] and Calinski-Harabasz criterion [15] were used to decide the optimal number of clusters. Implementation of K-means is provided by NbClust package [16] of R software (https:// cran.r-project.org/web/packages/NbClust/index.html).

2.4 Data partitioning and robustness evaluation

The TCGA set was split by a cross-validation (CV)-like procedure as reported elsewhere [9]. Firstly, the 360 samples from TCGA set were randomly partitioned into 5 folds, 2 of which were used as the testing set and the other 3 of which as the training set. As a consequence, 10 new combinations (folds) were obtained. For each new combination, a model was built using the training set followed by labels prediction in testing set. For each training combination, a distinct autoencoder and a classifier were constructed for predicting the labels of the testfold. Eventually, an autoencoder based on all TCGA samples was used to infer the labels for the confirmation sets.

2.5 Supervised classification

A supervised classification model was constructed by the SVM algorithm using the labels derived from K-means clustering. Subsequently, to predict on TCGA 3-omics test data, an SVM classifier was built based on the top 100 mRNAs, 100 methylation, and 30 miRNA features selected by analysis of variance (ANOVA). Besides, another SVM classifier was developed with the corresponding top 100 mRNAs to predict on the three confirmation sets.

With regard to a confirmation dataset from a specific omic layer, common features shared by this cohort and the omic layer of the TCGA set were chosen. To be specific, the common features shared by the three testing sets and the TCGA set were 12,849 features for E-GEOD-26549, 10,626 features for E-GEOD-27020, and 11,997 features for E-GEOD-65858. A median scaling was applied to the samples of TCGA set and all three testing datasets. All operations were carried out under R software. The penalizedSVM package [17, 18] (https://cran.r-project.org/web/packages/penalized SVM/index.html) was applied to conduct grid search to identify the optimal hyperparameters of the SVM model(s) using 5-fold CV.

2.6 Three metrics for evaluating performance of models

Concordance index (C-index): C-index refers to the fraction of all pairs of individuals with correctly ordered survival times [19] and is based on Harrell C statistics [20]. A score around 0.70 suggests a good model, whereas a score around 0.50 implies a random background. C-index was computed using concordance.index function in R survcomp package [21].

Log-rank $P$ value of Cox-PH regression. Difference in PFS between two groups was analyzed by Kaplan-Meier estimates together with log-rank statistics. Survival analysis was performed using Cox-PH model by R survival package (https://cran.r-project. org/web/packages/survival/index.html).

Brier score. Brier score is used to evaluate the mean difference between the observed and the predicted survival after a certain time in survival analysis [22]. It is calculated by R survcomp package and ranges from 0-1. The higher the score is, the more inaccurate the model is.

2.7 Comparison of the DL framework with two alternative approaches

DL framework was compared to two other approaches. In the first approach, principal component analysis (PCA) was conducted using the same number (100) of principal components. Subsequently, univariate Cox-PH model was applied to the PCA features using the same Cox-PH procedure in DL-based model so as to analyze the associations of the PCA features with PFS. In the second approach, single-variant Cox-PH models based on C-index scores were applied to choose the top 32 features from all 3-omics features. Also, K-mean clustering of the samples was performed.

2.8 Clinical covariate analysis

Fisher test was employed to study the associations of the identified two PFS subtypes (S1 and S2) with other clinical factors, including age, sex, perineural invasion (PNI), radiotherapy, margin status, Hpv status, group stage, histologic grade and alcohol. Uni-and multi-variate Cox regression analysis was also conducted for investigating the relations of clinical features and the prognosis model with PFS.

Figure 1.

Progress free survival differences between two subtypes (S1 and S2) for the TCGA and three confirmation sets: TCGA cohort (A), E-GEOD-26549 cohort (B), E-GEOD-85858 cohort (C), E-GEOD-27020 (D).

2.9 Differential expression analysis

Differential expression analysis for the mRNAs, miRNAs expression and methylation genes was carried out between the two PFS subtypes by DESeq2 package [23] (http://www.bioconductor.org/packages/relea se/bioc/html/DESeq2.html). The strict cutoff was set at log ${}_{2}$ FC $>$ 2 for mRNAs and at log ${}_{2}$ FC $>$ 1 for miRNAs.

For methylation data, beta values were transformed into M values by lumi package [24] (http://www.biocon ductor.org/packages/release/bioc/html/lumi.html) in R software. Differentially methylated genes (DMGs)were screened between the two subtypes (Benjamini-Hochberg corrected $p<$ 0.05). The filtering threshold was set at averaged M value differences $>$ 1.

Table 1
Performance evaluation of the SVM classification model on training and testing sets in TCGA cohort

Datasets	C-index (mean $\pm$ SD)	Brier score (mean $\pm$ SD)	Log-rank $P$ (geo.mean)
3-omics training	0.71 $\pm$ 0.05	0.22 $\pm$ 0.02	0.0008
3-omics testing	0.73 $\pm$ 0.07	0.22 $\pm$ 0.02	0.0024
RNA only	0.71 $\pm$ 0.09	0.22 $\pm$ 0.02	0.0040
miRNA only	0.69 $\pm$ 0.07	0.21 $\pm$ 0.02	0.0239
Methylation only	0.66 $\pm$ 0.07	0.22 $\pm$ 0.02	0.0493

SD, standard deviation.

Table 2

Performance of the model in the three confirmation sets

Datasets	Omics data type	Samples ( $N$ )	C-index	Brier score	Log-rank $P$
E-GEOD-26549	mRNA microarray	86	0.79	0.17	0.0033
E-GEOD-27020	mRNA microarray	109	0.72	0.15	0.0414
E-GEOD-65858	mRNA microarray	270	0.63	0.17	0.0494

2.10 Function analysis

Gene ontology (GO) function and pathway enrichment analyses were carried out with the python goatools package [25] and kobas 3.0 interface [26], respectively. Significance for a pathway or a GO biological process (BP) term was defined a $p$ -value $<$ 0.05.

3. Results

3.1 Identification of two PFS subtypes for HNSCC by DL approach

A total of 368 HNSCC samples with combined RNA-Seq, miRNA-Seq, and DNA methylation data were acquired from TCGA. Following preprocessing, 14,282 genes from RNA-seq, 220 miRNAs from miRNA-seq and 19883 genes from DNA methylation data were obtained and used as input features for autoencoder DL model which stacked the three types of omics features together.

In univariate Cox-PH regression analysis, 32 features out of the 100 new features derived from the bottleneck hidden layer of the DL-based model were identified to be significantly associated with survival (log-rank $P<$ 0.05). Subsequently, K-means clustering of the 32 features was conducted. When $K=$ 2, best scores were achieved for silhouette index and Calinski-Harabasz criterion. As shown in Fig. 1A, PFS was significantly different between two sub-clusters derived from TCGA data (log-rank $P=$ 2.36E ${}^{-05})$ . Therefore, two classes were determined to be optimal and used as labels to develop a prognostic model for PFS by the SVM algorithm together with CV.

As mentioned in Materials and Method, the 368 TCGA samples were split into 10 folds with a 60/40 ratio for training set and testing set. As shown in Table 1, the SVM classification model displayed a high C-index (0.71 $\pm$ 0.05), a low Brier score (0.22 $\pm$ 0.02) and a significant difference in PFS time between two subtypes (log-rank $P=$ 0.0008) for the training set. Similar results (C-index $=$ 0.73 $\pm$ 0.07, Brier score $=$ 0.22 $\pm$ 0.02, log-rank $P=$ 0.0024) were observed for the 3-omics held-out testing set. Furthermore, according to these metrics, this multi-omics model performed well on each single-omic layer of data (RNA or miRNA or methylation data only). These observations suggest that this stratification model is reliable for predicting PFS in HNSCC patients.

3.2 Predictive capability of the classification model was successfully validated in three confirmation sets

Capability of this model in stratifying HNSCC patients into two PFS subtypes was tested in three independent conformation sets (E-GEOD-26549, E-GEOD-27020, and E-GEOD-65858). The common top features chosen by ANOVA followed by SVM classification are summarized as follows: E-GEOD-26549 (95%), E-GEOD-27020 (54%) and E-GEOD-65858 (92%). For E-GEOD-26549 with 86 samples, the two survival subgroups had a good C-index of 0.75, a lower Brier score of 0.16, and a log-rank $P$ value of 0.0033 (Fig. 1C, Table 2). For either E-GEOD-65858, or E-GEOD-27020, the model also displayed decent performances (Fig. 1B and D, Table 2). Additionally, we did not find miRNA data or methylation data of HNSCC samples with corresponding recurrence information in publicly accessible repositories to test this model.

3.3 The DL-model performed better compared to two alternative methods.

As described in Materials and Methods, the DL-based model was compared to PCA and univariate Cox-PH-based models in terms of C-index, Brier score and Log-rank $P$ value. Table 3 showed that PCA failed to categorize patients into two subgroups with significantly different PFS time (log-rank $p=$ 0.068). Notably, the DL-based model generated the most significant log-rank p value (2.36E ${}^{-05})$ and the highest C-index (0.715) in comparison with the other two methods (Table 3).

Table 3
Comparison of performances of autoencoder and two alternative approaches

	C-index	Brier score	Log-rank $P$
PCA-based model	0.601	0.222	0.0680
Cox-PH-based method	0.680	0.204	0.0001
Autoencoder framework	0.715	0.215	$<$ 0.0001

3.4 Identification of prognostic clinical features

Results of Fisher exact test showed PNI was significantly related to PFS ( $p=$ 0.011, Table 4). In univariate Cox regression analysis, PNI ( $p=$ 0.005), margin status ( $p=$ 0.001) and the two-class classification model ( $p=$ 2.36E ${}^{-05})$ were significantly associated with PFS (Table 5). Furthermore, in multi-variate Cox regression analysis, the two-class classification model ( $p=$ 0.049) and margin status ( $p=$ 0.003) were identified to be independent predictors of PFS (Table 6).

Table 4
Results of fisher exact test for clinical features

Clinical variable	Class 1 ( $N=$ 235)	Class 2 ( $N=$ 133)	$P$ -value
Age	60.61 $\pm$ 10.65	59.69 $\pm$ 12.18	0.410
Sex (female/male)	58/177	29/104	0.610
Perineural invasion (YES/NO)	60/109	53/49	0.011
Radiotherapy (YES/NO)	58/25	46/21	1.000
Stage (I/II/III/IV)	12/41/40/137	4/22/31/68	0.361
Margin status (positive/negative)	22/158	19/89	0.225
Hpv status (positive/negative)	16/44	2/14	0.330
Histologic grade (G1/G2/G3/G4)	30/137/59/2	17/83/30/0	0.794
Alcohol (YES/NO)	165/64	92/40	0.632

Table 5

Summary of results from uni-variate cox regression analysis

Feature	Coefficient	$P$ -value	Lower 95%	Upper 95%
Age	0.169	0.329	0.844	1.660
Sex (female/male)	$-$ 0.057	0.773	0.644	1.387
Perineural_Invasion (YES/NO)	0.583	0.005	1.195	2.685
Radiotherapy (YES/NO)	0.324	0.290	0.758	2.520
Stage (I $+$ II/III $+$ IV)	0.316	0.164	0.879	2.140
Margin_Status (positive/negative)	0.792	0.001	1.412	3.451
Hpv_status (positive/negative)	$-$ 0.719	0.246	0.144	1.644
M_Stage (M0/M1)	1.241	0.217	0.481	24.860
N_Stage (N0/N1 $+$ N2 $+$ N3)	$-$ 0.200	0.255	0.580	1.155
T_Stage (T1 $+$ T2/T3 $+$ T4)	0.345	0.073	0.968	2.061
Histologic_Grade (G1 $+$ G2/G3 $+$ G4)	$-$ 0.175	0.388	0.564	1.249
Alcohol (YES/NO)	0.352	0.079	0.960	2.106
Two-class model	0.724	$<$ 0.0001	1.475	2.885

Table 6

Results of multi-variate Cox regression analysis

Variable	Coefficient	$P$ -value	Lower 95%	Upper 95%
Two-class model	0.460	0.049	1.002	2.504
Perineural_Invasion	0.353	0.127	0.905	2.238
Margin_Status	0.791	0.003	1.307	3.723

3.5 Identification and function annotation of DEGs between two subgroups

A total of 10 up-regulated and 338 down-regulated DEGs were selected between two identified PSF subgroups (log2 fold change $>$ 1 and FDR $<$ 0.05, Fig. 2). These genes were significantly enriched several immune-related GO BP terms, such as adaptive immune response, immune system process, and immune response (Fig. 3). Moreover, the study found various pathways enriched the two identified subgroups, such as primary immunodeficiency pathway, cell adhesion molecules (CAMs) pathway, B cell receptor signaling pathway and leukocyte transendothelial migration pathway (Table 7). Primary immunodeficiency and B cell receptor signaling pathway shared two common genes: cluster of differentiation CD40 ligand (CD40LG) and CD79 antigen (CD79A). Both cell adhesion molecules (CAMs) and leukocyte transendothelial migration pathways were significantly enriched withclaudin (CLDN) 3, CLDN10, CLDN20 and vascular cell adhesion molecule (VCAM) 1. Moreover, 23 differentially expressed miRNAs (DEMs) and 55 DMGs between the two subtypes were uncovered.

Table 7
A list of significant signaling pathways

Pathway	$P$ -value	Count	Genes
Primary immunodeficiency	0.0005	4	CD40LG, CD19, TNFRSF13B, CD79A
Cell adhesion molecules (CAMs)	0.0006	7	CLDN10, CLDN3, VCAM1, NRXN1, CD40LG, CD22, CLDN20
Drug metabolism – cytochrome P450	0.0006	5	ADH1C, GSTA1, FMO3, CYP2E1, UGT1A7
Intestinal immune network for IgA production	0.0013	4	CD40LG, PIGR, TNFRSF13B, TNFRSF17
Hematopoietic cell lineage	0.0017	5	CD19, CD22, FCER2, MS4A1, CR2
Linoleic acid metabolism	0.0037	3	PLA2G2D, ALOX15, CYP2E1
Retinol metabolism	0.0041	4	ADH1C, CYP26A1, UGT1A7, ALDH1A1
Amphetamine addiction	0.0046	4	PPP1R1B, CACNA1D, DDC, GRIN2A
Serotonergic synapse	0.0053	5	HTR3A, DDC, CACNA1D, ALOX15, CYP4X1
Glutamatergic synapse	0.0057	5	GRIK3, CACNA1D, DLGAP1, GLS2, GRIN2A
Metabolism of xenobiotics by cytochrome P450	0.0058	4	ADH1C, GSTA1, CYP2E1, UGT1A7
B cell receptor signaling pathway	0.0064	4	CD19, CD22, CR2, CD79A
Chemical carcinogenesis	0.0090	4	ADH1C, GSTA1, CYP2E1, UGT1A7
Salivary secretion	0.0122	4	LYZ, AQP5, DMBT1, MUC5B
Cocaine addiction	0.0130	3	PPP1R1B, DDC, GRIN2A
Neuroactive ligand-receptor interaction	0.0189	7	CNR2, GABRP, GRIN2A, GCGR, TACR1, CNR1, GRIK3
Arginine biosynthesis	0.0213	2	NOS2, GLS2
Arachidonic acid metabolism	0.0255	3	PLA2G2D, ALOX15, CYP2E1
Leukocyte transendothelial migration	0.0301	4	CLDN3, VCAM1, CLDN20, CLDN10
Dopaminergic synapse	0.0375	4	PPP1R1B, CACNA1D, DDC, GRIN2A
Cytokine-cytokine receptor interaction	0.0441	6	TNFRSF13B, CCL19, CXCR5, CCR4, CD40LG, TNFRSF17
cAMP signaling pathway	0.0461	5	PPP1R1B, CFTR, CACNA1D, GRIN2A, HCAR1
Tyrosine metabolism	0.0475	2	DDC, ADH1C
ECM-receptor interaction	0.0479	3	SV2B, COL6A5, COL4A4
GABAergic synapse	0.0550	3	GABRP, GLS2, CACNA1D

Count means the number of genes significantly enriched in a pathway.

Figure 2.

A heatmap of DEGs across two PFS classes identified by the DL-based model. The color bar corresponds to expression levels of DEGs.

Figure 3.

Significant GO BP terms for the DEGs between two PFS subgroups. -log(p-value) is encoded in a color bar for visualization. Gene count indicates the number of genes significantly enriched in a GO term. Gene ratio indicates the ratio of the enriched genes in each gene set.

4. Discussion

Accumulative genomic alterations play a critical role in HNSCC progression [27]. The present study integrated RNA-Seq, miRNA-Seq, and DNA methylation data of 368 HNSCC patients from TCGA to build a prognosis model for PFS by using DL approach. The DL method used in this study was autoencoder that reconstructed the multi-omics data to produce 100 new features to represent the data. There is evidence for the utility of autoencoder to unveil the organization of transcriptomic machinery [28]. Danaee et al. identify a set of genes as biomarkers for breast cancer by using autoencoder [29]. More recently, it has been reported that autoencoder-based model is efficient and accurate in predicting lung cancer, stomach cancer and breast cancer using RNA-seq data [30]. In the present study, the autoencoder-based model identified two subgroups of patients with significantly different PFS. CV results suggested that this model was robust in classifying patients into two PFS subgroups. Moreover, robustness and reliability of the model was successfully validated in three independent confirmation sets. Furthermore, according to C-index and Log-rank $P$ values, the DL-based model was superior to PCA or individual Cox-PH-based methods in terms of accuracy and efficiency. Additionally, results of multi-variate cox regression analysis showed that the two-class model was an independent predictor of PFS in HNSCC. These results reveal that the DL-based model is effective in differentiating the patients who are more likely to have disease progression from the patients who are less likely to have disease progression.

It has been recognized by all that effective clinical features are closely related to prognosis. In the present study, the clinical feature PNI was shown to be significantly related to PFS of HNSCC patients, and this association had been reported in two European literature [Goepfert H, 1984; Fagan JJ, 1998]. Today, as the process of neoplastic invasion of nerves, PNI has been emerging as an important pathologic feature of many malignancies, as well as a marker of poor outcome and a harbinger of decreased survival [Liebig C, 2009]. The present study will provide a new idea for the effective clinical treatment of HNSCC.HNSCC has been regarded as an immune-suppressive disease characterized by profound immune defects [31]. Immunotherapy has been increasingly recognized as a promising novel therapy against HNSCC, calling for intense investigation into the biological mechanisms on immune process [32]. This study uncovered 348 DEGs between two identified PSF subgroups, which were functionally related to a number of immune-related GO terms, such as adaptive immune response, immune system process and immune response. Moreover, a total of 25 significant pathways for 348 DEGs were unraveled between two identified PSF subgroups, and according to the number of genes involved in, the top three pathways were cell adhesion molecules (CAMs), Neuroactive ligand-receptor interaction and Cytokine-cytokine receptor interaction. It was reported that the changes in the expression or function of CAMs have been implicated in all steps of tumor progression [Makrilia N, 2009], and Neuroactive ligand-receptor interaction was one of the most significant pathway for pancreatic cancer risk [Wei P, 2012]. It worth noting that some DEGs were invovled in different pathways, such as CD40LG, CD79A, CLDN3, CLDN10, CLDN20 and VCAM1. CD40L is a member of the TNF receptor superfamily, regulating immune response through CD40-expressing T cells activation [33]. CD79A protein, namely B-cell antigen receptor complex-associated protein alpha chain, binds to CD79b to form a dimer related to membrane-bound immunoglobulin in B cells and is involved in B cell development and function [34]. In the present studym, CD40LG and CD79A were significantly enriched in both primary immunodeficiency and B cell receptor signaling pathways. Furhtermore, both cell adhesion molecules (CAMs) and leukocyte transendothelial migration pathways involved CLDN3, CLDN10, CLDN20 and VCAM1 in the current study. CLDN3,CLDN10 and CLDN20 belong to the family of claudin proteins which are the key components of the tight junctions. Emerging studies have established that deregulated claudins lead to disrupted tight junctions, thus contributing to cancer development and progression [35, 36]. It has been found that polarity and distribution of claudins are altered in HNSCC and that expression of claudins associate with survival of patients [37]. VCAM1 acts as a cell adhesion molecule and is relevant to tumor growth, metastasis and angiogenesis [38]. These findings further confirm the pivotal roles by immune system and cell adhesions in HNSCC progression and present CD40LG, CD79A, CLDN3, CLDN10, CLDN20 and VCAM1 as candidate therapeutic targets for HNSCC.

5. Conclusion

This study develops a DL-based prognostic model that is able to robustly discriminate two subpopulations of HNSCC patients with significantly different PFS. Several immune and cell adhesion-related pathways are involved in cancer progression. CD40LG, CD79A, CLDN3, CLDN10, CLDN20 and VCAM1 may be critical genes in HNSCC. This study provides more insights into the underlying molecular mechanisms of HNSCC progression. Further studies are demanded to validate the feasibility of this prognostic model.

References

Marur

and Forastiere

A.A.

, Head and neck squamous cell carcinoma: Update on epidemiology, diagnosis, and treatment, Mayo Clinic Proceedings 91 (2016), 386.

Torre

L.A.

Bray

Siegel

R.L.

Ferlay

Lortettieulent

and Jemal

, Global cancer statistics, 2012, Ca A Cancer Journal For Clinicians 65 (2015), 69–90.

Thariat

Vignot

Lapierre

Falk

A.T.

Guigay

Van

O.E.

and Milano

, Integrating genomics in head and neck cancer treatment: Promises and pitfalls, Critical Reviews in Oncology/hematology 95 (2015), 397–406.

Ferris

R.L.

Blumenschein

Fayette

Guigay

Colevas

A.D.

Licitra

Harrington

Kasper

Vokes

E.E.

and Even

, Nivolumab for recurrent squamous-cell carcinoma of the head and neck, New England Journal of Medicine 375 (2016), 1856.

Pignon

J.P.

le Maitre

Maillard

and Bourhis

, Meta-analysis of chemotherapy in head and neck cancer (MACH-NC): An update on 93 randomised trials and 17,346 patients, Radiother Oncol 92 (2009), 4–14.

Lee

Grosse

Ranganath

and Ng

A.Y.

, Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations, in: Proceedings of the 26th Annual International Conference on Machine Learning, ACM (2009), 609–616.

Liu

Ren

and Shu

, PEDLA: Predicting enhancers with a deep learning-based algorithmic framework, Scientific Reports 6 (2016), 28517.

Tan

Ung

Cheng

and Greene

C.S.

, Unsupervised feature construction and knowledge extraction from genome-wide assays of breast cancer with denoising autoencoders, Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing 20 (2015), 132.

Chaudhary

Poirion

O.B.

and Garmire

L.X.

, Deep Learning based multi-omics integration robustly predicts survival in liver cancer, Clinical Cancer Research 24 (2017), clincanres.0853.2017.

10.

Ribeiro

I.P.

Caramelo

Esteves

Menoita

Marques

Barroso

Miguã©Is

Melo

J.B.

and Carreira

I.M.

, Genomic predictive model for recurrence and metastasis development in head and neck squamous cell carcinoma patients, Sci Rep 7 (2017).

11.

Saintigny

Zhang

Fan

Y.H.

Elnaggar

A.K.

Papadimitrakopoulou

V.A.

Feng

Lee

J.J.

Kim

E.S.

H.W.

and Mao

, Gene expression profiling predicts the development of oral cancer, Cancer Prevention Research 4 (2011), 218–229.

12.

Fountzilas

Kotoula

Angouridakis

Karasmanis

Wirtz

R.M.

Eleftheraki

A.G.

Veltrup

Markou

Nikolaou

Pectasides

and Fountzilas

, Identification and validation of a multigene predictor of recurrence in primary laryngeal cancer, PLoS One 8 (2013), e70429.

13.

Wichmann

Rosolowski

Krohn

Kreuz

Boehm

Reiche

Scharrer

Halama

Bertolini

and Bauer

, The role of HPV RNA transcription, immune response-related gene expression and disruptive TP53 mutations in diagnostic and prognostic profiling of head and neck cancer, International Journal of Cancer 137 (2015), 2846–2857.

14.

Rousseeuw

P.J.

, Silhouettes: A graphical aid to the interpretation and validation of cluster analysis, Journal of Computational & Applied Mathematics 20 (1999), 53–65.

15.

Caliński

and Harabasz

, A dendrite method for cluster analysis, Communications in Statistics 3 (1974), 1–27.

16.

Charrad

Ghazzali

Boiteau

and Niknafs

, NbClust package: Finding the relevant number of clusters in a dataset, UseR! 2012 (2012).

17.

Hao

H.Z.

Ahn

Lin

and Park

, Gene selection using support vector machines with non-convex penalty, Bioinformatics 22 (2006), 88–95.

18.

Fung

G.M.

and Mangasarian

O.L.

, A feature selection newton method for support vector machine classification, Computational Optimization & Applications 28 (2004), 185–202.

19.

Steck

Krishnapuram

Dehing-oberije

Lambin

and Raykar

V.C.

, On ranking in survival analysis: Bounds on the concordance index, in: Advances in Neural Information Processing Systems, (2008), 1209–1216.

20.

H.F.

Lee

K.L.

and Mark

D.B.

, Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors, Statistics in Medicine 15 (1996), 361–387.

21.

Schröder

M.S.

Culhane

A.C.

Quackenbush

and Haibekains

, survcomp: An R/Bioconductor package for performance assessment and comparison of survival models, Bioinformatics 27 (2011), 3206–3208.

22.

Zhang

Akinyemiju

Ojesina

A.I.

Buckhaults

Liu

and Yi

, Pathway-structured predictive model for cancer survival prediction: A two-stage approach, Genetics 205 (2017), 89–100.

23.

Love

M.I.

Huber

and Anders

, Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2, Genome Biology 15 (2014), 550.

24.

Ching

Song

M.A.

Tiirikainen

Molnar

Berry

Towner

and Garmire

L.X.

, Genome-wide hypermethylation coupled with promoter hypomethylation in the chorioamniotic membranes of early onset pre-eclampsia, Molecular Human Reproduction 20 (2014), 885–904.

25.

Klopfenstein

D.V.

Zhang

Pedersen

B.S.

Ramírez

Vesztrocy

A.W.

Naldi

Mungall

C.J.

Yunes

J.M.

Botvinnik

and Weigel

, GOATOOLS: A python library for gene ontology analyses, Scientific Reports 8 (2018).

26.

Xie

Mao

Huang

Ding

Dong

Kong

Gao

C.Y.

and Wei

, KOBAS 2.0: A web server for annotation and identification of enriched pathways and diseases, Nucleic Acids Research 39 (2011), 316–322.

27.

Ahn

J.B.

Chung

W.B.

Maeda

Shin

S.J.

Kim

H.S.

Chung

H.C.

Kim

N.K.

and Issa

J.P.

, DNA methylation predicts recurrence from resected stage III proximal colon cancer, Cancer 117 (2011), 1847–1854.

28.

Chen

Cai

Chen

and Lu

, Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model, Bmc Bioinformatics 17(1) (2016), 9.

29.

Danaee

Ghaeini

and Hendrix

D.A.

, A deep learning approach for cancer detection and relevant gene identification, Pac Symp Biocomput 22 (2017), 219–229.

30.

Xiao

Lin

and Zhao

, A semi-supervised deep learning method based on stacked sparse auto-encoder for cancer prediction using RNA-seq data, Computer Methods and Programs in Biomedicine (2018).

31.

Economopoulou

Perisanidis

Giotakis

E.I.

and Psyrri

, The emerging role of immunotherapy in head and neck squamous cell carcinoma (HNSCC): Anti-tumor immunity and clinical applications, Ann Transl Med 4 (2016), 173.

32.

Moy

J.D.

Moskovitz

J.M.

and Ferris

R.L.

, Biological mechanisms of immune escape and implications for immunotherapy in head and neck squamous cell carcinoma, European Journal of Cancer 76 (2017), 152.

33.

Grewal

I.S.

and Flavell

R.A.

, CD40 and CD154 in cell-mediated immunity, Annual Review of Immunology 16 (1998), 111–135.

34.

Seda

and Mraz

, B-cell receptor signalling and its crosstalk with other pathways in normal and malignant cells, European Journal of Haematology 94 (2015), 193–205.

35.

Osanai

Takasawa

Murata

and Sawada

, Claudins in cancer: Bench to bedside, Pflügers Archiv – European Journal of Physiology 469 (2016), 1–13.

36.

TabariãS̈

and Siegel

P.M.

, The role of claudins in cancer metastasis, Oncogene 36 (2017), 1176–1190.

37.

Nelhűbel

G.A.

Károly

Szabó

Lotz

Kiss

Tóvári

and Kenessey

, The prognostic role of claudins in head and neck squamous cell carcinomas, Pathology & Oncology Research 20 (2014), 99–106.

38.

Schlesinger

and Bendas

, Vascular cell adhesion molecule-1 (VCAM-1) – An increasing insight into its role in tumorigenicity and metastasis, International Journal of Cancer 136 (2015), 2504–2514.

Deep learning-based model for predicting progression in patients with head and neck squamous cell carcinoma

Abstract

PURPOSE:

METHODS:

RESULT:

CONCLUSION:

Keywords

1. Introduction

2. Materials and methods

2.1 Datasets and research design

2.2 Features transformation using a DL framework

2.3 Transformed features selection and K-means clustering

2.4 Data partitioning and robustness evaluation

2.5 Supervised classification

2.6 Three metrics for evaluating performance of models

2.7 Comparison of the DL framework with two alternative approaches

2.8 Clinical covariate analysis

Table 1 Performance evaluation of the SVM classification model on training and testing sets in TCGA cohort

3. Results

3.1 Identification of two PFS subtypes for HNSCC by DL approach

3.2 Predictive capability of the classification model was successfully validated in three confirmation sets

3.3 The DL-model performed better compared to two alternative methods.

Table 3 Comparison of performances of autoencoder and two alternative approaches

Table 4 Results of fisher exact test for clinical features

Table 7 A list of significant signaling pathways

5. Conclusion

References

Table 1
Performance evaluation of the SVM classification model on training and testing sets in TCGA cohort

Table 3
Comparison of performances of autoencoder and two alternative approaches

Table 4
Results of fisher exact test for clinical features

Table 7
A list of significant signaling pathways