Abstract
Background:
Breast cancer (BC) has been reported as one of the most common cancers diagnosed in females throughout the world. Survival rate of BC patients is affected by metastasis. So, exploring its underlying mechanisms and identifying related biomarkers to monitor BC relapse/recurrence using new statistical methods is essential. This study investigated the high-dimensional gene-expression profiles of BC patients using penalized additive hazards regression models.
Methods:
A publicly available dataset related to the time to metastasis in BC patients (GSE2034) was used. There was information of 22 283 genes expression profiles related to 286 BC patients. Penalized additive hazards regression models with different penalties, including LASSO, SCAD, SICA, MCP and Elastic net were used to identify metastasis related genes.
Results:
Five regression models with penalties were applied in the additive hazards model and jointly found 9 genes including SNU13, CLINT1, MAPK9, ABCC5, NKX3-1, NCOR2, COL2A1, and ZNF219. According the median of the prognostic index calculated using the regression coefficients of the penalized additive hazards model, the patients were labeled as high/low risk groups. A significant difference was detected in the survival curves of the identified groups. The selected genes were examined using validation data and were significantly associated with the hazard of metastasis.
Conclusion:
This study showed that MAPK9, NKX3-1, NCOR1, ABCC5, and CD44 are the potential recurrence and metastatic predictors in breast cancer and can be taken into account as candidates for further research in tumorigenesis, invasion, metastasis, and epithelial-mesenchymal transition of breast cancer.
Introduction
Breast cancer (BC) is considered as one of the most prevalent types of cancers diagnosed in women all over the world. BC has also been reported as the second leading cause of mortalities due to cancers in the United States in 2021. 1 The patients with BC are at risk of local recurrence as well as axillary lymph node metastasis despite early detection of the disease and successful surgical resection.2-4 While the 5-year survival rate of BC patients at early stages is over 90%, this rate declines to <30% after the occurrence of metastasis. 5
Early diagnosis of metastasis after surgery of BC patients is urgent for good prognosis, 6 because advanced BC with distant metastasis is associated with serious outcomes for the patients. Therefore, creating diagnostic tools to identify individuals who are at recurrence/relapse risk of BC is of great importance, as these tools would help to assure the patient receives appropriate therapy. 7 Currently, BC patients mainly receive comprehensive treatments (eg, surgery, chemotherapy and radiotherapy as well as hormone therapy, and targeted therapy) as therapeutic approaches. 8 Exploring non-invasive diagnostic biomarkers helps clinicians to discriminate tumor types and cancer stages, and aids them in developing individualized/personalized therapy plans. 9 In spite of the comprehensive exploration of etiology and signatures of BC, diagnostic tools for predicting the metastasis/recurrence of BC have not been developed in clinical practice. 10 There is an urgent need for additional exploration of the underlying mechanisms of the BC metastasis and identification of biomarkers to screen BC recurrence. 10
Nowadays, enormous amounts of genomic data are produced by continuous advances in biological research technologies including sequencing which have been widely used in creating diagnostic tools for different diseases. BC has a heterogeneous nature clinically and pathologically and has several prognostic subgroups due to the vast molecular changes. Genomic information provides a better understanding of the diversity observed in BC types and molecular complexity and helps to detect homogeneous subclasses of patients according to their therapeutic response. Molecular classifications should provide clinicians with better treatment options. Also, diagnosis and prognosis of the disease are expected to be improved by the identified biomarkers.3,11
To date, gene-expression profiles-based classifiers have been developed for detecting BC tumor types.12,13 However, few studies have considered identification of prognostic markers to predict metastasis. So far, several bioinformatics and machine learning methods have been utilized to model genome-wide data to handle high-dimensional and low sample size problems with this type of data. Especially, when there is survival outcome in clinical information of the patients along with molecular data, the penalized Cox proportional hazards regression model is the first option. In this regard, the primary goal is to select a small subset of genes associated with survival outcomes that can be used to construct predictive models. In the penalized regression model, a penalty term is attached to the likelihood function which shrinks the regression coefficients toward zero. Therefore, variable selection and coefficient estimation are done simultaneously.
Despite extensive research that has been conducted based on multiplicative Cox proportional hazards model for selecting gene expression profiles, few attempts have been made to consider additive risk regression models. Additive hazard models may provide information further than the typical proportional hazards in survival analysis. For example, the former estimates the hazards difference instead of hazard ratio indicating “the absolute difference in the instantaneous failure rate per unit of change in the exposure variable.” Moreover, additive hazards model is more appropriate in the case the proportional hazard assumption is violated. 14 Penalized additive hazards model has been also developed for variables selection in the context of survival analysis and has been applied for selecting gene profiles related to various cancers.15,16 Nevertheless, to our knowledge, there is no study that analyzes molecular data with survival outcome in BC patients to create prognostic models for this disease using additive risks model. The aim of the present study was to focus on utilization of penalized additive hazards regression model for analyzing high-dimensional time-to-metastasis data, to create a diagnostic model to predict survival time in BC patients, and to determine genes associated with time-to-metastasis.
Methods
Data source
In the present study, a dataset with an accession series entry of GSE2034 on NCBI/Genbank Gene Expression Omnibus (GEO) database (www.ncbi.nlm.nih.gov/geo) was used. Samples (n = 436) were related to patients with lymph-nodenegative breast cancer from tumor bank at the Erasmus Medical Center (Netherlands) and steroid-hormone receptors were measured. Time to relapse, as well as gene expression measurements for 22 283 genes (platform: Affymetrix Human Genome U133A Array) from 286 samples available from GEO were used in the present study.
An external dataset was used to validate the results (GSE26971). Information of 277 samples including gene expression profiles and time to relapse (platform: Affymetrix Human Genome U133A Array).
Statistical analysis
Gene selection through penalized additive hazards model
Penalized regression techniques are useful statistical learning methods for variable selection, especially when the sample size is smaller than the number of explanatory variables. In this approach a penalty is attached to the objective function; so that the estimates of the regression coefficients are shrinking toward zero. In this way, variables selection and coefficient estimation are done at the same time. In the present study, an additive hazards approach 17 was considered as follows:
where,
where,
Robustness assessment
The data partitioning was repeated 100 times to assess the robustness of the model. The process of feature selection was repeated 100 times for all 5 penalties and the frequency of selected features over 100 times was calculated. So, the gene expression profiles with the frequency greater than 50% were considered as selected.
Inferring survival groups
The prognostic index (
Software
In this study, R software programing (http://www.r-project.org) was used based on an R package provided by Lin and Lv (http://www-scf.usc.edu/~linwei/software.html). 17
Results
There was information on 22 283 gene expression profiles for 286 samples. The data were preprocessed and quartile normalization was used to normalize data. Then, the penalized additive hazards model with 5 different penalties described in material and method section was applied to select genes related to the hazards of the patients. The response variable was survival time of the patients. Table 1a illustrates the name of 39 gene profiles selected by the Lasso, SCAD, ENet, SICA and MCP. The Lasso, ENet, SCAD, SICA and MCP selected 36, 13, 24, 39, and 10 gene profiles related to the survival of the patients. There were 9 genes common to the 5 different regression methods: SNU13, CLINT1, MAPK9, ABCC5, NKX3-1, NCOR2, COL2A1, and ZNF219. To investigate the association between the selected genes and the hazard of metastasis progression of the patients, unpenalized additive hazards regression was fitted. Regression coefficient of each identified gene along with unadjusted (univariate regression) and adjusted (multivariate regression) P-values were also provided in Table 1b. According to the results, all 39 genes had statistically significant associations with hazard function (unpenalized univariate additive hazards regression model). According to Table 1, some of the genes including SNU13, FBXO7, DUSP4, ABCC5, CD44, BTN3A2, CALML3, LILRA1, ZNF257, KCNJ4, SEC24A, CKAP2, ZBTB10, WFDC1, VTCN1, FRAS1, NOL3, and RPL31 reduced the hazard of metastasis, as their regression coefficients were negative. The rest of the genes were positively associated, therefore increasing the hazard of metastasis in patients with BC.
(a) The results of penalized additive hazards model and (b) regression coefficients from unpenalized additive hazards model for the selected genes from penalized additive hazards model on the original dataset.
Also, unpenalized multivariate additive hazards regression model was fitted by considering all 9 common genes in the model, simultaneously. Table 2 shows the results. Seven genes, named WFDC1, CKAP2, COL2A1, TCF7L2, KCNJ4, ABCC5, TCL1B were statistically significant in the multivariate model. P-values obtained by multivariate unpenalized additive hazards regression models for sets of genes selected by each penalty were provides in Table 2.
Results of fitting multivariate additive hazards model using 9 in common genes on the training dataset.
A breast cancer survival prognostic index
A prognostic index was created based on 39 genes and the regression coefficients obtained from penalized additive hazards model. The patients were labeled as high/low risk groups according to the median of the prognostic index. The median survival time of the low/high risk groups was 161.85 (S.E = 2.88) and 39 (6.05) months, respectively. The log rank test indicated a statistically significant difference between the survival curves of the 2 identified groups (P < .001). The prognosis index was also calculated based on gene subsets identified by each method. The last row of Table 1, presents the results of log rank test for the survival risk groups based on genes selected by each of the 5 methods. According to the results all subsets could identify risk groups, significantly. Figure 1 depicts the Kaplan-Meier survival function clearly distinguishing the high/low risk groups.

Kaplan-Meier survival curves for 2 identified risk groups based on 39 selected genes in original data.
External validation
We fitted univariate additive hazards regression model using the validation dataset described above. All 39 genes from penalized additive models in the original dataset were validated in the external dataset, and had P-values < .001. So, all selected genes were significantly associated with the hazard of metastasis. Also, risk groups were created based on these genes by calculating prognosis index for this data set. Figure 2 shows the Kaplan-Meier survival curves for these 2 groups. Also, log rank test for comparing survival curves for the 2 risk groups in validation data set showed significant differences (P < .001).

Survival curves for the 2 identified risk groups based on 39 selected genes in the validation data.
Unpenalized additive hazards model based on the 9 common genes
Table 2 shows the results of fitting unpenalized additive hazards model on the original data set by considering only 9 common genes selected by all 5 penalties of LASSO, ENet, SCAD, SICA, and MCP. As seen, almost all genes (but one) were significantly associated with the hazards of metastasis (Adjusted P-value < .05). Table 3 shows the adjusted p-values related to fitting the unpenalized multivariate additive hazards model on the validation data set. As seen, NCOR2 and ZNF219 were significantly associated with the hazards of metastasis on the validation data set.
Results of fitting multivariate additive hazards model using 9 in common genes on the validation dataset.
Initial screening of selected genes and investigating their relationship with breast cancer
In this study, the DisGeNET (v7.0) was used to initial screening of selected genes by an additive approach in relationship with breast cancer. DisGeNET is a discovery platform to integrate data from GWAS catalogs, animal models and the scientific literature. 18 The MAPK9, NKX3-1, NCOR2, ABCC5, and CD44 were found as breast cancer-related genes among 9 common genes were selected by penalized methods.
Discussion
Breast cancer is considered as the first cancer among females throughout the world. Distant tumor recurrence (metastasis) can involve the whole body of patients and it is associated with many cancer related deaths. As little is known about biomarkers related to metastasis, utilizing advanced statistical models addressing high-dimensionality in gene expression studies is crucial for advancing our understanding of the metastasis mechanisms.19,20 In this study, unlike other studies, an additive hazards regression approach with 5 different penalty functions was utilized to analyze a high dimension dataset consisting of gene expression profiles of breast cancer patients and a survival outcome (GSE2034). We used all 5 types of penalties to take advantage of all the methods, as each has different properties that can reveal new biomarkers. The identified genes were used to create a prognostic index and identify high risk and low risk patients. By using DisGeNET, 5 out of 9 genes (including MAPK9, NKX3-1, NCOR1, ABCC5, and CD44) were identified as the most important genes playing an important role in BC metastasis and correlated with distant tumor recurrence/metastasis in breast cancer patients.
Based on the additive hazards regression model, the selected genes can predict metastasis. The expression of MAPK9 can increase the hazard of metastasis incidence or can decrease the metastasis-free survival time, while, NKX3-1, NCOR2, ABCC5, and CD44 can increase the metastasis-free survival time.
The protein encoded by MAPK9 has an essential role in the control of several biological processes such as transcription regulation, proliferation, and apoptosis. This gene is involved in response to several stimuli. For example, MAPK9 is involved in the intrinsic pathway of apoptosis induction under UV radiation exposure.21,22 Song et al 23 showed that the expression level of MAPK9 was significantly increased in HR + HER2 – tumor compared to adjacent normal tissue. Zekri et al showed that MAPK9 is involved in response to tamoxifen in HR + metastatic BC patients. The transcription factor NKX3.1, a member of Homeodomain family, can modulate the androgen receptor. NKX3.1 plays an important role in the regulation of cellular processes such as survival, differentiation, and proliferation.24,25 Holmes et al 21 showed that Nkx3-1 inhibits the binding of Estrogen (ER) to DNA and therefore regulating ER response in breast cancer. Nuclear receptor co-repressor 2 protein encoded with NCOR2 as a nuclear receptor co-repressor has a role in target genes transcription repressing through the alteration of epigenetic modification. 26 Moreover, the expression level of NCOR2 was associated with poor prognosis, chemoresistance, distant metastasis, and tumor recurrence of breast cancer.27,28 ABCC5 as a member of ATP-binding cassette (ABC) transporter family is involved in efflux of several endogenous substances, toxins, and therapeutic agents, such as cisplatin methotrexate, and 5-Fu. On the other hand, ABCC5 has a role in chemoresistance of several cancers, such as hepatocellular carcinoma, breast cancer, and colorectal cancer.29-31 CD44 as a transmembrane glycoprotein has a role in cellular communication considered as a cancer stem cell marker. Cancer stemness was associated with metastasis, epithelial to mesenchymal transition, and therapeutic resistance of different cancers, such as breast cancer.32-34
Various variable selection methods including penalized methods have data-dependent performance, and there is no gold standard method for users. 35 The present study focused on utilization of penalized additive hazards regression with various penalties of LASSO, SCAD, ENet, SICA, and MCP to identify metastasis-related gene expression profiles. The additive hazards regression model is a beneficial alternative to Cox’s model for selecting relevant genes related to survival of the patients, in the presence of high-dimensional covariates. The selected genes may provide extra information when compared to multiplicative models, such as Cox regression. Lin and Lv 17 conducted several simulation studies to investigate the accuracy of the penalized additive hazards models in terms of variable selection in the case of high dimension survival data (the number of covariates are much larger than the sample size). According to the results of their simulation studies, all the penalties showed sensitivities (ie, selecting the informative variables or providing a non-zero regression coefficient when it is actual coefficient is non-zero) over 0.9. Also, they showed that the SCAD and MCP had greater specificities (ie, not selecting the non-informative variables or providing a zero regression coefficient when it is actual coefficient is zero) compared to the LASSO and ENet, which was due to form of the penalty terms (the penalty function of MCP and SCAD is non-concave). They also applied the method to select genes related to the survival of diffuse large-B-cell lymphoma (DLBCL) data. Some remarkable features of using penalized additive hazards model are that it can be applicable when the number of covariates is large. It also provides the risk difference/excess risk which is more relevant in clinical studies compared to the risk ration. Also, this method is computationally efficient (which is very important in high-dimension setting), as it uses a least square method. Nevertheless, due to the additive form of the hazard function, negative values might be produced for the some individuals and the obtained least squares estimators may not have the support of likelihood theory. 36 Future studies should have focus on the properties of the estimators based on the maximum likelihood function.
The present study also introduced a new set of influential gene expression profiles in predicting metastasis in breast cancer patients, using an additive hazards model, with a different perspective than the proportional hazards point of view.
Conclusion
Considering the findings of this study, the penalized additive hazards model could identify high/low risk survival subgroups of BC patients. MAPK9, NKX3-1, NCOR1, ABCC5, and CD44 could be used as potential recurrence and metastatic prediction biomarkers in breast cancer. The identified a small subset of genes associated with survival of breast cancer patients additively that revealed additive risks. They provide different information than multiplicative models. It is recommended to use additive and multiplicative models to take advantages of both models to take insight into the disease. However, further molecular studies should be performed to validate the role of these genes in tumorigenesis, invasion, metastasis, and epithelial-mesenchymal transition of breast cancer.
Footnotes
Acknowledgements
This work was part of an MSc thesis in Biostatistics. We would like to appreciate the Research and Technology Deputy of the Hamadan University of Medical Sciences and the Research and Technology Deputy of the Hamedan University of Technology for technical support for their approval and support of this work.
Funding:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was partially supported by Hamadan University of Medical Sciences (Grant NO: 1401120210346).
Declaration of Conflicting Interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Authors’ Contributions
LT and OH conceived the research topic. LT, OH, PA, SS, ID and SA explored that idea, performed the statistical analysis, and drafted the manuscript. All authors participated in the interpretations and drafting of the manuscript. All authors read and approved the final manuscript.
Availability of Data and Materials
Ethics Approval and Consent to Participate
This study used a publically available data set. All methods were carried out in accordance with relevant guidelines and regulations, and the study was approved by the Ethics Committee of the Hamadan University of Medical Sciences (IR.UMSHA.REC.1401.922). The funding body had no role in the design of the study and collection as well as in writing the manuscript.
Consent for Publication
Not applicable.
