Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time

Abstract

Objective:

Breast cancer is one of the most prominent and deadly diseases in the world, and its prognosis varies widely based on the expression of certain genes. Identification of these genes is important for developing and interpreting clinical prognostic tests as well as furthering our understanding of breast cancer biology. We expand on prior efforts in the field toward identifying prognostic genes, by integrating powerful statistical methods.

Methods:

To this end, we use an unsupervised random forest model, which allows for robust learning of non-linear gene expression/survival relationships and the ability to identify the most important genes affecting both positive and negative breast cancer prognosis. In total, 1,518 participants were considered from the METABRIC dataset, using 20,387 mRNA expression level variables and 23 clinical variables including HER2 mutation status. The top 250 & bottom 250 expressing genes and 6 clinical features were selected for the unsupervised random forest model.

Results:

Our research corroborates previous discoveries of 27 important prognostic genes while also identifying 3 genes as potentially novel prognostic factors. Based on gene ontology analysis, we additionally show that these genes have plausible connections to breast cancer biology that should be experimentally investigated.

Conclusions:

Here, we demonstrate the utility of the unsupervised random forest model over K-means clustering for identifying important genes in breast cancer.

Keywords

survival analysis computational models unsupervised learning gene expression cluster analysis

Introduction

Globally, breast cancer is the most diagnosed cancer in women and is a leading cause of death for women. Breast cancer is a prevalent disease with over 2 million new cases diagnosed each year. In addition to genetic risk factors, environmental risk factors include body mass index (BMI), alcohol consumption, and smoking status, among others.¹ Breast cancer survival rate and time is significantly affected by the presence or lack thereof of the hormones estrogen and progesterone, and the gene Human Epidermal Growth Factor Receptor 2 (HER2).² Estrogen and progesterone are positive prognostic factors (PF) for those with breast cancer, while high expression of HER2 is a negative PF.³ The most prominent mutations that affect breast cancer are BRCA1 and BRCA2 mutations, which lead to a significantly increased risk of breast cancer for those who have either of the mutations.⁴ Unlike many other cancers, breast cancer is not typically driven by many of the other classical oncogenes such as TP53, CHEK2, and PTEN, furthering the motivation to better characterize the genetic bases of the disease.^5
-7 The objective of this research is to find which genes are the most important PFs for breast cancer survival time using a newly applied method. Among prior approaches to identify important PFs for breast cancer, one prominent example is the PAM50 gene set. PAM50 consists of 50 different genes whose expression levels were determined to have a significant impact on breast cancer prognosis.⁸ Earlier work by Pereira et al had demonstrated the utility of the METABRIC dataset to perform survival analysis based on unsupervised K-means clustering of gene expression levels.^9,10 Our approach differs from this earlier work using unsupervised clustering by integrating the PAM50 approach to find genes that significantly impact breast cancer survival time. Unlike the K-means clustering model, the unsupervised random forest (URF) allows us to incorporate categorical clinical variables in the model as well as obtain insight about the variable importance. Furthermore, we perform a literature search on the 30 most important genes to better understand the biological and prognostic relevance of each gene. Our unique approach allows us to find PFs that should be further investigated and independently corroborated.

Methods

Dataset

The reporting of this study conforms to the TRIPOD-Cluster statement¹¹ (Supplemental Table 1). The accompanying QUIPS form is available in Supplemental Table 2. This research used data from the METABRIC dataset (n = 1,518). The METABRIC dataset consists of genetic and clinical data from primary breast cancer tumors from tumor banks in the United Kingdom and Canada. The dataset has 20,410 variables in total. 23 of these variables are clinical variables and 20,387 are genetic variables consisting of mRNA expression Z-scores for 20,387 different genes.^9,10 All participant samples were subject to the same clinical, genotypic and transcriptomic analysis to account for heterogeneity between collection locations. A total of 1,518 participants with no missing data were considered for URF clustering and survival analysis.

Selection Criteria and Variable Manipulation

Only individuals who had genetic data for both gene mutations and gene expression were included in our analysis. HER2 status was defined as positive if the subject had “gain” for the “Her2_Status” variable and negative if they had anything else. For the histologic cancer type variable, subjects with the cancer types: “other,” “metaplastic,” “tubular,” “mucinous,” “cribriform,” “medullary,” or “no value” were all recorded as “other,” due to small counts for each one. The gene expression Z-scores were turned into quartiles to standardize the dataset and improve interpretability for the URF variable importance grading. To limit the selection of important genes to the extremes, the top 250 and bottom 250 genes by Z-score were selected for unsupervised clustering. 6 clinical variables were selected for the URF model: number of positive lymph nodes, estrogen status, HER2 status, cancer type, age at diagnosis, and cellularity. The survival data was censored at 120 months.

Unsupervised Clustering

To examine which genes affect breast cancer survival time, we followed Breiman’s Method and used an URF to generate a proximity matrix followed by Partitioning Around the Medoids (PAM) to identify 3 clusters.¹² By design, the URF model does not have an dependent variable from the input data that is used to train the model. Rather, the random forest generates a synthetic set of data from the input data that is the same size. This shifts the task into a classification problem where the response variable is whether or not the observation is from the synthetic dataset or not. Now the algorithm resembles a classical random forest classifier with our new binary response variable. To generate the synthetic data using Breiman’s method, all of the input data is assigned a class label, 1, and a second, randomly sampled, vector is labeled class 2 based on the univariate distributions found in the input data. We then applied the random forest classifier to the concatenated 2-class data.¹³ The values in the proximity matrix are calculated by taking the proportion of the times 1 observation was in the same final terminal node as another observation.¹⁴ The proximity matrix was transformed into a dissimilarity matrix by subtracting all of the values from 1. For the clustering method, we chose to apply PAM as it has been used before with URF models.¹² A principal component analysis was performed using the fviz_cluster function in the factoextra package¹⁵ to visualize the 3 clusters.

Statistical Survival Analysis for Each Cluster

The clusters from the PAM algorithm were also subjected to further analysis such as Kaplan-Meier (KM) survival curve comparison and heatmap for the top 30 URF gene’s expression. These analyses were stratified by cluster. The survival analysis was done using the Log-Rank test to compare the clusters. First, the test was done as an omnibus test, to check if any of the curves are significantly different. This was done using the Survfit function in the Survival package in R and was plotted using the GGSurvplot function in the Survminer package in R.^16,17 The test was performed pairwise, comparing each curve to another curve. This was done using the pairwise_survdiff function from the Survminer package in R.

Biological Significance Analysis of Important Genes

The importance of each gene in the URF model was defined by the mean decrease in the Gini coefficient for that gene, which is the default importance measure for randomForest package¹⁸ in R. The Gini coefficient was chosen as the impurity feature importance metric because it is native to the randomForest package, is computationally efficient and relatively easy to interpret. The log-loss impurity was not considered appropriate as it implies a probabilistic interpretation. Permutation feature importance was not considered because it may mischaracterize important features that happen to be correlated.¹⁹ Other feature importance metrics such as SHAP are computationally intensive but may be useful in related applications of URF.²⁰ The top 30 most important genes from the URF were subjected to further analysis such as Reactome interaction plots using StringDB and gene set analysis using Reactome pathways.^21
-23 The gene set analysis was performed using the EnrichR library in R.^24,25 Reactome is an open-source, peer-reviewed, biological pathway database. It consists of biological pathways and the genes that are a part of the pathway, as well as the interaction between genes within the pathway.²³ For the Reactome interaction plots, thicker lines indicate a known interaction between the genes and thin lines indicate a predicted interaction by Reactome.²² In addition, a literature search was performed to determine if there had been prior research establishing a link between the genes and breast cancer. Google Scholar was used as the search engine to find published papers that looked at the gene’s prognostic effect on breast cancer and if that gene affected the tumor being estrogen positive. Gene Ontology (GO) terms for selected important genes were used to identify relevant biological processes.^26
-28

Quantifying Cluster Heterogeneity

Heterogeneity among identified clusters was focused on the top 30 genes. In lieu of fold changes, we calculated the difference in average Z-scores (∆Z-score) across participants for each gene and performed a 2-sided independent T test to estimate the P-value and reported both values in Table 1.

Table 1.

Summary of the Top 30 Genes.

Gene rank	Gene name	Validation group	PAM50	∆Z-score (1 vs 2)	T test ∆Z-score (1 vs 2; P-value)	∆Z-score (1 vs 3)	T test ∆Z-score (1 vs 3; P-value)	∆Z-score (2 vs 3)	T test ∆Z-score (2 vs 3; P-value)
1	ESR1	I	Yes	0.5274	4.34 × 10⁻⁵⁵	2.307	<1 × 10⁻³⁰⁸	1.78	3.0669 × 10⁻³¹⁴
2	GATA3	I	No	0.4796	1.20 × 10⁻⁵⁵	2.305	2.49 × 10⁻²⁰⁷	1.826	3.51 × 10⁻¹⁸¹
3	CA12	I	No	0.4374	2.29 × 10⁻⁴⁵	2.256	8.40 × 10⁻¹⁸⁰	1.819	6.31 × 10⁻¹⁵⁶
4	TBC1D9	I	No	0.504	4.47 × 10⁻⁴³	2.161	2.12 × 10⁻²¹⁹	1.657	4.38 × 10⁻¹⁷⁶
5	PSAT1	I	No	−0.4845	2.75 × 10⁻⁶⁵	−2.223	7.08 × 10⁻¹²⁴	−1.738	2.22 × 10⁻¹⁰³
6	AGR3	I	No	0.545	3.39 × 10⁻⁵³	2.287	<1 × 10⁻³⁰⁸	1.742	4.77 × 10⁻²⁹²
7	IL6ST	I	No	0.8249	4.52 × 10⁻⁸¹	2.008	2.72 × 10⁻¹⁹⁶	1.183	1.44 × 10⁻¹⁰⁷
8	DEGS2	II	No	0.604	9.84 × 10⁻⁴⁴	2.028	5.57 × 10⁻²⁰⁸	1.424	6.21 × 10⁻¹⁴⁸
9	DNAJC12	I	No	0.6669	2.95 × 10⁻⁴⁷	1.974	1.57 × 10⁻²¹⁸	1.307	2.15 × 10⁻¹⁴⁸
10	FOXA1	I	Yes	0.2637	4.56 × 10⁻⁵³	2.187	1.69 × 10⁻⁹¹	1.923	4.34 × 10⁻⁸⁰
11	MLPH	I	Yes	0.3771	3.37 × 10⁻⁵⁷	2.234	5.22 × 10⁻¹⁰⁶	1.857	3.94 × 10⁻⁸⁸
12	NME3	I	No	0.408	2.19 × 10⁻²⁸	1.982	7.58 × 10⁻¹²⁰	1.574	7.80 × 10⁻⁹³
13	TMC4	I	No	0.5969	1.04 × 10⁻⁴⁸	1.829	1.87 × 10⁻¹²⁷	1.232	1.61 × 10⁻⁷⁸
14	CCDC24	I	No	0.5742	4.88 × 10⁻⁴⁶	1.903	1.44 × 10⁻¹³⁵	1.329	2.93 × 10⁻⁸⁹
15	DNALI1	III	No	0.4794	1.74 × 10⁻²⁷	1.98	1.66 × 10⁻²⁰⁹	1.501	6.47 × 10⁻¹⁶⁷
16	GAMT	III	No	0.5307	3.04 × 10⁻⁴⁴	2	6.15 × 10⁻¹³⁴	1.47	2.77 × 10⁻⁹⁶
17	MYB	I	No	0.4767	5.43 × 10⁻⁴²	1.994	4.50 × 10⁻¹¹⁰	1.517	6.32 × 10⁻⁸¹
18	C9ORF116	III	No	0.691	1.19 × 10⁻⁶⁰	2.119	2.00 × 10⁻²⁰¹	1.428	9.12 × 10⁻¹³⁵
19	BCL2	I	Yes	0.5846	4.22 × 10⁻⁴⁰	1.659	1.05 × 10⁻¹⁰⁴	1.074	3.52 × 10⁻⁵⁷
20	RERG	I	No	0.5535	7.86 × 10⁻⁴¹	1.763	3.16 × 10⁻¹¹¹	1.209	1.63 × 10⁻⁶⁸
21	MAPT	I	Yes	0.7057	8.91 × 10⁻⁵⁸	2.019	1.15 × 10⁻²³⁰	1.313	2.53 × 10⁻¹⁵¹
22	SUSD3	I	No	0.7603	3.17 × 10⁻⁵⁶	1.887	8.91 × 10⁻²¹⁴	1.126	3.45 × 10⁻¹²²
23	AFF3	I	No	0.4171	5.58 × 10⁻²²	1.919	1.76 × 10⁻²⁰⁹	1.502	2.82 × 10⁻¹⁷⁰
24	KLHDC9	I	No	0.5537	1.02 × 10⁻⁴³	2.031	1.28 × 10⁻¹²⁸	1.477	8.80 × 10⁻⁹⁰
25	RUNDC1	II	No	0.7903	2.14 × 10⁻⁶⁸	1.978	9.88 × 10⁻¹⁸⁵	1.188	1.41 × 10⁻¹⁰²
26	XBP1	IV	No	0.3589	5.44 × 10⁻³²	1.937	1.26 × 10⁻¹⁰⁴	1.578	1.90 × 10⁻⁸³
27	NME5	I	No	0.7581	1.02 × 10⁻⁵⁹	1.859	4.42 × 10⁻²²¹	1.1	1.58 × 10⁻¹²¹
28	DACH1	I	No	0.4292	5.94 × 10⁻²¹	1.851	1.00 × 10⁻²⁰⁰	1.422	2.64 × 10⁻¹⁶⁷
29	ZG16B	I	No	0.6227	1.87 × 10⁻⁶⁰	1.843	4.88 × 10⁻⁸⁰	1.22	2.35 × 10⁻⁴⁶
30	PPP1R14C	I	No	−0.1882	1.19 × 10⁻¹⁶	−2.017	1.36 × 10⁻⁷²	−1.829	5.47 × 10⁻⁶⁵

Genes which represent the novel findings in this work, group II, are underlined. Group IV genes, which were discordant with respect to breast cancer significance in the literature are highlighted in bold.

Result

Survival Clustering Analysis

Figure 1A shows the PCA plot of the 3 clusters across the top 2 principal components, termed “Dim1” and “Dim2.” Figure 1B shows 3 survival curves, each corresponding to 1 of the 3 PAM-generated clusters. The overall Log-Rank Test, which rejects the null hypothesis that the survival curves are the same over time, compares the 3 survival curves and if the P-value is less than the .05 threshold, the difference in the survival time for the 3 survival curves is significant. There are 441, 811 and 266 participants in Clusters 1, 2 and 3 respectively. Figure 1C shows the pairwise survival comparison where each survival curve was compared with each of the other survival curves. All the P-values were less than 0.05, indicating that all of the survival curves significantly differ from each other. Thus, all the clusters have significantly different survival times than each other. Figure 2A shows the top 30 most significant genes from the URF using mean decrease in the Gini coefficient. Figure 2B shows the mean gene expression for the top 30 URF genes stratified by cluster. PSAT1 and PPP1R14C are highly expressed in Cluster 3 and lowly expressed in Cluster 1, while the other 28 genes are highly expressed in Cluster 1 and lowly expressed in Cluster 3. In Cluster 2, all of the genes are expressed at average levels.

Figure 1.

Principal component analysis plot from URF clustering shows strong separation between each cluster (A). “Dim1” and “Dim2” correspond to the first and second principal components and account for 31.3% and 8.5% of the variance, respectively. The number of participants in Clusters 1, 2 and 3 are 441, 811 and 266, respectively. The Kaplan-Meier survival plot for the 3 identified clusters. (B). The Log-Rank P-values between the clusters (C).

Figure 2.

A feature dot plot of the top 30 most important variables, including clinical variables, in the URF model using the variable importance metric, mean decrease in Gini coefficient (A). A heatmap, stratified by cluster, of the top 30 most important genes in the URF model. The scalebar indicates the Z-score values for gene expression levels (B). Gene enrichment analysis of the top 30 most important genes in the URF model using the database Reactome 2022. P-values are calculated using Fisher’s exact test and corrected using the Benjamini-Hochberg procedure (C). A gene network plot of the top 30 most genes in the URF model using the database Reactome 2022 (D). Solid lines between the gene nodes indicate a known interaction between the genes in the Reactome 2022 database.

Gene Set Enrichment Pathway Analysis

Figure 2C shows the gene set enrichment analysis with the top 30 URF genes, the most prominent pathway was “Estrogen-dependent gene expression.” The other most prominent pathways were Estrogen receptor- (ESR)-Mediated Signaling and Signaling By Nuclear Receptors. All 3 of these pathways included the same 5 genes: FOXA1, GATA3, ESR1, BCL2, and MYB. Figure 2D shows the Reactome pathway analysis for the genes. Notably, the same 5 genes, FOXA1, GATA3, ESR1, and MYB, were the only genes that had any interaction with each other.

Identifying Prognostic Factors for Breast Cancer

Using the METABRIC dataset, we were able to create 3 distinct clusters using URF and PAM in tandem. Per the KM plots and subsequent Log-Rank tests, we were able to establish that the clusters each had a distinct survival curve. Cluster 3 had the worst survival rate, cluster 1 had the best survival rate, and Cluster 2 was intermediate. Around 33% of the breast cancer participants in Cluster 3 died at around 45 months, compared to around 10% for cluster 2 and around 5% for Cluster 1. Participants in Cluster 2 had a massively increased risk of dying of breast cancer within the first 45 months of being diagnosed compared to the other 2 clusters. When looking at the heatmap in Figure 2B, stratified by cluster, we can see the stark differences between the clusters. Cluster 2 has mean Z-scores for gene expression near 0 for all of the top 30 URF genes, indicating that this gene expression pattern represents an average breast cancer participant. For Cluster 3, PPP1R14C and PSAT1 are highly expressed, while the other 28 genes are under-expressed. The opposite is true in Cluster 1, where PPP1R14C and PSAT1 are under-expressed, and the other 28 genes are over-expressed. From this, we can gather that PPP1R14C and PSAT1 are negative PFs as they are over-expressed in Cluster 3, which has the lowest survival rate, and the other 28 genes are positive PFs as they are over-expressed in Cluster 1, the cluster with the highest survival rate.

Prognostic Factors Are Highly Associated With Estrogen Pathways

For the gene set enrichment analysis, the only pathways that were enriched among the group of prognostic genes were pathways that were related to estrogen expression in the body. The genes that are involved in this pathway were ESR1, GATA3, BCL2, MYB, and FOXA1. This reinforces that important genes in breast cancer survival time are more likely to be involved in the estrogen pathway than non-prognostic genes. The Reactome interaction plot showed that the only genes with established interactions with each other were ESR1, GATA3, MYB, and FOXA1. These genes are part of the estrogen pathway, so that pathway is very prominently represented in our group of prognostic genes.

Literature Search of Identified Prognostic Factors

The literature search revealed that 27/30 genes had at least 1 paper written about their relationship with breast cancer, while 3 genes did not: DEGS2, RUNDC1, and GAMT. Of those that had published papers written about them, 23 of them were considered positive PFs by the researchers: AFF3, AGR3, BCL2, CA12, CCDC24, DACH1, DNAJC12, ESR1, FOXA1, GATA3, IL6ST, KLHDC9, MAPT, MLPH, MYB, NME3, NME5, RERG, SUSD3, TBC1D9, TMC4, XBP1, ZG16B and 2 were considered negative PFs: PSAT1 and PPP1R14C.^29
-51 C9ORF116 and DNALI1 are predicted to be positive PFs, according to our analysis, but are predictive of bone metastasis.⁵² Of these 27 genes with prior literature, 25 of them had a PF status in the literature search that agreed with our research.

We classify the genes into 4 categories: groups I, II, III and IV. The group assignments for the top 30 genes are presented in Table 1. Group I is the validated group, which have the same prognostic status in our analysis as appear in the literature search: AFF3, AGR3, BCL2, CA12, CCDC24, DACH1, DNAJC12, ESR1, FOXA1, GATA3, IL6ST, KLHDC9, MAPT, MLPH, MYB, NME3, NME5, PPP1R14C, PSAT1, RERG, SUSD3, TBC1D9, TMC4 and ZG16B. Importantly, 5 of the group I genes overlapped with some of the positive PFs identified by the previously defined PAM50 set: BCL2, ESR1, FOXA1, MAPT, and MPLH. Specifically, elevated expression of these genes is associated with lower proliferation and lower risk recurrence.¹⁰ Group I negative PFs, PSAT1 and PPP1R14C are both negative regulators of the GSK3β-dependent pathways.^50,51 Group II is the unvalidated group, which were significant positive PFs in our analysis but did not have any papers connecting them to breast cancer outcomes as of July 2025: DEGS2 and RUNDC1. Group III are predicted positive PFs that have roles in metastasis: C9ORF116, GAMT and DNALI1. In the context of metastatic breast cancer cases, BCL2, C9ORF116, DNALI1, ESR1, and GATA3 are predictive of metastasis to bone relative to other tissues.⁵² PFs are context-dependent and may require different interpretations under different circumstances. Since ESR1, GATA3, and BCL2 have been independently corroborated as positive PFs, they are categorized as group I. Group IV is defined as a predictive positive PF that in our study but published literature indicates the opposite prognostic effect than we observed. Group IV consists of XBP1, which has previously been reported to be a negative PF by promoting hormone therapy resistance when overexpressed but appears to be a positive PF in our analysis.^49,53 Based on Cluster 3 in the URF model, loss of expression among identified important genes, including XBP1, with high PSAT1 and PPP1R14C expression is a negative PF for survival (Figure 2). Using our method, we show that loss of XBP1 can be consequential for breast cancer survival under certain circumstances.

We used the GO database and existing literature to better understand the function and role of the groups II and III prognostic genes in breast cancer outcomes. DEGS2 is involved with ceramide biosynthesis (GO:0046513) and overexpression or depletion contributes to metastasis in colorectal cancer.⁵⁴ The related paralog, DEGS1, is part of the 70-gene PF panel from the Netherlands Cancer Institute.⁵⁵ Ceramides can induce cell death and is associated with less aggressive breast cancer types and we hypothesize that ceramide synthesis is a mechanism that is behind our observations.^54,56,57 Since DEGS1 and DEGS2 are paralogs, it is possible that the PF nature of DEGS2 was overshadowed by DEGS1 activity in previous works. Recent work has identified RUNDC1 as required for TP53 tumor suppression and may be involved in downregulating the autophagosome and regulating TP53.^58
-60 Since TP53 is a known oncogene, RUNDC1 expression is likely protective by this mechanism. RUNDC1 has not been well-characterized in previous studies because TP53 activity might otherwise explain the observations. These findings corroborate our results that DEGS2 and RUNDC1 are positive PFs and are elevated in Cluster 1. GAMT is a purine methyltransferase that is involved in spermatogenesis (GO:0007283), creatine metabolism (GO:0006600, GO:0006601) and muscle contraction (GO:0006936).^26
-28 Previous research has shown the importance of purine metabolism in breast cancer and creatine metabolism in metastasis.^61
-63 Specifically, dysregulated accumulation or depletion of uric acid is associated with poor survival prognosis and accumulation of creatine promotes metastasis in various cancer types.^63,64 The dual role of GAMT in breast cancer metabolism may account for the pleiotropic effects in breast cancer survival outcomes. This dual role may have obscured GAMT as a PF candidate by other methods. C9ORF116, is involved in microtubule organization and determining left/right symmetry (GO:0007368, GO:0061966).^26
-28 Further work is needed to investigate the physiological and clinical significance of these genes in breast cancer. Note that neither RUNDC1 nor DNALI1 have formal GO entries.

Cluster Heterogeneity

The 3 clusters are stratified such that Clusters 1 to 3 represent the best-case, average-case and worst-case survival outcomes, respectively. Clusters 1 and 2 have similar expression patterns for the top 30 genes except that Cluster 1 has significantly higher Z-scores with the exception of PSAT1 and PPP1R14C, which are Cluster 3-specific (Table 1). IL6ST shows the largest difference in Z-score between Clusters 1 and 2, highlighting the positive impact of IL6 cytokine signaling on survival outcomes.⁴⁰ The next 4 genes with the largest difference between Clusters 1 and 2 are mostly in Group I with the exception of RUNDC1. As mentioned earlier, Cluster 3 exhibits lower Z-scores for 28 of the 30 top genes, with the exception of PSAT1 and PPP1R14C, which are uniquely upregulated. Group I genes, including ESR1, GATA3, AGR3, CA12, MLPH, FOXA1 and TBC1D9 represent the largest differences between Clusters 1 & 2 relative to Cluster 3 (Table 1).

To check whether the cluster assignments reflected previously-established findings, we compared our cluster assignments to the available NPI scores. First, we calculated the mean NPI score for each cluster, which increases monotonically from Cluster 1 to Cluster 3. Clusters 1, 2 and 3 exhibited mean NPI scores of 3.5, 4.1 and 4.7. The NPI scores ranged from 1.02 to 6.36 with a median of 4.04.

Risk of Bias Assessment

The risk of bias was assessed for our use of the METABRIC dataset with the QUIPS tool (Supplemental Table 2). Overall, bias was considered low-to-moderate across the 6 domains. Low risk of bias domains included PF measurement, outcome measurement and statistical analysis & reporting. We identified moderate risk of bias for study participation, study attrition and study confounding. Study participation is likely biased toward breast invasive ductal carcinoma because about 76% of the participants shared this histological type, although we argue that this reflects a reasonable sampling of the true breast cancer subtype distribution.⁶⁵ The METABRIC dataset includes some loss of participant data but does not provide a detailed description for each lost participant, which we argue warrants a moderate risk of bias.⁹ Participant age (reported as age of diagnosis) and tumor type were all included in the model but were not stratified to control for these variables, which we argue confers a moderate risk of bias. Cancer treatment status, including chemotherapy, hormone therapy, radiotherapy and breast surgery were not included in the final model and are potential confounding variables that may bias our findings toward therapy-related gene expression effects. We use the clustering step to find the natural stratifications as a way of accounting for these effects. An exploratory analysis on the potential link between treatment and gene expression revealed that 76% and 70% of participants in Clusters 1 and 2, respectively, received hormone therapy, radiotherapy or both (but not chemotherapy). While 59% of participants in Cluster 3 received chemotherapy, radiotherapy or both (but not hormone therapy). This confirms distinct treatment profiles for the clusters that were organically identified by URF. Using the limma package in R, we modeled gene expression relative to the therapeutic strategies by fitting a linear model with empirical Bayes moderation to rank the gene expression contrasts based on the treatment groups.⁶⁶ Participants who received chemotherapy with or without radiotherapy, but not hormone therapy or radiotherapy only, exhibited a significant (BH adjusted P-value < 0.05) downregulation of groups I, II, III and IV genes relative to untreated participants with the exception of PSAT1 and PPP1R14C, which were upregulated. It has been shown that upregulation of PSAT1 and PPP1R14C are associated with triple negative breast cancer, which typically warrants chemotherapy treatment. However, it is observed that many triple negative breast cancer patients develop resistance, leading to poor survival.⁵¹ It is plausible that cluster 3 expression profile is a product of both treatment decisions and potentially chemotherapy-induced effects. Therefore, we cannot rule out chemotherapy as a confounder for survival. The same analysis revealed that hormone therapy was associated with some significant, but negligible changes. For detailed summaries of this analysis, we report the full results of the linear modeling in Supplemental Tables 3–9. The NPI and integrative cluster numbers were clinical variables included in the METABRIC dataset but were not included as part of the model to avoid bias toward already established PF measurements.^9,67

Discussion

Summary of Findings

The 30 genes presented here represent the extremes of the gene expression profile in breast cancers that cluster with a reasonable stratification of survival. Most of these genes have some connection to breast or other cancers. The 2 negative PFs we identify overlap in their function to inactivate GSK3β, which is common among triple-negative breast cancers and is known to affect survival outcomes.^50,51 The positive PFs tend to exist as a continuum from Cluster 2, which represents near-normal tissue expression to Cluster 1, which consistently has higher expression and better survival.

Comparison With Other Unsupervised Approaches

Extensive efforts have been made to identify clinically-relevant features in breast cancer, including genotypes (eg, BRCA1/2 variants), gene expression (eg, HER2 positivity) and tumor properties (eg, NPI) and relate those features with survival outcomes (PFs) and/or treatment success (predictive pathology factors). Related unsupervised clustering tasks have been described in the breast cancer PF literature as part of a supervised learning workflow. The clustering approaches include K-means, self-organized mapping and hierarchical clustering, which are sufficient to discriminate between samples.^10,55,68
-70 The work presented here requires no supervised learning step as the clustering step is used to find coincidental survival patterns for each cluster and rank those features/genes by importance.

Future Directions

Further extensions of this analysis could yield other interesting discoveries. One extension could be adding mutation status or copy number variations into the model. This would allow the researcher to analyze what gene mutations or copy number variants could be significant PFs in breast cancer survival time. By grouping treated and untreated patients together, the identified genes may represent gene expression patterns associated with treatment success. While this may not represent the untreated breast cancer patient, the insights are still valuable as they represent the typical patient. One could extend the URF model to integrate multi-omics technologies (proteomics, metabolomics, transcriptomics) to understand the molecular mechanisms of disease. While multi-omics technologies can offer mechanistic insights in a research setting, adopting multi-omics approaches in a clinical setting is likely tempered by the increased cost barrier relative to any increased utility over current methods. Another extension could be using a regularized Cox Proportional Hazards model to analyze the 30 important genes in our analysis, although this is beyond the scope of this work. This would allow the researcher to compare the magnitude of the gene’s effects on breast cancer mortality.

Our findings are able to find a stratification of survival from 5% to 33% after 45 months. This period is shorter than the 5-year survival analysis used for NPI analysis, yet we see a large difference within this window. Future implementations of URF could extend this period to other widely-used time frames including and beyond 5-year survival for direct comparison.

Conclusion

We present a random forest-based unsupervised clustering method for use in gene expression survival analysis datasets. Our results identify several genes with previously undescribed associations with breast cancer with potential utility as PFs and/or therapeutic targets. Some of the identified genes require further research to understand their role in breast cancer (groups II, III & IV), while 90% of the top 30 genes have a previously-described role in breast cancer and may be therapeutic targets (group I). The group II genes are new discoveries as they have not been described before in the context of breast cancer prognosis. The existence of groups III and IV is a reminder that our model can be limited by confounding and non-linear effects such as metastasis, gene-gene and gene-environment interactions. Interactions between these 30 genes and progesterone, estrogen and HER2 status should be studied further as well because of their importance in breast cancer survival rate.

Supplemental Material

sj-xlsx-1-cix-10.1177_11769351251393146 – Supplemental material for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time

Supplemental material, sj-xlsx-1-cix-10.1177_11769351251393146 for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time by Benjamin Goldberg, Eric Nels Pederson and Zhengqing Ouyang in Cancer Informatics

Supplemental Material

sj-xlsx-2-cix-10.1177_11769351251393146 – Supplemental material for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time

Supplemental material, sj-xlsx-2-cix-10.1177_11769351251393146 for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time by Benjamin Goldberg, Eric Nels Pederson and Zhengqing Ouyang in Cancer Informatics

Supplemental Material

sj-xlsx-3-cix-10.1177_11769351251393146 – Supplemental material for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time

Supplemental material, sj-xlsx-3-cix-10.1177_11769351251393146 for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time by Benjamin Goldberg, Eric Nels Pederson and Zhengqing Ouyang in Cancer Informatics

Supplemental Material

sj-xlsx-4-cix-10.1177_11769351251393146 – Supplemental material for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time

Supplemental material, sj-xlsx-4-cix-10.1177_11769351251393146 for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time by Benjamin Goldberg, Eric Nels Pederson and Zhengqing Ouyang in Cancer Informatics

Supplemental Material

sj-xlsx-5-cix-10.1177_11769351251393146 – Supplemental material for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time

Supplemental material, sj-xlsx-5-cix-10.1177_11769351251393146 for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time by Benjamin Goldberg, Eric Nels Pederson and Zhengqing Ouyang in Cancer Informatics

Supplemental Material

sj-xlsx-6-cix-10.1177_11769351251393146 – Supplemental material for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time

Supplemental material, sj-xlsx-6-cix-10.1177_11769351251393146 for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time by Benjamin Goldberg, Eric Nels Pederson and Zhengqing Ouyang in Cancer Informatics

Supplemental Material

sj-xlsx-7-cix-10.1177_11769351251393146 – Supplemental material for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time

Supplemental material, sj-xlsx-7-cix-10.1177_11769351251393146 for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time by Benjamin Goldberg, Eric Nels Pederson and Zhengqing Ouyang in Cancer Informatics

Supplemental Material

sj-xlsx-8-cix-10.1177_11769351251393146 – Supplemental material for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time

Supplemental material, sj-xlsx-8-cix-10.1177_11769351251393146 for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time by Benjamin Goldberg, Eric Nels Pederson and Zhengqing Ouyang in Cancer Informatics

Supplemental Material

sj-xlsx-9-cix-10.1177_11769351251393146 – Supplemental material for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time

Supplemental material, sj-xlsx-9-cix-10.1177_11769351251393146 for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time by Benjamin Goldberg, Eric Nels Pederson and Zhengqing Ouyang in Cancer Informatics

Footnotes

Abbreviations

AFF3: ALF transcription elongation factor 3, AGR3: Anterior gradient 3, BCL2: B-cell lymphoma 2, BH: Benjamini-Hochberg, BMI: Body mass index, BRCA1: Breast cancer susceptibility gene 1, BRCA2: Breast cancer susceptibility gene 2, C9ORF116: Chromosome 9 open reading frame 116, CA12: Carbonic anhydrase 12, CCDC24: Coiled-coil domain containing 24, CHEK2: Checkpoint kinase 2, DACH1: Dachsund homolog 1, DEGS1: Delta 4-desaturase, sphingolipid 1, DEGS2: Delta 4-desaturase, sphingolipid 2, DNAJC12: DnaJ heat shock protein family (Hsp40) member C12, DNALI1: Dynein axonemal light intermediate chain 1, ESR: Estrogen receptor-mediated signaling, ESR1: Estrogen receptor 1, FOXA1: Forkhead box A1, GAMT: Guanidinoacetate N-methyltransferase, GATA3: GATA binding protein 3, GO: Gene ontology, GSK3β: Glycogen synthase kinase 3 β, HER2: Human epidermal growth factor receptor 2, IL6ST: Interleukin 6 signal transducer, KLHDC9: Kelch domain containing 9, KM: Kaplan-Meier, MAPT: Microtubule-associated protein tau, METABRIC: Molecular Taxonomy of Breast Cancer International Consortium, MLPH: Melanophilin, mRNA: Messenger ribonucleic acid, MYB: Myeloblastosis oncogene, NME3: Nucleoside diphosphate kinase 3, NME5: Nucleoside diphosphate kinase 5, PAM: Partitioning around the medoids, PAM50: Prediction of analysis microarray 50, PCA: prinicipal component analysis, PPP1R14C: Protein phosphatase 1 regulatory inhibitor subunit 14C, PSAT1: Phosphoserine aminotransferase 1, PTEN: Phosphatase and tension homolog, QUIPS: quality in prognostic studies, RERG: Ras-like, estrogen-regulated, growth inhibitor, RUNDC1: RUN domain containing 1, SUSD3: Sushi domain containing 3, TBC1D9: TABC1 domain family member 9, TMC4: Transmembrane channel-like protein 4, TP53: Tumor protein 53, TRIPOD-Cluster: Transparent reporting of multivariable prediction models developed or validated using clustered data, URF: Unsupervised random forest, XBP1: X-box binding protein 1, ZG16B: Zymogen granule protein 16B.

ORCID iDs

Benjamin Goldberg

Eric Nels Pederson

Zhengqing Ouyang

Ethical Considerations

Ethical approval was not required for this research as the METABRIC dataset is publicly available.

Consent to Participate

All participant specimens were obtained with appropriate consent from the relevant institutional review board as described in the original METABRIC publication.⁹

Author Contributions

Conceptualization: B.G. and Z.O; Methodology: B.G. and Z.O.; Software: B.G.; Data and Resources: B.G.; Writing - Original Draft: B.G.; Writing - Review & Editing: E.N.P and Z.O.; Figures: B.G. and E.N.P; Supervision: Z.O.; Funding Acquisition: Z.O. All authors approved the version for publication.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the start-up fund to the Ouyang Laboratory from University of Massachusetts Amherst.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The normalized mRNA levels and survival rate datasets for this study can be found in CBioPortal: https://www.cbioportal.org/study/summary?id=brca_metabric.^9,10 The processed data discussed is available to view as an interactive Shiny app, can be found at the Ouyang Laboratory GitHub page: . Instructions for installation and execution are included in the README.md file.

Declaration of AI Use

No use of AI tools were used to refine language by the authors. No scientific data was generated or modified using AI.

Supplemental Material

Supplemental material for this article is available online.

References

Łukasiewicz

Czeczelewski

Forma

Baj

Sitarz

Stanisławek

Breast cancer-epidemiology, risk factors, classification, prognostic markers, and current treatment strategies-an updated review. Cancers. 2021;13(17):4287.

Nasrazadani

Thomas

Oesterreich

Lee

AV.

Precision medicine in hormone receptor-positive breast cancer. Front Oncol. 2018;8:144.

Yarden

Biology of HER2 and its importance in breast cancer. Oncology. 2001;61 Suppl 2:1-13.

Kotsopoulos

BRCA mutations and breast cancer prevention. Cancers. 2018;10(12):524.

Varna

Bousquet

Plassa

Bertheau

Janin

TP53 status and response to treatment in breast cancers. BioMed Res Int. 2011;2011(1):284584.

Apostolou

Papasotiriou

Current perspectives on CHEK2 mutations in breast cancer. Breast Cancer Target Ther. 2017;9:331-335.

Depowski

Rosenthal

Ross

JS.

Loss of expression of the PTEN gene protein product is associated with poor outcome in breast cancer. Mod Pathol. 2001;14(7):672-676.

Parker

Mullins

Cheang

MCU

, et al Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009;27(8):1160-1167.

Curtis

Shah

Chin

, et al The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012;486(7403):346-352.

10.

Pereira

Chin

Rueda

, et al The somatic mutation profiles of 2,433 breast cancers refines their genomic and transcriptomic landscapes. Nat Commun. 2016;7(1):11479.

11.

Debray

TPA

Collins

Riley

, et al Transparent reporting of multivariable prediction models developed or validated using clustered data (TRIPOD-Cluster): explanation and elaboration. BMJ. 2023;380:e071018.

12.

Shi

Seligson

Belldegrun

Palotie

Horvath

Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma. Mod Pathol. 2005;18(4):547-557.

13.

Breiman

. Manual on setting up, using, and understanding random forests v3.1. Stat Depart Univ California. 2002;1(58):1-29.

14.

Mantero

Ishwaran

Unsupervised random forests. Statistical analysis and data mining: an ASA. Data Sci J. 2021;14(2):144-167.

15.

Kassambara

Mundt

. factoextra: extract and visualize the results of multivariate data analyses. 2020. Accessed July 14, 2025. https://cran.r-project.org/web/packages/factoextra/index.html

16.

Therneau

Lumley

Elizabeth

Cynthia

Survival: Survival Analysis. 2024. Accessed February 24, 2025. https://cran.r-project.org/web/packages/survival/index.html

17.

Kassambara

Kosinski

Biecek

Fabian

survminer: Drawing Survival Curves using “ggplot2.” 2024. Accessed February 24, 2025. https://cran.r-project.org/web/packages/survminer/index.html

18.

Breiman

Cutler

Liaw

Wiener

randomForest: Breiman and Cutlers Random Forests for Classification and Regression. 2024. Accessed February 24, 2025. https://cran.r-project.org/web/packages/randomForest/index.html

19.

Hooker

Mentch

Zhou

Unrestricted permutation forces extrapolation: variable importance requires at least one more model, or there is no free variable importance. Stat Comput. 2021;31(6):82.

20.

Lundberg

Lee

. A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17. Curran Associates Inc.; 2017:4768-4777.

21.

Szklarczyk

Kirsch

Koutrouli

, et al The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 2023;51(D1):D638-D646.

22.

Fabregat

Sidiropoulos

Viteri

, et al Reactome pathway analysis: a high-performance in-memory approach. BMC Bioinformatics. 2017;18(1):142.

23.

Jassal

Matthews

Viteri

, et al The reactome pathway knowledgebase. Nucleic Acids Res. 2020;48(D1):D498-D503.

24.

Chen

Tan

Kou

, et al Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics. 2013;14(1):128.

25.

Kuleshov

Jones

Rouillard

, et al Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 2016;44(W1):W90-W97.

26.

Ashburner

Ball

Blake

, et al Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25(1):25-29.

27.

Thomas

Ebert

Muruganujan

Mushayahama

Albou

PANTHER: making genome-scale phylogenetics accessible to all. Protein Sci. 2022;31(1):8-22.

28.

Aleksander

Balhoff

Carbon

, et al The Gene Ontology knowledgebase in 2023. Genetics. 2023;224(1):iyad031.

29.

Huang

Chen

, et al Comprehensive analysis of the NME gene family functions in breast cancer. Transl Cancer Res. 2020;9(10):6369-6382.

30.

Aushev

Gopalakrishnan

Teitelbaum

, et al Tumor expression of environmental chemical-responsive genes and breast cancer mortality. Endocr Relat Cancer. 2019;26(12):843-851.

31.

Yuan

, et al DACH1 suppresses breast cancer as a negative regulator of CD44. Sci Rep. 2017;7(1):4361.

32.

Shi

Liu

, et al Identification of ZG16B as a prognostic biomarker in breast cancer. Open Med. 2020;16(1):1-13.

33.

Chen

Tan

Zhuang

AFF3 is a prognostic biomarker correlated with immune infiltrates in triple-negative breast cancer. Clin Exp Obstet Gynecol. 2023;50(8):165.

34.

Phillips

Gill

Baxter

RC.

Novel prognostic markers in Triple-negative breast cancer discovered by MALDI-mass spectrometry imaging. Front Oncol. 2019;9:379.

35.

Seachrist

Anstine

Keri

RA.

FOXA1: a pioneer of nuclear receptor action in breast cancer. Cancers. 2021;13(20):5205.

36.

Thakkar

Raj

Chakrabarti

, et al Identification of gene expression signature in estrogen receptor positive breast carcinoma. Biomarkers Cancer. 2010;2:1-15.

37.

Jian

Xie

Guo

, et al AGR3 promotes estrogen receptor-positive breast cancer cell proliferation in an estrogen-dependent manner. Oncol Lett. 2020;20(2):1441-1451.

38.

Franke

Grimm

, et al TFAP2C regulates carbonic anhydrase XII in human breast cancer. Oncogene. 2020;39(6):1290-1301.

39.

Gonda

Leo

Ramsay

RG.

Estrogen and MYB in breast cancer: potential for new therapies. Expert Opin Biol Ther. 2008;8(6):713-717.

40.

Martínez-Pérez

Leung

Kay

, et al The signal transducer IL6ST (gp130) as a predictive and prognostic biomarker in breast cancer. J Pers Med. 2021;11(7):618.

41.

De Bessa

Salaorni

Patrão

Neto

Brentani

Nagai

MA.

JDP1 (DNAJC12/Hsp40) expression in breast cancer and its association with estrogen receptor status. Int J Mol Med. 2006;17(2):363-367.

42.

Dawson

Makretsov

Blows

, et al BCL2 in breast cancer: a favourable prognostic marker across molecular subtypes and independent of adjuvant therapy received. Br J Cancer. 2010;103(5):668-675.

43.

Moy

Todorović

Dubash

, et al Estrogen-dependent sushi domain containing 3 regulates cytoskeleton organization and migration in breast cancer cells. Oncogene. 2015;34(3):323-333.

44.

Yoon

Maresh

Shen

, et al Higher levels of GATA3 predict better survival in women with breast cancer. Hum Pathol. 2010;41(12):1794-1801.

45.

Habashy

Powe

Glaab

, et al RERG (Ras-like, oestrogen-regulated, growth-inhibitor) expression in breast cancer: a marker of ER-positive luminal-like subtype. Breast Cancer Res Treat. 2011;128(2):315-326.

46.

Kothari

Clemenceau

Ouellette

, et al TBC1D9: An important modulator of tumorigenesis in breast cancer. Cancers. 2021;13(14):3557.

47.

Ikeda

Taira

Hara

, et al The estrogen receptor influences microtubule-associated protein tau (MAPT) expression and the selective estrogen receptor inhibitor fulvestrant downregulates MAPT and increases the sensitivity to taxane in breast cancer cells. Breast Cancer Res. 2010;12(3):R43.

48.

Amgalan

Tseveendorj

Lee

An integrative model for the identification of key players of cancer networks. Appl Math Model. 2018;58:65-75.

49.

Chen

Hua

, et al The emerging role of XBP1 in cancer. Biomed Pharmacother. 2020;127:110069.

50.

Gao

, et al PSAT1 is regulated by ATF4 and enhances cell proliferation via the GSK3β/β-catenin/cyclin D1 signaling pathway in ER-negative breast cancer. J Exp Clin Cancer Res. 2017;36(1):179.

51.

Jian

Kong

, et al Protein phosphatase 1 regulatory inhibitor subunit 14C promotes triple-negative breast cancer progression via sustaining inactive glycogen synthase kinase 3 beta. Clin Transl Med. 2022;12(1):e725.

52.

Wang

Lin

, et al Predictive and prognostic biomarkers of bone metastasis in breast cancer: current status and future directions. Cell Biosci. 2023;13:224.

53.

Wang

Geng

Sun

XBP1: a key regulator in breast cancer development and treatment. Pathol Res Pract. 2025;269:155900.

54.

Guo

Zhang

Feng

, et al M6A methylation of DEGS2, a key ceramide-synthesizing enzyme, is involved in colorectal cancer progression through ceramide synthesis. Oncogene. 2021;40(40):5913-5924.

55.

van ’t Veer

Dai

van de Vijver

, et al Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(6871):530-536.

56.

Moro

Kawaguchi

Tsuchida

, et al Ceramide species are elevated in human breast cancer and are associated with less aggressiveness. Oncotarget. 2018;9(28):19874-19890.

57.

Pal

Atilla-Gokcumen

Frasor

Emerging roles of ceramides in breast cancer biology and therapy. Int J Mol Sci. 2022;23(19):11178.

58.

Llanos

Efeyan

Monsech

Dominguez

Serrano

A high-throughput loss-of-function screening identifies novel p53 regulators. Cell Cycle. 2006;5(16):1880-1885.

59.

Zhang

Yang

, et al RUNDC1 inhibits autolysosome formation and survival of zebrafish via clasping ATG14-STX17-SNAP29 complex. Cell Death Differ. 2023;30(10):2231-2248.

60.

Ranjan

Thoenen

Bhosale

Thapa

Parrales

Iwakuma

Abstract 258: RUNDC1 is required for p53-mediated tumor suppression. Cancer Res. 2025;85(8_Supplement_1):258-258.

61.

Yin

Ren

Huang

Deng

Yin

Potential mechanisms connecting purine metabolism and cancer therapy. Front Immunol. 2018;9:1697.

62.

De Vitto

Arachchige

Richardson

French

JB.

The intersection of purine and mitochondrial metabolism in cancer. Cells. 2021;10(10):2603.

63.

Zhang

The two sides of creatine in cancer. Trends Cell Biol. 2022;32(5):380-390.

64.

Yue

Feng

Yao

, et al High serum uric acid concentration predicts poor survival in patients with breast cancer. Clin Chim Acta. 2017;473:160-165.

65.

Malhotra

Zhao

Band

Histological, molecular and functional subtypes of breast cancers. Cancer Biol Ther. 2010;10(10):955-960.

66.

Ritchie

Phipson

, et al Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47.

67.

Galea

Blamey

Elston

Ellis

IO.

The Nottingham prognostic index in primary breast cancer. Breast Cancer Res Tr. 1992;22(3):207-219.

68.

Chang

Nuyten

Sneddon

, et al Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival. Proc Natl Acad Sci USA. 2005;102(10):3738-3743.

69.

Tabl

Alkhateeb

Pham

Rueda

ElMaraghy

Ngom

A novel approach for identifying relevant genes for breast cancer survivability on specific therapies. Evol Bioinform Online. 2018;14:1176934318790266.

70.

Zhou

Rueda

Alkhateeb

Classification of breast cancer Nottingham prognostic index using high-dimensional embedding and residual neural network. Cancers. 2022;14(4):934.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.01 MB

0.02 MB

0.06 MB

0.05 MB

0.06 MB

0.05 MB

0.06 MB