Abstract
Objective:
Breast cancer is one of the most prominent and deadly diseases in the world, and its prognosis varies widely based on the expression of certain genes. Identification of these genes is important for developing and interpreting clinical prognostic tests as well as furthering our understanding of breast cancer biology. We expand on prior efforts in the field toward identifying prognostic genes, by integrating powerful statistical methods.
Methods:
To this end, we use an unsupervised random forest model, which allows for robust learning of non-linear gene expression/survival relationships and the ability to identify the most important genes affecting both positive and negative breast cancer prognosis. In total, 1,518 participants were considered from the METABRIC dataset, using 20,387 mRNA expression level variables and 23 clinical variables including HER2 mutation status. The top 250 & bottom 250 expressing genes and 6 clinical features were selected for the unsupervised random forest model.
Results:
Our research corroborates previous discoveries of 27 important prognostic genes while also identifying 3 genes as potentially novel prognostic factors. Based on gene ontology analysis, we additionally show that these genes have plausible connections to breast cancer biology that should be experimentally investigated.
Conclusions:
Here, we demonstrate the utility of the unsupervised random forest model over K-means clustering for identifying important genes in breast cancer.
Introduction
Globally, breast cancer is the most diagnosed cancer in women and is a leading cause of death for women. Breast cancer is a prevalent disease with over 2 million new cases diagnosed each year. In addition to genetic risk factors, environmental risk factors include body mass index (BMI), alcohol consumption, and smoking status, among others. 1 Breast cancer survival rate and time is significantly affected by the presence or lack thereof of the hormones estrogen and progesterone, and the gene Human Epidermal Growth Factor Receptor 2 (HER2). 2 Estrogen and progesterone are positive prognostic factors (PF) for those with breast cancer, while high expression of HER2 is a negative PF. 3 The most prominent mutations that affect breast cancer are BRCA1 and BRCA2 mutations, which lead to a significantly increased risk of breast cancer for those who have either of the mutations. 4 Unlike many other cancers, breast cancer is not typically driven by many of the other classical oncogenes such as TP53, CHEK2, and PTEN, furthering the motivation to better characterize the genetic bases of the disease.5 -7 The objective of this research is to find which genes are the most important PFs for breast cancer survival time using a newly applied method. Among prior approaches to identify important PFs for breast cancer, one prominent example is the PAM50 gene set. PAM50 consists of 50 different genes whose expression levels were determined to have a significant impact on breast cancer prognosis. 8 Earlier work by Pereira et al had demonstrated the utility of the METABRIC dataset to perform survival analysis based on unsupervised K-means clustering of gene expression levels.9,10 Our approach differs from this earlier work using unsupervised clustering by integrating the PAM50 approach to find genes that significantly impact breast cancer survival time. Unlike the K-means clustering model, the unsupervised random forest (URF) allows us to incorporate categorical clinical variables in the model as well as obtain insight about the variable importance. Furthermore, we perform a literature search on the 30 most important genes to better understand the biological and prognostic relevance of each gene. Our unique approach allows us to find PFs that should be further investigated and independently corroborated.
Methods
Dataset
The reporting of this study conforms to the TRIPOD-Cluster statement 11 (Supplemental Table 1). The accompanying QUIPS form is available in Supplemental Table 2. This research used data from the METABRIC dataset (n = 1,518). The METABRIC dataset consists of genetic and clinical data from primary breast cancer tumors from tumor banks in the United Kingdom and Canada. The dataset has 20,410 variables in total. 23 of these variables are clinical variables and 20,387 are genetic variables consisting of mRNA expression Z-scores for 20,387 different genes.9,10 All participant samples were subject to the same clinical, genotypic and transcriptomic analysis to account for heterogeneity between collection locations. A total of 1,518 participants with no missing data were considered for URF clustering and survival analysis.
Selection Criteria and Variable Manipulation
Only individuals who had genetic data for both gene mutations and gene expression were included in our analysis. HER2 status was defined as positive if the subject had “gain” for the “Her2_Status” variable and negative if they had anything else. For the histologic cancer type variable, subjects with the cancer types: “other,” “metaplastic,” “tubular,” “mucinous,” “cribriform,” “medullary,” or “no value” were all recorded as “other,” due to small counts for each one. The gene expression Z-scores were turned into quartiles to standardize the dataset and improve interpretability for the URF variable importance grading. To limit the selection of important genes to the extremes, the top 250 and bottom 250 genes by Z-score were selected for unsupervised clustering. 6 clinical variables were selected for the URF model: number of positive lymph nodes, estrogen status, HER2 status, cancer type, age at diagnosis, and cellularity. The survival data was censored at 120 months.
Unsupervised Clustering
To examine which genes affect breast cancer survival time, we followed Breiman’s Method and used an URF to generate a proximity matrix followed by Partitioning Around the Medoids (PAM) to identify 3 clusters. 12 By design, the URF model does not have an dependent variable from the input data that is used to train the model. Rather, the random forest generates a synthetic set of data from the input data that is the same size. This shifts the task into a classification problem where the response variable is whether or not the observation is from the synthetic dataset or not. Now the algorithm resembles a classical random forest classifier with our new binary response variable. To generate the synthetic data using Breiman’s method, all of the input data is assigned a class label, 1, and a second, randomly sampled, vector is labeled class 2 based on the univariate distributions found in the input data. We then applied the random forest classifier to the concatenated 2-class data. 13 The values in the proximity matrix are calculated by taking the proportion of the times 1 observation was in the same final terminal node as another observation. 14 The proximity matrix was transformed into a dissimilarity matrix by subtracting all of the values from 1. For the clustering method, we chose to apply PAM as it has been used before with URF models. 12 A principal component analysis was performed using the fviz_cluster function in the factoextra package 15 to visualize the 3 clusters.
Statistical Survival Analysis for Each Cluster
The clusters from the PAM algorithm were also subjected to further analysis such as Kaplan-Meier (KM) survival curve comparison and heatmap for the top 30 URF gene’s expression. These analyses were stratified by cluster. The survival analysis was done using the Log-Rank test to compare the clusters. First, the test was done as an omnibus test, to check if any of the curves are significantly different. This was done using the Survfit function in the Survival package in R and was plotted using the GGSurvplot function in the Survminer package in R.16,17 The test was performed pairwise, comparing each curve to another curve. This was done using the pairwise_survdiff function from the Survminer package in R.
Biological Significance Analysis of Important Genes
The importance of each gene in the URF model was defined by the mean decrease in the Gini coefficient for that gene, which is the default importance measure for randomForest package 18 in R. The Gini coefficient was chosen as the impurity feature importance metric because it is native to the randomForest package, is computationally efficient and relatively easy to interpret. The log-loss impurity was not considered appropriate as it implies a probabilistic interpretation. Permutation feature importance was not considered because it may mischaracterize important features that happen to be correlated. 19 Other feature importance metrics such as SHAP are computationally intensive but may be useful in related applications of URF. 20 The top 30 most important genes from the URF were subjected to further analysis such as Reactome interaction plots using StringDB and gene set analysis using Reactome pathways.21 -23 The gene set analysis was performed using the EnrichR library in R.24,25 Reactome is an open-source, peer-reviewed, biological pathway database. It consists of biological pathways and the genes that are a part of the pathway, as well as the interaction between genes within the pathway. 23 For the Reactome interaction plots, thicker lines indicate a known interaction between the genes and thin lines indicate a predicted interaction by Reactome. 22 In addition, a literature search was performed to determine if there had been prior research establishing a link between the genes and breast cancer. Google Scholar was used as the search engine to find published papers that looked at the gene’s prognostic effect on breast cancer and if that gene affected the tumor being estrogen positive. Gene Ontology (GO) terms for selected important genes were used to identify relevant biological processes.26 -28
Quantifying Cluster Heterogeneity
Heterogeneity among identified clusters was focused on the top 30 genes. In lieu of fold changes, we calculated the difference in average Z-scores (∆Z-score) across participants for each gene and performed a 2-sided independent T test to estimate the P-value and reported both values in Table 1.
Summary of the Top 30 Genes.
Genes which represent the novel findings in this work, group II, are underlined. Group IV genes, which were discordant with respect to breast cancer significance in the literature are highlighted in bold.
Result
Survival Clustering Analysis
Figure 1A shows the PCA plot of the 3 clusters across the top 2 principal components, termed “Dim1” and “Dim2.” Figure 1B shows 3 survival curves, each corresponding to 1 of the 3 PAM-generated clusters. The overall Log-Rank Test, which rejects the null hypothesis that the survival curves are the same over time, compares the 3 survival curves and if the P-value is less than the .05 threshold, the difference in the survival time for the 3 survival curves is significant. There are 441, 811 and 266 participants in Clusters 1, 2 and 3 respectively. Figure 1C shows the pairwise survival comparison where each survival curve was compared with each of the other survival curves. All the P-values were less than 0.05, indicating that all of the survival curves significantly differ from each other. Thus, all the clusters have significantly different survival times than each other. Figure 2A shows the top 30 most significant genes from the URF using mean decrease in the Gini coefficient. Figure 2B shows the mean gene expression for the top 30 URF genes stratified by cluster. PSAT1 and PPP1R14C are highly expressed in Cluster 3 and lowly expressed in Cluster 1, while the other 28 genes are highly expressed in Cluster 1 and lowly expressed in Cluster 3. In Cluster 2, all of the genes are expressed at average levels.

Principal component analysis plot from URF clustering shows strong separation between each cluster (A). “Dim1” and “Dim2” correspond to the first and second principal components and account for 31.3% and 8.5% of the variance, respectively. The number of participants in Clusters 1, 2 and 3 are 441, 811 and 266, respectively. The Kaplan-Meier survival plot for the 3 identified clusters. (B). The Log-Rank P-values between the clusters (C).

A feature dot plot of the top 30 most important variables, including clinical variables, in the URF model using the variable importance metric, mean decrease in Gini coefficient (A). A heatmap, stratified by cluster, of the top 30 most important genes in the URF model. The scalebar indicates the Z-score values for gene expression levels (B). Gene enrichment analysis of the top 30 most important genes in the URF model using the database Reactome 2022. P-values are calculated using Fisher’s exact test and corrected using the Benjamini-Hochberg procedure (C). A gene network plot of the top 30 most genes in the URF model using the database Reactome 2022 (D). Solid lines between the gene nodes indicate a known interaction between the genes in the Reactome 2022 database.
Gene Set Enrichment Pathway Analysis
Figure 2C shows the gene set enrichment analysis with the top 30 URF genes, the most prominent pathway was “Estrogen-dependent gene expression.” The other most prominent pathways were Estrogen receptor- (ESR)-Mediated Signaling and Signaling By Nuclear Receptors. All 3 of these pathways included the same 5 genes: FOXA1, GATA3, ESR1, BCL2, and MYB. Figure 2D shows the Reactome pathway analysis for the genes. Notably, the same 5 genes, FOXA1, GATA3, ESR1, and MYB, were the only genes that had any interaction with each other.
Identifying Prognostic Factors for Breast Cancer
Using the METABRIC dataset, we were able to create 3 distinct clusters using URF and PAM in tandem. Per the KM plots and subsequent Log-Rank tests, we were able to establish that the clusters each had a distinct survival curve. Cluster 3 had the worst survival rate, cluster 1 had the best survival rate, and Cluster 2 was intermediate. Around 33% of the breast cancer participants in Cluster 3 died at around 45 months, compared to around 10% for cluster 2 and around 5% for Cluster 1. Participants in Cluster 2 had a massively increased risk of dying of breast cancer within the first 45 months of being diagnosed compared to the other 2 clusters. When looking at the heatmap in Figure 2B, stratified by cluster, we can see the stark differences between the clusters. Cluster 2 has mean Z-scores for gene expression near 0 for all of the top 30 URF genes, indicating that this gene expression pattern represents an average breast cancer participant. For Cluster 3, PPP1R14C and PSAT1 are highly expressed, while the other 28 genes are under-expressed. The opposite is true in Cluster 1, where PPP1R14C and PSAT1 are under-expressed, and the other 28 genes are over-expressed. From this, we can gather that PPP1R14C and PSAT1 are negative PFs as they are over-expressed in Cluster 3, which has the lowest survival rate, and the other 28 genes are positive PFs as they are over-expressed in Cluster 1, the cluster with the highest survival rate.
Prognostic Factors Are Highly Associated With Estrogen Pathways
For the gene set enrichment analysis, the only pathways that were enriched among the group of prognostic genes were pathways that were related to estrogen expression in the body. The genes that are involved in this pathway were ESR1, GATA3, BCL2, MYB, and FOXA1. This reinforces that important genes in breast cancer survival time are more likely to be involved in the estrogen pathway than non-prognostic genes. The Reactome interaction plot showed that the only genes with established interactions with each other were ESR1, GATA3, MYB, and FOXA1. These genes are part of the estrogen pathway, so that pathway is very prominently represented in our group of prognostic genes.
Literature Search of Identified Prognostic Factors
The literature search revealed that 27/30 genes had at least 1 paper written about their relationship with breast cancer, while 3 genes did not: DEGS2, RUNDC1, and GAMT. Of those that had published papers written about them, 23 of them were considered positive PFs by the researchers: AFF3, AGR3, BCL2, CA12, CCDC24, DACH1, DNAJC12, ESR1, FOXA1, GATA3, IL6ST, KLHDC9, MAPT, MLPH, MYB, NME3, NME5, RERG, SUSD3, TBC1D9, TMC4, XBP1, ZG16B and 2 were considered negative PFs: PSAT1 and PPP1R14C.29 -51 C9ORF116 and DNALI1 are predicted to be positive PFs, according to our analysis, but are predictive of bone metastasis. 52 Of these 27 genes with prior literature, 25 of them had a PF status in the literature search that agreed with our research.
We classify the genes into 4 categories: groups I, II, III and IV. The group assignments for the top 30 genes are presented in Table 1. Group I is the validated group, which have the same prognostic status in our analysis as appear in the literature search: AFF3, AGR3, BCL2, CA12, CCDC24, DACH1, DNAJC12, ESR1, FOXA1, GATA3, IL6ST, KLHDC9, MAPT, MLPH, MYB, NME3, NME5, PPP1R14C, PSAT1, RERG, SUSD3, TBC1D9, TMC4 and ZG16B. Importantly, 5 of the group I genes overlapped with some of the positive PFs identified by the previously defined PAM50 set: BCL2, ESR1, FOXA1, MAPT, and MPLH. Specifically, elevated expression of these genes is associated with lower proliferation and lower risk recurrence. 10 Group I negative PFs, PSAT1 and PPP1R14C are both negative regulators of the GSK3β-dependent pathways.50,51 Group II is the unvalidated group, which were significant positive PFs in our analysis but did not have any papers connecting them to breast cancer outcomes as of July 2025: DEGS2 and RUNDC1. Group III are predicted positive PFs that have roles in metastasis: C9ORF116, GAMT and DNALI1. In the context of metastatic breast cancer cases, BCL2, C9ORF116, DNALI1, ESR1, and GATA3 are predictive of metastasis to bone relative to other tissues. 52 PFs are context-dependent and may require different interpretations under different circumstances. Since ESR1, GATA3, and BCL2 have been independently corroborated as positive PFs, they are categorized as group I. Group IV is defined as a predictive positive PF that in our study but published literature indicates the opposite prognostic effect than we observed. Group IV consists of XBP1, which has previously been reported to be a negative PF by promoting hormone therapy resistance when overexpressed but appears to be a positive PF in our analysis.49,53 Based on Cluster 3 in the URF model, loss of expression among identified important genes, including XBP1, with high PSAT1 and PPP1R14C expression is a negative PF for survival (Figure 2). Using our method, we show that loss of XBP1 can be consequential for breast cancer survival under certain circumstances.
We used the GO database and existing literature to better understand the function and role of the groups II and III prognostic genes in breast cancer outcomes. DEGS2 is involved with ceramide biosynthesis (GO:0046513) and overexpression or depletion contributes to metastasis in colorectal cancer. 54 The related paralog, DEGS1, is part of the 70-gene PF panel from the Netherlands Cancer Institute. 55 Ceramides can induce cell death and is associated with less aggressive breast cancer types and we hypothesize that ceramide synthesis is a mechanism that is behind our observations.54,56,57 Since DEGS1 and DEGS2 are paralogs, it is possible that the PF nature of DEGS2 was overshadowed by DEGS1 activity in previous works. Recent work has identified RUNDC1 as required for TP53 tumor suppression and may be involved in downregulating the autophagosome and regulating TP53.58 -60 Since TP53 is a known oncogene, RUNDC1 expression is likely protective by this mechanism. RUNDC1 has not been well-characterized in previous studies because TP53 activity might otherwise explain the observations. These findings corroborate our results that DEGS2 and RUNDC1 are positive PFs and are elevated in Cluster 1. GAMT is a purine methyltransferase that is involved in spermatogenesis (GO:0007283), creatine metabolism (GO:0006600, GO:0006601) and muscle contraction (GO:0006936).26 -28 Previous research has shown the importance of purine metabolism in breast cancer and creatine metabolism in metastasis.61 -63 Specifically, dysregulated accumulation or depletion of uric acid is associated with poor survival prognosis and accumulation of creatine promotes metastasis in various cancer types.63,64 The dual role of GAMT in breast cancer metabolism may account for the pleiotropic effects in breast cancer survival outcomes. This dual role may have obscured GAMT as a PF candidate by other methods. C9ORF116, is involved in microtubule organization and determining left/right symmetry (GO:0007368, GO:0061966).26 -28 Further work is needed to investigate the physiological and clinical significance of these genes in breast cancer. Note that neither RUNDC1 nor DNALI1 have formal GO entries.
Cluster Heterogeneity
The 3 clusters are stratified such that Clusters 1 to 3 represent the best-case, average-case and worst-case survival outcomes, respectively. Clusters 1 and 2 have similar expression patterns for the top 30 genes except that Cluster 1 has significantly higher Z-scores with the exception of PSAT1 and PPP1R14C, which are Cluster 3-specific (Table 1). IL6ST shows the largest difference in Z-score between Clusters 1 and 2, highlighting the positive impact of IL6 cytokine signaling on survival outcomes. 40 The next 4 genes with the largest difference between Clusters 1 and 2 are mostly in Group I with the exception of RUNDC1. As mentioned earlier, Cluster 3 exhibits lower Z-scores for 28 of the 30 top genes, with the exception of PSAT1 and PPP1R14C, which are uniquely upregulated. Group I genes, including ESR1, GATA3, AGR3, CA12, MLPH, FOXA1 and TBC1D9 represent the largest differences between Clusters 1 & 2 relative to Cluster 3 (Table 1).
To check whether the cluster assignments reflected previously-established findings, we compared our cluster assignments to the available NPI scores. First, we calculated the mean NPI score for each cluster, which increases monotonically from Cluster 1 to Cluster 3. Clusters 1, 2 and 3 exhibited mean NPI scores of 3.5, 4.1 and 4.7. The NPI scores ranged from 1.02 to 6.36 with a median of 4.04.
Risk of Bias Assessment
The risk of bias was assessed for our use of the METABRIC dataset with the QUIPS tool (Supplemental Table 2). Overall, bias was considered low-to-moderate across the 6 domains. Low risk of bias domains included PF measurement, outcome measurement and statistical analysis & reporting. We identified moderate risk of bias for study participation, study attrition and study confounding. Study participation is likely biased toward breast invasive ductal carcinoma because about 76% of the participants shared this histological type, although we argue that this reflects a reasonable sampling of the true breast cancer subtype distribution. 65 The METABRIC dataset includes some loss of participant data but does not provide a detailed description for each lost participant, which we argue warrants a moderate risk of bias. 9 Participant age (reported as age of diagnosis) and tumor type were all included in the model but were not stratified to control for these variables, which we argue confers a moderate risk of bias. Cancer treatment status, including chemotherapy, hormone therapy, radiotherapy and breast surgery were not included in the final model and are potential confounding variables that may bias our findings toward therapy-related gene expression effects. We use the clustering step to find the natural stratifications as a way of accounting for these effects. An exploratory analysis on the potential link between treatment and gene expression revealed that 76% and 70% of participants in Clusters 1 and 2, respectively, received hormone therapy, radiotherapy or both (but not chemotherapy). While 59% of participants in Cluster 3 received chemotherapy, radiotherapy or both (but not hormone therapy). This confirms distinct treatment profiles for the clusters that were organically identified by URF. Using the limma package in R, we modeled gene expression relative to the therapeutic strategies by fitting a linear model with empirical Bayes moderation to rank the gene expression contrasts based on the treatment groups. 66 Participants who received chemotherapy with or without radiotherapy, but not hormone therapy or radiotherapy only, exhibited a significant (BH adjusted P-value < 0.05) downregulation of groups I, II, III and IV genes relative to untreated participants with the exception of PSAT1 and PPP1R14C, which were upregulated. It has been shown that upregulation of PSAT1 and PPP1R14C are associated with triple negative breast cancer, which typically warrants chemotherapy treatment. However, it is observed that many triple negative breast cancer patients develop resistance, leading to poor survival. 51 It is plausible that cluster 3 expression profile is a product of both treatment decisions and potentially chemotherapy-induced effects. Therefore, we cannot rule out chemotherapy as a confounder for survival. The same analysis revealed that hormone therapy was associated with some significant, but negligible changes. For detailed summaries of this analysis, we report the full results of the linear modeling in Supplemental Tables 3–9. The NPI and integrative cluster numbers were clinical variables included in the METABRIC dataset but were not included as part of the model to avoid bias toward already established PF measurements.9,67
Discussion
Summary of Findings
The 30 genes presented here represent the extremes of the gene expression profile in breast cancers that cluster with a reasonable stratification of survival. Most of these genes have some connection to breast or other cancers. The 2 negative PFs we identify overlap in their function to inactivate GSK3β, which is common among triple-negative breast cancers and is known to affect survival outcomes.50,51 The positive PFs tend to exist as a continuum from Cluster 2, which represents near-normal tissue expression to Cluster 1, which consistently has higher expression and better survival.
Comparison With Other Unsupervised Approaches
Extensive efforts have been made to identify clinically-relevant features in breast cancer, including genotypes (eg, BRCA1/2 variants), gene expression (eg, HER2 positivity) and tumor properties (eg, NPI) and relate those features with survival outcomes (PFs) and/or treatment success (predictive pathology factors). Related unsupervised clustering tasks have been described in the breast cancer PF literature as part of a supervised learning workflow. The clustering approaches include K-means, self-organized mapping and hierarchical clustering, which are sufficient to discriminate between samples.10,55,68 -70 The work presented here requires no supervised learning step as the clustering step is used to find coincidental survival patterns for each cluster and rank those features/genes by importance.
Future Directions
Further extensions of this analysis could yield other interesting discoveries. One extension could be adding mutation status or copy number variations into the model. This would allow the researcher to analyze what gene mutations or copy number variants could be significant PFs in breast cancer survival time. By grouping treated and untreated patients together, the identified genes may represent gene expression patterns associated with treatment success. While this may not represent the untreated breast cancer patient, the insights are still valuable as they represent the typical patient. One could extend the URF model to integrate multi-omics technologies (proteomics, metabolomics, transcriptomics) to understand the molecular mechanisms of disease. While multi-omics technologies can offer mechanistic insights in a research setting, adopting multi-omics approaches in a clinical setting is likely tempered by the increased cost barrier relative to any increased utility over current methods. Another extension could be using a regularized Cox Proportional Hazards model to analyze the 30 important genes in our analysis, although this is beyond the scope of this work. This would allow the researcher to compare the magnitude of the gene’s effects on breast cancer mortality.
Our findings are able to find a stratification of survival from 5% to 33% after 45 months. This period is shorter than the 5-year survival analysis used for NPI analysis, yet we see a large difference within this window. Future implementations of URF could extend this period to other widely-used time frames including and beyond 5-year survival for direct comparison.
Conclusion
We present a random forest-based unsupervised clustering method for use in gene expression survival analysis datasets. Our results identify several genes with previously undescribed associations with breast cancer with potential utility as PFs and/or therapeutic targets. Some of the identified genes require further research to understand their role in breast cancer (groups II, III & IV), while 90% of the top 30 genes have a previously-described role in breast cancer and may be therapeutic targets (group I). The group II genes are new discoveries as they have not been described before in the context of breast cancer prognosis. The existence of groups III and IV is a reminder that our model can be limited by confounding and non-linear effects such as metastasis, gene-gene and gene-environment interactions. Interactions between these 30 genes and progesterone, estrogen and HER2 status should be studied further as well because of their importance in breast cancer survival rate.
Supplemental Material
sj-xlsx-1-cix-10.1177_11769351251393146 – Supplemental material for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time
Supplemental material, sj-xlsx-1-cix-10.1177_11769351251393146 for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time by Benjamin Goldberg, Eric Nels Pederson and Zhengqing Ouyang in Cancer Informatics
Supplemental Material
sj-xlsx-2-cix-10.1177_11769351251393146 – Supplemental material for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time
Supplemental material, sj-xlsx-2-cix-10.1177_11769351251393146 for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time by Benjamin Goldberg, Eric Nels Pederson and Zhengqing Ouyang in Cancer Informatics
Supplemental Material
sj-xlsx-3-cix-10.1177_11769351251393146 – Supplemental material for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time
Supplemental material, sj-xlsx-3-cix-10.1177_11769351251393146 for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time by Benjamin Goldberg, Eric Nels Pederson and Zhengqing Ouyang in Cancer Informatics
Supplemental Material
sj-xlsx-4-cix-10.1177_11769351251393146 – Supplemental material for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time
Supplemental material, sj-xlsx-4-cix-10.1177_11769351251393146 for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time by Benjamin Goldberg, Eric Nels Pederson and Zhengqing Ouyang in Cancer Informatics
Supplemental Material
sj-xlsx-5-cix-10.1177_11769351251393146 – Supplemental material for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time
Supplemental material, sj-xlsx-5-cix-10.1177_11769351251393146 for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time by Benjamin Goldberg, Eric Nels Pederson and Zhengqing Ouyang in Cancer Informatics
Supplemental Material
sj-xlsx-6-cix-10.1177_11769351251393146 – Supplemental material for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time
Supplemental material, sj-xlsx-6-cix-10.1177_11769351251393146 for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time by Benjamin Goldberg, Eric Nels Pederson and Zhengqing Ouyang in Cancer Informatics
Supplemental Material
sj-xlsx-7-cix-10.1177_11769351251393146 – Supplemental material for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time
Supplemental material, sj-xlsx-7-cix-10.1177_11769351251393146 for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time by Benjamin Goldberg, Eric Nels Pederson and Zhengqing Ouyang in Cancer Informatics
Supplemental Material
sj-xlsx-8-cix-10.1177_11769351251393146 – Supplemental material for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time
Supplemental material, sj-xlsx-8-cix-10.1177_11769351251393146 for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time by Benjamin Goldberg, Eric Nels Pederson and Zhengqing Ouyang in Cancer Informatics
Supplemental Material
sj-xlsx-9-cix-10.1177_11769351251393146 – Supplemental material for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time
Supplemental material, sj-xlsx-9-cix-10.1177_11769351251393146 for Unsupervised Random Forest Identifies Important Genetic Prognostic Factors for Breast Cancer Survival Time by Benjamin Goldberg, Eric Nels Pederson and Zhengqing Ouyang in Cancer Informatics
Footnotes
Abbreviations
AFF3: ALF transcription elongation factor 3, AGR3: Anterior gradient 3, BCL2: B-cell lymphoma 2, BH: Benjamini-Hochberg, BMI: Body mass index, BRCA1: Breast cancer susceptibility gene 1, BRCA2: Breast cancer susceptibility gene 2, C9ORF116: Chromosome 9 open reading frame 116, CA12: Carbonic anhydrase 12, CCDC24: Coiled-coil domain containing 24, CHEK2: Checkpoint kinase 2, DACH1: Dachsund homolog 1, DEGS1: Delta 4-desaturase, sphingolipid 1, DEGS2: Delta 4-desaturase, sphingolipid 2, DNAJC12: DnaJ heat shock protein family (Hsp40) member C12, DNALI1: Dynein axonemal light intermediate chain 1, ESR: Estrogen receptor-mediated signaling, ESR1: Estrogen receptor 1, FOXA1: Forkhead box A1, GAMT: Guanidinoacetate N-methyltransferase, GATA3: GATA binding protein 3, GO: Gene ontology, GSK3β: Glycogen synthase kinase 3 β, HER2: Human epidermal growth factor receptor 2, IL6ST: Interleukin 6 signal transducer, KLHDC9: Kelch domain containing 9, KM: Kaplan-Meier, MAPT: Microtubule-associated protein tau, METABRIC: Molecular Taxonomy of Breast Cancer International Consortium, MLPH: Melanophilin, mRNA: Messenger ribonucleic acid, MYB: Myeloblastosis oncogene, NME3: Nucleoside diphosphate kinase 3, NME5: Nucleoside diphosphate kinase 5, PAM: Partitioning around the medoids, PAM50: Prediction of analysis microarray 50, PCA: prinicipal component analysis, PPP1R14C: Protein phosphatase 1 regulatory inhibitor subunit 14C, PSAT1: Phosphoserine aminotransferase 1, PTEN: Phosphatase and tension homolog, QUIPS: quality in prognostic studies, RERG: Ras-like, estrogen-regulated, growth inhibitor, RUNDC1: RUN domain containing 1, SUSD3: Sushi domain containing 3, TBC1D9: TABC1 domain family member 9, TMC4: Transmembrane channel-like protein 4, TP53: Tumor protein 53, TRIPOD-Cluster: Transparent reporting of multivariable prediction models developed or validated using clustered data, URF: Unsupervised random forest, XBP1: X-box binding protein 1, ZG16B: Zymogen granule protein 16B.
Ethical Considerations
Ethical approval was not required for this research as the METABRIC dataset is publicly available.
Consent to Participate
All participant specimens were obtained with appropriate consent from the relevant institutional review board as described in the original METABRIC publication. 9
Author Contributions
Conceptualization: B.G. and Z.O; Methodology: B.G. and Z.O.; Software: B.G.; Data and Resources: B.G.; Writing - Original Draft: B.G.; Writing - Review & Editing: E.N.P and Z.O.; Figures: B.G. and E.N.P; Supervision: Z.O.; Funding Acquisition: Z.O. All authors approved the version for publication.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the start-up fund to the Ouyang Laboratory from University of Massachusetts Amherst.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The normalized mRNA levels and survival rate datasets for this study can be found in CBioPortal: https://www.cbioportal.org/study/summary?id=brca_metabric.9,10 The processed data discussed is available to view as an interactive Shiny app, can be found at the Ouyang Laboratory GitHub page:
. Instructions for installation and execution are included in the README.md file.
Declaration of AI Use
No use of AI tools were used to refine language by the authors. No scientific data was generated or modified using AI.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
