Abstract
Background:
Colon cancer is a highly heterogeneous disease, marked by substantial intra- and inter-tumor variability. Investigating transcriptomic profiles can offer deeper insight into this heterogeneity. However, most genome-transcriptome studies on colon cancer have primarily focused on examining primary tumors and matched normal tissues, often neglecting the multi-stage disease progression.
Objective:
To establish unique molecular colon subtypes based on the progression in transcriptomic profiles. Additionally, to investigate the implicated factors, such as mutations and the tumor microenvironment (TME), that affect colon cancer progression and their implications for therapy.
Methods:
RNA-sequencing data from The Cancer Genome Atlas Colon Cancer (TCGA-COAD) cohort were obtained from the UCSC Xena database, including 47 early and 39 late-stage tumor samples. Heterogeneity was exposed by tracking cancer progression through the multi-stages of cancer development. Hierarchical clustering revealed colon subtypes with varying progression, and differentially expressed genes (DEGs) were identified between these subtypes. The DEGs were subjected to Recursive Feature Elimination and mutational analyses to reveal driver genes. The TME and biological pathways were analyzed. The study was validated with an independent GEO dataset.
Results:
Two novel colon subtypes were identified. Significant enrichment pathways and varied mutations in cancer driver genes were found in both subtypes. Interestingly, concurrent downregulation of oncogenes and tumor suppressor genes was observed in one of the subtypes, suggesting a link to the dual functionality of CD4 and CD8 T-cells in the TME.
Conclusion:
Overall, our study demonstrates a complex relationship between TME and gene expression of driver genes. The presence of immune cell fractions with dual functions suggests a balanced early-to-late-stage progression. The findings provide insights into the disease progression that potentially contribute to the development of targeted therapies.
Introduction
Colon cancer is marked by the uncontrolled growth of malignant cells in the large intestine. It frequently develops from adenomatous polyps on the colon’s inner lining.1,2 According to the Global Cancer Statistics 2020, colorectal cancer is the second leading cause of cancer-related mortality, but ranks third in terms of incidence.3,4 The incidence rates of colon cancer vary significantly by region, with the highest rates observed in North America, Europe, and Australia/New Zealand.3,5 The disease develops through a tumor-initiating event involving genetic or epigenetic alterations that activate oncogenes or inactivate tumor suppressor genes (TSGs), followed by clonal expansion and subsequent metastasis.6,7 The TNM Classification system, based on tumor invasion (T), lymph node involvement (N), and metastases (M), is the primary prognostic tool, with 5-year survival rates of 94% for stage 1, 82% for stage 2, 67% for stage 3, and 11% for stage 4 in the disease progression. 7 Treatment includes surgery for early stages, with adjuvant chemotherapy for stage 3 and high-risk stage 2 tumors. For stage 4, options include metastasis resection, palliative chemotherapy, and radiotherapy for symptom management.6,8,9 Despite advances in diagnostic and therapeutic approaches, survival remains highly variable even within the same TNM stage.6,9 Although the TNM system has prognostic importance, it is limited in its ability to accurately identify high-risk patients for both preoperative and pathological staging. This issue is relevant across all stages of colorectal cancer, especially in the locally advanced stage. 10
Over the past decade, colon cancer has been increasingly recognized as a highly heterogeneous disease, characterized by significant intra-tumor and inter-tumor heterogeneity. This heterogeneity is a critical factor underlying the wide variation in oncological outcomes and prognostic trajectories observed among patients.9,11 Colon cancer, being a highly heterogeneous disease, involves DNA repair defects, DNA methylation, chromosome instability, and TME, as well as other molecular pathogeneses in the multi-stage cancer development that influence prognosis and recurrence risk in colon cancer.12 -15 Recent reports have indicated that the TME influences colon cancer heterogeneity because of its dynamic interactions between immune cells and tumor cells.13,16
Diverse response to treatment in patients at different stages has been linked to Inter-tumor heterogeneity caused by mutations and differential gene expression across colon cancer from various patients.11,17 Therefore, the gene expression patterns approach provides a more comprehensive view of inter-tumor heterogeneity and plays a crucial role in identifying diagnostic and predictive markers. 11 The approach considers heterogeneity influenced by the composition and impact of the TME, and the differential activity of cell signaling pathways in cancer cells that are not necessarily regulated solely by mutations. 11 The approach could be further implemented in the primary concern of standardized staging and treatment protocol, which fails to predict the appropriate stage for the whole cohort of colon cancer patients to receive adjuvant therapy, and reasons for poor prognosis across the stages. 18 This highlights the importance of gene expression analysis across the multi-stage disease progression.
The heterogeneity is further evident in the possible dual behavioral functions of immune19 -21 cells, acting both as anti-tumorigenic and pro-tumorigenic agents in the TME. 22 The immune cells act either as inhibitors of tumors (anti-tumor immunity) or as promoters of tumors (pro-tumor immunity), a “double-edged sword” within the TME that contributes to the observed heterogeneity. 22 Due to the high heterogeneity of the disease, no gene expression characteristics are reliable for prognostic stratification in clinical settings.19 -21
Recent studies underscore the potential of transcriptomic analysis to uncover the heterogeneity of colon cancer.23,24 Transcriptomics links molecular mechanisms to cellular activities, with RNA-seq enabling precise quantification of transcriptional outputs at single-cell and bulk levels.24,25 However, most genome-scale research is case-control studies, often neglecting the multi-stage progression of cancer from early to advanced stages.24,26 Examining transcriptomic changes across disease progression offers opportunities to identify novel subtypes, as RNA-seq variation during cancer development remains underexplored. According to the consensus molecular subtypes, 27 4 molecular subtypes were previously identified and reported: CMS1 (MSI-immune), CMS2 (canonical), CMS3 (metabolic), and CMS4 (mesenchymal), based on aggregated gene expression; however, these classifications did not account for stage-specific RNA-seq differences. These subtypes (CMS1-CMS4) highlight distinct biological characteristics and potential therapeutic responses. Moreover, the CMS classification was mainly derived from early-stage samples, with 92% of samples representing early-stage tumors at diagnosis. 28 It is therefore imperative to go beyond the primary tumor and incorporate studies that provide insight into how tumor characteristics evolve throughout the metastatic process.
Additionally, a potential limitation to consider is that the CMS Subtypes classification can differ across sampling sites. 29 This intratumor heterogeneity might undermine the reliability of the CMS classification, and variations may exist between the tumor center and the invasive front, as well as between primary tumors and their metastases.28,29 Our new subtypes aim to complement this existing framework by providing a more granular understanding of colon cancer progression based on specific RNA-seq expression patterns. In the era of personalized medicine, identifying novel subtypes based on cancer stage progression is crucial for recognizing inter-tumor heterogeneity and advancing targeted therapy development.
Tracking the progression of colon cancer provides valuable insights into the dynamic changes in the cellular transcriptome and deepens our understanding of the molecular mechanisms underlying carcinogenesis. This approach enables an efficient and thorough depiction of the gene expression patterns that vary over time across different conditions. Examining variations among identified subtypes enhances understanding of the characteristics and extent of heterogeneity in colon cancer. Furthermore, this knowledge enables the discovery of biomarkers and driver genes, thereby streamlining the screening process and improving prognosis.
In this study, a recently developed normalization method will be employed to capture the heterogeneity of colon cancer cells. By analyzing gene expression through RNA-seq, this method identifies molecular variations in tumor progression from early to late stages.30,31 Hierarchical clustering will subsequently be applied to visualize clusters with distinct progression patterns, enabling the sub-classification of heterogeneous colon cancer into newly identified molecular subtypes. The research also examines differentially expressed genes (DEGs), mutational variation, the TME, and pathways across subtypes. Additionally, analysis of DEGs to identify optimal gene subsets within cancer clusters was explored to refine subtype characterization. Previously established frameworks to stratify colorectal cancer, including microsatellite instability (MSI), 32 the CpG island methylator phenotype, 33 and transcriptome-based consensus molecular subtypes (CMS), 27 capture unique genomic and immunological characteristics of the disease. In this study, we built on previous research by identifying 2 transcriptome-derived subtypes, labeled L and S, tailored to the specific cohort we analyzed. While our dataset excluded corresponding methylation or MSI profiles, these subtypes present an additional population-level insight that may enhance existing classification systems.
Methods
Dataset Collection and Preprocessing
RNA-Seq data from the GDC TCGA-COAD project were obtained from the UCSC Xena database. 34 The dataset included log2 (count + 1) normalized expression data, phenotypic information, survival data, and somatic mutation profiles. Patient samples without matched phenotypic and gene expression profiles were discarded. To further reduce the influence of comorbidities on tumor progression, patients older than 75 years were excluded. We however acknowledge that this might affect the generalization of our findings in older patients. Additional selection criteria included a primary diagnosis of adenocarcinoma and the use of primary tumor samples, ensuring a specific focus on colon cancer cases. The final dataset comprised 47 early-stage and 39 late-stage samples. Gene expression profiles were converted to unnormalized count-based (count + 1) values and structured into a genomic matrix, with samples organized as columns and 60 661 Ensembl gene (ENSG) IDs as rows. 35 From this dataset, 19 569 ENSG identifiers annotated as protein-coding genes were extracted using Ensembl BioMart (GRCh38.p14, Ensembl 113, April 2024), resulting in the exclusion of 67.8% of non-coding genes. 36 The study’s workflow is depicted in Figure 1.

Study workflow. Steps used to analyze the dual role of the tumor microenvironment on newly identified subtypes of colon cancer. The steps comprise data extraction, normalization, differential gene expression, TME analysis, mutational analysis, pathway analysis, and validation.
Identification of Colon Cancer Subtypes
To identify molecular subtypes of colon cancer, a tracking cancer progression method was applied.
31
This method derives the expression matrices E (early) and A (advanced) from early- and late-stage samples, respectively. Initially, “m
In this study, the extracted protein-coding genes were normalized using the Livesey et al
31
method to capture molecular differences associated with tumor progression. Hierarchical clustering (HC) was then employed to visualize and identify colon cancer subtypes. Clustering was performed using the cosine distance between gene expression profiles, with Ward’s method applied for agglomeration.
37
The optimal number of subtypes (
Differential Gene Expression Analysis
Differential gene expression analysis was performed on 19 569 protein-coding genes across 39 late-stage samples to identify DEGs between the identified colon cancer subtypes. The Limma package in R (version 3.60.6)
39
was employed for this purpose, utilizing an empirical Bayesian approach to assess and detect significant changes in gene expression patterns between subtypes. The L subtype was used as the reference, while the S subtype was used as a test in the contrast matrix. DEGs were defined based on a Benjamini–Hochberg (BH) adjusted
Identification and Selection of Recursive Feature Elimination (RFE) Genes in Colon Subtypes
To identify the most predictive genes associated with colon cancer subtypes, the scikit-learn Python package 42 was utilized. The RFE algorithm, implemented with a linear kernel support vector machine (SVM), was applied to iteratively rank and select significant genes using DEGs as input. Through iterative feature elimination, less significant genes were systematically removed, and a Repeated Stratified K-Fold cross-validation approach with 7 splits and 7 repeats was employed to ensure robustly selected features. The final gene subset, selected using a linear SVM with a regularization parameter (C) set to 5, achieved the highest classification accuracy. Principal component analysis (PCA) was performed on the final gene subset using the FactoMineR (version 2.11) 43 and Factoextra (version 1.0.7) 44 R packages to explore variance and clustering patterns. Normalized gene expression profiles of each RFE selected gene were visualized using box plots, highlighting their distributions across colon cancer subtypes, and revealing subtype-specific expression patterns.
Differential miRNA Expression and Target Gene Analysis
Colon adenocarcinoma miRNA expression data were obtained from The Cancer Genome Atlas using the Bioconductor package TCGAbiolinks. Raw read counts from the miRNA expression quantification files, processed using the BCGSC miRNA profiling pipeline, were downloaded and merged into a numerical count matrix. Subtype S samples (n = 16) and Subtype L samples (n = 23) were compared to early-stage samples (n = 47) in a differential expression analysis using the edgeR-limma-voom pipeline. Low-expressed miRNAs were filtered out using edgeR filterByExpr() (with the group design and min.count = 10), which removes miRNAs with insufficient expression (CPM/counts) across an appropriate number of samples given the library sizes and experimental design. Library sizes were normalized using the trimmed mean of M-values (TMM) method via edgeR calcNormFactors(method = “TMM”). Differentially expressed miRNAs were identified using the limma-voom method by applying limma voom() to obtain log2-CPM values with precision weights, followed by linear modeling (lmFit()) and empirical Bayes moderation (eBayes()), and BH adjusted
Pathways Analysis
To elucidate the biological roles and functions of genes within the identified subtypes, the DEGs were subjected to Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses. These analyses were conducted using the enrichGo and enrichKEGG functions from the clusterProfiler R package (version 4.12.6),
45
with a significant threshold of
Further enrichment pathway analysis was conducted using Reactome, a curated library of human biological pathways, and reactions. 47 The list of DEGs was submitted to the Reactome web program (https://reactome.org) for analysis. The over-representation method, with a default false discovery rate (FDR), was employed. The analysis used Reactome database release 92 and Reactome pathway browser version 3.7, accessed on May 15, 2025.
Mutational Landscape in Colon Subtypes
The mutational and somatic landscape of the colon cancer subtypes was analyzed and visualized using the Maftools R Bioconductor package (version 2.20.0). 48 This tool processes Mutation Annotation Format (MAF) files to extract detailed mutational data for each subtype. Oncoplots were generated to display the top mutated genes, mutation frequencies, and differences in mutational patterns across the subtypes. Additionally, the oncoplots provided insights into the variant classification and types, offering a comprehensive overview of the mutational characteristics specific to each colon cancer subtype.
Tumor Microenvironment in Colon Subtypes
The Estimating the Proportions of Immune and Cancer Cells method (EPIC) 49 was employed to quantify the distribution of immune cell fractions within the TME of the identified colon cancer subtypes. EPIC is recognized for its high precision in estimating immune-cell fraction compared to other tools, offering a reliable approach for analyzing complex tumor ecosystems. To account for the substantial variability in cancer cell types among patients, EPIC incorporates an algorithm that adjusts for unidentified and potentially diverse cell populations within each subtype. The method models bulk RNA-Seq data to predict cell-type fractions in mixed RNA-Seq samples, utilizing reference gene expression patterns of major immune and non-malignant cell types. We also utilized the xCell tool38,39 alongside the EPIC tool to estimate the immune cell fractions. The normalized gene expression data were entered into the xCell web tool (https://xcell.ucsf.edu/). This analysis assigns the proportions of each cell fraction to every sample in the input data. EPIC utilizes constrained regression to estimate absolute proportions accurately and is tailored for tumor samples that exhibit clearly defined immune and stromal characteristics. 50 In contrast, xCell generates enrichment scores rather than absolute values, making it attuned to pathway-level signals but possibly less precise in quantification.51,52 Using both tools introduces methodological variation, and differences in results may stem from varying definitions of cell types, marker specificity, or assumptions regarding tumor purity.50 -52 This approach provides a comprehensive characterization of the TME, facilitating insights into the immune composition and heterogeneity across the colon cancer subtypes.
Spearman rank correlations were computed using the SciPy spearmanr function between EPIC-inferred CD4 and CD8 T-cell fractions and the expression of oncogenes and TSGs from DEGs across matched samples. Two-sided
Validation
The proposed framework was validated using the independent dataset GSE17538, obtained from the Gene Expression Omnibus (GEO) database.
53
The sequencing data were generated using the GPL570 platform ([HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array). Probe identifiers were converted to gene symbols using the BioMart platform
36
to ensure consistency with the primary dataset. Because GSE17538 is microarray-based GPL570 platform, not all RNA-seq defined Ensembl IDs were represented on the platform. Therefore, external validation was performed using the overlap subset (6 genes) available in GSE17538 from the 16 RFE genes (Supplemental S1 Table). To account for cross-platform variability between RNA-seq and microarray measurements, gene expression values were standardized within each dataset (gene-wise z-score across samples) prior to model training/testing. A linear SVM classifier (linear kernel;
Results
Identified Colon Cancer Subtypes
Applying the method of Livesey et al 8 on the extracted protein-coding genes of the 39 late-stage samples, with the early-stage samples used as a reference, resulted in normalized counts that revealed differences between stage groups in gene expression in colon cancer samples. Hierarchical clustering revealed distinct early- to late-stage patterns of gene expression. The HC dendrogram (Figure 2A) identified 2 distinct colon cancer subtypes, a larger subtype comprising 23 samples (orange branches) and a smaller subtype of 16 samples (blue branches), hereafter referred to as the L (Large) and S (Small) subtypes, respectively.

Colon cancer subtyping. (A) Hierarchical clustering dendrogram of colon cancer samples on the x-axis. Two subtypes, L (orange branches) and S (blue branches), were identified using hierarchical clustering of the 19 569 protein-coding gene expression profiles from 39 colon cancer samples. (B) Tanglegram comparing hierarchical clustering outcomes. The contrast of the HC of all 19 569 normalized protein-coding genes (left) and HC with the normalized gene expression of the RFE 16-gene subset (right). Samples are displayed along the y-axis, and the lines connecting the 2 dendrograms indicate the correspondence between the samples in the 2 clusterings. (C) Volcano plot of the 19 569 normalized protein-coding genes in the 39 late-stage samples. The dashed horizontal line indicates a statistical significance threshold (adjusted
This classification was validated using the independent dataset GSE17538, which similarly identified 2 clusters: a large cluster of 31 samples and a small cluster with 14 samples (Supplemental S1 Figure). The consistency in the number of clusters and the Large and Small identified clusters between the discovery (TCGA-COAD) and validation (GSE17538) datasets underscores a possible existence of stage difference heterogeneity in colon cancer and the potential relevance of understanding it.
Differentially Expressed Genes in Colon Cancer
Analysis of differential gene expression revealed significant differences between the 2 subtypes (S and L), with L being used as the reference subtype. A total of 1855 DEGs were identified (Figure 2C), based on a threshold of LFC ⩾ 1.5 or ⩽−1.5, and an adjusted
Identified Classification Genes in Colon Subtypes
RFE analysis identified a novel subset of genes associated with enhanced accuracy in categorizing colon cancer patients into the L and S subtypes. From the 1855 DEGs, a refined subset of 16 genes was selected as optimal predictive markers for distinguishing between the S and L subtypes. This 16-gene subset achieved a performance classification score of 0.98, demonstrating high reliability (Supplemental S1 Table).
To validate the clustering capability of the 16-gene subset, hierarchical clustering was carried out using the gene expression profiles of these genes. The resulting clustering was compared to the initial clustering based on all protein-coding genes, with the relationship between the 2 clustering outcomes visualized using a tanglegram, which shows the 2 dendrograms side-by-side (Figure 2B). The tanglegram showed that all 39 samples were consistently classified into the same subtypes (L and S) using the reduced 16-gene set, confirming the robustness of this set for subtype classification.
Using a Principal Component Analysis model, the variation in the gene expression profiles of the RFE 16-gene subset between L and S was assessed. As shown in Figure 2D, dimension 1 accounts for 66.3% of the overall variance, while dimension 2 accounts for 7.7% of the variance. The PCA plot illustrates a clear segregation between the 2 subtypes, with samples from L (replicates with an orange ellipse) and S (blue ellipse) distinctly separated in the 2-dimensional (2D) space.
The expression patterns of the RFE 16-gene subset were evaluated between L and S using boxplots. All 16 genes demonstrated statistically significant differences between the 2 subtypes (T-Test,

Expression of the colon cancer subtypes. (A) Boxplots of 6 of the RFE 16-gene subset. Boxplots depict the normalized gene expression profiles (
Role of Tumor Suppressor Genes and Oncogenes
In addition, the DEGs were filtered to identify TSGs and oncogenes based on the Cancer Gene Census from the COSMIC database. 54 The analysis revealed 82 such genes amongst the DEG list, with roughly similar proportions of TSGs and oncogenes. To further investigate their involvement in tumor progression, we examined their normalized gene expression levels. Interestingly, most of the TSGs and oncogenes from the DEGs were downregulated in the S subtype compared to the L subtype (Figure 3B and C).
Differential miRNA Expression and Target Gene Analysis
Differential miRNA expression analysis between subtype S and early-stage samples identified 9 differentially expressed miRNAs (Figure 4A). Validated experimental interactions were found between 3 mature miRNAs, hsa-miR-106a-5p, hsa-miR-296-5p and hsa-miR-1271-5p, and TSGs and oncogenes (Figure 4C). These MTIs were retrieved from multiMiR, and we retained only interactions annotated as Functional MTI to focus on the most stringent evidence category. Interestingly, all the above miRNAs were upregulated in subtype S. This pattern is consistent with the possibility of increased post-transcriptional repression of some targets in subtype S. However, the MTI evidence is derived from curated external studies and does not constitute direct confirmation in our cohort. Additionally, 3 differentially expressed mature miRNAs were also reported between subtype L and early-stage samples: hsa-let-7f-5p, hsa-miR-203a-3p and hsa-let-7e-5p. However, their expression is not consistent and is either upregulated or downregulated, also targeting fewer TSGs and oncogenes (Figure 4B and D).

Differential miRNA expression and validation of miRNA-target interaction involving TSGs and oncogenes. (A) Volcano plot illustrating the differential miRNA expression between subtypes S and early-stage. Significant miRNAs are shown in red. (B) Volcano plot illustrating the differential miRNA expression between subtypes L and early-stage. Significant miRNAs are shown in red. (C and D) Target-gene interaction networks of subtypes S and L show literature-curated, experimentally validated MTIs retrieved via multiMiR, restricted to Functional MTI support type; edges indicate previously reported functional evidence and were not experimentally validated in this study.
Enrichment Pathways Analysis
KEGG pathway analysis of the DEGs revealed 2 significantly enriched pathways: the extracellular matrix (ECM)-receptor interaction and cell adhesion molecules (Figure 5C). The estimated

Gene mutations and cellular pathways. (A) Oncoplot of the mutational landscapes of colon subtype S and (B) subtype L. Columns represent individual patient samples, while rows represent genes. The barplot at the top displays the mutation frequency per sample, and the barplot on the right indicates the mutation frequency for each gene. Colored squares indicate mutated genes, while gray squares indicate the non-mutated genes. Multi_Hit = genes that underwent multiple mutations in a single sample. (C) KEGG pathway enrichment analysis. KEGG enrichment analysis of the 1855 DEGs was performed using the clusterProfiler R package, identifying 2 significantly enriched pathways: ECM-receptor interaction and Cell adhesion molecules. (D) Reactome pathway enrichment analysis of DEGs. The top 2 statistically significant pathways are shown, and their colors are shown by the
Mutational Profile Analysis of Colon Subtypes
The mutational landscapes of the identified subtypes were analyzed using the maftools R package and its
In the L subtype, comprising 23 patient samples, all (100%) were found to harbor somatic mutations (Figure 5B). The oncoplot of the L subtype displayed the top 20 most frequently mutated genes, ranked by the total number of variants per gene, with the percentage reflecting the proportion of colon cancer samples with the genetic alteration relative to the total samples. Frequent mutations were observed in
Across both subtypes, the top 3 most frequently mutated genes were
Tumor Cell Fractions in Cancer Subtypes
Recent studies have emphasized the significant role of the TME in the progression of colon cancer.56,57 A newly developed computational tool, known as Estimating the Proportions of Immune and Cancer Cells, was employed to assess the proportions of immune and cancer cells across the identified subtypes.
50
The analysis revealed considerable variability in cell fractions between the 2 subtypes, particularly in the proportions of CD4 and CD8 T-cells. A statistically significant difference in immune cell expression was observed between the L and S subtypes (Kruskal-Wallis,
In the Xena dataset, the EPIC-inferred CD8 T-cell fraction was inversely associated with the expression of oncogenes and TSGs, with the strongest single-gene correlations reaching a Spearman’s rho (ρ) of approximately −0.53. These correlations remained significant after multiple-testing corrections for many genes (Correlation Table.pdf). At the gene set level, higher CD8 fractions were associated with lower oncogene score expression (ρ = −0.34,
Discussion
Colon cancer is characterized by substantial intra-tumor heterogeneity and notable inter-patient variability, posing challenges to its understanding and treatment.7,10 While transcriptomic profiling has provided valuable insights into this heterogeneity, much of the existing genome and transcriptome-scale research has focused on primary tumors and matched normal tissues.59 -61 Such approaches often fail to capture the dynamic, multi-stage progression of cancer from its early stages to late forms. This study explored the diverse progression of colon cancer by analyzing RNA-seq gene expression profiles across multiple stages of cancer development. This approach facilitated the identification of colon cancer subtypes and the exploration of genes with differential expression, mutational variations, TME, and enriched pathways linked to the subtypes.
Analyzing the gene expression profiles has significantly advanced our understanding of the variability in gene expression related to colon cancer progression. By uncovering a clear progression pattern from early to late stages of the disease, we identified 2 distinct colon cancer subtypes with distinct gene expression profiles (Figure 2A). If we consider the L subtype, the predominant form of colon cancer based on sample size, it appears that the S subtype demonstrates notable differences from the L subtype as the disease progresses. This distinction is strongly supported by the large number of DEGs identified between the 2 subtypes, totaling 1856. Interestingly, most of these DEGs, including both TSGs and oncogenes, were downregulated in the S subtype compared to the L subtype (Figures 2C, 3B and C). This downregulation may be ascribed to stress induced by the TME or gene silencing mechanisms potentially driven by mutations.
Furthermore, the 16-gene subset selected from the DEGs using the RFE method showed clear segregation between the 2 subtypes with 100% accuracy. This result highlights the efficiency of the RFE approach in rapidly identifying a minimal yet highly discriminative gene subset, further reinforcing the presence of substantial molecular heterogeneity between the 2 subtypes (Figure 2C and D). When applied to the independent GEO validation dataset, our in-house tracking cancer progression method successfully identified 2 distinct clusters, a large and a small cluster, supporting our findings. The ability to recover a similar separation of 2 clusters in an independent cohort suggests that the subtype signal is not cohort specific. However, results from the GEO dataset should be interpreted with caution, as several protein-coding genes, including some in the RFE gene subset, are unavailable. This limitation may have led to the misclassification of specific samples between clusters, potentially impacting downstream analyses if this dataset is used (Supplemental S1 Table). Hence, we acknowledge that the validation process is only partial. The boxplots show 6 RFE genes identified across both the Xena and GEO datasets (Figure 3A). Subtype L exhibited stable disease progression from early to late stages, with mean gene expression values remaining close to 0 throughout. In contrast, the S subtype showed significant downregulation of specific genes, indicating a notable increase in molecular activity (Figure 3A). Similarly, immune-related genes showed a stable expression around a mean value of 0 in L, while displaying upregulation in S, further supporting the notion of greater molecular dynamism in subtype S compared to L (Figure 3D).
A key finding of this study is the concurrent downregulation of both oncogenes and TSGs in S, which may be driven by likely random silencing mutations or regulatory mechanisms within the TME. Investigating the mutational landscape of the most frequently mutated genes, including known driver genes in colon cancer, revealed potential involvement of additional factors in their downregulation. For instance,
The downregulation of some TSGs and oncogenes in subtype S is consistent with the observed miRNA profile. The differentially expressed miRNAs that target both TSGs and oncogenes were consistently upregulated, suggesting a pattern of gene inhibition. 65 However, these links are based on externally curated functional MTIs and should therefore be interpreted as hypothesis-generating evidence rather than direct mechanistic confirmation in our cohort. The miRNA hsa-miR-106a-5p targeted the majority of TSGs and oncogenes. Previous studies have documented its role in colorectal cancer, linking it to tumor progression.66,67 This miRNA was not found to be significant in the L subtype, suggesting a different progression from the S subtype, confirming the validity of the method used to track cancer progression. 8
Interestingly, a negative correlation was observed between the TME fractions of CD4 and CD8 T-cells and the DEGs, including TSGs and oncogenes. Both immune cell types manifested increased infiltration as the S subtype progressed, while the gene expression showed concurrent downregulation (Figure 3B-D). However, the observed relationships between the inferred immune cell fractions and expression of the TSGs and oncogenes should be interpreted as associations derived from bulk colorectal cancer transcriptomic data, and not as mechanistically proven or causal effects. The inverse relationship may reflect immune-linked transcriptional remodeling. However, bulk deconvolution does not resolve T-cell functional states, so mechanistic interpretation requires orthogonal validation.66,67 This phenomenon of immune-transcriptional disconnection has been documented in microsatellite-stable (MSS) colorectal cancers, specifically within the CMS2 and CMS3 classifications, where metabolic and epithelial processes prevail despite the presence of immune cells.25,68 In contrast, the L subtype showed increased expression of immune-related genes, along with more consistent regulation of tumor-associated genes. This suggests that the tumor is in a state that responds more effectively to immune activity. These findings are consistent with the immune-reactive phenotypes observed in MSI-high or CMS1 colorectal cancers, as reported by.27,69
A similar pattern of immune cell expression was observed in the GEO dataset analysis (Supplemental S3 Figure), further supporting our findings. Remarkably, CD4 and CD8 T-cells appeared to have a twofold effect on disease progression. While CD8 T-cells are generally known for their anti-tumor properties. 70 Recent studies have highlighted their dual functionality, which varies depending on the TME and their state of exhaustion.71,72 Similarly, CD4 T-cells can exhibit both anti-tumorigenic and pro-tumorigenic roles, depending on their subset differentiation and functional state.72,73 This dual role contributed to a stabilized level of progression and, by extension, probably aggressiveness. Contrary to many previous reports,73,74 our findings suggest that TME does not necessarily exert a purely positive or negative influence on cancer aggressiveness. Instead, the presence of immune cell fractions with dual functions in the TME appears to possibly exert a balancing effect on colon cancer progression in the S subtype, as evidenced by the downregulation of both oncogenes and TSGs in this subtype. Consequently, if gene expression changes during disease progression from early to late stages are not accounted for, the 2 subtypes could be mistakenly regarded as a single subtype, underscoring the importance of the method for tracking cancer progression. However, the reported CD4+/CD8+ implications are descriptive and should not be viewed as mechanistic or causative without validation supported by raw data expression analyses and functional assays. Recent advances in single-cell and spatial transcriptomics have also improved understanding of the colorectal TME, showing that immune populations often have dual, context-dependent roles.75,76 However, our reliance on bulk transcriptomic data may mask distinct T-cell and stromal states, reflecting broader population signals and introducing variability due to sensitivity to tumor purity and composition. 58 Thus, our findings should be viewed as complementing detailed mechanisms.
The 2 pathways identified in the KEGG pathway analysis, cell adhesion molecules and ECM receptor interaction, play crucial yet opposing roles in colon cancer (Figure 5C). The ECM receptor interaction pathway, through ECM remodeling, promotes a pro-tumorigenic environment that enhances disease progression.55-78,79 In contrast, the cell adhesion molecules pathway has been reported to support the anti-tumor response by facilitating the absorption of tumor antigens and activating T lymphocytes specific to the tumor location in colorectal cancer. 80 The downregulation of both pathways during disease progression results in an altered effect that maintains a consistent level of aggressiveness. This further explains our finding regarding the balancing effect of the TME on colon cancer progression in the S subtype. Moreover, a recent report by 81 also identified a strong correlation between immune cell infiltration and the ECM receptor signatures. These insights provide a new foundation for the future design of anticancer drugs, focusing on the unique molecular characteristics of the identified subtypes and the TME. It is interesting to note that the enriched pathways identified in Reactome, collagen chain trimerization and the RHOA GTPase cycle, exhibit similar biological functions to those found in the KEGG pathways. Specifically, collagen chain trimerization plays a comparable role to cell adhesion molecules within the extracellular matrix, influencing both the structure and function.81,82 Furthermore, ECM interactions significantly regulate RhoA activity, 83 further supporting our findings. This convergence of pathways highlights the interconnectedness of these biological processes and underscores the relevance of our research.
Conclusion
Our study demonstrates a complex relationship between TME and gene expression that helps explain the molecular heterogeneity of novel colon cancer subtypes. The simultaneous downregulation of both TSGs and oncogenes in the S subtype suggests that factors beyond mutational burden, such as TME regulation mechanisms, contribute to this downregulation. In addition, the regulation of cancer driver genes is negatively correlated with CD4 and CD8 T-cells, suggesting a likely dual role of immune regulation in controlling S subtype progression by modulating TSGs and oncogenes that play opposing roles in cancer. Without studying tumor progression across stages, distinguishing the subtypes would be difficult and easily overlooked. The strong findings here enhance our understanding of the genetic characteristics and molecular mechanisms involved in the heterogeneity and progression of colon cancer. This knowledge improves colon cancer prognosis and advances the development of precise, targeted therapies.
Supplemental Material
sj-docx-1-cix-10.1177_11769351261431245 – Supplemental material for In Silico Analysis of the Dual Role of Tumor Microenvironment on Colon Cancer Subtypes
Supplemental material, sj-docx-1-cix-10.1177_11769351261431245 for In Silico Analysis of the Dual Role of Tumor Microenvironment on Colon Cancer Subtypes by Christianah Kehinde, Michelle Livesey, Yeuko Manganyi and Hocine Bendou in Cancer Informatics
Supplemental Material
sj-docx-2-cix-10.1177_11769351261431245 – Supplemental material for In Silico Analysis of the Dual Role of Tumor Microenvironment on Colon Cancer Subtypes
Supplemental material, sj-docx-2-cix-10.1177_11769351261431245 for In Silico Analysis of the Dual Role of Tumor Microenvironment on Colon Cancer Subtypes by Christianah Kehinde, Michelle Livesey, Yeuko Manganyi and Hocine Bendou in Cancer Informatics
Footnotes
Acknowledgements
We thank the University of Cape Town, the Computational Omics and Biomedical Informatics Program (COBIP), and the Data Science for Health Discovery and Innovation in Africa (DS-I Africa), as well as the SAMRC and NRF of South Africa for the funding. We further acknowledge the support of the COBIP Principal Investigators: Bendou, Hocine; Hersh, William R; Mulder, Nicola; Mutsvangwa, Tinashe Ernest Muzvidzwa. We also acknowledge the Ilifu cloud computing facility (
).
Abbreviations
A: Advanced
BH: Benjamini–Hochberg
BP: biological processes
CC: cellular components
DEGs: Differentially Expressed Genes
E: Early
ECM: Extracellular Matrix
ENSG: Ensembl Gene
EPIC: Estimating the Proportions of Immune and Cancer Cells
FDR: False Discovery Rate
GEO: Gene Expression Omnibus
GO: Gene Ontology
HC: Hierarchical Clustering
KEGG: Kyoto Encyclopedia of Genes and Genomes
L: Large
LFC: log2-fold change
MAF: Mutation Annotation Format
MF: Molecular Functions
MSS: microsatellite-stable
PCA: Principal component analysis
RFE: Recursive Feature Elimination
S: Small
SVM: Support Vector Machine
TCGA-COAD: The Cancer Genome Atlas Colon Cancer
TMB: tumor mutational burden
TME: Tumor Microenvironment
TMM: Trimmed Mean of M-values
TNM: Tumor invasive, Lymph node involvement, Metastases
TSGs: Tumor Suppressor Genes
Authors Contributions
Conceptualization, H.B.; Formal analysis, C.K.; Investigation, CK, ML, YM, and HB; Writing original draft, C.K.; Writing review and editing, C.K., M.L., Y.M., and H.B.; Supervision; H.B; funding acquisition, H.B.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Research reported in this publication was supported by the Fogarty international Center of the National Institutes of Health under Award Number U2RTW012131. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. ML is funded through a postdoctoral fellowship from the South African Medical Research Council (MRC-SHIP - GIPD project number 96849) and CAPRISA Center of Excellence - DST and the National Research Foundation (NRF) of South Africa (COE141028106922).
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
Publicly available datasets were analyzed in this study. The datasets can be found here: UCSC Xena database (https://xenabrowser.net/) and GEO database (https://www.ncbi.nlm.nih.gov/gds). The source code enabling reproducibility of the study’s analysis is available on GitHub:
.
Declaration of AI Use
The authors disclose the use of Grammarly® for grammar, spelling, and language clarity during manuscript preparation. No scientific data was generated or modified using AI.
Supplemental material
Supplemental material for this article is available online.
