Abstract
Objective:
Gastric cancer remains a major global health concern. This study aimed to identify key genes involved in the inflammation-to-cancer transition in the stomach using an integrative framework combining graph neural networks and causal discovery.
Methods:
In retrospective study gene expression data from two gastric cancer-related datasets were categorised into two stages: gastritis to precancerous lesions and precancerous lesions to gastric cancer. Differentially expressed genes were identified and analysed for functional enrichment. A relevance network was constructed using Pearson’s correlations. Graph sample and aggregate was then applied to the expression matrix, using this network for training. Node embeddings were generated via neighbourhood aggregation, and causal regulatory relationships were inferred using a constraint-based algorithm. Genes with the highest degrees in the causal network were assessed for prognostic relevance using Kaplan–Meier analysis.
Results:
A total of 857 differentially expressed genes were identified in the gastritis-to-precancerous transition and 337 in the precancerous-to-gastric cancer transition, with 83 differentially expressed genes shared. Enrichment analysis highlighted pathways linked to bacterial responses, especially Helicobacter pylori. Graph sample and aggregate enhanced gene representation for causal analysis. The Peter–Clark algorithm inferred 72 genes and 99 causal edges. Nine key genes—mucin 17, brain-expressed X-linked 2, BCL2/adenovirus E1B 19 kDa protein-interacting protein 3, Ras association domain family member 2, NLR family pyrin domain-containing 7, interferon regulatory factor 4, carbamoyl phosphate synthetase 1, nucleoporin 210 and neuronal differentiation 2—were identified, all of which were significantly associated with gastric cancer survival.
Conclusion:
This study integrates graph neural networks and causal inference to identify critical genes involved in gastric inflammation–cancer progression, providing novel insights into the pathogenesis of gastric cancer and potential biomarkers for validation in future studies.
Keywords
Introduction
Gastric cancer (GC) remains a significant public health challenge, 1 particularly in East Asia, including China, despite substantial advances in surgical resection, chemotherapy, radiotherapy, and, more recently, targeted and immunotherapeutic strategies. 2 It is associated with high morbidity and mortality, and the 5-year survival rate remains poor, particularly in patients diagnosed at advanced stages. 3 These statistics underscore the urgent need to better understand GC pathogenesis and to identify effective biomarkers for early detection and therapeutic intervention.
GC typically evolves through a well-characterised multistep cascade, beginning with chronic superficial gastritis and progressing through chronic atrophic gastritis, intestinal metaplasia, dysplasia, and ultimately, GC.4,5 Chronic inflammation—especially that induced by Helicobacter pylori infection—plays a central role in initiating and sustaining this progression.6,7 The inflammatory microenvironment disrupts epithelial integrity and activates oncogenic signalling pathways, thereby promoting cellular transformation. 8 Although host immune surveillance attempts to eliminate transformed cells, 9 dysregulated immune responses and inflammation-associated immune escape mechanisms often facilitate tumour progression.9,10
Despite decades of extensive research, the molecular mechanisms underlying the inflammation-to-cancer transition remain incompletely understood, which limits their applicability in guiding either early diagnosis and/or precision medicine. In particular, the interplay between immune regulation, inflammatory signalling, and gene regulatory networks during this transition has yet to be fully elucidated. Moreover, the limitations of current screening techniques—marked by low sensitivity and specificity—mean that many patients are diagnosed at advanced stages. 11 There is a pressing need for innovative approaches capable of identifying robust biomarkers that detect GC at its earliest, and potentially reversible, stages.
Chronic low-grade inflammation, also observed in obesity, has been implicated in the pathogenesis of numerous diseases, including type 2 diabetes, cardiovascular disorders, and multiple cancers. 12 Recent studies have identified interleukin-6 and chemokine (C-C motif) ligands 4 as obesity-related hub genes that may contribute to GC development by mediating systemic and local inflammation. 13 These findings further support the hypothesis that inflammatory mediators play a causal role in tumour initiation.
A number of bioinformatics methods have been developed to identify key molecular players in GC. Differentially expressed genes (DEGs) analyses have uncovered genes associated with tumour progression and metastasis.14,15 However, these methods typically focus on individual gene-level changes and often neglect complex gene–gene interactions. Co-expression network approaches, 16 while useful for revealing modules of correlated genes, do not distinguish causal from spurious associations. These correlation-based frameworks are limited in their ability to infer directionality or mechanistic influence, especially in the presence of latent confounders.
To overcome limitations in traditional gene regulatory modelling, we developed an integrative framework that combines graph neural networks (GNNs), specifically graph sample and aggregate (GraphSAGE), with causal discovery algorithms to identify key regulatory genes involved in the gastric inflammation-to-cancer transition. 17 Unlike conventional methods, GraphSAGE captures topological and functional relationships by aggregating signals from neighbouring genes in co-expression networks, providing biologically meaningful embeddings. These were integrated with the fast causal inference algorithm, an extension of the Peter–Clark (PC) algorithm that accounts for latent confounding and indirect associations, making it suitable for transcriptomic data.
In summary, the current study integrates graph-based representation learning (GraphSAGE) with causal discovery PC algorithm to generate high-quality gene expression representations, infer gene regulatory networks, and uncover key regulatory relationships in the gastric inflammation-to-cancer transition, providing a scalable framework for biomarker discovery and targeted intervention.
Materials and methods
Data sources and preprocessing
Data sources
We retrieved publicly available gene expression datasets related to GC progression from the gene expression omnibus (GEO) database. 1 Two datasets were selected based on their relevance to inflammation-driven gastric carcinogenesis: GSE55696, which includes 77 samples encompassing chronic gastritis, low- and high-grade intraepithelial neoplasia, and early GC 18 ; and GSE130823, comprising 94 gastric biopsy samples across various precancerous and cancerous stages. 19
Traditional feature selection algorithms in bioinformatics are primarily designed to improve model accuracy and generalisability, but they often fail to capture causal relationships between genes. While correlation analysis may reveal associations between two variables, such co-variation does not imply causation. Two variables can appear significantly correlated without any causal relationship. In this study, we applied causal discovery techniques from data mining to uncover the underlying causal network structure within complex and heterogeneous gene expression data, thereby enhancing the accuracy of gene regulatory network construction.
Data preprocessing
Raw microarray data were log-transformed, and probe IDs were mapped to gene symbols. For genes with multiple probes, the probe showing the highest expression was retained. Samples with missing values were excluded. Batch effects between the two datasets were corrected using the combining batches algorithm. This preprocessing yielded an integrated expression matrix containing 18,128 genes across 171 samples.
Differential expression analysis
We divided the samples into two biologically relevant comparisons:
(i) gastritis versus precancerous lesions
(ii) precancerous lesions versus GC
Differential expression analysis was performed using the limma package in R, which applies empirical Bayes moderated t-statistics. Genes with |log₂FC| > 1 and adjusted p < 0.05 (Benjamini–Hochberg correction) were considered DEGs.
Enrichment analysis
To contextualise the DEGs, we conducted gene ontology enrichment analysis (biological process (BP) and molecular function (MF) categories) using the clusterProfiler package. Enrichment significance was defined as p < 0.05. This analysis enabled us to characterise DEGs involved in inflammation, immunity, and microbial responses—processes central to the gastric inflammation-to-cancer transition.
GNN model: GraphSAGE
Gene–gene graph construction
We first constructed a gene correlation graph by calculating Pearson’s correlation coefficients (PPC) for all gene pairs. Edges were established for pairs with |r| ⩾ 0.5, reflecting moderate to strong linear relationships while maintaining biological interpretability.
GraphSAGE architecture and justification
This network was used to construct a homogeneous graph using PyTorch Geometric, in which neighbour sampling and multilayer feature aggregation were applied. This setup ensures that genes with similar expression profiles are connected, enabling meaningful representation learning.
A two-layer GraphSAGE model was implemented with K = 2 for neighbour sampling, allowing aggregation from up to second-order neighbours (Figure 1). This reflects the biological assumption that regulatory influence can propagate beyond direct interactions (e.g. transcriptional regulators acting through intermediates). Empirical evidence suggests that second-order aggregation balances local and global information flow without overfitting or over-smoothing.

The figure illustrates the GraphSAGE process, including sampling (a), aggregation (b), and prediction (c) of the targeted genes. GraphSAGE: graph sample and aggregate.
Neighbourhood sampling was performed using the Link Neighbour Loader, with 20 1-hop and 10 2-hop neighbours sampled per gene. Mean aggregation was employed due to its robustness and interpretability in high-dimensional omics data. Node embeddings were trained on 80% of the correlation network and validated on the remaining 20%.
Multilayer sampling can capture longer-range dependencies but may also introduce noise from unrelated nodes, and it increases computational complexity. In this study, we employed a two-layer GraphSAGE architecture to balance effective information propagation with control over information overreach. In the first layer (1-hop), the model aggregates information from the immediate upstream and downstream neighbours of a node. In the second layer (2-hop), the model further incorporates information from the neighbours of these immediate neighbours. This multilayer design enriches the final node representation and enables the model to capture broader relationships within the graph.
To address these limitations, we propose an integrative framework that combines GNNs with causal discovery algorithms to identify key regulatory genes involved in the gastric inflammation-to-cancer transition. GNNs, such as GraphSAGE, are well-suited to biological data due to their ability to model non-Euclidean structures, such as gene co-expression or interaction networks. Unlike conventional machine learning models that treat each gene as an independent feature, GraphSAGE captures contextual information by aggregating signals from a gene’s neighbourhood, thereby reflecting complex topological relationships and functional dependencies.
To evaluate whether our gene regulatory network inference model, based on causal discovery, accurately reflects true relationships between genes, we validated the model using both synthetic and real datasets with gold-standard references from The Dialogue on Reverse Engineering Assessment and Methods 5 challenge. 20 We also compared its performance with three widely used methods. Additionally, ablation studies were conducted to assess the contribution of each model component. To simplify evaluation and reduce confounding variables, the original datasets were randomly divided into 2 × 4 subsets. Evaluation metrics were calculated for each subset, and the mean and standard deviation were reported to provide robust assessments.
We benchmarked our approach against three commonly used gene expression matrix construction methods—Pearson’s correlation, partial and semi-partial correlations (ppcor) and cross-event attention-based time-aware network (catnet)—on datasets with known ground truth (p < 0.05). F1-scores were calculated for each subset, with mean and standard deviation shown in (Figure 2), where error bars indicate variability. Higher F1-scores reflect superior performance.

Benchmarking was conducted, using Pearson’s correlation, ppcor (a) and a catnet (b), against three commonly used gene expression matrix construction methods. ppcor: partial and semi-partial correlation; catnet: cross-event attention-based time-aware network.
To assess the importance of the GraphSAGE component, we evaluated the effect of removing the node embedding module. PC algorithm was applied directly to the sub-datasets to infer regulatory edges, and F1-scores were calculated and compared to those of the full GraphSAGE + PC model.
On the in silico dataset, our model (GraphSAGE + PC) outperformed ppcor and catnet and was only slightly outperformed by Pearson’s correlation. On the Escherichia coli dataset (a real-world dataset from the GEO database), our model outperformed all three comparison methods. This supports the utility of causal discovery-based models for real biological data.
The standalone PC algorithm performed poorly on both datasets, although marginally better on E. coli. These results confirm that the GraphSAGE embedding module significantly improves model accuracy.
While GNNs excel at learning expressive, topology-aware representations, they remain inherently correlational. To move from correlation to inference, we incorporated causal discovery techniques. These approaches are appropriate for transcriptomic data, as they account for latent confounding, sparsity and indirect associations—common features of gene regulatory networks. 21 By constructing a causal graph from salient genes identified via GNN-based classification, we aim to uncover upstream regulators of malignant transformation.
Causal discovery via the PC algorithm
Assumptions and applicability
In gene regulatory networks, regulatory factors act as causal determinants of gene expression levels. The PC algorithm facilitates the inference of regulatory relationships between genes by assessing their conditional independence. We applied the PC algorithm, a constraint-based causal discovery method, to infer directed gene regulatory networks (Figure 3).

The PC algorithm, a constraint-based causal discovery method, was applied to infer directed gene regulatory networks by assessing conditional independence among genes. The process involved: generating a true causal diagram (a), constructing a fully connected undirected graph (b), removing the X–Y edge after independence testing (c), further removing the Y–W edge (d), identifying V-shaped structures using X–Y–Z configurations (e), and finally predicting the network diagram by searching Y–Z–W structures (f). PC: Peter–Clark.
The PC algorithm assumes (i) causal sufficiency (i.e. no hidden confounders) and (ii) faithfulness (i.e. all conditional independencies reflect causal structure). Although these assumptions are rarely fully met in transcriptomic data, we mitigated their potential violation by
Using a GraphSAGE-enriched expression matrix, which captures multivariate dependencies and reduces noise and redundancy.
Performing dimensionality reduction through DEG filtering prior to causal inference, improving signal-to-noise ratio and minimising spurious associations.
These precautions facilitate cautious yet informative causal inference from high-dimensional omics data, as supported by previous gene regulatory network applications. 21
These refined gene embeddings were then input into the PC algorithm, a well-established causal discovery method based on conditional independence testing. To assess conditional independence, we used partial correlation coefficients, which quantify the linear correlation between two variables while controlling for the effects of others. Partial correlation coefficients are particularly suitable for gene expression data, where expression levels are often influenced by multiple interacting genes. Recursive estimation of pairwise partial correlation coefficients was employed to identify direct dependencies between genes, helping to distinguish true regulatory links from indirect associations.
To orient the edges within the resulting directed acyclic graph, we applied three standard collider-based orientation rules: enforcing acyclicity constraints, re-testing conditional independence, and assigning edge directions. This process yielded a final regulatory network specific to the gastric inflammation-to-cancer transition.
Hub gene selection and justification
From the PC-inferred causal graph, we selected hub genes based on node degree centrality. This strategy is rooted in network biology, where hub nodes frequently represent key regulators or bottlenecks in biological systems. Although degree centrality is a heuristic, its integration with differential expression and causal orientation enhances biological plausibility. The top nine genes with the highest degrees were selected for further evaluation.
Validation via survival analysis
To assess clinical relevance, we validated the nine candidate genes using the Kaplan–Meier Plotter (http://kmplot.com/analysis/) in GC patients. Although the platform offers extensive clinical coverage, we acknowledge the limitations of using a single retrospective dataset. Future validation using independent cohorts—such as The Cancer Genome Atlas (TCGA) 22 and external GEO datasets—is warranted.
In this study, we retrieved gene expression data related to the gastric inflammation-to-cancer transition from the GEO database and selected two microarray datasets, GSE55696 and GSE130823, for in-depth analysis. The GSE55696 dataset identifies a clear biological distinction between high-grade and low-grade gastric intraepithelial neoplasia, providing molecular evidence for clinical applications . 18 In addition, Zhang et al. 19 used the GSE130823 dataset to identify key molecules involved in the tumorigenesis of intestinal-type GC. Taken together, these two datasets are well suited for this type of research.
The GSE55696 dataset contains expression data for 18,353 genes, while GSE130823 includes data for 32,080 genes. After merging the datasets, a total of 18,128 genes across 171 samples were retained. This sample size is considered representative in biomedical research and is sufficient to support statistical analysis. Furthermore, to ensure the reliability of the results, we applied a standardised data preprocessing pipeline and performed quality control procedures to minimise technical variation and batch effects.
This is a retrospective study. The analysis relied solely on information from an existing public database. In accordance with institutional and ethical guidelines, signed informed consent was not required. The study was conducted in compliance with the Declaration of Helsinki (1975), as revised in 2024. The reporting of this study conforms to STROBE guidelines. 23
Results
Screening of DEGs
To ensure comparability between datasets, quantile normalisation was applied, as described in the Methods section. DEGs were then identified using the limma package.
Volcano plots in Figure 4(a) and (b) illustrate the results. In the gastritis-to-precancerous lesion transition, 857 DEGs were identified (415 upregulated, 442 downregulated). In the transition from precancerous lesions to GC, 337 DEGs were identified (289 upregulated, 48 downregulated). Notably, 83 DEGs were common to both transitions—43 upregulated and 40 downregulated. Figure 4(a) displays a broader p-value distribution and more pronounced expression changes than Figure 4(b), suggesting that the early transition (inflammation to precancer) involves more substantial transcriptomic shifts. Genes in the upper-left (downregulated) and upper-right (upregulated) quadrants of the volcano plots were prioritised for further analysis.

Volcano plots illustrating DEGs. Figure 4(a) shows a broader p-value distribution and more pronounced expression changes (inflammatory-to-cancer) compared with Figure 4(b) (precancer-to-cancer), suggesting that the early transition (inflammation-to-precancer) involves more substantial transcriptomic shifts. Genes in the upper-left (downregulated) and upper-right (upregulated) quadrants were prioritised for further analysis. DEG: differentially expressed gene.
GO annotation of DEGs
To explore the biological significance of DEGs associated with the inflammation-to-cancer transition, we performed Gene Ontology (GO) enrichment analysis on the intersecting gene set. The analysis yielded 56 significant terms—41 in BP and 15 in MF. The top 10 most significantly enriched terms are shown in (Figure 5).

GO annotation of DEGs. To explore the biological significance of DEGs associated with the inflammation-to-cancer transition, GO enrichment analysis was performed on the intersecting gene set. The analysis identified 56 significant terms, including 41 in BP and 15 in MF. The top 10 most significantly enriched terms are shown. DEG: differentially expressed gene; BP: biological process; MF: molecular function.
The results revealed strong associations with H. pylori, as top-enriched BP included responses to lipopolysaccharides and bacterial molecules. This suggests that identified genes may mediate host immune responses to bacterial infection. Chronic inflammation, potentially driven by H. pylori and its metabolites, likely contributes to the inflammation-to-cancer transition.
While our initial enrichment focused on H. pylori-related pathways due to its well-established role in early GC, we acknowledge that GC is a multifactorial disease involving diverse oncogenic signalling pathways, including wingless-related integration site/β-catenin, phosphoinositide 3-kinases/Ak strain transforming, mitogen-activated protein kinase and transforming growth factor-β. These pathways are known to promote proliferation, invasion, stemness and immune evasion. 22
GraphSAGE node embedding
We used the filtered gene expression matrix obtained from the differential expression analysis as the input for GraphSAGE. The correlation network (Table 1), constructed based on PPC, served as the source of positive samples for neighbourhood aggregation.
The table summarises potentially relevant genes identified within the network analysis.
The Link Neighbour Loader was configured to sample 20 neighbours from the 1-hop and 10 from the 2-hop neighbourhood for each gene. Through this process, GraphSAGE aggregated information from each gene’s local network, updating the expression matrix with enriched, context-aware representations. A portion of the resulting matrix is shown in Tables 2 and 3, while the complete versions are provided as Supplemental Tables S1 and S2.
Performance evaluation on the in silico data set of ablation experiment.
GraphSAGE: graph sample and aggregate; PC: Peter–Clark.
Performance evaluation on the ablation experiment Escherichia coli data set.
GraphSAGE: graph sample and aggregate; PC: Peter–Clark.
This embedding process integrates the intrinsic expression level of each gene with contextual information from its connected genes, generating biologically meaningful node representations. These enhanced embeddings improve the performance of the downstream causal discovery algorithm by more accurately identifying regulatory relationships.
Key gene survival analysis xlsx
We assessed the clinical relevance of the nine candidate genes using the Kaplan–Meier Plotter (http://kmplot.com/analysis/), which provides survival data based on publicly available datasets.
All nine genes exhibited statistically significant associations with overall survival in GC patients (p < 0.05), as shown in Figure 6. These findings suggest that these genes may have prognostic value, though further validation is required.

Key gene survival analysis xlsx. The clinical relevance of the nine candidate genes was assessed, using the Kaplan–Meier Plotter. All nine genes exhibited statistically significant associations with overall survival in GC patients (p < 0.05). GC: gastric cancer.
Causal discovery
The PC algorithm was applied to the enriched gene expression matrix to infer the underlying gene regulatory network. The resulting network contained 72 nodes and 99 directed edges (Figure 7). Top hub genes were identified based on node degree. The nine genes with the highest centrality were retained for further analysis: mucin 17 (MUC17), brain expressed X-linked 2 (BEX2), BCL2/adenovirus E1B 19 kDa protein-interacting protein 3 (BNIP3), Ras association domain family member 2 (RASSF2), NLR family pyrin domain containing 7 (NLRP7), interferon regulatory factor 4 (IRF4), carbamoyl phosphate synthetase 1 (CPS1), nucleoporin 210 (NUP210) and neuronal differentiation 2 (NEUROD2).

The PC algorithm was applied to the enriched gene expression matrix to infer the underlying gene regulatory network. The resulting network contained 72 nodes and 99 directed edges. Nine genes identified, MUC17 (a), BEX2 (b), BNIP3 (c), RASSF2 (d), NLRP7 (e), IRF4 (f), CPS1 (g), NUP210 (h) and NEUROD2 (i). PC: Peter–Clark; MUC17: mucin 17; BEX2: brain expressed X-linked 2; BNIP3; BCL2/adenovirus E1B 19 kDa protein-interacting protein 3; RASSF2: Ras association domain family member 2; NLRP7: NLR family pyrin domain containing 7; IRF4: interferon regulatory factor 4; CPS1: carbamoyl phosphate synthetase 1; NUP210: nucleoporin 210; NEUROD2: neuronal differentiation 2.
These genes are hypothesised to be key regulators in the inflammation-to-cancer transition, and are discussed below.
Discussion
This study presents an integrative computational pipeline for exploring the inflammation-to-cancer transition in gastric tissues. Our approach combines differential expression analysis, enrichment analysis, graph-based learning, causal inference and clinical outcome validation to identify candidate biomarkers and regulatory genes.
Our primary objective of this study is to identify key genes implicated in the gastric ‘inflammation-to-cancer transformation’. While DEGs are utilised as an initial step to reduce data dimensionality during the screening process, they do not constitute the central focus or main challenge of the research. Further investigations are planned to build upon these findings in future work.
Differential expression analysis revealed 857 DEGs in the transition from gastritis to precancerous lesions and 337 DEGs in the progression to GC. The 83 overlapping DEGs indicate molecular continuity throughout disease development. Notably, we demonstrated that the greater number of DEGs in the early stage implies that inflammation-induced dysregulation plays a dominant role in tumour initiation. This is supported by a recent publication demonstrating the mechanistic role of inflammation in cancer initiation, progression and metastasis. 24
Enrichment analyses highlighted immune and bacterial response pathways, with many DEGs associated with H. pylori-induced inflammation and immune signalling. These findings align with the established role of persistent infection and chronic inflammation in GC.25,26
Kaplan–Meier analysis revealed that these genes are significantly associated with overall survival, although this alone does not establish clinical utility. Survival outcomes may be influenced by confounding variables such as treatment regimens, molecular subtypes and comorbidities. Thus, further validation using prospective cohorts and experimental assays—employing both animal models 27 and human GC tissues 28 —is essential.
Additionally, while convolutional neural networks have demonstrated strong performance on structured, grid-like data (e.g., images), they are suboptimal for modelling non-Euclidean biological networks. In contrast, GNNs are specifically designed for graph-structured data, such as gene interaction networks. GNNs incorporate both node features and relational structures, making them well-suited for representing complex biological systems. 29
GraphSAGE, in particular, offers advantages over traditional Graph convolutional networks by enabling inductive learning, better generalisation and computational efficiency—important features when analysing large-scale omics data. GraphSAGE has further demonstrated strong performance in related biomedical tasks, supporting its selection for this study.
To move beyond correlation-based methods, we implemented the GraphSAGE algorithm, which allowed us to model gene–gene interactions in a topologically informed manner. The learned embeddings improved the performance of the PC algorithm in identifying plausible causal relationships. The resulting regulatory graph comprised 72 nodes and 99 edges, from which we identified nine key hub genes.
Among these nine key genes identified, MUC17, a membrane-associated mucin that maintains epithelial integrity and may act as a tumour suppressor, 30 has not yet been directly linked to GC. Nevertheless, mutations in MUC17 have been associated with the prognosis of diffuse glioma in adult patients, 31 and its most established gastrointestinal association is with Crohn’s disease, where MUC17 disruption plays a pathogenic role. 32 Our findings provide gene-level evidence supporting its potential relevance in GC and precision medicine, warranting further investigation at the mRNA level using RNA sequencing (RNA-seq).
BNIP3, a pro-apoptotic gene involved in autophagy and the hypoxia response, is dysregulated in multiple cancers 33 but, to date, has been reported in only one gastrointestinal cancer—hepatocellular carcinoma—in a single study. 34 Its central position in our analysis supports further investigation into its role in GC development and its potential as either an early diagnostic biomarker and/or a therapeutic target.
IRF4, a transcription factor regulating immune responses, 35 cooperates with basic leucine zipper activating transcription factor-like transcription factor to counter T cell exhaustion in mouse tumour models, 36 yet no studies have linked it to GC. Our findings suggest novel immunological mechanisms worth exploring in vitro and/or in vivo.
RASSF2, a tumour suppressor frequently silenced by methylation, 37 is involved in the differentiation of breast, 38 prostate 39 and squamous cervical cancers, 40 but its role in GC remains undefined. Our results provide preliminary evidence for its involvement in GC carcinogenesis, which may offer a novel target for early diagnosis and/or intervention.
NEUROD2, typically linked to neuronal differentiation, may also promote metastasis and survival in non-neuronal malignancies, especially glioblastoma. 41 Our data highlight the need to examine its contribution to early-stage GC, beginning with in vitro studies and subsequently in vivo validation using GC samples. 28
BEX2 has been suggested to correlate inversely with GC prognosis by regulating apoptosis, proliferation and metastasis, 42 though its precise role remains unclear. Our findings strengthen the evidence for a link between BEX2 and GC, which should be validated in vitro and/or in vivo.
NLRP7 has been associated with lung cancer via pyroptosis 43 and with tumour-associated macrophages in colorectal cancer, 44 while CPS1 has been linked to GC prognosis through effects on immune infiltration, differentiation and metastasis, 45 although current evidence is limited. Our results further support the need to verify the roles of these two molecules in GC in vitro at least.
Finally, NUP210, implicated in acute myeloid leukaemia 46 and inversely associated with colorectal cancer where its depletion suppresses metastasis, 47 emerged as a central hub in our inferred regulatory network, warranting prioritised functional validation, as mentioned above.
Limitations and future directions
First, our analyses are based on bulk tissue expression, which cannot resolve cell type–specific signals, potentially obscuring contributions from immune or stromal compartments. Single-cell RNA seq 48 would provide higher-resolution insights. Second, causal assumptions: the PC algorithm assumes no hidden confounders and exact conditional independencies. Although DEG filtering and GraphSAGE embeddings helped mitigate these assumptions, violations could still affect inference quality. Third, hub gene selection: node degree may favour highly connected but biologically nonspecific genes, potentially overlooking important low-centrality regulators. High connectivity can also reflect generic hubs or methodological bias rather than true disease drivers.
Future work will explore alternative causal discovery algorithms, such as linear non-Gaussian acyclic model, to relax PC assumptions, and employ genome-wide clustered regularly interspaced short palindromic repeats (CRISPR) screens screens 49 to validate gene essentiality. Protein–protein interaction databases, for example, search tool for the retrieval of interacting genes/proteins, biological general repository for interaction datasets, will help assess whether top-degree genes are embedded within known regulatory modules or signalling pathways.
Validation will use independent cohorts (e.g., TCGA, 23 external GEO datasets) and experimental assays in vitro, in vivo and on human GC tissues, including quantitative polymerase chain reaction, immunohistochemistry, 28 CRISPR perturbation 49 and single-cell RNA-seq. 48 A recent study by Hu et al. 50 used single-cell RNA-seq to identify ribosomal protein subunits as mediators of the inflammation-to-cancer transition in GC, validated in animal models 27 and human GC tissues. 28 We aim to follow a similar framework to confirm the role of our candidate genes and identify therapeutic targets. Finally, multi-omics integration (e.g., epigenomics, proteomics, metabolomics) will be pursued to enhance biological interpretability and network resolution.
Conclusion
We have developed a binary classification framework combining GraphSAGE and the PC algorithm to infer gene regulatory networks underlying the gastric inflammation-to-cancer transition. This approach identified nine candidate genes linked to GC progression and patient survival, highlighting their potential as non-invasive biomarkers and therapeutic targets. Representing the training phase of our pipeline, this integrative framework provides a scalable, biologically meaningful strategy for early GC detection, tumour biology elucidation and precision medicine, with clinical validation underway.
Supplemental Material
sj-xlsx-1-smo-10.1177_20503121251380131 – Supplemental material for Unravelling key genes and molecular pathways in gastric inflammation-to-cancer transition through causal discovery: implications for early diagnosis and therapy
Supplemental material, sj-xlsx-1-smo-10.1177_20503121251380131 for Unravelling key genes and molecular pathways in gastric inflammation-to-cancer transition through causal discovery: implications for early diagnosis and therapy by Zhen Ren, Xiaochen Li, Pengyun Liu, Jinjuan Li and Shisan Bao in SAGE Open Medicine
Supplemental Material
sj-xlsx-2-smo-10.1177_20503121251380131 – Supplemental material for Unravelling key genes and molecular pathways in gastric inflammation-to-cancer transition through causal discovery: implications for early diagnosis and therapy
Supplemental material, sj-xlsx-2-smo-10.1177_20503121251380131 for Unravelling key genes and molecular pathways in gastric inflammation-to-cancer transition through causal discovery: implications for early diagnosis and therapy by Zhen Ren, Xiaochen Li, Pengyun Liu, Jinjuan Li and Shisan Bao in SAGE Open Medicine
Footnotes
Appendix 1
Acknowledgements
We appreciate the language editing by Dr Brett D. Hambly.
Ethical considerations
Ethical approval for this study was obtained from The Human Ethic Committee, Gansu University of Chinese Medicine (Ethics Approval Number: 2024-230).
Informed consent
No patient data were obtained directly, as the current study relied solely on information from an existing public database. In accordance with institutional and ethical guidelines, signed informed consent was not required for this retrospective analysis.
Author contributions
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Gansu Provincial Natural Science Foundation, China: Clinical Decision Support Research for Post-Hepatitis Liver Cirrhosis Based on Data Mining (grant number: 23JRRA1719). The deep integration of medical and health care advances clinical rehabilitation and fosters the high-quality development of health services (grant number: 24RCKD001).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability statement
Data are available from the corresponding author upon reasonable request.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
