Unravelling key genes and molecular pathways in gastric inflammation-to-cancer transition through causal discovery: implications for early diagnosis and therapy

Abstract

Objective:

Gastric cancer remains a major global health concern. This study aimed to identify key genes involved in the inflammation-to-cancer transition in the stomach using an integrative framework combining graph neural networks and causal discovery.

Methods:

In retrospective study gene expression data from two gastric cancer-related datasets were categorised into two stages: gastritis to precancerous lesions and precancerous lesions to gastric cancer. Differentially expressed genes were identified and analysed for functional enrichment. A relevance network was constructed using Pearson’s correlations. Graph sample and aggregate was then applied to the expression matrix, using this network for training. Node embeddings were generated via neighbourhood aggregation, and causal regulatory relationships were inferred using a constraint-based algorithm. Genes with the highest degrees in the causal network were assessed for prognostic relevance using Kaplan–Meier analysis.

Results:

A total of 857 differentially expressed genes were identified in the gastritis-to-precancerous transition and 337 in the precancerous-to-gastric cancer transition, with 83 differentially expressed genes shared. Enrichment analysis highlighted pathways linked to bacterial responses, especially Helicobacter pylori. Graph sample and aggregate enhanced gene representation for causal analysis. The Peter–Clark algorithm inferred 72 genes and 99 causal edges. Nine key genes—mucin 17, brain-expressed X-linked 2, BCL2/adenovirus E1B 19 kDa protein-interacting protein 3, Ras association domain family member 2, NLR family pyrin domain-containing 7, interferon regulatory factor 4, carbamoyl phosphate synthetase 1, nucleoporin 210 and neuronal differentiation 2—were identified, all of which were significantly associated with gastric cancer survival.

Conclusion:

This study integrates graph neural networks and causal inference to identify critical genes involved in gastric inflammation–cancer progression, providing novel insights into the pathogenesis of gastric cancer and potential biomarkers for validation in future studies.

Keywords

Causal discovery GraphSAGE gastric cancer inflammation–cancer transition key genes bioinformatics

Introduction

Gastric cancer (GC) remains a significant public health challenge,¹ particularly in East Asia, including China, despite substantial advances in surgical resection, chemotherapy, radiotherapy, and, more recently, targeted and immunotherapeutic strategies.² It is associated with high morbidity and mortality, and the 5-year survival rate remains poor, particularly in patients diagnosed at advanced stages.³ These statistics underscore the urgent need to better understand GC pathogenesis and to identify effective biomarkers for early detection and therapeutic intervention.

GC typically evolves through a well-characterised multistep cascade, beginning with chronic superficial gastritis and progressing through chronic atrophic gastritis, intestinal metaplasia, dysplasia, and ultimately, GC.^4,5 Chronic inflammation—especially that induced by Helicobacter pylori infection—plays a central role in initiating and sustaining this progression.^6,7 The inflammatory microenvironment disrupts epithelial integrity and activates oncogenic signalling pathways, thereby promoting cellular transformation.⁸ Although host immune surveillance attempts to eliminate transformed cells,⁹ dysregulated immune responses and inflammation-associated immune escape mechanisms often facilitate tumour progression.^9,10

Despite decades of extensive research, the molecular mechanisms underlying the inflammation-to-cancer transition remain incompletely understood, which limits their applicability in guiding either early diagnosis and/or precision medicine. In particular, the interplay between immune regulation, inflammatory signalling, and gene regulatory networks during this transition has yet to be fully elucidated. Moreover, the limitations of current screening techniques—marked by low sensitivity and specificity—mean that many patients are diagnosed at advanced stages.¹¹ There is a pressing need for innovative approaches capable of identifying robust biomarkers that detect GC at its earliest, and potentially reversible, stages.

Chronic low-grade inflammation, also observed in obesity, has been implicated in the pathogenesis of numerous diseases, including type 2 diabetes, cardiovascular disorders, and multiple cancers.¹² Recent studies have identified interleukin-6 and chemokine (C-C motif) ligands 4 as obesity-related hub genes that may contribute to GC development by mediating systemic and local inflammation.¹³ These findings further support the hypothesis that inflammatory mediators play a causal role in tumour initiation.

A number of bioinformatics methods have been developed to identify key molecular players in GC. Differentially expressed genes (DEGs) analyses have uncovered genes associated with tumour progression and metastasis.^14,15 However, these methods typically focus on individual gene-level changes and often neglect complex gene–gene interactions. Co-expression network approaches,¹⁶ while useful for revealing modules of correlated genes, do not distinguish causal from spurious associations. These correlation-based frameworks are limited in their ability to infer directionality or mechanistic influence, especially in the presence of latent confounders.

To overcome limitations in traditional gene regulatory modelling, we developed an integrative framework that combines graph neural networks (GNNs), specifically graph sample and aggregate (GraphSAGE), with causal discovery algorithms to identify key regulatory genes involved in the gastric inflammation-to-cancer transition.¹⁷ Unlike conventional methods, GraphSAGE captures topological and functional relationships by aggregating signals from neighbouring genes in co-expression networks, providing biologically meaningful embeddings. These were integrated with the fast causal inference algorithm, an extension of the Peter–Clark (PC) algorithm that accounts for latent confounding and indirect associations, making it suitable for transcriptomic data.

In summary, the current study integrates graph-based representation learning (GraphSAGE) with causal discovery PC algorithm to generate high-quality gene expression representations, infer gene regulatory networks, and uncover key regulatory relationships in the gastric inflammation-to-cancer transition, providing a scalable framework for biomarker discovery and targeted intervention.

Materials and methods

Data sources and preprocessing

Data sources

We retrieved publicly available gene expression datasets related to GC progression from the gene expression omnibus (GEO) database.¹ Two datasets were selected based on their relevance to inflammation-driven gastric carcinogenesis: GSE55696, which includes 77 samples encompassing chronic gastritis, low- and high-grade intraepithelial neoplasia, and early GC¹⁸; and GSE130823, comprising 94 gastric biopsy samples across various precancerous and cancerous stages.¹⁹

Traditional feature selection algorithms in bioinformatics are primarily designed to improve model accuracy and generalisability, but they often fail to capture causal relationships between genes. While correlation analysis may reveal associations between two variables, such co-variation does not imply causation. Two variables can appear significantly correlated without any causal relationship. In this study, we applied causal discovery techniques from data mining to uncover the underlying causal network structure within complex and heterogeneous gene expression data, thereby enhancing the accuracy of gene regulatory network construction.

Data preprocessing

Raw microarray data were log-transformed, and probe IDs were mapped to gene symbols. For genes with multiple probes, the probe showing the highest expression was retained. Samples with missing values were excluded. Batch effects between the two datasets were corrected using the combining batches algorithm. This preprocessing yielded an integrated expression matrix containing 18,128 genes across 171 samples.

Differential expression analysis

We divided the samples into two biologically relevant comparisons:

(i) gastritis versus precancerous lesions

(ii) precancerous lesions versus GC

Differential expression analysis was performed using the limma package in R, which applies empirical Bayes moderated t-statistics. Genes with |log₂FC| > 1 and adjusted p < 0.05 (Benjamini–Hochberg correction) were considered DEGs.

Enrichment analysis

To contextualise the DEGs, we conducted gene ontology enrichment analysis (biological process (BP) and molecular function (MF) categories) using the clusterProfiler package. Enrichment significance was defined as p < 0.05. This analysis enabled us to characterise DEGs involved in inflammation, immunity, and microbial responses—processes central to the gastric inflammation-to-cancer transition.

GNN model: GraphSAGE

Gene–gene graph construction

We first constructed a gene correlation graph by calculating Pearson’s correlation coefficients (PPC) for all gene pairs. Edges were established for pairs with |r| ⩾ 0.5, reflecting moderate to strong linear relationships while maintaining biological interpretability.

GraphSAGE architecture and justification

This network was used to construct a homogeneous graph using PyTorch Geometric, in which neighbour sampling and multilayer feature aggregation were applied. This setup ensures that genes with similar expression profiles are connected, enabling meaningful representation learning.

A two-layer GraphSAGE model was implemented with K = 2 for neighbour sampling, allowing aggregation from up to second-order neighbours (Figure 1). This reflects the biological assumption that regulatory influence can propagate beyond direct interactions (e.g. transcriptional regulators acting through intermediates). Empirical evidence suggests that second-order aggregation balances local and global information flow without overfitting or over-smoothing.

Figure 1.

The figure illustrates the GraphSAGE process, including sampling (a), aggregation (b), and prediction (c) of the targeted genes. GraphSAGE: graph sample and aggregate.

Neighbourhood sampling was performed using the Link Neighbour Loader, with 20 1-hop and 10 2-hop neighbours sampled per gene. Mean aggregation was employed due to its robustness and interpretability in high-dimensional omics data. Node embeddings were trained on 80% of the correlation network and validated on the remaining 20%.

Multilayer sampling can capture longer-range dependencies but may also introduce noise from unrelated nodes, and it increases computational complexity. In this study, we employed a two-layer GraphSAGE architecture to balance effective information propagation with control over information overreach. In the first layer (1-hop), the model aggregates information from the immediate upstream and downstream neighbours of a node. In the second layer (2-hop), the model further incorporates information from the neighbours of these immediate neighbours. This multilayer design enriches the final node representation and enables the model to capture broader relationships within the graph.

To address these limitations, we propose an integrative framework that combines GNNs with causal discovery algorithms to identify key regulatory genes involved in the gastric inflammation-to-cancer transition. GNNs, such as GraphSAGE, are well-suited to biological data due to their ability to model non-Euclidean structures, such as gene co-expression or interaction networks. Unlike conventional machine learning models that treat each gene as an independent feature, GraphSAGE captures contextual information by aggregating signals from a gene’s neighbourhood, thereby reflecting complex topological relationships and functional dependencies.

To evaluate whether our gene regulatory network inference model, based on causal discovery, accurately reflects true relationships between genes, we validated the model using both synthetic and real datasets with gold-standard references from The Dialogue on Reverse Engineering Assessment and Methods 5 challenge.²⁰ We also compared its performance with three widely used methods. Additionally, ablation studies were conducted to assess the contribution of each model component. To simplify evaluation and reduce confounding variables, the original datasets were randomly divided into 2 × 4 subsets. Evaluation metrics were calculated for each subset, and the mean and standard deviation were reported to provide robust assessments.

We benchmarked our approach against three commonly used gene expression matrix construction methods—Pearson’s correlation, partial and semi-partial correlations (ppcor) and cross-event attention-based time-aware network (catnet)—on datasets with known ground truth (p < 0.05). F1-scores were calculated for each subset, with mean and standard deviation shown in (Figure 2), where error bars indicate variability. Higher F1-scores reflect superior performance.

Figure 2.

Benchmarking was conducted, using Pearson’s correlation, ppcor (a) and a catnet (b), against three commonly used gene expression matrix construction methods. ppcor: partial and semi-partial correlation; catnet: cross-event attention-based time-aware network.

To assess the importance of the GraphSAGE component, we evaluated the effect of removing the node embedding module. PC algorithm was applied directly to the sub-datasets to infer regulatory edges, and F1-scores were calculated and compared to those of the full GraphSAGE + PC model.

On the in silico dataset, our model (GraphSAGE + PC) outperformed ppcor and catnet and was only slightly outperformed by Pearson’s correlation. On the Escherichia coli dataset (a real-world dataset from the GEO database), our model outperformed all three comparison methods. This supports the utility of causal discovery-based models for real biological data.

The standalone PC algorithm performed poorly on both datasets, although marginally better on E. coli. These results confirm that the GraphSAGE embedding module significantly improves model accuracy.

While GNNs excel at learning expressive, topology-aware representations, they remain inherently correlational. To move from correlation to inference, we incorporated causal discovery techniques. These approaches are appropriate for transcriptomic data, as they account for latent confounding, sparsity and indirect associations—common features of gene regulatory networks.²¹ By constructing a causal graph from salient genes identified via GNN-based classification, we aim to uncover upstream regulators of malignant transformation.

Causal discovery via the PC algorithm

Assumptions and applicability

In gene regulatory networks, regulatory factors act as causal determinants of gene expression levels. The PC algorithm facilitates the inference of regulatory relationships between genes by assessing their conditional independence. We applied the PC algorithm, a constraint-based causal discovery method, to infer directed gene regulatory networks (Figure 3).

Figure 3.

The PC algorithm, a constraint-based causal discovery method, was applied to infer directed gene regulatory networks by assessing conditional independence among genes. The process involved: generating a true causal diagram (a), constructing a fully connected undirected graph (b), removing the X–Y edge after independence testing (c), further removing the Y–W edge (d), identifying V-shaped structures using X–Y–Z configurations (e), and finally predicting the network diagram by searching Y–Z–W structures (f). PC: Peter–Clark.

The PC algorithm assumes (i) causal sufficiency (i.e. no hidden confounders) and (ii) faithfulness (i.e. all conditional independencies reflect causal structure). Although these assumptions are rarely fully met in transcriptomic data, we mitigated their potential violation by

Using a GraphSAGE-enriched expression matrix, which captures multivariate dependencies and reduces noise and redundancy.

Performing dimensionality reduction through DEG filtering prior to causal inference, improving signal-to-noise ratio and minimising spurious associations.

These precautions facilitate cautious yet informative causal inference from high-dimensional omics data, as supported by previous gene regulatory network applications.²¹

These refined gene embeddings were then input into the PC algorithm, a well-established causal discovery method based on conditional independence testing. To assess conditional independence, we used partial correlation coefficients, which quantify the linear correlation between two variables while controlling for the effects of others. Partial correlation coefficients are particularly suitable for gene expression data, where expression levels are often influenced by multiple interacting genes. Recursive estimation of pairwise partial correlation coefficients was employed to identify direct dependencies between genes, helping to distinguish true regulatory links from indirect associations.

To orient the edges within the resulting directed acyclic graph, we applied three standard collider-based orientation rules: enforcing acyclicity constraints, re-testing conditional independence, and assigning edge directions. This process yielded a final regulatory network specific to the gastric inflammation-to-cancer transition.

Hub gene selection and justification

From the PC-inferred causal graph, we selected hub genes based on node degree centrality. This strategy is rooted in network biology, where hub nodes frequently represent key regulators or bottlenecks in biological systems. Although degree centrality is a heuristic, its integration with differential expression and causal orientation enhances biological plausibility. The top nine genes with the highest degrees were selected for further evaluation.

Validation via survival analysis

To assess clinical relevance, we validated the nine candidate genes using the Kaplan–Meier Plotter (http://kmplot.com/analysis/) in GC patients. Although the platform offers extensive clinical coverage, we acknowledge the limitations of using a single retrospective dataset. Future validation using independent cohorts—such as The Cancer Genome Atlas (TCGA)²² and external GEO datasets—is warranted.

In this study, we retrieved gene expression data related to the gastric inflammation-to-cancer transition from the GEO database and selected two microarray datasets, GSE55696 and GSE130823, for in-depth analysis. The GSE55696 dataset identifies a clear biological distinction between high-grade and low-grade gastric intraepithelial neoplasia, providing molecular evidence for clinical applications .¹⁸ In addition, Zhang et al.¹⁹ used the GSE130823 dataset to identify key molecules involved in the tumorigenesis of intestinal-type GC. Taken together, these two datasets are well suited for this type of research.

The GSE55696 dataset contains expression data for 18,353 genes, while GSE130823 includes data for 32,080 genes. After merging the datasets, a total of 18,128 genes across 171 samples were retained. This sample size is considered representative in biomedical research and is sufficient to support statistical analysis. Furthermore, to ensure the reliability of the results, we applied a standardised data preprocessing pipeline and performed quality control procedures to minimise technical variation and batch effects.

This is a retrospective study. The analysis relied solely on information from an existing public database. In accordance with institutional and ethical guidelines, signed informed consent was not required. The study was conducted in compliance with the Declaration of Helsinki (1975), as revised in 2024. The reporting of this study conforms to STROBE guidelines.²³

Results

Screening of DEGs

To ensure comparability between datasets, quantile normalisation was applied, as described in the Methods section. DEGs were then identified using the limma package.

Volcano plots in Figure 4(a) and (b) illustrate the results. In the gastritis-to-precancerous lesion transition, 857 DEGs were identified (415 upregulated, 442 downregulated). In the transition from precancerous lesions to GC, 337 DEGs were identified (289 upregulated, 48 downregulated). Notably, 83 DEGs were common to both transitions—43 upregulated and 40 downregulated. Figure 4(a) displays a broader p-value distribution and more pronounced expression changes than Figure 4(b), suggesting that the early transition (inflammation to precancer) involves more substantial transcriptomic shifts. Genes in the upper-left (downregulated) and upper-right (upregulated) quadrants of the volcano plots were prioritised for further analysis.

Figure 4.

Volcano plots illustrating DEGs. Figure 4(a) shows a broader p-value distribution and more pronounced expression changes (inflammatory-to-cancer) compared with Figure 4(b) (precancer-to-cancer), suggesting that the early transition (inflammation-to-precancer) involves more substantial transcriptomic shifts. Genes in the upper-left (downregulated) and upper-right (upregulated) quadrants were prioritised for further analysis. DEG: differentially expressed gene.

GO annotation of DEGs

To explore the biological significance of DEGs associated with the inflammation-to-cancer transition, we performed Gene Ontology (GO) enrichment analysis on the intersecting gene set. The analysis yielded 56 significant terms—41 in BP and 15 in MF. The top 10 most significantly enriched terms are shown in (Figure 5).

Figure 5.

GO annotation of DEGs. To explore the biological significance of DEGs associated with the inflammation-to-cancer transition, GO enrichment analysis was performed on the intersecting gene set. The analysis identified 56 significant terms, including 41 in BP and 15 in MF. The top 10 most significantly enriched terms are shown. DEG: differentially expressed gene; BP: biological process; MF: molecular function.

The results revealed strong associations with H. pylori, as top-enriched BP included responses to lipopolysaccharides and bacterial molecules. This suggests that identified genes may mediate host immune responses to bacterial infection. Chronic inflammation, potentially driven by H. pylori and its metabolites, likely contributes to the inflammation-to-cancer transition.

While our initial enrichment focused on H. pylori-related pathways due to its well-established role in early GC, we acknowledge that GC is a multifactorial disease involving diverse oncogenic signalling pathways, including wingless-related integration site/β-catenin, phosphoinositide 3-kinases/Ak strain transforming, mitogen-activated protein kinase and transforming growth factor-β. These pathways are known to promote proliferation, invasion, stemness and immune evasion.²²

GraphSAGE node embedding

We used the filtered gene expression matrix obtained from the differential expression analysis as the input for GraphSAGE. The correlation network (Table 1), constructed based on PPC, served as the source of positive samples for neighbourhood aggregation.

Table 1.

The table summarises potentially relevant genes identified within the network analysis.

Var1	Var2
DEFA6	LTF
DEFA6	S100A9
DEFA6	S100A12
DEFA6	S100A8
LTF	S100A9
LTF	S100A12
LTF	S100A8
S100A9	S100A12
S100A9	S100A8
S100A12	S100A8
CCL25	S100A9
CCL25	S100A12
CCL25	SAA1
CCL25	S100A8
CCL25	CXCL5
S100A9	S100A12
S100A9	SAA1
S100A9	S100A8
S100A9	CXCL5
S100A12	SAA1
S100A12	S100A8
S100A12	CXCL5
SAA1	S100A8
SAA1	CXCL5
S100A8	CXCL5
CPS1	ATP4B
CPS1	EPHB2
CPS1	DEFA6
CPS1	LTF
CPS1	S100A9
CPS1	NLRP7
CPS1	S100A8
CPS1	CXCL5
ATP4B	EPHB2
ATP4B	DEFA6
ATP4B	LTF
ATP4B	S100A9
ATP4B	NLRP7
ATP4B	S100A8
ATP4B	CXCL5
EPHB2	DEFA6
EPHB2	LTF
EPHB2	S100A9
EPHB2	NLRP7
EPHB2	S100A8
EPHB2	CXCL5
DEFA6	LTF
DEFA6	S100A9
DEFA6	NLRP7
DEFA6	S100A8
DEFA6	CXCL5
LTF	S100A9
LTF	NLRP7
LTF	S100A8
LTF	CXCL5
S100A9	NLRP7
S100A9	S100A8
S100A9	CXCL5
NLRP7	S100A8
NLRP7	CXCL5
S100A8	CXCL5
DEFA6	LTF
DEFA6	S100A9
DEFA6	S100A12
DEFA6	S100A8
LTF	S100A9
LTF	S100A12
LTF	S100A8
S100A9	S100A12
S100A9	S100A8
S100A12	S100A8
CPS1	ATP4B
CPS1	EPHB2
CPS1	DEFA6
CPS1	LTF
CPS1	S100A9
CPS1	NLRP7
CPS1	S100A8
CPS1	CXCL5
ATP4B	EPHB2
ATP4B	DEFA6
ATP4B	LTF
ATP4B	S100A9
ATP4B	NLRP7
ATP4B	S100A8
ATP4B	CXCL5
EPHB2	DEFA6
EPHB2	LTF
EPHB2	S100A9
EPHB2	NLRP7
EPHB2	S100A8
EPHB2	CXCL5
DEFA6	LTF
DEFA6	S100A9
DEFA6	NLRP7
DEFA6	S100A8
DEFA6	CXCL5
LTF	S100A9
LTF	NLRP7
LTF	S100A8
LTF	CXCL5
S100A9	NLRP7
S100A9	S100A8
S100A9	CXCL5
NLRP7	S100A8
NLRP7	CXCL5
S100A8	CXCL5
CCL25	S100A9
CCL25	S100A12
CCL25	SAA1
CCL25	S100A8
CCL25	CXCL5
S100A9	S100A12
S100A9	SAA1
S100A9	S100A8
S100A9	CXCL5
S100A12	SAA1
S100A12	S100A8
S100A12	CXCL5
SAA1	S100A8
SAA1	CXCL5
S100A8	CXCL5
CCL25	S100A9
CCL25	S100A12
CCL25	SAA1
CCL25	S100A8
CCL25	CXCL5
S100A9	S100A12
S100A9	SAA1
S100A9	S100A8
S100A9	CXCL5
S100A12	SAA1
S100A12	S100A8
S100A12	CXCL5
SAA1	S100A8
SAA1	CXCL5
S100A8	CXCL5
DEFA6	LTF
DEFA6	S100A9
DEFA6	S100A12
DEFA6	CXCL5
LTF	S100A9
LTF	S100A12
LTF	CXCL5
S100A9	S100A12
S100A9	CXCL5
S100A12	CXCL5
CCL25	S100A9
CCL25	S100A12
CCL25	SAA1
CCL25	S100A8
CCL25	CXCL5
S100A9	S100A12
S100A9	SAA1
S100A9	S100A8
S100A9	CXCL5
S100A12	SAA1
S100A12	S100A8
S100A12	CXCL5
SAA1	S100A8
SAA1	CXCL5
S100A8	CXCL5
SERPINB5	SERPING1
SERPINB5	SPINK2
SERPINB5	VIL1
SERPINB5	SERPINA5
SERPINB5	LTF
SERPINB5	S100A9
SERPINB5	NLRP7
SERPINB5	S100A8
SERPING1	SPINK2
SERPING1	VIL1
SERPING1	SERPINA5
SERPING1	LTF
SERPING1	S100A9
SERPING1	NLRP7
SERPING1	S100A8
SPINK2	VIL1
SPINK2	SERPINA5
SPINK2	LTF
SPINK2	S100A9
SPINK2	NLRP7
SPINK2	S100A8
VIL1	SERPINA5
VIL1	LTF
VIL1	S100A9
VIL1	NLRP7
VIL1	S100A8
SERPINA5	LTF
SERPINA5	S100A9
SERPINA5	NLRP7
SERPINA5	S100A8
LTF	S100A9
LTF	NLRP7
LTF	S100A8
S100A9	NLRP7
S100A9	S100A8
NLRP7	S100A8
CCL25	KIT
CCL25	S100A9
CCL25	S100A12
CCL25	SAA1
CCL25	S100A8
CCL25	CXCL5
KIT	S100A9
KIT	S100A12
KIT	SAA1
KIT	S100A8
KIT	CXCL5
S100A9	S100A12
S100A9	SAA1
S100A9	S100A8
S100A9	CXCL5
S100A12	SAA1
S100A12	S100A8
S100A12	CXCL5
SAA1	S100A8
SAA1	CXCL5
S100A8	CXCL5
CCL25	KIT
CCL25	S100A9
CCL25	S100A12
CCL25	SAA1
CCL25	S100A8
CCL25	CXCL5
KIT	S100A9
KIT	S100A12
KIT	SAA1
KIT	S100A8
KIT	CXCL5
S100A9	S100A12
S100A9	SAA1
S100A9	S100A8
S100A9	CXCL5
S100A12	SAA1
S100A12	S100A8
S100A12	CXCL5
SAA1	S100A8
SAA1	CXCL5
S100A8	CXCL5
SERPINB5	SERPING1
SERPINB5	SPINK2
SERPINB5	VIL1
SERPINB5	SERPINA5
SERPINB5	LTF
SERPINB5	NLRP7
SERPING1	SPINK2
SERPING1	VIL1
SERPING1	SERPINA5
SERPING1	LTF
SERPING1	NLRP7
SPINK2	VIL1
SPINK2	SERPINA5
SPINK2	LTF
SPINK2	NLRP7
VIL1	SERPINA5
VIL1	LTF
VIL1	NLRP7
SERPINA5	LTF
SERPINA6	NLRP7
LTF	NLRP7
SERPINB5	SERPING1
SERPINB5	SPINK2
SERPINB5	VIL1
SERPINB5	SERPINA5
SERPINB5	LTF
SERPINB5	S100A9
SERPINB5	NLRP7
SERPINB5	S100A8
SERPING1	SPINK2
SERPING1	VIL1
SERPING1	SERPINA5
SERPING1	LTF
SERPING1	S100A9
SERPING1	NLRP7
SERPING1	S100A8
SPINK2	VIL1
SPINK2	SERPINA5
SPINK2	LTF
SPINK2	S100A9
SPINK2	NLRP7
SPINK2	S100A8
VIL1	SERPINA5
VIL1	LTF
VIL1	S100A9
VIL1	NLRP7
VIL1	S100A8
SERPINA5	LTF
SERPINA5	S100A9
SERPINA5	NLRP7
SERPINA5	S100A8
LTF	S100A9
LTF	NLRP7
LTF	S100A8
S100A9	NLRP7
S100A9	S100A8
NLRP7	S100A8
SERPINB5	SERPING1
SERPINB5	SPINK2
SERPINB5	VIL1
SERPINB5	SERPINA5
SERPINB5	LTF
SERPINB5	NLRP7
SERPING1	SPINK2
SERPING1	VIL1
SERPING1	SERPINA5
SERPING1	LTF
SERPING1	NLRP7
SPINK2	VIL1
SPINK2	SERPINA5
SPINK2	LTF
SPINK2	NLRP7
VIL1	SERPINA5
VIL1	LTF
VIL1	NLRP7
SERPINA5	LTF
SERPINA6	NLRP7
LTF	NLRP7
NEUROD2	BEX1
NEUROD2	KIT
NEUROD2	LTF
NEUROD2	S100A9
NEUROD2	S100A12
NEUROD2	S100A8
BEX1	KIT
BEX1	LTF
BEX1	S100A9
BEX1	S100A12
BEX1	S100A8
KIT	LTF
KIT	S100A9
KIT	S100A12
KIT	S100A8
LTF	S100A9
LTF	S100A12
LTF	S100A8
S100A9	S100A12
S100A9	S100A8
S100A12	S100A8
DEFA6	LTF
DEFA6	S100A9
DEFA6	S100A12
DEFA6	CXCL5
LTF	S100A9
LTF	S100A12
LTF	CXCL5
S100A9	S100A12
S100A9	CXCL5
S100A12	CXCL5
ITGA4	CCL25
ITGA4	KIT
ITGA4	S100A9
ITGA4	S100A12
ITGA4	SAA1
ITGA4	S100A8
ITGA4	CXCL5
CCL25
CCL25	S100A9
CCL25	S100A12
CCL25	SAA1
CCL25	S100A8
CCL25	CXCL5
KIT	S100A9
KIT	S100A12
KIT	SAA1
KIT	S100A8
KIT	CXCL5
S100A9	S100A12
S100A9	SAA1
S100A9	S100A8
S100A9	CXCL5
S100A12	SAA1
S100A12	S100A8
S100A12	CXCL5
SAA1	S100A8
SAA1	CXCL5
S100A8	CXCL5
DEFA6	LTF
DEFA6	S100A9
LTF	S100A9
CPS1	S100A9
CPS1	S100A8
S100A9	S100A8
B3GAT1	CCND2
B3GAT1	GPR155
B3GAT1	NEUROD2
B3GAT1	EPHB2
B3GAT1	KIT
B3GAT1	DKK1
CCND2	GPR155
CCND2	NEUROD2
CCND2	EPHB2
CCND2	KIT
CCND2	DKK1
GPR155	NEUROD2
GPR155	EPHB2
GPR155	KIT
GPR155	DKK1
NEUROD2	EPHB2
NEUROD2	KIT
NEUROD2	DKK1
EPHB2	KIT
EPHB2	DKK1
KIT	DKK1
CCL25	KIT
CCL25	S100A9
CCL25	S100A12
CCL25	SAA1
CCL25	S100A8
CCL25	CXCL5
KIT	S100A9
KIT	S100A12
KIT	SAA1
KIT	S100A8
KIT	CXCL5
S100A9	S100A12
S100A9	SAA1
S100A9	S100A8
S100A9	CXCL5
S100A12	SAA1
S100A12	S100A8
S100A12	CXCL5
SAA1	S100A8
SAA1	CXCL5
S100A8	CXCL5
SERPING1	DEFA6
SERPING1	IGLL1
SERPING1	LTF
SERPING1	S100A9
SERPING1	S100A12
SERPING1	CXCL5
DEFA6	IGLL1
DEFA6	LTF
DEFA6	S100A9
DEFA6	S100A12
DEFA6	CXCL5
IGLL1	LTF
IGLL1	S100A9
IGLL1	S100A12
IGLL1	CXCL5
LTF	S100A9
LTF	S100A12
LTF	CXCL5
S100A9	S100A12
S100A9	CXCL5
S100A12	CXCL5
B3GAT1	NEUROD2
B3GAT1	EPHB2
B3GAT1	KIT
B3GAT1	DKK1
NEUROD2	EPHB2
NEUROD2	KIT
NEUROD2	DKK1
EPHB2	KIT
EPHB2	DKK1
KIT	DKK1
SERPINB5	SERPING1
SERPINB5	SPINK2
SERPINB5	VIL1
SERPINB5	SERPINA5
SERPINB5	LTF
SERPINB5	NLRP7
SERPING1	SPINK2
SERPING1	VIL1
SERPING1	SERPINA5
SERPING1	LTF
SERPING1	NLRP7
SPINK2	VIL1
SPINK2	SERPINA5
SPINK2	LTF
SPINK2	NLRP7
VIL1	SERPINA5
VIL1	LTF
VIL1	NLRP7
SERPINA5	LTF
SERPINA6	NLRP7
LTF	NLRP7
CHP2	CPS1
CHP2	BNIP3
CHP2	NEUROD2
CHP2	OTC
CHP2	KIT
CHP2	S100A8
CPS1	BNIP3
CPS1	NEUROD2
CPS1	OTC
CPS1	KIT
CPS1	S100A8
BNIP3	NEUROD2
BNIP3	OTC
BNIP3	KIT
BNIP3	S100A8
NEUROD2	OTC
NEUROD2	KIT
NEUROD2	S100A8
OTC	KIT
OTC	S100A8
KIT	S100A8
MSX1	GREM2
MSX1	LARP6
MSX1	BLK
MSX1	ITGA4
MSX1	DKK1
MSX1	IRF4
GREM2	LARP6
GREM2	BLK
GREM2	ITGA4
GREM2	DKK1
GREM2	IRF4
LARP6	BLK
LARP6	ITGA4
LARP6	DKK1
LARP6	IRF4
BLK	ITGA4
BLK	DKK1
BLK	IRF4
ITGA4	DKK1
ITGA4	IRF4
DKK1	IRF4
SERPINB5	SERPING1
SERPINB5	SPINK2
SERPINB5	VIL1
SERPINB5	SERPINA5
SERPINB5	LTF
SERPINB5	NLRP7
SERPING1	SPINK2
SERPING1	VIL1
SERPING1	SERPINA5
SERPING1	LTF
SERPING1	NLRP7
SPINK2	VIL1
SPINK2	SERPINA5
SPINK2	LTF
SPINK2	NLRP7
VIL1	SERPINA5
VIL1	LTF
VIL1	NLRP7
SERPINA5	LTF
SERPINA6	NLRP7
LTF	NLRP7
CCND2	RASSF2
CCND2	EPHB2
CCND2	KIT
CCND2	DKK1
CCND2	LTF
CCND2	S100A12
CCND2	CD19
RASSF2	EPHB2
RASSF2	KIT
RASSF2	DKK1
RASSF2	LTF
RASSF2	S100A12
RASSF2	CD19
EPHB2	KIT
EPHB2	DKK1
EPHB2	LTF
EPHB2	S100A12
EPHB2	CD19
KIT	DKK1
KIT	LTF
KIT	S100A12
KIT	CD19
DKK1	LTF
DKK1	S100A12
DKK1	CD19
LTF	S100A12
LTF	CD19
S100A12	CD19
B3GAT1	CCND2
B3GAT1	NEUROD2
B3GAT1	EPHB2
B3GAT1	KIT
B3GAT1	DKK1
CCND2	NEUROD2
CCND2	EPHB2
CCND2	KIT
CCND2	DKK1
NEUROD2	EPHB2
NEUROD2	KIT
NEUROD2	DKK1
EPHB2	KIT
EPHB2	DKK1
KIT	DKK1
CPS1	OTC
NEUROD2	DKK1
DEFA6	LTF
DEFA6	S100A12
LTF	S100A12
CPS1	OTC
TBX3	DKK1
MSX1	GREM2
MSX1	DKK1
GREM2	DKK1
CPS1	OTC
DEFA6	LTF
CPS1	OTC
CPS1	OTC
CPS1	S100A8
OTC	S100A8
S100A9	S100A8
THBS2	GREM2
THBS2	LGR6
THBS2	FGFBP1
THBS2	SERPINA5
THBS2	LTF
THBS2	SAA1
GREM2	LGR6
GREM2	FGFBP1
GREM2	SERPINA5
GREM2	LTF
GREM2	SAA1
LGR6	FGFBP1
LGR6	SERPINA5
LGR6	LTF
LGR6	SAA1
FGFBP1	SERPINA5
FGFBP1	LTF
FGFBP1	SAA1
SERPINA5	LTF
SERPINA5	SAA1
LTF	SAA1
S100A9	S100A12
S100A9	S100A8
S100A12	S100A8
SERPINB5	SERPING1
SERPINB5	SPINK2
SERPINB5	VIL1
SERPINB5	SERPINA5
SERPINB5	LTF
SERPINB5	NLRP7
SERPING1	SPINK2
SERPING1	VIL1
SERPING1	SERPINA5
SERPING1	LTF
SERPING1	NLRP7
SPINK2	VIL1
SPINK2	SERPINA5
SPINK2	LTF
SPINK2	NLRP7
VIL1	SERPINA5
VIL1	LTF
VIL1	NLRP7
SERPINA5	LTF
SERPINA6	NLRP7
LTF	NLRP7
SERPINB5	SERPING1
SERPINB5	SPINK2
SERPINB5	VIL1
SERPINB5	SERPINA5
SERPINB5	LTF
SERPINB5	NLRP7
SERPING1	SPINK2
SERPING1	VIL1
SERPING1	SERPINA5
SERPING1	LTF
SERPING1	NLRP7
SPINK2	VIL1
SPINK2	SERPINA5
SPINK2	LTF
SPINK2	NLRP7
VIL1	SERPINA5
VIL1	LTF
VIL1	NLRP7
SERPINA5	LTF
SERPINA6	NLRP7
LTF	NLRP7
SERPINB5	SERPING1
SERPINB5	SPINK2
SERPINB5	VIL1
SERPINB5	SERPINA5
SERPINB5	LTF
SERPINB5	NLRP7
SERPING1	SPINK2
SERPING1	VIL1
SERPING1	SERPINA5
SERPING1	LTF
SERPING1	NLRP7
SPINK2	VIL1
SPINK2	SERPINA5
SPINK2	LTF
SPINK2	NLRP7
VIL1	SERPINA5
VIL1	LTF
VIL1	NLRP7
SERPINA5	LTF
SERPINA6	NLRP7
LTF	NLRP7
SERPINB5	SERPING1
SERPINB5	SPINK2
SERPINB5	VIL1
SERPINB5	SERPINA5
SERPINB5	LTF
SERPINB5	NLRP7
SERPING1	SPINK2
SERPING1	VIL1
SERPING1	SERPINA5
SERPING1	LTF
SERPING1	NLRP7
SPINK2	VIL1
SPINK2	SERPINA5
SPINK2	LTF
SPINK2	NLRP7
VIL1	SERPINA5
VIL1	LTF
VIL1	NLRP7
SERPINA5	LTF
SERPINA6	NLRP7
LTF	NLRP7
THBS2	GREM2
THBS2	LGR6
THBS2	FGFBP1
THBS2	SERPINA5
THBS2	LTF
THBS2	SAA1
GREM2	LGR6
GREM2	FGFBP1
GREM2	SERPINA5
GREM2	LTF
GREM2	SAA1
LGR6	FGFBP1
LGR6	SERPINA5
LGR6	LTF
LGR6	SAA1
FGFBP1	SERPINA5
FGFBP1	LTF
FGFBP1	SAA1
SERPINA5	LTF
SERPINA5	SAA1
LTF	SAA1
THBS2	GREM2
THBS2	LGR6
THBS2	FGFBP1
THBS2	SERPINA5
THBS2	LTF
THBS2	SAA1
GREM2	LGR6
GREM2	FGFBP1
GREM2	SERPINA5
GREM2	LTF
GREM2	SAA1
LGR6	FGFBP1
LGR6	SERPINA5
LGR6	LTF
LGR6	SAA1
FGFBP1	SERPINA5
FGFBP1	LTF
FGFBP1	SAA1
SERPINA5	LTF
SERPINA5	SAA1
LTF	SAA1
CPS1	OTC
CPS1	SERPINA5
CPS1	S100A9
CPS1	S100A8
OTC	SERPINA5
OTC	S100A9
OTC	S100A8
SERPINA5	S100A9
SERPINA5	S100A8
S100A9	S100A8
CPS1	SIGLEC11
CPS1	SERPINA5
CPS1	S100A9
CPS1	S100A8
OTC	SERPINA5
OTC	S100A9
OTC	S100A8
SERPINA5	S100A9
SERPINA5	S100A8
S100A9	S100A8
SERPINB5	SERPING1
SERPINB5	SPINK2
SERPINB5	SERPINA5
SERPING1	SPINK2
SERPING1	SERPINA5
SPINK2	SERPINA5
SERPINB5	SERPING1
SERPINB5	SPINK2
SERPINB5	VIL1
SERPINB5	SERPINA5
SERPINB5	LTF
SERPINB5	NLRP7
SERPING1	SPINK2
SERPING1	VIL1
SERPING1	SERPINA5
SERPING1	LTF
SERPING1	NLRP7
SPINK2	VIL1
SPINK2	SERPINA5
SPINK2	LTF
SPINK2	NLRP7
VIL1	SERPINA5
VIL1	LTF
VIL1	NLRP7
SERPINA5	LTF
SERPINA6	NLRP7
LTF	NLRP7
S100A9	S100A8
S100A9	S100A8
GREM2	ITGA4
GREM2	KIT
GREM2	KIT
ITGA4	KIT
ITGA4	KIT
KIT	KIT

The Link Neighbour Loader was configured to sample 20 neighbours from the 1-hop and 10 from the 2-hop neighbourhood for each gene. Through this process, GraphSAGE aggregated information from each gene’s local network, updating the expression matrix with enriched, context-aware representations. A portion of the resulting matrix is shown in Tables 2 and 3, while the complete versions are provided as Supplemental Tables S1 and S2.

Table 2.

Performance evaluation on the in silico data set of ablation experiment.

Algorithmic models	In silico 1	In silico 2	In silico 3	In silico 4	M	SD
GraphSAGE + PC	0.6882	0.6824	0.7173	0.6901	0.6945	0.0135
PC	0.5470	0.5250	0.5867	0.5632	0.5555	0.0225

GraphSAGE: graph sample and aggregate; PC: Peter–Clark.

Table 3.

Performance evaluation on the ablation experiment Escherichia coli data set.

Algorithmic models	E. coli 1	E. coli 2	E. coli 3	E. coli 4	M	SD
Graph SAGE + PC	0.6190	0.5854	0.6087	0.6182	0.6078	0.0136
PC	0.5641	0.5500	0.5778	0.6000	0.5730	0.0184

GraphSAGE: graph sample and aggregate; PC: Peter–Clark.

This embedding process integrates the intrinsic expression level of each gene with contextual information from its connected genes, generating biologically meaningful node representations. These enhanced embeddings improve the performance of the downstream causal discovery algorithm by more accurately identifying regulatory relationships.

Key gene survival analysis xlsx

We assessed the clinical relevance of the nine candidate genes using the Kaplan–Meier Plotter (http://kmplot.com/analysis/), which provides survival data based on publicly available datasets.

All nine genes exhibited statistically significant associations with overall survival in GC patients (p < 0.05), as shown in Figure 6. These findings suggest that these genes may have prognostic value, though further validation is required.

Figure 6.

Key gene survival analysis xlsx. The clinical relevance of the nine candidate genes was assessed, using the Kaplan–Meier Plotter. All nine genes exhibited statistically significant associations with overall survival in GC patients (p < 0.05). GC: gastric cancer.

Causal discovery

The PC algorithm was applied to the enriched gene expression matrix to infer the underlying gene regulatory network. The resulting network contained 72 nodes and 99 directed edges (Figure 7). Top hub genes were identified based on node degree. The nine genes with the highest centrality were retained for further analysis: mucin 17 (MUC17), brain expressed X-linked 2 (BEX2), BCL2/adenovirus E1B 19 kDa protein-interacting protein 3 (BNIP3), Ras association domain family member 2 (RASSF2), NLR family pyrin domain containing 7 (NLRP7), interferon regulatory factor 4 (IRF4), carbamoyl phosphate synthetase 1 (CPS1), nucleoporin 210 (NUP210) and neuronal differentiation 2 (NEUROD2).

Figure 7.

The PC algorithm was applied to the enriched gene expression matrix to infer the underlying gene regulatory network. The resulting network contained 72 nodes and 99 directed edges. Nine genes identified, MUC17 (a), BEX2 (b), BNIP3 (c), RASSF2 (d), NLRP7 (e), IRF4 (f), CPS1 (g), NUP210 (h) and NEUROD2 (i). PC: Peter–Clark; MUC17: mucin 17; BEX2: brain expressed X-linked 2; BNIP3; BCL2/adenovirus E1B 19 kDa protein-interacting protein 3; RASSF2: Ras association domain family member 2; NLRP7: NLR family pyrin domain containing 7; IRF4: interferon regulatory factor 4; CPS1: carbamoyl phosphate synthetase 1; NUP210: nucleoporin 210; NEUROD2: neuronal differentiation 2.

These genes are hypothesised to be key regulators in the inflammation-to-cancer transition, and are discussed below.

Discussion

This study presents an integrative computational pipeline for exploring the inflammation-to-cancer transition in gastric tissues. Our approach combines differential expression analysis, enrichment analysis, graph-based learning, causal inference and clinical outcome validation to identify candidate biomarkers and regulatory genes.

Our primary objective of this study is to identify key genes implicated in the gastric ‘inflammation-to-cancer transformation’. While DEGs are utilised as an initial step to reduce data dimensionality during the screening process, they do not constitute the central focus or main challenge of the research. Further investigations are planned to build upon these findings in future work.

Differential expression analysis revealed 857 DEGs in the transition from gastritis to precancerous lesions and 337 DEGs in the progression to GC. The 83 overlapping DEGs indicate molecular continuity throughout disease development. Notably, we demonstrated that the greater number of DEGs in the early stage implies that inflammation-induced dysregulation plays a dominant role in tumour initiation. This is supported by a recent publication demonstrating the mechanistic role of inflammation in cancer initiation, progression and metastasis.²⁴

Enrichment analyses highlighted immune and bacterial response pathways, with many DEGs associated with H. pylori-induced inflammation and immune signalling. These findings align with the established role of persistent infection and chronic inflammation in GC.^25,26

Kaplan–Meier analysis revealed that these genes are significantly associated with overall survival, although this alone does not establish clinical utility. Survival outcomes may be influenced by confounding variables such as treatment regimens, molecular subtypes and comorbidities. Thus, further validation using prospective cohorts and experimental assays—employing both animal models²⁷ and human GC tissues²⁸—is essential.

Additionally, while convolutional neural networks have demonstrated strong performance on structured, grid-like data (e.g., images), they are suboptimal for modelling non-Euclidean biological networks. In contrast, GNNs are specifically designed for graph-structured data, such as gene interaction networks. GNNs incorporate both node features and relational structures, making them well-suited for representing complex biological systems.²⁹

GraphSAGE, in particular, offers advantages over traditional Graph convolutional networks by enabling inductive learning, better generalisation and computational efficiency—important features when analysing large-scale omics data. GraphSAGE has further demonstrated strong performance in related biomedical tasks, supporting its selection for this study.

To move beyond correlation-based methods, we implemented the GraphSAGE algorithm, which allowed us to model gene–gene interactions in a topologically informed manner. The learned embeddings improved the performance of the PC algorithm in identifying plausible causal relationships. The resulting regulatory graph comprised 72 nodes and 99 edges, from which we identified nine key hub genes.

Among these nine key genes identified, MUC17, a membrane-associated mucin that maintains epithelial integrity and may act as a tumour suppressor,³⁰ has not yet been directly linked to GC. Nevertheless, mutations in MUC17 have been associated with the prognosis of diffuse glioma in adult patients,³¹ and its most established gastrointestinal association is with Crohn’s disease, where MUC17 disruption plays a pathogenic role.³² Our findings provide gene-level evidence supporting its potential relevance in GC and precision medicine, warranting further investigation at the mRNA level using RNA sequencing (RNA-seq).

BNIP3, a pro-apoptotic gene involved in autophagy and the hypoxia response, is dysregulated in multiple cancers³³ but, to date, has been reported in only one gastrointestinal cancer—hepatocellular carcinoma—in a single study.³⁴ Its central position in our analysis supports further investigation into its role in GC development and its potential as either an early diagnostic biomarker and/or a therapeutic target.

IRF4, a transcription factor regulating immune responses,³⁵ cooperates with basic leucine zipper activating transcription factor-like transcription factor to counter T cell exhaustion in mouse tumour models,³⁶ yet no studies have linked it to GC. Our findings suggest novel immunological mechanisms worth exploring in vitro and/or in vivo.

RASSF2, a tumour suppressor frequently silenced by methylation,³⁷ is involved in the differentiation of breast,³⁸ prostate³⁹ and squamous cervical cancers,⁴⁰ but its role in GC remains undefined. Our results provide preliminary evidence for its involvement in GC carcinogenesis, which may offer a novel target for early diagnosis and/or intervention.

NEUROD2, typically linked to neuronal differentiation, may also promote metastasis and survival in non-neuronal malignancies, especially glioblastoma.⁴¹ Our data highlight the need to examine its contribution to early-stage GC, beginning with in vitro studies and subsequently in vivo validation using GC samples.²⁸

BEX2 has been suggested to correlate inversely with GC prognosis by regulating apoptosis, proliferation and metastasis,⁴² though its precise role remains unclear. Our findings strengthen the evidence for a link between BEX2 and GC, which should be validated in vitro and/or in vivo.

NLRP7 has been associated with lung cancer via pyroptosis⁴³ and with tumour-associated macrophages in colorectal cancer,⁴⁴ while CPS1 has been linked to GC prognosis through effects on immune infiltration, differentiation and metastasis,⁴⁵ although current evidence is limited. Our results further support the need to verify the roles of these two molecules in GC in vitro at least.

Finally, NUP210, implicated in acute myeloid leukaemia⁴⁶ and inversely associated with colorectal cancer where its depletion suppresses metastasis,⁴⁷ emerged as a central hub in our inferred regulatory network, warranting prioritised functional validation, as mentioned above.

Limitations and future directions

First, our analyses are based on bulk tissue expression, which cannot resolve cell type–specific signals, potentially obscuring contributions from immune or stromal compartments. Single-cell RNA seq⁴⁸ would provide higher-resolution insights. Second, causal assumptions: the PC algorithm assumes no hidden confounders and exact conditional independencies. Although DEG filtering and GraphSAGE embeddings helped mitigate these assumptions, violations could still affect inference quality. Third, hub gene selection: node degree may favour highly connected but biologically nonspecific genes, potentially overlooking important low-centrality regulators. High connectivity can also reflect generic hubs or methodological bias rather than true disease drivers.

Future work will explore alternative causal discovery algorithms, such as linear non-Gaussian acyclic model, to relax PC assumptions, and employ genome-wide clustered regularly interspaced short palindromic repeats (CRISPR) screens screens⁴⁹ to validate gene essentiality. Protein–protein interaction databases, for example, search tool for the retrieval of interacting genes/proteins, biological general repository for interaction datasets, will help assess whether top-degree genes are embedded within known regulatory modules or signalling pathways.

Validation will use independent cohorts (e.g., TCGA,²³ external GEO datasets) and experimental assays in vitro, in vivo and on human GC tissues, including quantitative polymerase chain reaction, immunohistochemistry,²⁸ CRISPR perturbation⁴⁹ and single-cell RNA-seq.⁴⁸ A recent study by Hu et al.⁵⁰ used single-cell RNA-seq to identify ribosomal protein subunits as mediators of the inflammation-to-cancer transition in GC, validated in animal models²⁷ and human GC tissues.²⁸ We aim to follow a similar framework to confirm the role of our candidate genes and identify therapeutic targets. Finally, multi-omics integration (e.g., epigenomics, proteomics, metabolomics) will be pursued to enhance biological interpretability and network resolution.

Conclusion

We have developed a binary classification framework combining GraphSAGE and the PC algorithm to infer gene regulatory networks underlying the gastric inflammation-to-cancer transition. This approach identified nine candidate genes linked to GC progression and patient survival, highlighting their potential as non-invasive biomarkers and therapeutic targets. Representing the training phase of our pipeline, this integrative framework provides a scalable, biologically meaningful strategy for early GC detection, tumour biology elucidation and precision medicine, with clinical validation underway.

Supplemental Material

sj-xlsx-1-smo-10.1177_20503121251380131 – Supplemental material for Unravelling key genes and molecular pathways in gastric inflammation-to-cancer transition through causal discovery: implications for early diagnosis and therapy

Supplemental material, sj-xlsx-1-smo-10.1177_20503121251380131 for Unravelling key genes and molecular pathways in gastric inflammation-to-cancer transition through causal discovery: implications for early diagnosis and therapy by Zhen Ren, Xiaochen Li, Pengyun Liu, Jinjuan Li and Shisan Bao in SAGE Open Medicine

Supplemental Material

sj-xlsx-2-smo-10.1177_20503121251380131 – Supplemental material for Unravelling key genes and molecular pathways in gastric inflammation-to-cancer transition through causal discovery: implications for early diagnosis and therapy

Supplemental material, sj-xlsx-2-smo-10.1177_20503121251380131 for Unravelling key genes and molecular pathways in gastric inflammation-to-cancer transition through causal discovery: implications for early diagnosis and therapy by Zhen Ren, Xiaochen Li, Pengyun Liu, Jinjuan Li and Shisan Bao in SAGE Open Medicine

Footnotes

Appendix 1 Acknowledgements

We appreciate the language editing by Dr Brett D. Hambly.

ORCID iD

Shisan Bao

Ethical considerations

Ethical approval for this study was obtained from The Human Ethic Committee, Gansu University of Chinese Medicine (Ethics Approval Number: 2024-230).

Informed consent

No patient data were obtained directly, as the current study relied solely on information from an existing public database. In accordance with institutional and ethical guidelines, signed informed consent was not required for this retrospective analysis.

Author contributions

Zhen Ren: experimental design and supervision; draft the manuscript; Xiaochen Li: data collection and analysis; Pengyun Liu: data collection and analysis; Jinjuan Li: data collection and analysis; and Shisan Bao: intellectual input and revise the manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Gansu Provincial Natural Science Foundation, China: Clinical Decision Support Research for Post-Hepatitis Liver Cirrhosis Based on Data Mining (grant number: 23JRRA1719). The deep integration of medical and health care advances clinical rehabilitation and fosters the high-quality development of health services (grant number: 24RCKD001).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

Data are available from the corresponding author upon reasonable request.

Supplemental material

Supplemental material for this article is available online.

References

Edgar

Domrachev

Lash

AE.

Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 2002; 30: 207–210.

National Cancer Institute GCT. Gastric cancer treatment (PDQ®)–health professional version. 2024, https://www.cancer.gov/types/stomach/hp/stomach-treatment-pdq (accessed 21 February 2025).

Morgan

Arnold

Camargo

, et al. The current and future incidence and mortality of gastric cancer in 185 countries, 2020-40: a population-based modelling study. eClinicalMedicine 2022; 47: 101404.

Shimada

Tanaka

A new era for understanding genetic evolution of multistep carcinogenesis. J Gastroenterol 2019; 54: 667–668.

Land

Parada

Weinberg

RA.

Cellular oncogenes and multistep carcinogenesis. Science 1983; 222: 771–778.

Bravo

Hoare

Soto

, et al. Helicobacter pylori in human health and disease: mechanisms for local gastric and systemic effects. World J Gastroenterol 2018; 24: 3071–3089.

Khatoon

Rai

Prasad

KN.

Role of Helicobacter pylori in gastric cancer: updates. World J Gastrointest Oncol 2016; 8: 147–158.

Zhao

Yan

, et al. Inflammation and tumor progression: signaling pathways and targeted intervention. Signal Transduct Target Ther 2021; 6: 263.

Dunn

Old

Schreiber

RD.

The immunobiology of cancer immunosurveillance and immunoediting. Immunity 2004; 21: 137–148.

10.

Togasaki

Sugimoto

Ohta

, et al. Wnt signaling shapes the histologic variation in diffuse gastric cancer. Gastroenterology 2021; 160: 823–830.

11.

Necula

Matei

Dragu

, et al. Recent advances in gastric cancer early diagnosis. World J Gastroenterol 2019; 25: 2029–2044.

12.

Hotamisligil

GS.

Inflammation, metaflammation and immunometabolic disorders. Nature 2017; 542: 177–185.

13.

Cui

Guo

Bioinformatics analysis of the association between obesity and gastric cancer. Front Genet 2024; 15: 1385559.

14.

Carino

Graziosi

Marchiano

, et al. Analysis of gastric cancer transcriptome allows the identification of histotype specific molecular signatures with prognostic potential. Front Oncol 2021; 11: 663771.

15.

Smith

Sheltzer

JM.

Genome-wide identification and analysis of prognostic features in human cancers. Cell Rep 2022; 38: 110569.

16.

Chen

Wang

Nie

, et al. Co-expression network analysis identified CDH11 in association with progression and prognosis in gastric cancer. Onco Targets Ther 2018; 11: 6425–6436.

17.

Wang

, et al. Cell graph neural networks enable the precise prediction of patient survival in gastric cancer. NPJ Precis Oncol 2022; 6: 45.

18.

Feng

Liu

, et al. Differential gene expression profiling of gastric intraepithelial neoplasia and early-stage adenocarcinoma. World J Gastroenterol 2014; 20: 17883–17893.

19.

Zhang

, et al. Dissecting expression profiles of gastric precancerous lesions and early gastric cancer to explore crucial molecules in intestinal-type gastric cancer tumorigenesis. J Pathol 2020; 251: 135–146.

20.

Magnusson

Gustafsson

LiPLike: towards gene regulatory network predictions of high certainty. Bioinformatics 2020; 36: 2522–2529.

21.

Glymour

Zhang

Spirtes

Review of causal discovery methods based on graphical models. Front Genet 2019; 10: 524.

22.

National Cancer Institute CfCG. The cancer genome atlas program (TCGA). National Cancer Institute, 2023.

23.

von Elm

Altman

Egger

, et al. The strengthening the reporting of observational studies in epidemiology (STROBE) statement: guidelines for reporting observational studies. Ann Intern Med 2007; 147: 573–577.

24.

Nishida

Andoh

The role of inflammation in cancer: mechanisms of tumor initiation, progression, and metastasis. Cells 2025; 14(7): 488.

25.

Wroblewski

Peek

Wilson

KT.

Helicobacter pylori and gastric cancer: factors that modulate disease risk. Clin Microbiol Rev 2010; 23: 713–739.

26.

Choi

Kook

Kim

, et al. Helicobacter pylori therapy for the prevention of metachronous gastric cancer. N Engl J Med 2018; 378: 1085–1095.

27.

Cox

Amirfakhri

Lwin

, et al. A new locoregional mouse model of gastric cancer for identifying probes for fluorescence guided surgery. Surgery 2025; 181: 109270.

28.

Liu

Zhang

, et al. Inverse correlation between Interleukin-34 and gastric cancer, a potential biomarker for prognosis. Cell Biosci 2020; 10: 94.

29.

Zhou

Yang

Wang

, et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 2020; 579: 270–273.

30.

Yang

, et al. Mucin 17 inhibits the progression of human gastric cancer by limiting inflammatory responses through a MYH9-p53-RhoA regulatory feedback loop. J Exp Clin Cancer Res 2019; 38: 283.

31.

Machado

Ferrer

VP.

MUC17 mutations and methylation are associated with poor prognosis in adult-type diffuse glioma patients. J Neurol Sci 2023; 452: 120762.

32.

Layunta

Javerfelt

van de Koolwijk

, et al. MUC17 is an essential small intestinal glycocalyx component that is disrupted in Crohn’s disease. JCI Insight 2024; 10(3): e181481.

33.

Giatromanolaki

Koukourakis

Sowter

, et al. BNIP3 expression is linked with hypoxia-regulated protein expression and with poor prognosis in non-small cell lung cancer. Clin Cancer Res 2004; 10: 5566–5571.

34.

Yao

Wang

, et al. CDK9 inhibition blocks the initiation of PINK1-PRKN-mediated mitophagy by regulating the SIRT1-FOXO3-BNIP3 axis and enhances the therapeutic effects involving mitochondrial dysfunction in hepatocellular carcinoma. Autophagy 2022; 18: 1879–1897.

35.

Liang

, et al. Regulatory effects of IRF4 on immune cells in the tumor microenvironment. Front Immunol 2023; 14: 1086803.

36.

Seo

Gonzalez-Avalos

Zhang

, et al. BATF and IRF4 cooperate to counter exhaustion in tumor-infiltrating CAR T cells. Nat Immunol 2021; 22: 983–995.

37.

Schagdarsurengin

Richter

Hornung

, et al. Frequent epigenetic inactivation of RASSF2 in thyroid cancer and functional consequences. Mol Cancer 2010; 9: 264.

38.

Perez-Janices

Blanco-Luquin

Torrea

, et al. Differential involvement of RASSF2 hypermethylation in breast cancer subtypes and their prognosis. Oncotarget 2015; 6: 23944–23958.

39.

Aykanli

Arisan

, et al. Diagnostic value of GSTP1, RASSF1, AND RASSF2 methylation in serum of prostate cancer patients. Urol J 2024; 21: 182–188.

40.

Guerrero-Setas

Perez-Janices

Blanco-Fernandez

, et al. RASSF2 hypermethylation is present and related to shorter survival in squamous cervical cancer. Mod Pathol 2013; 26: 1111–1122.

41.

Agrawal

Garg

Benny Malgulwar

, et al. p53 and miR-210 regulated NeuroD2, a neuronal basic helix-loop-helix transcription factor, is downregulated in glioblastoma patients and functions as a tumor suppressor under hypoxic microenvironment. Int J Cancer 2018; 142: 1817–1828.

42.

Yasumoto

Fujimori

Mochizuki

, et al. BEX2 is poor prognostic factor and required for cancer stemness in gastric cancer. Biochem Biophys Res Commun 2023; 655: 59–67.

43.

Jing

Yun

, et al. Pyroptosis and inflammasome-related genes-NLRP3, NLRC4 and NLRP7 polymorphisms were associated with risk of lung cancer. Pharmgenomics Pers Med 2023; 16: 795–804.

44.

, et al. NLRP7 deubiquitination by USP10 promotes tumor progression and tumor-associated macrophage polarization in colorectal cancer. J Exp Clin Cancer Res 2021; 40: 126.

45.

Fang

Xiang

, et al. Expression profiling of CPS1 in Correa’s cascade and its association with gastric cancer prognosis. Oncol Lett 2021; 21: 441.

46.

Zhao

Bioinformatics analysis of the expression and clinical significance of the NUP210 gene in acute myeloid leukaemia. Hematology 2022; 27: 456–462.

47.

Kondo

Mishiro

Iwashima

, et al. Discovery of a novel aminocyclopropenone compound that inhibits BRD4-driven nucleoporin NUP210 expression and attenuates colorectal cancer growth. Cells 2022; 11(3): 317.

48.

Tang

, et al. A pan-cancer single-cell panorama of human natural killer cells. Cell 2023; 186: 4235–4251.

49.

Meyers

Bryan

McFarland

, et al. Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells. Nat Genet 2017; 49: 1779–1784.

50.

, et al. Analysis of single-cell RNA sequencing data to examine the gastric inflammation-to-cancer transition and evaluation of the effect of probiotic on precancerous lesions. Eng Microbiol 2025; 5: 100208.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.01 MB