Artificial Intelligence Approaches for Predictive Biomarker Discovery in Non-Small Cell Lung Cancer

Abstract

Introduction

Non-small cell lung cancer (NSCLC) is the most prevalent and lethal subtype of lung cancer. Most patients are diagnosed at an advanced stage of the disease, resulting in a poor prognosis. Early treatment and clinical intervention for NSCLC following early diagnosis can improve patients’ survival rate. It is of considerable significance to develop a more efficient and precise approach for identifying key genes and clinically pertinent biomarkers in NSCLC to enable its early diagnosis.

Methods

An interpretable two-stage analytical framework integrated with advanced artificial intelligence (AI) technology is proposed to enhance the accuracy of biological gene screening for NSCLC. Firstly, gene-level statistical features derived from the GSE19804,GSE30219 and GSE33532 datasets are standardized and dimensionally reduced via principal component analysis (PCA), which reveals two distinct linear distribution patterns of candidate genes in the PCA projection space. Subsequently, these candidate genes are validated using the TCGA and GEPIA platform by evaluating their differential expression profiles and associations with patient survival outcomes, with the goal of identifying robust predictive biomarkers.

Results

Through AI-driven analytical pipelines, multiple tumor-associated genes are screened and confirmed to be correlated with NSCLC progression. Notably, ADGRD1 (Adhesion G Protein-Coupled Receptor D1) exhibits a close association with pulmonary physiological functions and may serve as a potential biomarker in the initiation and progression of NSCLC.

Conclusion

The proposed method combines unsupervised structural discovery with cross-cohort clinical evidence to prioritize NSCLC biomarkers, providing critical support for early diagnosis, prognostic stratification, and biomarker-guided therapeutic strategies. Furthermore, the study provides technical support for biomarker discovery in other cancer types, and highlights the application value of integrating computational intelligence with oncology research.

Keywords

non-small cell lung cancer (NSCLC)artificial intelligence (AI)principal component analysis (PCA)Adhesion G Protein-Coupled Receptor D1 (ADGRD1)predictive biomarkers

Introduction

Lung cancer accounts for a substantial proportion of cancer-related morbidity and mortality globally, imposing an enormous burden on public health systems.¹ It is estimated that approximately 787,000 new cases of lung cancer are diagnosed in China each year, with more than 630,000 patients dying from the disease, ranking first globally in both incidence and mortality rates.² Among these cases, 85% of lung cancers are non-small cell lung cancer (NSCLC), which mainly includes lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC).³ As the incidence and mortality rates of NSCLC continue to rise, research on its early screening and prognostic assessment has become a pivotal focus in the field of lung cancer research and several advancements have been made. Early detection and diagnosis of lung cancer can reduce its mortality rate. Currently, early lung cancer detection technologies are evolving toward diversification and non-invasiveness, mainly including traditional imaging detection, non-invasive breath testing technology, liquid biopsy technology, and biomarker detection.^4–6 However, with the 5-year overall survival rate remaining between 15% and 20%, the clinical prognosis remains dismal. This poor outcome is primarily attributed to the fact that approximately 70% of patients are diagnosed at an advanced stage, where radical surgical resection is no longer feasible and the efficacy of systemic therapy is significantly limited.^7,8

To address this challenge, exploring the key genes involved in the initiation and progression of NSCLC has become imperative. The advancement of high-throughput sequencing technology and the accumulation of large-scale genomic, transcriptomic, and epigenomic datasets have provided robust support for deciphering the complex molecular landscape of NSCLC. Previous studies have identified driver mutations in key oncogenes and tumor suppressor genes, including EGFR, KRAS, TP53, ALK, and ROS1.^9–13 These biomarkers are crucial for understanding the pathogenesis and immunotherapeutic responses of NSCLC. Their discovery has ushered in the era of precision oncology and accelerated the development of molecular targeted therapies, such as EGFR tyrosine kinase inhibitors (TKIs), ALK inhibitors, and immune checkpoint inhibitors.¹⁴

However, the tumorigenesis and progression of NSCLC are driven by the complex interplay of multiple factors, including gene mutations, epigenetic modifications, transcriptional dysregulation, and tumor microenvironment perturbations.¹⁵ Although single-gene analysis approaches have demonstrated utility, they fail to fully capture the complexity of intergenic interactions and pathway crosstalk that underpin malignant phenotypes. Therefore, there is an urgent need for comprehensive research strategies that integrate multi-dimensional genomic data to identify reliable biomarkers and key regulatory genes capable of guiding early diagnosis, prognostic evaluation, and personalized treatment.¹⁶ To achieve a more comprehensive identification of NSCLC-related key genes, we have incorporated AI technology—a transformative tool in biomedical research that excels in processing and interpreting high-dimensional, complex biological data. In particular, AI-driven biomarker discovery exerts a remarkably positive effect on lung cancer screening. For example, in the field of non-invasive detection, Binson et al proposed an electronic nose system for analyzing volatile organic compounds (VOCs) in exhaled breath. Combined with machine learning algorithms, this system can effectively distinguish lung cancer patients from those with chronic obstructive pulmonary disease (COPD) and healthy individuals, providing a novel approach for non-invasive early disease screening.¹⁷ Their subsequent research further optimized the system, focusing on lung cancer staging detection and verifying the application potential of this method in assessing disease progression.¹⁸ In the domain of precise subtyping, Dwivedi et al constructed an interpretable AI deep learning framework, which screens clinically meaningful biomarker sets for NSCLC from high-dimensional gene expression data, facilitating the differentiation between LUAD and LUSC subtypes.¹⁹ At the level of prognostic prediction, Fang et al innovatively integrated knowledge graphs with machine learning. By extracting graph embedding features, this integration compensates for the insufficiency of gene panel data and significantly enhances the stability and reliability of survival risk prediction for NSCLC patients.²⁰ Compared with traditional statistical methods, AI exhibits superior performance in capturing non-linear relationships and hidden patterns within genomic datasets, providing strong support for comprehensive molecular characterization of NSCLC.^21–29

The objective of this study is to establish a reliable computational framework for molecular diagnosis of NSCLC and biomarker discovery by integrating AI approaches with the analysis of public gene expression datasets (GSE19804/GSE30219/ GSE33532/TCGA and GEPIA), and constructing a gene classification and prediction model. This study proposes an AI-based method that combines unsupervised structural discovery with cross-database validation to prioritize the screening of NSCLC biomarkers. After applying the proposed method, multiple NSCLC-related genes identified in previous studies are validated; additionally, ADGRD1—a gene previously understudied in the context of NSCLC—is newly identified. ADGRD1 is one of the group V members of the adhesion G protein-coupled receptor family. Current studies have shown that ADGRD1 is highly expressed in acute myeloid leukemia (AML) and is closely associated with shorter survival time in patients. Furthermore, studies have found that ADGRD1 is highly expressed in the hypoxic regions of malignant gliomas and is regulated by the HIF-αtranscription factor, thereby promoting the occurrence and progression of gliomas. These studies suggest that ADGRD1 functions as an oncogene. ADGRD1 participates in tumor cell differentiation, proliferation, invasion, infiltration and tumor growth. It is abnormally expressed in glioblastoma, oral squamous cell carcinoma, and other tumors, and correlates with the prognosis of oral squamous cell carcinoma patients. Additionally, ADGRD1 is highly expressed in acute myeloid leukemia and closely associated with patients’ survival time.³⁰ We conduct an in-depth investigation into the correlation between ADGRD1 expression in NSCLC and patient prognosis. ADGRD1 is lowly expressed in lung cancer and linked to patients’ poor prognosis, indicating that ADGRD1 may exert a tumor suppressor role in lung cancer. Therefore, exploring the causes of ADGRD1 downregulation in lung cancer is essential, and its regulatory mechanism may provide a crucial basis for lung cancer treatment targeting ADGRD1.³¹This discovery is expected to advance the development of early diagnostic strategies and personalized treatment regimens, ultimately improving clinical outcomes. Furthermore, this study provides technical support for biomarker discovery in other cancer types, highlighting the value of integrating computational intelligence with oncology research.

Data Sources and Method

Data Sources

This study integrates public gene expression datasets to conduct a comprehensive analysis of the molecular characteristics associated with NSCLC. The data are primarily obtained from the following two sources:

GSE19804/30219/33532 Datasets: Downloaded from the Gene Expression Omnibus (GEO), these datasets contain gene expression profiles of NSCLC patient samples and their matched normal control samples. To ensure the biological relevance of the included genes, strict screening criteria were established: |log2 fold change (log2FC)| > 1 and adjusted P-value < .05. After screening, a total of 1960 differentially expressed genes (DEGs) were identified. These genes exhibit significant expression alterations in tumor samples, laying a solid foundation for subsequent molecular characteristic analysis.

GEPIA Dataset: The study also utilizes the Gene Expression Profiling Interactive Analysis (GEPIA) platform, which includes gene expression profiles of lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC), as well as patient survival-associated gene data. The core advantage of the GEPIA platform lies in its integration of resources from two authoritative databases:

Cancer Genome Atlas (TCGA): Provides large-scale multi-dimensional data (eg, genomics, transcriptomics) of cancer samples;

Genotype-Tissue Expression (GTEx) Project: Provides gene expression data of normal tissues.

This data integration feature ensures that the gene information used in the study is both comprehensive and reliable, so it was employed as the validation dataset.

AI-Based Processing Method

In this study, GSE19804//30219/33532 datasets were used as the core discovery datasets. By combining AI technology with bioinformatics analysis, a two-stage “screening-verification” research framework was constructed. The overall research followed the logical sequence of “AI model-based candidate gene screening → GEPIA platform-based validation → predictive biomarker identification”.^32,33 The ultimate goal was to identify disease-related biomarkers with clinical predictive value, and the specific protocol is as follows:

A. Discovery Stage: AI Model-Based Candidate Gene Screening

First, preprocessing was performed on GSE19804/30219/33532 datasets, including data normalization, missing value imputation, and quality control of the gene expression matrix, to ensure reliable input for the predictive model. A linear regression model based on artificial intelligence was employed to construct predictive models using gene expression levels as input features. Clinical endpoints, such as disease stage and prognostic status, were clearly defined as indicators of patient outcomes: disease stage was represented as categorical variables reflecting tumor progression, while prognostic status (eg, overall survival or disease-free survival) was quantified using patient survival time and event status, and summarized through statistical measures such as log-rank test p-values and Cox proportional hazards model hazard ratios (HRs). These clinical endpoints were integrated into the model as additional input features, enabling the linear regression model to simultaneously capture the relationships between gene expression and patient outcomes. This model was chosen because it can effectively handle continuous gene expression data while maintaining interpretability, allowing the regression coefficients to directly reflect the contribution of each gene to the prediction of clinical endpoints, and it is stable to train on a dataset with a moderate sample size, avoiding overfitting. To identify the most informative genes contributing to the prediction, feature selection was performed using principal component analysis (PCA), initially narrowing down the set of candidate genes for subsequent analysis.

B. Verification Stage: GEPIA Platform-Based Dual Validation

This stage utilized GEPIA (Gene Expression Profiling Interactive Analysis)—an online tool built on The Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx) databases—to perform differential expression verification and survival association verification of the candidate genes.

C. Confirmation Stage: Predictive Biomarker Establishment

By synthesizing the results of the above two stages, genes that simultaneously met the criteria of “high importance in the AI model, significant differential expression in GEPIA, and significant association with survival in GEPIA” were screened. These genes were ultimately established as biomarkers with clinical predictive value, providing core targets for subsequent mechanism research or the construction of diagnosis/prognosis models. The workflow is shown in Figure 1.

Figure 1.

The framework of predictive biomarker discovery.

Data Preprocessing

A variety of statistical indicator features were extracted from the original GSE19804 dataset, including log fold change ( $\log F C$ ), p-values, false discovery rate-adjusted p-values, t-statistics, and B-statistics.³⁴ A four-dimensional feature vector was then constructed for each gene i using these indicators, as follows:

x_{i} = [\log F C_{i}, - \log_{10} (adj . P . {Val}_{i}), t_{i}, B_{i}]

(1)

To ensure data comparability and avoid bias caused by scale differences, all features were normalized using Z-score standardization³⁴:

{\tilde{x}}_{i j} = \frac{x_{i j} - μ_{j}}{σ_{j}}, μ_{j} = \frac{1}{n} \sum_{i = 1}^{n} x_{i j}, σ_{j} = \sqrt{\frac{1}{n - 1} \sum_{i = 1}^{n} {(x_{i j} - μ_{j})}^{2}}

(2)

Missing values were handled by imputation using the k-nearest neighbors (k-NN) algorithm to maintain dataset integrity.³⁵

Principal Component Analysis (PCA)

Let the Z-score–standardized data matrix be $\tilde{X} \in R^{n \times p}$ , the sample covariance matrix is calculated as:

S = \frac{1}{n - 1} {\tilde{X}}^{⊺} \tilde{X}

(3)

Eigenvalue decomposition was performed, and the top d eigenvectors associated with the largest eigenvalues are retained to form:

S w_{k} = λ_{k} w_{k}

(4)

W = [w_{1}, \dots, w_{d}]

(5)

The dimensional representation of a gene i is obtained by:

z_{i} = {\tilde{x}}_{i} W \in R^{d}

(6)

where

{\tilde{x}}_{i}

denotes the standardized feature vector of gene i.

Unsupervised Clustering (K-Means)

In the PCA space ${z_{i}}$ , K-means clustering is performed. The objective function is defined as:

m i n_{{C_{k}}, {μ_{k}}} \sum_{k = 1}^{K} \sum_{i \in C_{k}} ‖ z_{i} - μ_{k} 2^{2} ‖

(7)

where

μ_{k}

denotes the centroid of the k-th cluster. Multiple random initializations (eg,

n_{i n i t} =

10) are used, and the optimal solution is retained.

We select K by grid-searching $K \in {2, \dots, 6}$ and maximize the average silhouette coefficient. For each sample i:

s_{i} = \frac{b_{i} - a_{i}}{m a x (a_{i}, b_{i})}, \bar{s} = \frac{1}{n} \sum_{i = 1}^{n} s_{i}

(8)

The value of K corresponding to the largest $\bar{s}$ is chosen. To incorporate the unsupervised structure into subsequent integrated scoring, the inverse distance from each gene to the centroid of its respective cluster is defined as:

δ_{i} = \frac{1}{1 + ‖ z_{i} - μ_{c (i)} ‖ 2}

(9)

A larger $δ_{i}$ value indicates that the gene i is closer to the prototype of the gene cluster it belongs to. After completing the clustering, the features were divided into two clusters. As shown in Figure 2(b), when known biomarker categories were overlaid onto the clustering results, these categories were also split into two groups; within each group, the biomarker categories exhibited an approximately linear distribution. Therefore, in subsequent steps, linear regression was adopted to identify potential biomarkers.

Figure 2.

Results of principal component analysis and linear regression.

On the PCA plane, known biomarkers were dichotomized and fitted with weighted linear regressions separately. The perpendicular distance to each fitted line was used as a similarity metric to screen potential candidates. Let the standardized, PCA-reduced coordinates be $z_{i} = (x_{i}, y_{i}) \in R^{2}$ . Let K denote the set of genes known from the literature/databases to be associated with lung cancer, and write their coordinates as below:

Z_{K} = {(x_{i}, y_{i}) | i \in K}

(10)

Run K-means with K = 2 on $Z_{K}$ to obtain two groups $K^{(u p)}$ and $K^{(d n)}$ . Within each group, fit a line:

ℓ^{(g)} : y = a^{(g)} x + b^{(g)}, g \in {u p, d n}

(11)

To identify genes most consistent with these two linear relationships, define the near-line sets using the (Euclidean) perpendicular distance to each line. For all samples compute:

d_{⊥}^{(g)} (x_{i}, y_{i}), g \in {u p, d n}

(12)

Take the q-th quantile (eg, q = 15%) as the threshold $τ^{(g)}$ , and define:

N^{(g)} = {i | d_{⊥}^{(g)} (x_{i}, y_{i}) \leq τ^{(g)}}

(13)

Assign each sample in $N^{(u p)} \cup N^{(d n)}$ to the closer line via:

g * (i) = \arg m i n_{g \in {u p, d n}} d_{⊥}^{(g)} (x_{i}, y_{i})

(14)

The final set of near-line candidates is:

N = N^{(u p)} \cup N^{(d n)}

(15)

The set N is aligned across databases with GEPIA and incorporated into the composite scoring, yielding a list of candidate biomarkers that conforms to the two linear patterns and carries stronger statistical support.

It should be noted that the four-dimensional (4D) feature vectors constructed from statistical indicators ( $\log_{2} FC$ , adjusted p-value, t-statistic, and B-statistic) were not directly used as inputs to the final AI model. Instead, these features were employed in an intermediate feature representation and selection stage. Specifically, the standardized 4D feature vectors were first projected into a low-dimensional space via PCA, and the resulting representations were used for unsupervised clustering and geometric pattern analysis (K-means clustering and linear regression fitting). This procedure aimed to identify genes with consistent statistical characteristics and structural patterns. Only genes satisfying these PCA-based and clustering-based criteria were retained as candidate biomarkers and subsequently incorporated into downstream analyses, including AI-based modeling and prognostic validation.

Cross-Database Validation with GEPIA

Based on GEPIA (Gene Expression Profiling Interactive Analysis), we quantified candidate selection using a differential-expression score and a survival score.

Differential expression (DE) score can be denoted as:

s_{i}^{D E} = - \log_{10} (p_{i}^{D E})

(16)

where

p_{i}^{D E}

is the DE P -value returned by GEPIA.

Using the gene's log-rank value and hazard ratio (HR), the survival score can be denoted as:

s_{i}^{S V} = - l o g_{10} (p_{i}^{S V}), e_{i}^{S V} = | l o g (H R_{i}) |

(17)

Given three classes of measures—cluster representativeness $δ_{i}$ , DE score $s_{i}^{D E}$ , and survival score $s_{i}^{S V}$ , we first apply min–max normalization to obtain $r (δ_{i}), r (s_{i}^{D E}), r (s_{i}^{S V}), r (e_{i}^{S V})$ , where $ε$ is a small constant to avoid division by zero.

r (x_{i}) = \frac{x_{i} - m i n_{j} x_{j}}{m a x_{j} x_{j} - m i n_{j} x_{j} + ε}

(18)

Then we composite score as below:

\begin{aligned} S_{i} = & w_{c l u} r (δ_{i}) + w_{D E} r (s_{i}^{D E}) + w_{S V} r (s_{i}^{S V}) \\ + w_{H R} r (e_{i}^{S V}) w_{c l u} + w_{D E} + w_{S V} + w_{H R} = 1 \end{aligned}

(19)

Genes are ranked in descending order of $S_{i}$ to generate the candidate biomarker list. For interpretability, we also report a probability-like score via softmax:

π_{i} = \frac{e x p (S_{i})}{\sum_{j} e x p (S_{j})}

(20)

To ensure the reliability and specificity of candidate gene screening, the following threshold criteria were applied for the dual validation in this study:

Differential Expression Validation Threshold:

Consistent with the differential gene screening criteria of the GSE19804/GSE30219/GSE33532 discovery dataset (| $\log_{2}$ fold change ( $\log_{2} (F C)$ )| > 1 and adjusted P-value < .05), genes were considered to have significant differential expression in GEPIA if they met two conditions simultaneously:

The absolute value of $\log_{2} (F C)$ between lung cancer tissues (including LUAD and LUSC) and normal lung tissues was greater than 1, indicating that the gene expression level changed by more than 2-fold; (2) The differential expression P-value (DE P-value) returned by GEPIA was less than 0.05, which excluded the interference of random errors and ensured statistical significance of the expression difference.

Survival Association Validation Threshold:

Genes were determined to be significantly associated with patient survival outcomes if they satisfied the following statistical criteria in GEPIA: (1) For both overall survival (OS) and disease-free survival (DFS), the log-rank test P-value was less than .05, meaning the survival curve difference between the high-expression and low-expression groups of the gene was statistically significant; (2) The hazard ratio (HR) calculated by the Cox proportional hazards model was not equal to 1. Among them, HR < 1 indicated that high gene expression was a protective factor for patient survival (reducing the risk of death or recurrence), while HR > 1 indicated that high gene expression was a risk factor.

Only genes that simultaneously met the above two sets of validation thresholds and had high feature importance in the AI model were included in the final list of predictive biomarkers, ensuring that the screened genes not only had significant molecular-level expression differences but also closely correlated with clinical prognosis, providing reliable targets for subsequent translational research.

AI Model and Training Details

A linear regression model based on artificial intelligence (AI) was employed to predict clinical phenotypes from gene expression data. The input features to the model were derived from standardized 4D statistical vectors for each gene, which included log fold change, p-values, adjusted p-values, and t-statistics. These feature vectors were first reduced via principal component analysis (PCA), retaining the top principal components that captured the majority of variance. Subsequently, K-means clustering was applied in the PCA space to identify gene clusters, and near-line filtering based on perpendicular distance to fitted lines was used to select candidate genes for modeling.

The linear regression model was trained using ordinary least squares (OLS) to minimize the mean squared error (MSE) between predicted and observed clinical outcomes (eg, disease stage, prognostic status). To enhance robustness and avoid overfitting, the model incorporated L2 regularization (ridge regression), with the optimal regularization parameter selected via 5-fold cross-validation. The training process involved randomly partitioning the dataset into 5 folds, iteratively using 4 folds for training and 1 fold for validation, and averaging the performance metrics across folds.

Model evaluation metrics included mean squared error (MSE), coefficient of determination (R²), and Pearson correlation coefficients between predicted and observed clinical phenotypes. Feature importance was assessed directly from the regression coefficients, which quantified the contribution of each gene to the predicted clinical outcome. Genes with higher absolute coefficients and those meeting dual validation criteria from GEPIA (differential expression and survival correlation) were prioritized as candidate biomarkers.

This workflow ensures full reproducibility of the AI modeling process, provides interpretable results for biological interpretation, and allows integration of both statistical and clinical information in a transparent manner.

Results

Implementation Details. In the K-means clustering analysis, the number of clusters was set to K = 2. During validation using the GEPIA database, in order to obtain the final normalized score，all of weights were assigned to each indicator:

w_{c l u} = w_{D E} = w_{S V} = w_{H R} = 0.25

Analysis. Figure 2(a) shows the preprocessed samples, which exhibit clear separation characteristics. When known biomarker categories were annotated in the embedding space (as shown in Figure 2(b)), their distribution presented an approximately linear trend, indicating that candidate biomarkers can be screened based on their degree of alignment with these linear trends in the sample space. Figure 2(c) displays the linear regression models fitted using known biomarkers.

The findings were validated using Formulas 19 and 20 combined with the GEPIA database, and partial scoring results for key genes are presented in Figure 3.

Figure 3.

Results of candidate biomarker screening.

Among the highest-scoring candidate genes, many have been previously confirmed to be associated with lung cancer^36–42; in contrast, research on ADGRD1 (Adhesion G Protein-Coupled Receptor D1) remains relatively limited. Therefore, a focused analysis of this gene was conducted. In terms of differential expression, ADGRD1 expression levels were significantly lower in lung cancer tissues but higher in normal lung tissues (Figure 4), suggesting that this gene may play a critical role in NSCLC pathogenesis.

Figure 4.

ADGRD1 expression in GEPIA.

Subsequently, the association between ADGRD1 expression in lung adenocarcinoma (LUAD) and patient survival rates was analyzed using the GEPIA database (Figure 5). The results showed that patients in the high ADGRD1 expression group had a significantly better prognosis than those in the low expression group. For overall survival (OS), the log-rank test yielded a p-value of .00048; the Cox proportional hazards model showed a hazard ratio (HR) of 0.59 (p = .00056), indicating a 41% reduction in the risk of death in the high-expression group. For disease-free survival (DFS), the log-rank test resulted in a p-value of .018, with an HR of 0.69 (p = .018), corresponding to a 31% decrease in the risk of recurrence/progression. Collectively, these data confirm that high ADGRD1 expression serves as a protective prognostic biomarker for NSCLC.

Figure 5.

Prognostic relevance of ADGRD1 in LUAD: os and DFS (GEPIA).

We found high expression of ADGRD1 is significantly associated with shortened overall survival (OS) in patients and serves as a risk factor for OS; however, it shows no statistically significant association with disease-free survival (DFS) by the comparison of LUSC(Figure 6). This contrasts with the previous conclusion that ADGRD1 acts as a protective factor in lung adenocarcinoma (LUAD), demonstrating the heterogeneity of the prognostic role of this gene across different lung cancer subtypes.

Figure 6.

Prognostic relevance of ADGRD1 in LUSC: os and DFS (GEPIA).

In addition, existing literature has shown that both in LUSC and LUAD, the relative mRNA expression level of ADGRD1 is significantly lower than that in their corresponding adjacent normal tissues (P < .05); Kaplan-Meier survival analysis results indicated that the 10-year survival rate of lung cancer patients with high ADGRD1 expression is significantly higher than that of patients with low ADGRD1 expression (P < .001), suggesting that ADGRD1 may function as a tumor suppressor gene in lung cancer tissues, and its high expression is beneficial to patients’ prognosis.⁴³

A further comparison of analysis results from two independent human tissue expression resources—the Human Protein Atlas (HPA) and the Genotype-Tissue Expression (GTEx) Project—revealed that ADGRD1 exhibits lung tissue-enriched expression. As shown in Figure 7, ADGRD1 expression levels ranked among the highest in both resources and were significantly higher than those in most non-lung tissues. This lung tissue-specific enrichment, combined with the expression patterns observed in tumor datasets, indicates that ADGRD1 is closely associated with lung physiological functions, may play an important role in the initiation and progression of lung cancer, and highlights its potential as a NSCLC biomarker.

Figure 7.

Comparative results across HPA and GTEx.

Discussion

This study proposes an AI-driven strategy for NSCLC biomarker discovery, which proceeds as follows:First, the overall tissue-specific molecular structure was identified using the Principal Component Analysis-K-means Clustering (PCA-K-means Clustering) method; second, a bilinear model combined with anchor-aware weighting was employed to fit known lung cancer-related genes, and a near-line filtering mechanism based on perpendicular distance was constructed; finally, cross-cohort data on differential expression and survival outcomes from GEPIA were integrated into an interpretable scoring system.

Data Reliability: The dual design of “discovery dataset (GSE19804/GSE30219/ GSE33532) + validation dataset (GEPIA/TCGA)” avoids result bias caused by a single dataset and enhances the generalizability of the identified biomarkers.

Technological Innovation: Leveraging the feature screening capability of AI models addresses the inefficiency and high missed-detection risk of traditional “gene-by-gene verification” approaches, enabling rapid prioritization of key candidates from thousands of genes.

Clinical Relevance: Using “differential expression” and “survival correlation” as core validation indicators ensures that the finally screened biomarkers not only exhibit molecular-level differences but also directly reflect patients’ clinical prognosis, providing clear directions for subsequent translational research (eg, diagnostic reagent development, therapeutic target identification).

Computational Efficiency and Scalability: The proposed workflow is computationally lightweight, as it is primarily built upon linear or low-complexity algorithms, including statistical feature extraction, PCA, K-means clustering, and linear regression–based modeling. These components do not require large-scale parameter optimization or iterative backpropagation, enabling efficient execution on standard computing resources. Moreover, the framework is inherently scalable to larger datasets, since both PCA and K-means can be efficiently implemented using optimized linear algebra routines, and the overall computational cost grows approximately linearly with the number of genes and samples. This makes the proposed strategy suitable for large-scale biomarker screening and facilitates its extension to other tumor types or higher-dimensional omics datasets.

Comparison with Existing Methods: Compared with traditional bioinformatics approaches based on univariate statistical testing, the proposed method captures multivariate relationships among genes through PCA-based feature integration, improving robustness and reducing redundancy. Compared with deep learning–based AI models, this framework emphasizes interpretability and computational efficiency, as it relies on linear modeling and distance-based filtering rather than large-scale parameter optimization. However, the method does not explicitly model complex nonlinear patterns and may therefore be less sensitive to subtle nonlinear gene–phenotype associations. Overall, the proposed strategy is well suited for efficient and interpretable biomarker prioritization and can be regarded as complementary to more complex AI models.

When extending this method to biomarker screening for other tumors in the future, several limitations should be considered. First, reliance on public datasets may introduce heterogeneity due to batch effects and incomplete clinical annotations; despite data standardization and harmonization, residual variability may still affect the generalizability of results. Second, the near-line criterion assumes the existence of an approximate linear manifold in the PCA space; although robust thresholds and anchor weighting reduce sensitivity to this assumption, results may still vary with the number of clusters and distance thresholds.

Conclusion

Through an AI-based dataset discovery and validation approach, this study found that multiple previously confirmed lung cancer-associated genes achieved high scores, demonstrating the reliability and accuracy of the proposed method. Additionally, during gene screening, it was discovered that ADGRD1 expression and its regulatory mechanisms may serve as prognostic indicators for NSCLC. The method proposed in this study provides a transparent and reproducible research pathway for prioritizing the screening of prognostic biomarkers and potential therapeutic targets for NSCLC.^44,45

Footnotes

Abbreviations

Acknowledgements

The authors sincerely thank all public databases used in this study. They also express gratitude to the anonymous reviewers and associate editor for their valuable comments and suggestions.

ORCID iDs

Xiaoyue Wang

Na Liu

Ting Xu

Ethical Approval

This article does not involve any studies with human or animal subjects conducted by the authors.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Authors’ Contributions Statement

Xiaoyue Wang designed the study and drafted the manuscript; Na Liu and Shu Xu collected and analyzed the data; Ting Xu revised the entire manuscript. All authors have read and approved the final version of the manuscript and confirm that this manuscript has not been published previously and is not under consideration by another journal.

Data Availability

The original data used in this research are publicly available from online databases. All data supporting the findings of this study are included in the article. GSE databases: http://www.broadinstitute.org/gsea; GEPIA: (http://gepia2.cancer.pku.cn/); TCGA:

References

Siegel

Miller

Fuchs

, et al. Cancer statistics, 2022[J]. CA Cancer J Clin. 2022;72(1):7-33.

Gao

Wang

, et al. Lung cancer in people’s republic of China. J Thorac Oncol. 2020;15(10):1567-1576. doi:10.1016/j.jtho.2020.04.028

Thai

Solomon

Sequist

, et al. Lung cancer. Lancet. 2021;398(10299):535-554. doi:10.1016/S0140-6736(21)00312-3

Binson

Thomas

Subramoniam

. VOC Pattern recognition for lung cancer detection using a compact E-nose system [J]. Microchem J. 2025,215:114156. doi:10.1016/j.microc.2025.114156

Freitas

Sousa

Machado

, et al. The role of liquid biopsy in early diagnosis of lung cancer[J]. Front Oncol. 2021;11:634316. doi:10.3389/fonc.2021.634316

Lener

Reszka

Marciniak

, et al. Blood cadmium levels as a marker for early lung cancer detection. J Trace Elem Med Biol. 2021;64:126682. doi:10.1016/j.jtemb.2020.126682

Hirsch

Scagliotti

Mulshine

, et al. Lung cancer: Current therapies and new targeted treatments. Lancet. 2017;389(10066):299-311. 10.1016/S0140-6736(16)30958-8

Wang

Zhang

, et al. Advances in targeted therapy and immunotherapy for non-small cell lung cancer: mechanisms and clinical applications[J]. Cancer Treat Rev. 2024;115:102601. 10.1016/j.ctrv.2023.102601

Tanaka

Ueda

Takahashi

, et al. Multi-omics profiling reveals new therapeutic targets in NSCLC[J]. Cell Rep Med. 2023;4(4):100847. 10.1016/j.xcrm.2023.100847

10.

Wang

Zhang

, et al. Advances in targeted therapy and immunotherapy for non-small cell lung cancer: mechanisms and clinical applications[J]. Cancer Treat Rev. 2024;115:102601. 10.1016/j.ctrv.2023.102601

11.

Sequist

Yang

Yamamoto

, et al. Phase III study of Afatinib or cisplatin plus pemetrexed in patients with metastatic lung adenocarcinoma WithEGFRMutations. J Clin Oncol. 2013;31(27):3327-3334. 10.1200/JCO.2012.44.2806

12.

Canon

Rex

Saiki

, et al. The clinical KRAS(G12C) inhibitor AMG 510 drives anti-tumour immunity. Nature. 2019;575(7781):217-223. 10.1038/s41586-019-1694-1

13.

Olivier

Hollstein

Hainaut

. TP53 Mutations in human cancers: origins, consequences, and clinical use[J]. Cold Spring Harb Perspect Biol. 2010;2(1):a001008. 10.1101/cshperspect.a001008

14.

Zhao

, et al. Emerging roles of non-coding RNAs in the tumor microenvironment of lung cancer[J]. J Hematol Oncol. 2023;16(1):85. 10.1186/s13045-023-01424-5

15.

Wang

Sun

Wang

, et al. Resistance mechanisms and new therapeutic strategies for EGFR-mutant non-small cell lung cancer[J]. Front Oncol. 2024;14:1178569. 10.3389/fonc.2024.1178569

16.

Sequist

Waltman

Dias-Santagata

, et al. Genotypic and histological evolution of lung cancers acquiring resistance to EGFR inhibitors[J]. Sci Transl Med. 2011;3(75):75ra26. 10.1126/scitranslmed.3002003

17.

Subramoniam

Mathew

. Noninvasive detection of COPD and lung cancer through breath analysis using MOS sensor array based e-nose[J]. Expert Rev Mol Diagn. 2021;21(11):1223-1233. 10.1080/14737159.2021.1971079

18.

Mathew

Thomas

, et al. Detection of lung cancer and stages via breath analysis using a self-made electronic nose device[J]. Expert Rev Mol Diagn. 2024;24(4):341-353. 10.1080/14737159.2024.2316755

19.

Dwivedi

Rajpal

, et al. An explainable AI-driven biomarker discovery framework for non-small cell lung cancer classification. Comput Biol Med. 2023;153:106544. 10.1016/j.compbiomed.2023.106544

20.

Fang

Arango Argoty

Kagiampakis

, et al. Integrating knowledge graphs into machine learning models for survival prediction and biomarker discovery in patients with non–small-cell lung cancer. J Transl Med. 2024;22(1):726. 10.1186/s12967-024-05509-9

21.

Zhang

Sun

Chen

, et al. Artificial intelligence in lung cancer precision medicine: applications and challenges[J]. Cancer Lett. 2023;558:216044. 10.1016/j.canlet.2023.216044

22.

Zhou

. ADGRD1 As a potential prognostic and immunological biomarker in non-small-cell lung cancer[J]. Biomed Res Int. 2022;2022:5699892. 10.1155/2022/5699892

23.

Kourou

Exarchos

, et al. Machine learning applications in cancer prognosis and prediction[J]. Comput Struct Biotechnol J. 2015;13:8-17. 10.1016/j.csbj.2014.11.005

24.

Esteva

Robicquet

Ramsundar

, et al. A guide to deep learning in healthcare. Nat Med. 2019;25(1):24-29. 10.1038/s41591-018-0316-z

25.

Ardila

Kiraly

Bharadwaj

, et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat Med. 2019;25(6):954-961. 10.1038/s41591-019-0447-x

26.

Vamathevan

Clark

Czodrowski

, et al. Applications of machine learning in drug discovery and development. Nat Rev Drug Discov. 2019;18(6):463-477. 10.1038/s41573-019-0024-5

27.

Zhang

Ren

Sun

. Deep residual learning for image recognition[J]. Proc IEEE CVPR. 2016;2016:770-778. 10.1109/CVPR.2016.90

28.

Wang

, et al. Tumor microenvironment remodeling and immunotherapy resistance in non-small cell lung cancer[J]. Front Immunol. 2023;14:1135342. 10.3389/fimmu.2023.1135342

29.

Liu

Huang

Zhao

, et al. Predictive biomarkers and AI-guided personalized immunotherapy for lung cancer[J]. Front Immunol. 2024;15:1198741. 10.3389/fimmu.2024.1198741

30.

Chen

, Li Y, Zhang L,

et al. The tumor microenvironment in lung squamous cell carcinoma: fibrosis and neutrophil infiltration as key features. Cancers (Basel). 2023;15(16):4023. 10.3390/cancers15164023

31.

Jianhua

Rui

Daosheng

Xunhuang

. The mechanism of ADGRD1 inhibits cell proliferation and migration in non⁃small cell lung cancer [J]. The Journal of Practical Medicine. 2023;39(5):550-556.

32.

Kumar

Singh

Yang

, et al. Liquid biopsy and AI-enabled detection of minimal residual disease in lung cancer[J]. Nat Rev Clin Oncol. 2023;20(5):330-345. 10.1038/s41571-023-00734-x

33.

Zhao

Han

Zhang

, et al. AI-powered multi-modal data integration for prognosis prediction in NSCLC[J]. Comput Biol Med. 2024;164:107121. 10.1002/cam4.5321

34.

Tanaka

Ueda

Takahashi

, et al. Multi-omics profiling reveals new therapeutic targets in NSCLC[J]. Cell Rep Med. 2023;4(4):100847. 10.1016/j.xcrm.2023.100847

35.

Patel

Jones

Nguyen

, et al. Deep learning approaches for integrative analysis of multi-omics data in cancer[J]. Brief Bioinform. 2024;25(2):bbac574. 10.1093/bib/bbac574

36.

Rolfo

Mack

Scagliotti

, et al. Liquid biopsy for advanced non-small cell lung cancer (NSCLC): a statement paper from the IASLC[J]. J Thorac Oncol. 2018;13(9):1248-1268. 10.1016/j.jtho.2018.05.030

37.

Borghaei

Paz-Ares

Horn

, et al. Nivolumab versus docetaxel in advanced nonsquamous non–small-cell lung cancer[J]. N Engl J Med. 2015;373(17):1627-1639. 10.1056/NEJMoa1507643

38.

Chen

Huang

Zhang

, et al. Multi-omics integration in lung cancer: current landscape and future perspectives[J]. Front Oncol. 2023;13:1102305. 10.3389/fonc.2023.1102305

39.

Zhao

, et al. Emerging roles of non-coding RNAs in the tumor microenvironment of lung cancer[J]. J Hematol Oncol. 2023;16(1):85. 10.1186/s13045-023-01424-5

40.

Wang

Sun

Wang

, et al. Resistance mechanisms and new therapeutic strategies for EGFR-mutant non-small cell lung cancer[J]. Front Oncol. 2024;14:1178569. 10.3389/fonc.2024.1178569

41.

Yang

Zhang

Cao

, et al. Integrating spatial transcriptomics and single-cell RNA-Seq reveals the immune landscape of lung cancer[J]. Nat Commun. 2023;14(1):1027. 10.1038/s41467-023-36824-9

42.

Chen

Wang

, et al. Recent advances in KRAS G12C inhibitors in NSCLC: mechanisms and clinical outcomes[J]. Cancer Treat Rev. 2023;112:102457. 10.1016/j.ctrv.2023.102457

43.

Wang

, Li Y, Zhang H, et al. ADGRD1 Inhibits proliferation and metastasis of non-small-cell lung cancer via regulating RTK-RAS-MAPK/PI3K-AKT pathways. Front Oncol. 2025;15:1278905. 10.3389/fonc.2025.1278905

44.

Sun

Zhao

Wang

, et al. Deep learning for NSCLC diagnosis and therapeutic response prediction: a review[J]. Cancer Med. 2023;12(2):1375-1394. 10.1002/cam4.5321

45.

Lee

Kim

, et al. Advances in AI applications for lung cancer genomics and precision medicine[J]. Genome Med. 2024;16(1):20. 10.1186/s13073-024-00960-8