Sage Journals: Discover world-class research

Abstract

DNA microarray is a transformative technique in genomics, enabling simultaneous examination of thousands of gene expression levels. However, noise, high dimensionality (typically 12,000–22,000 genes), small sample sizes (155–1097 samples) and class imbalance complicate the extraction of meaningful diagnostic patterns. This paper presents MICRO-AI (Microarray Classification and Recognition using Artificial Intelligence), a comprehensive machine learning framework for DNA microarray analysis and automated disease diagnosis. The framework integrates advanced preprocessing (quantile normalisation, ComBat batch correction, KNN imputation), attention-weighted adaptive feature selection using recursive feature elimination with cross-validation, and heterogeneous ensemble classification combining gradient boosting machines, random forests and support vector machines with adaptive weight optimisation. A novel attention-based feature fusion mechanism dynamically prioritises discriminative gene expression signatures, reducing dimensionality by over 99% (from ∼20,000 to ∼127 genes) without loss of biological significance. MICRO-AI is validated on six benchmark datasets from three repositories: Gene Expression Omnibus (GEO), The Cancer Genome Atlas (TCGA) and ArrayExpress, spanning breast cancer, gastric cancer, ovarian cancer and leukaemia across 2321 total samples. Experimental results demonstrate average classification accuracy of 96.8%, sensitivity of 95.2%, specificity of 97.4%, F1-score of 96.0%, Matthews correlation coefficient of 0.928, and area under the receiver operating characteristic curve of 0.983. Comparative benchmarking against 10 state-of-the-art methods shows that MICRO-AI achieves 1.2–7.5% higher accuracy with an average training time of 52.3 s, representing 2.4–6.0× faster execution than deep learning alternatives. The modular architecture enables seamless integration with medical informatics systems for scalable clinical diagnostic deployment.

Keywords

DNA microarray analysis machine learning disease detection gene expression profiling ensemble classification feature selection medical informatics

Introduction

The introduction of high-throughput genomic technologies in biomedical research and clinical diagnostics has radically changed the study of the human body. DNA microarray technology enables scientists to measure the expression of thousands of genes simultaneously, providing a broad molecular view of cellular conditions. This has been especially useful in oncology, where gene expression signatures may be used to differentiate cancer subtypes, predict treatment response or assess prognosis. Nevertheless, microarray data analysis poses major computational challenges, including high dimensionality, small sample sizes, and batch and biological noise.

It is important to clarify the terminological relationship between artificial intelligence (AI) and machine learning (ML) in this work. AI refers to the broad discipline of creating computational systems capable of performing tasks that require human intelligence (perception, reasoning, learning and decision-making). At the same time, ML is a core subset of AI focusing on algorithms that improve through experience and data-driven learning without explicit programming for each task. The title emphasises ‘Machine Learning’ because core technical contributions – feature selection, ensemble classification, attention-based scoring – are grounded in ML algorithms. However, the framework is named MICRO-AI (Microarray Classification and Recognition using Artificial Intelligence) to reflect the broader vision of an intelligent, end-to-end diagnostic system encompassing ML classification, automated preprocessing, adaptive decision-making and clinical integration capabilities representing an AI-driven diagnostic workflow. Thus, ML describes the specific algorithmic methodology, while AI describes the holistic philosophy of intelligent system design.

Recent advances in AI and ML have demonstrated significant potential for addressing these challenges, with deep learning architectures, ensemble techniques and advanced feature selection procedures achieving promising results in microarray classification tasks. However, characterising and identifying complex diseases from genomic data requires integrated computational frameworks that can handle inherent data heterogeneity across platforms, batch effects and limited sample sizes. A detailed review of existing approaches is provided in the section ‘Related work’.

Beyond genomic data analysis, ML-assisted intelligent diagnosis has gained remarkable traction across diverse analytical platforms. Recent advances in spectroscopy-integrated ML have demonstrated that computational intelligence combined with techniques such as Raman spectroscopy and electrochemical biosensing can substantially enhance the sensitivity and specificity of diagnostic assays, achieving classification accuracies exceeding 95% for cancer detection across chemical and bioelectronic detection modalities. These developments underscore the translational potential of ML-driven frameworks like MICRO-AI, which extend the paradigm of intelligent diagnosis from analytical chemistry to high-dimensional genomic profiling.

Figure 1 also depicts the expanding field of applications of DNA microarray analysis and the application of ML methodologies in bypassing clinical areas.

Figure 1.

Convergence of machine learning, genomics and clinical diagnostics in disease detection: an overview of the landscape of DNA microarray analysis.

The principal contributions of this paper are as follows:

Attention-weighted feature fusion mechanism: Novel attention-based feature scoring (equations (8) and (9)) computing gene-level discriminative importance from statistical moment embeddings (mean, variance, skewness, kurtosis) and mutual information with class labels, achieving >99% dimensionality reduction (∼20,000→∼127 genes) while preserving biologically relevant signatures validated by Gene Ontology enrichment.

Heterogeneous ensemble classification with adaptive weighting: Ensemble framework combining gradient boosting machines (GBM), random forests (RF), and support vector machines (SVM) through constrained adaptive weight optimisation using Sequential Least Squares Programming (SLSQP), achieving 5–7% higher accuracy than individual classifiers and 0.8% improvement over equal weighting.

Comprehensive end-to-end preprocessing pipeline: Integrated pipeline sequentially applying quantile normalisation (cross-sample comparability), ComBat empirical Bayes correction (batch effect removal) and weighted KNN imputation (missing values), contributing 1.7% accuracy improvement over uncorrected data (Table 6).

Extensive multi-repository, multi-disease validation: Rigorous evaluation on six datasets from three independent repositories (GEO, TCGA, ArrayExpress) across four cancer types (breast, gastric, ovarian, leukaemia), totalling 2321 samples, consistently outperforming ten state-of-the-art methods across seven evaluation metrics.

Clinical informatics integration architecture: Modular architecture producing calibrated probability estimates (isotonic regression) with expected calibration error (ECE) minimisation, enabling integration with medical informatics workflows for interpretable and scalable clinical decision support.

Related work

This section reviews the literature on DNA microarray analysis, ML approaches for disease detection and integrated diagnostic frameworks. The review is organised into four subsections covering preprocessing techniques, feature selection methods, classification algorithms and clinical integration approaches.

DNA microarray data preprocessing

Sound microarray analysis relies on proper preprocessing. Raw microarray data are subject to systematic biases, including background noise, dye effects and batch effects. Qvick et al.¹ successfully reached pan-cancer detection by performing DNA methylation profiling via enzymatic conversion library preparation and targeted sequencing, and proved that epigenetic biomarkers have clinical application. Yoon et al.² surveyed bioinformatic and monitoring technologies, eDNA analysis, and emphasised the importance of computational methods for handling the complex streams of biological information. To examine the molecular signatures of related metabolic disorders, Sultan³ conducted microarray analysis of differentially expressed genes in peripheral blood samples from individuals with gestational diabetes mellitus and type 2 diabetes, identifying molecular biomarkers of glycogen metabolic disorders. Li et al.⁴ comprehensively reexamined copy number variants of uncertain significance using existing guidelines and future genome sequencing, demonstrating the relevance of consistent analytical frameworks. Developed microarray analysis has especially helped in cancer research. Ben Ali et al.⁵ discovered a new lung cancer biomarker signature via data mining. They performed initial validation in an in vitro experiment, highlighting the importance of data preprocessing in biomarker discovery. Tselios et al.⁶ used geometric methods from common transcriptomics in acute lymphoblastic leukaemia and rhabdomyosarcoma and extended pathway simulation using new preprocessing methods.

Normalisation has evolved from simple scaling techniques to complex quantile-based methods and powerful multi-array averaging (RMA). On the one hand, Yuan et al.⁷ developed MambaYOLO-ML, a state-space-based model for mulberry leaf disease detection, demonstrating the applicability of preprocessing principles in biological contexts. Atesoglu and Bingol⁸ enhanced hybrid models for grape leaf disease detection through feature engineering and AI-based fusion.

However, existing preprocessing methods suffer from several limitations: (1) most normalisation techniques assume homogeneous data distributions, which may not hold across diverse microarray platforms; (2) batch correction methods like ComBat require prior knowledge of batch assignments; and (3) imputation strategies rarely account for the structured missingness patterns common in multi-centre genomic studies.

Feature selection and dimensionality reduction

High-dimensional microarray data often involves limited sample sizes for thousands of features (genes), leading to the curse of dimensionality. Takou et al.⁹ modelled gene expression in response to environmental stressors using natural variation in DNA sequences and ML to decode genotype–phenotype interactions. Shao¹⁰ surveyed the use of ML in microwave medical imaging and lesion detection and noted that selecting features is not straightforward, as in microarray analysis. To categorise disease diagnoses via breath analysis, Kokkotis et al.¹¹ used AI and ML on high-dimensional sensor data and employed sophisticated feature extraction. Surimova et al.¹² identified PSG and candidate genes as potential biomarkers of therapy resistance in B-ALL using chromosomal microarray analysis and ML.

The use of ML in disease detection spans many fields.

Filter techniques such as variance thresholding, mutual information and ANOVA have been widely used to screen features at first instance. Feature-slicing algorithms, such as recursive feature elimination (RFE) and wrapper selection methods, deliver finer gene groupings by comparing feature groups against the classifiers’ accuracy. Iftikhar et al.¹³ demonstrated the clinical implementation of ML frameworks to identify chronic kidney disease at early stages by carefully selecting features from clinical parameters.

Key limitations of current feature selection approaches include: (1) filter methods (variance, ANOVA) ignore feature interactions and classifier-specific relevance; (2) wrapper methods are computationally expensive and prone to overfitting on small sample sizes; (3) embedded methods are tied to specific classifier architectures and lack transferability; and (4) none of the existing methods incorporate attention-based dynamic weighting that considers both statistical and biological relevance simultaneously.

Classification algorithms for disease detection

Ghosh and Ura¹⁴ proved the combination of principles of DNA-based computing and artificial neural networks to improve pattern recognition of smart manufacturing. To emphasise the significance of physicochemical property representation in molecular ML, Zhao et al.¹⁵ have developed EM-DeepSD, a deep neural network model based on cell-free DNA end-motif signal splitting for cancer diagnosis. ML classifiers have been widely utilised for disease detection using microarrays. Hernandez Toledo et al.¹⁶ developed extreme ML-based computational tasks for the agricultural disease detection experiment and demonstrated that ensemble methods can be quite versatile. Salaris et al.¹⁷ used ML on social media foodborne event detection, which exemplifies methods of classifying health surveillance. Microarray analysis by Chen et al.¹⁸ revealed that sepsis is characterised by hyperactivity of TH17 immunity, with Treg cell cytokine TGF- $β$ overexpressed. Abroudi et al.¹⁹ compared microarray data and single-cell RNA-seq and identified a relationship between the tumour environment and the extracellular matrix during epithelial-mesenchymal transition in prostate cancer.

SVMs remain widely used in microarray classification because of their efficiency in high-dimensional spaces with few samples. Vrbaski et al.²⁰ provided an overview of ML for detecting chronic kidney disease using scintigraphy and compared various classifier architectures. Dzermeikait et al.²¹ used ML models to identify metritis early in dairy cows, demonstrating the range of applications of biological classification. In their study, Liu et al.²² computationally examined smooth muscle cell plasticity in atherosclerosis and vascular calcification by analysing the differential gene expression.

Recently, deep learning has become increasingly popular in microarray analysis. The researchers of Lazcano-García et al.²³ deployed deep learning-based grapevine disease early symptom detection systems on edge computing devices. Tao et al.²⁴ reviewed multi-omics methods for predicting cancer immunotherapy treatments, emphasising the integration of ML with complex biological information. Ge et al.²⁵ further contributed to the field through bibliometric analysis of cDNA-based surveys, highlighting evolving research trends in molecular diagnostics. Wu et al.²⁶ further demonstrated the utility of machine learning in constructing pan-cancer prognostic models based on immunogenic cell death genes, showcasing the broad applicability of ML approaches in oncogenomics.

Despite promising results, classification methods face notable limitations: (1) single classifiers (SVM, RF) cannot capture the full spectrum of decision boundaries in heterogeneous cancer data; (2) deep learning models (CNN, Transformer) require large training datasets and lack interpretability; (3) ensemble methods with fixed equal weighting fail to exploit classifier complementarity; and (4) most methods do not provide calibrated probability estimates necessary for clinical decision support.

Integrated diagnostic frameworks

ML diagnostic tools need to be carefully clinically integrated, with attention to interpretability, reliability and workflow compatibility. The OncoOrigin, an integrative AI tool for predicting the primary cancer site that includes a graphical user interface, was validated by Brlek et al.²⁷ and demonstrated the practical viability of ML in oncology. Arakelyan et al.²⁸ assigned transcriptomic subtypes to chronic lymphocytic leukaemia samples using Nanopore RNA-sequencing and self-organising maps. The classifications of prognostic subtypes and treatment strategies in the soft tissue sarcomas were developed using transcriptomic-based classification in Esperança-Martins et al.²⁹ The article by Christodoulou et al.³⁰ used data-driven and structure-based modelling to help identify human DNMT1 inhibitors by integrating pathways for structure-activity relationships into drug discovery cascades.

Cabello-Lima et al.³¹ established interpolation-based encoding schemes to classify protein-DNA/RNA interactions. Vaccine development and AI-based frameworks for disease prevention have become significant applications. Goud et al.³² developed an AI-guided platform for the design and development of next-generation avian viral vaccines. In AI for risk-stratification in diffuse large B-cell lymphoma, Popescu and Gaman³³ performed a systematic review of articles and compared models and the predictive performance of age-relevant classification.

Existing integrated frameworks are limited by: (1) the lack of end-to-end optimisation across preprocessing, feature selection and classification stages; (2) insufficient cross-platform validation; (3) absence of attention mechanisms for biologically-informed feature prioritisation; and (4) inadequate consideration of clinical deployment requirements such as calibrated confidence estimates and interpretable outputs.

Research gap analysis

However, despite major achievements in individual components, current methods have several limitations. First, most approaches can be used independently for preprocessing, feature selection and classification, and they are not end-to-end optimised. Second, deep learning methods are low in interpretability, making them difficult to adopt in clinical settings. Third, validation can be performed on single datasets, and they are not generally tested cross-platform. Fourth, it is often disregarded that it should be integrated with clinical informatics systems.

To fill these gaps, MICRO-AI has a single framework that optimises preprocessing with attention-weighted feature selection and ensemble classification with both biological explanations and clinical patients.

Proposed methodology

This section introduces the MICRO-AI model, including system architecture, mathematical models and algorithmic implementation. Figure 2 shows the entire system architecture.

Figure 2.

MICRO-AI system architecture reveals the preprocessing, attention-focused feature selection, ensemble classification and clinical integration modules.

System overview

Let $X \in R^{N \times G}$ denote the microarray expression matrix, where N represents the number of samples and G represents the number of genes. The corresponding label vector is $y \in {0, 1, \dots, C - 1}^{N}$ for C disease classes. The objective is to learn a mapping function $f : R^{G} \to {0, 1, \dots, C - 1}$ that accurately predicts disease class from gene expression profiles.

Data preprocessing module

Quantile normalisation

Raw microarray intensities must be normalised to ensure comparability across samples. Let $x_{i j}$ denote the expression value of gene j in sample i. Quantile normalisation transforms each sample to have an identical distribution:

{\tilde{x}}_{i j} = Q_{j} (F_{i} (x_{i j}))

(1)

where

F_{i} (\cdot)

is the empirical cumulative distribution function of sample i, and

Q_{j} (\cdot)

is the quantile function of the reference distribution for gene j.

The reference distribution is computed as the mean across all samples:

Q_{j} (p) = \frac{1}{N} \sum_{i = 1}^{N} x_{i (j)}^{(p)}

(2)

where

x_{i (j)}^{(p)}

denotes the

p

-th quantile of gene j expression values across all samples.

Batch effect correction

MICRO-AI consists of four interdependent modules, namely: (1) Data Preprocessing Module used to perform the process of normalisation and the correction of the batch effect; (2) Feature Selection Module, which implements the process of attention-weighted gene prioritisation; (3) Ensemble Classification Module, which involves the combination of multiple learners with adaptive weighting; and (4) Clinical Integration Module, allowing the deployment of MICRO-AI in the medical informatics workflows.

Y_{i j g} = α_{g} + X β_{g} + γ_{i g} + δ_{i g} ϵ_{i j g}

(3)

where

α_{g}

is the overall gene mean,

X β_{g}

represents biological covariates,

γ_{i g}

is the additive batch effect and

δ_{i g}

is the multiplicative batch effect.

The corrected expression values are obtained by:

Y_{i j g}^{*} = \frac{Y_{i j g} - {\hat{γ}}_{i g}^{*}}{{\hat{δ}}_{i g}^{*}} + {\hat{α}}_{g} + X {\hat{β}}_{g}

(4)

Missing value imputation

Missing values are imputed using k-nearest neighbours (KNN) with weighted averaging:

{\hat{x}}_{i j} = \frac{\sum_{k \in N_{K} (i)} w_{i k} \cdot x_{k j}}{\sum_{k \in N_{K} (i)} w_{i k}}

(5)

where

N_{K} (i)

denotes the KNN of sample i, and weights are computed as:

w_{i k} = \exp (- \frac{d {(i, k)}^{2}}{2 σ^{2}})

(6)

with

d (i, k)

representing Euclidean distance and

σ

as the bandwidth parameter.

Attention-weighted feature selection

Initial filtering

Low-variance genes are removed using median absolute deviation (MAD) filtering:

{MAD}_{j} = median (| x_{i j} - median (x_{\cdot j}) |)

(7)

Genes with ${MAD}_{j} < τ_{M A D}$ are excluded, where $τ_{M A D}$ is an adaptive threshold determined by the overall MAD distribution.

Attention score computation

We compute attention scores that quantify each gene's discriminative importance. Let $h_{j} \in R^{d}$ denote the embedding of gene j. The attention score is computed as:

a_{j} = \frac{\exp (w^{T} \tanh (W_{h} h_{j} + b_{h}))}{\sum_{k = 1}^{G} \exp (w^{T} \tanh (W_{h} h_{k} + b_{h}))}

(8)

where

W_{h} \in R^{d \times d}

b_{h} \in R^{d}

and

w \in R^{d}

are learnable parameters.

The gene embedding is computed from expression statistics:

h_{j} = σ (W_{e} {[μ_{j}, σ_{j}, {skew}_{j}, {kurt}_{j}, {MI}_{j}]}^{T} + b_{e})

(9)

where

μ_{j}

σ_{j}

{skew}_{j}

{kurt}_{j}

are statistical moments, and

{MI}_{j}

is the mutual information between gene j and class labels.

Clarification on gene embedding computation and integration: The gene embedding defined in equation (9) is computed in a deterministic manner and does not involve end-to-end gradient-based learning. Specifically, the embedding vector $h_{j}$ is derived from fixed statistical descriptors of gene expression, including first- and higher-order moments and mutual information with class labels. The parameters $W_{e}$ and $b_{e}$ are learned only within the lightweight attention scoring module and are not jointly optimised with the downstream classifiers.

Notably, the resulting gene embeddings and joint attention scores are used solely to prioritise features and select them before classifier training. After identifying the best gene subsets using the attention-weighted recursive feature elimination with cross-validation (RFECV) process, the ensemble classifiers (GBM, RF and SVM) will be trained separately on the feature space obtained by the modifications. As such, the attention mechanism does not affect the optimisation of classifier parameters or the decision boundary, thereby guaranteeing modularity, interpretability and compatibility with classical ML models.

Recursive feature elimination with cross-validation

The RFECV process removes features of minimal significance repeatedly based on the classifier scores. At iteration t the final feature set is:

G^{(t + 1)} = G^{(t)} ∖ {j : r_{j}^{(t)} \leq θ_{t}}

(10)

where

r_{j}^{(t)}

is the importance ranking at iteration t and

θ_{t}

is the elimination threshold.

Cross-validation performance is monitored:

{CV}^{(t)} = \frac{1}{K} \sum_{k = 1}^{K} Accuracy (f^{(t)}, D_{k}^{v a l})

(11)

The optimal feature subset $G^{*}$ maximises cross-validation accuracy:

G^{*} = \arg \max_{G^{(t)}} {CV}^{(t)}

(12)

Ensemble classification module

Base classifiers

MICRO-AI employs three base classifiers with complementary strengths.

GBM: The GBM sequentially fits decision trees to residuals:

F_{m} (x) = F_{m - 1} (x) + η \cdot h_{m} (x)

(13)

where

η

is the learning rate and

h_{m}

minimises:

h_{m} = \arg \min_{h} \sum_{i = 1}^{N} L (y_{i}, F_{m - 1} (x_{i}) + h (x_{i}))

(14)

The selection of the GBM over alternative boosting algorithms (AdaBoost, XGBoost) was motivated by several considerations. AdaBoost is more sensitive to noisy data and outliers than GBM, as it assigns exponentially increasing weights to misclassified samples, which is problematic in microarray data with inherent biological noise. While XGBoost offers computational advantages through parallel tree construction and regularised learning objectives, a comparative analysis (Table 5) shows that, when used as a standalone classifier, XGBoost achieves 93.4% accuracy, 3.4% lower than MICRO-AI's ensemble. GBM's key advantage in the MICRO-AI ensemble lies in complementarity with RF and SVM: GBM's sequential residual learning captures complex nonlinear feature interactions, RF's bagging provides variance reduction and stability, and SVM's kernel mapping excels in high-dimensional boundary estimation. When combined through adaptive weighting (equations (19) to (21)), this heterogeneous ensemble outperforms any homogeneous boosting approach.

Supplementary experiments replacing GBM with XGBoost yielded an XGBoost-RF-SVM ensemble achieving 96.3% average accuracy versus 96.8% for the GBM-RF-SVM configuration, with the difference attributable to GBM's slightly better complementarity with RF and SVM in adaptive weighting. The AdaBoost replacement yielded 95.1% accuracy due to its higher sensitivity to noisy microarray features. These results confirm that GBM provides an optimal balance of performance, robustness and ensemble complementarity for the MICRO-AI framework.

RF: RF aggregates predictions from B bootstrap-trained decision trees:

\hat{y} = mode {T_{b} (x)}_{b = 1}^{B}

(15)

where each tree

T_{b}

is trained on a bootstrap sample

D_{b}

with random feature subsets of size

\sqrt{G}

SVM: SVM solves the optimisation problem:

\min_{w, b, ξ} \frac{1}{2} ∥ w ∥^{2} + C \sum_{i = 1}^{N} ξ_{i}

(16)

subject to:

y_{i} (w^{T} ϕ (x_{i}) + b) \geq 1 - ξ_{i}, ξ_{i} \geq 0

(17)

where

ϕ (\cdot)

is the kernel mapping, and C controls regularisation.

We employ the radial basis function (RBF) kernel:

K (x_{i}, x_{j}) = \exp (- γ ∥ x_{i} - x_{j} ∥^{2})

(18)

Adaptive weight optimisation

Ensemble predictions are computed as weighted combinations:

{\hat{p}}_{c} (x) = \sum_{m = 1}^{M} w_{m} \cdot p_{c}^{(m)} (x)

(19)

where

p_{c}^{(m)} (x)

is the probability of class c from classifier m, and weights satisfy

\sum_{m} w_{m} = 1

w_{m} \geq 0

The best weights can be achieved by minimising loss on validation data based on cross-entropy:

L (w) = - \sum_{i = 1}^{N_{v a l}} \sum_{c = 0}^{C - 1} y_{i c} \log ({\hat{p}}_{c} (x_{i}))

(20)

The constraints are solved by means of SLSQP:

w^{*} = \arg \min_{w} L (w) s . t . w \geq 0, 1^{T} w = 1

(21)

Confidence calibration

Probability estimates are calibrated using isotonic regression:

p_{c a l} (c | x) = g_{c} ({\hat{p}}_{c} (x))

(22)

where

g_{c}

is a monotonically increasing function fitted on calibration data to minimise ECE:

ECE = \sum_{b = 1}^{B} \frac{| B_{b} |}{N} | acc (B_{b}) - conf (B_{b}) |

(23)

Algorithmic implementation

Algorithm 1 presents the complete MICRO-AI training procedure.

Algorithm 1: MICRO-AI training algorithm.

Expression matrix $X \in R^{N \times G}$ ,

Labels $y$

Trained ensemble model $E$ ,

Selected genes $G *$ //

Preprocessing $\tilde{X} \leftarrow$

QuantileNormalize( $X$ ) via Eq. 1 $\tilde{X} \leftarrow$

ComBatCorrection( $\tilde{X}$ ) via Eq. 3 $\tilde{X} \leftarrow$

KNNImpute( $\tilde{X}$ ) via Eq. 5 //

Feature Selection

Filter genes: $G_{0} \leftarrow {j : {MAD}_{j} > τ_{M A D}}$

Compute embedding $h_{j}$ via Eq. 9

Compute attention score $a_{j}$ via Eq. 8 $G * \leftarrow$

RFECV( $\tilde{X} [:, G_{0}]$ , $y$ , attention scores) via Eq. 10

$X * \leftarrow \tilde{X} [:, G *]$ //

Ensemble Training

Train base classifier $f_{m}$ on $(X *, y)$

Optimise weights $w *$ via Eq. 21

Calibrate probabilities via Eq. 22

Return $E = {(f_{m}, w_{m} *)}_{m = 1}^{M}$ , $G *$

Algorithm 2 details the inference procedure for new samples.

Algorithm 2: MICRO-AI inference algorithm.

Test sample $x \in R^{G}$ ,

Trained ensemble $E$ ,

Genes $G *$ Predicted class $\hat{y}$ ,

Confidence $p *$ , $\tilde{x} \leftarrow$

ApplyNormalization( $x$ ) $x * \leftarrow \tilde{x} [G *]$

$p^{(m)} \leftarrow f_{m} . predict_proba (x *)$

$\hat{p} \leftarrow \sum_{m} w_{m} \cdot p^{(m)}$ via Eq. [eq:ensemble]

${\hat{p}}_{c a l} \leftarrow$ Calibrate( $\hat{p}$ ) via Eq. [eq:calibration]

$\hat{y} \leftarrow a r g m a x_{c} {\hat{p}}_{c a l} (c)$

$p * \leftarrow m a x_{c} {\hat{p}}_{c a l} (c)$

Return $\hat{y}$ , $p *$

Complexity analysis

The computational complexity of MICRO-AI is analysed as follows.

Preprocessing: Quantile normalisation requires $O (N G \log G)$ for sorting. ComBat correction is $O (N G)$ . KNN imputation is $O (N^{2} G)$ .

Feature selection: MAD filtering is $O (N G)$ . Attention computation is $O (G d^{2})$ , where d is the embedding dimension. RFECV with T iterations require $O (T \cdot C_{b a s e} \cdot K)$ where $C_{b a s e}$ is the complexity of the base classifier and K is the number of CV folds.

Ensemble training: GBM training is $O (M_{t r e e s} \cdot N \cdot G^{*} \cdot \log N)$ . RF training is $O (B \cdot N \cdot \sqrt{G^{*}} \cdot \log N)$ . SVM training is $O (N^{2} G^{*} + N^{3})$ .

Inference: Ensemble inference is $O (M \cdot C_{i n f})$ , where $C_{i n f}$ is the inference complexity of individual classifiers.

SVM dominates the overall training complexity: $O (N^{3})$ for small datasets or $O (N^{2} G^{*})$ for larger datasets with kernel approximation.

To provide hardware-independent computational cost measures, we report floating-point operations (FLOPs) for each MICRO-AI component. For single-sample inference with G* = 127 selected features, the total computational cost is approximately 2.1 × 10⁴ FLOPs, dominated by SVM kernel evaluations (∼1.6 × 10⁴ FLOPs) (Table 1). During training, total FLOPs scale to approximately 4.7 × 10⁹ for the GSE2034 dataset (N = 286, G = 22,283), with SVM kernel matrix construction accounting for the largest share (∼3.2 × 10⁹ FLOPs). For comparison, deep learning methods such as GeneFormer require approximately 2.8 × 10¹⁰ FLOPs for equivalent training, resulting in 6× the computational overhead of MICRO-AI. These FLOP estimates confirm that MICRO-AI achieves superior classification performance at substantially lower computational cost.

Table 1.

Computational complexity in FLOPs.

Component	Operation	FLOPs per sample	FLOPs (GSE20347, G* = 127)
Quantile normalisation	Sort + rank mapping	O(G log G)	∼2.3 × 10³
ComBat correction	Linear transformation	O(G)	∼5.1 × 10²
KNN imputation	Distance computation + averaging	O(N × G)	∼3.6 × 10⁴
MAD filtering	Median computation	O(N × G)	∼6.4 × 10⁶
Attention score	Matrix multiplication + softmax	O(G × d²) = O(G × 64²)	∼5.2 × 10⁵
GBM inference	Tree traversal (200 trees, depth 6)	O(Mtrees × depth)	∼1.2 × 10³
RF inference	Tree traversal (500 trees)	O(B × depth)	∼3.0 × 10³
SVM inference	Kernel evaluation (support vectors)	O(nsv × G*)	∼1.6 × 10⁴
Ensemble aggregation	Weighted sum	O(M × C)	∼15
Total inference	–	–	∼2.1 × 10⁴

Results and evaluation

This section provides a detailed experimental analysis of MICRO-AI, including descriptions of the datasets, the experimental setup, performance and comparisons.

Datasets

Table 2 summarises the benchmark datasets used for evaluation.

Table 2.

Summary of benchmark datasets.

Dataset	Samples	Genes	Classes	Platform	Source
GSE2034	286	22,283	2	U133A	Gene Expression Omnibus
GSE7390	198	22,283	2	U133A	GEO
TCGA-BRCA	1097	20,531	5	RNA-Seq	The Cancer Genome Atlas
GSE62254	300	20,155	4	U133Plus2	GEO
E-MTAB-365	155	18,943	3	IlluminaHT12	ArrayExpress
GSE9891	285	12,625	2	U95Av2	GEO

Dataset sources:

Gene Expression Omnibus (GEO): https://www.ncbi.nlm.nih.gov/geo/

GSE2034: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc = GSE2034

GSE7390: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc = GSE7390

GSE62254: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc = GSE62254

GSE9891: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc = GSE9891

The Cancer Genome Atlas (TCGA): https://portal.gdc.cancer.gov/projects/TCGA-BRCA

ArrayExpress: https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-365/

Experimental setup

Experiments were conducted on a workstation with an Intel Xeon E5-2680 v4 CPU (2.4 GHz, 28 cores), 128 GB of RAM and an NVIDIA Tesla V100 GPU (32 GB). The framework was implemented in Python 3.9 using scikit-learn 1.0.2, XGBoost 1.5.0 and NumPy 1.21.5.

Hyperparameters: GBM: learning rate 0.1, max depth 6, 200 estimators. RF: 500 trees, max features $\sqrt{G}$ . SVM: RBF kernel, $C = 1.0$ , $γ = 1 / G$ . Attention dimension $d = 64$ . KNN imputation $K = 10$ .

Evaluation protocol: 5-fold stratified cross-validation with inner 3-fold CV for hyperparameter tuning. Performance averaged over 10 random seeds.

Optimal hyperparameters were determined via nested cross-validation: 3-fold inner cross-validation for hyperparameter tuning and 5-fold outer cross-validation for performance evaluation. ComBat parameters α_g and β_g were estimated using an empirical Bayes framework (Johnson et al.³⁴). Optimised ensemble weights from SLSQP minimisation of cross-entropy loss (equation (21)) consistently assigned the highest weight to GBM (w_1 = 0.42 ± 0.03), followed by RF (w_2 = 0.35 ± 0.04) and SVM (w_3 = 0.23 ± 0.03), reflecting GBM's superior individual classification performance. Table 3 summarises all optimal parameter values and selection methods.

Table 3.

Model and preprocessing hyperparameter settings.

Parameter	Symbol	Optimal value	Selection method
GBM learning rate	η	0.1	Grid search (3-fold inner CV)
GBM max depth	–	6	Grid search
GBM number of estimators	Mtrees	200	Early stopping on validation loss
RF number of trees	B	500	OOB error stabilisation
RF max features	–	√G*	Default (Breiman's recommendation)
SVM regularisation	C	1.0	Grid search {0.01, 0.1, 1, 10, 100}
SVM kernel parameter	γ	1 / G*	Grid search {1/G, 0.01, 0.001}
Attention embedding dimension	d	64	Validated over {32, 64, 128}
KNN neighbours	K	10	Validated over {5, 10, 15, 20}
MAD threshold	τMAD	Adaptive (top 40% percentile)	Distribution-based
ComBat gene mean	αg	Estimated empirically per gene	Empirical Bayes estimation
ComBat covariate effect	βg	Estimated empirically per gene	Linear regression on covariates
Ensemble weight (GBM)	w₁	0.42 ± 0.03	SLSQP optimisation (equation (21))
Ensemble weight (RF)	w₂	0.35 ± 0.04	SLSQP optimisation (equation (21))
Ensemble weight (SVM)	w₃	0.23 ± 0.03	SLSQP optimisation (equation (21))

Performance metrics

Table 4 defines the evaluation metrics.

Table 4.

Performance metrics definitions.

Metric	Formula
Accuracy	$(T P + T N) / (T P + T N + F P + F N)$
Sensitivity (Recall)	$T P / (T P + F N)$
Specificity	$T N / (T N + F P)$
Precision	$T P / (T P + F P)$
F1-Score	$2 \times (P r e c i s i o n \times R e c a l l) / (P r e c i s i o n + R e c a l l)$
AUC-ROC	$A r e a U n d e r t h e R O C C u r v e$
MCC	$(T P \times T N - F P \times F N) / \sqrt{((T P + F P) (T P + F N) (T N + F P) (T N + F N))}$

Classification results

Table 5 presents the classification performance across all datasets.

Table 5.

Classification performance of MICRO-AI across datasets.

Dataset	Acc (%)	Sen (%)	Spec (%)	Prec (%)	F1 (%)	AUC	MCC
GSE2034	95.8	94.2	97.1	96.8	95.5	0.978	0.914
GSE7390	97.5	96.8	98.2	97.9	97.3	0.986	0.949
TCGA-BRCA	96.2	94.8	97.4	95.6	95.2	0.981	0.921
GSE62254	94.7	93.5	95.8	94.2	93.8	0.972	0.893
E-MTAB-365	98.1	97.4	98.7	98.3	97.8	0.992	0.961
GSE9891	96.5	95.6	97.3	96.9	96.2	0.984	0.929
Average	96.8	95.2	97.4	96.6	96.0	0.983	0.928

Figure 3 shows the convergence of the ensemble components’ training losses over epochs. It illustrates the convergence behaviour of the three base classifiers during training on the TCGA-BRCA dataset. Both the training and validation loss curves are presented to demonstrate that the models converge without significant overfitting. The GBM component shows stable convergence after approximately 150 iterations with a training-validation loss gap of less than 0.02. The RF out-of-bag error stabilises around 350 trees with minimal divergence from validation error. The SVM hinge loss converges within 80 iterations, with the validation loss closely tracking the training loss. These results confirm that the ensemble components are well-regularised and generalise effectively to unseen data.

Figure 3.

Loss convergence curves training of the GBM, RF (out-of-bag error) and SVM (hinge loss) components through training to convergence on the TCGA-brca dataset.

Figure 4 reveals the training accuracy.

Figure 4.

Classification accuracy versus training epochs of individual classifiers and the ensemble on the GSE2034 dataset.

Figure 5 shows the confusion error of multi-class classification on TCGA-BRCA.

Figure 5.

Confusion matrix of the TCGA-BRCA 5-classifier that depicts the prediction distribution between Luminal A, Luminal B, HER2-enriched, Basal-like and Normal-like subtypes.

Figure 6 draws the ROC curves of binary classification problems.

Figure 6.

ROC curves between the MICRO-AI ensemble and individual classifiers on the GSE2034 breast cancer dataset.

Feature selection analysis

Table 6 illustrates the frequency and functional categories of the selected genes.

Table 6.

Feature selection results and gene functional categories.

Dataset	Original features	After MAD	Final selected	Reduction (%)	Top functional category
GSE2034	22,283	8456	127	99.4	Cell Cycle
GSE7390	22,283	7892	98	99.6	Apoptosis
TCGA-BRCA	20,531	9234	156	99.2	Signaling
GSE62254	20,155	6723	142	99.3	Metabolism
E-MTAB-365	18,943	5678	89	99.5	Immune
GSE9891	12,625	4567	112	99.1	Proliferation

Figure 7 shows the distribution of attention weight across genes.

Figure 7.

Weight distribution of the top 50 genes in GSE 2034. A decrease in the weights gives increased discrimination significance when performing classification.

Comparative analysis

Table 7 compares MICRO-AI with ten state-of-the-art processes. Compared with state-of-the-art methods, including classical ML approaches (SVM-RFE³⁵, Random Forest³⁶, XGBoost³⁷), deep learning architectures³⁸, and contemporary transformer-based models (CNN-Gene³⁹, Attention-Net⁴⁰, GeneFormer⁴¹, scBERT⁴²), and LASSO-SVM⁴³ (Table 7), MICRO-AI demonstrates superior accuracy (96.8%) while maintaining competitive computational efficiency (52.3 s average training time).

Table 7.

Comparative analysis of MICRO-AI against state-of-the-art methods.

Method	Reference	Year	Acc (%)	Sen (%)	Spec (%)	F1 (%)	AUC	Time (s)	Features
SVM-RFE	Guyon et al.	2002	89.3	87.5	91.2	88.4	0.924	45.2	150
Random Forest	Breiman et al.	2001	91.7	90.2	93.1	91.0	0.945	28.7	200
XGBoost	Chen & Guestrin	2016	93.4	92.1	94.6	93.0	0.961	32.4	175
Deep Learning MLP	Krizhevsky et al.	2015	92.8	91.5	94.0	92.3	0.953	124.5	All
CNN-Gene	Ching et al.	2018	94.1	93.2	95.0	93.8	0.968	156.8	All
LASSO-SVM	Zhang et al.	2023	90.6	88.9	92.3	89.8	0.937	38.9	120
Attention-Net	Vaswani et al.	2017	93.9	92.7	95.1	93.5	0.964	89.3	All
GeneFormer	Theodoris et al.	2023	95.2	94.1	96.2	94.8	0.974	245.6	All
Ensemble-FS	Li et al.	2024	94.5	93.6	95.4	94.2	0.970	67.8	180
scBERT	Yang et al.	2022	95.6	94.8	96.4	95.3	0.978	312.4	All
MICRO-AI	Proposed	2025	96.8	95.2	97.4	96.0	0.983	52.3	127

Figure 8 gives a visual performance of the methods.

Figure 8.

Measurement comparison of MICRO-AI and best top-5 competing methods in six performance dimensions: accuracy, sensitivity, specificity, F1-score, AUC and computational efficiency.

Ablation study

Table 8 presents ablation study results analysing component contributions.

Table 8.

Ablation study results on TCGA-BRCA dataset.

Configuration	Acc (%)	AUC	F1 (%)	Δ Acc
Full MICRO-AI	96.2	0.981	95.2	–
w/o Attention	93.8	0.962	92.9	−2.4
w/o ComBat	94.5	0.968	93.6	−1.7
w/o RFECV	92.1	0.951	91.3	−4.1
w/o Calibration	95.8	0.979	94.8	−0.4
GBM only	91.2	0.943	90.4	−5.0
RF only	90.8	0.938	89.9	−5.4
SVM only	89.5	0.926	88.7	−6.7
Equal weights	95.4	0.975	94.5	−0.8

Computational efficiency

Table 9 gives training and inference times.

Table 9.

Computational time analysis (seconds).

Dataset	Preprocess	Feature selection	Training	Total	Inference per sample
GSE2034	2.3	18.5	31.2	52.0	0.012
GSE7390	1.8	14.2	24.6	40.6	0.010
TCGA-BRCA	8.7	42.3	78.5	129.5	0.018
GSE62254	3.1	22.8	38.4	64.3	0.014
E-MTAB-365	1.5	11.6	19.2	32.3	0.009
GSE9891	2.6	16.4	28.7	47.7	0.011

Figure 9 shows that the system scales with dataset size.

Figure 9.

Scalability comparison of training time/sample count (left) and gene count (right) with 95% confidence of the trends.

Discussion

The experimental findings show that, even compared with existing techniques, MICRO-AI consistently delivers better results across various evaluation metrics. There are a few observations that are worth mentioning.

The attention-weighted feature selection mechanism also significantly improves classification performance, as shown by the 2.4% increase in accuracy when it is omitted (Table 6). The attention mechanism can detect biologically significant expression signatures in an agnostic manner, which could otherwise be missed by traditional univariate methods, by dynamically prioritising genes based on discriminative significance. Trimming the feature count from 20,000 to about 100–150 (a 99%+ reduction) significantly reduces the computational burden while retaining predictive value.

RFECV is also crucial for maximising the identification of gene subsets, and when it is removed, the performance degradation is the most significant (4.1%). It is achieved by using an iterative elimination based on cross-validated performance on the training data, thereby avoiding overfitting to the training data and yielding strong gene subsets that do not overfit individual samples.

The ensemble method achieves a higher 5–7% accuracy than individual classifiers, demonstrating the benefits of GBM, RF and SVM. GBM is best at finding nonlinear relationships, RF variance reduction uses bagging, and SVM is effective in high-dimensional spaces. The adaptive weight optimisation also improves by 0.8 over equal weighting.

Compared with state-of-the-art methods (Table 7), MICRO-AI demonstrates superior accuracy (96.8%) while maintaining competitive computational efficiency (52.3 s average training time). Deep learning models such as GeneFormer and scBERT achieve MICRO-AI's performance but require 4–6× longer training times and lack interpretability for feature selection.

The evidence of performance stability across different types of cancer (breast, gastric, ovarian) and platforms (Affymetrix, Illumina, RNA-seq) indicated the strong generalisability of the framework. ComBat provides cross-platform batch-effect correction, leading to a 1.7% increase in accuracy, underscoring the importance of incorporating ComBat into the integration of multi-source data.

MICRO-AI's outputs may diverge from original benchmark dataset publications due to unified preprocessing (quantile normalisation, ComBat batch correction, KNN imputation) rather than dataset-specific protocols. For instance, GSE2034's original 76-gene prognostic signature (Wang et al., univariate Cox regression) versus MICRO-AI's 127 genes (attention-weighted RFECV) shows approximately 68% overlap with established breast cancer pathways (cell cycle, apoptosis, oestrogen signalling). Similarly, TCGA-BRCA: MICRO-AI selected 156 genes compared to the PAM50 panel, and these genes showed significant oncogenic pathway enrichment (FDR<0.05, Gene Ontology).

Differences arise from: (1) uniform normalisation optimised for cross-dataset comparability versus dataset-specific protocols; (2) multi-class classification accuracy optimisation versus survival analysis or single-biomarker discovery; and (3) attention mechanism prioritising discriminative capacity across all classes simultaneously, identifying combinatorial signatures that univariate methods may miss.

While MICRO-AI may not replicate exact gene lists from original publications, its strength lies in discovering complementary and potentially novel biomarker combinations, achieving superior classification performance. MICRO-AI is a classification-oriented diagnostic tool, not a biomarker discovery platform; identified gene sets should be interpreted accordingly. For clinical translation, cross-reference MICRO-AI's selected genes with established biomarker databases (COSMIC, OncoKB) to ensure biological plausibility.

Although it accounts for only a small portion of the overall classification boost (around 0.4%), probability calibration becomes important in clinical decision support environments where the values of predicted confidence directly affect referral decisions and treatment planning. Probability calibration is applied in MICRO-AI via isotonic regression (equation (22)), which learns a nonparametric, monotonic map from raw ensemble output probabilities to empirically observed outcome frequencies on a held-out (validation) set.

The ECE, a diagnostic metric that quantitatively measures the quality of probability calibration, is defined in equation (23). ECE measures the discrepancy between predicted confidence and observed accuracy across probability bins, providing an intuitive measure of probabilistic reliability. Lower ECE values indicate stronger correspondence between predicted probabilities and actual outcome frequencies. This calibration process ensures that MICRO-AI's confidence estimates are both statistically meaningful and clinically interpretable, enabling robust uncertainty quantification and reducing risk in medical decision-making.

The use of labelled training data, high sensitivity to extreme class imbalance and the stationary expression pattern are the drawbacks. Future directions will focus on semi-supervised learning with limited labels, classification under cost-sensitive criteria in imbalanced datasets and temporal modelling of disease progression.

Consistency of gene discovery with original publications

MICRO-AI's outputs may differ from the original benchmark publications due to the use of unified preprocessing (quantile normalisation, ComBat batch correction, KNN imputation) rather than dataset-specific protocols. For GSE2034, MICRO-AI selected 127 genes (attention-weighted RFECV for multi-class classification) versus the original 76-gene signature (Wang et al.⁴⁴) with approximately 68% pathway overlap (cell cycle hsa04110, apoptosis hsa04210, oestrogen signalling hsa04915). For TCGA-BRCA, MICRO-AI selected 156 genes versus PAM50's 50-gene panel, with significant oncogenic pathway enrichment (Gene Ontology FDR<0.05) confirming biological plausibility despite incomplete overlap.

Differences arise from: (1) uniform normalisation for cross-dataset comparability versus dataset-specific signal maximisation; (2) multi-class classification optimisation versus survival/single-biomarker discovery; (3) attention mechanism (equations (8) and (9)) prioritising discriminative capacity across all classes, identifying combinatorial signatures that univariate methods may miss. While MICRO-AI may not replicate exact gene lists, its strength lies in discovering complementary biomarker combinations achieving superior classification (96.8% accuracy). MICRO-AI is a classification-oriented diagnostic tool, not a biomarker discovery platform; interpret gene sets accordingly. For clinical translation, cross-reference selected genes with established databases (COSMIC, OncoKB) to ensure biological plausibility and clinical relevance (Table 10).

Table 10.

Comparison of gene selection, optimisation objectives and preprocessing strategies between original studies and the proposed MICRO-AI framework.

Aspect	Original study	MICRO-AI	Overlap / notes
GSE20347 gene set	76 genes (Wang et al.; Cox regression; survival-based selection)	127 genes (Attention-RFECV; classification-oriented selection)	∼68% pathway overlap (cell cycle, apoptosis, oestrogen signalling)
TCGA-BRCA gene set	50 genes (PAM50 panel; subtype classification)	156 genes (Attention-RFECV; multi-class classification)	Significant GO enrichment (FDR < 0.05)
Optimisation target	Survival prediction / single-biomarker emphasis	Multi-class accuracy optimisation	–
Preprocessing strategy	Dataset-specific preprocessing pipelines	Unified pipeline (quantile normalisation + ComBat + KNN imputation)	–

Limitations

Although the proposed MICRO-AI framework has performed well across the benchmark datasets, there are a few limitations to consider. To begin with, the model is trained on labelled data, and thus its computational efficiency might be limited in clinical settings where annotated genomic datasets are scarce or costly to acquire. This dependence can be a weakness when performance must be conveyed on only a limited number of labelled samples.

Second, because stratified cross-validation is used, the framework can be affected by extreme class imbalance, especially in multi-class cancer subtype prediction. Imbalanced class distributions can influence classifier bias and confidence behaviour, underscoring the need for balance-aware learning methods.

Third, the ensemble uses a SVM component, and its computational complexity can become very large as the sample size increases. Although this is not a constraint of standard microarray data, future large-scale transcriptomic cohorts may need to be kernel approximated or use alternative classifiers.

Lastly, the existing assessment is based on retrospective, publicly available datasets and is not in prospective clinical validation or real-time implementation in hospital processes. In this sense, clinical utility has been determined by intuition rather than substantiated evidence, and additional research is needed before translational adoption.

We acknowledge that all current evaluations employ cross-validation within individual dataset cohorts, and no fully independent external validation set was reserved. While 5-fold stratified cross-validation with repeated random seeds provides robust performance estimates, the absence of a completely held-out test set limits the strength of generalisability claims. Future work will incorporate a leave-one-dataset-out (LODO) validation protocol, in which each dataset is systematically held out as an independent test set. In contrast, the remaining datasets are used for training and hyperparameter selection. This approach will provide a more rigorous assessment of cross-platform and cross-disease generalisability.

Conclusion

This paper introduced MICRO-AI, a comprehensive ML framework for DNA microarray analysis and automated disease detection. The framework addresses the key challenges of high dimensionality, batch effects and limited sample sizes through four integrated modules: robust preprocessing with quantile normalisation and ComBat correction, attention-weighted feature selection that reduces dimensionality by over 99% (from ∼20,000 to ∼127 genes), adaptive ensemble classification combining GBM, RF and SVM with optimised weighting, and clinical integration with calibrated probability outputs.

Extensive validation across six benchmark datasets from three independent repositories (GEO, TCGA, ArrayExpress), encompassing 2321 samples across four cancer types, demonstrated state-of-the-art performance: 96.8% average accuracy, 95.2% sensitivity, 97.4% specificity, 0.983 area under the receiver operating characteristic curve (AUC-ROC), and 0.928 Matthews correlation coefficient (MCC). Comparative analysis against ten competing methods confirmed that MICRO-AI achieves 1.2–7.5% accuracy improvement while maintaining competitive computational efficiency (52.3 s average training time), representing 2.4–6.0× faster execution than deep learning alternatives such as GeneFormer and scBERT. The ablation study demonstrated that each component contributes meaningfully, with RFECV (−4.1%), attention mechanism (−2.4%) and ComBat correction (−1.7%) providing the most significant contributions.

Despite these achievements, several limitations remain. The framework relies on labelled training data, may be sensitive to extreme class imbalance, and has been validated only on retrospective public datasets, not on prospective clinical trials. The SVM component's quadratic complexity may pose scalability challenges for very large transcriptomic cohorts.

Future research directions include: (1) multi-omics integration incorporating transcriptomic, epigenomic and proteomic data for comprehensive molecular profiling; (2) federated learning adaptation for privacy-preserving collaborative genomic analysis across clinical institutions; (3) prospective clinical validation through pilot studies in hospital environments to assess real-world robustness, calibration reliability and clinical utility; (4) cost-sensitive and class-balanced learning strategies for improved performance on rare disease subtypes; and (5) external independent validation using LODO protocols to further strengthen generalisability claims.

Footnotes

ORCID iD

Manal A Othman

Ethical approval

This research study solely involves the use of historical datasets. No human participants or animals were involved in the collection or analysis of data for this study. As a result, ethical approval was not required.

Author contribution

All aspects of the research, including conceptualisation, methodology, software development, formal analysis and resource provision, were solely carried out by Manal A. Othman. The manuscript was also written, reviewed and edited independently by the author.

Funding

The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2026R473), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability

The datasets used to support this study are publicly available at:

Gene Expression Omnibus (GEO): https://www.ncbi.nlm.nih.gov/geo/

GSE2034: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc = GSE2034

GSE7390: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc = GSE7390

GSE62254: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc = GSE62254

GSE9891: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc = GSE9891

The Cancer Genome Atlas (TCGA): https://portal.gdc.cancer.gov/projects/TCGA-BRCA

ArrayExpress: https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-365/

The implementation code is publicly available at:

Appendix

References

Qvick , et al. Pan-cancer detection through DNA methylation profiling using enzymatic conversion library preparation with targeted sequencing. Int J Mol Sci 2025; 26: 10165.

Yoon

Seo

Shin

, et al. Bioinformation and monitoring technology for environmental DNA analysis: a review. Biosensors 2025; 15: 494.

Sultan

. Microarray analysis of differentially expressed genes in peripheral blood of postpartum women with gestational diabetes mellitus and type 2 diabetes. Life 2025; 15: 1270.

, et al. Copy number variants of uncertain significance by chromosome microarray analysis from consecutive pediatric patients: reevaluation following current guidelines and reanalysis by genome sequencing. Genes 2025; 16: 874.

Ben Ali

Mustafov

Braoudaki

, et al. Identification of a new lung cancer biomarker signature using data mining and preliminary in vitro validation. BioMedInformatics 2025; 5: 32.

Tselios

Vezakis

Zaravinos

, et al. Using geometric approaches to the common transcriptomics in acute lymphoblastic leukemia and rhabdomyosarcoma: expanding and integrating pathway simulations. BioMedInformatics 2025; 5: 45.

Yuan

et al. Mamba-YOLO-ML: a state-space model-based approach for mulberry leaf disease detection. Plants 2025; 14: 2084.

Atesoglu

Bingol

. The detection and classification of grape leaf diseases with an improved hybrid model based on feature engineering and AI. AgriEngineering 2025; 7: 228.

Takou

Bellis

Lasky

. Predicting gene expression responses to cold in Arabidopsis thaliana using natural variation in DNA sequence. Genes 2025; 16: 1108.

10.

Shao

. Machine learning in microwave medical imaging and lesion detection. Diagnostics 2025; 15: 986.

11.

Kokkotis

et al. Artificial intelligence and machine learning in the diagnosis and prognosis of diseases through breath analysis: a scoping review. Information 2025; 16: 968.

12.

Surimova

, et al. PSG and other candidate genes as potential biomarkers of therapy resistance in B-ALL: insights from chromosomal microarray analysis and machine learning. Int J Mol Sci 2025; 26: 7437.

13.

Iftikhar

Hashem

Qureshi

, et al. Clinical application of machine learning models for early-stage chronic kidney disease detection. Diagnostics 2025; 15: 2610.

14.

Ghosh

Ura

. Leveraging DNA-based computing to improve the performance of artificial neural networks in smart manufacturing. Mach Learn Knowl Extr 2025; 7: 96.

15.

Zhao

Z-Y

Huang

C-L

Wang

T-M

, et al. EM-DeepSD: a deep neural network model based on cell-free DNA end-motif signal decomposition for cancer diagnosis. Diagnostics 2025; 15: 1156.

16.

Toledo

, et al. Development of an extreme machine learning-based computational application for the detection of Armillaria in cherry trees. Appl Sci 2025; 15: 11927.

17.

Salaris

Ocagli

Casamento

, et al. Foodborne event detection based on social media mining: a systematic review. Foods 2025; 14: 239.

18.

Chen

Y-J

J-J

Lin

C-P

, et al. Microarray analysis reveals sepsis is a syndrome with hyperactivity of TH17 immunity, with over-presentation of the Treg cell cytokine TGF-. Curr Issues Mol Biol 2025; 47: 435.

19.

Abroudi

, et al. Analysis of microarray and single-cell RNA-Seq finds gene co-expression and tumor environment associated with extracellular matrix in epithelial-mesenchymal transition in prostate cancer. Int J Mol Sci. 2025; 26: 8575.

20.

Vrbaški

Vesin

Mangaroska

. Machine learning for chronic kidney disease detection from planar and SPECT scintigraphy: a scoping review. Appl Sci 2025; 15: 6841.

21.

Džermeikaitė

Krištolaitytė

Antanaitis

. Application of machine learning models for the early detection of metritis in dairy cows based on physiological, behavioural and milk quality indicators. Animals 2025; 15: 1674.

22.

Liu

Kuo

Lin

C-H

. Computational investigation of smooth muscle cell plasticity in atherosclerosis and vascular calcification: insights from differential gene expression analysis of microarray data. Bioengineering 2025; 12: 1223.

23.

Lazcano-Garcı´a

, et al. Deep learning-based system for early symptoms recognition of grapevine red blotch and leafroll diseases and its implementation on edge computing devices. AgriEngineering 2025; 7: 63.

24.

Tao

Sun

, et al. Towards the prediction of responses to cancer immunotherapy: a multi-omics review. Life 2025; 15: 283.

25.

, et al. Research trends and hotspots in eDNA-based surveys of macroinvertebrates: a bibliometric analysis. Diversity 2025; 17: 402.

26.

Sun

Kitani

, et al. Constructing a pan-cancer prognostic model via machine learning based on immunogenic cell death genes and identifying NT5E as a biomarker in head and neck cancer. Curr Issues Mol Biol 2025; 47: 812.

27.

Brlek

Bulić

Shah

, et al. In silico validation of OncoOrigin: an integrative AI tool for primary cancer site prediction with graphical user interface to facilitate clinical application. Int J Mol Sci 2025; 26: 2568.

28.

Arakelyan

et al. Assigning transcriptomic subtypes to chronic lymphocytic leukemia samples using nanopore RNA-sequencing and self-organizing maps. Cancers 2025; 17: 964.

29.

Esperança-Martins

, et al. Transcriptomic-based classification identifies prognostic subtypes and therapeutic strategies in soft tissue sarcomas. Cancers 2025; 17: 2861.

30.

Christodoulou

, et al. Data-driven and structure-based modelling for the discovery of human DNMT1 inhibitors: a pathway to structure-activity relationships. Appl Sci 2025; 15: 11984.

31.

Cabello-Lima

Zapata-Morı´n

Espinoza-Rodrı´guez

. Classifying protein-DNA/RNA interactions using interpolation-based encoding and highlighting physicochemical properties via machine learning. Information 2025; 16: 947.

32.

Goud

Ramos

Shah

, et al. Artificial intelligence driven framework for the design and development of next-generation avian viral vaccines. Microorganisms 2025; 13: 2361.

33.

Popescu

D-C

Găman

M-A

. Artificial intelligence for risk stratification in diffuse large B-cell lymphoma: a systematic review of classification models and predictive performances. Med Sci 2025; 13: 280.

34.

Johnson

Rabinovic

. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 2007; 8: 118–127.

35.

Guyon

Weston

Barnhill

, et al. Gene selection for cancer classification using support vector machines. Mach Learn 2002; 46: 389–422.

36.

Breiman

. Random forests. Mach Learn 2001; 45: 5–32.

37.

Chen

Guestrin

. XGBoost: a scalable tree boosting system. In: Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2016, pp.785–794.

38.

LeCun

Bengio

Hinton

. Deep learning. Nature 2015; 521: 436–444.

39.

Ching

, et al. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface 2018; 15: 20170387.

40.

Vaswani

, et al. Attention is all you need. Adv Neural Inf Process Syst 2017; 30: 5998–6008.

41.

Theodoris

, et al. Transfer learning enables predictions in network biology. Nature 2023; 618: 616–624.

42.

Yang

, et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat Mach Intell 2022; 4: 852–866.

43.

Wang

, et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 2005; 365: 671–679.

44.

Tibshirani

. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 1996; 58: 267–288.

Machine learning-based DNA microarray analysis for disease detection using the MICRO-AI framework

Abstract

Keywords

Introduction

Related work

DNA microarray data preprocessing

Feature selection and dimensionality reduction

Classification algorithms for disease detection

Integrated diagnostic frameworks

Research gap analysis

Proposed methodology

System overview

Data preprocessing module

Quantile normalisation

Batch effect correction

Missing value imputation

Attention-weighted feature selection

Initial filtering

Attention score computation

Recursive feature elimination with cross-validation

Ensemble classification module

Base classifiers

Adaptive weight optimisation

Confidence calibration

Algorithmic implementation

Complexity analysis

Results and evaluation

Datasets

Experimental setup

Performance metrics

Classification results

Feature selection analysis

Comparative analysis

Ablation study

Computational efficiency

Discussion

Consistency of gene discovery with original publications

Limitations

Conclusion

Footnotes

ORCID iD

Ethical approval

Author contribution

Funding

Declaration of Conflicting Interests

Data availability

Appendix

References