Sage Journals: Discover world-class research

Abstract

Background

Alzheimer's disease (AD) presents a significant and escalating public health concern, with early-stage neurodegeneration often going undetected using current diagnostic procedures. Medical imaging modalities, particularly structural magnetic resonance imaging (MRI) and functional positron emission tomography (PET), provide complementary insights into the anatomical and metabolic changes associated with AD. Despite their potential, the integration of these imaging techniques into a unified, explainable artificial intelligence (AI) framework remains limited.

Objectives

This study aims to develop and evaluate NeuroFusion-ADNet, a novel AI model that effectively combines structural and functional imaging data to improve diagnostic accuracy and clinical interpretability in AD detection.

Methods

NeuroFusion-ADNet is a dual-path deep learning model that jointly processes co-registered MRI and PET slices for simultaneous region-of-interest segmentation and diagnostic classification. The model features modality-specific encoders for structural and functional feature extraction, a bi-directional cross-attention fusion layer and a segmentation-informed classification module. The framework was trained and evaluated using the Alzheimer's Disease Neuroimaging Initiative dataset, comprising 381 subjects across normal control, mild cognitive impairment) and AD categories. Performance was benchmarked against standard architectures, including ResNet152, U-Net++, and multimodal convolutional neural networks (CNNs). Recently, combining CNNs and attention mechanisms has shown highly effective results in medical image analysis. Therefore, our model integrates explainability features, including attention heatmaps and Local Interpretable Model-Agnostic Explanations.

Results

NeuroFusion-ADNet achieved a classification accuracy of 99.48% and a Dice coefficient of 0.985, significantly outperforming existing baselines. Attention-based visualizations confirmed that the model consistently focuses on clinically relevant brain regions such as the hippocampus, entorhinal cortex and basal ganglia. Extensive ablation studies validated the contributions of each architectural component.

Conclusion

This work introduces a clinically promising multimodal AI framework that enhances diagnostic accuracy while maintaining transparency through explainable techniques. NeuroFusion-ADNet sets a foundation for the development of efficient, interpretable and deployable tools in the early diagnosis of AD.

Keywords

Alzheimer's disease mild cognitive impairment deep learning multimodal neuroimaging MRI–PET fusion computer-aided diagnosis

Introduction

Alzheimer's disease (AD) is the leading cause of dementia, affecting over 57 million individuals globally, with approximately 10 million new cases reported each year.¹ Characterized by progressive memory loss, executive dysfunction and behavioral changes, AD leads to substantial personal, social, and economic burdens. Given the lack of curative treatment, early and accurate diagnosis is crucial to managing the disease through lifestyle adjustments, clinical trials and symptomatic treatments. Neuroimaging has revolutionized AD diagnostics. Structural magnetic resonance imaging (MRI) detects cortical and subcortical atrophy, particularly in the medial temporal lobe, hippocampus and parietal cortex, regions strongly associated with AD pathology.² Meanwhile, fluorodeoxyglucose positron emission tomography (FDG–PET) reveals metabolic hypofunction in parietotemporal regions, precuneus and posterior cingulate cortex, providing a functional lens on neurodegeneration.³ Despite their complementary nature, most diagnostic tools utilize either MRI or PET independently, limiting their diagnostic potential. Recent progress in deep learning has improved automated image-based diagnosis in various neurological conditions, including AD.^4,5 Convolutional neural network (CNN)-based models have shown success in classifying AD from MRI and PET, yet challenges remain. These include poor generalizability, limited interpretability, and failure to jointly analyze imaging modalities or incorporate anatomical information explicitly. Moreover, most classification networks operate on global features, neglecting spatial disease patterns that are vital for clinical relevance.

To address these limitations, we propose NeuroFusion-ADNet, a dual-modality neural network designed to leverage the full potential of structural and functional neuroimaging. The model introduces a dual-encoder architecture with modality-specific pathways for MRI and PET, a cross-attention fusion mechanism and a segmentation-informed classifier. The segmentation head identifies anatomical regions relevant to AD, which are then used to guide classification, ensuring that predictions are anatomically grounded and interpretable. Recently, combining CNNs with attention mechanisms has shown highly effective results in medical image analysis.^6–9 Therefore, our model integrates explainability features, including attention heatmaps and Local Interpretable Model-Agnostic Explanations (LIME), to ensure that clinicians can validate model focus areas. Using the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset,¹⁰ we demonstrate that NeuroFusion-ADNet achieves superior performance in both segmentation and classification compared to traditional CNNs, U-Nets and hybrid architectures. This article contributes to the field of artificial intelligence (AI) in neurodegeneration in the following ways: (1) a novel multimodal architecture that jointly performs segmentation and classification for AD; (2) a cross-attention fusion method that adaptively integrates MRI and PET signals; (3) incorporation of region-of-interest (ROI)-guided classification for interpretability; and (4) extensive evaluation on a large dataset with cross-validation and visualization.

Related work

The application of AI and deep learning to AD diagnosis has grown considerably over the past decade. Traditional machine learning models, such as Support Vector Machines (SVMs), decision trees and random forests, were initially applied to hand-crafted features extracted from MRI or PET scans.^11,12 Although these methods offered interpretability and early insights into neuroimaging biomarkers, their performance was limited by reliance on predefined features and small datasets. The advent of deep CNNs enabled automatic feature extraction from raw imaging data. Suk et al.¹³ developed a stacked autoencoder model that extracted hierarchical features from MRI, achieving modest improvements in classification accuracy. Lu et al.¹⁴ extended this approach to PET, achieving similar gains. However, these unimodal models often failed to capture the full pathological complexity of AD, as structural and functional changes evolve differently across individuals. To address this, several studies have explored multimodal fusion. Liu et al.¹⁵ proposed a simple early-fusion CNN combining MRI and PET volumes, while Pan et al.¹⁶ used multistream three-dimensional (3D) CNNs for modality-specific learning followed by late fusion. Although performance improved, such techniques often rely on static fusion strategies, which cannot adapt to the varying importance of modalities across samples. Moreover, most ignore the spatial context of the disease. Graph convolutional networks (GCNs) and attention mechanisms have emerged as promising tools for multimodal learning. For example, Huang et al.¹⁷ used spatial attention maps to fuse structural and functional features. These works offer more flexibility but often sacrifice interpretability and overlook anatomical localization. Separately, segmentation-based networks have demonstrated success in lesion detection for tumors and ischemic stroke.¹⁸ However, few studies have applied joint segmentation and classification for neurodegenerative disorders. A rare example is the work of Jo et al.,¹⁹ which used hippocampal segmentation to guide classification. However, their model did not generalize to multiple modalities or perform full-brain analysis. In contrast, NeuroFusion-ADNet builds upon both segmentation and multimodal fusion literature. By using attention-guided fusion and ROI-informed classification, it simultaneously achieves performance, interpretability, and anatomical relevance. Unlike models that require hand-drawn masks or external segmentation, our approach learns these maps in a data-driven, end-to-end fashion, enabling scalable clinical deployment. Beyond static feature fusion, AD is characterized by complex structural and functional alterations that evolve dynamically across disease stages. Empirical analysis of our dataset revealed progressive volumetric reductions in hippocampal and entorhinal cortices, MRI-based and decreased glucose metabolism in posterior cingulate and temporal lobes, FDG–PET based, when comparing NC, MCI, and AD groups. These findings are consistent with prior neuroimaging studies of AD progression.^2,14,19 To capture these longitudinal-like changes within a cross-sectional framework, NeuroFusion-ADNet leverages multimodal representations that distinguish subtle shifts in regional distributions. This progression-aware representation contributes directly to improved diagnostic accuracy and aligns with the clinical understanding of AD as a dynamic neurodegenerative process.

Methodology

Overview of NeuroFusion-ADNet

NeuroFusion-ADNet is a dual-path architecture designed to process MRI and PET data simultaneously for joint segmentation and classification of AD, as shown in Figure 1. The model consists of: (1) two modality-specific encoders (PWResNet and CADDLFCNet), (2) a bi-directional attention fusion module, (3) a lightweight decoder for ROI segmentation, (4) a segmentation-informed classification head (ROI-GateNet), (5) auxiliary modules for attention visualization and model explainability. The pipeline is trained end-to-end using a composite loss function that optimizes both segmentation and classification objectives. The composite loss function combines the advantages of individual loss terms, balancing segmentation accuracy with classification robustness. Prior studies have shown the effectiveness of composite loss in medical image analysis.^20–22 NeuroFusion-ADNet integrates existing deep learning components with several novel contributions. Standard modules include a ResNet-based encoder backbone for feature extraction²³ and a U-Net-style decoder for segmentation,²⁴ which are widely adopted in medical image analysis. Our novel contributions are threefold: (i) a bi-attention fusion mechanism to dynamically integrate MRI and PET features across multiple representational levels; (ii) a dual-task architecture combining ROI segmentation and classification to enforce pathological consistency; and (iii) an interpretability module that integrates ROI-guided attention maps with LIME-based saliency explanations.

Figure 1.

NeuroFusion-ADNet architecture.

Modality-specific encoders

We adopt a ResNet-152 encoder²³ as the backbone feature extractor, which provides a robust starting point for multimodal analysis; this component is standard and not our original contribution.

PWResNet (MRI pathway)

MRI data captures fine structural detail and is encoded using PWResNet, a custom variant of ResNet enhanced by: (a) wavelet decomposition at the input stage to preserve frequency-aware features,²⁵ (b) Residual Dense (RD) blocks to retain contextual flow and reduce vanishing gradients, (c) pyramidal pooling to capture multiscale anatomical context without loss of resolution. Each layer outputs a progressively lower-resolution but higher-semantic representation of structural features. These outputs are later used for skip connections in the decoder. Input MRI slices of size 224 × 224 × 1 were passed into PWResNet. Feature maps were progressively downsampled to 112 × 112 × 64, 56 × 56 × 128, and 28 × 28 × 256 dimensions.

CADDLFCNet (PET pathway)

For PET images, which emphasize metabolic information, we employ CADDLFCNet, featuring: (a) depthwise separable convolutions to reduce computational load while retaining spatial sensitivity,²⁶ (b) dilated convolutions for increased receptive field, (c) statistical channel gating to adaptively suppress or enhance metabolic features. This branch specializes in extracting smooth, intensity-based patterns indicative of hypometabolism in regions such as the parietal cortex and temporal lobe. Input PET slices of size 224 × 224 × 1 were processed to yield progressively refined feature maps of dimensions 112 × 112 × 64, 56 × 56 × 128, and 28 × 28 × 256.

Bi-attention fusion module

The bi-attention fusion mechanism represents a novel contribution of this work, designed to dynamically weight complementary MRI and PET features across scales, thereby modeling disease-relevant structural–functional interactions. To combine the outputs of both encoders, we propose a bi-attention fusion module that applies cross-modal attention in two directions: (1) MRI attends to PET, captures regions where structure aligns with function; (2) PET attends to MRI, emphasizes metabolism guided by anatomical location. The module produces attention maps across both spatial and channel dimensions, which are then merged using learned softmax weights. This yields a fused representation preserving salient multimodal features. The fusion module receives two input feature maps (28 × 28 × 256 each) and produces a fused representation (28 × 28 × 512) with spatial and channel attention weights applied.

Segmentation decoder

While the classification head follows standard fully connected layers, our contribution is the integration of ROI-guided attention maps with LIME-based explanation overlays, enabling interpretable decision making grounded in known AD biomarkers. The decoder reconstructs a binary segmentation mask identifying ROIs typically affected in AD. Key components include: (a) bilinear upsampling at each stage to double the resolution, (b) concatenated skip connections from both encoders to preserve low-level details, (c) sigmoid activation at the final layer for pixel-wise binary classification. The output mask provides spatial guidance to the classifier and enhances model interpretability. The decoder upsamples the fused representation back to the original input resolution (224 × 224). Skip connections preserve feature detail at 112 × 112 and 56 × 56 scales.

ROI-GateNet classifier

This module extends the U-Net++ framework²⁷ for multimodal segmentation; our contribution lies in tailoring the decoder to enforce ROI-specific supervision aligned with downstream classification tasks. The final classification head integrates anatomical priors into diagnostic prediction. The process is as follows: (a) the segmentation mask is multiplied by the fused feature map, (b) an adaptive ROI pooling layer computes localized summary statistics, (c) a fully connected layer with softmax activation predicts class probabilities:

NC, MCI, and AD. This gating strategy ensures the model focuses on disease-relevant areas, improving both performance and trustworthiness. The segmentation mask (224 × 224) is multiplied with the fused feature map (28 × 28 × 512), followed by ROI pooling to a 1 × 1 × 512 vector. This vector is passed into a fully connected layer to produce a three-class SoftMax output NC, MCI, and AD.

Explainability integration

NeuroFusion-ADNet includes two explainability components: (1) attention maps from the fusion layer are upsampled to highlight salient regions; (2) LIME²⁸ is applied to final predictions, providing image-level attribution for clinicians. These modules enhance transparency and align predictions with known biomarkers.

Loss function

We define a multitask loss function as in equation (1):

L_{t o t a l} = α . L_{s e g} . + (1 - α) . L_{c l s}

(1)where L_seg is the Dice Loss + Binary Cross-Entropy for segmentation, L_cls the Categorical Cross-Entropy for classification, and α 0.6 a tunable hyperparameter. Table 1 summarizes the key components of the proposed model.

Table 1.

Summary of key components of the proposed model.

Component	Technique Used	Purpose
MRI Encoder	PWResNet, Wavelet, RD Blocks	Capture structural features
PET Encoder	Depthwise Conv, SCG, Dilation	Extract metabolic patterns
Fusion Module	Bi-Attn (spatial + channel)	Adaptive multimodal integration
Decoder	Skip connections + Upsampling	ROI segmentation
Classifier	ROI-Gated Pooling + Softmax	Spatially informed diagnosis

MRI:

magnetic resonance imaging; PET: positron emission tomography; RD: Residual Dense; ROI: region-of-interest.

Results

To evaluate the performance of NeuroFusion-ADNet, we conducted comprehensive experiments using the ADNI dataset.⁶ This dataset includes co-registered MRI and FDG–PET scans from 381 participants, stratified into 3 diagnostic groups: 126 cognitively normal controls, 160 individuals with mild cognitive impairment and 95 patients diagnosed with AD. All participants’ imaging data were preprocessed to ensure alignment with the MNI152 atlas, followed by intensity normalization and slice extraction from the axial plane. We selected slices from indices 45 to 60, which correspond to brain regions of high diagnostic relevance, including the hippocampus and basal ganglia. We adopted a stratified 10-fold cross-validation scheme to ensure robust performance assessment while minimizing bias due to random splits. Each fold comprised 70% training, 10% validation, and 20% testing. All training and inference procedures were carried out on a workstation equipped with an NVIDIA Tesla V100 GPU and 64 GB of RAM using PyTorch 2.0.

The classification performance of NeuroFusion-ADNet was evaluated in terms of accuracy, precision, recall, specificity, F1-score, and Matthew’ correlation coefficient (MCC). Across all folds, our model consistently achieved a mean classification accuracy of 99.48%, with a standard deviation below 0.4%. Precision and recall values reached 0.990 and 0.9818, respectively, indicating not only the model's ability to correctly identify AD cases but also its robustness in distinguishing them from NC and MCI cases. The F1-score, a harmonic mean of precision and recall, stood at 0.9847, underscoring the model's balanced performance. The MCC, often regarded as a more reliable metric in imbalanced datasets, yielded a value of 0.9835, further supporting the model's strong diagnostic reliability.

Comparisons with baseline models demonstrated the superiority of NeuroFusion-ADNet. When tested under the same conditions, a standard ResNet-152 achieved 94.8% accuracy and a 3D U-Net++ achieved 96.2%. These models, although strong performers in their own right, lacked the interpretability and spatial localization capabilities provided by our architecture. Other multimodal fusion models using simple concatenation or shallow attention mechanisms also underperformed, typically reaching accuracies between 95% and 97%. In contrast, the ROI-guided classification strategy in NeuroFusion-ADNet enabled more focused and informed decision making, particularly in MCI subjects whose imaging patterns often exhibit overlap with both NC and AD categories. Table 2 summarizes the classification performance comparison.

Table 2.

Classification performance comparison.

Model	Accuracy (%)	Precision	Recall (sensitivity)	Specificity	F1-score	MCC
ResNet-152	94.8	0.936	0.920	0.951	0.928	0.910
U-Net++	96.2	0.958	0.946	0.970	0.952	0.935
BiFusionNet	97.5	0.973	0.962	0.978	0.967	0.951
NeuroFusion-ADNet	99.48	0.990	0.9818	0.9889	0.9847	0.9835

MCC:

Matthew’s correlation coefficient.

In addition to classification, we evaluated the segmentation performance of the model. The Dice coefficient, a widely used metric for evaluating the overlap between predicted and ground truth masks, averaged 0.985 across the test folds. The intersection over union (IoU) metric yielded a mean value of 0.981 and pixel-wise accuracy reached 0.987. These results were consistently superior to those obtained from standalone segmentation models such as U-Net, U-Net++, and SegNet. Our decoder was able to precisely identify disease-relevant structures, such as the hippocampus, posterior cingulate, and entorhinal cortex, even in cases with mild atrophy or subtle hypometabolism. Table 3 shows a summary comparison of segmentation performance.

Table 3.

Segmentation performance comparison.

Model	Dice coefficient	Intersection over union (IoU)	Pixel accuracy
U-Net	0.924	0.879	0.941
U-Net++	0.936	0.891	0.953
SegNet	0.938	0.892	0.956
NeuroFusion-ADNet	0.985	0.981	0.987

To further assess the internal mechanisms and transparency of the model, we performed a visual inspection of attention heatmaps and LIME-based explanations. Attention maps derived from the bi-attention fusion module were projected back onto the input space using bilinear interpolation. In subjects diagnosed with AD, the attention module consistently highlighted the medial temporal lobe, posterior parietal cortex, and basal ganglia, regions well known for early AD pathology. The visual saliency exhibited a gradual shift when comparing the NC, MCI, and AD categories. In MCI subjects, attention was concentrated on the entorhinal cortex and hippocampal subfields, while in AD cases, activation spread into the parietal and occipital lobes. This graded change in attention provides evidence that the model is sensitive to subtle disease progression.

Complementary to attention maps, we applied LIME, local interpretable model-agnostic explanations, to generate sample-level visualizations. LIME perturbs regions of the input image and observes the change in output probability to infer feature importance. In our case, LIME explanations aligned with attention maps and frequently emphasized cortical atrophy and metabolic decline in diagnostic regions, as shown in Figure 2. In control subjects, both attention and LIME maps focused on preserved structural integrity and high PET activity in healthy brain regions. These findings suggest that NeuroFusion-ADNet is not only accurate but also biologically plausible in its decision-making processes.

Figure 2.

LIME visualization results. LIME: Local Interpretable Model-Agnostic Explanations.

To further clarify the roles of the bi-attention and ROI segmentation modules in the interpretability pipeline, we compared saliency maps generated under three conditions: (i) without bi-attention, (ii) without ROI segmentation, and (iii) with both modules active. When bi-attention was removed, the resulting maps were diffuse and often highlighted nonspecific cortical areas, reducing their clinical plausibility. Similarly, without ROI segmentation, attention frequently extended to irrelevant cortical regions, diluting the diagnostic focus. In contrast, with both modules included, NeuroFusion-ADNet consistently concentrated on hippocampal and medial temporal regions, which are well-established biomarkers of early AD pathology. This confirms that the synergy of bi-attention and ROI segmentation enhances both predictive performance and clinical interpretability by ensuring model focus on pathologically relevant structures. As shown in Figure 3, the model correctly focuses on AD-related neuroanatomical regions, highlighting hypometabolic areas in PET scans and structural atrophy in MRIs. This validates the cross-modality attention mechanism as an effective, region-specific learning tool.

Figure 3.

Attention map of MRI and PET datasets and segmented image. MRI: magnetic resonance imaging; PET: positron emission tomography.

To further evaluate the strengths of NeuroFusion-ADNet, we analyzed cases where baseline models failed. In particular, ResNet-152 and U-Net++ frequently misclassified subjects with mild cognitive impairment (MCI), especially those exhibiting subtle hippocampal and entorhinal cortex abnormalities. BiFusionNet demonstrated better performance but still overlooked early-stage atrophy in several instances. In contrast, NeuroFusion-ADNet consistently localized and classified these borderline cases correctly. The added spatial guidance from the segmentation branch and adaptive weighting of MRI–PET signals enabled the model to capture early AD-specific patterns that other architectures missed. Table 4 summarizes misclassification rates across diagnostic categories.

Table 4.

Misclassification counts across models.

Model	NC misclassified	MCI misclassified	AD misclassified	Total misclassifications
ResNet-152	6 (4.8%)	12 (7.5%)	2 (2.1%)	20
U-Net++	4 (3.2%)	8 (5.0%)	3 (3.1%)	15
BiFusionNet	3 (2.4%)	6 (3.8%)	2 (2.1%)	11
NeuroFusion-ADNet	1 (0.8%)	1 (0.6%)	0 (0.0%)	2

AD:

normal control; MCI: mild cognitive impairment; NC: normal control;

To rigorously examine the contribution of each architectural component, we conducted a series of ablation studies. Removing the segmentation branch and using only a traditional global average pooling for classification resulted in a marked decline in performance, with accuracy dropping to 96.5% and F1-score to 0.958. Replacing the bi-attention fusion module with simple concatenation of encoder outputs reduced dice from 0.985 to 0.946 and accuracy to 95.1%, indicating that attention plays a central role in integrating multimodal signals effectively. Using a shared encoder for both MRI and PET also degraded performance significantly, emphasizing the importance of modality-specific learning. Table 5 presents the ablation study of the proposed model.

Table 5.

Ablation study of NeuroFusion-ADNet components.

Model variant	Accuracy (%)	Dice coefficient
Full NeuroFusion-ADNet	99.48	0.985
Without ROI Segmentation	96.5	–
Without Bi-Attention (Concat Fusion)	95.1	0.946
Shared Encoder (no modality separation)	93.4	0.930

ROI: region-of-interest.

In terms of inference speed, NeuroFusion-ADNet achieved a processing time of 0.18 seconds per subject during testing, making it suitable for real-time clinical decision support applications. The total parameter count of the model is approximately 32 million, which is comparable to standard deep networks like ResNet and MobileNetV3 but offers greater interpretability and multimodal flexibility. The results strongly support the effectiveness of NeuroFusion-ADNet as a diagnostic model that achieves state-of-the-art accuracy, reliable ROI segmentation and intuitive visual explanations across multiple subject categories and data folds. Its robustness, transparency, and clinical alignment make it a strong candidate for deployment in real-world AD screening settings.

Discussion

The findings of this study provide strong evidence that the proposed NeuroFusion-ADNet framework significantly advances the state of the art in automated AD diagnosis using multimodal neuroimaging. The model's ability to integrate structural and functional brain data in a spatially aware and interpretable fashion addresses several longstanding limitations in the field. Particularly, NeuroFusion-ADNet demonstrates that a dual-path architecture, when combined with segmentation-informed classification and cross-modal attention, can yield both high diagnostic accuracy and clinically meaningful insight into disease localization.

The exceptional classification accuracy of 99.48% achieved on the ADNI dataset underscores the diagnostic power of integrating MRI and PET imaging, particularly when fusion is handled with a context-sensitive mechanism like bi-directional attention. Traditional deep learning models often rely on a single modality, which may limit their sensitivity to either structural or metabolic abnormalities. In contrast, our model capitalizes on the complementary nature of the two modalities, MRI providing high-resolution anatomical information and FDG–PET capturing metabolic dysfunction, to enhance the robustness of diagnosis. The bi-attention fusion module dynamically adjusts the weighting of modality-specific features, allowing the network to adapt to individual subject characteristics. This flexibility is particularly beneficial in clinical settings where imaging quality, protocol variability and disease manifestation can differ significantly from case to case. The error analysis confirmed that NeuroFusion-ADNet's improvements were not merely due to increased model complexity, but rather to its ability to localize disease-relevant brain regions and adaptively weight MRI and PET features. Unlike baseline models, which often overlook hippocampal or entorhinal abnormalities, NeuroFusion-ADNet consistently detected these early biomarkers of AD. This strongly supports that the observed gains arise from intentional architectural design rather than incidental complexity.

An equally important contribution of this work lies in its explicit modeling of spatial disease patterns. By incorporating a segmentation branch, NeuroFusion-ADNet learns to localize brain regions that are most affected by AD. This approach not only improves the classifier's performance but also enhances interpretability. Rather than relying on global features, which can dilute disease-specific signals, the ROI-GateNet classifier focuses on the output of the segmentation map. This strategy aligns well with clinical reasoning, where radiologists and neurologists base diagnostic decisions on the condition of specific anatomical regions, such as the hippocampus, entorhinal cortex, and posterior cingulate. Furthermore, attention and LIME-based visualizations confirm that the network's focus aligns with known neuropathological biomarkers of AD. The model's visual saliency gradually shifts across the NC, MCI, and AD groups, providing further validation that the architecture is sensitive to early-stage degeneration and disease progression.

Beyond its predictive power, the interpretability of NeuroFusion-ADNet is perhaps one of its most significant advantages. In recent years, the demand for explainable AI in healthcare has grown dramatically, particularly in high-stakes domains like neurodegenerative disease diagnosis. While existing deep learning models often perform well on standard metrics, they fail to offer clinicians insight into how decisions are made. NeuroFusion-ADNet addresses this issue through multiple mechanisms. The use of segmentation masks ensures that classification is driven by anatomically grounded evidence, while the bi-attention maps provide a visual overlay of modality contributions. LIME explanations complement this by highlighting input regions that contribute most to the prediction. These forms of model transparency not only facilitate trust in automated systems but also provide a foundation for physician–AI collaboration in diagnostic workflows.

Despite these strengths, there are limitations to the present study. The model was trained and evaluated using two-dimensional (2D) axial slices extracted from the ADNI dataset. Although this approach simplifies training and reduces computational load, it also omits interslice continuity, which may carry important contextual information. Future work should explore extending the model to handle full 3D volumes, potentially using volumetric convolutions or hybrid 2.5D strategies to balance accuracy with efficiency. Another limitation is the use of a single dataset. While ADNI is a gold standard in AD research, it lacks demographic diversity and is collected under controlled conditions. Generalizing NeuroFusion-ADNet to real-world clinical settings will require external validation on datasets such as AIBL, OASIS, or local hospital cohorts with varying acquisition protocols. Additionally, while the current model processes imaging data alone, incorporating other biomarkers, such as cerebrospinal fluid measures, genetic risk factors or cognitive scores, could enhance diagnostic confidence, especially in borderline cases. Scalability and deployment in clinical settings also remain areas for future development. Although the model inference time is relatively fast, ∼0.18 seconds per subject, and the total parameter size is modest by current deep learning standards, real-time deployment in radiology or memory clinics may benefit from model compression techniques such as quantization or pruning. Federated learning could further facilitate training on distributed clinical datasets while preserving patient privacy.

NeuroFusion-ADNet demonstrates that it is possible to develop an AI model that is not only highly accurate but also interpretable and aligned with clinical reasoning. Its ability to fuse multimodal information, localize disease-affected regions and justify predictions through visual and structural evidence makes it a promising candidate for real-world deployment. By addressing core challenges of generalizability, interpretability, and performance, this work contributes to a new generation of trustworthy AI tools for neurodegenerative disease diagnosis.

Conclusion

The present study introduces NeuroFusion-ADNet, a novel deep learning framework designed to improve the diagnosis of AD by leveraging the complementary strengths of structural MRI and functional PET imaging. Unlike conventional models that treat each modality independently or employ simplistic fusion strategies, NeuroFusion-ADNet employs a dual-path architecture that integrates these imaging modalities through modality-specific encoders and a bi-directional attention fusion mechanism. This approach allows the model to adaptively extract and combine structural and metabolic features, leading to more robust and nuanced representations of brain pathology.

One of the key distinguishing features of this framework is its segmentation-informed classification strategy. By incorporating a dedicated segmentation decoder, the model identifies regions of interest that are likely to exhibit AD-related changes. These spatially localized features are then used to guide the classification process, enabling the model to focus on disease-relevant brain structures while minimizing noise from unaffected regions. This integration of spatial awareness into the classification process represents a significant advancement over traditional deep learning approaches, which often rely on global feature aggregation and lack anatomical specificity. In addition to its architectural innovations, NeuroFusion-ADNet emphasizes interpretability, a critical requirement for clinical applications of AI. The use of attention maps, which highlight modality-specific contributions to decision making and LIME-based explanations, which identify salient regions in individual input images, ensures that the model's predictions are transparent and explainable. This is particularly important in the context of neurodegenerative disease, where accurate localization and characterization of pathology are essential for both diagnosis and patient management. The model's visual outputs consistently aligned with established neuropathological markers of AD, including the hippocampus, parietal cortex and posterior cingulate, further enhancing its clinical plausibility.

The performance of the model exceeded that of several state-of-the-art benchmarks in both classification and segmentation tasks. With a classification accuracy of 99.48% and a Dice score of 0.985, NeuroFusion-ADNet demonstrated not only predictive accuracy but also spatial precision and reliability across diagnostic categories. These results were further supported by robust cross-validation and extensive ablation studies that confirmed the importance of each architectural component. The current implementation of NeuroFusion-ADNet is limited to 2D slice processing and single-cohort validation. Future work will extend the model to 3D volumetric data and assess generalizability across diverse datasets and clinical environments. Additionally, integrating nonimaging data such as genetic markers, cognitive assessments, and cerebrospinal fluid biomarkers may further enhance its diagnostic utility. NeuroFusion-ADNet represents a significant step forward in the development of interpretable, accurate, and clinically meaningful AI tools for AD diagnosis. Its dual-modality integration, attention-guided fusion and ROI-based classification combine the best of deep learning and domain knowledge, offering a powerful platform for advancing precision medicine in neurodegenerative disorders.

Footnotes

ORCID iD

Abdullah Alsaleh

Contributorship

AA contributed to the conceptualization, writing—original draft, and writing—review & editing.

Funding

The author thanks the Deanship of Postgraduate Studies and Scientific Research at Majmaah University for funding this research (project number ER-2025-2048).

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

The data that support the findings of this study are available from Alzheimer's Disease Neuroimaging Initiative (ADNI) at .

References

World Health Organization. Dementia fact sheets. 2025. Available from: https://www.who.int/news-room/fact-sheets/detail/dementia (accessed May 2025).

Jack Jr

Bennett

Blennow

, et al. NIA–AA research framework: toward a biological definition of Alzheimer’s disease. Alzheimers Dement 2018; 14: 535–562.

Mosconi

. Glucose metabolism in normal aging and Alzheimer’s disease: methodological and physiological considerations for PET studies. Clin Transl Imaging 2013; 1: 217–233.

Liu

Cheng

Wang

, et al. Multi-modality cascaded convolutional neural networks for Alzheimer’s disease diagnosis. Neuroinform 2018; 16: 295–308.

Wen

Thibeau-Sutre

Diaz-Melo

, et al. Convolutional neural networks for classification of Alzheimer’s disease: overview and reproducible evaluation. Med Image Anal 2020; 63: 101694.

Goceri

. Vision transformer based classification of gliomas from histopathological images. Expert Syst Appl 2024; 241: 122672.

Goceri

. An efficient network with CNN and transformer blocks for glioma grading and brain tumor classification from MRIs. Expert Syst Appl 2025; 268: 126290.

Goceri

. A convolution and transformer-based method with effective stain normalization for breast cancer detection from whole slide images. Biomed Signal Process Control 2025; 110: 108138.

Sebai

Goceri

. BRMSA-Net: disclosing concealed colorectal polyps in colonoscopy images via boundary recalibration and multi-scale aggregation network. Biomed Signal Process Control 2025; 110: 108083.

10.

Alzheimer’s Disease Neuroimaging Initiative (ADNI). Available from: https://adni.loni.usc.edu/ (accessed June 2025).

11.

Zhang

Wang

Zhou

, et al. Multimodal classification of Alzheimer’s disease and mild cognitive impairment. Neuroimage 2011; 55: 856–867.

12.

Klöppel

Stonnington

Chu

, et al. Automatic classification of MR scans in Alzheimer’s disease. Brain 2008; 131: 681–689.

13.

Suk

Lee

Shen

. Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis. Neuroimage 2014; 101: 569–582.

14.

Popuri

Ding

, et al. Alzheimer’s disease neuroimaging initiative. Multimodal and multiscale deep neural networks for the early diagnosis of Alzheimer’s disease using structural MR and FDG-PET images. Sci Rep 2018; 8: 5697.

15.

Liu

Cheng

Yan

. Classification of Alzheimer’s disease by combination of convolutional and recurrent neural networks using FDG-PET images. Front Neuroinform 2018; 12: 35.

16.

Pan

Huang

Zeng

, et al. Early diagnosis of Alzheimer’s disease based on deep learning and GWAS. In: Human brain and artificial intelligence. Singapore: Springer, 2019, pp.1072. doi:10.1007/978-981-15-1398-5_4

17.

Huang

Zhou

, et al. Diagnosis of Alzheimer’s disease via multi-modality 3D convolutional neural network. Front Neurosci 2019; 13: 509.

18.

Isensee

Jaeger

Kohl

, et al. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 2021; 18: 203–211.

19.

Nho

Saykin

. Deep learning in Alzheimer’s disease: diagnostic classification and prognostic prediction using neuroimaging data. Front Aging Neurosci 2019; 11: 220.

20.

Goceri

. Nuclei segmentation using attention aware and adversarial networks. Neurocomputing 2024; 579: 127445.

21.

Goceri

. GAN based augmentation using a hybrid loss function for dermoscopy images. Artif Intell Rev 2024; 57: 234.

22.

Goceri

. Polyp segmentation using a hybrid vision transformer and a hybrid loss function. J Digit Imaging 2024; 37: 851–863.

23.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp.770–778.

24.

Ronneberger

Fischer

Brox

. U-Net: convolutional networks for biomedical image segmentation. Lect Notes Comput Sci 2015; 9351: 234–241.

25.

Mallat

. A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans Pattern Anal Mach Intell 1989; 11: 674–693.

26.

Howard

Zhu

Chen

, et al. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv [Preprint]. 2017. Available from: https://arxiv.org/abs/1704.04861.

27.

Zhou

Siddiquee

MMR

Tajbakhsh

, et al. UNet++: a nested U-Net architecture for medical image segmentation. Lect Notes Comput Sci 2018; 11045: 3–11.

28.

Ribeiro

Singh

Guestrin

. "Why should I trust you?” Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp.1135–1144.

An interpretable multimodal deep learning framework for Alzheimer's disease diagnosis

Abstract

Background

Objectives

Methods

Results

Conclusion

Keywords

Introduction

Related work

Methodology

Overview of NeuroFusion-ADNet

Modality-specific encoders

PWResNet (MRI pathway)

CADDLFCNet (PET pathway)

Bi-attention fusion module

Segmentation decoder

ROI-GateNet classifier

Explainability integration

Loss function

Results

Discussion

Conclusion

Footnotes

ORCID iD

Contributorship

Funding

Declaration of conflicting interests

Data availability statement

References