Sage Journals: Discover world-class research

Abstract

BACKGROUND:

Alzheimer’s disease (AD) endangers the physical and mental health of the elderly, constituting one of the most crucial social challenges. Due to lack of effective AD intervention drugs, it is very important to diagnose AD in the early stage, especially in the Mild Cognitive Impairment (MCI) phase.

OBJECTIVE:

At present, an automatic classification technology is urgently needed to assist doctors in analyzing the status of the candidate patient. The artificial intelligence enhanced Alzheimer’s disease detection can reduce costs to detect Alzheimer’s disease.

METHODS:

In this paper, a novel pre-trained ensemble-based AD detection (PEADD) framework with three base learners (i.e., ResNet, VGG, and EfficientNet) for both the audio-based and PET (Positron Emission Tomography)-based AD detection is proposed under a unified image modality. Specifically, the effectiveness of context-enriched image modalities instead of the traditional speech modality (i.e., context-free audio matrix) for the audio-based AD detection, along with simple and efficient image denoising strategy has been inspected comprehensively. Meanwhile, the PET-based AD detection based on the denoised PET image has been described. Furthermore, different voting methods for applying an ensemble strategy (i.e., hard voting and soft voting) has been investigated in detail.

RESULTS:

The results showed that the classification accuracy was 92% and 99% on the audio-based and PET-based AD datasets, respectively. Our extensive experimental results demonstrate that our PEADD outperforms the state-of-the-art methods on both audio-based and PET-based AD datasets simultaneously.

CONCLUSIONS:

The network model can provide an objective basis for doctors to detect Alzheimer’s Disease.

Keywords

Alzhemier’s disease early detection pre-training ensembling modality

1. Introduction

Alzheimer’s disease (AD) is a kind of neurodegenerative disease, which seriously endangers the physical and mental health of the elderly. According to statistics from World Alzheimer Report 2022,1

¹
https://www.alzint.org/u/World-Alzheimer-Report-2022.pdf.

the number of people living with dementia estimated to stand at 55 million in 2019 and it is expected to rise to 139 million in 2050, according to the most recent World Health Organization (WHO) figures. It can be seen that AD has become a more and more serious public health and social problem, which has caused a great burden on families, society and the government.

Studies have shown that the stage before the dementia stage of AD is called Mild Cognitive Impairment (MCI). At this time, patients have objective cognitive impairment, but the ability of daily living has not been significantly affected. It is very important to conduct automatic early detection of AD, especially in the MCI phase. At present, artificial intelligence is widely used in heart rate estimation [1], COVID-19 infection detection [2], malignant lymphoma detection [3], and medical devices integration [4]. According to the literature [5], people’s language comprehension and cognitive process originates from a semantic network formed in the mind consisting of nodes and relations, containing both transient and long-term memory encoding processes. This process lays the foundation of the neural network cognitive model in artificial intelligence. Relevant studies have shown that artificial intelligence can be used to model different behavior of AD patients (i.e., logical reasoning and spatial navigation ability [6]), so as to predict the possibility of people suffering from AD.

The existing literature of surveys show that the automatic AD detection based on artificial intelligence is possible. For example, Garcia et al. [7] gave statistics on the representative methods of AD intervention using artificial intelligence, speech and language methods from 2000 to 2019, and made an in-depth review from the research details, data analysis, methodology, and clinical application. Besides, Filiou et al. [8] conducted an in-depth comparative study on the six types of features most affected in AD discourse in the context of picture description, including the multidimensional change patterns in the fields of production, grammar, vocabulary, fluency, semantics and discourse. Furthermore, Billeci et al. [9] listed the representative machine learning methods for the AD detection and emphasized the importance of multi-modal AD detection combined with image analysis. In addition, Pulido et al. [10] analyzed the monitoring of AD patients based on spontaneous speech and speech analysis technologies. More specifically, some audio-based features (i.e., percentage of silence duration, number of speech segments, log-Mel) and linguistic-based features (i.e., n-gram, syntactic) can be employed to perform AD detection.

Currently, the image-based AD detection is relatively easy compared with audio-based AD detection since the image modality is easy to obtain. However, the audio-based AD dataset is scarce since the speech collection is a time-consuming task. Therefore, the motivation of this paper is to investigate whether the audio modality can be transferred to context-enriched Spectrogram/Mel-spectrogram/MFCC image modality as shown in Fig. 1 as the input of current pre-trained image classification framework instead of using the traditional context-free audio matrix. Therefore, audio-based AD and image-based AD detection with the pre-trained image classification framework has been conducted under a unified image modality simultaneously.

Figure 1.

Transfering audio modality to image modality.

Different from the existing AD detection models, a novel pre-trained ensemble-based AD detection (PEADD) framework with three base learners (i.e., ResNet, VGG, and EfficientNet) for both audio-based and PET-based AD detection is developed under a unified image modality. The context-enriched image modality instead of a context-free audio matrix is taken as the input to the pre-training image classification framework for the audio-based AD detection. Furthermore, different voting methods (i.e., hard voting and soft voting) for applying an ensemble strategy has been verified in detail. Experimental results on the two benchmark AD datasets demonstrate that our PEADD significantly outperforms the state-of-the-art methods.

The major contributions of this paper are two-folds, as follows:

(1)

The proposed model systematically studies the early detection of AD from a unified image modality for both audio-based and PET-based AD detection. The effectiveness of context-enriched image modality instead of the traditional context-free audio matrix for the audio-based AD detection is inspected in detail.

(2)

A simple and efficient image denoising strategy for both the initial audio-based and PET-based images is proposed. In addition, different voting methods (e.g., hard voting and soft voting) for applying an ensemble strategy has been verified, and demonstrate that our model significantly outperforms the state-of-the-art approaches on two benchmark audio-based and PET-based AD datasets simultaneously.

The rest of this paper is organized as follows. We review related work in Section 2. In Section 3, the model PEADD is presented in detail. In Section 4, experiments to investigate the performance of our proposed PEADD are conducted, including experimental setting and result analysis. The limitations and deployment are discussed in Section 5. Finally, we conclude the paper and discuss future research in Section 6.

2. Related works

Approaches for automatic AD detection can be categorized into three subgroups (i.e., feature-based methods, deep learning-based approaches, and hybrid models). Following subsections will explain representative models of these three subgroups in detail.

2.1 Feature-based methods

Abdalla et al. [11] explored the function of specific rhetorical structures in the discourse expression of AD patients using ANOVA (Analysis of Variance). Besides, Qiao et al. [12] studied the effectiveness of seven key audio-based features (i.e., percentage of silence duration, average duration of phrase segments, average duration of silence segments, number of speech segments, the number of long pauses, the ratio of hesitation/speech counts, and the ratio of short pauses/speech counts) in the AD detection. In addition, Ahangar et al. [13] carried out SPSS analysis and $T$ -test on two groups (e.g., healthy people and AD patients) on audio-based AD dataset, and showed that there were significant differences in morphological and syntactic patterns between the two groups. Similarly, Tth et al. [14] has investigated the function of four types of acoustic features (i.e., speech ratio, speech tempo, length and number of silent and filled pauses, and length of utterance), and performed statistical analysis and machine learning algorithms (i.e., naive Bayes, random forest and SVM) to conduct AD detection. Luz et al. [15] studied the predictive value of pure acoustic features automatically extracted from spontaneous speech for AD detection, and employed decision trees, KNN, LDA, random forest and SVM classifiers to detect AD. Furthermore, Martinc et al. [16] proposed a multi-modal AD detection model which adopted MFCC and TF-IDF features to conduct AD detection with random forest, SVM and logistic expression classifiers. Moreover, Luz et al. [17] adopted some novel audio-based features (i.e., mean, variance, minimum and maximum, entropy, and speech speed) to conduct AD detection using Bayes classifier. Again, Yancheva and Rudzicaz [18] integrated vector space theme model and semantic features to perform AD detection with random forest classifier, obtaining F1 performance of 0.80. Other similar work can be found in literatures [19, 20].

2.2 Deep learning-based approaches

Lopez-de-Ipina et al. [21] proposed a nonlinear multi-task model based on automatic speech analysis, extracted some novel features (i.e., Castiglioni fractal dimension and multi-scale permutation entropy), and employed a multi-layer perception and convolutional neural network for AD detection. In addition, Martinc et al. [22] proposed a multi-modal AD detection model based on BERT, which verified the temporal interaction between the acoustics (i.e., duration, embedding velocity, centroid velocity) and the text features (i.e., bag-of-n-grams). Besides, Zhu et al. [23] proposed a transfer learning-based AD detection model with YAMNet, Mockingjay and BERT to alleviate the limited AD voice data problem. Mahajan and Baths [24] proposed an end-to-end AD detection model based on CNN-LSTM, which combined text features and audio features simultaneously. Furthermore, Sarawgi et al. [25] proposed a multi-modal AD detection model based on multi-layer perception, which deeply explored various acoustic, cognitive and linguistic features, and achieved F1 performance of 0.83 on the ADReSS dataset. Similarly, Yuan et al. [26] proposed an ERNIE-based AD detection model, and showed that the frequency of usage of UM tone in AD patients is much lower than that usage of UH tone. Moreover, Fritsch et al. [27] and Palo and Parde [28] proposed a LSTM-based and CNN-based AD detection models respectively. Also, Karlekar et al. [29] proposed three neural models based on CNN, LSTM-RNN and their combinations for AD detection, and further analyzed the discrimination of these neural models on the speech characteristics of AD patients through activation clustering and first derivative saliency technology. Other similar work can be found in literatures [30, 31].

2.3 Hybrid models

Two AD detection models were proposed in Balagopalan et al. [32], one is a domain knowledge-driven model which can extract a large number of clinically relevant linguistic and acoustic features, and another one is a BERT-driven transfer learning model. Lindsay et al. [33] investigated the function of different features (i.e, task related features, syntactic features, semantic features and prosodic features) in AD detection, and performed non parametric Kruskal Wallis H-Test and correlation analysis. They also employed logistic regression, SVM and MLP classifiers to conduct AD detection. Furthermore, Haulcy and Glass et al. [34] employed various acoustic features (i.e., i-vectors and x-vectors) and text features (i.e., word vectors, linguistic inquiry and word count), and conducted AD detection with LDA, decision tree, KNN, SVM, random forest, LSTM, CNN classifiers. In addition, literature [35] compared the performance of two AD detection methods (i.e., domain-based and BERT-based methods) in detail on the ADReSS dataset.

Different from the above representative AD detection models, a novel pre-trained ensemble-based AD detection (PEADD) framework with three base learners (i.e., ResNet, VGG, and EfficientNet) for both the audio-based and PET-based AD simultaneously is proposed under a unified image modality. This work tries to conduct AD detection from different context-enriched image modality instead of the traditional context-free audio matrix, along with simple and effective image denoising strategies.

3. Proposed model

Figure 2 shows the framework of our AD detection model PEADD. As it shows, the whole model consists of four components. The first component is the feature extractor, the second one is the image denoising, the third one is the feature learner, and the last one is the ensemble learner. If source inputs are in audio format, the context-enriched image modality of audio will be generated through the component of feature extractor; otherwise, the PET-based inputs can be directly fedded into image denoising component. And then, the output of the feature learner will be fedded into ensemble learner to predict the final label (i.e., AD, MCI, HC). In the following paragraphs, major components inside PEADD, i.e., feature extractor, image denoising, feature learner, and ensemble learner, will be explained in detail.

Figure 2.

The framework of our PEADD.

Feature extractor

This feature extractor component is designed for the audio-based AD detection. Specifically, three kinds of image modality of audio-based features, e.g., Spectrogram, Mel-spectrogram, and Mel-Frequency Cepstral Coefficients (MFCC), is adopted.

Spectrogram

The spectrogram is obtained from the original audio through pre-emphasis, framing, windowing and short-time Fourier transform as shown in Eq. (1) where $K$ is the number of frequency points after doing discrete Fourier transformation, $k(0\leqslant k\leqslant K)$ is the frequency index. In fact, the energy distribution at the specified frequency end can be viewed with spectrogram. Generally speaking, the energy distribution of healthy people and AD patient are definitely different.

$\displaystyle X[k,l]=\sum_{n=0}^{N-1}x_{l}[n]e^{-\frac{j2\pi nk}{K}}=\sum_{n=0% }^{N-1}w[n]x[n+lL]e^{-\frac{j2\pi nk}{K}}$ (1)

Mel-spectrogram

The Mel-spectrogram is also known as FBank as shown in Eq. (2). The spectrogram passes through a series of Mel filter banks. Because of the overlap of adjacent filters, the characteristic correlation of Mel-spectrogram is high. On the basis of the extracted sound spectrum, the energy spectrum is obtained by squaring it, the energy in each filter band is superposed, and the output power spectrum of the $k$ -th filter is $x[k]$ ; Logarithmic the output of each filter is adopted to obtain the logarithmic power spectrum of the corresponding frequency band. In fact, in order to process sound, the logarithmic scale through the Mel scale and the decibel scale shoulde be used when processing the frequency and amplitude of data. That’s the main function of Mel-spectrogram.

$\displaystyle Y_{\textit{FBank}}[k]=\log x[k]$ (2)

MFCC

Mel-Frequency Cepstral Coefficients is a kind of popular speech feature as shown in Eq. (3) where $M$ is the number of triangular filters, generating a total of $L$ MFCC coefficients obtained by doing inverse discrete cosine transform based on Mel sound spectrum. MFCC is a speech feature widely used in automatic speech recognition and speaker recognition.

$\displaystyle C_{n}=\sum_{k=1}^{M}\log X[k]\cos\left(\frac{\pi(k-0.5)n}{M}\right)$ (3)

Generally, the above three audio-based features (i.e., Spectrogram, Mel-spectrogram, and MFCC) are context-free matrix of audio modality. The context-free matrix definitely loss some key information. Therefore, this paper tries to investigate the context-enriched image modality of these audio-based features. Compared with the context-free audio matrix format, the context-enriched image modality has rich representation. The amplitude in the Mel-spectrogram demonstrates the energy of the people. Generally, the energy of AD, MCI and HC should be different. Therefore, the context-enriched image modality instead of the context-free audio matrix is taken as the input of base learners. What’s more, image denoising is also conducted as shown in the Figs 3 and 4 respectively as following.

Figure 3.

Image denoising for the Mel-spectrogram.

Figure 4.

Image denoising for the PET.

Image denoising

To be more specific, the Fig. 3a is the initial Mel-spectrogram modality, and the Fig. 3b is the denoised Mel-spectrogram modality named Mel-spectrogram_Denoising. Similarly, Fig. 4a is the initial PET image, and the Fig. 4b is the denoised PET image. For the Mel-spectrogram modality, the points with pixel value less than 255 are detected. the white frame surrounding the Mel-spectrogram after using matplotlib utility to generate its image is removed in order to reduce the noise. For the PET (Positron Emission Tomography) modality, since the images of the initial PET are all standard squares, the pre-processing process is to detect the square shape of brain from four directions. After the brain region is detected, the cutting process is performed. At this time, because the brain image is not round and the initial position of the brain is random (that is, the position of the brain relative to the original black background is random), most of the images obtained after the cutting process are rectangular. To ensure that the training input is uniform, Information is lost (or noise is introduced) without proportional deformation, and then the processed rectangle is filled into a square. The specific method is to take the longest side of the rectangle as the side length, and fill the short side into the same length to make a square, so as to perfectly avoid changing the proportion when resizing. The performance of this simple and efficient image denoising can be demonstrated through our extensive experimental results.

More specifically, algorithm 3 shows the detail process for both the audio-based and PET-based image denoising.

[!h] : algorithm of image denoising[1] initial Spectrogram/Mel-spectrogram/MFCC image, initial PET image denoised Spectrogram/Mel-spectrogram/MFCC image, denoised PET image each initial Spectrogram/Mel-spectrogram/MFCC image Detect points with pixel value<255 Remove the white surrounding frame Cropping the initial image

each initial PET image Detect points with pixel value $>$ 50 Cut out the bounding rectangle Fill the rectangle obtained after clipping the border with the longest edge as a square Cropping the initial image denoised Spectrogram/Mel-spectrogram/MFCC image, denoised PET image

Feature learner

As shown in Fig. 3, existing three popular pre-trained image classification models (i.e., ResNet, VGG and EfficientNet) are adopted to perform AD detection. Specifically, for the ResNet model [36], it has two basic blocks (e.g., Conv Block and Identity Block). The input and output dimensions of Conv Block are different, so they cannot be concatenated consecutively. Its role is to change the dimensions of the network; Identity Block has the same input dimension and output dimension, which can be concatenated to deepen the network. In fact, the ResNet model is also a kind of deep convolution. On the basis of deepening the network, residual unit is introduced to reduce and solve the problem of network degradation when training deep network to a certain extent.

For the VGG model [37], it contains 19 hidden layers (e.g., 16 convolution layers and 3 full connection layers). The structure of VGG network is very consistent, they adopt 3x3 convolution and 2x2 max pooling throughout the whole framework.

For the EfficientNet model [42] which was presented by Google. The interior of EfficientNet model is realized through multiple MBConv convolution blocks with DropConnect module. The difference between DropConnect and Dropout is that in the process of training the neural network model, it does not randomly discard the output of hidden layer nodes, but randomly discards the input of hidden layer nodes.

Ensemble learner

As Fig. 3 illustrates, two different ensemble strategies (i.e., hard voting and soft voting) to predict the final label (e.g., AD, MCI, HC) are adopted. For the hard voting strategy, the final label can be classified using majority voting. For the soft voting strategy, the average value of the probability that all three models (i.e., ResNet, VGG, and EfficientNet) prediction samples belong to a certain category is taken as the final label.

4. Experiments

In this section, the performance of our proposed framework PEADD will be investigated, including datasets description, experimental settings and results analysis.

4.1 Datasets

This paper conducts our experiments on the two benchmark AD datasets. One is the audio-based AD dataset, and the other is PET-based AD dataset. The audio-based AD dataset consists of audio from three kinds of people, e.g., AD, MCI and HC (Healthy Contrast) released by the AD contest group in the 16th National Conference on Man-Machine Speech Communication (NCMMSC 2021).2

²
http://tsinghua-ieit.com/ad.

The pronunciation content includes three ways of speaking with pictures description (i.e., cookie theft), fluency test (i.e., mathematic calculation, judgement, color discrimination) and free conversation in Chinese (i.e, home address, member of family, health condition). The corpus doesn’t contain the speech of interviewers, and only reserve the speech of interviewees. The length of each piece of audio is between 30 seconds and 60 seconds, and the training set and testing set contain 280 and 119 pieces of audio respectively. There are 54 male and 69 female of the training set. More specifically, there are 26 AD (i.e., 10 male of AD, 16 female of AD), 53 MCI (i.e, 26 male of MCI, 27 female of MCI), and 44 HC (i.e., 18 male of HC, 26 female of HC). The audio dataset doesn’t provide age of the interviewees. Since the final true label for the PET-based AD detection contest cannot be obtained, we randomly split 20% of its training set and testing set, and the remaining 80% as training set. In addition, the PET-based AD dataset consists of 3000 images of AD, 4000 images of MCI, and 3000 images of HC released by the brain PET image analysis and disease prediction challenge contest group in 2020.3

http://challenge.xfyun.cn/topic/info?type=PET.

For both datasets, 5-cross validation is performed on the training set to obtain the best model, and predict the testing set on the reserved model. Table 1 illustrates the statistics of the two AD datasets, and Fig. 5 shows the cookie theft picture to generate picture description audio.

Table 1

Statistics of the two AD datasets

Audio-based AD dataset
	Label	#pieces of audio	#length of Min. sample (in second)	#length of Max. sample (in second)	#length of Avg. sample (in second)	#length of Total sample (in hour)
Training set	AD	79	29	60	54.70	1.20
	MCI	108	28	60	52.70	1.36
	HC	93	28.20	60	54.10	1.62
Testing set	AD	35	50	60	59.10	0.57
	MCI	39	44	60	58.60	0.63
	HC	45	47	60	58.20	0.73
PET-based AD dataset
	Label	#image	Max. size of image	Min. size of image	Avg. size of image	Image size category
Training set	AD	2400	(336,336)	(128,128)	(222,222)	4
	MCI	3200
	HC	2400
Testing set	AD	600
	MCI	800
	HC	600

Figure 5.

The cookie theft picture.

4.2 Experimental settings

The librosa4

⁴
http://librosa.org/doc/latest/index.html.

utility is adopted to transform spectrogram, Mel-spectrogram and MFCC into image modality. The size of spectrogram, Mel-spectrogram and MFCC in the original image is 640*480. After performing image denoising, the size of Mel-spectrogram is 496*369, and the size of Spectrogram and MFCC is 497*370. The initial value of learning rate is set to 0.01, the value of gamma which is used to learn the dynamic learning rate is set to 0.85, the value of step size is set to 4, the value of batch size is set to 10, and the total number of epochs is set to 30.

Baselines for audio-based AD detection

following five baseline models for audio-based AD detection are adopted in the experiment. The baseline 2 to baseline 5 won the top 4 of the audio-based AD contest in the NCMMSC 2021. Since the training and testing set splitting are same, this work just report their performance directly.

Baseline 1

The organizer of the AD contest group in the NCMMSC 2021 conference released the official baseline Ncmmsc2021_baseline_svm.5

⁵

https://github.com/THUsatlab/AD2021/tree/main/ncmmsc2021_baseline_svm.

It is a SVM-based AD detection model. This baseline system adopts openSMILE to extract a batch of signal features (e.g., frame energy, frame intensity, critical band spectra, MFCC, auditory spectra, perceptual linear predictive coefficients, fundamental frequency, mean-crossing rate, spectral features, etc.), and feds these features into a SVM classifier to detect AD.

Baseline 2

Yuan et al. [39] adopt wav2vec in a fine-tuning framework to conduct AD detection. They split the longer audio into many 6-second sections, and obtained the best 3-way classification performance in the audio-based AD contest.

Baseline 3

Hui et al. [40] proposed a cross-voting based feature selection (CVFS) to detect AD which can reduce over-fitting issue in current machine learning framework. They obtained the second place in the audio-bsed AD contest.

Baseline 4

Zhen et al. [41] proposed a temporal convolutional network (TCN)-based AD detection model TCN_SE_SpatialDroupout which can effectively integrate a novel structure residual block and self-attention mechanism.

Baseline 5

Liu et al. [42] proposed a convolutional neural network (CNN)-based AD detection model CNN_AD based on audio features and multi-feature fusion, along with ensemble learning strategy.

Baseline 6

Alić et al. [43] and Veljović [44] proposed a similar artificial neural network (ANN) based framework to predict metabolic syndrome and antimicrobial activity for new compounds, respectively. The only difference exists in the number of neurons (i.e., 14 neurons in paper 2 and 26 neurons in paper 4), donating as ANN_14 and ANN_26, respectively.

Baselines for PET-based AD detection

Following four baseline systems for the PET-based AD detection are adopted in the experiment.

Baseline 1

The official ResNet-34 baseline PET-baseline for the PET-based AD detection contest.6

⁶

https://github.com/datawhalechina/competition-baseline/tree/master/competition.

Baseline 2

VGG (Visual Geometry Group) [36] adopts many small convolution kernel instead of using using a large convolution kernel, which has more activation functions, richer features and stronger discrimination ability. The VGG-19 is adopted as a baseline model.

Baseline 3

ResNet [37] is a modified version based on VGG through adding residual unit, solving the problem of network degradation when training deep network. The ResNet-50 is adopted as a baseline model.

Baseline 4

The EfficientNet [38] adopts NAS (Neural Architecture Search) technology to search the reasonable configuration of three parameters (e.g., image input resolution, network depth depth and channel width). It obtains better performance compared with VGG and ResNet on many tasks. The EfficientNet-b8 is adopted as a baseline model, along with iterative fine tuning and 5-cross validation strategy.

Baseline 5

The ANN_14 and ANN_26 based models in Alić et al. [43] and Veljović [44], respectively.

Evaluation metrics

Similar to existing literature, this paper will investigate the performance of our proposed framework and other comparing models in terms of accuracy, precision, recall and F1-score.

4.3 Experimental results

Results on audio-based AD detection

Table 2 shows the performance of 3-way AD detection using different modalities with different base learners. Since the testing set are same, the results of baselines are directly taken from their paper, respectively. It can be seen that the performance of image modality outperforms the speech modality, which indicates the efficiency of the context-enriched image instead of the context-free audio matrix. In addition, following observations can be obtained.

Table 2
The efficiency of 3-way AD detection using different modalities with different base learners on the audio dataset

Feature	ResNet-50	VGG-19	EfficientNet-b8
Context-free audio matrix
Spectrogram	0.7227	0.7059	0.7227
Mel-spectrogram	0.7143	0.7311	0.7479
MFCC	0.7395	0.7563	0.7479
Spectrogram $+$ Mel-spectrogram	0.7311	0.7143	0.7395
Spectrogram $+$ MFCC	0.7563	0.7395	0.7227
Mel-spectrogram $+$ MFCC	0.7479	0.7563	0.7563
Spectrogram $+$ Mel-spectrogram $+$ MFCC	0.7647	0.7479	0.7563
Context-enriched image matrix
Spectrogram	0.8235	0.8151	0.8067
Mel-spectrogram	0.8235	0.8319	0.7983
MFCC	0.7815	0.8151	0.7731
Spectrogram_Denoising	0.8655	0.8403	0.8571
Mel-spectrogram_Denoising	0.8740	0.8908	0.8908
MFCC_Denoising	0.8319	0.8235	0.8235

(1)

For the speech modality, Mel-spectrogram outperforms other two features (i.e., Spectrogram and MFCC). Compared with MFCC image, it can be seen that spectrogram form pixels have more information (features) and less noise, which can be better learned by the model. The frequency spectrum image (Mel-spectrogram) after performing Mel filtering is further processed to remove some noise, making Mel-spectrogram easier to learn, obtaining the best performace as shown in Fig. 6. Also, the combination of different speech features can improve the AD detection performance due to different function can be extracted from audio.

Figure 6.

Image modality of three different audio-based features.

(2)

For the three base learners (i.e., ResNet-50, VGG-19, EfficientNet-b8), EfficientNet-b8 obtains the best performance. The reason is that the EfficientNet increases the network width, network depth and input network resolution simultaneously. The EfficientNet can adjust the parameters in width, depth, image resolution and scaling coefficient at the same time, which improves the accuracy compared with the single improvement (e.g., ResNet or VGG).

(3)

The performance improvement of image denoising is very efficient, which can obtain the performance increment from 5% to 10%. In fact, the denoised image through removing the white frame surrounding the original Mel-spectrogram can reduce the noise of the initial image.

Table 3 shows the performance of all systems on the audio-based AD testing set. It can be seen that our model outperforms all the state-of-the-art approaches in terms of accuracy metric. Among the six baseline systems, the ANN-based baseline performs the worst, the SVM-based baseline performs second-to-last, and other four deep learning-based models obtain comparable performance. Since all of the four deep learning-based models adopt the context-free audio matrix instead the image as the input modality, our image modality based model performs better than these models. It is strange that the ANN with audio-based feature performs better than with Mel-spectrogram denoising feature. The potential reason is that ANN only has a single channel which is quite different with three channels (e.g., RGB) within a CNN, resulting in performance degradation. In addition, among the two voting methods, the hard voting is better than soft voting. For this audio-based AD dataset, the basic model prediction error is mainly in AD and MCI, which has a strong bias. In this case, it may be one of the reasons for hard voting integration to achieve better results since it integrates a violent solution in the algorithm.

Table 3

Audio-based AD detection results

Method	Accuracy	Precision	Recall	F1-score
ANN_14_Spectrogram $+$ Mel-spectrogram $+$ MFCC	0.6555	0.6876	0.6608	0.6543
ANN_26_Spectrogram $+$ Mel-spectrogram $+$ MFCC	0.6723	0.6974	0.6756	0.6708
ANN_14_Mel-spectrogram_Denoising	0.6134	0.6317	0.5933	0.5550
ANN_26_Mel-spectrogram_Denoising	0.6218	0.6239	0.5957	0.5241
Ncmmsc2021_baseline_svm	0.7980	0.7990	0.7850	0.7860
wav2vec	0.8992	0.8945	0.8958	0.8948
CVFS	0.8675	–	–	–
TCN_SE_SpatialDroupout	0.8824	0.8909	0.8872	0.8807
CNN_AD	0.8487	0.8463	0.8449	0.8450
PEADD (hard voting)	0.9244	0.9276	0.9241	0.9198
PEADD (soft voting)	0.9160	0.9163	0.9147	0.9110

Table 4

PET-based AD detection results

Method	Accuracy	Precision	Recall	F1-score
ANN_14	0.8065	0.8131	0.8036	0.8074
ANN_26	0.8045	0.8110	0.8015	0.8053
PET-baseline	0.8705	0.8725	0.8706	0.8714
VGG-19	0.9110	0.9176	0.9082	0.9120
ResNet-50	0.9275	0.9266	0.9313	0.9281
EfficientNet-b8	0.9125	0.9179	0.9124	0.9127
VGG-19_Denoising	0.9810	0.9821	0.9804	0.9812
ResNet-50_Denoising	0.9895	0.9892	0.9901	0.9897
EfficientNet-b8_Denoising	0.9870	0.9874	0.9868	0.9871
PEADD (hard voting)	0.9945	0.9949	0.9940	0.9944
PEADD (soft voting)	0.9950	0.9954	0.9946	0.9950

Figure 7.

The confusion matrix of our best model on the audio-based dataset.

Figure 7 shows the confusion matrix of our best performance. It can be seen that all AD can be successfully detected by our model. Only 1 HC is mis-classified to MCI. Since the MCI is much more closer to AD, 8 MCI is mis-classified to AD. Therefore, the performance of MCI detection improvement should be one of our future work.

Results on PET-based AD detection

Table 4 shows the performance of all systems on the PET-based AD dataset. It can be seen that our model significantly outperforms all the state-of-the-art approaches in terms of accuracy metric. All the three pre-trained baseline models (i.e., VGG-19, ResNet-50,EfficientNet-b8) obtain comparable high performance. Again, the denoised PET image performs better than the initial PET image. Since the three pre-trained base learners obtain quite similar performance on the denoised PET image, therefore hard voting and soft voting obtain similar performance.

Figure 8 shows the confusion matrix of our best performance. Similarly, it can be seen that only 1 AD is mis-classified to MCI, 2 AD is mis-classified to HC, 1 MCI is mis-classified to AD, and 6 HC is mis-classified to MCI. Our model obtains better performance on the PET-based AD detection compared with audio-based AD detection.

Figure 8.

The confusion matrix of our best model on the PET-based dataset.

5. Limitations and deployment

5.1 Limitations

This work has two limitations as following.

(1) Limited generalizability: Our experiments were conducted on specific datasets and may not fully represent the characteristics of other Alzheimer’s disease detection scenarios or platforms. Generalizability to different datasets and languages needs to be further explored. (2) Absence of real-time evaluation: Our evaluation primarily focused on offline performance measures, and we did not consider real-time or dynamic evaluation scenarios. Future work should investigate the model’s performance in real-time Alzheimer’s disease detection settings.

5.2 Deployment

For the practical using of our method or deployment, a PET image or a phonetic fraction collected from candidates can be fed into our proposed end-to-end Alzheimer’s disease detection model directly. Our model will predict a 3-way (e.g., AD, MCI, HC) classification results.

6. Conclusions and future work

This work proposed using three pre-trained image classification base learners (i.e., ResNet, VGG, and EfficientNet) to create ensembles for both audio-based and PET-based AD simultaneously under a unified image modality. We examined the effectiveness of context-enrichedl image modality instead of the traditional context-free audio matrix for the audio-based AD detection. In addition, different voting methods for applying an ensemble along with simple and effective image denoising strategies are investigated in detail. Experimental results on two benchmark AD datasets demonstrate that our proposed model PEADD significantly outperforms the state-of-the-art methods. In the future, we would like to create ensembles for other supervised AD models, along with different image and audio features.

Footnotes

Acknowledgments

The authors would like to thank anonymous reviewers for their insightful comments on this paper.

Conflict of interest

The authors declare that they have no conflict of interest.

Funding

This research was supported by the National Natural Science Foundation of China under Grants 62162031, 62066020 and 62266023, Key Project of Jiangxi Natural Science Foundation under Grant 20224ACB202010, and Jiangxi Province Degree and Graduate Education Teaching Reform Research Project under Grant JXYJG-2021-056.

References

Ryu

Hong

Liang

Pak

Zhang

Wang

Lian

. A real-time heart rate estimation framework based on a facial video while wearing a mask. Technology and Health Care. 2023; 31(3): 887-900. doi: 10.3233/THC-220322.

Tiwari

Tripathi

Pandey

Sharma

. Detection of COVID-19 infection in CT and X-ray images using transfer learning approach. Technology and Health Care. 2022; 30(6): 1273-1286. doi: 10.3233/THC-220114.

Zhang

Jiang

Yang

. Research on the classification of lymphoma pathological images based on deep residual neural network. Technology and Health Care. 2021; 29(S1): 335-344. doi: 10.3233/THC-218031.

Badnjevic

Avdihodžić

Pokvic

. Artificial intelligence in medical devices: Past, present and future. Science, Art & Religion. 2021; 1(1-2): 101-106. doi: 10.5005/sar-1-1-2-101.

Lin

. Decoding the process of cognitive language understanding. Journal of Jiangxi Normal University (Social Sciences). 2009; 42(6): 157-160. doi: 10.3969/j.issn.1000-579X.2009.06.028. (in Chinese).

Zanco

Plácido

Marinho

Ferreira

de Oliveira

Monteiro-Junior

Barca

Engedal

Laks

Deslandes

. Spatial navigation in the elderly with Alzheimer’s disease: A cross-sectional study. Journal of Alzheimer’s Disease. 2018; 66(4): 1683-1694. doi: 10.3233/JAD-180819. PMID: 30507580.

de la Fuente Garcia

Ritchie

Luz

. Artificial intelligence, speech and language processing approaches to monitoring Alzheimer’s disease: A systematic review. Journal of Alzheimer’s Disease. 2020; 78(4): 1547-1574. doi: 10.3233/JAD-200888.

Filiou

Nathalie

Antoine

Bérengėre

Patricia

Simona

. Connected speech assessment in the early detection of Alzheimer’s disease and mild cognitive impairment: A scoping review. Aphasiology. 2020; 34(6): 723-755. doi: 10.1080/02687038.2019.1608502.

Billeci

Badolato

Bachi

Tonacci

. Machine learning for the classification of Alzheimer’s disease and its prodromal stage using brain diusion tensor imaging data: A systematic review. Processes. 2020; 8(9): 1071. doi: 10.3390/pr8091071.

10.

Pulido

MLB

Hernández

JBA

Ballester

MÁF

González

CMT

Mekyska

Smékal

. Alzheimer’s disease and automatic speech analysis: A review. Expert Systems With Applications. 2020; 150: 113213. doi: 10.1016/j.eswa.2020.113213.

11.

Abdalla

Rudzicz

Hirst

. Rhetorical structure and Alzheimer’s disease. Aphasiology. 2017; 32(1): 41-60. doi: 10.1080/02687038.2017.1355439.

12.

Qiao

Xie

Lin

Zou

. Computer-assisted speech analysis in mild cognitive impairment and Alzheimer’s disease: A pilot study from Shanghai, China. Journal of Alzheimer’s disease. 2020; 75(1): 1-11. doi: 10.3233/JAD-191056.

13.

Ryu

Hong

Liang

Pak

Zhang

Wang

Lian

. A real-time heart rate estimation framework based on a facial video while wearing a mask. Technology and Health Care. 2023; 31(3): 887-900. doi: 10.3233/THC-220322.

14.

Tóth

Hoffmann

Gosztolya

Vincze

Szatloczki

Banreti

Pakaski

Kalman

. A speech recognition-based solution for the automatic detection of mild cognitive impairment from spontaneous speech. Current Alzheimer Research. 2018; 15(2): 130-138. doi: 10.2174/1567205014666171121114930.

15.

Haider

de la Fuente

Luz

. An assessment of paralinguistic acoustic features for detection of Alzheimer’s dementia in spontaneous speech. IEEE Journal of Selected Topics in Signal Processing. 2019; 14(2): 272-281. doi: 10.1109/JSTSP.2019.2955022.

16.

Martinc

Pollak

. Tackling the ADReSS challenge a multimodal approach to the automated recognition of Alzheimer’s dementia. ISCA Conference on the Interspeech. ISCA. 2020. pp. 2157-2161. doi: 10.21437/Interspeech.2020-2202.

17.

Luz

. Longitudinal monitoring and detection of Alzheimer’s type dementia from spontaneous speech data. IEEE Conference on the Computer-Based Medical Systems. IEEE. 2017. pp. 45-46. doi: 10.1109/CBMS.2017.41.

18.

Yancheva

Rudzicz

. Vector-space topic models for detecting Alzheimer’s disease. ACL Conference on the Annual Meeting of the Association for Computational Linguistics. ACL. 2016. pp. 2337-2346. doi: 10.18653/v1/P16-1221.

19.

Luz

Haider

de la Fuente

Fromm

Macwhinney

. Alzheimer’s dementia recognition through spontaneous speech: the ADReSS challenge. ISCA Conference on the Interspeech. ISCA. 2020. pp. 2172-2176. doi: 10.21437/Interspeech.2020-2571.

20.

Gosztolya

Vincze

Tóth

Pákáski

Kálmán

Hoffmann

. Identifying mild cognitive impairment and mild Alzheimer’s disease based on spontaneous speech using ASR and linguistic features. Computer Speech & Language. 2018; 53: 181-197. doi: 10.1016/j.csl.2018.07.007.

21.

Lopez-de-Ipina

Martinez-de-Lizarduy

Calvo

Mekyska

Beitia

Barroso

Estanga

Tainta

Ecay-Torres

. Advances on automatic speech analysis for early detection of Alzheimer disease: A non-linear multi-task approach. Current Alzheimer Research. 2018; 15(2): 139-148. doi: 10.2174/1567205014666171120143800.

22.

Martinc

Haider

Pollak

Luz

. Temporal integration of text transcripts and acoustic features for Alzheimer’s diagnosis based on spontaneous speech. Frontiers in Aging Neuroscience. 2021; 13: 642647. doi: 10.3389/fnagi.2021.642647.

23.

Zhu

Liang

Batsis

Roth

. Exploring deep transfer learning techniques for Alzheimer’s dementia detection. Frontiers of Computer Science. 2021; 3: 624683. doi: 10.3389/fcomp.2021.624683.

24.

Mahajan

Baths

. Acoustic and language based deep learning approaches for Alzheimer’s dementia detection from spontaneous speech. Frontiers in Aging Neuroscience. 2020; 13: 623607. doi: 10.3389/fnagi.2021.623607.

25.

Sarawgi

Zulfikar

Soliman

Maes

. Multimodal inductive transfer learning for detection of Alzheimer’s dementia and its severity. arXiv preprint arXiv:2009.00700v1. 2020. doi: 10.48550/arXiv.2009.00700.

26.

Yuan

Bian

Cai

, et al. Disfluencies and fine-tuning pre-trained language models for detection of Alzheimer’s disease. ISCA Conference on the Interspeech. ISCA. 2020. pp. 2162-2166. doi: 10.21437/Interspeech.2020-2516.

27.

Fritsch

Wankerl

Noth

. Automatic diagnosis of Alzheimer’s disease using neural network language models. IEEE Conference on the Acoustics, Speech and Signal Processing. IEEE. 2019. pp. 5841-5845. doi: 10.1109/ICASSP.2019.8682690.

28.

Palo

Parde

. Enriching neural models with targeted features for dementia detection. ACL Conference on the Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. ACL. 2019. pp. 302-308. doi: 10.18653/v1/P19-2042.

29.

Karlekar

Niu

Bansal

. Detecting linguistic characteristics of Alzheimer’s dementia by interpreting neural models. ACL Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL. 2018. pp. 701-707. doi: 10.48550/arXiv.1804.06440.

30.

Cummins

Pan

Ren

, et al. A comparison of acoustic and linguistics methodologies for Alzheimer’s dementia recognition. ISCA Conference on the Interspeech. ISCA. 2020. pp. 2182-2186. doi: 10.21437/Interspeech.2020-2635.

31.

Koo

Lee

Pyo

Lee

. Exploiting multimodal features from pre-trained networks for Alzheimer’s dementia recognition. arXiv preprint arXiv:2009.04070v1, 2020. doi: 10.48550/arXiv.2009.04070.

32.

Balagopalan

Eyre

Robin

Rudzicz

, et al. Comparing Pre-trained and feature-based models for prediction of Alzheimer’s disease based on speech. Frontiers in Aging Neuroscience. 2021; 13: 1-12. doi: 10.3389/fnagi.2021.635945.

33.

Lindsay

Troger

Konig

. Language impairment in Alzheimer’s disease-robust and explainable evidence for AD-related deterioration of spontaneous speech through multilingual machine learning. Frontiers in Aging Neuroscience. 2021; 13: 642033. doi: 10.3389/fnagi.2021.642033.

34.

Haulcy

Glass

. Classifying Alzheimer’s disease using audio and text-based representations of speech. Frontiers in Psychology. 2021; 11: 624137. doi: 10.3389/fpsyg.2020.624137.

35.

Balagopalan

Eyre

Rudzicz

, et al. To BERT or not to BERT: comparing speech and language-based approaches for Alzheimer’s disease detection. ISCA Conference on the Interspeech. ISCA. 2020. pp. 1-7. doi: 10.48550/arXiv.2008.01551.

36.

Simonyan

Zisserman

. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. doi: 10.48550/arXiv.1409.1556.

37.

Zhang

Ren

Sun

. Deep residual learning for image recognition. IEEE Conference on the Computer Vision and Pattern Recognition. IEEE. 2016. pp. 770-778. doi: 10.48550/arXiv.1512.03385.

38.

Tan

. EfficientNet: rethinking model scaling for convolutional neural networks. IEEE Conference on the International Conference on Machine Learning). IEEE. 2019. pp. 6105-6114. doi: 10.48550/arXiv.1905.11946.

39.

Yuan

Cai

Huang

Zheng

Church

. Recognition of Alzheimer’s disease from 6-second speech. Conference on the Man-Machine Speech Communication. 2021. pp. 929-935. (in Chinese).

40.

Hui

Xue

Wang

Sun

. Cross-voting based feature selection for reducing the risk of over-fitting in Alzheimer’s disease recognition task. Conference on the Man-Machine Speech Communication. 2021. pp. 955-962. (in Chinese).

41.

Gong

Niu

Zhao

. Alzheimer’s disease recognition based on speech modality. Conference on the Man-Machine Speech Communication. 2021. pp. 984-992. (in Chinese).

42.

Liu

Yang

Zhao

. Single-feature and multi-feature fusion audio classification for Alzheimer’s disease based on convolutional neural network. Conference on the Man-Machine Speech Communication. 2021. pp. 1002-1011. (in Chinese).

43.

Alić

Pokvic

Badnjević

Čengić

Malenica

Dujič

Causevic

Bego

. Classificaiton of metabolic syndrome patients using implemented expert system. IFMBE Conference on the Medical and Biological Engineering. IFMBE. 2017. pp. 601-607. doi: 10.1007/978-981-10-4166-2_91.

44.

Veljović

Halilovic

Muratović

Osmanović

Badnjevic

Pikvic

Tatlić

Zorlak

Imanović

Husić

Zavrsnik

. Artificial neural network and docking study in design and synthesis of xanthenes as antimicrobial agents. IFMBE Conference on the Medical and Biological Engineering. IFMBE. 2017. pp. 617-626. doi: 10.1007/978-981-10-4166-2_93.

Pre-training and ensembling based Alzheimer’s disease detection

Abstract

BACKGROUND:

OBJECTIVE:

METHODS:

RESULTS:

CONCLUSIONS:

Keywords

1. Introduction

1 https://www.alzint.org/u/World-Alzheimer-Report-2022.pdf.

2.1 Feature-based methods

2.2 Deep learning-based approaches

2.3 Hybrid models

3. Proposed model

Feature extractor

Spectrogram

Mel-spectrogram

MFCC

Image denoising

Feature learner

Ensemble learner

4.1 Datasets

2 http://tsinghua-ieit.com/ad.

4 http://librosa.org/doc/latest/index.html.

Baselines for audio-based AD detection

Baseline 1

Baseline 2

Baseline 3

Baseline 4

Baseline 5

Baseline 6

Baselines for PET-based AD detection

Baseline 1

Baseline 2

Baseline 3

Baseline 4

Baseline 5

Evaluation metrics

Results on audio-based AD detection

Table 2 The efficiency of 3-way AD detection using different modalities with different base learners on the audio dataset

Results on PET-based AD detection

5.1 Limitations

5.2 Deployment

6. Conclusions and future work

Footnotes

Acknowledgments

Conflict of interest

Funding

References

¹
https://www.alzint.org/u/World-Alzheimer-Report-2022.pdf.

²
http://tsinghua-ieit.com/ad.

⁴
http://librosa.org/doc/latest/index.html.

Table 2
The efficiency of 3-way AD detection using different modalities with different base learners on the audio dataset