Abstract
Background:
Semantic and Phonological fluency (SF and PF) are routinely evaluated in patients with Alzheimer’s disease (AD). There are disagreements in the literature regarding which fluency task is more affected while developing AD. Most studies focus on SF assessment, given its connection with the temporoparietal amnesic system. PF is less reported, it is related to working memory, which is also impaired in probable and diagnosed AD. Differentiating between performance on these tasks might be informative in early AD diagnosis, providing an accurate linguistic profile.
Objective:
Compare SF and PF performance in healthy volunteers, volunteers with probable AD, and patients with AD diagnosis, considering the heterogeneity of age, gender, and educational level variables.
Methods:
A total of 8 studies were included for meta-analysis, reaching a sample size of 1,270 individuals (568 patients diagnosed with AD, 340 with probable AD diagnosis, and 362 healthy volunteers).
Results:
The three groups consistently performed better on SF than PF. When progressing to a diagnosis of AD, we observed a significant difference in SF and PF performance across our 3 groups of interest (p = 0.04). The age variable explained a proportion of this difference in task performance across the groups, and as age increases, both tasks equally worsen.
Conclusion:
The performance of SF and PF might play a differential role in early AD diagnosis. These tasks rely on partially different neural bases of language processing. They are thus worth exploring independently in diagnosing normal aging and its transition to pathological stages, including probable and diagnosed AD.
INTRODUCTION
Clinical diagnosis of Alzheimer’s disease (AD) includes impairment in two or more cognitive domains, such as memory, executive function, and behavior, which compromise social cognition [1], according to the standard criteria used, as determined by the National Institute of Neurological and Communicative Disorders and Stroke-Alzheimer’s Disease and Related Disorders Association (NINCDS-ADRDA) [2].
Language is a domain of potential impairment that has been suggested as an important marker in assessing early AD stages, particularly relating to semantic fluency (SF) and phonological fluency (PF) [3, 4]. SF requires the ability to access and retrieve semantic knowledge, while PF depends on phonological/lexical retrieval mechanisms [5]. Studies focused on normal aging have highlighted that declines in lexical access affect performance on both tasks; however, impairment in executive function has also been linked with worse measures of SF and PF. Those findings mentioned how age could contribute to perseverative errors while performing these tasks [6–8]. SF is primarily relevant in dementia research due to its connection with the temporoparietal lobe amnesic system [9, 10]. Although PF is less reported, it is related to working memory, a cognitive function that is also impaired in patients with probable and confirmed diagnosis of AD [11, 12]. Given the functions these tasks target, both tasks are expected to be impaired in patients with dementia [11].
However, there are discrepancies regarding the performance of patients diagnosed with AD on SF and PF tasks [13]. This challenges the selection of which task is more critical for early diagnosis of patients with cognitive decline or dementia. Adlam et al. [14] studied SF performance in patients with AD diagnosis showing its impairment even at mild stages of AD. They also reported better performance on PF than SF, consistent with other studies [15–18]. In contrast, a retrospective study [19] comparing AD and frontotemporal dementia patients showed that the AD group performed better on SF than PF, consistent with Comesaña and Coni [20], who found that AD patients showed more significant problems with PF than SF tasks.
Studies reporting normative data on SF and PF task performance are available in different countries, providing further insight on the protocols applied to evaluate fluency and the effects of demographic variables’ effects when performing these tasks [21, 22]. Various protocols are used to study SF and PF, implementing different categories (e.g., generate animal names, supermarket items) or different amounts of letters. These tasks are generally evaluated in one minute [23, 24]; however, other study protocols consider other time windows, such as 90 seconds [25] or even no time limit to produce the words [26]. Task performance has been reported using raw results [27] or a sum of words produced for each letter or category [14, 28].
Two meta-analyses reporting SF and PF in healthy volunteers and AD patients [29, 30] include diverse study protocols evaluating SF and PF. Performance on SF and PF tasks depends critically upon the letter or semantic category chosen [31] and the time limit given. Age and educational level influence SF and PF task performance as well, and there are discrepancies in the literature on the brain regions for different verbal fluency tasks [32–35]. In patients with mild cognitive impairment and with AD, Kawano et al. [36] found that years of education were significantly related to SF test scores, but not significantly related to PF test scores. Similar effects of years of education and age have been consistently reported in many languages [22, 37, 38]. One reason that might being interfering in the comprehension of age and education level, is the variability of cognitive reserve in adult population [39]. Cognitive reserve refers to differences between individuals in how tasks are performed that might enable some people to be more resilient to brain changes than others. This “reserve” is combination of lifestyle choices and lifetime experiences, such as social activities and education [40]. A better understanding of the concept of cognitive reserve could lead to interventions to mitigate cognitive ageing or reduce the risk of dementia [39, 41, 42]. Although less reported, there are also conflicting findings regarding whether gender affects SF and PF performance [21, 22, 36].
So far, no previous meta-analyses have considered including the same parameters (time, category, letter, score) for evaluating SF and PF performance in this clinical population. The processing behind these fluency tasks might reveal better insight into the common word-finding problems in healthy and pathological aging. The present systematic review and meta-analysis thus aimed to compare SF and PF performance in healthy volunteers, and in individuals with probable and confirmed AD diagnoses, considering the influence of age, gender, and educational level.
METHODS
Review design
A systematic review and meta-analysis were conducted following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [43]. The manuscript was registered in PROSPERO under the code CDR42022352253. The PRISMA checklist appears in the Supplementary Material.
Data sources and search strategy
We reviewed the literature in PubMed, EMBASE, and CENTRAL databases, from January 1975 to December 2020. Our search strategy was performed in each database, selecting relevant terms aligned with our research question. We used the terms “Alzheimer’s Disease, Alzheimer’s type dementia, aged” to characterize our interest population. We selected terms related to performance in the SF and PF task, such as “Verbal Fluency, Verbal Fluency Test, Semantic Fluency, and Phonemic Fluency.” All synonyms of relevant concepts were included and were linked with Boolean terms such as AND and OR. We used controlled vocabulary assigned by the indexers, Mesh for PubMed and Central, Emtree for the Embase database, and natural language. The complete search strategy and filters used for the three databases are described in Supplementary Tables 1–3.
Eligibility criteria
Inclusion criteria considered studies reporting: a) adults with probable diagnosis or diagnosed with AD according to NINCDS-ADRDA criteria, b) report on both tasks, SF and PF, c) SF and PF tasks assessed following the standard time of one minute, d) SF evaluation with “animal” category according to varying national norms, e) PF evaluation with validated letters according to each country with raw score, f) articles in English and Spanish, and g) studies reporting baseline measurements. We excluded case studies, case analyses, and systematic reviews (with or without meta-analysis).
Data collection process
Duplicated articles were identified using the features included in the Rayyan software [44]. Two authors (R.O.V. and C.S.S.) applied inclusion and exclusion criteria to all titles and abstracts, selecting studies according to the inclusion criteria. The interrater reliability (IRR) determined with Cohen’s K was 0.69 (p < 0.001). The 10% of disagreements were resolved by a third senior reviewer (C.M.O.). The authors applied their self-designed checklist to extract the included articles’ data and synthesize the evidence (Supplementary Table 4).
Risk of bias assessment
The risk of bias and the applicability of the eight full-text articles included in this systematic review and meta-analysis was performed by the QUADAS-2 tool [45]. QUADAS-2 is an instrument designed and validated to assess bias risk and the applicability of diagnostic accuracy studies. Two reviewers independently graded each article (R.O.V. and C.S.S.), and the scores were compared. A third reviewer (C.M.O.) resolved disagreements between the two principal reviewers.
Statistical analysis
A systematic review and meta-analysis were performed for the reported SF and PF scores among the three groups, and the weighted average was estimated. The values per task were considered to be the number of words generated in one minute, and an unstandardized mean difference was estimated for the number of words. We used Review Manager 5.4, and with Cochran’s Q statistic test, we analyzed the differences between SF and PF across all groups, comparing the means of each group extracted across studies. Due to the highly heterogeneous values reported in the selected studies (>75%), we performed a random-effects meta-regression including the following demographic variables: age, educational level, and gender (proportion of females) in each study. We estimated the percentage of heterogeneity R-squared (%) attributable to each demographic variable. Univariate and multivariate meta-regression models were created using the “meta regress” command of STATA v16 software, applying the Knapp-Hartung method with truncation to adjust the standard error of coefficients. Multivariate models were constructed with the variables that showed significant association with the univariate models. The proportion of variance explained between studies (R2) came from the univariate and multivariate meta-regression models.
RESULTS
Study selection
Figure 1 shows the flow chart for the selection process based on the PRISMA guidelines [43]. A total of 5,902 potential studies were identified. After removing duplicated articles (1,160), 4,742 studies were considered for screening titles and abstracts, excluding 4,682 articles. After screening, 60 articles were considered for critical analysis. After applying inclusion criteria, 29 studies were selected, but we could extract data from only 8 studies for the meta-analysis of both SF and PF [12, 22, 27, 36, 46–49].

Flow Diagram PRISMA search and selection articles.
Study characteristics
Table 1 shows the general characteristics of the eight articles included in the review. The sample size of the selected studies was 1,270 patients (568 patients diagnosed with AD (44.7%), 340 volunteers with a probable AD diagnosis (26.7%), and 362 healthy volunteers (28.5%)), and the average age of all groups was 71.5 years. A total of 738 participants (58%) were women, while one article did not report the number of participants in their study. Education level showed an average of 10.1 years of schooling in the total sample. Finally, Table 1 shows the heterogeneity of years of publication (the oldest published in 1999), and the country for all the articles included.
General characteristics of the articles
–, not reported; UK, United Kingdom; USA, United States of America.
Bias risk in the selected studies
Using the QUADAS-2 tool [45], all the articles were analyzed considering risks of bias and applicability (Fig. 2). Regarding the risk of bias for Domains 1 and 2 (patient selection and index test, respectively), 90% were classified as low risk. In addition, in Domains 3 and 4 (reference standard and flow and timing, respectively), 80% of the articles were classified as low risk. On the other hand, the assessment of applicability concerns showed that 90% were classified as low risk for patient selection and index tests. By contrast, 80% of the articles were classified as low risk and 20% as high risk for reference standard evaluation.

Assessment of the risk of bias and applicability for the reviewed studies.
Results of syntheses
Performance on SF and FP tasks was compared between the three groups: healthy volunteers, volunteers with a probable AD diagnosis, and patients diagnosed with AD. Although significant heterogeneity was reported across studies (I2 = 90.5%), Cochran’s Q test showed significant differences between both types of verbal fluency among all groups (p = 0.04). The lower difference in SF and PF task performance corresponds to patients diagnosed with AD (mean difference 2.93; 95% CI:1.59–4.27). Figure 3 shows SF and PF performance in the three groups. In the intra-group analysis for the healthy volunteers, we observed high heterogeneity (I2 = 80.20%) across the values reported in the selected studies. However, the difference in task performance between both tasks was statistically significant (p < 0.001). Healthy volunteers perform better for the SF, generating approximately 6 words more than PF compared with the other two groups. The studies included for analyzing volunteers with a probable diagnosis of AD also demonstrated high heterogeneity (I2 = 90.58%). The difference in task performance remained significant, though (p < 0.001), with higher scores for SF, approximately 4 words more than for PF. The studies included for the diagnosed AD patient group showed high heterogeneity (I2 = 88.10%), and the difference between SF and PF tasks was statistically significant (p < 0.001). This group also performed better in SF, generating approximately 3 words more than for the PF task.

Analysis of the performance during SF and PF in the three groups of interest. Random-effects REML model. Results correspond to raw scores of SF and PF.
The univariate model built with the variable group showed high heterogeneity (I2 = 86.67%), but task performance remained significantly different (p = 0.025), being explained by the differences between groups (R2 = 24.04%). The group of patients diagnosed with AD generated approximately 3 words less than the healthy volunteer group (β= – 2.74; 95% CI: – 5.09 to – 0.40; p = 0.025). The univariate model built with the variable age showed high heterogeneity (I2 = 83.83%), and the tasks performance difference was significant (p = 0.01). The age variable explains task performance differences (R2 = 41.06%) with a negative coefficient of – 0.18 (95% CI – 0.33 to – 0.05; p = 0.011). Age influenced task performance, with fewer words generated as age rose. The univariate model built with the variable gender (considered as the proportion of females) showed high heterogeneity (I2 = 89.27%) but did not reach significance in explaining the varying heterogeneity of the task performance (R2 = 9.27%; p = 0.1). The univariate model built with educational level was also highly heterogeneous (I2 = 90.47%) but did not reach significance explaining the heterogeneity in task performance differences (R2 = 0.00%; p = 0.319).
A multivariate model built considering the group and age variable revealed that a large proportion of task performance difference could be explained by the variable age (R2 = 43.60%) besides its high heterogeneity (I2 = 82.24%). However, age was not statistically significant (p = 0.06), explaining the task performance. A lack of statistical power could explain the non-significance of this variable (see below). The negative coefficient (– 0.13) demonstrates that task performance difference is affected by increasing age in each group.
DISCUSSION
The present study aimed to compare SF and PF task performance in three groups of interest: healthy volunteers, volunteers with a probable AD diagnosis, and diagnosed AD patients, considering the variables age, gender, and years of education. Applying diverse protocols to measure SF and PF results in a different valorization of task performance. We thus implemented strict inclusion criteria to select studies that reported the same protocol to evaluate SF and PF in healthy adults, adults with a probable AD diagnosis, and diagnosed AD patients. Our results support the claim that performance in SF and PF tasks significantly differs in healthy volunteers. We observed that the three groups consistently performed better on SF than PF. However, when progressing to an AD diagnosis, SF and PF performance deteriorates equally, but significant differences in task performance remain. This finding challenges the practice of focusing primarily on semantic processing in diagnosing dementia, highlighting the benefits of examining both tasks’ performance from the early stages.
We found that healthy volunteers produced more words in the SF task than patients with suspected or confirmed AD diagnoses, similar to the findings of Saranpää et al. [50] in patients with mild cognitive impairment and AD patients. However, we observed this pattern not only for SF but also for PF. Although high heterogeneity was reported across our selected studies, Cochran’s Q test showed significant differences between both types of fluency tasks among all groups (p = 0.04). This difference is pronounced in the group of healthy volunteers and in individuals where the disease has progressed; performance in both tasks decreased, but these differences remained significant, with SF performance better than PF performance. Our results are aligned with studies that postulate that SF is less impaired than PF in healthy and pathological aging [19, 20]. Hart et al. [31] found a similar result in a sample of subjects with AD, contradicting the prevailing hypothesis that SF is the most affected in this pathology. Later, Comesaña and Coni [20] confirmed this finding, reporting that diagnosed AD patients show better SF performance than PF [20]. A retrospective study also showed that AD performed better in SF tasks [19]. A sample of patients with mild to moderate AD found retrieving words from semantically defined categories easier than by first letters [51]. While there is significant evidence that SF performance is more impaired in healthy subjects as they progress to AD stages [15, 52], our results report the opposite.
SF and PF performance depends on partially different cognitive-linguistic systems, and the deterioration of these systems is different as AD progresses. PF requires a greater demand of frontal lobe mediated cognitive processes [53], imposes fast mental processing [54], information evocation, and search, and requires decoding as well as encoding, specifically linked to the initial letter [55]. On the other hand, SF requires retrieval strategies that use the meaning of words within a semantic category and generate consecutive responses that are semantically similar. This occurs partly because of automatic associative retrieval processes [56] drawing inferences about the relatedness of concepts encoded in semantic memory [57]. Several studies argue that as language is semantically represented and organized [58, 59], retrieving words from a semantic category would be consistent with how language is stored in the mind. In contrast, word representations are not alphabetically organized [60], meaning that PF requires engaging additional cognitive strategies. Letter retrieval involves exploring more subsets of categories than retrieving names from a specific semantic category [61, 62], which could be related to the worse performance in PF found in our findings, given possible higher task demands in PF compared to SF. A previous meta-analysis investigated healthy volunteers’ brain activation during phonemic and semantic verbal fluency tasks. Evidence arose for spatially different activation in BA 44, but not in other regions of the left inferior frontal gyrus and the middle frontal gyrus (BA 9, 45, 47) during phonemic and semantic verbal fluency processing [63]. Schmidt and colleagues [64] also identified circumscribed parts of the left inferior frontal gyrus and left superior and middle temporal gyrus that are significantly double dissociated concerning their differential contribution to PF and SF, respectively. Meinzer et al. [65] reported differential brain activity patterns of older and younger adults performing SF and PF tasks. Young adults recruited different subparts of the left inferior frontal gyrus for SF and PF, but older adults did not show this distinction. The PF task was comparable for both age groups and was reflected in strongly left-lateralized (frontal) activity patterns. In older adults, poor PF performance was accompanied by additional right frontal activity, while SF is more related to the temporal cortex function [66]. Merging behavioral data with task-driven neuroimaging data might shed light on the neural mechanisms underlying these tasks in healthy and pathological aging. Behind this differential performance in SF and PF, a neural network might start failing, and it might deteriorate faster in the presence of dementia. Based on SF and PF, early detection of deterioration and cognitive stimulation promotion during the early stages of aging could be done.
A univariate model was performed comparing the performance of SF and PF tasks between the groups of healthy volunteers, volunteers with a probable AD diagnosis, and diagnosed AD patients. Despite no significant differences, the volunteers with a probable AD diagnosis produced an average of 2 words less than the healthy volunteers in both fluency tasks (p = 0.2; coefficient = – 1.8). However, in the group of diagnosed AD patients, performance is significantly worse than the healthy volunteers (p = 0.02; coefficient = – 3), generating approximately 3 fewer words in each fluency task. For those with suspected AD and when the disease was diagnosed, there was an equivalent drop in performance for both fluency tasks as well. This decline in fluency performance has been associated with impaired executive functioning in healthy subjects and people with AD [12], and happens because both fluency tests require organization, inhibition, cognitive flexibility, and working memory, which are key executive components for word search and retrieval.
In our analyses, we adjusted for possible confounding factors, including demographic variables, such as age, years of education, and gender (considered as the proportion of females). Among these, only age significantly affected SF and PF task performance. Our univariate model results revealed that performance declines in both tasks as we age. This finding aligns with Olabarrieta-Landa et al. [21] and Pakhomov et al. [67]. By contrast, years of education failed to account for the difference in performance between verbal fluency tasks across the groups. The literature reports an association between years of education and better fluency performance [21, 36]; however, this variable is not consistently considered during clinical assessment. Our results revealed that performance declines throughout the establishment of the disease is not explained by years of education. This observation meshes with Sherman et al. [49], where no significant differences were found for age, education level, symptoms duration, or rate of dementia progression. Itaguchi et al. [68] also found that age and education did not influence SF differences between healthy older and AD groups. On the other hand, the reported impact of gender on PF and SF performance is inconsistent or nonexistent in the literature [69]. The results of a study in Latin America suggest not adjusting for gender when obtaining percentiles for verbal fluency tests [21]. The study of St-Hilarie et al. [22] also observed that gender did not affect performance in each condition. However, Irvine et al. [42] found that cognitive functions are more severely and more widely affected in women with AD than in men, even accounting for differences in age, education, or dementia severity. Our analyses highlight that performance on fluency tasks is an indicator to be observed more routinely in clinical practice, which may be associated with AD diagnosis confirmation, since the decline in this performance is not explained by the variable of years of schooling or gender.
When exploring the age variable and groups of interest with a multivariate model, both variables fail to explain the difference in performance between verbal fluency tasks; however, we observed a trend for the age variable (p = 0.06). Given that our meta-analysis used a truncated Knapp-Hartung method to estimate the standard error, we highlight this borderline finding for age. This method has much more appropriate false positive rates than the standard T-test for regression coefficients in meta-regression [70]; however, it might have affected the statistical power to detect the effect of the age variable in our groups.
This study has several limitations. We performed an extensive search, but due to our restricted inclusion criteria, we ended up with few studies that could provide the information needed for our meta-analysis. The articles excluded in this study showed different study designs and paradigms for assessing SF and PF tasks. Many of those studies also did not accurately report the early AD stage and did not report the diagnosis quantitatively. Extracting the raw results of the PF tasks was not always possible either, because some studies grouped and presented as a sum of the performance of the fluency test called “FAS” (fluency of three letters: F, A, and S). We did not consider patients diagnosed with AD with comorbidities in the inclusion criteria, which might also limit the extrapolation of our results. Finally, our review was conducted during the Coronavirus pandemic, and we included studies up to December 2020 thanks to the opportunity of open access initiatives and databases that have supported low- and mid-income countries.
Comparing SF and PF performance in volunteers with a probable AD diagnosis, patients diagnosed with AD, and healthy volunteers is of significant interest in research and clinical practice. Increasing knowledge about verbal fluency in these groups could help determine what verbal fluency tasks are essential to improve early detection and diagnosis of AD patients. Our results are relevant for future research because they highlight the importance of language as a functional cognitive domain that could establish differences between the transition from normal aging to pathological stages such as AD. One strength of this meta-analysis is that we considered studies published in both languages Spanish and English. Finally, an essential projection is the unification of protocols for assessing verbal fluency tasks (SF and PF). Words produced for semantic categories and a given letter thus have to be based on idiomatic heterogeneity, and it would be helpful to consider a normative letter for language and sociocultural characteristics in PF. The differences in performance through aging are of great interest to evaluate in combination with neuroimaging data, providing further understanding of the neural and cognitive changes associated with underlying pathological aging.
Conclusion
Performance on SF and PF tasks might play a differential role in early AD diagnosis. These two tasks target partially different neurocognitive mechanisms and should be explored for differential diagnosis between normal aging and its transition to pathological stages such as probable and diagnosed AD.
Footnotes
ACKNOWLEDGMENTS
This work was supported by the Health Science Department of Pontificia Universidad Católica de Chile, ANID-Subdirección de Capital Humano/Doctorado Nacional/2021-21212181 (David Toloza-Ramirez), and ANID-Subdirección de Capital Humano/Doctorado Nacional/2023-21230591 (Teresa Julio-Ramos).
FUNDING
The authors have no funding to report.
CONFLICT OF INTEREST
The authors have no conflict of interest to report.
DATA AVAILABILITY
The data supporting the findings of this study are available within the article and/or its supplementary material. Additional information is available on request from the corresponding author.
