Abstract
Ageing is commonly accompanied by declining performance on tasks of episodic memory for information that can be verbally encoded or retrieved (verbal memory) and for figural information that is less easily verbally encoded or retrieved (non-verbal memory) [1–3]. A deficit in episodic memory beyond that expected for normal ageing, in combination with subjective reporting of memory decline, are features of ‘mild cognitive impairment’ (MCI) [4], [5]. It has been found that MCI can precede Alzheimer's disease (AD) by a decade or more [6], and conversion from MCI to AD has been estimated as high as 80% of cases over 6 years [4]. As part of the determination of MCI, memory tests need to be evaluated in relation to normative values for that person's sex, age and level of education [1]. Known gender differences in memory (women tend to perform better on verbal memory tasks) [1], [3] highlight the need for gender-based normative data, especially since AD affects about twice as many women as men [7]. For many cognitive tests, available normative data are derived from small samples. In addition, much normative data currently available is based on American samples, which may differ from samples derived from other Englishspeaking populations.
The Melbourne Women's Midlife Health Project (MWMHP) is a longitudinal population-based study of Australian-born women originally established to evaluate factors associated with the menopausal transition [8]. A number of health-related outcomes have been studied in this ageing cohort [9–13]. Cognitive assessment has recently been implemented, including tests of verbal and non-verbal memory.
Method
Participants
In 2002, 372 women aged 56–67 (mean = 60) who were enrolled in the Melbourne Women's Midlife Health Project (MWMHP) were invited to participate in phase I of a study of cognitive function. Participants were originally recruited in 1991 by random telephone dialling in a cross-sectional survey of women within the Melbourne metropolitan area [8]. At that time a total of 2001 women, then aged 45–55, were interviewed and of these, 438 women who fulfilled the selection criteria (having menstruated in the prior 3 months and not taking oestrogen-containing hormone therapy) agreed to participate in the longitudinal study [10], [11], [14]. Demographic data were obtained at baseline, and medical information (face-to-face interview regarding menopausal status, medications, onset of any illness, operations, measurement of blood pressure, glucose and cholesterol levels) has been updated annually. Over 11 years, there has been an 85% retention rate of participants.
Nine women with various neurological disorders (Parkinson's disease, stroke, transient ischaemic attacks, dementia, brain tumour, cerebellar disorder, trigeminal neuralgia and multiple sclerosis [two women]) were excluded from cognitive assessment, as were seven women with a major medical condition. Others declined to participate because of: unwillingness to travel to the study location (n = 25); the need to care for a family member (n = 6); the death of a relative (n = 2); unwillingness to provide a venous blood sample (n = 1); and being too busy or unwilling to participate (n = 64). One woman withdrew from the study part-way through the test session. In total, 257 women completed the cognitive assessment, being the first phase of a study of change in cognitive function over 2 years. Study procedures were approved by the Human Research Ethics Committee at the University of Melbourne.
Testing procedures
The participants attended the Office for Gender and Health at the Royal Melbourne Hospital between the hours of 8:30 am and 11:15 am for approximately 2 hours. Prior to cognitive testing, informed consents were obtained, blood pressure was measured, a venous blood sample was taken, and a 10-item mood questionnaire, the short version of the Center for Epidemiological Studies Depression Scale (CES-D) [15], was administered. A full cognitive test battery was administered by two psychologists with additional training and supervision by a neuropsychologist and behavioural neurologist.
Cognitive tasks included three verbal memory measures (a shortened version of the California Verbal Learning Test II [CVLT-II] [3], 10-item supraspan word list recall task [16], and the East Boston Memory Test [EBMT] [17]) and one non-verbal memory measure, the Faces subtest of the Wechsler Memory Scale III (WMS-III) [2]. Additional tests included the Trail Making Test, part B; category fluency; block design; Boston Naming Test; letter-number sequencing; judgement of line orientation; symbol digit modality test; Tower of London; digits backward. The battery also included the New Adult Reading Test (NART), used to provide an estimate of baseline intelligence [18], [19].
Verbal memory: related word-list learning
The short version of the CVLT-II [3] was the second item in the battery, begun within the first 15 minutes of the testing session. Test administration was modified from standard CVLT-II procedures. The CVLT-II is a 16-item supraspan word-list learning task and includes four words from each of four semantic categories. The original CVLTII was developed for a US population and contains some words not commonly used by Australians. Therefore, the word ‘aeroplane’ was substituted for ‘subway’ within the semantic category of transport, and ‘rabbit’ for ‘squirrel’ within the category of animals. Words were read to participants at a rate of one per 1.5 seconds. Immediately after the words were read, the participant was asked to repeat as many of the words as she could in any order. This procedure was repeated two more times, resulting in three ‘immediate recall’ learning trials (as opposed to the five learning trials in the original CVLT-II). After a 20–30 minute period during which the participant performed unrelated cognitive tests, she was asked to recall as many words from the list as she could (delayed recall). The maximum possible immediate and delayed recall scores were thus 48 and 16, respectively. A ‘savings score’ was calculated by dividing the delayed recall score by the score obtained for the third learning trial. For the third learning trial, a semantic clustering score was calculated by scoring ‘1’ each time a word from the same category as the previous word was recalled, adjusting for this occurring by chance for total number of words recalled [3].
Verbal memory: unrelated word-list learning
A second supraspan word list task [16] was administered after 45 minutes of cognitive testing. This 10-item task had been given to participants 3 years previously [20]. The unrelated words were read at a rate of one per 1.5 seconds. It differed from the CVLT-II in being shorter (10 items cf. 16), consisting of unrelated nouns of 1–2 syllables and varying in the word order for each of the three learning trials. For the second learning trial, words were read in the reverse order, and in the third trial words were read in the original order but commencing with the sixth word. In addition, the delayed recall trial took place 3–5 minutes after the third immediate recall trial as opposed to 30 minutes The maximum possible score was 30 for the sum of the learning trials (immediate recall) and 10 for the delayed trial.
Verbal memory: story learning
Story learning used the East Boston Memory Test (EBMT) [17], which was administered after approximately 30 minutes of testing. This test consists of two parts, immediate recall and short delayed recall. The participant was told: ‘Listen carefully as I read you a short story’, and the following was read: Three children were alone at home and the house caught on fire. A brave fireman managed to climb in a back window and carry them to safety. Aside from minor cuts and bruises all were well.
Participants were then asked to repeat back as much of the story as they could remember. After a break of approximately 3 minutes during which an unrelated task was performed, the participant was again asked to repeat back as much of the story as they could.
Results were scored using both a 6-item and a 12-item scoring system [17]. The 6-item scoring system assessed recall of six main concepts: ‘three children’; ‘the house caught on fire’; ‘the fireman climbed in’; ‘the children were rescued’; ‘the injuries were minor’; and ‘everyone was well’. The 12-item scoring system was based on recall of 12 essential ideas: ‘three’; ‘children’; ‘house’; ‘on fire’; ‘fireman’; ‘came in’; ‘children’; ‘rescued’; ‘minor’; ‘injuries’; ‘everyone’; ‘well’.
Non-verbal memory: face learning
The WSM-III Faces subtest [21] was the third task in the battery sequence. The first trial was typically performed in the first 15 minutes of the testing session. Standard administration procedures were followed. The Faces task is a facial recognition test which requires participants to recognize 24 facial photographs that have been shown to them previously. Each photograph was initially presented for two seconds. Participants were immediately shown a further 48 photographs that included the earlier 24 photographs and 24 new distracter photographs. Participants were required to respond with a ‘yes’ if they had seen the photograph previously or ‘no’ if they had not. Approximately 30 minutes later, during which time participants performed unrelated cognitive tests, another 48 facial photographs were presented, and participants were again asked whether the face was one of the original 24. The WMS-III Faces task yields two scores, one for the immediate recognition trial (Faces I) and one for the delayed recognition trial (Faces II). The maximum possible score on each trial is 48.
Data analysis and scaling
A multiple linear regression analysis incorporating age, years of education and mood score as continuous variables, was used to determine which demographic factors affected test performance. To create normative tables raw scores were converted to scaled scores as follows. Raw scores were transformed into a percentile rank, and then to a z-score, which was converted to a score equating to a group mean of 10 and standard deviation of 3. A confirmatory factor analysis was conducted on the immediate and delayed recall scores for all of the tests to study potential differences between the various memory tests. Data were analyzed using SPSS V.10 [22].
Results
The mean age of the 257 MWMHP participants who underwent full cognitive testing was 59.8 ± 2.5 (56–67). The mean estimated intelligence quotient (based on scores from the New Adult Reading Test [19]) was 115 ± 6.6, which is above average. This is most likely related to the participants being more likely to have completed secondary school compared to women in the Melbourne metropolitan population within the same age range (48% vs. 24%) [23]. In addition the participants were more likely to be in paid employment (49% vs. 28%). The mean CES-D score was 6.8 ± 4.1, and 75.5% of the women were married or currently living with a partner.
Memory scores were normally distributed for word-list recall tasks and Faces I and II but skewed for scores from the EBMT. Preliminary analyses considered the two scoring systems for the EBMT. Spearman's rank correlation coefficient for the 6-item and 12-item scoring systems was 0.93, demonstrating a high correlation between the two systems. The 12-item scoring system was used in subsequent analyses to maximize recording of details recalled.
Performances between most memory measures were significantly correlated. Immediate and delayed recall scores were strongly correlated on all tasks (r = 0.58–0.70, p < 0.01). There were also strong associations between scores on the two word list tasks (Pearson's product moment coefficient, r = 0.47, p < 0.01, for delayed recall of the two word lists and r = 0.42, p < 0.01 for immediate recall) but weak or no relation between Faces I and II and the several verbal memory tasks (all p < 0.05, except for the immediate recall of the related word list, where r = 0.23 and 0.25, for Faces I and II, p < 0.01).
Education was significantly correlated with immediate and delayed recall performances on each memory task (r = 0.19–0.30, all p < 0.01). As expected, education also correlated significantly with NART scores (r = 0.53). Age, however, was not significantly associated with memory scores within the 11 year-span represented by MWMHP participants (56–67 years) (r = −0.11 – −0.01, all p < 0.05). CES-D mood scores were unrelated to memory performances (r = −0.09–0.03, all p < 0.05).
A multiple regression analysis incorporating age, years of education, and mood score as continuous variables confirmed that age and mood did not effect performances on the four memory tests, although there was a tendency for the younger women to perform better. Table 1 shows means and standard deviations for the eight memory measures, stratified by age and education. Although age effects were not significant in our midlife cohort, the well recognized affect of age, as determined in studies covering a larger age span, was observable, and thus warranted the division of the cohort into the younger (56–59) and older (60–67) groups.
Mean(SD) memory test scores by age and education
The number of years of education accounted for 11% of the variation for immediate recall for semantically related words and 10% for the delayed score, 9% for the unrelated word list immediate recall and 5% for the delayed recall, 4% for the East Boston Memory Test immediate and delayed scores, and 4% for Faces I and 5% for Faces II.
Mild cognitive impairment is sometimes defined in part by memory scores at least 1.5 standard deviations below expected levels in the absence of dementia [4]. Table 1 provides the values for 1.5 standard deviations below the averages. Three women (1%) in this study obtained scores that were below this cut-off, in either immediate and delayed recall, or both, in all categories of memory.
The unrelated word list learning test and the modified CVLT are reported as immediate recall (sum of immediate recall on three learning trials), delayed recall and savings score (delayed score/third learning trial score), and a semantic clustering score is included for the CVLT (Table 1). Cumulative scores over the first three trials for the CVLT list and the unrelated word list are reported, because a cumulative score is more reliable than separate trial-by-trial learning scores and is the best estimate of the asymptotic learning curve which underlies serial acquisition of this type [24]. To determine whether these scores represent similar or different aspects of long-term retrieval, a confirmatory factor analysis was conducted on the immediate and delayed recall scores. Alternatives examined included the hypotheses that: (i) all memory scores measure one long-term retrieval ability; (ii) the scores distinguish between an immediate and a delayed recall ability; and (iii) the scores distinguish verbal from non-verbal recall trials [25], [26]. As with other clinical memory batteries, we modelled measurement variance by freeing the covariances between respective immediate and delayed subtests [27], [28]. Results of the confirmatory factor analysis provided clear-cut results, with the third hypothesis corresponding to Model 3, providing a fit to the data (Table 2). Factor loadings for this solution are shown in Table 3. The correlation between these factors was 0.32. This fitting model assumed that the memory test scores measure two abilities, distinguishing verbal from non-verbal memory. Of note, the model distinguishing immediate from delayed memory (Model 2) provided a poor fit to the data, as did the hypothesis of a single memory ability (Model 1).
Summary of the goodness of fit statistics for the alternative models of long-term retrieval in the immediate and delayed memory scores derived from the CVLT, CERAD, EBMT and WMS-III Faces scores
Completely standardized factor loading matrix of the immediate and delayed verbal and visual memory scores derived from the CVLT, CERAD, EBMT and WMS-III Faces subtest. Factor I can be interpreted as auditory-verbal memory, and Factor 2 as visual memory
To determine whether the strategy of clustering words assisted in recall of words after a delay, an analysis of semantic clustering on the third learning trial of the related word list (the modified CVLT) in relation to the delayed recall was performed. It was demonstrated that those participants who used clustering as a recall strategy performed better on the delayed recall on the CVLT (r = 0.51). In addition an analysis of the relationship between clustering on the related word list and delayed recall on the unrelated word list, was also performed. In this case the scores were significantly correlated, but not as highly as within the related word list (r = 0.36), suggesting a weaker relationship.
Using the finding that performance on these memory tests is affected by prior education level (less than 12 years of education), normative tables were constructed accordingly.Tables 4–7 present scaled scores for all of the memory tests described for women aged 56–67 (mean = 60).
Scaled scores from raw data for the modified California Verbal Learning Test (women 56–67, mean = 60)
Scaled scores from raw data for Wechsler Memory Test-III Faces I and II (women 56–67, mean = 60)
Scaled scores from raw data for East Boston Memory Test (women 56–67, mean = 60)
Scaled scores from raw data for the ‘10 unrelated word list’ according to level of education (women 56–67 years, mean = 60)
Discussion
We have derived data that can be used for interpreting results of select tests of verbal and non-verbal memory for Australian women aged 56–67 years. Findings might not be directly applicable to men, whose verbal memory scores tend generally to be slightly lower than those of women [1], [3], and normative values would be expected to differ in populations younger or older than we studied. Nevertheless, with the exception of the unrelated word list learning task (16), the previous absence of Australian norms suggests that investigators will find these data helpful. Moreover, supraspan word lists have been demonstrated to be very sensitive to early changes associated with dementia [4], [5], and hence the construction of this normative data for women with a mean age of 60 will be of benefit in the identification of women who have early signs of disease. The criteria for the determination of MCI includes memory scores below 1.5 standard deviations below the mean in the absence of dementia, in combination with subjective reporting of memory problems (preferably corroborated by a friend or family member) [5]. In relation to clinical use, it is important to note that the position of a particular test in a battery may affect the score obtained on that test. Finally, although validation data have not been obtained, the EBMT might eventually prove useful as a rapid, easily administered screen for memory loss.
Education is known to affect memory performance [16]. In our analyses, as expected, educational level was an important predictor of performance on each memory task, and we present mean and scaled scores separately for Australian-born women with less than 12 years of education and those with 12 years or more. Age was not significantly associated with test performance within the range represented within the MWMHP, but there was a trend for decline in performance with age and mean scores would be expected to be lower in an older cohort. The scaled normative data for the CVLT-II (Table 4) is not directly comparable to previously published norms [3] due to our modification of the testing procedure. Two items in the word list were replaced with more familiar words for use in an Australian population; we administered three learning trials rather than five; and our 25-minute delayed recall was not preceded by presentation of an interference list or by short delayed recall trials. However, standard CVLT-II administration takes about 35 minutes and the abbreviated MWMHP version could be useful when less time is available for memory assessment.
Our findings for the unrelated word list recall task can be compared to those of Collie et al. for Australian women age 50–69 years [16]. Immediate recall scores were similar in the two populations. For women with less than 12 years of education, the mean score was 21.0 in the MWMHP cohort, as compared to 21.9 in the convenience sample of Collie et al. For those with 12 or more years of education, mean scores were 22.5 and 23.0, respectively. However, delayed recall scores were substantially lower among MWMHP participants (mean score of 5.3 vs. 7.1 for < 12 years of education; 6.1 vs. 8.2 for ≥ 12 years), perhaps because prior or intervening cognitive tasks administered between the final immediate recall trial and the delayed recall trial differed between the two studies.
The EBMT is a story-learning task increasingly used in epidemiological studies [17], [29], [30]. Norms for the 12-point scoring system have not been previously available. As suggested by relatively high mean recall scores (Tables 1,6), structured information within the context of a brief narrative is relatively easy for healthy women to learn and retrieve. Forty-four percent of participants performed at ceiling on immediate recall and 35% after a short delay. The EBMT might be particularly useful as a screening test to identify women whose poor memory suggests the need for more detailed cognitive assessment. For example, only 3.9% and 7% of participants recalled seven or fewer essential ideas on immediate and delayed recall trials, respectively.
For Faces I and II, a non-verbal memory task, the WMS-III manual [21] provides combined norms for American men and women, for all education levels, for the 55–64 age group. The demographic data indicate that 74% of the sample had more than 12 years of education, representing a more educated sample than in our study, and in addition they were slightly younger. Therefore, a direct comparison between our study group and an American sample group specific for gender and education was not possible. However, findings were generally similar between the two groups. Australian women with less than 12 years of education performed slightly below the combined male and female American norms for Faces I (immediate recognition trial, 34 vs. 36) but performed slightly better on Faces II, the delayed recognition retrial (36 vs. 35). Australian women with more than 12 years of education achieved the same average scaled score as the combined American sample group for Faces I (36 vs. 36) and performed better on Faces II (37–38 vs. 35). Although MWMHP women on average performed well, a number of participants remarked that some faces did not resemble those commonly encountered in Australia.
In terms of defining what these tests are measuring, the confirmatory factor analysis distinguishes between auditory verbal memory and visuo-spatial memory. Although task demands differed among tests, the results of this analysis provide support for a model of clinical memory test scores that distinguishes between materialspecific memory whether measured for acquisition at brief delays or for retrieval after longer delays. In contrast, the model hypothesizing a distinction between immediate acquisition and delayed retrieval provided a relatively poor fit to the data. This result supports similar findings in other clinical memory batteries, using a variety of different tests, including the revised and third editions of the Wechsler Memory Scale [25], [31]. Converging evidence suggests that scores derived from immediate and delayed trials of many clinical memory tests are best interpreted as measuring similar traits, at least in healthy populations, and that the construct validity of material-specific memory tests warrants the separate interpretation of auditory verbal and visuo-spatial abilities.
Another area of discussion is the relationship between performance on recall of related word lists, whereby words can be categorized semantically, and recall of lists containing words that are not related. That is, performance on recall of unrelated word lists is potentially different from that of a list of related words, where it is not possible to use the strategy of clustering words into related groups to assist in encoding and retrieval. Semantic clustering has been found to be reduced in patients with neurological disorders, including dementia. However, in some cases, although there may have been damage to the areas of the brain involved in memory, such as the hippocampus and adjacent medial temporal lobe structures, resulting in the person having severe memory deficits, semantic clustering may still be preserved [3]. A potential explanation for this is that semantic clustering is presumed to involve frontal executive skills, and in these particular individuals, these regions were not affected. There are no data to indicate which type of recall task would be more sensitive to very early changes of AD. The differences in test procedures must also be taken into consideration when drawing comparisons. In our tests, the delay between the third learning trial for the unrelated word list was much shorter than that for the related word list. Also, the unrelated word list had fewer words, and the use of a shorter list may not be as sensitive in detecting specific deficits compared to a longer related word list. Depending on the needs of the clinician or investigator, it may be more appropriate to use a list of words drawn from related categories or to use a list of unrelated words, or to use both types of lists.
Footnotes
Acknowledgements
Research was funded in part by a grant from the Alzheimer's Association (USA: IIRG-01–2684). We thank Gael Trytell, Francis Grodstein and women in the Melbourne Women's Midlife Health Project who volunteered for study procedures.
