Abstract
A novel text mining pilot for dementia detection using Linguistic Inquiry and Word Count (LIWC) was tested on public figures’ writings looking at word choice and affect compared to those with and without dementia. The differences found in this analysis mirror the expected patterns where writings of people with dementia reflect significantly more analytical thinking words, but significantly less authentic and emotional tone. In addition, the analysis found that people with dementia use significantly less functional words, such as grammar, and affections (happiness, sadness, anger, sadness), but tend to use significantly more pronouns in their writings. Written samples of those with dementia also use significantly less time-oriented words that indicate past, present, or future. The analysis of free form text suggests a potential avenue for detecting early changes that correlate with dementia, allowing for early preventative treatment before noticeable cognitive impairment occurs.
Introduction
Globally dementia is one of the leading causes of disability in older people. 1 Dementia, characterized by a progressive decline of cognitive functions in thinking and memory, leads to behavioral changes and the inability to engage in daily activities. 1 The most common type of dementia, Alzheimer’s disease, affects 60%–80% of all patients. 2 Risk factors for dementia include family history, age, brain trauma, genetic mutations, cardiovascular disease, chronic use of medications that mimic dementia (anticholinergic agents), and lower educational attainment. 3 Dementia diagnoses require a thorough review of personal health history, physical exam, lab. values, brain scans, and functional movement assessments, as well as cognitive testing such as a verbal fluency test, the “Mini-Cog,” or other instruments.2,3
Clinical use of the “Mini-Cog” mini or other verbal diagnostic tests asks patients to recall words, items, and concepts from short or long-term memory as indicators of cognitive impairment. Patients with dementia often struggle with object names, active verb usage and are prone to repetitive words and linguistically useless phrases such as “um”.4,5 The verbal fluency test asks patients to list as many animals as they can in 60 s with <=15 animals listed signaling potential dementia (<=12 if only 1–7 years of education and <=9 with no education). 3 The “Mini-Cog” test asks patients to remember three unrelated words (i.e., river, nation, finger), draw the time of day on a clock face, and then recall the three words. A Mini-Cog score of 0–2 correlates highly with a diagnosis of dementia (each item recalled and an accurate clock is each worth one point apiece). 3 Verbal cognitive testing, widely used in healthcare, is unreliable in cases of minimal (onset) impairment and is prone to average age-related cognitive decline.2,3 Commonly patients and family members do not mention cognitive impairment to clinicians until the middle or late stages of dementia, where cognition and memory issues have already compromised their health and well-being. 6 While there are no cures for dementia, early detection is critical for delaying the progression through active health management, including promising new drugs that may postpone impartment.
Early detection allows for more intensive interventions before disability. Researchers have been searching for methods to detect minimal early cognitive effects as an early warning sign.7,8 Efforts have focused on three specific approaches: medical records data mining,9,10 general internet text data mining, 11 and population-wide preventative individual screening. 12 Researchers have examined clinician visit notes within electronic medical records with limited success for reported observations of dementia-like pre-curser patient behaviors combined with sequential brain images.9,10,13,14 Regardless of the model or analysis technique (machine learning, predictive models, structured, and unstructured analysis), these approaches utilize limited medical record data. In most cases, once dementia is suspected, data collection occurs without a comparative baseline measurement. In clinical practices, there is little time given to collection of baseline data on cognition or functional baseline measurement for two reasons: time constraints and a lack of re-imbursement for collecting baseline data that may or may not be needed later, thus the absence of a comparison data set.
Bull et al. (2016) propose including data collected from individuals outside of medical care such as email, documents, and diaries as a method to diagnose dementia. 11 Bull et al. 11 suggest that these electronic writings combined with personalized patterns of keystrokes and mouse movements might indicate dementia. The slowing down or alteration of keystrokes and mouse movements can indicate signs of dementia, and this approach is currently undergoing trials. 15 Another suggested approach to diagnose dementia is evaluating idea density (low density = cognitive impairment). Idea density describes how efficiently a person conveys ideas without repeating phrases or producing vague responses.16–19 Low grammatical complexity and negativity in writings can predict an eventual dementia diagnosis.16–19 These earlier studies have been challenging to implement due to their reliance on the longitudinal collection of long-form writings augmented with clinical diagnostic supporting findings. Le et al.17,20 have proposed a short-form text analysis and have patented a hierarchical model of text analysis. However, they have yet to create a working model of the proposed analysis.
Researchers have also studied a verbal test administered to a broad population outside of a clinical environment.21–23 These verbal tests use word vector mapping in everyday conversations or conversations with interactive avatars and time series of conversational word repetition as methods for dementia detection.21–23 While successful, these methods also require specialized efforts, longitudinal tracking (including a need for a personal baseline), and large-scale computing to successfully identify potential dementia-related cognitive issues. Text-based methods are more usable on broad populations than spoken analysis due to the relatively small samples needed (200 words minimum) and the wide availability of text samples such as social media posts. Finally, computer scientists have suggested using advanced data mining and text analysis to detect dementia.24,25 However, these researchers still depend upon evaluating patient responses to the neuro-cognitive tests traditionally utilized in health care, even with their inaccuracy in diagnosis.
The promise of artificial intelligence (AI), machine learning, and natural language processing in detecting early changes in cognition is the possibility for broader population screening.26,27 While there are limitations to the data contained in the medical record, the availability of personal writings in the form of blogs, transcriptions from videos posted on social media (TikTok, Facebook live), social media postings, emails, and other communication offer a potential source of more natural, less edited and biased communication to screen. Other methods of potential interest for implementation in electronic health/medical records (EHR/EMR) are additions of population initial screening via speech generation tasks (a scripted reading read aloud by patients) and recorded for AI analysis.26,27 Even when screening in the clinical setting, early signs are easy to miss due to patients being able to use cognitive reserves to camouflage symptoms, making full population, baseline and ongoing screening an advantage for clinicians and a potential area for health informatics to be assist clinical care. 28
Recognizing that early detection is vital and that most patients and their families will not ask for a diagnostic exam until impairment is significant, we propose a text-based model utilizing short-form personal writings for early dementia detection. Utilizing various text and data mining methods, we evaluate personal writings for their style, psychological affect, and personal demographic factors, focusing on those associated with an eventual diagnosis of dementia. This work seeks to identify a pattern of pre-dementia diagnosis cognitive issues in text files in early life when prevention efforts could delay dementia onset and advancement.
Methods
This study utilizes the personal writings (number) of public figures due to the high probability that their writing samples exist and were publicly available from different periods of their lives (Appendix A). Preferred data samples are un-edited (letters and diaries) rather than edited speeches or published works. As a control group, we also collected writings of public figures who did not die from dementia. Public figures with dementia and control group writings were matched based on gender and year of birth based on the belief that those of the same gender and generation would be more alike in shared-world experiences. Sample cases were in English and predominantly from the United States. Basic demographic data (dates of birth & death, occupation, highest level of education completed, dementia/no dementia diagnosis) were collected, and writing samples from three periods. The three periods include early life <= 29 years of age, middle life = 30 years of age to 11 years before the date of death, late life = 10 years before death.
Linguistic Inquiry and Word Count (LIWC) software program evaluation occurred once all the text file writing samples were collected. LIWC is a text analysis program that matches text samples to dictionaries marking each word in a sample with up 84 different variables (each word can fall into multiple categories) within 27 dimensions (17 linguistic and 10 relativity) and 44 categories (25 psychological constructs and 19 personal concerns). 29 “Cried” would be coded as a negative emotion, sadness, overall affect, and past tense verb in the analysis output. 29 These categories tested positively for internal consistency and validity. 30
This descriptive study tests if text-based analysis detects differences between those with dementia and those without dementia. If there are noticeable differences, especially those aligned with expected changes caused by dementia over the life span, this model may provide a fruitful method of identifying early dementia effects in affect from writing samples. This early warning would need follow-up cognitive testing and clinical evaluation. However, it could ensure that if found, early-stage medications can begin prior to significant impairment or cognitive decline when they are most effective at halting disease progression.
Dementia versus non-dementia writings
The dementia group names mainly came from a list maintained by the Alzheimer’s association and were augmented through search engine queries “people who died of dementia” and “people who died of Alzheimer’s disease”, resulting in 404 entertainers (actor, writer, director, singer, and musicians), politicians, and sports figures (athletes and coaches). 31 Of these, 64 born in non-English speaking countries were excluded from the list since early writings were not English; most were U.S. born, followed by U.K., and Australia. Searches for primary unedited writings excluded even more figures from this list. It was common to find edited samples from authors and politicians, which can alter tone and wording. Sixty-two individuals with dementia writings over two or three periods (early writings were the most difficult to find) are included.
Demographics of dementia and no dementia cohorts.
To answer the research question, is there a textual difference detectable with LIWC between those who develop dementia and the non-dementia group, we evaluated the 18 summary variables that represent 84 individual variable counts using SPSS v26. We tested for normality and found the Shapiro-Wilk test to be significance for the majority of the dependent variables. As a result, we tested our hypothesis using a series of non-parametric tests for independent samples. The first analysis matched dementia and without dementia groups with a single file for each individual and included life period; results shown in the tables. Secondary analysis included life periods (early, middle, and late), comparison of only letters (the least likely text to have been edited), as well as text length comparisons. Results of the secondary analyses are included only when significant.
Results
Composite metrics
Dementia versus no dementia composite LIWC measures.
aLilliefors significance correction.
Additional analysis of the letters only samples, dementia (n = 46) and non-dementia (n = 46), using a Quade Non-Parametric ANCOVA, found that only the ‘Analytic’ dimension is significantly higher (p = .024) for people with dementia as compared to those with no dementia. When compared by the age range, the letters-only samples show that the ‘Emotional Tone’ dimension was significantly lower in middle age compared to the early age (p = .047) of the people who had dementia. No significant differences existed between the age ranges for the other composite dimensions.
Communication patterns
Functional, informal speech, and punctuation scale LIWC.
Function includes: total pronouns (personal pronouns, 1st person singular, 1st person plural, 2nd person, 3rd person singular, 3rd person plural, and impersonal pronouns); grammar other (regular verbs, adjectives, comparatives, interrogatives, numbers, and quantifiers): affect (positive emotion, negative emotions including anxiety, anger, sadness); social (family, friends, female referents, male referents); informal language (swear words, netspeak, assent, nonfluencies, fillers); all punctuation (periods, commas, colons, semicolons, question marks, exclamation marks, dashes, quotation marks, apostrophes, parentheses (pairs), other punctuation).
aLilliefors significance correction.
blower bound of the true significance.
Communication with others includes informal speech and, when written, punctuation. Informal speech includes swearing, netspeak, assent, non-fluencies (“um,” “hm,” “er”), and fillers (“you know,” “I mean,” and other meaningless words) (Table 3). Netspeak refers to shorthand abbreviations used in text messaging such as “LOL” for “laugh out loud” and ‘XOXO” for “hugs and kisses.” Those with dementia use more informal words and punctuation, however the rate was not-significant (Table 3).
Secondary analysis of the letters-only samples for both dementia and no dementia using a Quade Non-Parametric ANCOVA found that “Grammar” (p = .025) and “Affect” (p = .038) are still significantly less in the “Dementia” group versus “No Dementia” group. When the letter test files were compared among group ages, “affect” was significantly lower among all age ranges (p = .033), especially when comparing the early to middle-life age ranges (p = .028). Analysis of the length of the text files showed no significant difference for the other measures.
Processes, drives, orientation, relativity, and concerns
Drives, relativity, and processes (cognitive, perpetual, bodily).
Processes include: Cognitive (insight, cause, discrepancies, tentativeness, certainty, differentiation); Perpetual (seeing, hearing, feeling); Biological (body, health/illness, sexuality, ingesting); Core Dirves and Needs (affiliation, achievement, power, reward focus, risk/prevention focus); Time Orientation (past focus, present focus, future focus); Relativity (motion, space, time) Personal Concerns (work, leisure, home, money, religion, death).
aLilliefors significance correction.
blower bound of the true significance.
Dementia and non-dementia writings show no differences in the cognitive (insight, cause, discrepancies, tentativeness, certainty, differentiation), perpetual (sight, hearing, touch) or biological (body, health/illness, sexuality, ingesting) process word usage. Nor are there significant differences between groups in usage of the core drives and needs, relativity, nor personal concerns wording. Core drives and needs words refer to (affiliation, achievement, power, reward focus, risk/prevention focus). Relativity in LIWC analysis includes terms related directly to relativity (motion, space, time). Personal concerns relate to work, leisure, home, money, religion, and death.
Analysis of letters only and age range showed no significant difference for the composite measures.
Discussion
Linguistic inquiry and word count has been used to identify underlying personality characteristics and effects of behavioral health diagnosis and social relationships in research utilizing text analysis, word categorization, and counts. 29 LIWC’s ability to detect small changes that are less memory dependent but more affect and temporal such as thinking and focus, make LIWC a potentially powerful tool for screening dementia-related changes before significant, noticeable impairment. LIWC validation and psychometric testing show that in the normal aging process people in general become less self-centered, more current and future time focused, and that verbal complexity maintains or increases with age. 29 Broad usage of the “Mini-Cog” to detect cognitive changes common from dementia may only detect significant and late-stage disease after the time when the only current dementia treatment (medications) are less effective; they are “too little, too late.” 3 This project attempts to use personal writings throughout a lifetime to see if there is a textual indicator of changes in emotion, wording, and grammar as an early indicator of upcoming cognitive issues such as dementia.
Those with dementia use more analytical words with less authentic and emotional tone wording. Text samples in the dementia group were 23% profession-related writings and 4% for those without dementia. Profession-related writings are more likely to be edited, more formal perhaps leading to the more analytic wording. Analytical wording is related to formal, logic-based, and hierarchical thinking. Formality and professionalism do not, however, explain the differences in emotional tone, affect, or authenticity. When letters only were compared for the two samples, dementia writings were still significantly different including less affect, emotional tone, and grammar. Dementia writings include more logic, hierarchical wording along with less words in the composite authentic and emotional tone scales. These words indicate that dementia writings are more guarded, socially distanced, and more neutral with less emotion. Lesser use of functional words by those with dementia includes less use of other grammar and affect words, but more total pronouns words overall. When dementia writings were compared for age ranges, affect was the only significant scale and then only between early to late life samples. As dementia advances communication and language losses increase causing difficulties in making semantic and conceptual distinctions; this progression begins with difficulty in naming, then grammatical, and finally comprehension as the disease progresses. 33 Our results mirror these findings, with those experiencing dementia relying on simplified pronouns and less grammar and affect wording, complex semantic and conceptual distinctions that amplify over time. Dementia texts include fewer time-orientation words as well. Taken together, these more negative and formalized communications written by those with dementia correspond with a less personal, time-oriented, and personalized narrative style of writing and confirm that there is something different about the communication for those with dementia than that expected simply by the aging process. 29
Prior research has shown dementia-related effects on written communication, including a marked increase in negativity with low idea density and low grammatical complexity.16–19 Our findings support those findings, with individuals diagnosed with dementia using more pronouns overall. In cases where writers struggle with time orientation (especially present and future) and may struggle to recall names, use of less grammar and proper names make communication easier. A writer without cognitive decline is more likely to use a variety of words and phrases to refer to others in correspondence to ensure flow and to avoid repetition. As predicted by others’ research, these word choices, frequencies, and grammatical differences indicate a more general use of written language with less emotion and less complexity.16–19,22,29 Current cognitive testing looks for dementia signs such as lost object naming, less active verbs, word repetition, and the use of non-fluencies.3,4 Our findings support these frequencies in text (more pronouns instead of object/subject naming, fewer active verbs).
As people move into more significant impairment due to dementia, limiting their movement due to safety concerns is common. This repeated place, schedule, and predictable days minimize confusion and stress. 34 Advancing dementia also reduces person/object/memory recall and awareness, leading to isolation, potentially leading to a negative emotional state and less personal communication. 35 As the disease progresses and the person’s world contracts, the most vivid memories are often from the distant past stored in long-term memory. 2
This work’s limitations include the reliance on personal writings or various lengths and people of differing education levels. Middle and late-life samples are more likely to have been reviewed for some of these public figures by personal assistants, secretaries, or family members. For this reason, whenever possible personal correspondence was utilized and not more formal job-related letters. The samples were of different lengths and different purposes. When writing to close family and friends or in unpublished diaries, there is a less formal tone. While length of text is variable LIWC analysis samples studied were at least 50 words of more in length (98.5% for dementia writings; 100% for non-dementia writings).30,32 In the secondary analysis, greater word count or length of text lead to more significant differences and findings. Educational attainment could impact these findings as education influences vocabulary, grammar, and style.36,37
Conclusion
Linguistic inquiry and word count was able to identify significant word choices and affect from written communications. There were notable differences between those who had dementia and those without dementia. The differences mirror the expected patterns where people with dementia show more analytic language, less authentic wording, less emotional tone, fewer function words, more total pronouns, less grammar, less affection, and less time orientation than those without dementia. Usage of the brief, unedited written word is a potential tool for identifying early shifts potentially indicative of early dementia symptoms. Future studies need to look at the use of blogs, tweets, text messages to see if there is a minimum necessary text needed and if current communication patterns are mappable to suspected dementia-related wording shifts.
Supplemental Material
Supplemental Material - Using personal writings to detect dementia: A text mining approach
Supplemental Material for Using personal writings to detect dementia: A text mining approach by Beni Asllani and Deborah M Mullen in Environment and Planning B: Urban Analytics and City Science
Footnotes
Acknowledgements
This work would not have been possible without the assistance of Thomas Mahvi, Caroline Spradlin, and Megan Woods, who worked to collect these writing samples.
Declaration of conflicting interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Neither BA nor DM have any conflicts of interest to disclose.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Gary W. Rollins College of Business, Summerfield Johnston Centennial Scholars Faculty Grant Program.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
