Sage Journals: Discover world-class research

Abstract

Objective: We analyzed a natural language processing (NLP) toolkit’s ability to classify unstructured EHR data by psychiatric diagnosis. Expertise can be a barrier to using NLP. We employed an NLP toolkit (CLARK) created to support studies led by investigators with a range of informatics knowledge. Methods: The EHR of 652 patients were manually reviewed to establish Depression and Substance Use Disorder (SUD) labeled datasets, which were split into training and evaluation datasets. We used CLARK to train depression and SUD classification models using training datasets; model performance was analyzed against evaluation datasets. Results: The depression model accurately classified 69% of records (sensitivity = 0.68, specificity = 0.70, F1 = 0.68). The SUD model accurately classified 84% of records (sensitivity = 0.56, specificity = 0.92, F1 = 0.57). Conclusion: The depression model performed a more balanced job, while the SUD model’s high specificity was paired with a low sensitivity. NLP applications may be especially helpful when combined with a confidence threshold for manual review.

Keywords

psychiatry mental health natural language processing machine learning

Introduction

Interest in artificial intelligence (AI) and machine learning (ML) in healthcare is rapidly growing, and there is hope for incorporation of ML into everyday medical practice.¹ Natural language processing (NLP) is a particularly enticing tool for clinicians in search of important patterns buried in the massive amount of unstructured text stored in the electronic health record (EHR).^2,3

In psychiatry, NLP is especially promising, as diagnostic criteria rely heavily on patient report and are often contained in the narrative text of a clinical encounter.^4–6 Several excellent recent reviews have summarized the history of AI and ML, including NLP, and its use in psychiatry, which has accelerated in the past 10 years.^3,4,7–12 A recent review showed a sizeable percentage of ML studies were from psychiatry (11%), with only imaging and cardiology specialties contributing a larger single proportion.¹³ Experts have developed several powerful tools to make ML and NLP more accessible.^14–20 There have been many successful demonstrations of NLP use for classification and prediction within a psychiatric clinical context, the breadth of which we cannot fully describe here.^21–26 Studies of NLP in psychiatry typically use unstructured data from either the EHR or social media sites, with the goal of extracting or establishing a diagnosis, monitoring changes over time, or predicting onset of disease.⁴

While NLP research in mental health has rapidly grown, there are challenges to incorporating NLP into routine clinical practice. Reasons include limited familiarity of ML to clinicians, limited availability of validated tools and methods at clinical sites, and prohibitive costs for widespread adoption of NLP into patient care.⁷ Effective implementation of NLP requires knowledge and skills that are not regularly taught during medical training.^27,28 The Informatics for Integrating Biology and the Bedside (i2b2) Center issued an NLP challenge in 2016 that demonstrated how complicated and time-consuming it can be to develop well-performing algorithms for psychiatric symptom classification.²⁹ The time constraints of clinical work make it challenging to overcome the learning curve that most clinicians face if they want to use NLP in research studies.

The Clinical Annotation Research Kit (CLARK) is an ML software kit that classifies patient charts based on unstructured data. It was developed for clinicians with varying degrees of expertise in NLP modeling and informatics.³⁰ CLARK has been successfully used to answer clinical questions within various medical specialties, but it has not been used in a psychiatry context.^30,31 The primary aim of CLARK is to serve as an open-source, easy to use, graphical interface for machine learning-based classification of structured and unstructured patient health records. Broadly, CLARK takes inputs of unstructured clinical notes from patients of known health outcomes, paired with expert-defined lexicons of terms associated with the outcome, to train classifiers based on the occurrence of the terms within the notes. To do so, CLARK transforms the clinical notes into multi-dimensional feature vectors based on the given lexicons, and then applies a user-selected machine learning model on these vectors. After validating the model on a prespecified subset of the patients, the classifier is then applied to a held-out set of additional patients to assess performance metrics such as accuracy. Full technical details on CLARK are previously described.³⁰ CLARK can allow an end-to-end clinician-led process that relies on little outside technical expertise and can be rapidly implemented, thus circumventing several of the common challenges of NLP implementation mentioned above.

In this study, we tested CLARK’s ability to classify unstructured clinical notes of solid organ transplant recipients based on the presence or absence of a depression diagnosis and/or substance use diagnosis, and compared the results generated by CLARK to a gold standard classification by mental health experts. Additionally, to compare between classical statistical learning NLP methods and more recent transformer models, we compared CLARK’s performance to that of a bidirectional encoder representations from transformers (BERT) model from the open-source Hugging Face library.³² The impact of pre-transplant psychiatric disorders on solid organ transplant recipient outcomes have been widely studied and debated.^33–43 Psychiatric diagnoses are common pre-transplant; for example, 20%–40% of candidates will experience depression pre-transplant.^33,34,39,40 However, accurate retrospective extraction of pre-transplant psychiatric diagnoses is a time-intensive task. While all organ transplant recipients undergo psychosocial evaluation, diagnoses are rarely included in the problem list. Therefore, the population provides ample representation of cases and controls; methods to accelerate accurate retrospective identification of cases from the EHR could benefit research efforts within the transplant community; and the results could inform future development of algorithms to identify psychiatric disorders in the wider population of medically ill patients. This work demonstrated the first use of CLARK in a psychiatry context and showed that an NLP software kit can aid research by clinicians with a wide range of knowledge in ML techniques.

Methods

All adults (age 18 or older) who received a solid organ transplant at a tertiary care center over a four-and-a-half-year timeframe were included in the analysis. The final dataset included 652 unique individuals.

Demographics, transplant wait list duration, and transplant-specific medical history were extracted from a central data repository containing clinical, research, and administrative data sourced from our health care system, with the ability to query most data elements as far back as mid-2004.

Psychiatric diagnosis was manually abstracted from the electronic health record (EHR) to establish a “gold standard” labeled dataset. The EHR was reviewed by a content expert and licensed psychologist and two graduate-level research assistants. Research assistants completed thorough training in standardized chart review methods to discern whether individuals met criteria for psychiatric disorders based on the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5).⁴⁴ To ensure inter-rater reliability, research assistants were required to demonstrate 100% agreement in data extraction of diagnoses in a parallel record review with the psychologist prior to completing independent reviews. After research assistants demonstrated consistent expertise, the psychologist completed continuous randomized reliability checks on 10% of all records. Depression diagnoses included Major Depressive Disorder (MDD), and Other Depressive Disorder (not meeting MDD criteria). SUD diagnoses included all SUDs other than Tobacco Use Disorder but did not include isolated episodes of substance use not meeting disorder criteria.

We trained models to detect the occurrence of substance use disorder or depression diagnoses in the EHR using CLARK, an open-source, interface-guidable NLP toolkit developed for computable phenotyping in unstructured data such as clinical notes. CLARK has been described in detail previously.³⁰ Broadly, CLARK takes input consisting of unstructured clinical notes paired with user-defined lexicons of regular expression (regex) terms. CLARK builds a classification model to link these terms with the potential outcomes and give a probabilistic confidence of the outcomes for each record. All modeling was performed using CLARK v1.0. We compared CLARK model performance to that of a BERT model based off of an open-source model, Hugging Face.³²

To create the corpus, we extracted clinical notes from our EHR and central data repository. Clinical notes were extracted from 2 years prior to the transplant date until the date of transplant and filtered by author credentials, author department, or encounter department to include those from psychiatry, psychology, and case management (as available). The corpus was duplicated to create separate classification models for depression and substance use diagnoses. Within each corpus, each record was labeled as case or control based on the “gold standard” generated by content experts. The corpus and gold standard labels for each diagnostic category were combined into a JavaScript Object Notation (JSON) formatted corpus for use with CLARK, and into tokenized inputs for use with Hugging Face.³² We did not perform any language pre-processing techniques for CLARK analyses. For tokenization and use in Hugging Face, text was processed using the DistilBERT tokenizer⁴⁵ to convert to inputs usable with BERT models, and each note is reduced to the first 512 tokens in a sequence for compatibility and speed of model training with Hugging Face.^32,46 Each fully labelled patient record was randomly assigned to one of two datasets: a dataset used to train the classification model (“training data set”) and a held-out evaluation (or “test”) dataset used to assess model performance. The distribution of categorical variables across training and evaluation datasets was assessed with Pearson’s chi-square; continuous variables were assessed with t-tests (if parametric) or two sample Kruskal-Wallis (Mann Whitney) tests (if nonparametric). Statistical significance was set at alpha = 0.05.

We developed separate regex lexicons for the depression model and the SUD model, through an iterative process of selecting combinations of terms giving the most accurate classification results with the training sets (Table 1). CLARK allows the user to view where terms are found within the text each time the model is run; if terms picked up an unintended phrase or did not contribute to better performance of the model, they were adjusted or removed. We started with terms from the DSM-5 diagnostic criteria⁴⁴ for each disorder group and added other words or phrases associated with common symptoms, treatments, or behaviors.

Table 1.

Regex terms developed for depression and substance use disorder models.

Depression Algorithm Regular Expressions
(?i)((den[iesngdy]{1,4}) \| no \|not \|never \|don't \|doesn't )(admit \|report )(he \|she \|they \|of \|or \|to \|a \|an \|any \|past \|with \|previous \|prior \|current \|present \|ever \|known \|(ha[sdveing]{1,4}) \|symptoms \|sxs \|hx \|h.o \|history \|diagnosis \|diagnoses \|feeling[s]? \|issue[s]? \|struggl[edings]{1,3} \|deal[ingst]{1,4} \|trouble[s]? \|problem[s]?)((anx[ietyous]{4})\|anxiety symptoms\|GAD\|generalized anxiety disorder\|anxiety disorder\|panic disorder\|panic\|panic attack[s]?\|worr[yiesdng]{1,4}\|racing\|overthinking\|nervous\|stres[seidngor]{1,4})
(?i)((den[iesngdy]{1,4}) \|no \|not \|never \|don't \|doesn't )(admit \|report )(he \|she \|they \|of \|or \|to \|a \|an \|any \|past \|with \|previous \|prior \|current \|present \|ever \|known \|(ha[sdveing]{1,4}) \|symptoms \|sxs \|hx \|h.o \|history \|diagnosis \|diagnoses \|feeling[s]? \|issue[s]? \|struggl[edings]{1,3} \|deal[ingst]{1,4} \|trouble[s]? \|problem[s]? )((depress[ionedg]{2,4})\|mdd\|major depres\|anhedonia\|hopeless\|((loss\|change)?.(of\|in\|low\|poor).(weight\|appetite))\|irritab\|fatigue\|((no\|low\|of) energy)\|poor concentration)
(?i)((den[iesngdy]{1,4}) \|no \|not \|never \|don't \|doesn't )(admit \|report )(he \|she \|they \|of \|or \|to \|a \|an \|any \|past \|with \|previous \|prior \|current \|present \|ever \|known \|(ha[sdveing]{1,4}) \|symptoms \|sxs \|hx \|h.o \|history \|diagnosis \|diagnoses \|feeling[s]? \|issue[s]? \|struggl[edings]{1,3} \|deal[ingst]{1,4} \|trouble[s]? \|problem[s]? )((insomnia)\|((trouble\|interrup[ted]{1,3}\|restless) slee[ipng])\|(slee[ipng]{1,4} problem)\|frequent awakenings)
(?i)(major depr\|\\bnos\\b\|unspecified depressive\|depressive dx\|dysthymia\|cyclothymia\|history of depression due to\|hx of depression\|episode of depression\|depression 2)
(?i)(prozac\|fluoxetine\|zoloft\|sertraline\|citalopram\|celexa\|lexapro\|fluvoxamine\|luvox\|paroxetine\|paxil\|vilazodone\|viibryd\|vortioxetine\|trintellix\|duloxetine\|cymbalta\|pristiq\|venlafaxine\|effexor\|ptyline\|pamelor\|elavil\|mipramine\|doxepin\|buspirone\|buspar\|bupropion\|wellbutrin\|mirtazapine\|remeron)
(?i)(symptoms:\|diagnoses:\|diagnosis:\|use:\|anxiety:\|[ACS]D:).?(no\|none\|denies\|absent)
(?i)(high risk)
(?i)(scored in the (mild\|moderate\|severe) range for (depression\|anxiety))
(?i)(azepam\|Klonop\|xanax\|ativan\|valium\|librium\|hydroxyine\|atarax\|vistaril)
(?i)(suicide attempts?: yes\|ideation: yes\|(\\\\bsi\\\\b)\|(self.injurious behavior: yes)\|wanted to die\|wants to die)
(?i)(stelazine\|\\borap\\b\|pimozide\|haldol\|haloperidol\|risper\|zyprex\|olanz\|quetiap\|seroq\|clozapine\|clorazil\|fluphenazine\|abilify\|piprazole\|rexulti\|invega\|paliper\|latuda\|lurasidone\|ziprasidone\|geodon\|thorazine\|chlorpromazine)
(?i)(went \|had \|\\bis\\b \|\\bhas\\b \|was \|goes \|got \|going \|getting \|been \|sees \|seeing \|saw \|\\bto\\b \|\\ba\\b \|\\bin\\b \|for \|some \|orde[rseding]{1,4} \|ordered \|refe[rsaleding]{1,5} )+(therapy \|therapist \|counse[lingor]{4,5} \|session \|psychia[tryics]{3,5} \|psychiatry consul[tseding]{1,4} \|therapist)
Substance Use Algorithm Regular Expressions
(?i)(hepatocellular carcinoma\|hcc)
(?i)(hepatitis\|hcv)
(?i)(naltrexone\|depade\|suboxone\|naloxone\|buprenorphine\|antabuse\|disulfiram\|acamprosate\|campral)
(?i)(heroin\|dope\|opioid use disorder\|IVDU\|IV drug\|shoot\|cocaine\|snort\|whippet\|inhalant\|meth \| metha\|stimulant use disorder)
(?i)(alcoholic cirrhosis)
(?i)(positive (utox\|urine\|tox))
(?i)(marijuana\|cannabis\| THC \|(CBD)\|weed)
(?i)(pancreatitis)
(?i)(various substances\|misuse: yes\|drug use: yes\|substance use treatment\|substance abuse treatment\| AA \|alcoholics anonymous\|narcotics anonymous\|polysubstance\|poly substance\|addict\|intoxic\|craving)
(?i)(etoh: yes\|liquor\|drunk\|drinking\|bottles\|cans\|handles\|fifth\|h.ngover\|heavy drinker\|binge\| dui \|alcohol use disorder\|heaviest use in the past\|sober\|quit drinking\|stopped drinking\|alcohol-induced)
(?i)(misuse:.?no)

Within CLARK, we trained our classification models using 5- and 10-fold random versus stratified cross-validation and then applied a random forest, gaussian naïve bayes, or linear support vector machine classifier to predict each outcome.⁴⁷ CLARK uses the default setting for the Python scikit-learn Random Forest Classifier.⁴⁸ After achieving a satisfactory result with the training dataset, we input the evaluation dataset using the same classifier and cross-validation settings to evaluate the performance of the models.

Within Hugging Face,³² we loaded the pretrained DistilBERT model for sequence classification, a faster and smaller model runnable on devices locally.⁴⁵ We set a learning rate of 0.00,002, batch sizes of 16 for both training and evaluation, weight decay of 0.01, and 50 epochs of retraining.

Performance metrics including prediction accuracy, sensitivity, specificity, balanced accuracy (average of sensitivity and specificity), positive predictive value (PPV), negative predictive value (NPV), Cohen’s kappa, and F1 were computed to assess performance. We also computed Receiver Operating Characteristic (ROC) curves and their corresponding area under the curve (AUC). To determine how well the models within CLARK would perform when a depression or SUD diagnosis was not available in the EHR problem list, the models’ performance were assessed when restricted to patients without the corresponding diagnosis in the problem list. Similarly, the models’ performance within CLARK were compared when restricted to recipients of kidney transplant versus other organ recipients.

Statistical analysis

Statistical analyses were performed using Microsoft Excel,⁴⁹ and R v. 4.2.1⁵⁰; the following R packages were used: pROC v.1.18.0,⁵¹ ROCR v. 1.0-11,⁵² caret v. 6.0-92,⁵³ and cvms v. 1.3.4.⁵⁴ Hugging Face hub 0.23.4 and CUDA 12.6 were used for the BERT model. Study data were collected and managed using Research Electronic Data Capture (REDCap).^55,56

Results

A total of 16,032 notes were extracted from 652 patient records to form the corpus. The training dataset included 323 patient records (49.5%), and the evaluation dataset included 329 (50.5%) patient records. Of the 652 solid organ recipients, most had undergone kidney transplant (61%). The majority of recipients were non-Hispanic (94.5%), white (54%) males (62.4%), with an average age of 51.3 years old (median 53). There were more cases of depression than substance use: 312 had a depressive disorder diagnosis, and 135 had an SUD. There was overlap between the two diagnosis groups (Table 2).

Table 2.

Characteristics of solid organ recipients, stratified by training and evaluation sets^a.

Age at transplant (years)	Total (N = 652)		Training (N = 323)		Evaluation (N = 329)		p-value
	Mean	SD	Mean	SD^a	Mean	SD^a	p-value
	53	(43-61)	51.5	(12.5)	51.0	(13)	0.63
	Median	IQR	Median	IQR	Median	IQR	p-value
Days on organ wait list	255	(71-1008)	288	(78-1070)	233	(59-868)	0.076
	Count	Percent	Count	Percent	Count	Percent	p-value
Gender							0.18
Male	407	62.4	210	65.0	197	59.9
Female	245	37.6	113	35.0	132	40.1
Race							0.42
White	353	54.1	168	52.0	185	56.2
Black	240	36.8	127	39.3	113	34.3
Other	59	9.1	28	8.7	31	9.5
Marital Status at time of transplant							0.76
Single	175	26.8	93	28.8	82	24.9
Married	376	57.7	182	56.3	194	59.0
Divorced	52	8.0	24	7.4	28	8.5
Widowed/Widower	25	3.8	11	3.4	14	4.3
Other	24	3.7	13	4.1	11	3.3
Kidney transplant^b	398	61.0	199	61.6	199	60.5	0.77
Liver transplant^b	162	24.8	76	23.5	86	26.1	0.44
Heart Transplant^b	58	8.9	27	8.4	31	9.4	0.63
Lung transplant^b	60	9.2	29	9.0	31	9.4	0.84
Pancreas transplant^b	<10	Suppress	<10	Suppress	<10	Suppress	0.031
Donor status: Living	120	18.4	60	18.6	60	18.2	0.91
Received solid organ transplant in the past	54	8.3	27	8.4	27	8.2	0.94
Depression Diagnosis (Gold Standard)	294	45.1	136	42.1	158	48.0	0.13
Substance Use Disorder (Gold Standard)	135	20.7	72	22.3	63	19.1	0.32

^aMean and standard deviation or median and IQR range provided for continuous measures; count and percentage provided for discrete data. The distribution of categorical variables across training and evaluation sets was assessed with Pearson’s chi-square; continuous variables were assessed with t-tests (if parametric) or two sample Kruskal-Wallis (Mann Whitney) tests (if nonparametric). Statistical significance was set at α = 0.05. Counts less than 10 suppressed to limit risk of identifying individuals.

^bSome recipients received more than one organ, thus the percentages for each solid organ transplant type sum to >100% Abbreviations: IQR, Interquartile Range; SD, standard deviation.

Within CLARK, using the random forest machine learning classifier with 5-fold random cross-validation, the SUD model had a specificity of 91% (and negative predictive value of 90%), but a sensitivity of 56% (and positive predictive value of 58%). Overall, the SUD model had an accuracy of 84%, balanced accuracy of 73%, and an F1 of 0.57, with AUC of 0.73 (Table 2, Figures 1 and 2). In contrast, the depression model had an accuracy of 69%, balanced accuracy of 69%, F1 score (0.68) and AUC (0.69) (Table 3, Figures 1 and 2). Sensitivity and specificity were comparable at 70% and 68%, respectively. In a subgroup analysis of patients without a mental health diagnosis in the EHR problem list, the models’ performance was similar to that of the main analysis (Table 3, Figures 1 and 2). In a subgroup analysis of kidney transplant recipients versus other organ transplant recipients, the depression model performance was similar across both transplant groups. The substance use model had a notably lower F1 for the kidney transplant subgroup, mirroring the very low substance use prevalence in this group (Table 4). Results using 5- (vs. 10-fold) random (vs. stratified) cross validation were comparable, emphasizing the robustness of our model across different choices of cross-validation hyperparameters; results from the random forest machine learning classifier were superior for both models compared to those using gaussian naïve bayes and linear support vector machine classifiers (data available upon request).

Figure 1.

Confusion Matrices for each CLARK Model. This figure provides a visual and numeric summary of each model’s performance for the full cohort (plots a and b) and for those charts without a depression or substance use diagnosis in the problem list (plots c and d, respectively). Depression models are in plots a and c. Substance Use Disorder models are in plots b and d.

Figure 2.

Receiver under the operator curves for the depression and substance use disorder models, by presence vs. absence of the corresponding diagnosis in the problem list. This figure provides the receiver under the operator curves for the following models: Depression, all charts (light grey, solid); Depression, no diagnosis in problem list (light grey, dashed); Substance Use Disorder, all charts (black, solid); Substance Use Disorder, no diagnosis in problem list (black, dashed).

Table 3.

Performance metrics for CLARK models for charts with and without psychiatric diagnoses in the problem list.

Model	Population	Prevalence (% Case)	Balanced Accuracy	Accuracy	Spec.	Sens.	F1	PPV	NPV	Kappa	AUC
Depression	Full Evaluation Set	48.0%	0.693	0.693	0.702	0.684	0.681	0.679	0.706	0.385	0.693
Depression	No Depression in Problem List	42.7%	0.678	0.683	0.708	0.648	0.635	0.623	0.730	0.355	0.678
SUD	Full Evaluation Set	19.1%	0.731	0.839	0.906	0.556	0.569	0.583	0.896	0.470	0.731
SUD	No SUD in Problem List	16.0%	0.694	0.840	0.909	0.480	0.490	0.500	0.902	0.395	0.694

Abbreviations: SUD, Substance Use Disorder; Spec., Specificity; Sens., Sensitivity; PPV, Positive Predictive Value; NPV, Negative Predictive Value; AUC, Area Under the Curve.

Table 4.

Performance metrics for CLARK models by organ transplant type.

Model	Organ	Prevalence (% Case)	Balanced Accuracy	Accuracy	Spec.	Sens.	F1	PPV	NPV	Kappa	AUC
Depression	All Organs	48.0%	0.693	0.693	0.702	0.684	0.681	0.679	0.706	0.385	0.693
SUD	All Organs	19.1%	0.731	0.839	0.906	0.556	0.569	0.583	0.896	0.470	0.731
Depression	Kidney	42.7%	0.704	0.714	0.772	0.635	0.655	0.675	0.739	0.41	0.704
SUD	Kidney	12.6%	0.674	0.879	0.948	0.400	0.455	0.526	0.917	0.388	0.674
Depression	Other (non-Kidney)	56.2%	0.651	0.662	0.561	0.740	0.711	0.684	0.627	0.305	0.651
SUD	Other (non-Kidney)	29.2%	0.742	0.777	0.826	0.658	0.622	0.610	0.854	0.473	0.742

Abbreviations: Spec., Specificity; Sens., Sensitivity; PPV, Positive Predictive Value; NPV, Negative Predictive Value; AUC, Area Under the Curve.

When comparing CLARK model performance to that of Hugging Face,³² the BERT model did not perform as well as CLARK, though the patterns of performance across the two models were similar with a notably high specificity and low sensitivity for the SUD model, while the Depression model provided a more balanced performance (Supplemental Table 1). The BERT Depression model had notably low NPV, as compared to the SUD model which had high NPV but low PPV.

Discussion

This study describes the first application of the machine learning toolkit, CLARK, to classification of unstructured EHR data by psychiatric diagnosis. When comparing model performance, the depression model performed a more balanced job predicting the test set, with metrics for cases and controls being very similar. The low prevalence of SUD as compared to depression diagnoses led to a naturally higher SUD model specificity. The high specificity and NPV of the SUD model are in line with other studies diagnosing SUDs,^57–59 but the model’s sensitivity of 56% (and positive predictive value = 58%), indicated the model struggled to identify those with substance use. Our models’ performance was not affected by the presence of a diagnosis in the problem list or medical history, as sensitivity analyses only including records without a depression or SUD diagnosis in the problem list revealed similar performance metrics for both models.

We aimed to test the performance of CLARK without the addition of other text processing or ML tools. Therefore, when compared to the existing literature, our study is also notable for its lack of pre-processing of text and reliance on unstructured notes.⁴ Pre-processing text can improve model accuracy, but also represents a significant practical barrier for clinicians with a limited NLP skillset.⁴ Overall, our models performed well given our approach, though it is difficult to directly compare results across the NLP literature given the numerous combinations of data, methods, classifier types, and platforms.⁴

When considering barriers to building an accurate model performance, there are several issues to consider. There is a high prevalence of neurovegetative symptoms in the medically ill. Clinical notes for the medically ill often contain references to fatigue, low appetite, weight change, and poor concentration; all of which are also DSM-5 criteria for major depressive disorder.⁴⁴ This overlap in symptomatology could make it harder to find defining features in notes of individuals with or without a depressive disorder diagnosis. Aside from symptom overlap, words like “depressed” may show up within an illness description such as “depressed cardiac functioning.”

Bias and limitations

Given that our population was limited to solid organ transplant recipients, sampling bias was inherently present, and there may be characteristics unique to individuals who receive a solid organ transplant that impact our results. Our sample size was restricted to the population of transplant recipients over the specified timeframe; no sample size calculation was performed a priori. Race and ethnicity are known to influence psychotic disorder diagnosis.⁶⁰ When creating the gold standard, reviewers were not blinded to any aspect of the medical record, and unconscious bias related to patient demographics could have influenced their diagnoses. Because NLP of unstructured text inherently relies on the accuracy of the text, bias may be introduced by the author of the note⁶; a recent study showed that clinicians may use different language in notes depending on patient’s race.⁶¹ Although we do not have the issue of “black box” algorithm creation that happens in unsupervised ML, a supervised learning approach can still create or perpetuate bias depending on the quality of the input.⁶²

The notes in our corpus reflect the documentation style of transplant clinicians at a large academic medical center in the southeastern United States which uses one EHR, and generalizability of our lexicon may be limited by these factors.³ The type of note templates and proportion of unstructured text relative to discrete data fields may also differ relative to other hospitals. Substance use can vary dramatically based on location,⁶³ making our lexicon subject to overfitting. The reviewers creating the gold standard could see each record’s problem list, medication list, and other structured data, along with more unstructured data than was extracted for analysis by CLARK. As a result, a record could have been labeled as positive for a diagnosis with data not included within the corpus CLARK was analyzing. While this was likely an uncommon event, it is important to consider when designing future studies involving NLP.

While CLARK is easy to navigate, the limited range of customizability and reliance on regular expressions are limitations. The user can choose between four classifier methods and quickly edit regular expressions to get the best combination of classifier and terms. However, using more complex, but widely used, preprocessing methods such as clinical Text Analysis and Knowledge Extraction System (cTAKES) allows investigators to take advantage of medical vocabulary dictionaries and phenotypes that have already been developed.⁶⁴ Using regular expressions without preprocessing can also lead to unintended inclusions, for example adding “thc” as a term without properly denoting word borders can pick up “forthcoming,” potentially leading to inclusion of a chart in the substance use category as a mistake. That being said, further use of CLARK for psychiatry diagnosis categorization could also lead to the development of more generalizable and accessible regular expression libraries.

Robustness is a key factor in model trustworthiness,⁶⁵ and while CLARK does offer four options for cross-validation of the algorithm, additional cross-validation options and validation using an external dataset would allow further demonstration of robustness when comparing results. We had just one dataset available, which prevented us from comparing performance on data from other institutions.

At a more fundamental level, most NLP tools require an English corpus,³ making it difficult to include unstructured data containing direct quotations or wording from non-English speaking patients in the US. Additionally, our classification approach was based upon the United States (US) based classification system of psychiatric disorders, the DSM-5, as opposed to an international system. Even within the US, there has long been disagreement regarding the classification of psychiatric disorders, with other systems such as the Research Domain Criteria (RDoC) proposed as alternatives that rely on neurobiological mechanisms.⁶⁶

Newer learning models are being increasingly used in NLP research involving mental health, including transformer-based methods.⁶⁷ There are potential healthcare uses for transformer-based pretrained language models, such as BERT, and large language models (LLM), such as ChatGPT, but there is a large cost in terms of time, expertise, and data storage needed for training these models.⁶⁷ Indeed, when comparing CLARK performance to a BERT model based upon the open-source Hugging Face model,³² CLARK demonstrated a more robust overall performance. Notably, many open-source BERT models, including Hugging Face,³² require equal character length across all samples, either truncating or “padding” input in order to optimize model processing speed.⁴⁶ In general, the more accessible options (such as Hugging Face) have shorter character limits, and this likely affected our BERT model’s performance when compared to CLARK, which can manage input with variable character length. Running models that can account for longer sequences, such as our EHR free-text notes, are significantly more computationally expensive. In addition to these barriers, there remain substantial concerns about bias, privacy, and interpretability when using deep learning in healthcare.⁶⁷ While there is ongoing effort to make LLMs more usable and more transparent, in our experience, this type of learning model is a less accessible way for a clinician to perform text classification.

Clinical application and future Directions

CLARK’s approach to ready-made NLP software could be a key step to make ML and NLP techniques accessible at early levels of training. Medical students and other trainees could develop targeted projects using CLARK and become better versed in NLP as they advance in their career. CLARK is an excellent toolkit that provides an introduction to coding through regular expressions. As CLARK is used more broadly, more regular expressions libraries can be created and shared. Free toolkits like CLARK could help overcome several known barriers to widespread NLP use.⁷

Explainability of identified features is important for being able to interpret results from machine learning algorithms, and therefore enhances how trustworthy the model is.⁶⁵ While CLARK does allow review of each patient record to see the highlighted regular expression features, it does not include a built-in method to determine the importance of each feature on the prediction outcome. The addition of more feature importance metrics to CLARK would both help the clinician refine feature selection and improve explainability.

CLARK has several applications for mental health clinicians. We used NLP models to classify medical records by diagnosis. A similar approach could identify specific diagnoses within a population. If combined with expert review, using CLARK or similar software with a higher confidence threshold could be especially helpful for sorting study participants based on psychiatric diagnoses or for screening large patient databases for population health or research purposes. Indeed, CLARK could be used as part of a two-step screening process: charts not meeting an established confidence threshold could be then flagged for further analysis by a clinician. Of note, the confidence of assignment to a certain outcome, such as depression or SUD, depends on the predictive model being used. In the random forest-based model within CLARK, confidence is defined as the proportion of decision trees that predict a patient to have a certain outcome.^68,69 For the Linear Support Vector Machine and Gaussian Naive Bayes models within CLARK, the confidence thresholds were generated using the Python sklearn function “predict_proba” to generate the confidences (or probabilities) for each of the classifier types.⁷⁰ Confidence for BERT is calculated by taking a softmax operator (a transformation to turn a score to a probability) on the logits (the raw prediction scores from a Transformer) output by the model.⁷¹ CLARK allows users to set the confidence threshold cut-off; by raising the confidence threshold, only records meeting this threshold are included in the output and the model accuracy is drastically improved. A confidence threshold can also be set in order to maximize performance over specific model metrics, such as accuracy or specificity. However, as this leaves a portion of records uncategorized, we opted to keep the default confidence threshold setting for our analysis.

CLARK does not replace the expertise of a clinician. In our population under study, the transplant population, the decision to place a patient on the waiting list for a solid organ transplant is a life and death decision. Any data incorporated into this decision must be of the highest accuracy, thus NLP methods do not appear appropriate for this clinical application at this time. However, in a research scenario, it could lessen the work burden on manual reviewers and focus the efforts of statistical or informatics experts when there are limited time and funds for these supports.

Conclusions

The role for NLP in psychiatry will likely only grow. Experience with NLP will enable psychiatrists to understand the strengths and limitations of this tool. Software for NLP will have to find the best balance between accessibility and accuracy, as well as accessibility and generalizability. Our work has demonstrated that highly accessible software toolkits can produce helpful and meaningful results. Unstructured data can be cumbersome to manipulate, but incorporation of NLP techniques with unstructured data has been shown to improve algorithm results.⁶⁴ As NLP is used more broadly in psychiatry, the classification schemes may even inform a better overarching classification model for our field.

Supplemental Material

Supplemental Material - Using a natural language processing toolkit to classify electronic health records by psychiatric diagnosis

Supplemental Material for Using a natural language processing toolkit to classify electronic health records by psychiatric diagnosis by Alissa Hutto, Tarek M Zikry, Buck Bohac, Terra Rose, Jasmine Staebler, Janet Slay, C Ray Cheever, Michael R Kosorok and Rebekah P Nash in Health Informatics Journal

Footnotes

Author contributions

A.H. drafted and edited the manuscript and created and tested the regular expressions in CLARK. T.Z. performed statistical analyses, drafted sections of the manuscript, and edited the manuscript. B.B. provided education and guidance around use of CLARK as applied to this particular project, including formatting EHR data for use. T.R., J.S., and J.S. created the gold standard. T.R. edited the manuscript. C.R.C. collected key aspects of the dataset and edited the manuscript. M.K. provided guidance around analyses performed. R.P.N. generated initial project concept and provided project leadership, collected data, generated the gold standard, edited the manuscript, and contributed to statistical analyses. All authors gave approval for the final manuscript.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project was supported by the National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (UL1TR002489), and the Foundation of Hope for Research and Treatment of Mental Illness (Seed Grant). Michael Kosorok and Tarek M. Zikry are supported by UL1TR002489. Tarek M. Zikry is additionally supported by NIH (F31HL156464-03). Rebekah Nash is supported by the National Institute of Mental Health (K23MH128613), the Foundation of Hope for Research and Treatment of Mental Illness (Seed Grant), and the Doris Duke Charitable Foundation grant # (2020143). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or other funding sources.

Ethical statement

ORCID iD

Tarek M Zikry

Supplemental Material

Supplemental material for this article is available online.

References

Topol

. High-performance medicine: the convergence of human and artificial intelligence. Nat Med 2019; 25: 44–56.

Grzenda

Kraguljac

McDonald

, et al. Evaluating the machine learning literature: a primer and user’s guide for psychiatrists. Am J Psychiatr 2021; 178: 715–729.

Crema

Attardi

Sartiano

, et al. Natural language processing in clinical neuroscience and psychiatry: a review. Front Psychiatry 2022; 13: 946387. DOI: 10.3389/fpsyt.2022.946387.

le Glaz

Haralambous

Kim-Dufor

D-H

, et al. Machine learning and Natural Language processing in mental health: systematic review. J Med Internet Res 2021; 23: e15708.

Uzuner

Stubbs

Filannino

. A natural language processing challenge for clinical records: research Domains Criteria (RDoC) for psychiatry. J Biomed Inform 2017; 75S: S1–S3.

Rezaii

Wolff

Price

. Natural language processing in psychiatry: the promises and perils of a transformative approach. Br J Psychiatry 2022; 220: 251–253.

Dwyer

Falkai

Koutsouleris

. Machine learning approaches for clinical psychology and psychiatry. Annu Rev Clin Psychol 2018; 14: 91–118.

Smoller

. The use of electronic health records for psychiatric phenotyping and genomics. Am J Med Genet B Neuropsychiatr Genet 2018; 177: 601–612.

Lee

Torous

De Choudhury

, et al. Artificial intelligence for mental health care: clinical applications, barriers, facilitators, and artificial wisdom. Biol Psychiatry Cogn Neurosci Neuroimaging 2021; 6: 856–864.

10.

Graham

Depp

Lee

, et al. Artificial intelligence for mental health and mental illnesses: an overview. Curr Psychiatry Rep 2019; 21: 116.

11.

Lin

C-H

Lane

H-Y

. Precision psychiatry applications with pharmacogenomics: artificial intelligence and machine learning approaches. Int J Mol Sci 2020; 21: 969.

12.

Sajjadian

Lam

Milev

, et al. Machine learning in the prediction of depression treatment outcomes: a systematic review and meta-analysis. Psychol Med 2021; 51: 2742–2751.

13.

Zippel

Bohnet-Joschko

. Rise of clinical studies in the field of machine learning: a review of data registered in ClinicalTrials.gov. Int J Environ Res Public Health 2021; 18: 5072.

14.

Savova

Masanz

Ogren

, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 2010; 17: 507–513.

15.

Gorrell

Song

Roberts

. Bio-YODIE: a named entity linking system for biomedical text.

16.

Aronson

. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp 2001; 17–21.

17.

Kraljevic

Searle

Shek

, et al. Multi-domain clinical natural language processing with MedCAT: the medical concept annotation toolkit. Artif Intell Med 2021; 117: 102083.

18.

Schrouff

Rosa

Rondina

, et al. PRoNTo: pattern recognition for neuroimaging toolbox. Neuroinformatics 2013; 11: 319–337.

19.

Tseytlin

Mitchell

Legowski

, et al. Noble – flexible concept recognition for large-scale biomedical natural language processing. BMC Bioinf 2016; 17: 32.

20.

Friedman

Shagina

Lussier

, et al. Automated encoding of clinical documents based on Natural Language processing. J Am Med Inform Assoc 2004; 11: 392–402.

21.

Jackson

Patel

Jayatilleke

, et al. Natural language processing to extract symptoms of severe mental illness from clinical text: the Clinical Record Interactive Search Comprehensive Data Extraction (CRIS-CODE) project. BMJ Open 2017; 7: e012012.

22.

Huang

LePendu

Iyer

, et al. Toward personalizing treatment for depression: predicting diagnosis and severity. J Am Med Inform Assoc 2014; 21: 1069–1075.

23.

Perlis

Iosifescu

Castro

, et al. Using electronic medical records to enable large-scale studies in psychiatry: treatment resistant depression as a model. Psychol Med 2012; 42: 41–50.

24.

Fernandes

Dutta

Velupillai

, et al. Identifying suicide ideation and suicidal attempts in a psychiatric clinical research database using Natural Language processing. Sci Rep 2018; 8: 7426.

25.

Zhang

, et al. Individualized prediction of depressive disorder in the elderly: a multitask deep learning approach. Int J Med Inform 2019; 132: 103973.

26.

Kessler

Bauer

Bishop

, et al. Evaluation of a model to target high-risk psychiatric inpatients for an intensive postdischarge suicide prevention intervention. JAMA Psychiatr 2023; 80: 230.

27.

Minor

. Stanford medicine 2020 health trends report, the rise of the data-driven physician, 2020.

28.

Pucchio

Rathagirishnan

Caton

, et al. Exploration of exposure to artificial intelligence in undergraduate medical education: a Canadian cross-sectional mixed-methods study. BMC Med Educ 2022; 22: 815.

29.

Filannino

Stubbs

Uzuner

. Symptom severity prediction from neuropsychiatric clinical records: overview of 2016 CEGS N-GRID shared tasks Track 2. J Biomed Inform 2017; 75S: S62–S70.

30.

Pfaff

Crosskey

Morton

, et al. Clinical annotation research kit (CLARK): computable phenotyping using machine learning. JMIR Med Inform 2020; 8: e16042.

31.

DiMartino

Miano

Wessell

, et al. Identification of uncontrolled symptoms in cancer patients using Natural Language processing. J Pain Symptom Manage 2022; 63: 610–617.

32.

Wolf

Debut

Sanh

, et al. HuggingFace’s transformers: state-of-the-art Natural Language processing.

33.

Rosenberger

DiMartini

DeVito Dabbs

, et al. Psychiatric predictors of long-term transplant-related outcomes in Lung transplant recipients. Transplantation 2016; 100: 239–247.

34.

DiMartini

Dew

Chaiffetz

, et al. Early trajectories of depressive symptoms after liver transplantation for alcoholic liver disease predicts long-term survival. Am J Transplant 2011; 11: 1287–1295.

35.

Fukunishi

Sugawara

Takayama

, et al. Psychiatric disorders before and after living-related transplantation. Psychosomatics 2001; 42: 337–343.

36.

Novak

Zsolt Molnar

Szeifert

, et al. Depressive symptoms and mortality in patients after kidney transplantation: a prospective prevalent cohort study. Psychosom Med 2010; 72: 527–534.

37.

Possemato

Geller

Ouimette

. Posttraumatic stress and quality of life in kidney transplantation recipients. Traumatology 2009; 15: 34–39.

38.

Goetzmann

Ruegg

Stamm

, et al. Psychosocial profiles after transplantation: a 24-month follow-up of Heart, Lung, liver, kidney and allogeneic bone-marrow patients. Transplantation 2008; 86: 662–668.

39.

Rogal

Mankaney

Udawatta

, et al. Pre-transplant depression is associated with length of hospitalization, discharge disposition, and survival after liver transplantation. PLoS One 2016; 11: e0165517.

40.

Dew

Switzer

DiMartini

, et al. Psychosocial assessments and outcomes in organ transplantation. Prog Transplant 2000; 10: 239–259.

41.

Burke

. Could anxiety, hopelessness and health locus of control contribute to the outcome of a kidney transplant? S Afr J Psychol 2008; 38: 527–540.

42.

Danuser

Simcox

Studer

, et al. Employment 12 months after kidney transplantation: an in-depth bio-psycho-social analysis of the Swiss Transplant Cohort. PLoS One 2017; 12: e0175161.

43.

von der Lippe

Waldum

Brekke

, et al. From dialysis to transplantation: a 5-year longitudinal study on self-reported quality of life. BMC Nephrol 2014; 15: 191.

44.

American Psychiatric Association . Diagnostic and statistical manual of mental disorders. Virginia: American Psychiatric Association, 2013. DOI: 10.1176/appi.books.9780890425596.

45.

Sanh

Debut

Chaumond

, et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and Lighter.

46.

Sun

Qiu

, et al. How to fine-tune BERT for text classification.

47.

Breiman

. Random forests. Mach Learn 2001; 45: 5–32.

48.

Pedregosa

Varoquaux

Gramfort

, et al. Scikit-learn: machine learning in Python. JMLR 2011; 12: 2825–2830.

49.

Microsoft corporation. Microsoft Excel, 2018. https://office.microsoft.com/excel (accessed 16 January 2023).

50.

R Core Team . R: a language and environment for statistical computing.

51.

Robin

Turck

Hainard

, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinf 2011; 12: 77.

52.

Sing

Sander

Beerenwinkel

, et al. ROCR: visualizing classifier performance in R. Bioinformatics 2005; 21: 3940–3941.

53.

Kuhn

. Building predictive models in R using the caret package. J Stat Softw 2008; 28: 1–26. DOI: 10.18637/jss.v028.i05.

54.

Jeyaraman

Olsen

Wambugu

. Practical machine learning with R : define, build, and evaluate machine learning models for real-world applications. Birmingham: Packt Publishing, 2019.

55.

Harris

Taylor

Minor

, et al. The REDCap consortium: building an international community of software platform partners. J Biomed Inform 2019; 95: 103208.

56.

Harris

Taylor

Thielke

, et al. Research electronic data capture (REDCap)—a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform 2009; 42: 377–381.

57.

Sharma

Karnik

, et al. Validation of an alcohol misuse classifier in hospitalized patients. Alcohol 2020; 84: 49–55.

58.

Wang

Chen

Pakhomov

, et al. Automated extraction of substance use information from clinical texts. AMIA Annu Symp Proc 2015; 2015: 2121–2130.

59.

Lin

Sharma

Thompson

, et al. External validation of a machine learning classifier to identify unhealthy alcohol use in hospitalized patients. Addiction 2022; 117: 925–933.

60.

Schwartz

Blankenship

. Racial disparities in psychotic disorder diagnosis: a review of empirical literature. World J Psychiatry 2014; 4: 133–140.

61.

Sun

Oliwa

Peek

, et al. Negative patient descriptors: documenting racial bias in the electronic health record. Health Aff 2022; 41: 203–211.

62.

Challen

Denny

Pitt

, et al. Artificial intelligence, bias and clinical safety. BMJ Qual Saf 2019; 28: 231–237.

63.

Degenhardt

Chiu

Sampson

, et al. Epidemiological patterns of extra-medical drug use in the United States: evidence from the national comorbidity survey replication, 2001–2003. Drug Alcohol Depend 2007; 90: 210–223.

64.

Liao

Cai

Savova

, et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. BMJ 2015; 350: h1885.

65.

Hasan

Watling

Larue

. Validation and interpretation of a multimodal drowsiness detection system using explainable machine learning. Comput Methods Programs Biomed 2024; 243: 107925.

66.

Stein

Shoptaw

Vigo

Dv.

, et al. Psychiatric diagnosis and treatment in the 21st century: paradigm shifts versus incremental integration. World Psychiatr 2022; 21: 393–414.

67.

Zhang

Schoene

, et al. Natural language processing applied to mental illness detection: a narrative review. NPJ Digit Med 2022; 5: 46.

68.

Zhou

. Machine Learning. 1st ed. Beijing: Tsinghua University Press, 2016.

69.

James

Witten

Hastie

, et al. An introduction to statistical learning. New York, NY: Springer New York, 2013. DOI: 10.1007/978-1-4614-7138-7.

70.

Platt

. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. Advances in large margin classifiers 1999; 10: 61–74.

71.

Holm

Wright

Augenstein

. Revisiting softmax for uncertainty approximation in text classification. Information 2023; 14: 420.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.02 MB