Sage Journals: Discover world-class research

Abstract

This research extracted patient-reported symptoms from free-text EHR notes of colorectal and breast cancer patients and studied the correlation of the symptoms with comorbid type 2 diabetes, race, and smoking status. An NLP framework was developed first to use UMLS MetaMap to extract all symptom terms from the 366,398 EHR clinical notes of 1694 colorectal cancer (CRC) patients and 3458 breast cancer (BC) patients. Semantic analysis and clustering algorithms were then developed to categorize all the relevant symptoms into eight symptom clusters defined by seed terms. After all the relevant symptoms were extracted from the EHR clinical notes, the frequency of the symptoms reported from colorectal cancer (CRC) and breast cancer (BC) patients over three time-periods post-chemotherapy was calculated. Logistic regression (LR) was performed with each symptom cluster as the response variable while controlling for diabetes, race, and smoking status. The results show that the CRC and BC patients with Type 2 Diabetes (T2D) were more likely to report symptoms than CRC and BC without T2D over three time-periods in the cancer trajectory. We also found that current smokers were more likely to report anxiety (CRC, BC), neuropathic symptoms (CRC, BC), anxiety (BC), and depression (BC) than non-smokers.

Keywords

data mining electronic health records machine learning text mining clinical decision-making

Introduction

Cancer patients commonly experience symptoms such as pain, depression, and fatigue as a consequence of undergoing chemotherapy treatment, and these symptoms may develop or persist even after the chemotherapy ends. These symptoms add to patients’ distress and functional impairment if left untreated. The literature shows that individual differences have associations with the symptoms and patient experience.^1,2 The symptoms could be gastrointestinal symptoms including nausea, vomiting, lack of appetite, or psychoneurological symptoms including depressive symptoms, anxiety, or other types.

The majority of research studies gather patient-reported symptoms through symptom measurement questionnaires.^3–5 Cheung et al.⁶ applied Principle Component Analysis (PCA) for the 1366 cancer patient cohort who completed the Edmonton Symptom Assessment Scale (ESAS) questionnaires to determine the inter-relationships of the nine symptoms. They identified two major symptom clusters from their study cohort. One included fatigue, drowsiness, nausea, decreased appetite, and dyspnea; the other included anxiety and depression. Marshall et al.⁷ investigated symptom clusters in women with breast cancer using social media data. The k-medoid clustering method was used to cluster the symptoms. The similarity measure was developed based on the frequency of the co-occurrences of the 25 symptoms included in the survey. Unstructured EHR clinical notes include a vast amount of important information regarding patient-reported symptoms. However, there is limited research focused on extracting symptoms from the clinical notes, then analyzing the symptoms with other clinical data for risk assessment, outcome prediction, or clinical decision making. Vijayakrishnan et al.⁸ developed a natural language processing (NLP) procedure to identify signs and symptoms of heart failure (HF) patients, using electronic health records (EHR). The method is designed to target HF patients specifically. Jackson et al.⁹ developed a suite of language models to specifically capture key symptoms of severe mental illness (SMI) from clinical text. Forsyth et al.¹⁰ developed a name entity recognition algorithm to extract breast cancer symptoms from the EHR. However, the semantic relations between various symptoms were not considered, and the symptoms were used to analyze any clinical outcomes. Divita et al.¹¹ developed a NLP pipeline to extract symptoms from the clinical notes. The pipeline considers the Part-Of-Speech (POS) and some syntactic rules, but no semantic similarity and embedding was used. Gundlapalli et al.¹² investigated an NLP pipeline to extract urinary symptoms to detect indwelling urinary catheter. They first defined a set of lexicons for urinary symptoms, then NLP tools were used to detect the positive and negative cases. Koleck et al.¹³ did a systemic review on NLP techniques applied to clinical notes in the EHR for symptom analysis. They found that only 14 studies had symptom-related information as the primary outcome. Although the studies included various symptoms, only nine studies reported patient demographic characteristics. Vaci et al.¹⁴ used a Long-Short Term Memory (LSTM) learning based name entity recognition (NER) model along with active learning to identify depression related symptoms from various sections of the EHR. However, this research focused on depression symptoms only, and patient outcomes were not measured.

In this research, we use EHR data and focus on mining and analyzing the symptom clusters of patients with or without type 2 diabetes (T2D) and diagnosed with colorectal or breast cancer. We analyzed eight different types of symptoms and their correlations with comorbidity—diabetes, race, and smoking status. The research presented in this paper is different from previous research as it includes: (1) extracting from the EHR clinical notes and selecting symptom types (such as fatigue, gastrointestinal issue, and so on) that are relevant to both cancer patients and people with T2D; (2) investigating pretrained BioWord2Vec for semantic analysis of the seed symptom representations, and applying a hierarchical clustering algorithm for symptom clustering in order to identify relevant symptom expressions in the clinical notes; and (3) exploring the symptoms and their association with comorbidity—T2D in colorectal cancer and breast cancer patients.

Study cohorts

The study cohort in this research consists of patients with a primary diagnosis of breast cancer (BC) or colorectal cancer (CRC) who have electronic medical records in a large academic medical center’s EHR system. BC and CRC patients were identified using the International Classification of Diseases (ICD). The ICD codes for BC are 174 (ICD-9) and C50 (ICD-10); the ICD codes for CRC are 153–154 (ICD-9) and C18–C20 (ICD-10). Through these ICD codes, we identify the BC and CRC cases that have received chemotherapy within the 10 years of 2007–2017. For each case, we extracted clinical notes, demographic data including gender, race, and smoking status, cancer characteristic data, including the pathologic stage of cancer, age at diagnosis, and the number of comorbidities.

In total, there were 1694 CRC patients, and 3458 BC patients included in this study. There were 112,298 clinical notes for CRC patients and 254,100 clinical notes for BC patients. Over 600 types of clinical reports were included.

Methods

System design

In this research study, we focused on eight types of symptoms—“Fatigue,” “Anxiety,” “Cognitive Issues,” “Neuropathy,” “Sleep Issues,” “Gastrointestinal (GI) Issues,” “Functional Status,” “Depression” defined by symptom clusters including a set of seed terms relevant to the corresponding symptoms. Based on our previous work¹⁵ and current literature^16–26 we identified the symptoms a priori. The seed symptoms selected for inclusion in this study are the most common symptoms reported by cancer survivors and people with type 2 diabetes. Table 1 provides the seed terms for each symptom cluster, which were provided by domain expert (Author Storey). However, the seed symptom terms cannot cover various representations of the symptoms in the EHR clinical notes. Hence, we developed an NLP framework to identify and extract all the terms relevant to these symptom clusters from the EHR.

Table 1.

Seed symptom clusters.

Symptom cluster	Seed symptoms
Fatigue	fatigue, listless, weary, weariness, lethargic, lethargy, no energy, tired, sleepy, drowsy, exhausted, exhaustion, worn out, drained
Peripheral neuropathy	peripheral neuropathy, numbness, tingling, burning, crawling
Depression	depression, depressed, sad, unhappy, no appetite, failure to thrive, despair, misery, melancholy, hopeless, down-hearted, despondent
Anxiety	anxiety, anxious, apprehensive, nervous, stressed, uptight, tensed, can’t relax, worry, worried, worrisome, panic, panicked, panic attack, irritable, overwhelmed, nervousness, fretful
Cognitive issues	cognitive issue, forgetful, chemo brain, fogginess, lack of concentration difficulty remembering, memory loss, spaced out, can’t stay focused, can’t think straight, brain drain, loss of right words to say
Sleep issues	can’t sleep, insomnia, restlessness, wakeup, can’t fall asleep, can’t stay asleep, interrupted sleep, sleeplessness
Functional status	inability to care for self, can’t do the things I used to do, can’t keep up, loss of stamina, unsteady gait, unsteady, falling, falls, gait changes, can’t walk, can’t walk far, shaky, slowing down, loss of strength, weak, need help cooking/eating/driving/walking/bathing
GI issues	anorexia, poor appetite, not hungry, no or decreased appetite, bad taste in mouth, constipation, bloating, diarrhea, cramping, nausea, vomiting, thirsty, difficulty swallowing

The overall system design for symptom extraction, clustering, and data analysis using the EHR clinical notes along with patient attributes (including diabetes status, smoking status) is shown in Figure 1. Given all the clinical notes of the study cohort, we first utilized UMLS MetaMap,¹⁵ which is an NLP tool that uses various sources to categorize the phrases or terms in the text to different semantic types, to identify all the symptom terms. Based on the initial eight symptom clusters defined in Table 1, we selected the following semantic types: “Sign or Symptom,” “Mental and Behavioral Dysfunction,” “Mental Process,” “Finding,” “Daily or Recreational Activity,” “Home Care Activity.” Figure 2 provides an example of clinical notes with highlighted terms mapped into the selected semantic types using UMLS MetaMap. The negation detection and Word Sense Disambiguation (WSD) functionalities were turned on when using UMLS MetaMap. Therefore, negating and ambiguous terms are not included. After symptoms terms were identified and extracted from the clinical notes, symptom clustering through semantic analysis was used to expand the symptom clusters in Table 1 by adding more symptom expressions extracted from the clinical notes. We first used pretrained BioWord2Vec²⁷ to convert symptoms into vector representations. Research¹⁶ showed that BioWord2Vec outperformed other word embeddings on multiple NLP tasks in the biomedical domain. We then used hierarchical clustering algorithm to generate sub-clusters of the symptoms. The reason to use hierarchical clustering algorithm is that the symptom terms within each symptom clusters are semantically related but might belong to different sub-clusters based on their representations. These extended eight symptom clusters were used to identify the patients with any of these symptoms, and then to calculate the severity of the symptoms. This, in turn, facilitated the ability to apply statistical analysis to examine the symptoms associated with other patient attributes, such as diabetes comorbidity and smoking status.

Figure 1.

Overall system design.

Figure 2.

Symptom extraction using UMLS MetaMap.

Semantic analysis of the symptoms and symptom clustering

To discover more symptom expressions that are not listed in Table 1. We used semantics analysis to first group symptom terms within each cluster into semantically meaningful sub-clusters, then employed further semantic analysis²⁸ to identify the symptoms expressions in the clinical notes that are close to the sub-clusters.

To analyze the semantic meaning of various symptom terms, we used concept embeddings to measure the semantic similarities between all extracted symptoms and the seed symptom clusters to identify additional symptom expressions within the EHR clinical notes. Since 2013, various neuron networks have been investigated for generating word or concept embeddings. The vector representation of the embeddings is capable of capturing the semantic associations between the words or concepts. The BioWord2Vec²⁹ includes pretrained biomedical word embeddings^30,31 using PubMed and MeSH sequences. In this research, we utilized the pretrained BioWord2Vec to generate symptom embeddings. Given a symptom term consisting of more than one word, it computes the symptom embeddings by computing the element-wise sum of the representations of each word embedding. The semantic similarities between the symptoms can then be calculated by measuring cosine similarity between the embeddings. Figure 3 shows the selected symptoms in the clusters through the heatmap of the symptom cosine similarity matrix using embeddings generated from the BioWord2Vec. The higher the similarity score is (the lighter the cell is), the more similar the symptoms are from the semantic point of view. Based on similarities in Figure 3, symptoms within the same clusters show high similarities. For example, “numbness” and “tingling” both belong to cluster “Neuropathy,” and cosine similarity (0.92) between them is high, and higher than that seen among the rest of the terms in the other clusters. However, some symptoms from different clusters also show the semantic relationship. For example, the semantic similarity score between “tired” and “sleeplessness” is also high (0.85), which means “tired” often co-occurred or is semantically related to “sleeplessness.” This demonstrates that some symptom clusters are correlated.

Figure 3.

Heatmap based on the cosine similarity matrix using BioWord2Vec embeddings.

In order to identify the additional terms written in the EHR that are semantically close to the terms in the seed clusters, we first applied hierarchical clustering algorithm³² to form sub-clusters within each seed symptom cluster. Then, we measured the similarity of a given new term to each of the sub-clusters to extend the seed symptom clusters. The hierarchical clustering works as an iterative process, which starts with the points as individual clusters and, at each iteration, merges the most similar pair of clusters. Before the process started, the similarities between all pairs of clusters were computed. It generates an initial similarity matrix (Z) whose Z_uv entry gives the similarity (S(u, v)) between clusters u and v. This similarity matrix gets updated with every iteration to reflect the pairwise similarity between the new cluster and the original clusters. Equation (1) is the average linkage function to calculate the cosine similarity of the newly formed cluster u with the remaining clusters v, which sums the similarities between any data instances within cluster u and v.

S (u, v) = \sum_{i j} \frac{s (u_{i}, v_{j})}{| u_{i} | \times | v_{j} |}

(1)

However, hierarchical clustering does not inform the further steps to form proper number of clusters. so, we developed our approach to identify the optimal number of clusters. The following steps were taken to form the clusters, where $o_{i}$ ( $i \in (1, n)$ ) is a data instance in the input, $n$ is the total number of data instances, $s (o_{i}, o_{j})$ is the cosine similarity between $o_{i}$ and $o_{j}$ . The threshold $θ$ was used to decide distances between instances and cluster centroid of each cluster and cluster expansion. In order to avoid involving unrelated terms in the symptom clusters, after empirical studies of $θ$ from 0.6 to 0.9. We chose 0.8 as the threshold value.

Step 0: Set k = 1

Step 1: Initialize the number of clusters to k

Step 2: Generate k clusters based on the hierarchical clustering results

Step 3: Calculate the selected centroid of each cluster (which contains $m$ instances) to be the data instance ( $o_{i}$ ) if $o_{i}$ has the average similarity between $o_{i}$ and the rest of the data instance ( $o_{j}$ ) larger than a threshold ( $\frac{\sum_{j = 1}^{m} s (o_{i}, o_{j})}{m} > θ$ ) and larger than the average similarity between another one ( $o_{p}$ ) and the rest of the data instances ( $\frac{\sum_{j = 1}^{n} s (o_{i}, o_{j})}{m} > \frac{\sum_{j = 1}^{n} s (o_{p}, o_{j})}{m}$ ) in the cluster

Step 4: If all clusters has a selected centroid output all clusters with centroids, otherwise, $k = k + 1$ , continue to Step 2.

Figure 4 shows the dendrogram visualization of hierarchical clustering results of the seed terms in category “Functional status.” The centroids of these three clusters are calculated using Step 3 described above. If there are two terms in a cluster, the centroid is randomly selected. The terms “falling” and “falls” connected with light blue lines are within one cluster, “falls” is selected as the centroid. The terms “unsteady” and “unsteady gait” connected with green lines are in one cluster, and “unsteady” is selected as centroid. The terms “need help cooking/walking/driving/eating/bathing” connected with purple lines are within one cluster, and “need help eating” is selected as the centroid. The rest of the terms are clusters with only one word in it, such as “weak,” “slowing down,” and so on. For the GI symptoms, the terms “poor/no/decreased appetite” were in one cluster with “no appetite” as the centroid. The terms “bloating” and “cramping” were in one cluster with “bloating” as centroid. The terms “nausea” and “vomiting” were in one cluster with “nausea” as the centroid. The rest of the terms were independent clusters. For anxiety symptoms, the terms “nervous” and “nervousness” were in a cluster with “nervous” as the centroid. The terms “worry” and “worried” were in a cluster with “worry” as the centroid. The terms “panic” and “panic attack” were in a cluster with “panic” as the centroid. The rest of the terms were independent clusters. For cognitive issues, all terms were independent clusters. For fatigue symptoms, the terms “exhaustion” and “exhausted” were in a cluster with “exhaustion” as the cluster centroid. The terms “lethargic” and “lethargy” were in a cluster with “lethargic” as centroid. The rest of terms were independent clusters. For sleep issues, the terms “can’t fall asleep” and “can’t stay asleep” were in a cluster with “can’t fall asleep” as the centroid. The terms “restlessness” and “sleeplessness” were in a cluster with “restlessness” as the centroid. The rest of the terms were independent clusters. For depression symptoms, “depression” and “depressed” were in a cluster with “depression” as the centroid. The rest of the terms were independent clusters.

Figure 4.

Hierarchical clustering results of the seed terms in category “functional status”.

After the sub-clusters were identified, each symptom cluster is expanded through measuring the cosine similarity between the centroids of the clusters and the extracted terms from the EHR clinical notes. If cosine similarity between an extracted term and the centroid of a cluster is over the threshold θ (set to be 0.8 after we have done experiments with its value from 0.6 to 0.9), the extracted term was added to that cluster as a new representation. A new representation can be added to more than one cluster if it meets the requirements of the threshold. If the similarity between the extracted term and the centroid a cluster is below the threshold, it is not selected as a symptom relevant to the symptom clusters. For example, “cough” is an extracted symptom but close enough to any of the centroids. It will not be included in the symptoms. Through this process, terms extracted from the clinical notes were added to the clusters. For example, “neuropathic pain” is added to the cluster “Neuropathy”; “extreme exhaustion” is added to the “fatigue” cluster. Some symptoms, such as “finger numbness,” which has a body part to one of the seed symptoms clusters, are identified through this process. Table 2 shows the additional symptoms of CRC and BC cohorts extracted from the EHR, demonstrating that the semantic analysis and clustering algorithm are capable of identifying symptoms with similar semantic meaning within the EHR. Some typos are also captured through this analysis. Comparing the additional symptoms of the CRC and BC cohorts identified through the method, there are more neuropathy and GI symptom representations with the BC cohort, whereas more representations of functional status symptoms were extracted from the CRC cohort.

Table 2.

Additional symptoms extracted from EHR through symptom cluster expansion.

Symptom clusters	Extracted CRC symptoms	Extracted BC symptoms
Fatigue	generalized fatigue, extreme exhaustion, physical exhaustion	–
Neuropathy	burning micturition, epigastric burning, tingling sensation, perioral tingling, peripheral neuropathic pain, sciatic neuralgia, acute pain, perioral numbness, leg numbness, hand numbness, throat numbness, extremity numbness, toe numbness, lower extremity numbness, numbness of finger, oral numbness, extremity numbness, numbness fingers, foot numbness	chest burning, leg burning, foot burning, eye numbness, hand numbness, upper extremity numbness, finger numbness, numbness of hand, facial numbness, numbness feet, leg numbness, numbness toe, tongue numbness, numbness of fingers, numbness extremity, thigh numbness, numbness thigh, numbness of extremities, numbness of upper arm, foot numbness, peripheral numbness, lip numbness, lower extremity numbness, feet numbness, extremity numbness, shoulder numb ness, upper arm numbness, numbness tongue, numbness fingers, fingers numbness, numbness of finger, numbness foot, numbness lip, numbness of toe, neuropathic pain, peripheral neuropathic pain, limb numbness, toe numbness
Depression	symptoms of depression	–
Anxiety	situational stress, nervousness, shaking, miserable, trembling, feeling jittery	–
Cognitive issues	short term memory, memory problems, poor memory, memory impairment	forgetfulness, decreased concentration
Sleep issues	hypnagogic, difficultly sleeping, excessive daytime sleepiness, trouble sleeping, fragmented sleep, restless sleep, sleeping difficulty	can’t fall asleep, can’t stay asleep, restless sleep, nightmares
Functional status	stiffness, tandem gait, bilateral foot pain, neck stiffness, back stiffness, lower leg edema, lower limb pain, anterior knee pain, extremities edema, shoulder stiffness, ataxic gait, posterior neck pain, edema of foot, edema of foot, cramp of limb, generalized muscle weakness, decreased grip strength, loss of proprioception, quadriceps weakness, arm muscle weakness, repeated falls	–
GI issues	diminished appetite, reduced appetite, abdominal cramp, abdominal fullness, gas bloating, bloating symptoms, persistent vomiting, vomited, intermittent vomiting, postprandial vomiting, uncontrollable vomiting, vomit, nausea emesis, vomiting symptom, recurrent vomiting, acute vomiting, vomiting diarrhea, chronic vomiting, feculent vomiting, diarrhea vomiting	bloating symptoms, bloated feeling, gas bloating, bloating gas, constipate, constipating, diarrhea, diarrhea symptoms, diarrheas, reduced appetite, decrease appetite, nausea/vomiting, morning vomiting, recurrent vomiting, acute vomiting, intermittent vomiting, vomiting symptoms, vomit, constipated, intractable vomiting, diarrhea vomiting, nauseated, persistent vomiting, acute constipation, chronic constipation, nausea emesis, nausea sickness, nauseas

Frequency measure of the symptoms

This research aims to investigate the relationship between each symptom cluster and other variables, such as T2D status, of the study cohorts. We calculated the frequency of each symptom cluster within a timeframe using the document frequency (DF), which is often used in the NLP domain. In our study, each document is a clinical note. If a clinical note contains one or more than one symptom expression in one symptom cluster, the document frequency (DF) for that cluster is counted as 1. The higher the DF, the more frequently the same symptom clusters are reported to the physician in different clinical notes, indicating greater bother caused by the symptoms.

Statistical analysis

Descriptive statistics were used to describe demographic and medical characteristics, and covariates were summarized using frequencies with percentages (values in the parentheses) for categorical variables or means with standard deviations (values the parentheses) for continuous variables. Logistic regression (LR) was used to evaluate the effect of T2D on each of the eight symptoms (Fatigue, Anxiety, Cognitive Issues, Neuropathy, Sleep Issues, GI Issues, Functional Status, and Depression) during each of the three timeframes (within 6 months, between 12 to 18 months, and between 24 to 30 months) post the patient’s first chemotherapy. Logistic regression was performed with the symptom as the response variable, controlling for T2D, age, race, gender (only for CRC), and smoking status. All analyses were performed using SAS, version 9.4 (SAS Institute; Cary, NC).

Results

The clinical characteristics of the CRC cohort (n = 1694) and the BC cohort (n = 3458) are listed in Table 3. The percentage of patients’ with T2D is 21.3% in the CRC cohort, whereas the percentage of patients with T2D is 15.9% in the BC cohort. The mean age of the CRC cohort at diagnosis was 58.41 years, which is older than the mean age of the BC cohort at diagnosis (53.22 years). For both cohorts, the patients with T2D were older than the patients without T2D. This age difference is statistically significant (p < 0.0001). The mean BMI value of the CRC cohort is slightly less than that of the BC cohort, however for both cohorts those with T2D had higher BMI (p < 0.001) than those without T2D. The Charlson Comorbidity Index (CCI) value of the patients with T2D is much higher than that of the patients without T2D for both the CRC and BC cohorts. For the BC cohort, the ratio of black patients with T2D is higher than without T2D (p < 0.0001). For both cohorts, among former and current smokers, there are more patients with T2D (p < 0.0001).

Table 3.

Demographics of CRC and BC cohorts.

	CRC cohort				BC cohort
Variable	Pts without diabetes (N = 1333)	Pts with diabetes (N = 361)	All (N = 1694)	p Value	Pts without diabetes (N = 2906)	Pts with diabetes (N = 552)	All (N = 3458)	p Value
Age	57.42 (12.82)	62.07 (10.61)	58.41 (12.53)	<0.0001	52.07 (11.73)	59.31 (10.4)	53.22 (11.83)	<0.0001
BMI	28.90 (6.76)	31.64 (7.26)	29.65 (7)	<0.0001	30.03 (7)	33.70 (7.89)	31.05 (7.44)	<0.0001
CCI	0.28 (0.79)	1.22 (1.28)	0.50 (1.01)	<0.0001	0.13 (0.41)	1.12 (1.2)	0.31 (0.74)	<0.0001
Race
Black	91 (6.83)	36 (9.97)	127 (7.5)	0.1266	299 (10.29)	104 (18.84)	403 (11.65)	<0.0001
White	1229 (92.2)	321 (88.92)	1550 (91.5)		2545 (87.58)	442 (80.07)	2987 (86.38)
Unknown	13 (0.98)	4 (1.11)	17 (1)		62 (2.13)	6 (1.09)	68 (1.97)
Smoke
Former	61 (4.58)	43 (11.91)	104 (6.14)	<0.0001	73 (2.51)	25 (4.53)	98 (2.83)	<0.0001
No	621 (46.59)	163 (45.15)	784 (46.28)		1831 (63.01)	346 (62.68)	2177 (62.96)
Unknown	560 (42.01)	104 (28.81)	664 (39.2)		864 (29.73)	129 (23.37)	993 (28.72)
Current	91 (6.83)	51 (14.13)	142 (8.38)		138 (4.75)	52 (9.42)	190 (5.49)
Stage
I	–	–	–	–	882 (30.35)	166 (30.07)	1048 (30.31)	0.7600
II	427 (32.03)	103 (28.53)	530 (31.29)	0.2031	1399 (48.14)	274 (49.64)	1673 (48.38)
III	906 (67.97)	258 (71.47)	1164 (68.71)		625 (21.51)	112 (20.29)	737 (21.31)

Symptom factor analysis: CRC patients with diabetes versus without diabetes

Table 4 presents the logistic analysis of outcomes of developing the symptom clusters of the CRC patients with and without T2D over the three-time frames post the first chemotherapy (within 6 months, between 12 to 18 months, and between 24 to 30 months). The results show that within the 6 months post chemotherapy, CRC patients with T2D had a higher risk for developing fatigue (OR, 1.33; p = 0.035), neuropathic symptoms (OR, 1.54; p = 0.017), depression (OR, 1.87; p < 0.001), anxiety (OR, 1.54; p = 0.009) and GI symptoms (OR, 1.38; p = 0.029). Patients who smoke (OR, 1.42; p = 0.029) are more likely to develop depression within 6 months after chemotherapy. It shows that black patients are less likely to develop depression (OR, 0.44; p = 0.014) compared to white patients within the 6 months after the chemotherapy. For the timeframe of 12 to 18 months post chemotherapy CRC with T2D did not show a high risk for developing any of these symptom clusters. However, patients who currently smoke have a higher risk to develop anxiety (OR, 2.05; p = 0.02) and peripheral neuropathy (OR, 2.37; p = 0.02), and patients who smoked formerly have higher risk to develop depression (OR, 2.93; p < 0.001) and anxiety (OR, 2.69; p = 0.004). For the timeframe of 24 to 30 months after chemotherapy, patients with T2D have a higher risk of developing functional related symptoms (OR, 4.84; p = 0.009), and patients who currently smoke have higher risk of anxiety (OR, 2.11; p = 0.04) and peripheral neuropathy (OR, 2.53; p = 0.02), and patients who smoked formerly have a higher risk to develop depression (OR, 3.89; p < 0.001).

Table 4.

Symptom outcomes by symptom cluster for CRC patients with and without diabetes.

Time frame 1: Within 6 months after the chemo
	Fatigue		Neuropathy		Depression		Anxiety		Cognitive		Sleep		Functional		GI
	OR (95% CI)	p	OR (95% CI)	p	OR (95% CI)	p	OR (95% CI)	p	OR (95% CI)	p	OR (95% CI)	p	OR (95% CI)	p	OR (95% CI)	p
Diabetes
No	1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]
Yes	1.33 (1.02, 1.74)	0.035	1.54 (1.08, 2.20)	0.017	1.87 (1.33, 2.62)	<0.001	1.54 (1.12, 2.13)	0.009	5.27 (0.72, 38.57)	0.102	0.34 (0.04, 3.20)	0.34	1.26 (0.44, 3.61)	0.66	1.38 (1.03, 1.85)	0.029
Race
White	1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]
Black	1.00 (0.68, 1.49)	0.99	1.13 (0.67, 1.92)	0.65	0.44 (0.23, 0.85)	0.014	0.57 (0.32, 1.00)	0.051	2.02 (0.19, 21.76)	0.56	1.12 (0.10, 12.26)	0.93	1.07 (0.22, 5.18)	0.93	1.34 (0.86, 2.09)	0.19
Others	0.94 (0.28, 3.20)	0.93	1.50 (0.34, 6.61)	0.59	0.20 (0.01, 3.87)	0.29	1.03 (0.23, 4.60)	0.97	10.38 (0.35, 309.61)	0.18	11.03 (0.70, 174.53)	0.09	3.08 (0.17, 56.74)	0.45	2.13 (0.47, 9.69)	0.33
Smoking status
No	1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]
Former	1.15 (0.74, 1.79)	0.57	0.52 (0.25, 1.06)	0.073	1.36 (0.79, 2.33)	0.26	1.60 (0.96, 2.66)	0.072	0.89 (0.06, 13.98)	0.93	1.66 (0.11, 24.59)	0.74	1.01 (0.18, 5.51)	0.99	1.00 (0.61, 1.62)	0.98
Current	1.20 (0.82, 1.75)	0.35	0.98 (0.59, 1.62)	0.93	1.42 (0.90, 2.24)	0.029	1.85 (1.22, 2.81)	0.091	0.99 (0.06, 15.62)	1.00	3.06 (0.53, 17.59)	0.21	1.31 (0.34, 5.01)	0.70	1.15 (0.74, 1.79)	0.53
Unknown	0.47 (0.36, 0.60)	<0.001	0.65 (0.45, 0.93)	0.020	0.68 (0.48, 0.96)	0.13	0.76 (0.55, 1.05)	0.004	0.51 (0.04, 6.10)	0.60	0.74 (0.13, 4.25)	0.73	0.51 (0.16, 1.65)	0.26	0.37 (0.29, 0.48)	<0.001
Time frame 2: Between 12–18 months after the chemo
Diabetes
No	1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]
Yes	1.24 (0.84, 1.82)	0.28	0.99 (0.55, 1.81)	0.98	1.22 (0.76, 1.97)	0.41	1.37 (0.85, 2.21)	0.20	5.27 (0.72, 38.57)	0.10	0.34 (0.04, 3.20)	0.34	0.72 (0.15, 3.40)	0.94	0.97 (0.69, 1.34)	0.84
Race
White	1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]
Black	0.79 (0.42, 1.48)	0.46	1.06 (0.42, 2.66)	0.90	0.83 (0.38, 1.85)	0.65	0.58 (0.23, 1.42)	0.23	2.02 (0.19, 21.76)	0.56	1.12 (0.10, 12.26)	0.93	2.28 (0.46, 11.33)	0.23	1.21 (0.74, 1.99)	0.45
Others	3.04 (0.90, 10.23)	0.07	7.55 (2.08, 27.38)	0.002	2.06 (0.47, 9.02)	0.34	1.01 (0.17, 6.16)	0.99	10.38 (0.35, 309.61)	0.18	11.03 (0.70, 174.53)	0.09	11.36 (1.72, 75.03)	0.076	3.74 (1.01, 13.88)	0.05
Smoking status
No	1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]
Former	1.37 (0.77, 2.46)	0.28	1.58 (0.64, 3.90)	0.32	2.93 (1.56, 5.53)	<0.001	2.69 (1.37, 5.28)	0.004	0.89 (0.06, 13.98)	0.93	1.66 (0.11, 24.59)	0.71	0.99 (0.06, 15.17)	0.32	1.65 (0.98, 2.78)	0.06
Current	0.73 (0.41, 1.30)	0.28	2.37 (1.15, 4.86)	0.02	1.43 (0.75, 2.76)	0.28	2.05 (1.10, 3.82)	0.02	0.99 (0.06, 15.62)	0.99	3.06 (0.53, 17.59)	0.21	1.48 (0.27, 8.06)	0.64	0.98 (0.61, 1.56)	0.93
Unknown	0.71 (0.48, 1.05)	0.08	1.09 (0.59, 1.99)	0.80	1.05 (0.65, 1.72)	0.83	1.45 (0.90, 2.34)	0.13	0.51 (0.04, 6.10)	0.59	0.74 (0.13, 4.25)	0.74	1.62 (0.48, 5.40)	0.53	0.88 (0.64, 1.20)	0.41
Time frame 3: Between 24–30 months after the chemo
Diabetes
No	1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]
Yes	1.13 (0.72, 1.76)	0.60	1.14 (0.57, 2.26)	0.71	1.10 (0.60, 2.02)	0.75	0.90 (0.49, 1.65)	0.73	5.27 (0.72, 38.57)	0.10	2.44 (0.43, 13.89)	0.31	4.84 (1.49, 15.74)	0.009	1.21 (0.82, 1.79)	0.34
Race
White	1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]
Black	1.13 (0.60, 2.14)	0.70	0.69 (0.22, 2.11)	0.51	0.52 (0.17, 1.59)	0.25	0.45 (0.15, 1.37)	0.16	2.02 (0.19, 21.76)	0.56	6.04 (1.04, 35.16)	0.05	0.39 (0.03, 5.51)	0.48	1.01 (0.57, 1.77)	0.98
Others	3.76 (0.92, 15.34)	0.07	0.82 (0.04, 17.01)	0.90	0.60 (0.03, 12.40)	0.74	0.49 (0.02, 10.31)	0.65	10.38 (0.35, 309.61)	0.18	9.76 (0.37, 254.48)	0.17	3.53 (0.17, 73.79)	0.42	1.73 (0.42, 7.12)	0.44
Smoking status
No	1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]
Former	1.33 (0.68, 2.61)	0.40	2.24 (0.91, 5.55)	0.08	3.89 (1.83, 8.29)	<0.001	2.18 (0.97, 4.91)	0.06	0.89 (0.06, 13.98)	0.93	0.99 (0.06, 16.00)	1.00	0.37 (0.02, 6.14)	0.49	1.07 (0.58, 1.96)	0.83
Current	1.47 (0.81, 2.65)	0.21	2.53 (1.16, 5.56)	0.02	1.97 (0.91, 4.24)	0.08	2.11 (1.03, 4.33)	0.04	0.51 (0.04, 6.10)	1.00	1.13 (0.09, 14.26)	0.92	1.69 (0.38, 7.49)	0.49	1.17 (0.68, 2.01)	0.56
Unknown	0.92 (0.59, 1.42)	0.70	0.65 (0.29, 1.44)	0.29	1.03 (0.55, 1.95)	0.92	1.07 (0.60, 1.90)	0.83	0.99 (0.06, 15.62)	0.60	0.47 (0.03, 6.59)	0.58	1.03 (0.25, 4.24)	0.97	0.79 (0.54, 1.15)	0.23

Symptom factor analysis: BC patients with diabetes versus without diabetes

Table 5 presents the results of the logistic analysis of outcomes of developing the symptom clusters of the BC patients with and without T2D over the three timeframes post the first chemotherapy. The results show that within the 6 months after the chemotherapy, patient with T2D have a higher risk for developing fatigue (OR, 1.23; p = 0.05), depression (OR, 1.78; p < 0.001), anxiety (OR, 1.57; p < 0.001), functional issue (OR, 1.48; p = 0.004) and GI symptoms (OR, 1.50; p =0.05). Patients who currently smoke or smoke formerly were more likely to develop depression (OR, 2.69; p < 0.001; OR, 1.89; p = 0.006) and anxiety (OR, 1.81; p < 0.001; OR, 2.17; p < 0.001), and patients who smoke formerly have more risk to develop functional issues (OR, 2.83; p < 0.001) within the 6 months after chemotherapy. It shows that black patients are less likely to report depression (OR, 0.67; p = 0.01), anxiety (OR, 0.62; p < 0.001), and GI issues (OR, 0.66; p < 0.001) comparing to white patients within the 6 months after the chemotherapy.

Table 5.

Symptom outcomes of BC patients.

Time frame 1: Within 6 months after the chemo
	Fatigue		Neuropathy		Depression		Anxiety		Cognitive		Sleep		Functional		GI
	OR (95% CI)	p	OR (95% CI)	p	OR (95% CI)	p	OR (95% CI)	p	OR (95% CI)	p	OR (95% CI)	p	OR (95% CI)	p	OR (95% CI)	p
Diabetes
No	1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]
Yes	1.23 (1.00, 1.52)	0.05	1.24 (0.98, 1.57)	0.08	1.78 (1.40, 2.26)	<0.001	1.57 (1.26, 1.97)	<0.001	2.72 (0.79, 9.36)	0.11	1.37 (0.97, 1.92)	0.07	1.48 (1.13, 1.92)	0.004	1.50 (1.19, 1.88)	0.05
Race
White	1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]
Black	0.89 (0.70, 1.12)	0.32	1.07 (0.82, 1.40)	0.60	0.67 (0.49, 0.91)	0.01	0.62 (0.47, 0.82)	<0.001	1.19 (0.24, 5.78)	0.83	0.78 (0.51, 1.20)	0.26	1.30 (0.96, 1.77)	0.09	0.66 (0.52, 0.84)	<0.001
Others	1.15 (0.66, 2.02)	0.62	1.27 (0.68, 2.36)	0.46	0.57 (0.25, 1.33)	0.19	0.97 (0.52, 1.82)	0.93	3.55 (0.23, 54.35)	0.36	0.71 (0.24, 2.15)	0.55	0.92 (0.40, 2.13)	0.84	1.13 (0.62, 2.06)	0.68
Smoking status
No	1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]
Former	1.14 (0.75, 1.74)	0.54	1.22 (0.77, 1.93)	0.40	1.89 (1.20, 2.96)	0.006	2.17 (1.42, 3.31)	<0.001	0.90 (0.06, 13.40)	0.94	1.30 (0.68, 2.47)	0.42	2.83 (1.79, 4.48)	<0.001	0.98 (0.63, 1.54)	0.95
Current	0.97 (0.70, 1.33)	0.83	1.20 (0.85, 1.69)	0.31	2.69 (1.93, 3.74)	<0.001	1.81 (1.31, 2.50)	<0.001	0.55 (0.04, 7.60)	0.65	1.48 (0.93, 2.36)	0.10	1.52 (1.02, 2.27)	0.04	0.98 (0.70, 1.38)	0.93
Unknown	0.41 (0.34, 0.49)	<0.001	0.53 (0.42, 0.66)	<0.001	0.69 (0.54, 0.87)	<0.001	0.54 (0.44, 0.67)	<0.001	0.16 (0.01, 2.17)	0.17	0.55 (0.38, 0.78)	<0.001	1.30 (1.02, 1.65)	0.04	0.33 (0.28, 0.40)	<0.001
Time frame 2: Between 12–18 months after the chemo
Diabetes
No	1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]
Yes	1.30 (1.01, 1.67)	0.04	1.87 (1.37, 2.56)	<0.001	1.64 (1.23, 2.19)	<0.001	1.45 (1.08, 1.95)	0.013	1.94 (0.39, 9.64)	0.42	1.35 (0.84, 2.15)	0.21	0.97 (0.43, 2.16)	0.94	1.34 (1.06, 1.69)	0.016
Race
White	1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]
Black	1.35 (1.03, 1.77)	0.03	1.28 (0.90, 1.82)	0.17	0.88 (0.63, 1.25)	0.49	0.86 (0.61, 1.21)	0.39	0.65 (0.05, 7.90)	0.74	1.00 (0.58, 1.71)	1.00	1.53 (0.69, 3.37)	0.23	0.93 (0.71, 1.21)	0.57
Others	1.32 (0.68, 2.57)	0.41	0.66 (0.22, 2.01)	0.46	0.59 (0.22, 1.59)	0.29	0.70 (0.28, 1.76)	0.45	4.79 (0.32, 71.00)	0.25	0.58 (0.11, 3.01)	0.52	3.31 (0.88, 12.40)	0.076	1.13 (0.61, 2.10)	0.70
Smoking status
No	1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]
Former	1.69 (1.03, 2.76)	0.04	1.22 (0.62, 2.42)	0.56	2.24 (1.32, 3.82)	0.003	1.91 (1.11, 3.29)	0.02	5.93 (0.33, 106.71)	0.23	2.11 (0.95, 4.66)	0.065	1.94 (0.52, 7.18)	0.32	1.36 (0.84, 2.20)	0.21
Current	1.12 (0.77, 1.63)	0.56	1.91 (1.25, 2.94)	0.003	1.97 (1.33, 2.93)	<0.001	1.77 (1.19, 2.64)	0.005	11.45 (1.60, 82.07)	0.015	1.24 (0.62, 2.48)	0.55	1.31 (0.43, 3.95)	0.64	1.26 (0.89, 1.78)	0.20
Unknown	0.69 (0.53, 0.89)	0.05	0.97 (0.69, 1.36)	0.85	0.89 (0.66, 1.22)	0.48	0.79 (0.58, 1.09)	0.15	6.22 (1.06, 36.44)	0.042	1.25 (0.79, 1.95)	0.34	1.26 (0.61, 2.61)	0.53	0.71 (0.56, 0.89)	0.003
Time frame 3: Between 24–30 months after the chemo
Diabetes
No	1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]
Yes	1.58 (1.17, 2.12)	0.002	2.35 (1.64, 3.37)	<0.001	1.93 (1.39, 2.66)	<0.001	1.49 (1.06, 2.11)	0.023	2.77 (0.77, 9.99)	0.12	2.35 (1.41, 3.92)	<0.001	3.40 (1.72, 6.75)	<0.001	1.32 (1.01, 1.73)	0.044
Race
White	1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]
Black	0.92 (0.65, 1.31)	0.64	1.31 (0.85, 2.00)	0.22	0.85 (0.57, 1.26)	0.41	0.81 (0.53, 1.23)	0.32	1.71 (0.33, 8.93)	0.53	0.68 (0.33, 1.41)	0.30	0.99 (0.40, 2.47)	0.98	0.92 (0.68, 1.24)	0.57
Others	1.17 (0.54, 2.52)	0.69	2.74 (1.22, 6.14)	0.015	0.34 (0.09, 1.26)	0.11	0.53 (0.17, 1.64)	0.27	6.24 (0.40, 98.31)	0.19	0.87 (0.16, 4.65)	0.87	2.12 (0.39, 11.52)	0.38	1.10 (0.56, 2.19)	0.77
Smoking status
No	1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]		1[Ref]
Former	2.46 (1.39, 4.35)	0.002	1.11 (0.47, 2.61)	0.82	2.05 (1.10, 3.82)	0.024	1.34 (0.66, 2.70)	0.42	2.35 (0.14, 38.26)	0.55	0.54 (0.10, 2.81)	0.46	1.10 (0.21, 5.83)	0.91	1.84 (1.05, 3.23)	0.035
Current	1.74 (1.09, 2.78)	0.019	1.85 (1.05, 3.25)	0.03	1.87 (1.13, 3.08)	0.014	1.63 (0.96, 2.76)	0.068	1.49 (0.11, 20.82)	0.77	2.03 (0.95, 4.35)	0.46	1.06 (0.29, 3.90)	0.93	1.33 (0.85, 2.07)	0.21
Unknown	0.60 (0.42, 0.86)	0.005	0.91 (0.58, 1.41)	0.69	0.85 (0.57, 1.26)	0.017	0.56 (0.36, 0.86)	0.008	3.81 (1.03, 14.06)	0.04	0.78 (0.40, 1.52)	0.07	0.71 (0.27, 1.92)	0.50	0.75 (0.57, 0.98)	0.037

For the time frame of 12 to 18 months after chemotherapy, patients with T2D are more likely to report fatigue (OR, 1.30; p = 0.04), neuropathy (OR, 1.87; p < 0.001), depression (OR, 1.64; p < 0.001), anxiety (OR, 1.45; p = 0.013), and GI symptoms (OR, 1.34; p = 0.016). Comparing to white patients, black patients are more likely to develop fatigue (OR, 1.35; p = 0.03). Patients who smoked formerly have higher risk to develop fatigue (OR, 1.69; p = 0.04), depression (OR, 2.24; p = 0.003), and anxiety (OR, 1.91, p = 0.02), and patients who currently smoke have higher risk to develop neuropathy (OR, 1.91; p = 0.003), depression (OR, 1.97; p < 0.001), anxiety (OR, 1.77; p = 0.005), and cognitive issues (OR, 11.45; p = 0.015).

For the time frame of 24 to 30 months after chemotherapy, patients with T2D are at higher risks for develop fatigue (OR, 1.58; p = 0.002), neuropathy (OR, 2.35; p < 0.001), depression (OR, 1.93; p < 0.001), anxiety (OR, 1.49; p = 0.023), sleep issue (OR, 2.35; p < 0.001), functional issue (OR, 3.40; p < 0.001), and GI symptoms (OR, 1.32; p = 0.044). Patients who smoked formerly have higher risk of developing fatigue (OR, 2.46; p = 0.002), depression (OR, 2.05; p = 0.024), and GI issues (OR, 1.84; p = 0.035), and patients who currently smoke have higher risk to develop fatigue (OR, 1.74; p = 0.019), peripheral neuropathy (OR, 1.85; p = 0.03), depression (OR, 1.87; p = 0.014).

Discussion

Symptom science experts recommend evaluating EHR and its utility in symptom cluster research.³³ The primary focus to date in Big Data analysis of EHRs has been to establish prediction models for disease development, prognosis, and resource utilization^13,34–37 with little emphasis on its use for exploration of symptoms. To our knowledge, this is one of the first studies to use NLP processing techniques in unstructured data from the EHR to identify symptoms of CRC and BC with and without T2D across three key timeframes in the cancer trajectory.

Links between T2D and the development of CRC and BC have been well documented.^38–42 Patients with CRC and, BC, and people with T2D experience similar symptoms.^25,43–45 However, understanding the effects of the combination of CRC or BC and T2D on symptoms is lacking. In this study, we found CRC and BC patients with T2D were more likely to report symptoms than CRC and BC without T2D over three timeframes in the cancer trajectory. These findings are consistent with the few studies found, which noted that CRC or BC patients with T2D reported more symptoms than their non-T2D counterparts.^19–21,44 However, these studies utilized symptom measurement-scales and were primarily cross-sectional study designs.

Colorectal cancer

Among CRC patients, we found those with T2D reported more symptoms at two of the three timeframes than their counterparts without T2D. Within 6 months of cancer diagnosis, CRC with T2D reported more fatigue, depression, neuropathy, anxiety, and GI issues than those CRC without T2D. The initial 6 months around chemotherapy administration is a critical time period for patients, as treatment regimens are initiated many of which are known to cause and/or perpetuate the worsening of symptoms.^5,46,47 In addition, treatment regimens often include drugs that challenge glucose self-management and regulation⁴⁷ and self-management of diabetes⁴⁸ which could exacerbate symptoms. We also found at the 24–30 month post chemotherapy timeframe, CRC patients with T2D reported only symptoms associated with functional status. The long-term effects of cancer treatment on the symptoms among CRC patients has been documented to include fatigue, insomnia,⁴⁹ and neuropathy,^50,51 all of which have been associated with decreased function and quality of life. In our study, functional status was defined broadly and included both neuropathic and other physical functional symptoms, thus, precluding our ability to determine which symptoms specifically contributed to the long-term functional symptoms. More research is needed to examine the relationship of T2D on symptoms of CRC patients longitudinally. Interestingly, we did not find a significant difference in the symptoms reported by the CRC with and without T2D at the 12–18 month time-period. This time-period is associated with the cessation treatment, which may decrease the intensity of acute side effects associated with active treatment. This reprieve in acute symptoms may result in fewer symptoms reported, however, as time progresses and acute symptoms fade, long-term lingering effects may become more prominent.

Breast cancer

In our study, BC patients with T2D consistently reported four symptoms (fatigue, depression, anxiety and GI issues) across the three timeframes. Functional issues were reported along with the four symptoms at the 6- and 24–30-month intervals. Neuropathy was reported at the 12–18- and 24–30-month time-periods. BC with T2D, reported most symptoms at the latest timeframe which included all symptoms with the addition of sleep symptoms. Other researchers have also noted these symptoms and others to linger several years after receipt of treatment for cancer^52–55 causing concern for BC patients and impacting quality of life.¹⁶ The effect of T2D on the symptoms of BC patients has not been well described. Tang et al, 2016 noted among BC patients, those with T2D reported lower scores on physical, emotional and social function, and higher symptom scores (greater symptoms) such as fatigue, GI issues, and insomnia than BC patients without T2D.¹⁶ However, this study included the self-report of a T2D diagnosis rather than use of ICD codes. Similarly, another researcher noted among BC patients, 3–8 years post cancer diagnosis, those with T2D reported poorer physical and attention function more sleep disturbance, and greater fatigue than women with breast cancer without T2D.⁴⁴ Both studies used standard symptom measures versus the symptoms described by the BC patient to the healthcare provider. More research is warranted to understand the symptoms experienced by BC with T2D to facilitate the identification of subsets of cancer patients that may be at risk for higher symptom profiles.

When comparing the symptom profiles of patients from both cancer diagnoses, we found CRC patients with T2D were more likely to report symptoms early in the treatment period, with no symptoms reported at 12–18 months (typically cessation of treatment) and symptoms being reported again at the later period (24–30 months post cancer diagnosis). Whereas BC patients reported symptoms over all three of the timeframes. A finding also noted in our previous study that examined T2D and three symptoms among CRC and BC patient over 12 months.¹⁵ This finding may be a result of gender differences in the groups. Research examining the role of gender on symptoms is inconclusive. With some studies indicating men with cancer, experience some symptoms more than women, and others reporting women experience more symptoms.^55–58 In our study, the sample of BC patients included only women, whereas the CRC sample included both men and women. It is possible that this finding may have been influenced by the mix of genders in our CRC sample and may have been similar to the BC findings if we had included only women with CRC in the analysis. Additionally, the different chemotherapy drugs used to treat the specific cancers may also contribute to the type and duration of symptoms. More studies that examine gender differences and symptoms among cancer survivors are warranted.

We found race was associated with higher reporting of symptoms. In our study, Caucasian CRC and BC patients, were more like to report depression (CRC and BC), anxiety, and GI issues (BC) than black CRC or BC patients. This finding may be a result of the lack of ethnic diversity in the sample as the majority of the subjects in our study were Caucasian. Additionally, black patients often have less access to healthcare and cancer resources,^59,60 making the assessment and documentation of their symptoms more difficult to acquire. Understanding the role of ethnicity in the symptoms of cancer patients is important to explore and could facilitate the treatment of symptoms.

Current smoking status was associated with symptoms in CRC and BC patients. We found current smokers were more likely to report anxiety (CRC, BC), neuropathic symptoms (CRC, BC), anxiety (BC), and depression (BC) than non-smokers. Research suggests that people with higher anxiety are more likely to smoke^61,62 and that smoking may in turn increase anxiety through biological pathways.⁶¹ Similarly, smokers have been shown to have a higher risk for depression.^62,63 The role of smoking as a risk factor for neuropathy is equivocal with some studies finding a link between smoking and neuropathy,⁶⁴ while others do not.⁶⁵

Our study has several strengths, the use of unstructured symptom data is novel. Capturing the description of the symptom experience from the CRC and BC patients in real time during interactions with healthcare providers is important and can help to develop tailored assessment and self-management strategies for those at high risk. To our knowledge this is one of the first studies to use EHR’s to study symptoms to study two common cancer types of which many patients have comorbid T2D. Analysis of unstructured EHR data provides a foundation upon which interventions for symptom management can be tailored to the unique needs of patients with CRC or BC and T2D. The findings of our study must be considered in light of its limitations. Although we were able to count the documented frequency of the self-reported symptoms from CRC and BC patients, we were unable to measure the severity and/or distress caused by these symptom(s) and its impact on daily activity or quality of life. While the patients in each cancer group were in earlier stages of their cancer diagnoses, we did not control for variations in chemotherapy regimens which may have influenced the symptoms. Lastly, we were unable to measure lifestyle habits (diet, exercise, medication adherence) between CRC and BC patients with and without T2D, which may have also contributed to symptoms.

Limitations of the study

The UMLS MetaMap was used to identify the possible symptoms concepts from the clinical notes. Although word sense disambiguation (WSD) was used and negation detection of the UMLS MetaMap was turned on to extract the symptom concepts, because of the limitation of the WSD and negation detection of UMLS MetaMap, there could be some relevant symptom concepts that might be left out. The overall approach relies on provided seed symptoms to capture all the possible relevant symptom representations within the clinical notes. The seed symptoms need to have a good semantic coverage of the symptom representations to capture the relevant symptom concepts within the clinical notes.

Conclusions

As cancer survivorship continues to increase due to advances in treatment, management of comorbid conditions in people with cancer is becoming increasingly important. Using EHR data to identify and examine symptoms in CRC and BC with T2D represents an initial step toward understanding their association of comorbid conditions. A comprehensive understanding of symptoms specific to CRC and BC patients with comorbid T2D is critical to guide clinical practice and strategies to mitigate symptoms.

Footnotes

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was funded in part by a Research Investment Fund from the Indiana University School of Nursing (PI Storey), a Release Time for Research award from the Indiana University Purdue University Indianapolis Office of the Vice Chancellor for Research (Dr. Storey), and an RE01 from the Oncology Nursing Foundation (PI Storey).

ORCID iDs

Xiao Luo

Susan Storey

References

Miaskowski

Cooper

Paul

, et al Subgroups of patients with cancer with different symptom experiences and quality-of-life outcomes: a cluster analysis. Oncol Nurs Forum 2006; 33(5): E79–E89.

Miaskowski

Dunn

Ritchie

, et al Latent class analysis reveals distinct subgroups of patients based on symptom occurrence and demographic and clinical characteristics. J Pain Symptom Manage 2015; 50(1): 28–37.

Chen

Ofner

Bakoyannis

, et al Symptoms-based phenotypes among women with dysmenorrhea: a latent class Analysis. West J Nurs Res 2018; 40(10): 1452–1468.

Han

Reding

Cooper

, et al Symptom clusters in patients with gastrointestinal cancers using different dimensions of the symptom experience. J Pain Symptom Manage 2019; 58(2): 224–234.

Mazor

Cataldo

Lee

, et al Differences in symptom clusters before and twelve months after breast cancer surgery. Eur J Oncol Nurs 2018; 32: 63–72.

Cheung

Zimmermann

Symptom clusters in patients with advanced cancers. Support Care Cancer 2009; 17(9): 1223–1230.

Marshall

Yang

Ping

, et al Symptom clusters in women with breast cancer: an analysis of data from social media and a research study. Qual Life Res 2016; 25(3): 547–557.

Vijayakrishnan

Steinhubl

, et al Prevalence of heart failure signs and symptoms in a large primary care population identified through the use of text and data mining of the electronic health record. J Card Fail 2014; 20(7): 459–464.

Jackson

Patel

Jayatilleke

, et al Natural language processing to extract symptoms of severe mental illness from clinical text: the Clinical Record Interactive Search Comprehensive Data Extraction (CRIS-CODE) project. BMJ open 2017; 7(1): e012012.

10.

Forsyth

Barzilay

Hughes

, et al Machine learning methods to extract documentation of breast cancer symptoms from electronic health records. J Pain Symptom Manage 2018; 55(6): 1492–1499.

11.

Divita

Luo

Tran

, et al General symptom extraction from VA electronic medical notes. Stud Health Technol Inform 2017; 245: 356–360.

12.

Gundlapalli

Divita

Redd

, et al Detecting the presence of an indwelling urinary catheter and urinary symptoms in hospitalized patients using natural language processing. J Biomed Inform 2017; 71: S39–S45.

13.

Koleck

Dreisbach

Bourne

, et al Natural language processing of symptoms documented in free-text narratives of electronic health records: a systematic review. J Am Med Inform Assoc 2019; 26(4): 364–379.

14.

Vaci

Liu

Kormilitzin

, et al Natural language processing for structuring clinical text data on depression using UK-CRIS. Evid Based Ment Health 2020; 23(1): 21–26.

15.

Gandhi

Luo

Storey

, et al Identifying symptom clusters in breast cancer and colorectal cancer patients using EHR data. In: Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics, Niagara Falls, NY, USA, 4 September 2019, pp.405–413. New York, NY: Association for Computing Machinery.

16.

Tang

Wang

Zhang

, et al Associations between diabetes and quality of life among breast cancer survivors. PLoS One 2016; 11(6): e0157791.

17.

Axelrod

Guth

, et al Comorbidities and quality of life among breast cancer survivors: a prospective study. J Pers Med 2015; 5(3): 229–242.

18.

Hammer

Storey

Hershey

, et al Hyperglycemia and cancer: a state-of-the science review. Oncol Nurs Forum 2019; 46(4): 459–472.

19.

Vissers

Thong

Pouwer

, et al The individual and combined effect of colorectal cancer and T2D on health-related quality of life and sexual functioning: results from the PROFILES registry. Support Care Cancer 2014; 22: 3071–3079.

20.

Vissers

Mols

Thong

, et al The impact of diabetes on neuropathic symptoms and receipt of chemotherapy among colorectal cancer patients: results from the PROFILES registry. J Cancer Surviv 2015; 9(3): 523–531.

21.

Vissers

Falzon

van de Poll-Franse

, et al The impact of having both cancer and diabetes on patient-reported outcomes: a systematic review and directions for future research. J Cancer Surviv 2016; 10(2): 406–415.

22.

Vardy

Dhillon

Pond

, et al Cognitive function and fatigue after diagnosis of colorectal cancer. Ann Oncol 2014; 25: 2404–2412.

23.

Gray

Hall

Browne

, et al Predictors of anxiety and depression in people with colorectal cancer. Support Care Cancer 2014; 22: 307–314.

24.

Garcia

Bose

Zuniga

, et al Mexican Americans’ T2D symptom prevalence, burden and cluster. Appl Nurs Res 2019; 46: 37–42.

25.

Scott

, et al The effect of symptom clusters on quality of life among patients with type 2 diabetes. Diabetes Educ 2019; 45(3): 287–294.

26.

Won

Y-J

Lee

J-H

, et al Clinical phenotype of diabetic peripheral neuropathy and relation to symptom patterns: cluster and factor analysis in patients with type 2 diabetes in Korea. J Diabetes Res. Epub ahead of print 13 December 2017. DOI: 10.1155/2017/5751687.

27.

Aronson

Lang

FM.

An overview of MetaMap: historical perspective and recent advances. Journal of the American Medical Informatics Association 2010; 17: 229–236.

28.

Schneider

Lexical semantic analysis in natural language text. Unpublished Doctoral Dissertation, Carnegie Mellon University, 2014.

29.

Zhang

Chen

Yang

, et al BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci Data 2019; 6(1): 1–9.

30.

Mikolov

Sutskever

Chen

, et al Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv 2013; 1310.4546

31.

Chiu

Crichton

Korhonen

, et al How to train good word embeddings for biomedical NLP. In: Proceedings of the 15th workshop on biomedical natural language processing, Berlin, Germany, 12 August 2016, pp.166–174. Association for Computational Linguistics.

32.

Bouguettaya

Liu

, et al Efficient agglomerative hierarchical clustering. Expert Syst Appl 2015; 42(5): 2785–2797.

33.

Miaskowski

Cooper

Aouizerat

, et al The symptom phenotype of oncology outpatients remains relatively stable from prior to through 1 week following chemotherapy. Eur J Cancer Care 2017; 26(3): e12437.

34.

Tseng

W-T

Chiang

W-F

Liu

S-Y

, et al The application of data mining techniques to oral cancer prognosis. J Med Syst 2015; 39(5): 59.

35.

Eswari

Sampath

Lavanya

Predictive methodology for diabetic data analysis in big data. Procedia Comput Sci 2015; 50: 203–208.

36.

Rumbold

O’Kane

Philip

, et al Big Data and diabetes: the applications of Big Data for diabetes care now and in the future. Diabet Med 2020; 37(2): 187–193.

37.

Zhu

Patumcharoenpol

Zhang

, et al Biomedical text mining and its applications in cancer research. J Biomed Inform 2013; 46(2): 200–211.

38.

Giovannucci

Harlan

Archer

, et al Diabetes and cancer: a consensus report. CA Cancer J Clin 2010; 60(4): 207–221.

39.

Hardefeldt

Edirimanne

Eslick

GD.

Diabetes increases the risk of breast cancer: a meta-analysis. Endocr Relat Cancer 2012; 19(6): 793–803.

40.

Samuel

Varghese

, et al Challenges and perspectives in the treatment of diabetes associated breast cancer. Cancer Treat Rev 2018; 70: 98–111.

41.

González

Prieto

del Puerto-Nevado

, et al 2017 update on the relationship between diabetes and colorectal cancer: epidemiology, potential molecular mechanisms and therapeutic implications. Oncotarget 2017; 8(11): 18456–18485.

42.

Zhu

, et al The relationship between diabetes and colorectal cancer prognosis: a meta-analysis based on the cohort studies. PloS One 2017; 12(4): e0176068.

43.

Tantoy

Cataldo

Aouizerat

, et al A review of the literature on multiple co-occurring symptoms in patients with colorectal cancer who received chemotherapy alone or chemotherapy with targeted therapies. Cancer Nurs 2016; 39(6): 437–445.

44.

Storey

Cohee

Gathirua-Mwangi

, et al The impact of diabetes on the symptoms of breast cancer survivors. Oncol Nurs Forum 2019; 46(4): 473–484.

45.

Pettersson

Berterö

Unosson

, et al Symptom prevalence, frequency, severity, and distress during chemotherapy for patients with colorectal cancer. Support Care Cancer 2014; 22(5): 1171–1179.

46.

Kim

Lee

, et al Predictors of symptom experience in Korean patients with cancer undergoing chemotherapy. Eur J Oncol Nurs 2015; 19(6): 644–653.

47.

Deshields

Potter

Olsen

, et al The persistence of symptom burden: symptom experience and quality of life of cancer patients across one year. Support Care Cancer 2014; 22(4): 1089–1096.

48.

Hershey

Tipton

Given

, et al Perceived impact of cancer treatment on diabetes self-management. Diabetes Educ 2012; 38(6): 779–790.

49.

Ratjen

Schafmayer

Enderle

, et al Health-related quality of life in long-term survivors of colorectal cancer and its association with all-cause mortality: a German cohort study. BMC Cancer 2018; 18(1): 1156.

50.

Mols

Beijers

Lemmens

, et al Chemotherapy-induced neuropathy and its association with quality of life among 2-to 11-year colorectal cancer survivors: results from the population-based PROFILES registry. J Clin Oncol 2013; 31(21): 2699–2707.

51.

Soveri

Lamminmäki

Hänninen

, et al Long-term neuropathy and quality of life in colorectal cancer patients treated with oxaliplatin containing adjuvant chemotherapy. Acta Oncol 2019; 58(4): 398–406.

52.

Von Ah

Kang

D-H

Carpenter

JS.

Predictors of cancer-related fatigue in women with breast cancer before, during, and after adjuvant therapy. Cancer Nurs 2008; 31(2): 134–144.

53.

Fabi

Falcicchio

Giannarelli

, et al

The course of cancer related fatigue up to ten years in early breast cancer patients: what impact in clinical practice?

Breast 2017; 34: 44–52.

54.

Otte

Carpenter

Russell

, et al Prevalence, severity, and correlates of sleep-wake disturbances in long-term breast cancer survivors. J Pain Symptom Manage 2010; 39(3): 535–547.

55.

Bao

Basal

Seluzicki

, et al Long-term chemotherapy-induced peripheral neuropathy among breast cancer survivors: prevalence, risk factors, and fall risk. Breast Cancer Res Treat 2016; 159(2): 327–333.

56.

Pudrovska

Why is cancer more depressing for men than women among older white adults?

Soc Forces 2010; 89(2): 535–558.

57.

Wong

Bedard

Pulenzas

, et al Gender differences in symptoms experienced by advanced cancer patients: a literature review. Rev Health Care 2013; 4(2): 141–153.

58.

Miaskowski

Gender differences in pain, fatigue, and depression in patients with cancer. JNCI Monogr 2004; 2004(32): 139–143.

59.

Hines

Markossian

TW.

Differences in late-stage diagnosis, treatment, and colorectal cancer-related death between rural and urban African Americans and whites in Georgia. J Rural Health 2012; 28(3): 296–305.

60.

Lai

Wang

Civan

, et al Effects of cancer stage and treatment differences on racial disparities in survival from colon cancer: a United States population-based study. Gastroenterology 2016; 150(5): 1135–1146.

61.

Moylan

Jacka

Pasco

, et al How cigarette smoking may increase the risk of anxiety symptoms and anxiety disorders: a critical review of biological pathways. Brain Behav 2013; 3(3): 302–326.

62.

Emre

Topal

Bozkurt

, et al Mental health screening and increased risk for anxiety and depression among treatment-seeking smokers. Tob Induc Dis 2014; 12(1): 20.

63.

Fluharty

Taylor

Grabski

, et al The association of cigarette smoking with depression and anxiety: a systematic review. Nicotine Tob Res 2016; 19(1): 3–13.

64.

Seretny

Currie

Sena

, et al Incidence, prevalence, and predictors of chemotherapy-induced peripheral neuropathy: a systematic review and meta-analysis. Pain 2014; 155(12): 2461–2470.

65.

Olausson

Hyperglycemic-inducing neoadjuvant agents used in treatment of solid tumors: a review of the literature. Oncol Nurs Forum 2014; 41(6): E343–E354.

Analyzing the symptoms in colorectal and breast cancer patients with or without type 2 diabetes using EHR data

Abstract

Keywords

Introduction

Study cohorts

Methods

System design

Semantic analysis of the symptoms and symptom clustering

Frequency measure of the symptoms

Statistical analysis

Results

Symptom factor analysis: CRC patients with diabetes versus without diabetes

Symptom factor analysis: BC patients with diabetes versus without diabetes

Discussion

Colorectal cancer

Breast cancer

Limitations of the study

Conclusions

Footnotes

Declaration of conflicting interests

Funding

ORCID iDs

References