Sage Journals: Discover world-class research

Abstract

Sociodemographic factors are critical determinants of health outcomes and disparities, yet their documentation in electronic medical records is often sparse and confined to unstructured clinical text. This poses substantial challenges for automated extraction and integration into clinical decision-making. In this study, we systematically evaluate and compare 6 convolutional neural network architectures, including hybrid models that integrate traditional classifiers, for binary classification of multiple sociodemographic characteristics from EMR text using data from 4375 patients across 96 primary care clinics. The goal was to assess how model complexity and lexical diversity influence classification performance. Manual annotation achieved high inter-rater reliability (kappa: 0.98 for documentation status, 0.96 for documented information). We report performance using F1 score, precision, recall, area under the precision-recall curve, and Matthews correlation coefficient. Results showed that simpler architectures, particularly a single-layer CNN, consistently outperform deeper or hybrid models across most characteristics (F1 score: 90.99%), especially under conditions of data imbalance and varied documentation patterns. While hybrid models offered gains for well-documented factors like marital status, they were less effective for sparse or diverse characteristics. These findings provide a practical framework for developing efficient, interpretable clinical NLP pipelines and inform model selection strategies for real-world health equity and EMR research applications.

Keywords

clinical classification clinical text convolutional neural networks deep neural networks electronic medical records supervised learning sociodemographic factors health equity

Introduction

Social conditions expose certain groups to risk factors that can lead to disease onset, producing disease patterns in various populations.¹ One such social construct is the sociodemographic factors of individuals. These factors constitute the social structural conditions that are underlying determinants of health.² Extensive research has established associations between sociodemographic characteristics and numerous health outcomes. Race has been linked to mental health, healthcare delivery, cancer, heart failure, and pre-term birth.^3-7 Marital status and education are associated with cancer outcomes, Alzheimer’s disease, and healthcare service utilization.^3,8-12 Employment status predicts both physical and mental health, with unemployment linked to depression, anxiety, and suicide risk.¹³ Sexual orientation and gender identity have been associated with elevated rates of mental distress, depression, substance misuse, and chronic conditions including asthma, chronic obstructive pulmonary disease, obesity, and diabetes.^14-19 Understanding these sociodemographic patterns allows for identification of medically relevant risk factors underlying disease causation. If these risk factors are identified and addressed through intervention or prevention, health disparities in affected populations can be reduced. Furthermore, capturing this data at the community level enables population-level health risk prediction. Despite the well-established importance of sociodemographic factors in health outcomes, there is a critical need to improve their systematic identification and extraction from clinical documentation to support equity-driven research, patient-centered care, and accurate risk assessment. As healthcare systems increasingly adopt artificial intelligence to address documentation gaps and support clinical decision-making, understanding which model architectures offer optimal trade-offs in performance, efficiency, and interpretability has become essential.

Data collected through electronic medical records (EMRs) capture services provided by non-physicians, sociodemographic factors, and detailed diagnostic information, enabling a more complete understanding of primary care encounters.^20,21 Sociodemographic characteristics documented in EMRs generally include information on place of birth, ethnicity/race, immigration/refugee status, partnership status, sexual orientation, education, and occupation.²² This information is captured in both structured and unstructured formats. However, many sociodemographic characteristics such as marital status and race are often missing-not-at-random in structured data when compared to unstructured counterparts.^23,24 Consequently, unstructured clinical narratives prove to be a rich data source for sociodemographic information. Nevertheless, many challenges come with medical text classification due to high dimensionality, data sparsity, and unstructured format.²⁵ Furthermore, medical text is characterized by long sentences with normalized medical terminology, poor grammar, and many spelling mistakes. Another challenge is the low prevalence of sociodemographic factors, causing class imbalance between absence and presence of characteristics. Therefore, identification and extraction of sociodemographic data from clinical text remain understudied areas of research.²⁶

The most common biomedical approach for extracting clinical text combines natural language processing and machine learning.²⁷ There are 2 main methods used for text classification: traditional machine learning and deep learning approaches. Traditional machine learning approaches are difficult to implement for large-scale training samples and require rigorous feature engineering.²⁸ Contrarily, deep learning approaches can efficiently represent long-range dependencies through deep hierarchical feature construction.²⁹ They are composed of multiple processing layers that help discover intrinsic patterns in the data and lessen the burden of feature engineering.^30-32 This is especially useful when using free text data in EMRs, as these models can learn rich representations of medical vocabulary.³³ Convolutional neural networks (CNNs) and variations of recurrent neural networks are the most widely adopted deep learning methods for text categorization, as they are well suited for sequential data such as longitudinal patient records and clinical text.^30,32,34 Specifically, CNNs have shown impressive results in sentence categorization.^35-38

Recent advances in transformer-based language models, such as BERT and its clinical variants (ClinicalBERT, BioClinicalBERT), have significantly expanded the capabilities of natural language processing in healthcare.³⁹ These architectures employ self-attention mechanisms that enable contextual understanding across long sequences of text, supporting fine-grained tasks such as entity recognition and the extraction of social and behavioral information from complex clinical narratives.^40-43 Prior studies have shown that transformer-based models can outperform earlier deep learning architectures, including CNNs and recurrent neural networks, for extracting social determinants of health from electronic health records.^10,23 However, these performance gains often come with trade-offs in computational cost, model transparency, and deployment feasibility. Transformer models typically require large-scale infrastructure, substantial memory, and specialized hardware, which may not be available in community or primary care settings.⁴⁴

In contrast, convolutional neural networks can achieve competitive performance for localized, sentence-level classification while offering several practical advantages for deployment in resource-limited clinical settings.^30,45 They require substantially less computational infrastructure, enable faster inference on standard clinical workstations, and provide more transparent feature representations through filter visualization. CNNs are characterized by local connection, spatial sampling, and weight sharing.²⁸ The neurons in CNNs use local connections which reduce the number of parameters in the model. Feature representations extracted for training and classification can be obtained using convolution. Spatial sampling is used to reduce feature dimensionality while enhancing robustness. Furthermore, CNNs’ weight-sharing characteristics reduce the complexity of feature extraction and data reconstruction compared to traditional models. These practical advantages motivated our systematic evaluation of CNN-based architectures for sociodemographic information extraction.

Despite the widespread adoption of deep learning for clinical text analysis, no prior study has systematically compared CNN architectures specifically for sociodemographic information extraction from EMRs. Existing work either focuses on isolated classification tasks or relies on complex transformer models without establishing whether simpler, more deployable architectures can achieve comparable performance. This gap is particularly important given that healthcare institutions often face computational constraints and require interpretable, efficient models for clinical deployment. The purpose of this research was to address this gap by conducting the first systematic evaluation of CNN model configurations for sociodemographic information extraction from clinical text, providing evidence-based recommendations for model development practices tailored to this domain. We applied various CNN-based models to categorize text fragments at the sentence level by extracting semantic information from a corpus of medical data. In addition, we integrated our CNN models with traditional machine learning algorithms, including support vector machines and random forests, to assess whether hybrid configurations enhance performance. We investigated 6 variations of convolutional neural network architectures to identify the presence or absence of sociodemographic characteristics in clinical text, comparing them against 2 traditional supervised learning baselines. We explicitly assessed the influence of model complexity on performance, comparing CNN architectures of varying depth and configuration to determine the trade-offs between efficiency, interpretability, and accuracy. We evaluated the models using F1 score, recall, precision, area under the precision-recall curves, and Matthews correlation coefficient. Unlike prior studies that explore isolated classification tasks or assume transformer models are necessary for optimal performance, our comparative approach provides practical guidance for selecting efficient, interpretable CNN architectures for real-world clinical deployment, particularly in resource-limited settings where computational efficiency and model transparency are critical considerations.

Methods

Data Source

This project received research ethics board (REB) approval from the University of Toronto (#40129) and North York General Hospital (#20-0044). The models were evaluated using data from the University of Toronto Practice-Based Research Network (UTOPIAN) Data Safe Haven, a repository containing de-identified EMR data from over 400 family physicians across 96 clinics and approximately 400,000 patients in Ontario.⁴⁶ The UTOPIAN database includes records from 3 EMR vendors, which are among the most widely used in family physician practices across the province.^47,48 To define the baseline cohort all physicians and their patients that had insufficient or low-quality data were removed. The cutoff for each cycle of data included physicians with less than 20% of billing records, lab records, medication records, and less than 200 rostered patients. Furthermore, the criteria for a patient to be included in the baseline cohort required that their physician’s data quality was sufficient, having a valid age and sex, an EMR start date greater than 1 year from the data extraction cut-off date (unless the patient was less than 1 year of age), populated entries in any of the cumulative patient profile tables provided in the EMR, and be rostered to a physician or had at least 2 family physician visits in the past 3 years. The social history and risk factor sections of the EMR are semi-structured fields that contribute to the summary information found in the cumulative patient profile. These sections typically contain patients’ sociodemographic details and are regularly updated during clinical visits.

The social history and risk factor sections were filled with 561,210 patient entries. The system logs patient data chronologically, with each entry timestamped. However, this sometimes led to duplicate entries for the same patient. To reduce redundancy, we kept only the most recent entry when multiple records started with the same text during preprocessing. As a result, our analyses were based on the most current status of each sociodemographic characteristic examined in this study. The patient cohort used to train the models consisted of adults aged 18 or older as of December 31st, 2021, since characteristics such as marital status and occupation are less likely to be recorded for children and adolescents. We first grouped entries by eligible patients and merged data from the semi-structured fields within the cumulative patient profile section of the EMR. To ensure the sample accurately reflected the UTOPIAN database, we randomly selected 1.5% of patients from each clinic, resulting in a final cohort of 4375 patients. The 1.5% per-clinic sampling rate was chosen to balance computational feasibility with broad representation across 96 clinics, ensuring that smaller clinics contributed proportionally while preventing larger clinics from dominating the dataset. We assessed the representativeness of this random sample by comparing age, sex, and EMR start date distributions to those of the entire database. Additionally, we verified that the sample included all physicians from each clinic and preserved the distribution of patient counts per physician.

A reference standard was created by a trained PhD student who manually labeled social phrases in the cohort using predefined annotation guidelines to produce a dataset for supervised machine learning. The annotation guidelines were developed collaboratively by the clinical team members to ensure clinical relevance and accuracy. The PhD student received dedicated training from the clinical team and was supervised throughout the annotation process. The guidelines included specific examples for each sociodemographic characteristic, explicit rules for handling ambiguous cases, and clear distinctions between patient-specific information and references to family members’ characteristics. Two labels were assigned for each characteristic: one captured the actual documented information from the semi-structured fields, and the other indicated whether that sociodemographic characteristic had been documented at all. Due to resource and budget constraints, a single annotator was used for the primary annotation task. To assess annotation quality and ensure consistency, approximately 5% of the dataset (219 phrases) was double-annotated by the same annotator within the same month, yielding a kappa score of 0.98 for documentation status and 0.96 for the documented information labels, averaged across all characteristics. These high kappa values indicate excellent intra-rater reliability and suggest that the annotation guidelines were sufficiently clear and comprehensive. Figure 1 presents the distribution of each characteristic within the labeled dataset. Due to documentation rates below 3%, race, gender identity, and sexual orientation were excluded from model training because of insufficient data to support effective classification.

Figure 1.

The frequency of documentation for each sociodemographic characteristic in the labeled sample.

Data Preprocessing

Due to common occurrences of misspellings, non-word symbols, abbreviations, and acronyms in clinical text, the retrieved sample had to undergo a series of text preprocessing before being fed to the classification algorithms. First, punctuation, common stopwords (e.g., ‘the’, ‘is’, ‘in’, ‘for’, ‘where’, ‘when’, ‘to’, ‘at’, etc.), non-alphabetic and 1-letter words were removed from the tokenized input text since they do not provide any semantic meaning. Abbreviations and acronyms were preserved in their original form rather than expanded, as the Word2Vec embeddings can learn representations for commonly occurring abbreviated terms, and expansion would risk introducing errors given the high variability in clinical abbreviation usage across different EMR systems and clinicians. The letters were then converted to lowercase and went through a lemmatization process that transformed the words into their root forms. To prepare the data for the word embedding models, the data sequences (clinical text) were truncated to 100,000 tokens based on word frequency to optimize model training. This cutoff was selected based on vocabulary size optimization principles established in prior clinical NLP work,²⁹ where retaining the most frequent 100,000 tokens captures the majority of semantic information while reducing computational complexity and mitigating overfitting on rare terms. Furthermore, sentences were converted to the same length by applying post padding based on the number of words in the longest sentence. Finally, the categorical target variable (absent/present label) was transformed into numeric values (0/1) using a label encoder.

Classification Modeling

We focused on binary presence/absence classification due to the sparsity, inconsistency, and heterogeneity of value-level documentation in EMRs. This approach reflects a practical and generalizable task aligned with real-world needs, such as flagging under-documented characteristics for manual review or downstream imputation. Transformer-based models such as BERT or ClinicalBERT were not included in this study for several reasons. First, our research objective was to provide the first systematic comparison of CNN architectures for sociodemographic extraction, as these models have not been previously benchmarked in this domain. Second, for binary sentence-level classification tasks, CNNs can achieve competitive performance while requiring substantially less computational infrastructure, an important consideration for deployment in resource-limited primary care and community health settings that may lack access to GPU clusters or cloud computing resources necessary for transformer inference. Third, CNNs offer faster training and inference times, making them more practical for real-time integration into clinical workflows. Fourth, CNNs provide greater interpretability through visualization of activated filters and learned features, which is essential for clinical trust and adoption. While transformer models excel at capturing long-range dependencies in complex, multi-sentence narratives, the sociodemographic characteristics in our dataset were typically documented in short, localized phrases where CNNs’ ability to capture local semantic patterns is well-suited. However, we acknowledge that transformer comparisons would strengthen future work, particularly for more complex multi-class extraction or characteristics requiring broader contextual understanding.

The parameter configuration of the CNN architectures, such as filter region size, learning rate, etc., was tuned for each characteristic based on suggestions from a previous study using Bayesian optimization.³⁸ The tuned hyperparameters for each model and the assessed characteristics can be found in Table 1. All models utilized pre-trained Word2Vec word embeddings, trained on a corpus of approximately 100 billion words from the Google News dataset.⁴⁹ This model contains 300-dimensional vector representations for 3 million words and phrases. This choice was motivated by its widespread adoption in prior EMR and health informatics studies, providing a robust and reproducible baseline for benchmarking CNN architectures.^50,51 Many of the sociodemographic terms of interest (e.g., marital status, education, occupation) overlap substantially with general-language usage, making general-domain embeddings appropriate. In preliminary analyses, we also compared Word2Vec with GloVe⁵² embeddings and found no significant differences in classification performance, further supporting the suitability of Word2Vec for this task. Nevertheless, future work could explore whether domain-specific embeddings trained on clinical corpora (e.g., MIMIC-III, clinical notes) provide additional performance gains, particularly for characteristics with more specialized medical terminology.

Table 1.

Hyperparameter Tuning Results for Each CNN Architecture and Sociodemographic Characteristic.

Factor	Model A	Model B	Model C	Model D	Model E (SVM)	Model F (RF)
Place of birth	#Dense filters = 368	#Dense filters = 368,	#Dense filters = 181	#Dense filters = 267	Kernel = linear	Criterion = gini
	#Conv layer filters = 179	#Conv layer filters = 177	#Conv layer filters = 266	#conv layer filters = 108	Gamma = scale	n_estimators = 100
	Kernel size = 3	Kernel size = (2,3,4)	Kernel size = 1,	Kernel size = (1,2,3,4,5)	C = 0.1	min_samples_leaf = 1
	Dropout rate = 0.24	dropout rate = 0.31	Dropout rate = 0.45	Dropout rate = 0.17		min_samples_split = 2
	Learning rate = 0.0015	Learning rate = 0.0047	Learning rate = 0.0067	Learning rate = 0.0036		max_depth = 10
			Pool size = 3			max_features = 7
Citizenship status	#Dense filters = 370	#Dense filters = 368	#Dense filters = 442	#Dense filters = 143	Kernel = rbf	criterion = gini
	#Conv layer filters = 180	#Conv layer filters = 179	#Conv layer filters = 170	#Conv layer filters = 102	Gamma = scale	n_estimators = 200
	Kernel size = 2	Kernel size = (3,4,5)	Kernel size = 4,	Kernel size = (4,5,6,7,8)	C = 0.1	min_samples_leaf = 1
	Dropout rate = 0.26	Dropout rate = 0.24	Dropout rate = 0.32	Dropout rate = 0.43		min_samples_split = 2
	Learning rate = 0.0012	Learning rate = 0.0015	Learning rate = 0.0004	Learning rate = 0.0035		max_depth = None
			Pool size = 4			max_features = 7
Marital status	#Dense filters = 365	#Dense filters = 136	#Dense filters = 266	#Dense filters = 266	Kernel = poly	criterion = gini
	#Conv layer filters = 180	#Conv layer filters = 315	#Conv layer filters = 158	#Conv layer filters = 158	Gamma = scale	n_estimators = 100
	Kernel size = 3	Kernel size = (2,3,4)	Kernel size = 1	Kernel size = (1,2,3,4,5)	C = 10	min_samples_leaf = 2
	Dropout rate = 0.12	Dropout rate = 0.13	Dropout rate = 0.37	Dropout rate = 0.37		min_samples_split = 2
	Learning rate = 0.0016	learning rate = 0.0040	Learning rate = 0.0031	Learning rate = 0.0031		max_depth = 10
			Pool size = 1			max_features = 7
Occupation	#Dense filters = 136	#Dense filters = 264	#Dense filters = 266	#Dense filters = 267	Kernel = linear,	criterion = gini
	#Conv layer filters = 315	#Conv layer filters = 156	#Conv layer filters = 158	#Conv layer filters = 110	Gamma = scale,	n_estimators = 200
	Kernel size = 2	Kernel size = (1,2,3)	Kernel size = 1,	Kernel size = (1,2,3,4,5)	C = 0.1	min_samples_leaf = 3
	Dropout rate = 0.13	Dropout rate = 0.11	Dropout rate = 0.37	Dropout rate = 0.36		min_samples_split = 2
	Learning rate = 0.0041	Learning rate = 0.0034	Learning rate = 0.0031	Learning rate = 0.0088		max_depth = None
			Pool size = 1			max_features = 7
Education	#Dense filters = 362	#Dense filters = 368	#Dense filters = 439	#Dense filters = 136	Kernel = rbf	criterion = gini
	#Conv layer filters = 173	#Conv layer filters = 179	#Conv layer filters = 182	#Conv layer filters = 315	Gamma = scale	n_estimators = 200
	Kernel size = 2	Kernel size = (3,4,5)	Kernel size = 2	Kernel size = (2,3,4,5,6)	C = 0.1	min_samples_leaf = 1
	Dropout rate = 0.29	Dropout rate = 0.24	Dropout rate = 0.10	Dropout rate = 0.13		min_samples_split = 2
	Learning rate = 0.0018	Learning rate = 0.0015	Learning rate = 0.0013	Learning rate = 0.0040		max_depth = None
			Pool size = 2			max_features = 7

We used a weighted binary cross-entropy loss function to account for class imbalance in which each label is inversely weighted based on their frequency in the data. The training was performed using the Adam optimizer. We assessed the most optimal number of epochs for model training including 10, 20, 30, 60, and 100, where we found that 20 epochs produced the best performance without overfitting on the training data. The hidden layers collectively used the ReLU activation function excluding Models D and E which utilized the tanh function. To avoid potential data leakage between the training and testing stages, only singular entries for each patient were incorporated into the patient sample. We used the Keras software package with the TensorFlow backend for model implementation.⁵³

The dimensionality of the word vectors produced by the word embedding layer is denoted by d. If a sentence has a length of s, the corresponding sentence matrix has a dimensionality of $s \times d$ . Given the inherently sequential nature of sentences in textual data, the convolutional filters are designed with widths equal to the dimensionality of the word embeddings (i.e., d), as each row corresponds to an individual word. A filter defined by a weight matrix w and spanning a region size of h will have $h \cdot d$ learnable parameters. The sentence can be represented by a matrix $A \in ℝ^{s \times d}$ , where $A [i : j]$ denotes the sub-matrix consisting of rows i through j. The convolution operation yields an output sequence $o \in ℝ^{s - h + 1}$ , obtained by sliding the filter over successive sub-matrices of A, such that:

o_{i} = w \cdot A [i : i + h - 1]

(1)

where $i = 1 ... s - h + 1$ . A bias term is added $b \in ℝ$ as well as an activation function f, to every $o_{i}$ . This produces a feature map $c \in ℝ^{s - h + 1}$ for the filters:

c_{i} = f (o_{i} + b) .

(2)

Feature maps are used by CNNs to learn rich representations of the training data and find intrinsic patterns of the sociodemographic information in the text.

We evaluated 6 CNN architectures of varying complexity to systematically assess the impact of model configuration on classification performance. Table 2 provides a comparative overview of these architectures, highlighting their key structural differences and design rationales. Models A and B represent simpler, single-layer architectures with uniform and multi-scale filters respectively. Models C and D explore depth versus width, with Model C employing sequential convolutional blocks and Model D using parallel blocks. Models E and F are hybrid approaches that combine CNN feature extraction with traditional classifiers (support vector machine and random forest, respectively). Detailed descriptions of each architecture follow.

Model A: The first model is a simple 1-layer CNN that is akin to standard baseline methods such as support vector machines and logistic regression. Studies using this comparatively simple architecture have shown very strong results suggesting that it may be used as a drop-in replacement for well-established baseline models.^35,38,54 In practice, a simpler model is less prone to bias on the task-specific dataset used for training while producing fast training and prediction times. We employed filters with identical region sizes to capture complementary features within the same contextual regions. This caused the dimensionality of the feature map to vary based on the different sentence lengths. Therefore, a 1-max pooling function performed globally over feature maps was used to produce a fixed-length vector. This was followed by a flatten layer to convert the data into a 1-dimensional array to feed it into the first fully connected layer. Dropout was applied to the output of the first fully connected layer as a means of regularization. Additionally, we assessed whether $L_{2}$ regularization added to the convolutional layers in this model and subsequent architectures would improve testing performance. However, our results indicated a lower F1 score, thus it was not used. This was followed by the output layer which produced the prediction score and labels for a given observation. A graphical representation of the model is shown in Figure 2.

Model B: Studies have found that the combination of several filters where their region size ranged depending on an optimal value increased performance for sentence-based classification.^28,35,38 Therefore, for this model, we implemented 1 convolutional layer with 3 varying-sized filter regions to compare to the above implementation. This was followed by a 1-max pooling function performed globally over feature maps and a flatten layer. Finally, a fully connected layer was added, followed by a dropout layer to prevent overfitting. The model representation is shown in Figure 3.

Model C: To increase complexity to assess whether deeper architectures perform better for text classification, we implemented a model that consisted of 2 blocks of convolutional layers and 1-max pooling functions where the first max pooling function was performed over small equal-sized local regions based on a pooling size that was tuned for each characteristic and the second was performed globally over feature maps. This model architecture was inspired by the work in Hughes et al⁵⁵ where they found that 2 sets of 2 convolutional layers with each pair followed by a max pooling layer was the most optimal configuration for medical text classification. The last max-pooling layer was followed by a flatten layer, a fully connected layer, and an applied dropout rate before being processed by the output layer. The representation of the model can be found in Figure 4.

Model D: Inspired by the questions raised in Le et al⁵⁶ we wanted to assess whether a wide CNN architecture for text classification performs better than a deep architecture such as Model C. This wide CNN model consists of a concatenation of 5 convolutional blocks. Each CNN block had a similar architecture that contained 2 pairs of convolutional layers each followed by a 1-max pooling function performed globally over feature maps. A flatten layer is added to the end of each block to convert multidimensional inputs into 1 dimension. To extract local characteristics of varying sizes, each convolutional block used varying filter region sizes. This allowed the model to fully represent the information of each word.²⁸ The blocks were then merged and followed by a fully connected layer using a Leaky ReLU activation function and a batch normalization layer to speed up the convergence process. The representation of this model can be found in Figure 5.

Model E: This model consists of 2 components: feature extraction and classification for text categorization. A CNN model was used to extract deep representations of the training data that were fed into a traditional machine learning model to identify the documentation status of sociodemographic characteristics in clinical text. As word embeddings are not typically re-trained on the training sample, various words would not obtain their own representation in the vector space.²⁸ This results in an incomplete account of the data features. To solve this issue, a hybrid approach that combines deep learning with traditional supervised modeling can produce a more robust architecture that can fully represent the training samples. Model B was used to perform feature extraction and a support vector machine was used for text classification. Model B was chosen since previous studies have found optimal results when using a combination of several filter region sizes for text classification.^28,38 Furthermore, a support vector machine model was chosen since they have been found to perform well over a range of classification tasks and have better handling of imbalance.⁵⁷ The fundamental concept behind support vector machines involves first breaking down an example, in this context clinical text, into a concise vector of relevant ‘features’ that encapsulate the contents of the text. Then it tries to discover a separating hyperplane capable of effectively dividing the instances into 2 groups, with instances corresponding to class A on 1 side of the hyperplane and those corresponding to class B on the other. In practice, these instances are not perfectly separable in this manner. Therefore, this algorithm utilizes a technique known as the ‘kernel trick’ to map the instances into a space of infinite dimensions, where identifying this separating hyperplane becomes more manageable, and then the hyperplane is projected back into the dimensionality of the original feature space.²⁸ A grid search analysis was used to tune model hyperparameters for the support vector machine model on the test data which can be found in Table 1. The standalone CNN model was compared to the hybrid approach to evaluate any model enhancements on a held-out test set. Figure 6 provides an overview of the model architecture.

Model F: This model follows the same idea as Model E. It consists of a feature extraction block and a classification block. The feature extraction block used was Model D, while the classification was performed by a random forest classifier. Model D was chosen since the 5 parallel CNN blocks with varying filter region sizes can fully consider the information of each word in the text.²⁸ Furthermore, a random forest classifier was chosen as it is well suited for dealing with high dimensional noisy data such as text.⁵⁸ It is composed of a set of decision trees in which each tree is trained using random subsets of features.⁵⁹ The prediction made by the random forest classifier is obtained by majority voting of the individual predictions of the trees in the forest. Random forests are effective at modeling complex interactions among input variables due to their hierarchical decision tree architecture.⁶⁰ The features extracted from the input text from each convolutional block after concatenation was fed to the random forest classifier that produced the final classification results. A grid search analysis was used to tune model hyperparameters for the random forest classifier on the test data which can be found in Table 1. The full architecture of the model is shown in Figure 7.

Table 2.

Comparative Summary of CNN Architectures Evaluated in This Study.

Model	Architecture type	Key features	Complexity	Rationale
Model A	Single-layer CNN	One convolutional layer; uniform filter size; global max pooling	Low	Baseline; efficient; less prone to overfitting
Model B	Single-layer CNN (multi-scale)	One convolutional layer; 3 filter sizes; global max pooling	Low-Medium	Captures features at multiple granularities
Model C	Deep CNN	Two convolutional blocks; local + global max pooling	Medium	Tests whether added depth improves performance
Model D	Wide CNN	Five parallel convolutional blocks; varying filter sizes; concatenated feature maps	High	Evaluates effect of width; captures diverse word-level cues
Model E	Hybrid CNN + SVM	Uses Model B for features; SVM for classification	Medium	Combines deep features with traditional classifier stability
Model F	Hybrid CNN + RF	Uses Model D for features; RF for classification	High	Leverages ensemble learning for complex interactions

Figure 2.

The CNN architecture representation of Model A.

Figure 3.

The CNN architecture representation of Model B.

Figure 4.

The CNN architecture representation of Model C.

Figure 5.

The CNN architecture representation of Model D.

Figure 6.

The CNN architecture representation of Model E.

Figure 7.

The CNN architecture representation of Model F.

Baseline Models

We implemented 2 baseline models for performance comparisons against our deep neural networks: logistic regression and random forest as these 2 models were found to have the best performance for sentence level classification of social and behavioral characteristics in clinical text.^61,62 The models were trained on features derived from term frequency-inverse document frequency representations of the preprocessed text. To enhance performance, Bayesian optimization was used for hyperparameter tuning tailored to each characteristic. Model evaluation was conducted using stratified 10-fold cross-validation to ensure robust performance assessment while addressing class imbalance.

Evaluation Criteria

The primary evaluation metrics were recall, precision, F1 score, area under the precision-recall curve, and Matthews correlation coefficient, computed across all sociodemographic characteristics. 10-fold stratified cross-validation was used to evaluate the generality of model performance. F1 score is generally used to measure model performance for imbalanced datasets as it takes both precision and recall into account. Furthermore, the decision as to whether to optimize precision or recall is dependent on the clinical application which in this case is unknown, therefore, we considered F1 score as the primary evaluation metric.²⁶ To provide a more comprehensive assessment of model performance, we also reported Matthews correlation coefficient, a metric that evaluates the quality of binary classifications by incorporating true and false positives and negatives. Unlike F1 score, Matthews correlation coefficient is a balanced metric even when the classes are of very different sizes, offering a more reliable indicator of overall performance, particularly in imbalanced datasets. For each performance metric, we report the mean across the 10-folds along with 95% confidence intervals, calculated as mean ± 1.96 × (SD/ $\sqrt{10}$ ), where SD is the standard deviation across folds. These confidence intervals quantify the uncertainty in our performance estimates and provide a range within which the true population performance is likely to fall. Below is the mathematical description of all evaluation metrics:

R e c a l l = \frac{t r u e p o s t i t i v e}{t r u e p o s i t i v e + f a l s e n e g a t i v e}

(3)

P r e c i s i o n = \frac{t r u e p o s i t i v e}{t r u e p o s i t i v e + f a l s e p o s i t i v e}

(4)

F 1 = \frac{2 \times p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l}

(5)

M C C = \frac{(T P \times T N) - (F P \times F N)}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

(6)

where TP represents true positive, TN represents true negative, FP represents false positives, and FN represents false negatives. To compare model performances, the precision-recall curve (PR curve) was used as it is generally preferred for imbalanced classes.⁶³

To evaluate whether the differences in model performance were statistically significant, we employed a combination of parametric and non-parametric tests, depending on the distribution of the data. Shapiro-Wilk tests⁶⁴ were conducted to assess data normality. If normality was satisfied (p > 0.05, for all metrics), we used repeated measures ANOVA,⁶⁵ followed by Tukey’s honestly significant difference test⁶⁶ to control for multiple comparisons in pairwise model evaluations. If normality was violated, we applied the Friedman test,⁶⁷ a non-parametric alternative, with the Nemenyi post-hoc test⁶⁸ to identify significant differences while adjusting for multiple comparisons.

Results

As shown in Figure 8, Model A achieved the highest recall (91.8%), F1 score (91.0%), area under the precision-recall curve (95%), and Mathews correlation coefficient (88.1%), indicating a strong balance between precision and recall and high predictive reliability even in imbalanced datasets. Model B performed comparable to Model A, with an F1 score of 90.0%, slightly lower than Model A, but still performed well on all metrics. Model F, on average, had the best overall precision (92.2%) compared to the rest of the models. The more complex models (Models C and D) generally achieved high recall, while the hybrid approaches (Models E and F) tended to have high precision due to the strengths of their traditional classifiers. However, both sets of models often struggled to maintain a balance between precision and recall, leading to lower F1 scores and Mathews correlation coefficient values, which ultimately reflected in their overall performance compared to simpler CNN architectures.

Figure 8.

Grouped bar chart comparing 10-fold average (in percentage) of recall, precision, F1 score, area under the precision-recall curve (AUCPR), and Mathews correlation coefficient (MCC) across all sociodemographic characteristics for all 6 CNN models.

Table 3 summarizes average performance metrics across the 6 models from the stratified 10-fold cross-validation for individual characteristics. The values in parentheses represent 95% confidence intervals. For all characteristics, at least 1 CNN model consistently outperformed the baseline models, namely logistic regression and random forest. Model A significantly outperformed logistic regression for place of birth by 10.556 percentage points in F1 score (92.058% vs 81.502%) and citizenship status by 10.004 percentage points (91.098% vs 81.094%), and significantly outperformed random forest for marital status by 3.691 percentage points (97.536% vs 93.845%) and occupation by 6.465 percentage points (91.770% vs 85.305%) (p < 0.05). Model B (F1: 91.916%) and Model E (F1: 90.796%) both significantly outperformed logistic regression for citizenship status by 10.822 and 9.702 percentage points respectively (p < 0.05). For marital status and occupation, Model F significantly outperformed random forest by 3.480 percentage points (97.325% vs 93.845%) and 6.661 percentage points (91.966% vs 85.305%) respectively (p < 0.05), indicating that incorporating CNN-based feature extraction before random forest classification can enhance performance in balanced datasets. In terms of marital status both Models C (F1: 97.801%) and D (F1: 97.153%) also significantly outperformed random forest by 3.956 and 3.308 percentage points respectively (p < 0.05). No significant differences were observed between performances across models for the classification of education status, likely due to its high class imbalance, though Model A achieved the highest F1 score (82.477%), only 2.599 percentage points above the best baseline (random forest: 79.878%).

Table 3.

10-fold Average Model Performance on Each Sociodemographic Factor Using Recall, Precision, F1 Score, Area Under the Precisions Recall Curve (AUCPR), and Mathews Correlation Coefficient (MCC) as Evaluation Metrics.

Model performances (N = 4375)
Factor	Metric (%)	ModelA	ModelB	Model C	Model D	Model E	Model F	LR	RF
Place of birth	Recall	92.681 (92.650, 92.712)	93.670 (93.648, 93.692)	92.853 (92.820, 92.886)	91.235 (91.208, 91.262)	90.730 (90.702, 90.758)	86.031 (86.009, 86.053)	75.644 (75.606, 75.682)	77.963 (77.927, 77.999)
	Precision	91.689 (91.666, 91.712)	86.311 (86.272, 86.350)	85.321 (85.307, 85.335)	91.075 (91.051, 91.099)	91.700 (91.685, 91.715)	92.582 (92.565, 92.599)	88.885 (88.863, 88.907)	91.744 (91.719, 91.769)
	F1 score	92.058 (92.039, 92.077)	89.572 (89.558, 89.586)	88.828 (88.811, 88.845)	91.072 (91.053, 91.091)	91.141 (91.124, 91.158)	89.126 (89.111, 89.141)	81.502 (81.483, 81.521)	84.153 (84.129, 84.177)
	AUCPR	96.625 (96.613, 96.637)	95.750 (95.741, 95.759)	94.036 (94.021, 94.051)	95.877 (95.863, 95.891)	95.432 (95.417, 95.447)	96.332 (96.320, 96.344)	87.008 (86.982, 87.034)	90.635 (90.618, 90.652)
	MCC	90.671 (90.650, 90.692)	87.714 (87.698, 87.730)	86.848 (86.828, 86.868)	89.475 (89.453, 89.497)	89.569 (89.550, 89.588)	87.323 (87.306, 87.340)	78.641 (78.621, 78.661)	81.733 (81.706, 81.760)
Citizenship	Recall	92.098 (92.074, 92.122)	95.377 (95.354, 95.400)	88.003 (87.979, 88.027)	90.781 (90.756, 90.806)	90.784 (90.763, 90.805)	84.544 (84.517, 84.571)	74.511 (74.469, 74.553)	77.145 (77.110, 77.180)
	Precision	90.421 (90.389, 90.453)	89.204 (89.161, 89.247)	87.391 (87.354, 87.428)	88.463 (88.427, 88.499)	91.011 (90.989, 91.033)	91.359 (91.336, 91.382)	89.658 (89.640, 89.676)	90.243 (90.219, 90.267)
	F1 score	91.098 (91.080, 91.116)	91.916 (91.897, 91.935)	87.512 (87.492, 87.532)	89.456 (89.434, 89.478)	90.796 (90.785, 90.807)	87.769 (87.747, 87.791)	81.094 (81.074, 81.114)	83.005 (82.985, 83.025)
	AUCPR	96.960 (96.952, 96.968)	96.580 (96.570, 96.590)	93.008 (92.993, 93.023)	95.667 (95.657, 95.677)	95.720 (95.708, 95.732)	94.902 (94.888, 94.916)	87.039 (87.024, 87.054)	89.565 (89.551, 89.579)
	MCC	89.556 (89.535, 89.577)	90.613 (90.591, 90.635)	85.329 (85.305, 85.353)	87.600 (87.574, 87.626)	89.185 (89.172, 89.198)	85.763 (85.737, 85.789)	78.680 (78.662, 78.698)	80.648 (80.626, 80.670)
Marital status	Recall	97.191 (97.184, 97.198)	96.230 (96.220, 96.240)	96.489 (96.483, 96.495)	97.626 (97.619, 97.633)	97.522 (97.517, 97.527)	96.850 (96.844, 96.856)	96.437 (96.433, 96.441)	94.526 (94.515, 94.537)
	Precision	97.985 (97.979, 97.991)	97.863 (97.857, 97.869)	99.161 (99.156, 99.166)	96.705 (96.695, 96.715)	96.278 (96.266, 96.290)	98.636 (98.632, 98.640)	92.050 (92.040, 92.060)	93.200 (93.194, 93.206)
	F1 score	97.536 (97.533, 97.539)	97.028 (97.023, 97.033)	97.801 (97.798, 97.804)	97.153 (97.147, 97.159)	96.882 (96.876, 96.888)	97.325 (97.321, 97.329)	94.184 (94.179, 94.189)	93.845 (93.841, 93.849)
	AUCPR	99.171 (99.166, 99.176)	98.917 (98.913, 98.921)	98.817 (98.810, 98.824)	99.225 (99.222, 99.228)	98.945 (98.941, 98.949)	99.050 (99.045, 99.055)	97.534 (97.530, 97.538)	97.190 (97.186, 97.194)
	MCC	95.192 (95.186, 95.198)	94.240 (94.231, 94.249)	95.773 (95.767, 95.779)	94.386 (94.375, 94.397)	93.837 (93.824, 93.850)	95.598 (95.590, 95.606)	88.387 (88.376, 88.398)	87.842 (87.833, 87.851)
Occupation	Recall	92.515 (92.493, 92.537)	90.361 (90.344, 90.378)	94.564 (94.549, 94.579)	93.647 (93.629, 93.665)	92.353 (92.339, 92.367)	93.056 (93.048, 93.064)	84.003 (83.989, 84.017)	87.992 (87.978, 88.006)
	Precision	91.120 (91.112, 91.128)	90.510 (90.488, 90.532)	89.387 (89.378, 89.396)	87.804 (87.780, 87.828)	91.435 (91.423, 91.447)	90.911 (90.903, 90.919)	90.619 (90.608, 90.630)	82.814 (82.801, 82.827)
	F1 score	91.770 (91.759, 91.781)	90.378 (90.364, 90.392)	91.881 (91.872, 91.890)	90.513 (90.504, 90.522)	91.880 (91.869, 91.891)	91.966 (91.959, 91.973)	87.165 (87.155, 87.175)	85.305 (85.294, 85.316)
	AUCPR	96.994 (96.989, 96.999)	96.762 (96.756, 96.768)	96.527 (96.521, 96.533)	96.552 (96.546, 96.558)	94.828 (94.818, 94.838)	96.806 (96.800, 96.812)	94.839 (94.832, 94.846)	93.421 (93.415, 93.427)
	MCC	84.443 (84.424, 84.462)	81.857 (81.830, 81.884)	84.416 (84.399, 84.433)	81.822 (81.805, 81.839)	84.590 (84.569, 84.611)	84.590 (84.568, 84.612)	76.724 (76.707, 76.741)	71.565 (71.544, 71.586)
Education status	Recall	84.514 (84.488, 84.540)	80.344 (80.281, 80.407)	79.001 (78.953, 79.049)	79.494 (79.436, 79.552)	73.738 (73.692, 73.784)	67.456 (67.413, 67.499)	69.042 (68.987, 69.097)	79.285 (79.244, 79.326)
	Precision	80.899 (80.856, 80.942)	82.996 (82.943, 83.049)	79.931 (79.900, 79.962)	81.162 (81.134, 81.190)	85.729 (85.687, 85.771)	87.300 (87.257, 87.343)	84.322 (84.266, 84.378)	80.667 (80.629, 80.705)
	F1 score	82.477 (82.450, 82.504)	81.308 (81.258, 81.358)	79.203 (79.174, 79.232)	80.041 (80.005, 80.077)	79.057 (79.020, 79.094)	75.954 (75.916, 75.992)	75.523 (75.483, 75.563)	79.878 (79.842, 79.914)
	AUCPR	85.366 (85.328, 85.404)	86.868 (86.830, 86.906)	85.128 (85.093, 85.163)	85.010 (84.970, 85.050)	82.911 (82.857, 82.965)	84.589 (84.547, 84.631)	80.631 (80.588, 80.674)	83.655 (83.617, 83.693)
	MCC	80.676 (80.646, 80.706)	79.566 (79.512, 79.620)	77.162 (77.131, 77.193)	78.129 (78.090, 78.168)	77.403 (77.364, 77.442)	74.569 (74.528, 74.610)	73.869 (73.826, 73.912)	77.803 (77.764, 77.842)

Numbers in parentheses represent the 95% confidence intervals. Bold-faced numbers represent the highest performance among the 6 models. Multiple values are highlighted for 1 metric if they only vary by a few decimal points.

There was notable variability in performance across both models and sociodemographic characteristics. While Model A generally demonstrated strong performance, it did not consistently outperform all other models across every characteristic. This variability highlights that although simpler CNN architectures like Model A often deliver robust results—particularly in imbalanced datasets—model effectiveness is highly influenced by the underlying data characteristics. For instance, Model B proved superior for classifying citizenship status. At the same time, the more complex models (Models C and D) as well as Model F provided competitive performance results for marital status. For occupation, the hybrid models (Models E and F) alongside Model C excelled, suggesting that each model has advantages over others depending on how the output characteristic is documented in the medical text.

To assess the impact of integrating traditional classifiers with CNN-based feature extraction, we compared the F1 score performance of the hybrid models against their standalone CNN counterparts (shown in Table 4). Across most sociodemographic characteristics, the standalone CNN models generally outperformed their hybrid counterparts, suggesting that the addition of traditional classifiers did not consistently enhance performance and, in some cases, resulted in a decline. This performance drop was particularly notable for education status, where the F1 score declined from 80.041% to 75.954% when the CNN of Model F was combined with random forest. Similarly, for citizenship status, both hybrid models underperformed relative to the standalone CNNs. In contrast, for more structured and balanced characteristics such as marital status and occupation, using a hybrid approach increased performance. For example, Model F resulted in a slight improvement in performance from 97.153% to 97.732% for marital status and 90.513% to 91.966% for occupation.

Table 4.

F1 Score Comparisons (in Percentage) of the Hybrid CNN Architectures to Their Standalone CNN Components for Each Sociodemographic Characteristic.

Characteristic	CNN(E)	CNN(E) + SVM	CNN(F)	CNN(F) + RF
Place of birth	89.572	91.141	91.072	89.126
Citizenship	91.916	90.796	89.456	87.768
Marital status	97.028	96.882	97.153	97.732
Occupation	90.378	91.880	90.513	91.966
Education status	81.307	79.057	80.041	75.954

The CNN component of Model E is depicted as CNN(E) and the support vector machine as SVM. The CNN component of Model F is depicted as CNN(F) and the random forest as RF. Bold-faced numbers indicate an enhancement in model performance when a traditional classifier is added to the architecture. The differences in performance were not statistically significant.

The PR curves, calculated on a held-out test set of 438 data entries (10% of the labeled dataset), are shown in Figure 9 for each model grouped by the sociodemographic characteristic. Marital status, occupation, and place of birth were able to optimize well on both metrics, showing the highest separation of classes with the most agreement between models. Contrarily, education produced the lowest average precision scores and the most variation in the PR curves between models. What can be observed is that despite marital status and occupation having very similar data distributions (~50% presence), there is quite a noticeable variation in the PR curves between the 2 characteristics. Here, marital status had much better separability, as indicated by higher precision and recall across models. A similar pattern is observed for place of birth and citizenship status, confirming the challenges posed by data characteristics such as extreme class imbalance or high lexical variability.

Figure 9.

The PR curves of the 6 deep learning models and the 2 traditional classifiers for each sociodemographic characteristic. The average precision (AP) for each model is displayed between brackets.

The F1 score distribution for each model, categorized by the sociodemographic factors, is presented in Figure 10. These distributions reflect the F1 scores obtained from the 10-fold stratified cross-validation process used to evaluate model performance. Unlike single-point metrics, these graphs provide a clearer depiction of each model’s stability across different data splits. Marital status exhibited the least variability in F1 scores across folds and models, indicating consistent performance regardless of the specific data partition. On the other hand, the F1 score distributions for occupation showed greater dispersion compared to marital status, reflecting the impact of data complexity on performance consistency. This pattern also extends to place of birth and citizenship status, where the former exhibited more pronounced peaks with less dispersion. Education status showed the highest variability, which is consistent with the above findings. Interestingly, model stability did not always correspond with peak performance. Some models demonstrated greater consistency across folds, suggesting stronger generalizability even without achieving the highest F1 scores. For instance, Model B showed the most stable performance for place of birth, while Model E was the most consistent for citizenship status, despite Model A and Model B having the highest average F1 scores for these characteristics, respectively.

Figure 10.

The distribution of the F1 scores across data splits from the 10-fold stratified cross-validation for each sociodemographic characteristic and each model including the 2 traditional classifiers: (a) place of birth, (b) citizenship status, (c) marital status, (d) occupation, and (e) education.

Error Analysis

To better understand the differences in performance results for characteristics with very similar distributions (i.e., place of birth vs citizenship status, marital status vs occupation), we conducted an error analysis using the predictions on the held-out test set. We randomly sampled 20 sentences of cases where a factor was present but was predicted to be absent as well as the reverse where there was no indication of a factor but the models incorrectly classified these entries as present. Examples of misclassified sentences for both categories are presented in Table 5. Overall, we found that the majority of misclassified entries were due to incorrect annotations, contributing to both false positive and false negative errors. However, we did observe quite significant variations in how each characteristic was documented in the EMRs that helped explain the differences in model performance.

Table 5.

Sampled Sentences That Resulted in Misclassification by the Models.

Sentence	Characteristic	Misclassified label	True label	Identified issue
‘Returned to Canada, prior lived in Kenya’	Place of Birth	Present	Absent	States previous living situation but isn’t directly associated with place of birth
‘Born in Newfoundland’	Citizenship Status	Absent	Present	Low instances of this specific province in the training sample
‘Originally from Quebec’	Citizenship Status	Present	Absent	Doesn’t state patient was born in Quebec so can’t infer citizenship status
‘Former real estate agent’	Occupation	Absent	Present	Low instances of this specific job title in the training sample
‘Living situation: lives with daughter’	Marital Status	Present	Absent	The structured format is used frequently to indicate living with spouse/wife/husband, and therefore, the models must have associated the term with an indication of marital status
‘Teacher’, ‘Banker’, ‘Software developer’	Occupation	Absent	Present	Model was unable to capture any contextual information due to the use of a single term to represent occupation
‘Mother divorced’	Marital Status	Present	Absent	The sentence contained an indication of marital status, but the models failed to find the indication of parental divorce rather than the patient’s own divorce
‘Graduated from PSW school’	Education Status	Absent	Present	Sparse representation, insufficient examples of this specific use case in training data to learn robust patterns
‘Completed grade 10’	Education Status	Absent	Present	High lexical diversity, education expressed in many ways (degrees, grades, certifications), making pattern learning difficult
‘Employed at teachers college’	Education Status	Present	Absent	Classified as education status; models confused occupational context with educational attainment

For example, sentences that did contain occupation information but were misclassified as absent were primarily due to the sheer number of ways that can be used to document job details. Many instances in the clinical text featured single-word job titles or brief descriptions, which led to frequent classification errors. These errors likely stemmed from the lack of contextual framing around these terms, making it difficult for the models to distinguish between job titles and other similar lexical items found in the text. Furthermore, in the absence of contextual information around job titles and descriptions, the models rely on having a sufficient number of entries with the same job to effectively learn the cues needed to classify it as the presence of occupation information. On the other hand, marital status was found to be represented in a more semi-structured format where certain terms clearly indicate the presence of a patient’s marital status (e.g., ‘married to X’, ‘marital status: divorced’, ‘family: separated’). Many of the false positives for marital status were due to references to the marital status of family members, such as ‘parents are separated’.

When comparing place of birth and citizenship status, we found a very similar trend. Place of birth information tended to be more explicitly stated in the narratives, often in a consistent format (e.g., ‘born in [location]’), which likely caused more accurate classification. In contrast, citizenship status was documented in a less structured and more context-dependent manner, often requiring the model to infer status from indirect references (e.g., ‘born in Quebec’ or ‘immigrated to the country in 2005’). This lexical variability and reliance on implied information likely contributed to a higher rate of misclassifications for citizenship status compared to place of birth. Many of the classification errors related to place of birth occurred when sentences referenced previous residences or migration histories without explicitly stating the actual birthplace, especially when both involve geographic locations. For citizenship status, misclassifications often stemmed from the models’ inability to infer legal citizenship from indirect references. While it is generally assumed that individuals born in Canada are Canadian citizens, this inference requires contextual understanding that the models lack, as they rely solely on the presence of specific keywords rather than legal or societal assumptions.

Education status showed the weakest classification performance across all models (best F1: 82.477%), substantially lower than other sociodemographic factors. This limitation reflects several intertwined challenges. The dataset contained only 9.7% of records with documented education information, leaving the models with too few examples to learn reliable features. The descriptions that did appear were strikingly diverse, ranging from degree completion (e.g., ‘graduated from University X’, ‘bachelor’s degree in engineering’), to partial or ongoing study (e.g., ‘some college’, ‘currently enrolled in nursing’), to professional certifications (e.g., ‘licensed practical nurse’, ‘certified electrician’). Unlike marital status, which was often recorded in a consistent, structured format, educational references lacked standardization, making it difficult for the models to identify recurring patterns. The few positive examples that were available were spread thinly across many different institutions and programs, further limiting the ability to build robust representations. These overlapping factors created a particularly difficult classification setting that even the strongest CNN architectures struggled to address effectively.

These findings highlight that the observed performance variations across characteristics are influenced not only by the data distribution but also by the lexical diversity and structural consistency within the text. Characteristics with more structured representations (e.g., marital status, place of birth) tend to yield better model performance, while those with higher lexical diversity and that require context-dependent cues (e.g., education, occupation, citizenship status) pose greater challenges, even when their distributions are comparable.

Discussion

Previous studies have identified associations between patients’ sociodemographic characteristics and various health-related conditions.^3-19 Extracting these factors from EMRs and grouping patients into cohorts enables researchers to better analyze these medically relevant risk factors, which may play a crucial role in disease development and progression. However, much of the EMR data is stored in an unstructured format, with sociodemographic factors often exhibiting high levels of missingness. Identifying an optimal approach for classifying these characteristics at the sentence level is a critical first step toward improving data availability for health disparities research and precision medicine. CNNs have been successfully applied to clinical text classification tasks, particularly for social and behavioral factor extraction.^63,69 Their ability to highlight key local patterns in text makes them particularly useful for extracting structured information from unstructured EMR data. This study aimed to recommend model development practices for clinical text classification by evaluating 6 variations of CNN architectures for the binary classification of sociodemographic information in EMR free-text fields. These findings are directly actionable for informatics teams designing tools to augment EMR completeness or build demographically representative cohorts.

Our results demonstrate that model architecture plays a significant role in the classification of sociodemographic characteristics within EMRs. Overall, we found that simpler CNN architectures like Model A tended to perform robustly, likely due to reduced model complexity minimizing the risk of overfitting. However, when looking at a more granular level, more complex models like Model C and hybrid models like Model F demonstrated comparable performance, especially in more balanced datasets, such as marital status, or for more lexically diverse data such as occupation. Our models can be generalizable to categorize other sociodemographic factors as they have shown high performance for all assessed characteristics which varied by frequency and lexical diversity. Therefore, model architecture can be chosen based on documentation rates since we have provided detailed evaluations for the varying class sizes.

The hybrid approach, which integrated traditional classifiers, such as support vector machines and random forests, with CNN-based feature extraction, exhibited mixed performance when compared to their standalone CNN counterparts. While hybridization was expected to enhance classification by leveraging the robust feature extraction of CNNs alongside the structured decision-making capabilities of traditional classifiers, the results indicate that this approach did not consistently improve performance. In certain cases, such as marital status and occupation, the hybrid models achieved slightly higher F1 scores, suggesting that the traditional classifiers helped refine classification boundaries for structured or lexically complex characteristics. However, for citizenship status and education status, the hybrid models underperformed, likely due to the random forest and support vector machine’s sensitivity to imbalanced data, which led to a decline in recall. This was particularly evident in education status, where the hybrid models (especially Model F) struggled to capture contextual nuances. While hybrid architectures may offer advantages for structured or well-represented features, they may not always be suitable for highly imbalanced data mixed with lexically diverse text, where standalone CNNs appeared to generalize more effectively.

The observed variability in model performances revealed key insights into the relationship between data characteristics and model performance. Despite some characteristics exhibiting the same class distributions, the CNN models demonstrated noticeably higher performance for some over others. Marital status and occupation both had a prevalence rate of ~50% in the data while place of birth and citizenship status also shared a similar distribution in the data of ~15% presence. However, marital status and place of birth achieved higher scores and model stability when compared to occupation and citizenship status. This disparity is largely attributed to the greater lexical diversity inherent in documenting occupation data and the context-dependent nature of citizenship status. Marital status and place of birth are more explicitly mentioned following a very similar structure across entries such as ‘marital status: divorced’, ‘married to XX’, and ‘place of origin: France’. On the other hand, citizenship and occupational information varied considerably due to complex linguistic patterns and subtle contextual cues that may be evident to a human reader but create challenges for model training. Education status exhibited the lowest performance rates across models, likely due to both its pronounced class imbalance and the variability in how educational information can be expressed—ranging from phrases like ‘graduated from University X’ to ‘completed grade 10’ or ‘pursuing a master’s degree’. This linguistic complexity increases the difficulty for models to generalize effectively.

The CNN models consistently performed well on characteristics with balanced presence-absence distributions, such as marital status and occupation, achieving high F1 scores and Matthews correlation coefficients. The lexical diversity in occupation seemed to favor models with more complex architectures (Models C and D), while marital status performed well across any architecture due to its structured nature and balanced distribution. Conversely, characteristics with lower presence rates, such as place of birth and citizenship status (~15%) and education status (~9%), generally benefited from simpler architectures. Model B excelled at identifying citizenship status likely due to its use of multiple filter region sizes that enhanced its ability to capture diverse linguistic patterns common with that characteristic.

This analysis indicated that balancing model complexity with data characteristics is essential for optimal performance, particularly in sensitive domains like healthcare data, where both recall and precision can have significant implications. Each CNN model might have subtle architectural differences (e.g., filter size, layer depth, activation functions) that emphasize certain aspects of the data. For instance, Model A might excel in capturing broad patterns that generalize well, whereas Model F might have layers or filters optimized to detect features specific to the positive class, resulting in higher precision scores. Generally, for complex models to outperform simpler models, more high-quality or well-documented data may be required. In cases with sparse or imbalanced data, simpler architectures seem to offer a performance advantage without the added computation costs.

Our findings build on and extend prior research on deep learning approaches for extracting social determinants of health from clinical text. Earlier work has demonstrated the effectiveness of CNNs,^63,69 long short-term memory networks,^26,70 and transformer-based architectures,^23,41-43 though most studies have focused on evaluating individual models or benchmarking deep learning against traditional machine learning approaches. More recent research highlights the advantages of transformer-based models, particularly BERT and its variants, which deliver state-of-the-art performance for fine-grained entity recognition and the capture of complex contextual relationships.^41-43 These models consistently outperform CNNs and LSTMs across diverse social determinant domains, including housing, education, occupation, substance use, and broader social environments.^10,23 However, such studies typically rely on well-resourced datasets and substantial computational infrastructure, often addressing intricate tasks that involve temporal, hierarchical, or multidimensional representations of social determinants of health. By contrast, our work addresses a complementary gap by providing, to our knowledge, the first systematic comparison of CNN architectures for binary presence/absence classification of sociodemographic characteristics in resource-limited settings. We demonstrate that relatively simple CNNs can achieve strong performance (F1 ≈ 91%) for sentence-level classification when sociodemographic information is documented in structured formats. This finding underscores that architectural complexity should be calibrated to the characteristics of the data and the institutional context, rather than assumed to be universally advantageous. Importantly, these results have direct practical implications for healthcare systems with limited computational resources or those requiring real-time workflow integration, where the efficiency and interpretability of CNNs may outweigh the marginal performance gains offered by transformer-based models. Our findings also reinforce prior observations on the influence of lexical diversity and documentation consistency,²⁶ highlighting the importance of aligning model choice with both data properties and implementation environments.

We found that deep learning architectures can be successfully leveraged for real-world medical text classification without requiring extensive feature engineering techniques. Importantly, CNNs demonstrated robustness in highly imbalanced and small training sample sizes, achieving strong performance without the need for explicit data-balancing methods. This is particularly relevant in the clinical domain, where minority sociodemographic groups are often underrepresented in datasets, and balancing strategies risk discarding valuable information or compromising representativeness. These findings highlight CNNs’ potential to deliver reliable classification in real-world settings where data availability and distribution cannot be easily controlled.

In practical terms, CNN-based classifiers could be integrated directly into EMR systems to automatically flag missing or incomplete sociodemographic information, prompting clinicians or administrative staff to verify and update records during routine encounters, thereby improving data completeness for clinical decision support and equity-driven research. This enhanced data capture directly supports patient care by enabling more accurate risk stratification for conditions with known sociodemographic associations (e.g., cardiovascular disease, diabetes, mental health disorders), facilitating targeted preventive interventions for high-risk populations, and supporting culturally appropriate care delivery tailored to patients’ backgrounds and life circumstances. At the health system level, complete sociodemographic data reduces administrative burden by eliminating redundant data collection efforts, enables population health management through identification of underserved communities, supports compliance with health equity reporting requirements, and facilitates research on social determinants of health without requiring labor-intensive manual chart review. Implementation could follow a tiered approach: (1) batch processing of existing records to identify documentation gaps and prioritize outreach for vulnerable populations, (2) real-time flagging during clinical encounters to prompt timely data collection when clinically relevant, and (3) integration with structured data entry fields to guide standardized documentation practices. Because CNNs are computationally lightweight, they can be deployed locally in primary care and community health settings on standard clinical workstations or local servers without requiring high-performance infrastructure, avoiding the latency and privacy concerns associated with cloud-based processing that would be necessary for transformer-based approaches. Such systems should operate in a human-in-the-loop fashion, where automated predictions are reviewed and validated by clinicians or trained staff, ensuring oversight, reducing risks of misclassification, and preserving alignment with clinical workflows. Key implementation considerations include establishing clear governance for model updates and retraining as documentation practices evolve, ensuring compliance with privacy regulations and institutional review processes, and conducting prospective evaluation to monitor performance drift over time, following successful deployment patterns from other clinical NLP tasks such as automated coding assistance and clinical decision support.

Our study has several limitations. The models we developed were designed to detect the presence or absence of sociodemographic characteristics rather than classifying patients into distinct subcategories. A multi-class framework could offer deeper insights by distinguishing between specific subgroups, thereby improving downstream analyses and risk stratification. Another limitation lies in the lack of interpretability of deep neural networks, an important consideration for clinical decision-making. Future work should explore the application of post hoc explainable AI techniques to enhance transparency. We also did not compare convolutional neural networks with state-of-the-art transformer-based models (e.g., BERT, ClinicalBERT), which have demonstrated strong performance on clinical NLP tasks.^41-43 Our emphasis on CNNs was intentional, reflecting their computational efficiency and suitability for binary classification tasks in resource-limited settings. Given the relatively small dataset size and the structured nature of certain characteristics, more complex models may have been prone to overfitting, reducing their comparative advantage. Moreover, the CNN architectures already achieved high performance (F1 ≈ 91%), leaving limited headroom for further improvement. Nonetheless, direct comparison with transformer-based approaches remains an important direction for future work, particularly for lexically diverse or context-dependent sociodemographic characteristics and for more complex multi-class extraction tasks where transformers may excel.

A further limitation is that primary annotation was performed by a single annotator. Although intra-rater reliability was high (kappa: 0.98 for documentation status, 0.96 for documented information), multiple annotators with inter-rater reliability assessment would have provided stronger validation and reduced the potential for systematic bias in the reference standard. In addition, several sociodemographic variables of high relevance to health equity research (including race, gender identity, and sexual orientation) were excluded because documentation rates fell below 3%. Training classifiers on such sparse data would have yielded unreliable models dominated by class imbalance. While this exclusion reduces comprehensiveness, it reflects current documentation limitations in EMRs; larger multi-site datasets or semi-supervised methods may help address this gap. Finally, our dataset was derived from family medicine clinics in Ontario using 3 EMR vendors and comprised English-language documentation. Although this reflects real-world practices across 96 clinics and diverse patient populations within Canadian primary care, its regional and linguistic scope may limit generalizability to other provinces, countries, or health systems. Documentation styles, language use, and sociodemographic recording conventions can vary across jurisdictions and EMR platforms, potentially influencing model performance. Broader datasets spanning multiple geographic regions and healthcare settings will therefore be essential to assess external validity.

Future research should build on these findings in several ways to advance automated sociodemographic extraction from clinical text. Semi-supervised learning offers 1 promising direction, as it could leverage large volumes of unlabeled clinical text alongside smaller annotated datasets to support robust modeling of rare characteristics that are currently excluded due to sparse documentation. Another avenue is the exploration of lightweight transformer architectures (e.g., DistilBERT, ALBERT, MobileBERT), which may balance the contextual strengths of transformers with the efficiency required for resource-limited clinical environments. Expanding beyond binary classification to multi-label and multi-class frameworks would also be valuable, enabling simultaneous extraction of multiple sociodemographic attributes and more fine-grained categorization (such as differentiating education levels or employment types). Prospective validation across diverse healthcare settings, including varied regions, EMR platforms, languages, and clinical specialties, will be critical for establishing generalizability and clinical relevance. Finally, the development of explainable AI methods tailored to CNN-based models could enhance interpretability and foster clinician trust, ensuring that automated extraction systems are not only accurate but also transparent and usable in practice.

Conclusion

There is increasing interest in leveraging information documented in electronic medical records, as it offers valuable insights into disease onset, progression, and responses to treatments or interventions. As EMR adoption continues to expand, large-scale real-world clinical data is becoming more readily available for biomedical research. This data includes sociodemographic information that can be used to find associations between specific factors and a patient’s health. Understanding the relationship between these factors and certain health risks first requires the classification of each characteristic documented in the EMR. However, not only are these characteristics rarely documented, but the medical text on its own is composed of complex medical vocabulary and medical measures that often do not follow natural language grammar. This creates challenges of high dimensionality and data sparsity, making the classification of sociodemographic factors a particularly difficult but important area of research.

This study provides the first systematic comparison of CNN architectures for binary presence/absence classification of sociodemographic characteristics in EMR clinical text. By evaluating 6 CNN architectures across 5 sociodemographic factors with varying documentation rates and lexical diversity, we demonstrate that simpler architectures often outperform more complex models, particularly under conditions of data imbalance and sparse documentation. Our findings offer practical, evidence-based guidance for researchers and informatics teams developing automated extraction tools for clinical deployment, particularly in resource-limited settings where computational efficiency and interpretability are critical considerations. Because the evaluated CNN architectures are lightweight, transparent, and require minimal computational resources, they can be readily integrated into existing EMR infrastructures without disrupting routine workflows. Such deployable models could operate as embedded modules for real-time documentation audits, automated flagging of missing sociodemographic fields, or generation of structured data summaries to support clinical decision support systems.

The translational impact of this work extends to both clinical practice and health equity research. For clinical decision support, automated classification of sociodemographic characteristics can enhance risk stratification for conditions with known social determinants, facilitate culturally appropriate care delivery, and reduce documentation burden by flagging incomplete patient profiles for verification. For equity-driven research, these methods enable efficient construction of demographically representative cohorts, identification of underserved populations, and large-scale investigation of social determinants of health without labor-intensive manual chart review. By demonstrating that efficient CNN-based approaches can achieve high performance for binary classification tasks, this study contributes to ongoing efforts to make automated clinical text analysis more accessible and deployable across diverse healthcare settings, ultimately supporting both improved patient care and advances in health disparities research.

Footnotes

Acknowledgements

Dr. K Tu receives a Chair in Family and Community Medicine Research in Primary Care at UHN and a Research Scholar Award from the Department of Family and Community Medicine, Temerty Faculty of Medicine, University of Toronto.

ORCID iD

Rawan Abulibdeh

Ethical Considerations

This project received Research Ethics Board (REB) approval from the University of Toronto (Protocol #40129) and North York General Hospital (Protocol #20-0044). UTOPIAN operates under the principles of the Tri-Council Policy Statement: Ethical Conduct for Research Involving Humans (TCPS2), specifically Chapter 5, Section D, Article 5.5A. All methods were carried out in accordance with relevant guidelines and regulations, including the Declaration of Helsinki.

Consent to Participate

The need for informed consent was formally waived by both REBs due to the use of de-identified, retrospective EMR data. The data is de-identified and processed by our research team, and UTOPIAN standard operating procedures ensure that no linkage can be made from the outcomes reported in this project to specific patients. Family physicians who contribute data to UTOPIAN each signed a data sharing agreement that allows for research with REB approval.

Author Contributions

K.T. and E.S. conceived the study. R.A. designed and conducted the study, developed and implemented the models, collected and processed the data, performed model and error analyses, and drafted the manuscript. K.T. and E.S. supervised the study, provided resources, assisted in manuscript editing and review, and contributed to project administration. K.T. additionally curated data, and secured funding. All authors—R.A., K.T., and E.S.—reviewed and approved the final manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Canadian Institutes of Health Research [grant number 173094]. Dr. K Tu receives a Chair in Family and Community Medicine Research in Primary Care at UHN and a Research Scholar Award from the Department of Family and Community Medicine, Temerty Faculty of Medicine, University of Toronto.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The data used in this study consisted of individual-level, de-identified records. In accordance with institutional policies and REB requirements, individual-level data cannot be made publicly available; only aggregate data may be shared. Due to the nature of this project, the data cannot be aggregated for public release. Additionally, the dataset—sourced from the University of Toronto’s Practice-Based Research Network (UTOPIAN) Data Safe Haven, a large primary care EMR repository—is no longer accessible, as the parent database has been archived. Future access to similar datasets may be possible upon REB approval. For inquiries, please contact the University of Toronto’s Human Research Ethics Unit at ethics.review@utoronto.ca or reach out to Mariya Gancheva (m.gancheva@utoronto.ca), the Research Ethics Coordinator.

References

Link

Phelan

JC.

Understanding sociodemographic differences in health–the role of fundamental social causes. Am J Public Health. 1996;86:471-473.

Prus

SG.

Comparing social determinants of self-rated health across the United States and Canada. Soc Sci Med. 2011;73:50-59.

Goodday

Kormilitzin

Vaci

, et al. Maximizing the use of social and behavioural information from secondary care mental health electronic health records. J Biomed Inform. 2020;107:103429.

Chiu

Tian

Hart

Early detection of severe functional impairment among adolescents with major depression using logistic classifier. Front Public Health. 2021;8:622007.

Cheng

Witte

McClure

, et al. Socioeconomic status and prostate cancer incidence and mortality rates among the diverse population of California. Cancer Causes Control. 2009;20:1431-1440.

Blecker

Sontag

Horwitz

, et al. Early identification of patients with acute decompensated heart failure. J Card Fail. 2018;24:357-362.

Abraham

Kosti

, et al. Dense phenotyping from electronic health records enables machine learning-based prediction of preterm birth. BMC Med. 2022;20:333.

Quaglia

Lillini

Mamo

Ivaldi

Vercelli

Socio-economic inequalities: a review of methodological issues and the relationships with cancer survival. Crit Rev Oncol Hematol. 2013;85:266-277.

Harvei

Kravdal

Ø.

The importance of marital and socioeconomic status in incidence and survival of prostate cancer. Prevent Med. 1997;26:623-632.

10.

Yang

Dang

, et al. A study of social and behavioral determinants of health in lung cancer patients using transformers-based natural language processing models. Paper presented at: AMIA Annual Symposium Proceedings; October 30-November 3, 2021; San Diego, CA. American Medical Informatics Association; 2021: 1225.

11.

Wang

Lakin

Riley

Korach

Frain

Zhou

Disease trajectories and end-of-life care for dementias: Latent topic modeling and trend analysis using clinical notes. Paper presented at: AMIA Annual Symposium Proceedings; November 3-7, 2018; San Francisco, CA. American Medical Informatics Association; 2018: 1056.

12.

Bucher

Shi

Pettit

Ferraro

Chapman

Gundlapalli

Determination of marital status of patients from structured and unstructured electronic healthcare data. Paper presented at: AMIA Annual Symposium Proceedings; November 16-20, 2019; Washington, DC. American Medical Informatics Association; 2019: 267.

13.

Raphael

Bryant

Mikkonen

Alexander

Social Determinants of Health: The Canadian facts. Ontario Tech University Faculty of Health Sciences; 2020.

14.

Valanis

Bowen

Bassford

Whitlock

Charney

Carter

RA.

Sexual orientation and health: comparisons in the women’s health initiative sample. Arch Fam Med. 2000;9:843.

15.

Boehmer

Miao

Linkletter

Clark

MA.

Health conditions in younger, middle, and older ages: are there differences by sexual orientation?

LGBT Health. 2014;1:168-176.

16.

Gonzales

Henning-Smith

Health disparities by sexual orientation: results and implications from the behavioral risk factor surveillance system. J Community Health. 2017;42:1163-1172.

17.

Logie

The case for the World Health Organization’s Commission on the Social Determinants of Health to address sexual orientation. Am J Public Health. 2012;102:1243-1246.

18.

Abramovich

de Oliveira

Kiran

Iwajomo

Ross

Kurdyak

Assessment of health conditions and health service use among transgender patients in Canada. Am Med Assoc Netw Open. 2020;3:e2015036.

19.

Sokkary

Awad

Paulo

Frequency of sexual orientation and gender identity documentation after electronic medical record modification. J Pediatr Adolesc Gynecol. 2021;34:324-327.

20.

Pendergrass

Crawford

DC.

Using electronic health records to generate phenotypes for research. Curr Protoc Hum Genet. 2019;100:e80.

21.

Rayner

Khan

Chan

Illustrating the patient journey through the care continuum: leveraging structured primary care electronic medical record (EMR) data in Ontario, Canada using chronic obstructive pulmonary disease as a case study. Int J Med Inform. 2020;140:104159.

22.

Ehrenstein

Kharrazi

Lehmann

Taylor

. Obtaining data from electronic health records. In: Gliklich RE, Leavy MB, Dreyer NA, eds. Tools and Technologies for Registry Interoperability, Registries for Evaluating Patient Outcomes: A User’s Guide, 3rd Edition, Addendum 2. Agency for Healthcare Research and Quality (US); 2019. Chapter 4. https://www.ncbi.nlm.nih.gov/books/NBK551878/

23.

Han

Zhang

Shi

, et al. Classifying social determinants of health from unstructured electronic health records using deep learning-based natural language processing. J Biomed Inform. 2022;127:103984.

24.

Abulibdeh

Butt

Train

Crampton

Sejdić

Assessing the capture of sociodemographic information in electronic medical records to inform clinical decision making. PLoS One. 2025;20:e0317599.

25.

Qing

Linhong

Xuehai

A novel neural network-based method for medical text classification. Fut Intern. 2019;11:255.

26.

Stemerman

Arguello

Brice

Krishnamurthy

Houston

Kitzmiller

Identification of social determinants of health using multi-label classification of electronic health record clinical notes. J Am Med Inform Assoc Open. 2021;4:ooaa069.

27.

Mishra

Bian

Fiszman

, et al. Text summarization in the biomedical domain: a systematic review of recent research. J Biomed Inform. 2014;52:457-467.

28.

Wang

Research on Web text classification algorithm based on improved CNN and SVM. Paper presented at: IEEE 17th International Conference on Communication Technology (ICCT); October 27-30, 2017; Chengdu, China. IEEE; 2017: 1958-1961.

29.

Rajendran

Topaloglu

Extracting smoking status from electronic health records using NLP and deep learning. AMIA Jt Summits Transl Sci Proc. 2020;2020:507.

30.

LeCun

Bengio

Hinton

Deep learning. Nature. 2015;521:436-444.

31.

Khalid

Khalil

Nasreen

A survey of feature selection and feature extraction techniques in machine learning. Paper presented at: Science and Information Conference; August 27-29, 2014; London, UK. IEEE; 2014: 372-378.

32.

Yang

Varghese

Stephenson

Gronsbell

Machine learning approaches for electronic health records phenotyping: a methodical review. J Am Med Inform Assoc. 2023;30:367-381.

33.

Khattak

Jeblee

Pou-Prom

Abdalla

Meaney

Rudzicz

A survey of word embeddings for clinical text. J Biomed Inform. 2019;100:100057.

34.

Roberts

Datta

, et al. Deep learning in clinical natural language processing: a methodical review. J Am Med Inform Assoc. 2020;27:457-470.

35.

Chen

. Convolutional Neural Network for Sentence Classification. M.S. Thesis. Department of Computer Science and Engineering, University of Waterloo; 2015.

36.

Kalchbrenner

Grefenstette

Blunsom

A convolutional neural network for modelling sentences. Paper presented at: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); June 2014; Baltimore, Maryland. ACL; 2014: 655.

37.

Wang

, et al. Semantic clustering and convolutional neural network for short text categorization. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), July 2015; Beijing, China. ACL; 2015: 352-357.

38.

Zhang

Wallace

A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), November 2017; Taipei, Taiwan. ACL; 2017: 253-263.

39.

Abulibdeh

Sejdić

Natural language processing methods for assessing social determinants of health in the electronic health records: a narrative review. Expert Syst Appl. 2025; 284:127928.

40.

Minaee

Kalchbrenner

Cambria

Nikzad

Chenaghlu

Gao

Deep learning–based text classification: a comprehensive review. Assoc Comput Mach Comput Surveys. 2021;54:1-40.

41.

Lybarger

Yetisgen

Uzuner

Ö.

The 2022 n2c2/UW shared task on extracting social determinants of health. J Am Med Inform Assoc. 2023;30:1367-1378.

42.

Peng

Yang

, et al. Identifying social determinants of health from clinical narratives: a study of performance, documentation ratio, and potential bias. J Biomed Inform. 2024;153:104642.

43.

Romanowski

Ben Abacha

Fan

Extracting social determinants of health from clinical note text with classification and sequence-to-sequence approaches. J Am Med Inform Assoc. 2023;30:1448-1455.

44.

Lybarger

Dobbins

Long

, et al. Leveraging natural language processing to augment structured social determinants of health data in the electronic health record. J Am Med Inform Assoc. 2023;30:1389-1397.

45.

Kim

. Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), October 2014; Doha, Qatar. ACL; 2014: 1746-1751.

46.

University of Toronto Family Medicine Report Tech. Report. Department of Family and Community Medicine at the University of Toronto, 2019.

47.

OntarioMD. Provincial EMR-Integrated Access. OntarioMD; 2025.

48.

OntarioMD. From Foundation to Integration: Annual Report 2016-2017. OntarioMD; 2017.

49.

Google Code Archive. Long-term storage for Google Code project hosting. 2016. https://code.google.com/archive/

50.

Mullenbach

Wiegreffe

Duke

Sun

Eisenstein

Explainable Prediction of Medical Codes from Clinical Text. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), June 2018; New Orleans, Louisiana. ACL; 2018: 1101-1111.

51.

Yogarajan

Montiel

Smith

Pfahringer

Seeing the whole patient: using multi-label medical text classification techniques to enhance predictions of medical codes. arXiv preprint arXiv:2004.00430. 2020.

52.

Pennington

Socher

Manning

CD.

Glove: Global vectors for word representation. Paper presented at: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); October 2014; Doha, Qatar. ACL; 2014: 1532-1543.

53.

Chollet

Keras-Team/Keras: Deep Learning for Humans. GitHub; 2015.

54.

Johnson

Zhang

Effective use of word order for text categorization with convolutional neural networks. Paper presented at: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; May June 2014; Denver, Colorado. Association for Computational Linguistics.

55.

Hughes

Kotoulas

Suzumura

Medical text classification using convolutional neural networks. Stud Health Technol Inform. 2017;235:246-250.

56.

Cerisara

Denis

. Do convolutional networks need to be deep for text classification? Paper presented at: Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence; February 2-7, 2018; New Orleans, LA.

57.

Lingeman

Wang

Becker

Detecting opioid-related aberrant behavior using natural language processing. AMIA Annu Symp Proc. 2018;2017:1179-1185.

58.

Islam

Liu

Kang

A semantics aware random forest for text classification. Paper presented at: CIKM’19: The 28th ACM International Conference on Information and Knowledge Management; November 3-7, 2019; Beijing, China. ACM; 2019: 1061-1070.

59.

Ahsan

Ohnuki

Mitra

You

MIMIC-SBDH: a dataset for social and behavioral determinants of health. Proc Mach Learn Res. 2021:149:391-413.

60.

Wang

Sohn

Liu

, et al. A clinical text classification paradigm using weak supervision and deep representation. BMC Med Inform Decis Making. 2019;19:1-13.

61.

Badger

LaRose

Mayer

Bashiri

Page

Peissig

Machine learning for phenotyping opioid overdose events. J Biomed Inform. 2019;94:103185.

62.

Afshar

Phillips

Karnik

, et al. Natural language processing and machine learning to identify alcohol misuse from the electronic health record in trauma patients: development and internal validation. J Am Med Inform Assoc. 2019;26:254-261.

63.

Afshar

Sharma

Bhalla

, et al. External validation of an opioid misuse machine learning classifier in hospitalized adult patients. Addict Sci Clin Pract. 2021;16:1-11.

64.

Shaphiro

Wilk

An analysis of variance test for normality. Biometrika. 1965;52:591-611.

65.

Scheffe

. The Analysis of Variance. John Wiley & Sons; 1999: 72.

66.

Tukey

JW.

Comparing individual means in the analysis of variance. Biometrics. 1949;5:99-114.

67.

Friedman

The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc. 1937;32:675-701.

68.

Nemenyi

PB.

Distribution-free Multiple Comparisons. Princeton University; 1963.

69.

Senior

Burghart

, et al. Identifying predictors of suicide in severe mental illness: a feasibility study of a clinical prediction rule (Oxford Mental Illness and Suicide Tool or OxMIS). Front Psychiatry. 2020;11:268.

70.

Chen

Dredze

Weiner

Kharrazi

Identifying vulnerable older adult populations by contextualizing geriatric syndrome information in clinical notes of electronic health records. J Am Med Inform Assoc. 2019;26:787-795.

Balancing Model Complexity and Clinical Deployability in Deep Learning for Sociodemographic Information Extraction

Abstract

Keywords

Introduction

Methods

Data Source

Data Preprocessing

Classification Modeling

Baseline Models

Evaluation Criteria

Results

Error Analysis

Discussion

Conclusion

Footnotes

Acknowledgements

ORCID iD

Ethical Considerations

Consent to Participate

Author Contributions

Funding

Declaration of Conflicting Interests

Data Availability Statement

References