Sage Journals: Discover world-class research

Abstract

Mycoplasma pneumonia may lead to hospitalizations and pose life-threatening risks in children. The automated identification of mycoplasma pneumonia from electronic medical records holds significant potential for improving the efficiency of hospital resource allocation. In this study, we proposed a novel method for identifying mycoplasma pneumonia by integrating multi-modal features derived from both free-text descriptions and structured test data in electronic medical records. Our approach begins with the extraction of free-text and structured data from clinical records through a systematic preprocessing pipeline. Subsequently, we employ a pre-trained transformer language model to extract features from the free-text, while multiple additive regression trees are used to transform features from the structured data. An attention-based fusion mechanism is then applied to integrate these multi-modal features for effective classification. We validated our method using clinic records of 7157 patients, retrospectively collected for training and testing purposes. The experimental results demonstrate that our proposed multi-modal fusion approach achieves significant improvements over other methods across four key performance metrics.

Keywords

Deep learning electronic medical record multi-modal fusion pneumonia diagnosis unsupervised pre-training

Introduction

Mycoplasma pneumonia (MP), a type of community-acquired pneumonia (CAP), is responsible for 10 to 40% of CAP cases in hospitalized children.^1,2 It can cause severe respiratory complications, leading to hospitalizations and potentially life-threatening situations in pediatric patients. MP spreads through droplets and direct contact with infected individuals, often displaying an epidemic infection pattern with seasonal peaks in autumn and winter.^3,4 In clinical practice, pediatricians commonly treat children with MP using macrolide antibiotics, differing from treatment protocols for other pneumonias.^5–7 Therefore, distinguishing MP from other pneumonia cases is crucial for ensuring child safety and minimizing inappropriate medication use.^8–10

Recent advancements in artificial intelligence have greatly assisted clinicians in diagnosing and managing diseases using multi-modal data sources like electronic medical record (EMR) and medical imaging.^11–13 EMRs provide comprehensive records of patient diagnosis and treatment, penned by hospital clinicians. Current research predominantly focuses on extracting and utilizing valuable Pulmonology information from EMRs for disease research and developing automated diagnostic systems, e.g., asthma,^14,15 ICU mortality scoring¹⁶ and infectious diseases identification.¹⁷ Several machine learning approaches, e.g., XGBoost,¹⁸ Bayesian network,¹⁹ regression trees,²⁰ etc.,^21,22 have also been developed for pneumonia diagnosis and management. Additionally, the integration of medical imaging and EMRs into disease diagnosis has also been explored, e.g., text-image semantic retrieval,²³ multi-modal pneumonia identification^24–26 and COVID-19 diagnosis.²⁷

Despite the plethora of EMR-based respiratory disease diagnosis research, studies specifically focusing on MP diagnosis are lacking. In this study, we propose a diagnostic system for MP utilizing a combination of deep learning and multiple additive regression trees (MART). The methodology begins with extracting free-text and structured medical test data from EMRs via a preprocessing pipeline. A pre-trained BERT model²⁸ serves as the free-text feature extractor, while MART are employed to transform features from structured data, converting multi-source scalar numeric data into dimensionless multi-hot vectors. This facilitates the unification of diverse features into a single metric space. Subsequently, an attention-based fusion module merges free-text and structured data features, which are then inputted into a binary classifier for the final classification. Figure 1 provides an overview of the proposed system, which was trained and validated on 7157 retrospectively collected EMRs of CAP patients.

Figure 1.

Overview of the proposed system. (a) Overview of record preprocessing pipeline. Providing clinical text of EMR and extract both free-text and structured streams using pre-built regular expressions. (b) Overview of MP Identification. BERT extracts free-text features. Trained MART transforms structured streams. An attention fusion module is used to fuse features of free-text and structured data, and fed the fused features into the binary classifier for MP identification.

Materials & methods

Ethical approval

This research was approved by the Institutional Review Board (IRB) of the Medical Ethics Committee at Children’s Hospital, Zhejiang University School of Medicine, China (IRB Approval ID: 2020-IRB-058) and conducted in compliance with the Declaration of Helsinki. Given the retrospective nature of the data analysis involved in this study, the IRB granted a waiver for the requirement of informed consent.

Data overview

In this study, EMRs were retrospectively collected, consisting of a patient cohort of 7157 individuals diagnosed with CAP at the Department of Pulmonology, Children’s Hospital, Zhejiang University School of Medicine. This cohort was comprised of two subgroups: 3706 cases of MP and 3451 non-MP cases. The age distribution of these patients ranged from 29 days to 18 years, with a mean age of 3.57 years and a standard deviation (SD) of 3.05 years. Statistical insights derived from the patients’ EMRs are illustrated in Figure 2. The primary analytical focus of this study was on three critical aspects of the EMRs: admission examination, auxiliary examination, and discharge diagnoses. It is important to note that all free-text records within these domains were originally recorded in Chinese, as detailed in Table 1.

Figure 2.

Data overview. (a) Histogram of the age distribution of patients, ranging from 0 to 18 years old, with the majority of children under 4 years old. (b) Bar graph of the months of patient admission, where the peak seasons of MP and CAP are in autumn and winter. (c) Pie chart of gender distribution of patients, in which males account for 56.66% and females account for 43.34%.

Table 1.

Examples of English translations of the fields used in the original EMRs.

Field	Description
Admission examination	T 36.2°C, P 120 beats/min, R 36 beats/min, BP 104/52 mmHg, SPO2 95%, mentally clear and normal, throat slightly red, tonsils swollen to the first degree, breathing level, no three concave signs, coarse breathing lung sounds, dry and wet rales were not heard, moderate heart sounds, harmonious, no obvious murmurs, soft abdomen, no swelling under the liver, spleen and ribs. Neurologic examination was negative, without rash, and extremity temperature
Auxiliary examination	Blood routine: White blood cell count 11.15 × 10^9/L; lymphocyte 45.9%; 45.5% neutrophils; hemoglobin 117 g/L; platelet count 320 × 10^9/L; hypersensitive C-reactive protein <0.5 mg/L. Blood gas analysis: pH 7.436; carbon dioxide partial pressure pCO2 36.6 mmHg; oxygen partial pressure pO2 106mmHg; oxygen saturation SO2 98.6%; potassium K + 3.1 mmol/L; sodium Na+ 139 mmol/L; lactate Lac2.0 mmol/L; bicarbonate HCO3 24.2 mmol/L; actual base residual ABE 6.6 mmol/L; alanine aminotransferase 18U/L; creatinine 32ummol/L; urea was 3.6 mmol/L
Discharge diagnosis	1. Mycoplasma pneumonia; 2. Tonsillitis

Note. The admission examination field records the doctor’s observation of the patient’s physical condition and basic examination results. The auxiliary examination field records special examination results, including blood routine, blood gas analysis, urine routine, etc. And the discharge diagnoses are ground-truth to determine whether the patient belongs to mycoplasma pneumonia or other non-mycoplasma pneumonia.

For this study, we randomly divided the EMR data into three distinct sets on a patient-by-patient basis: a training set (85% of the total data), a validation set, and a testing set (each constituting 7.5% of the total data). Specifically, the training set comprises 2966 MP cases and 3111 non-MP cases, totaling 6077 cases. The validation set, which we used for hyperparameter tuning, includes 370 MP cases and 170 non-MP cases, amounting to 540 cases in total. Similarly, the testing set, designated for model evaluation, mirrors the validation set in terms of case distribution, with 540 cases in total.

Additionally, this study also assembled an expansive dataset comprising 50,754 non-pneumonia-related EMRs denoted as ‘Extra-data’, which was sourced from three secondary-level children’s hospitals: Zhengzhou Children’s Hospital, Shangyu Maternity & Childcare Hospital, and Dongyang People’s Hospital. The collection spanned both pulmonary and non-pulmonary departments. The format and writing style of these records were analogous to the EMRs of patients with CAP in our study, encompassing fields such as admission examination and auxiliary examination. Notably, this dataset, which was not subjected to any preprocessing, was utilized as a domain-specific corpus for the purpose of unsupervised pre-training of a BERT model.

Data preprocessing pipeline

EMRs often contain semi-structured text that reflects both the hospital’s writing norms and individual clinicians’ habits. To address this, we have developed a three-step preprocessing pipeline, illustrated in Figure 1(a), for cleaning and extracting relevant information from free-text in EMRs:

(1) Standardization of Test Item Expressions: Due to the varied representations of test items in EMRs, such as “temperature 37.1 degrees” and “T 37.1°C”, we applied 228 regular expressions to standardize these expressions. For example, “capillary refill time 3 s” is converted to “CRT 3s”.

(2) Tabulation of Test Result Numerics: We utilized 74 regular expressions, e.g., “partial pressure of carbon dioxide ([0-9]+[.。]?[0-9]*)mmHg”, to extract numeric values from standardized test results and organize them into a tabular format.

(3) Physical Descriptions Extraction and Refinement: The same 228 regular expressions from the first step are reused to extract physical descriptions, which involves removing text related to test results. Additionally, we employ a set of straightforward regular expressions to systematically eliminate redundant punctuations and speculative phrases that imply tentative diagnoses, like ‘suspected pneumonia’ or ‘suspected mycoplasma infection.’ This strategy is crucial for preventing label leakage, a common occurrence among clinicians during patient examinations.

Table 2 presents structured data for 75 test items, including group, density, mean (SD), and units of measure. The numeric forms of these test results are varied, often sparse, and 26 items showed statistically significant differences between the MP and non-MP groups. It’s important to note that the specified three-step preprocessing pipeline is not mandatory in situations where structured test data is readily available, particularly via a lab module, and can be directly employed.

Table 2.

Overview of structured data (75 items in total).

Group	Item	Title	Density (%)	Mean (SD)	Unit of measure	P*
Basis	Temp	Body temperature	96.49	37.69 (4.12)	°C	.75
	HR	Heart rate	94.68	129.06 (27.42)	times/min	<.01**
	RR	Rate of respiration	95.54	34.71 (9.57)	times/min	<.01**
	BP₁	Systolic blood pressure	92.85	100.45 (22.08)	mmHg	<.01**
	BP₂	Diastolic blood pressure	92.85	61.77 (80.64)	mmHg	.39
	TcSPO₂	Percutaneous arterial oxygen saturation	50.40	96.27 (22.25)	%	.01**
	CRT	Capillary filling time	5.00	2.24 (3.79)	second	.15
Blood routine	WBC	Blood white blood cell count	93.68	9.99 (21.57)	×10⁹/L	<.01**
	GRA	Neutrophil absolute value	3.91	6.68 (11.78)	×10⁹/L	.04**
	N	Neutrophil granulocyte	87.66	55.18 (18.23)	%	<.01**
	CRP	Determination of hypersensitive C-reactive protein	93.04	13.21 (19.83)	mg/L	<.01**
	MCHC	Mean erythrocyte hemoglobin concentration	0.20	311.86 (23.09)	g/L	.15
	HGB	Haemoglobin	67.70	123.42 (24.92)	g/L	<.01**
	MCV	Mean red blood cell volume	0.29	75.22 (12.28)	fl	.37
	RBC	Red blood cell count	9.96	29.73 (673.46)	×10¹²/L	.37
	PCV	Hematocrit	0.88	35.27 (5.99)	%	.02*
	Ret	Reticulocyte erythrocyte	0.24	2.57 (2.99)	%	.32
	PLT	Platelet count	65.71	291.92 (117.12)	×10⁹/L	<.01**
	PCT	Blood platelet pressure	0.10	0.26 (0.06)	%	.03**
	LY	Lymphocyte cell	68.55	37.76 (18.26)	%	<.01**
	LYMPH	Absolute value of lymphocytes	0.99	111.27 (873.22)	×10⁹/L	.34
	AMS	Blood amylase	0.21	230.99 (301.7)	U/L	.63
	Fbg	Fibrinogen	0.14	3.67 (1.74)	g/L	.03**
	EOS	Eosinophils	2.74	3.76 (5.29)	%	.04**
	EOV	Eosinophils absolute value	0.46	0.53 (0.51)	×10⁹/L	.53
	BASO	Basophilic granulocyte	0.08	0.17 (0.11)	%	.56
	BASV	Basophils absolute value	0.06	0.02 (0.01)	×10⁹/L	<.01**
	MONO	Monocyte cell	0.14	1.06 (0.38)	%	.71
	MONV	Monocyte absolute value	0.71	11.21 (11.85)	×10⁹/L	.95
	ALT	Alanine aminotransferase	11.99	53.63 (135.63)	U/L	.81
	AST	Aspartate aminotransferase	1.15	135.06 (166.62)	U/L	.80
	ALB	Albumin protein	0.82	33.4 (11.67)	g/L	.06
	TP	Total protein	0.59	59.71 (9.22)	g/L	.01*
	BUREA	Blood urea	1.97	3.09 (2.02)	mmol/L	.70
	PUREA	Urea of kidney	6.53	3.21 (3.86)	mmol/L	.19
	CR	Creatinine	9.63	30.54 (14.52)	μmol/L	<.01**
	Ferr	Ferritin	0.14	822.12 (693.91)	μg/L	.17
	LE	Neutrophil esterase	0.18	1.12 (1.0)	L	.31
Blood gas analysis	BPH	Blood pH	17.47	13.29 (208.28)	—	.36
	PCO₂	Partial pressure of carbon dioxide	16.79	35.02 (12.7)	mmHg	<.01**
	PO₂	Partial oxygen pressure	15.98	81.91 (190.81)	mmHg	.66
	SO₂	Arterial oxygen saturation	9.15	90.73 (29.29)	%	.57
	K+	Potassium ion	14.84	4.11 (0.63)	mmol/L	<.01**
	Na+	Sodium ion	14.08	135.86 (3.28)	mmol/L	.91
	Lac	Lactic acid	12.95	2.27 (1.29)	mmol/L	<.01**
	Ca₂	Calcium ion	6.04	1.25 (0.19)	mmol/L	.02**
	HCO₃	Hydrogen carbonate root	0.60	22.45 (2.66)	mmol/L	.93
	ABE	Actual base excess	12.14	2.56 (2.38)	mmol/L	<.01**
	Cl-	Blood chloride ion	7.03	108.30 (4.91)	mmol/L	.79
	Glu	Blood glucose	5.11	6.46 (2.15)	mmol/L	.61
	NH₃	Blood ammonia	0.31	31.9 (14.43)	umol/L	.03**
Urine routine	PPH	Uric acid alkalinity	0.21	7.21 (0.53)	—	.13
	UA	Uric acid	0.21	236.33 (139.62)	μmol/L	.38
	BUN	Urea nitrogen	0.13	3.68 (2.12)	mmol/L	.70
	BLO	Urinary occult blood	0.80	1.27 (1.15)	L	.12
	PRO	Urine protein	1.29	1.25 (1.1)	L	.16
	UPC	Urine pus cell	0.31	79.46 (124.82)	μL	.16
	URBCC	Urine red blood cell count	0.52	1034.32 (2879.76)	μL	.42
	URBCM	Urine red blood cell microscopy examination	0.61	146.34 (509.83)	HP	.51
	URBC	Urine red blood cell	0.01	0.41 (0.19)	L	.80
	UWBCC	Urine white blood cell count	0.22	66.31 (119.74)	μL	.41
	UWBCM	Urine white blood cell microscopy examination	0.48	14.46 (24.6)	HP	.25
	UWBC	Urinary white blood cell	0.10	1.36 (0.58)	L	.80
	MTP	Trace total protein	0.28	654.9 (1028.5)	mg/L	.93
	MALB	Trace albumin	0.01	83.0 (33.0)	mg/L	.21
Other	TG	Triglyceride	0.10	2.67 (1.53)	mmol/L	.19
	TBil	Total bilirubin	0.80	62.52 (83.74)	μmol/L	.01**
	TC	Total cholesterol	0.08	9.37 (5.02)	mmol/L	.66
	CC3	Complement C3	0.15	1.12 (0.27)	g/L	.62
	CC4	Complement C4	0.15	0.33 (0.13)	g/L	.70
	FPC	Fecal pus cell	0.15	0.64 (0.93)	L	.65
	FRBC	Fecal red blood cell	0.04	0.50 (0.47)	L	0.01**
	FWBC	Fecal white blood cell	0.01	1.00 (0.21)	L	.61
	OB	Fecal occult blood	0.32	0.70 (1.04)	L	.57
	PCT_A	Procalcitonin	12.3	2.35 (7.68)	ng/mL	.14

*The p-values of the t-tests between the MP group and the non-MP group.

**p < .05 indicates a statistically significant difference.

The proposed MP identification system

An illustrative overview of the proposed MP identification system is shown in Figure 1(b). In our study, we utilize both structured test data and free-text descriptions from the preprocessing pipeline to identify MP cases. Throughout this paper, we refer to these two types of data as ‘free-text stream’ and ‘structured data stream’, representing the dual input streams of our system.

Transformer encoder for free-text streams

For processing the free-text stream, we employ the well-known Transformer-based architecture BERT model²⁸ as our encoder. Specifically, we utilize the Chinese BERT_BASE model from the Transformers package,²⁹ which comprises an initial embedding layer followed by a series of 12 Transformer blocks.³⁰ The free-text stream is first tokenized and then passed through the embedding layer. Subsequently, the output from the embedding layer undergoes feature interaction and extraction within the Transformer blocks.

Each Transformer block is composed of two primary sub-layers: a multi-head self-attention (MHA) layer and a feed-forward network (FFN) layer. The MHA layer, with $h$ heads can be expressed as:

M H A (Z) = C o n c a t ({h e a d}_{1}, {h e a d}_{2,} \dots, {h e a d}_{h}) W^{o}

Here, $Z \in R^{L \times C}$ represents the input to the MHA layer. Each ${h e a d}_{i}$ is computed as:

{h e a d}_{i} = S o f t m a x (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{k}}}) V_{i}, i = 1, \dots, h

Q_{i} = W_{i}^{Q} z_{i}, K_{i} = W_{i}^{K} z_{i}, V_{i} = W_{i}^{V} z_{i}

In this formulation, $Q_{i}$ , $K_{i}$ and $V_{i}$ are the query, key, and value embeddings for the ith head, obtained through projection matrices $W_{i}^{Q}$ , $W_{i}^{K}$ , and $W_{i}^{V}$ respectively. $W^{o}$ is the output projection matrix, $d_{k}$ is the dimension of each head, and $z_{i} \in R^{L \times d_{k}}$ is the feature of each head. The FFN layer in each Transformer block is defined as follows:

F F N (z) = R e L U (W_{u p} z) W_{d o w n}

This layer consists of two linear transformations with a ReLU activation function in between. $W_{u p} \in R^{C \times 4 C}$ and $W_{d o w n} \in R^{4 C \times C}$ are the projection matrices. And $z \in R^{L \times C}$ is the input of FFN. The integration of these sub-layers within the lth Transformer layer includes normalization operator and residual connections, formulated as:

z_{l}^{'} = M H A (LN (z_{l - 1})) + z_{l - 1}

z_{l} = F F N (LN (z_{l}^{'})) + z_{l}^{'}

Here $LN$ is the layer normalization,^31,32 and ${z_{l - 1}, z_{l}, z}_{l}^{'} \in R^{L \times C}$ , $z_{l - 1}$ is the input to lth Transformer block and the $z_{l}$ is the output of lth Transformer block.

In this study, the collected ‘Extra-data’ set serves as the training corpus for unsupervised pre-training of BERT, specifically tailored for the clinical text domain. Unsupervised pre-training in this context typically involves two tasks: the Masked Language Model (MLM) and Next Sentence Prediction (NSP).²⁸ In the MLM task, BERT is trained to predict randomly masked tokens within a sentence. In the NSP task, BERT assesses whether two sentences are contextually related. However, given that clinical text is often brief and lack clear contextual relationships, and considering that NSP has been found to be less effective than MLM in some scenarios,^33,34 our study focuses solely on the MLM task for pre-training. We refer to the model trained through this pre-training process as ‘MLM BERT’.

Tree-based feature transformation for structured data streams

Drawing inspiration from the hybrid model structure presented in ref,³⁵ we trained a widely recognized MART model, namely LightGBM,³⁶ as a feature transformer for structured data streams. The primary objective of this training process is to execute binary classification between MP and non-MP cases, employing the binary cross-entropy function as the objective. However, this process exclusively considers structured tabular data as input within the dataset. Upon achieving training convergence, LightGBM transitions to serve as a feature transformer. This is achieved by converting multiple numeric features into a unified multi-hot categorical vector. Specifically, when a sample with various features is input into LightGBM, it ultimately reaches a leaf node within one of the model’s trees. This particular leaf node is designated as ‘1’, while all other nodes are labeled as ‘0’. As a result, the numerical feature of the sample, as it is distributed by the tree, is transformed into a one-hot vector. Given the presence of numerous sub-trees in LightGBM, all these one-hot vectors are concatenated to form a single multi-hot vector. This process effectively accomplishes the transformation of numerical features. Consider the scenario of an additive tree ensemble comprising two trees, where the first tree has three leaves and the second tree has two leaves. In this instance, if a sample is directed to the second leaf in the first tree and the first leaf in the second tree, it would be represented as [0,1,0,1,0]. This representation is an outcome of tree-based feature transformation. Essentially, this method acts as a supervised feature encoding technique, transforming continuous real-valued vectors into more concise binary-valued vectors. The basis for this transformation is a set of binary rules, which are established according to the training objective of the tree. In this context, the objective is to differentiate between MP and non-MP cases.

Attentive fusion of free-text streams and structured data streams

In our system, we employ a Multi-modal Attentive Fusion (MAF) module, as proposed in prior research.³⁷ This module is designed to effectively combine features from both free-text streams and structured data streams. The detailed architecture of the MAF module is illustrated in Figure 3 and defined as follows:

S_{o}^{t e} = W^{t e} {\cdot S}_{i}^{t e} + A F (S_{i}^{s d})

S_{o}^{s d} = {W^{s d} \cdot S}_{i}^{s d} + A F (S_{i}^{t e})

Figure 3.

The structure of the multi-modal attentive fusion module. (a) The structure of the fusion block. (b) The detailed structure of the adaptive feature infusion module.

In this context, $S_{i}^{t e}$ and $S_{i}^{s d}$ represent the input from the free-text and structured data stream, respectively. Following feature fusion, the transformed outputs of these streams are represented as $S_{o}^{t e}$ for the free-text stream and $S_{o}^{s d}$ for the structured data stream. Central to this process is the Adaptive Feature Infusion (AF) module, which is defined as follows:

A F (S_{i}) = W_{3} \cdot LN (W_{2} \cdot (S o f t m a x (W_{1} \cdot S_{i}) \otimes S_{i}))

Here, $S_{i} \in R^{L \times C}$ denotes the input feature to the AF module. $W_{1}, W_{2}, W_{3}$ are $1 \times 1$ convolution layers and $LN$ is a layer normalization. The Hadamard product, represented by $\otimes$ , is used for element-wise multiplication in the feature space. The MAF module, built upon a self-attention-like mechanism, is designed to selectively amplify the features of one data stream based on its interaction with the other. This enhancement is achieved through the application of two distinct MAF modules, one for each data stream. Upon processing through these modules, the augmented features from both streams are concatenated. This combined feature vector is then channeled into a binary linear classifier for the purpose of classification. Throughout the training of our model, the Cross-Entropy loss function is utilized as the optimization criterion.

Results

In our study, we conduct a quantitative comparison of MP identification performance using uni-modal data streams, namely, free-text streams and structured data streams, and a multi-modal approach that fuses both types of data streams. To evaluate the effectiveness of these methods, we employ widely recognized evaluation metrics: accuracy (Acc), precision (P), recall (R), and F1 score (F1). Among these metrics, the F1 score is given the highest priority due to its balanced consideration of both precision and recall.

MP identification based on free-text streams

In our study, we emphasize the significance of text embedding/representation in analyzing free-text streams. We quantitatively compare three prevalent text-embedding strategies: Vector Space Model (VSM), Word2Vec, and Transformer-based method. For VSM and Word2Vec, the free-text is tokenized into word sequences using the Stanford Chinese Word Segmenter.³⁸ In contrast, Transformer-based methods tokenize free-text into character sequences. The VSM method utilizes Term Frequency–Inverse Document Frequency (TF-IDF)³⁹ for word representations, embedding text lines into VSM space. Word2Vec applies the Chinese word vector, from FastText,^40,41 which has a vocabulary of 4,000,000 words, representing each word as a vector. In the Transformer-based method, BERT’s embedding layer provides word representations. For VSM-based classifiers, besides LightGBM, we also implemented SVM, LR and MLP via the Scikit Learn package.⁴² The Word2Vec classifiers include FastText,⁴¹ TextCNN,⁴³ and BLSTM.⁴⁴ The Transformer-based method employs a conv-tanh-linear composite layer, where BERT’s token representations are fed into a classifier, consisting of a 1-day conv layer, a tanh-activation layer, and a binary linear classification layer.

Table 3 shows the performance of these methods under different text embedding strategies. LightGBM, representing VSM-based methods, achieves the highest performance with 0.689 Acc, 0.745 F₁, 0.848 P and 0.665 R. LR performs the least effectively, likely due to its limited nonlinear representation and feature interaction capabilities. MLP ranks the second in performance (0.670 Acc, 0.727 F₁, 0.840 P and 0.641 R), reflecting its higher complexity and nonlinear capabilities. The performance of SVM is comparable to MLP in Acc and F₁ (0.661 vs 0.670 in Acc and 0.726 vs 0.727 in F₁), but inferior to MLP in P and R (0.815 vs 0.840 in P, 0.654 vs 0.641 in R). In the Word2Vec category, TextCNN outperforms others with 0.722 Acc, 0.786 F₁, 0.833 P and 0.743 R. FastText’s under-performance is attributed to its average vectorization of text and shallow classification linear layers, which hinder capturing nonlinear features. TextCNN, with its multiple convolutional layers, and BLSTM, with its four-layer LSTM stack,⁴⁵ show better complexity and performance. As for Transformer-based methods, MLM BERT demonstrates superior performance compared to the original BERT, indicating the effectiveness of MLM training. The dynamic nature of Transformer-based word representations, which are contextually influenced, becomes advantageous post-MLM training. Thus, while original BERT does not show a significant improvement over Word2Vec, the MLM BERT’s improved performance (0.741 Accuracy and 0.814 F1) highlights the benefits of dynamic word representation in this context.

Table 3.

Performance comparison of various methods for different word embedding representation strategies based on free-text streams.

Word representation	Method	Acc	F ₁	P	R
VSM	LR	0.622	0.687	0.794	0.605
	SVM	0.661	0.726	0.815	0.654
	MLP	0.670	0.727	0.840	0.641
	LightGBM^a	0.689	0.745	0.848	0.665
Word2Vec	FastText	0.696	0.767	0.808	0.730
	BLSTM	0.711	0.775	0.830	0.727
	TextCNN^a	0.722	0.786	0.833	0.743
Transformer (all tokens)	BERT	0.719	0.779	0.843	0.724
Transformer (all tokens)	MLM BERT^a	0.741	0.814	0.799	0.830

^aBest performing models under different word representation strategies.

MP identification based on structured data streams

We compared the performance of various methods for MP identification using structured data streams, as detailed in Table 4. LightGBM, a representative of MART, emerged as the top performer with 0.709 Acc, 0.768 F₁, 0.847 P, and 0.703 R. Notably, these results surpass its performance in free-text stream analysis across most metrics. LightGBM also demonstrates superior performance compared to MLP (0.709 vs 0.659 in Acc, 0.768 vs 0.734 in F₁, 0.847 vs 0.789 in P, and 0.703 vs 0.686 in R), highlighting its effectiveness in handling structured data streams despite their high sparsity. This indicates the structured data stream’s capability to distinguish between MP and non-MP cases effectively.

Table 4.

Performance comparison of different methods based on structured data streams.

Method	Acc	F ₁	P	R
SVM	0.643	0.717	0.783	0.662
LR	0.646	0.719	0.790	0.659
MLP	0.659	0.734	0.789	0.686
LightGBM^a	0.709	0.768	0.847	0.703

^aBest performing model.

MP identification based on fusion of both streams

We conducted comprehensive comparisons between the proposed multi-modal fusion model, the state-of-the-art Multi-modal Attentive Fusion Res-Net (MAF-Res) model,³⁷ and the top-performing uni-modal models (free-text streams or structured data streams). As shown in Table 5, our proposed multi-modal fusion model significantly outperforms the MAF-Res in all four performance metrics (0.756 vs 0.674 in Acc, 0.831 vs 0.769 in F₁, 0.790 vs 0.747 in P, and 0.876 vs 0.792 in R). This demonstrates the effectiveness of our multi-modal approach. For comparisons between our multi-modal fusion approach and the highest-performing uni-modal methods across various word representation strategies, the multi-modal fusion method consistently exceeds all uni-modal methods in terms of Acc, F₁, and R.

Table 5.

Performance comparison of methods using different data modalities.

Method	Free-text	Structured data	Acc	F ₁	P	R
LightGBM (VSM^a)	√	×	0.689	0.745	0.848	0.665
TextCNN (Word2Vec^a)	√	×	0.722	0.786	0.833	0.743
MLM BERT (transformer^a)	√	×	0.719	0.779	0.843	0.724
LightGBM^a	×	√	0.709	0.768	0.847	0.703
MAF-res³⁷	√	√	0.674	0.769	0.747	0.792
Ours	√	√	0.756	0.819	0.832	0.805
Ours^b	√	√	0.756	0.831	0.790	0.876

^aBest performing models under different word representations and data modalities.

^bMulti-modal with MART transformation.

Furthermore, when comparing our model with and without MART transformation (wherein the absence of MART, structured raw data is zero-filled for missing values), there are notable improvements in F₁, and R (0.831 vs 0.819 in F₁, and 0.876 vs 0.805 in R), while maintaining equivalent Acc (0.756 in both cases). This underscores the efficacy of tree-based transformation in enhancing model performance.

Discussions

Influence of word representation context on free-text stream analysis

In our study, we analyzed various word embedding representations and their corresponding classification methods, as shown in Table 3. When comparing VSM-based methods with Word2Vec-based methods, we equate MLP and FastText due to their similar complexities as shallow neural networks. FastText, however, demonstrates better performance than MLP (0.696 vs 0.670 in Acc and 0.767 vs 0.727 in F₁), indicating Word2Vec’s superior word representation capabilities. VSM’s corpus is restricted to the training set, while Word2Vec benefits from a much larger corpus. Despite VSM’s corpus being more relevant to specific tasks, word frequency alone proves insufficient for capturing the full importance of a word. This limitation includes the inability to reflect the positional information of words and their interrelationships. In contrast, Word2Vec, through unsupervised training on extensive corpora, learns more nuanced word representations, capturing implicit word relationships.

Interestingly, original BERT, despite its wide-ranging successes in natural language processing, does not show a marked advantage over Word2Vec in our analysis and is slightly outperformed by TextCNN. This observation is further evidenced by the improved performance of MLM BERT over the original BERT, suggesting that contextual word embeddings benefit from additional data and training. This leads us to conclude that for specialized domain language tasks like clinical text analysis, pre-training is essential to fully leverage BERT’s capabilities.

Impact of token representations on free-text stream analysis

In the utilization of BERT as an encoder for downstream tasks, two primary approaches emerge for token representations. For tasks at the token-level, such as sequence tagging, representations of tokens other than [CLS] are utilized. Conversely, for document-level tasks like text classification, the [CLS] token’s representations are typically employed for classification, often viewed as encapsulating global textual information. However, our research findings indicate a more effective approach in leveraging all token representations for classification purposes.

This approach’s efficacy is demonstrated in Table 6, which compares the performance of BERT using different token representations for free-text stream classification. Notably, BERT with [CLS] token representation slightly outperforms BERT using all tokens (0.720 vs 0.719 in Acc and 0.787 vs 0.779 in F₁). However, when using MLM BERT, the all-token representation shows a marginal improvement over the with [CLS] approach (0.741 vs 0.735 in Acc, 0.814 vs 0.810 in F₁). This distinction may stem from the exclusive application of the MLM task in our pre-training. In this context, the MLM task focuses on predicting the category of masked [MASK] tokens, thereby directly training non-[CLS] token output headers, while [CLS] token output headers receive only implicit updates. It’s important to note that the NSP task, which directly trains the [CLS] token output header, was not employed in our study.

Table 6.

Performance comparison between different tokens representations for BERT.

Encoder	Used representations	Acc	F ₁	P	R
BERT	(CLS) token	0.720	0.787	0.823	0.754
BERT	All tokens	0.719	0.779	0.843	0.724
MLM BERT	(CLS) token	0.735	0.810	0.796	0.824
MLM BERT	All tokens	0.741	0.814	0.799	0.830

Advantages of tree-based feature transformation

The efficacy of tree-based feature transformation is clearly exhibited in the results presented in Table 5. Additionally, as shown in Table 4, MART demonstrate superior performance over neural networks like MLP, particularly when handling structured data, which often exhibits significant diversity in range, variance, and measurement units, and accompanied by numerous missing values as indicated in Table 2. We attribute this advantage to two key factors: a ranking-based splitting principle and robustness to missing values.

Neural networks update parameters based on the gradient derived from the difference between the output and target, thereby, their effectiveness can diminish when dealing with large input feature distributions. Although normalization techniques (such as feature norm and norm layer) are commonly used in neural networks to mitigate issues arising from large magnitude differences in input features, MART’s ranking-based approach is more adept in this context as the depth-based splitting inherent in trees facilitates implicit feature interaction, enhancing the model’s understanding of complex data relationships. Moreover, by transforming continuous values into multiple sets of categorical one-hot features, MART aligns well with neural networks’ strengths, thereby optimizing the subsequent feature fusion process in MAF modules.

The other advantage of MART lies in its handling of missing values. Neural networks typically require missing values to be replaced with fixed values (like zero or the mean), which can inadvertently affect parameter updates through gradient changes. In contrast, MART assesses the impact of missing values by measuring their gain in different branches of the tree. This approach not only maintains sensitivity to the true data distribution but also circumvents the distortions that can arise from arbitrary value imputations.

Comparison with other related work

Previous EMR diagnostic systems, as referenced in,^37,46 often rely on word-wise features like word2vec. Such features, while useful, overlook the sequential nature of words, thereby failing to capture the full semantics of sentences. Notably, the superior performance of word-wise features over the BERT model, as reported in ref,³⁷ may be attributed to the absence of fine-tuning BERT with a specialized medical field corpus. In contrast to these studies, our approach leverages sentence-level embeddings, utilizing the richness of unlabeled corpora to enhance the accuracy of EMR diagnosis. Our methodology, though inspired by the feature fusion module in ref,³⁷ addresses a critical shortcoming in their numerical feature input: the lack of normalization. Even with normalization, the differing ranges of medical test features could disproportionately influence the model, potentially leading to biased predictions. Our implementation of MART mitigates this bias through a ranking-based split principle for feature transformation.

Limitations

While our method demonstrates high accuracy, its effectiveness might vary in other clinical centers due to differences in clinical text writing conventions. Despite this, we posit that the semantic interpretation of EMRs based on BERT offers better generalizability compared to word-based text understanding. Furthermore, the data preprocessing pipeline in our study includes an intensive manual template design for structured data extraction. This aspect could potentially restrict the practical deployment of our model. Nonetheless, as mentioned earlier, this preprocessing pipeline is not mandatory in situations where structured test data is readily available, particularly via a lab module, and can be directly employed. Finally, the successful application of our model is contingent upon the integration with a hospital’s information system infrastructure, including ETL (Extract, Transform, Load) tools for the automated extraction, preprocessing, and storage of necessary test numerical values.

Conclusions

In our study, we developed and evaluated a hybrid model that integrates clinical free-text descriptions with structured numerical medical test data for identifying MP in CAP patients. The model was trained and validated on a dataset of 7157 EMRs. Our findings show that the proposed model achieves an accuracy of 0.756, an F1 score of 0.831, a precision of 0.790, and a recall of 0.876 on a test set comprising 370 MP cases and 170 non-MP cases. It significantly surpasses the performance of the state-of-the-art MAF-Res method and all leading uni-modal methods across key metrics. This highlights the effectiveness of our approach. Furthermore, our experiments demonstrate the critical role of pre-training and the efficiency of tree-based feature transformation. Looking ahead, potential improvements include expanding the model’s capabilities to identify more CAP sub-types and enhancing its interpretability.

Footnotes

Acknowledgements

We gratefully acknowledge the funding support received from the Key R&D Program of Zhejiang, the National Key Research & Development Program of China, the National Natural Science Foundation of China, and the Hong Kong Research Grants Council through the General Research Fund.

Author contributions

(1) Conceptualization: Qiuyang Sheng, Xiaoqing Liu, Jingna Xie, Yingshuo Wang; (2) Methodology: Qiuyang Sheng, Xiaoqing Liu, Yizhou Yu; (3) Formal analysis and investigation: Yingshuo Wang, Jing Li, Jingna Xie, Gang Yu; (4) Writing - original draft preparation: All authors; (5) Writing-review and editing: All authors; (6) Funding acquisition: Xiaoqing Liu, Yiming Li, Yizhou Yu, Gang Yu; (7) Resources: Jing Li, Fenglei Sun, Yuqi Wang, Shuxian Li; (8) Supervision: Yizhou Yu, Gang Yu.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by grants from the Key R&D Program of Zhejiang (Grant No. 2023C03101), the National Key R & D Program of China (Grant No. 2023YFC2706400 and 2019YFE0126200), the National Natural Science Foundation of China (Grant No. 62076218 and 82171934), Zhejiang Province Research Project of Public Welfare Technology Application (Grant No. LGF22H180004), and the Hong Kong Research Grants Council Through General Research Fund (Grant No. 17207722).

Ethical statement

ORCID iDs

Qiuyang Sheng

Xiaoqing Liu

References

Jain

Williams

Arnold

CDC EPIC Study Team , et al. Community-acquired pneumonia requiring hospitalization among U.S. children. N Engl J Med 2015; 372(9): 835–845. DOI: 10.1056/NEJMoa1405870.

Liu

Chen

, et al. Epidemiology of acute respiratory infections in children in Guangzhou: a three-year study. PLoS One 2014; 9(5): e96674. DOI: 10.1371/journal.pone.0096674.

Polkowska

Harjunpää

Toikkanen

, et al. Increased incidence of Mycoplasma pneumoniae infection in Finland, 2010-2011. Euro Surveill 2012; 17(5): 20072. DOI: 10.2807/ese.17.05.20072-en.

Jacobs

. Mycoplasma pneumoniae: now in the focus of clinicians and epidemiologists. Euro Surveill 2012; 17(6): 20084.

Klapdor

Ewig

Pletz

CAPNETZ Study Group , et al. Community-acquired pneumonia in younger patients is an entity on its own. Eur Respir J. 2012;39(5):1156–1161. DOI: 10.1183/09031936.00110911. Erratum in: Eur Respir J. 2012 Dec;40(6):1583.

Dumke

von Baum

Lück

, et al. Occurrence of macrolide-resistant Mycoplasma pneumoniae strains in Germany. Clin Microbiol Infect 2010; 16(6): 613–616. DOI: 10.1111/j.1469-0691.2009.02968.x.

Pereyre

Touati

Petitjean-Lecherbonnier

, et al. The increased incidence of Mycoplasma pneumoniae in France in 2011 was polyclonal, mainly involving M. pneumoniae type 1 strains. Clin Microbiol Infect 2013; 19(4): E212–E217. DOI: 10.1111/1469-0691.12107.

Zhao

Tao

, et al. Antibiotic sensitivity of 40 Mycoplasma pneumoniae isolates and molecular analysis of macrolide-resistant isolates from Beijing, China. Antimicrob Agents Chemother 2012; 56(2): 1108–1109. DOI: 10.1128/AAC.05627-11.

Hong

Choi

Lee

, et al. Macrolide resistance of Mycoplasma pneumoniae, South Korea, 2000-2011. Emerg Infect Dis 2013; 19(8): 1281–1284. DOI: 10.3201/eid1908.121455.

10.

Zheng

Deng

, et al. Characterization of macrolide resistance of Mycoplasma pneumoniae in children in Shenzhen, China. Pediatr Pulmonol 2014; 49(7): 695–700. DOI: 10.1002/ppul.22851.

11.

Zhao

Feng

Chen

, et al. Diagnose like a radiologist: hybrid neuro-probabilistic reasoning for attribute-based medical image diagnosis. IEEE Trans Pattern Anal Mach Intell 2022; 44(11): 7400–7416. DOI: 10.1109/TPAMI.2021.3130759.

12.

Zhou

Wang

, et al. A transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics. Nat Biomed Eng 2023; 7(6): 743–755. DOI: 10.1038/s41551-023-01045-x.

13.

Hong

Sheng

Dong

, et al. Automatic detection of secundum atrial septal defect in children based on color Doppler echocardiographic images using convolutional neural networks. Front Cardiovasc Med 2022; 9: 834285. DOI: 10.3389/fcvm.2022.834285.

14.

Prosperi

Marinho

Simpson

, et al. Predicting phenotypes of asthma and eczema with machine learning. BMC Med Genom. 2014;7(Suppl 1):S7. DOI: 10.1186/1755-8794-7-S1-S7.

15.

, et al. The role of artificial intelligence in identifying asthma in pediatric inpatient setting. Ann Transl Med 2020; 8(21): 1367. DOI: 10.21037/atm-20-2501a.

16.

Aczon

Ledbetter

Laksana

, et al. Continuous prediction of mortality in the PICU: a recurrent neural network model in a single-center dataset. Pediatr Crit Care Med 2021; 22(6): 519–529. DOI: 10.1097/PCC.0000000000002682.

17.

Wang

Wei

Jia

, et al. Deep learning model for multi-classification of infectious diseases from unstructured electronic medical records. BMC Med Inf Decis Making 2022; 22(1): 41. DOI: 10.1186/s12911-022-01776-y.

18.

Giang

Calvert

Rahmani

, et al. Predicting ventilator-associated pneumonia with machine learning. Medicine (Baltim) 2021; 100(23): e26246. DOI: 10.1097/MD.0000000000026246.

19.

Wang

Sheng

, et al. Interpretable modeling and discovery of key predictors for pneumonia diagnosis in children based on electronic medical records. Digit Health 2022; 8: 20552076221131185. DOI: 10.1177/20552076221131185.

20.

Sun

Douiri

Gulliford

. Applying machine learning algorithms to electronic health records to predict pneumonia after respiratory tract infection. J Clin Epidemiol 2022; 145: 154–163. DOI: 10.1016/j.jclinepi.2022.01.009.

21.

Jia

Tan

Zhang

. DKDR: an approach of knowledge graph and deep reinforcement learning for disease diagnosis. In: Proceedings of 2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), Xiamen, 16–18 December 2019, pp. 1303–1308.

22.

Siordia-Millán

Torres-Ramos

Salido-Ruiz

, et al. Pneumonia and pulmonary thromboembolism classification using electronic health records. Diagnostics 2022; 12(10): 2536. DOI: 10.3390/diagnostics12102536.

23.

Shin

Kim

, et al. Interleaved text/image deep mining on a large-scale radiology database for automated image interpretation. J Mach Learn Res 2016; 17(1): 3729–3759.

24.

Wang

Peng

, et al. Tienet: text-image embedding network for common thorax disease classification and reporting in chest x-rays. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, 18–23 June 2018, pp. 9049–9058.

25.

Sheu

Chen

, et al. Multi-modal data analysis for pneumonia status prediction using deep learning (MDA-PSP). Diagnostics 2022; 12(7): 1706. DOI: 10.3390/diagnostics12071706.

26.

Wang

Yang

, et al. Deep regression via multi-channel multi-modal learning for pneumonia screening. IEEE Access 2020; 8: 78530–78541.

27.

Zheng

Yan

Gou

, et al. Pay attention to doctor-patient dialogues: multi-modal knowledge graph attention image-text embedding for COVID-19 diagnosis. Inf Fusion 2021; 75: 168–185. DOI: 10.1016/j.inffus.2021.05.015.

28.

Jacob

Chang

M-W

Lee

, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, 2019, Vol. 1, pp. 4171–4186. Association for Computational Linguistics.

29.

Wolf

Debut

Sanh

, et al. Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations, Scotland, 2020, pp. 38–45.

30.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). 2017; 31: 6000–6010.

31.

Kiros

Hinton

. Layer normalization, 2016. arXiv preprint arXiv:1607.06450.

32.

Gibbs

. Elementary principles in statistical mechanics: developed with especial reference to the rational foundations of thermodynamics. New York, NY: C. Scribner’s Sons, 1902.

33.

Liu

Ott

Goyal

, et al. Roberta: a robustly optimized bert pretraining approach, 2019. arXiv preprint arXiv:1907.11692.

34.

Joshi

Chen

Liu

, et al. Spanbert: improving pre-training by representing and predicting spans. Transac Assoc Comput Linguist 2020; 8: 64–77.

35.

Pan

Jin

, et al. Practical lessons from predicting clicks on ads at facebook. In: Proceedings of the eighth international workshop on data mining for online advertising, New York, NY, 2014, pp. 1–9.

36.

Meng

Finley

, et al. Lightgbm: a highly efficient gradient boosting decision tree. In: Ulrike

Isabelle

Samy

, et al. (eds) NIPS 2017. Proceedings of the 31st International Conference on Neural Information Processing Systems, Beach, CA, 2017 Dec 4–9, pp. 3149–3157. Curran Associates Inc.

37.

Shi

, et al. Identification of pediatric respiratory diseases using a fine-grained diagnosis system. J Biomed Inf 2021; 117: 103754. DOI: 10.1016/j.jbi.2021.103754.

38.

Chang

Galley

Manning

. Optimizing Chinese word segmentation for machine translation performance. In: Proceedings of the third workshop on statistical machine translation, Columbus, OH, 2008, pp. 224–232.

39.

Salton

Wong

Yang

. A vector space model for automatic indexing. Commun ACM 1975; 18(11): 613–620.

40.

Grave

Bojanowski

Gupta

, et al. Learning word vectors for 157 languages. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018), Miyazaki, 2018.

41.

Joulin

Grave

Bojanowski

, et al. Bag of tricks for efficient text classification, 2016. arXiv preprint arXiv:1607.01759.

42.

Pedregosa

Varoquaux

Gramfort

, et al. Scikit-learn: machine learning in Python. J Mach Learn Res 2011; 12: 2825–2830.

43.

Kim

Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, October 25, 2014, pp. 1746–1751. Association for Computational Linguistics.

44.

Graves

Schmidhuber

. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Network 2005; 18(5–6): 602–610.

45.

Hochreiter

Schmidhuber

. Long short-term memory. Neural Comput 1997; 9(8): 1735–1780.

46.

Obeid

Davis

Turner

, et al. An artificial intelligence approach to COVID-19 infection risk assessment in virtual visits: a case report. J Am Med Inf Assoc 2020; 27(8): 1321–1325. DOI: 10.1093/jamia/ocaa105.

Identification of mycoplasma pneumonia in children based on fusion of multi-modal clinical free-text description and structured test data

Abstract

Keywords

Introduction

Materials & methods

Ethical approval

Data overview

Data preprocessing pipeline

The proposed MP identification system

Transformer encoder for free-text streams

Tree-based feature transformation for structured data streams

Attentive fusion of free-text streams and structured data streams

Results

MP identification based on free-text streams

MP identification based on structured data streams

MP identification based on fusion of both streams

Discussions

Influence of word representation context on free-text stream analysis

Impact of token representations on free-text stream analysis

Advantages of tree-based feature transformation

Comparison with other related work

Limitations

Conclusions

Footnotes

Acknowledgements

Author contributions

Declaration of conflicting interests

Funding

Ethical statement

ORCID iDs

References