Extracting structured clinical data from pediatric emergency records using LLMs: A multimodel retrospective study of children with medical complexity

Abstract

Importance

Emergency departments (EDs) face significant documentation burdens due to reliance on unstructured clinical narratives, hindering efficiency, particularly in pediatric care. Large language models (LLMs) offer a potential solution by automating data extraction to improve clinical workflows.

Objective

To determine whether an LLM can accurately and efficiently extract structured clinical data from free-text pediatric ED records in a non-English setting.

Design

Diagnostic accuracy study using retrospective data from 2007 to 2023. Manual clinician classification served as the gold standard to assess model performance.

Setting

Single-center study conducted at the pediatric ED of Padova University Hospital, a tertiary care referral center in Italy.

Participants

A convenience sample of 697 anonymized ED records from children with complex medical conditions.

Exposure

Automated data extraction using OpenAI's GPT-5.2 model via structured prompts processed in Python. All texts were in Italian and translated to English in the workflow.

Main Outcomes and Measures

Primary outcomes included accuracy, AUC, sensitivity, and specificity of the LLM in extracting triage color codes, ED outcomes, reasons for ED visit, and performed procedures. Efficiency gains were also measured by comparing manual and automated extraction times.

Results

Among 697 records analyzed, the primary model (GPT-5.2) achieved high accuracy in classifying triage color (0.99) and ED outcome (0.984). Accuracy for laboratory tests was 0.96, oxygen therapy 0.95, and nasogastric tube placement 0.987. Results were consistent across all seven models (mean Fleiss’ kappa = 0.922). Processing time was reduced from ∼5 min to 6 s per record, with a total cost of € 23.42.

Conclusions

In this study of pediatric ED encounters in a non-English setting, LLMs reliably extracted structured clinical data and substantially reduced documentation processing time. These findings supported their potential to streamline workflows, particularly in resource-constrained environments. Further research was warranted to improve classification of complex or ambiguous information.

Keywords

Large language models emergency department clinical documentation artificial intelligence natural language processing pediatric care

Introduction

Emergency departments (EDs) serve as critical access points for acute care and accommodate a broad spectrum of patients with diverse medical needs. Inconsistent clinical documentation remains a pervasive challenge across healthcare systems. The prevalence of free-text narratives impedes effective real-time data processing, subsequently increasing clinician workload, delaying patient care, and diminishing the efficiency of resource utilization.¹ While this issue is pronounced in low-resource settings, where digital documentation tools are often scarce, even high-income countries struggle with the limitations of the existing systems and literature suggests that EHR implementation has often led clinicians to spend more time on documentation, detracting from direct patient care.^2,3

This is particularly problematic in emergency medicine, where workflows are inherently nonlinear, yet standardized EHR systems often require extensive structured input, forcing clinicians to choose between thorough documentation and attending to more patients.⁴

Within this context, children with medical complexity (CMC) merit focused attention. The CMC are “children with multiple significant chronic health problems involving multiple organ systems, which results in functional limitations, high health care needs or utilization, and often requires need for, or use of, medical technology.”⁵ Their encounters frequently involve multimorbidity, device dependence, evolving presentations, and indirect communication; clinicians often synthesize caregiver reports and clinical judgment to form a complete clinical picture. This indirect information-gathering process heightens diagnostic complexity and cognitive load, underscoring the need for efficient, intuitive documentation systems. Yet, rigid EHR structures often fail to capture the rapidly evolving and dynamic nature of pediatric assessments. A UK survey found that none of the 15 EHR systems used by over 170 EDs met internationally validated usability standards, highlighting the inadequacy of current systems.⁶ Additionally, healthcare personnel satisfaction, as they represent the primary recipients of these tools, remains low and persistent challenges in overall efficiency were noted long after implementation, necessitating further EHR system optimization.⁷

In resource-limited settings, the high cost and complexity of these systems remain major obstacles due to significant infrastructure, training, and maintenance requirements.⁸ This forces clinicians to rely on handwritten notes or fragmented digital records with limited interoperability, hindering continuity of care, and restricting opportunities for research, epidemiological surveillance, and quality-improvement initiatives.

Addressing these disparities requires cost-effective solutions that can bridge the digital divide without imposing unsustainable financial and technological demands. Machine learning (ML) techniques have previously been employed to classify unstructured medical data automatically, but they require substantial amounts of data for training and significant computational resources.⁹ Conversely, large language models (LLMs) represent promising alternatives. By directly parsing free-text data into structured formats through prompt-based interactions, LLM can help optimize workflows, alleviate documentation burdens, and improve the precision of the extracted information. It is important to distinguish LLM-based data extraction from LLM-generated clinical documentation, which has been extensively studied.^10,11 While the latter focuses on producing narrative summaries or clinical notes from structured inputs, the present task involves the reverse process: transforming existing unstructured narratives into structured, queryable data fields. However, the adoption of such a technology is challenging, particularly concerning the validation of outputs across diverse languages and clinical practices. Additionally, the implementation of LLM must carefully consider privacy concerns, ensuring adherence to regulations such as General Data Protection Regulation (GDPR) is maintained to protect sensitive information.¹²

This scenario underscores the need for tools that balance computational efficiency and scalability with accuracy, adaptability to local healthcare requirements, and patient safety.

This study evaluated the performance of LLM in automating data extraction from pediatric ED records of CMC and explored its potential to transform unstructured data into actionable insights in both high- and resource-constrained environments.

Methods

Study design and setting

This retrospective study followed STARD guidelines for reporting and analyzed EHRs of CMC who were under the care of the hospital Pediatric Palliative Care (PPC) and Pain Service—the regional PPC referral center for the Veneto region, Italy—who accessed the pediatric ED of Padova University Hospital, Italy, from 2007 to 2023.¹³

Emergency department encounters were excluded if the registry entry was recorded as an administrative misentry (e.g., explicitly marked as error in the note, no discharge date/time recorded, which was needed for PDF generation); or the record was blocked for privacy. No additional filters or stratification criteria were applied. Ethical approval and informed consent were waived by the Padova University Hospital Ethics Committee due to the study's retrospective design and use of routine care data, posing minimal risk to participants.

Sample size and precision

Adequacy was based on an expected accuracy of 0.95; for the final cohort (n = 697, see Results), the two-sided 95% confidence interval (CI) has an estimated half-width of about 1.6% using the binomial normal approximation. This level of precision was deemed acceptable for reporting diagnostic performance.

Emergency department information system

Encounters were documented in the hospital's Sistema Sanitario Integrato (SSI), an integrated software application that stores both structured fields (e.g., vital signs, laboratory/imaging results, diagnoses) and unstructured narratives. Health personnel can enter multiple time-stamped free-text medical notes per visit (e.g., triage assessments, progress updates, consult notes), enabling longitudinal documentation. Encounter documents were exported from SSI as PDFs for further processing.

Manual classification (gold standard)

All included records were manually labeled by two clinicians, serving as the gold standard for comparison with LLM's outputs. Items of interest were extracted from SSI and classified under four domains. The ED admission domain comprised the color codes both at triage (ordinal acuity recorded at arrival) and at discharge/hospitalization (ordinal acuity/disposition at encounter end), and the primary reason for ED visit (nominal, single label defined by precedence rules in the codebook). The triage system implemented in Padova University Hospital assigns color-coded priority levels at ED arrival: red (emergency, immediate life-threatening conditions), orange (urgent), yellow (deferrable urgency), green (minor urgency), blue (nonurgent), and white (nonurgent, inappropriate ED use). These codes are assigned by a triage nurse upon arrival and may be updated at discharge to reflect clinical evolution. The referrals and consultation domain included ED referral (route of access as recorded in the system), specialist consultation (binary indicator of any in-hospital specialty involvement), and PPC specialist involvement (binary). The diagnostic and therapeutic procedures domain included encounter-level binary indicators for radiological tests, instrumental tests (nonradiological device-based diagnostics as per system categories), laboratory tests, medication/fluid administration, oxygen therapy, nasogastric tube replacement, and immunization administered in a protected environment. The outcome at discharge/hospitalization captured the final disposition of the encounter (e.g., discharge, hospital admission, or other/transfer) as recorded at ED exit. Unless otherwise specified, binary variables indicate presence if any occurrence during the encounter. Annotator disagreements were resolved by consensus before model evaluation.

Model-based classification

Data preprocessing and reproducibility

The free text, all in Italian, was extracted from PDFs using unstructured (0.16.9) and processed using LangChain (0.3.9) in Python.^14,15 The extracted text and relative prompts were processed using the OpenAI GPT-5.2 model.¹⁶ The parameters for calibrating the GPT responses were set to default, except for temperature, which was set at 0 to ensure deterministic output. This approach significantly improves response stability, although a degree of variability persists. To address this residual variability and to improve the robustness of the findings, each model was run three independent times on the entire cohort (iterations 1–3). To further strengthen generalizability beyond a single-model design, the analysis was replicated across seven OpenAI model variants: GPT-4o, GPT-4o-mini, GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, GPT-5.1, and GPT-5.2. This multimodel, multi-iteration design allowed assessment of both within-model consistency and between-model variability.

To ensure patient confidentiality, clinical records were anonymized prior to data processing by removing direct identifiers (e.g., names and dates of birth) in accordance with the GDPR and local ethical standards. Model-based data processing was performed within GPT Teams in a secure, isolated environment designed to prevent unauthorized access and ensure that no data are used for subsequent model training.

Prompt design

The LLM prompt and output schema matched the manual classification label set and decision rules for consistency and reproducibility. The prompt was designed collaboratively by a clinician (MD) specialized in PPC and a data scientist with expertise in natural language processing, following an expert-in-the-loop approach to ensure clinical accuracy and domain relevance. A single standardized prompt template was used for all records and refined over three pilot iterations before being fixed and applied to the entire cohort.

The final prompt consisted of two main components. The first provided fixed context instructions to orient the model to the task, including translation from Italian to English, formatting guidance, and constraints (e.g., JSON output only, no added commentary). The second component was dynamically generated for each case and contained the raw clinical narrative, alongside a standardized sequence of structured queries (Q1–Q14), each targeting a specific variable (Supplementary Material, eTable S1). Each question was formulated in clear English and followed a constrained-response format (e.g., “answer only yes or no,” “write only the color,” or “return a Python list of predefined categories”). These constraints allowed for automated parsing and extraction of JSON fields into structured tables via a Python script.

Misclassification handling

Misclassification analysis was critical for assessing model limitations beyond accuracy. Special attention was given to cases where with partial disagreement between model and gold standard, especially when multiple reasons for ED visit were listed. Classification was deemed correct if at least one reason overlapped; otherwise, only the primary reasons from each source were compared. This conservative approach accounts for the inherent complexity and subjectivity in clinical documentation. This approach reflects a conservative but realistic evaluation strategy, accounting for the complexity and subjectivity often inherent in clinical documentation.

Recurring misclassification patterns—such as overidentification of consultations based on ambiguous surnames or inconsistent treatment mentions—were logged and categorized to inform prompt refinement and guide future model iteration. Errors were further classified qualitatively into three types: omission (model failed to detect a present condition or procedure), commission/invalid (model assigned an entity not present in the clinical record), and inaccurate (model identified the correct domain but assigned an incorrect category).

Processing time and cost estimation

Classification efficiency (time and cost) was evaluated by recording time per record for manual extraction on a pilot subset and LLM-based extraction (request-to-response latency from API logs), extrapolating the median per-record times to the full cohort, and computing monetary cost from API-reported token usage (input + output) multiplied by contemporaneous unit prices; one-time setup activities, PDF export/OCR, and file I/O were excluded from these estimates. Additionally, the cost associated with prompt engineering was assessed based on the total number of hours expended by clinician–data scientists.

Statistical analysis

Descriptive statistics were used to summarize the manually classified (gold standard) data. Categorical variables were reported as absolute frequencies and percentages. The model's performance for each classification task was assessed using accuracy as primary metric, calculated as the proportion of correctly classified items against the total number of classifications; 95% CIs for accuracy was also computed to provide a measure of estimates precision. For discrimination by class, sensitivity and specificity were estimated using a one-vs-rest formulation for each class, with 95% CIs computed from exact binomial (Clopper–Pearson) intervals.

When only hard class labels and no probabilities are available, true ROC AUC is not identifiable.¹⁷ For dichotomous variables, in this case balanced accuracy was reported, defined as (Sensitivity + Specificity)/2, as an AUC-equivalent summary with 95% CI obtained by nonparametric percentile bootstrap over observations (B = 1000). For multiclass variables, a one-vs-rest strategy was adopted: (i) class-wise sensitivity and specificity with Clopper–Pearson 95% CIs as above; and (ii) AUC summarized as the macro balanced accuracy with bootstrap 95% CI.

To quantify within-model consistency across the three iterations, interrun reliability was assessed using Fleiss’ kappa, a multirater agreement statistic appropriate for categorical data with more than two raters. All performance metrics were averaged across the three runs per model and reported with 95% CIs. All statistical analysis was conducted using R, version 4.4.2, with the caret, binom, and pROC packages.^18–20

Results

Between 2007 and 2023, the system recorded 775 encounters in ED department from 85 children with medical complexities who visited the ED of Padova University Hospital in Italy. Out of these, 697 records were included in the analysis based on the availability of PDF documentation. The model's outputs were compared to clinician classification, which served as the gold standard. Most patients were assigned a yellow color code at both triage (41.8%) and discharge/hospitalization (40.3%), indicating a significant level of urgency in their condition. The green code represents less critical cases (Table 1).

Table 1.

Descriptive statistics of extracted items.

	N = 697¹
ED admission
Color code at triage
Yellow	291.0 (41.8%)
Green	156.0 (22.4%)
Orange	84.0 (12.1%)
Red	64.0 (9.2%)
White	54.0 (7.7%)
Blue	48.0 (6.9%)
Color code at outcome
Yellow	281.0 (40.3%)
Green	129.0 (18.5%)
White	104.0 (14.9%)
Orange	83.0 (11.9%)
Red	64.0 (9.2%)
Blue	31.0 (4.4%)
NA	5.0 (0.7%)
Primary reason to ED visit
Infectious	148.0 (21.2%)
Medical device malfunction	133.0 (19.1%)
Respiratory	97.0 (13.9%)
Neurological	90.0 (12.9%)
Gastroenterological	78.0 (11.2%)
Trauma	46.0 (6.6%)
Other	42.0 (6.0%)
Antalgic	22.0 (3.2%)
Immunization shots in a protected environment	20.0 (2.9%)
Cardiological	8.0 (1.1%)
Osteoarticular of nontraumatic origin	6.0 (0.9%)
Genitourinary	5.0 (0.7%)
Metabolic	2.0 (0.3%)
Referral and consultation
ED referral
Spontaneous	513.0 (73.6%)
Other	71.0 (10.2%)
PPC specialist	64.0 (9.2%)
Pediatric General Practitioner	49.0 (7.0%)
Specialist consult
No	404.0 (58.0%)
Yes	293.0 (42.0%)
PPC specialist involvement
No	507.0 (72.7%)
Yes	190.0 (27.3%)
Outcome
Discharge	461.0 (66.1%)
Admission	231.0 (33.1%)
Other	5.0 (0.7%)
Diagnostic and therapeutic procedures
Immunization shots in a protected environment
No	677.0 (97.1%)
Yes	20.0 (2.9%)
Laboratory tests
Yes	370.0 (53.1%)
No	327.0 (46.9%)
Medications/fluids
Yes	352.0 (50.5%)
No	345.0 (49.5%)
Nasogastric tube replacement
No	662.0 (95.0%)
Yes	35.0 (5.0%)
Oxygen therapy
No	600.0 (86.1%)
Yes	97.0 (13.9%)
Radiological tests
No	438.0 (62.8%)
Yes	259.0 (37.2%)
Instrumental tests
No	642.0 (92.1%)
Yes	55.0 (7.9%)

n (%); ED: emergency department; PPC: pediatric palliative care.

In terms of primary reasons for ED visits, infectious diseases topped the list, accounting for 21.2% of the cases. Medical device malfunctions and respiratory issues were also common, comprising 19.1% and 13.9% of the visits, respectively.

Most patients arrived at the ED spontaneously (73.6%), with fewer referrals from pediatric general practitioners (7.0%) and PPC specialists (9.2%). Specialist consultations were necessary in 42.0% of cases and PPC specialists were involved in 27.3% of cases.

The outcomes of these admissions revealed that 66.1% of the patients were discharged, while 33.1% required hospitalization. Diagnostic and therapeutic procedures were frequently performed, with laboratory tests conducted in 53.1% of the cases and medications or fluids administered in 50.5% of the cases. Oxygen therapy was provided to 13.9% of patients, and nasogastric tube replacement was performed in 5.0% of the cases. Additionally, immunization shots in a protected environment were administered in 2.9% of cases.

Performance across variables

The model exhibited outstanding performance in key areas of ED data extraction. The results are presented in Table 2. Notably, the classification of triage color codes at admission reached an accuracy of 0.99 (95% CI: 0.979–0.996). Similarly, ED outcome classification achieved an accuracy of 0.984 (95% CI: 0.972–0.992), while the model correctly classified ED referrals in 0.822 (95% CI: 0.789–0.848).

Table 2.

Performance metrics of the model (GPT-5.2) in classification tasks.

Variable	Performance metrics
Color code at triage	Accuracy (95% CI)	0.99 (0.979–0.996)
	AUC (95% CI)	0.996 (0.992–0.999)
	Sensitivity (95% CI)	Specificity (95% CI)
Blue	1 (0.921–1)	1 (0.994–1)
White	1 (0.93–1)	1 (0.994–1)
Green	0.974 (0.935–0.993)	1 (0.993–1)
Yellow	0.99 (0.971–0.998)	0.99 (0.974–0.997)
Orange	1 (0.958–1)	1 (0.994–1)
Red	1 (0.942–1)	0.995 (0.986–0.999)
Color code at discharge/hospitalization	Accuracy (95% CI)	0.964 (0.947–0.976)
	AUC (95% CI)	0.979 (0.97–0.987)
	Sensitivity (95% CI)	Specificity (95% CI)
Blue	1 (0.877–1)	1 (0.994–1)
White	0.98 (0.928–0.998)	0.995 (0.985–0.999)
Green	0.909 (0.847–0.952)	0.988 (0.974–0.995)
Yellow	0.976 (0.951–0.99)	0.973 (0.952–0.986)
Orange	0.988 (0.935–1)	0.998 (0.991–1)
Red	0.952 (0.865–0.99)	0.995 (0.986–0.999)
Ed referral spontaneous medical	Accuracy (95% CI)	0.82 (0.789–0.848)
	AUC (95% CI)	0.724 (0.696–0.752)
	Sensitivity (95% CI)	Specificity (95% CI)
Spontaneous	0.969 (0.95–0.982)	0.583 (0.508–0.656)
PPC specialist	0.567 (0.432–0.694)	0.995 (0.986–0.999)
Pediatric General Practitioner	0.765 (0.625–0.872)	0.93 (0.907–0.948)
Outcome	Accuracy (95% CI)	0.984 (0.972–0.992)
	AUC (95% CI)	0.989 (0.981–0.995)
	Sensitivity (95% CI)	Specificity (95% CI)
Admission	0.966 (0.934–0.985)	0.996 (0.984–0.999)
Discharge	0.993 (0.981–0.999)	0.988 (0.964–0.997)
Other	1 (0.478–1)	0.991 (0.981–0.997)
Reason to ED visit	Accuracy (95% CI)	0.823 (0.793–0.851)
	AUC (95% CI)	0.878 (0.841–0.917)
	Sensitivity (95% CI)	Specificity (95% CI)
Antalgic	0.667 (0.43–0.854)	0.988 (0.977–0.995)
Cardiological	0.75 (0.349–0.968)	0.999 (0.992–1)
Gastroenterological	0.854 (0.758–0.922)	0.956 (0.937–0.971)
Genitourinary	0.6 (0.262–0.878)	0.999 (0.992–1)
Immunization shots in a protected environment	1 (0.832–1)	0.999 (0.992–1)
Infectious	0.611 (0.528–0.689)	0.982 (0.967–0.991)
Medical device malfunction	0.937 (0.88–0.972)	0.984 (0.97–0.993)
Metabolic	0.5 (0.013–0.987)	1 (0.995–1)
Neurological	0.956 (0.89–0.988)	0.974 (0.958–0.985)
Osteoarticular of nontraumatic origin	0.5 (0.118–0.882)	0.994 (0.985–0.998)
Respiratory	0.867 (0.784–0.927)	0.928 (0.905–0.948)
Trauma	0.956 (0.849–0.995)	0.997 (0.989–1)
Other	0.625 (0.458–0.773)	0.989 (0.978–0.996)
Immunization shots in a protected environment	Accuracy (95% CI)	1 (0.995–1)
	AUC (95% CI)	1 (1–1)
	Sensitivity (95% CI)	1 (0.832–1)
	Specificity (95% CI)	1 (0.995–1)
Medications/Fluids	Accuracy (95% CI)	0.914 (0.891–0.934)
	AUC (95% CI)	0.913 (0.891–0.933)
	Sensitivity (95% CI)	0.939 (0.909–0.961)
	Specificity (95% CI)	0.882 (0.842–0.914)
Nasogastric tube replacement	Accuracy (95% CI)	0.987 (0.975–0.994)
	AUC (95% CI)	0.961 (0.911–0.996)
	Sensitivity (95% CI)	0.939 (0.909–0.961)
	Specificity (95% CI)	0.882 (0.842–0.914)
Oxygen therapy	Accuracy (95% CI)	0.95 (0.931–0.965)
	AUC (95% CI)	0.968 (0.955–0.979)
	Sensitivity (95% CI)	0.99 (0.944–1)
	Specificity (95% CI)	0.942 (0.92–0.959)
Instrumental tests	Accuracy (95% CI)	0.784 (0.752–0.814)
	AUC (95% CI)	0.842 (0.795–0.88)
	Sensitivity (95% CI)	0.906 (0.793–0.969)
	Specificity (95% CI)	0.794 (0.76–0.824)
Lab tests	Accuracy (95% CI)	0.96 (0.942–0.973)
	AUC (95% CI)	0.96 (0.944–0.974)
	Sensitivity (95% CI)	0.96 (0.935–0.978)
	Specificity (95% CI)	0.963 (0.935–0.98)
Radiological tests	Accuracy (95% CI)	0.978 (0.964–0.987)
	AUC (95% CI)	0.98 (0.969–0.989)
	Sensitivity (95% CI)	0.989 (0.967–0.998)
	Specificity (95% CI)	0.972 (0.952–0.986)
Specialist consult	Accuracy (95% CI)	0.76 (0.726–0.792)
	AUC (95% CI)	0.794 (0.77–0.819)
	Sensitivity (95% CI)	0.99 (0.97–0.998)
	Specificity (95% CI)	0.589 (0.54–0.637)
PPC specialist involvement	Accuracy (95% CI)	0.948 (0.929–0.963)
	AUC (95% CI)	0.914 (0.887–0.939)
	Sensitivity (95% CI)	0.841 (0.781–0.89)
	Specificity (95% CI)	0.986 (0.972–0.994)

ED: emergency department; PPC: pediatric palliative care.

For medical procedures, the identification of immunization shots administered in a protected environment stood out as a key strength, achieving accuracy of 1 (95% CI: 0.995–1) with perfect sensitivity (1, 95% CI: 0.832–1) and specificity (1, 95% CI: 0.995–1).

Laboratory tests and oxygen therapy administration were classified with accuracies of 0.96 (95% CI: 0.942–0.973) and 0.95 (95% CI: 0.931–0.965), respectively. Similarly, the classification of nasogastric tube replacements was highly accurate at 0.987 (95% CI: 0.975–0.994), with high sensitivity (0.939, 95% CI: 0.909–0.961) and specificity of 0.882 (95% CI: 0.842–0.914).

High performance was exhibited by the model also in identifying medications and fluids, achieving an accuracy of 0.914 (95% CI: 0.891–0.934), with high sensitivity (0.939, 95% CI: 0.909–0.961) and specificity (0.882, 95% CI: 0.842–0.914).

Large language model faced difficulties in executing more complex tasks, such as classifying specialist consultations, with an accuracy of 0.76 (95% CI: 0.726–0.792). Although the sensitivity was excellent (0.99, 95% CI: 0.97–0.998), the specificity was 0.589 (95% CI: 0.54–0.637), indicating the limitations of the model for this task. Similarly, instrumental tests (e.g., diagnostic imaging or other instrumental evaluations) posed a challenge, with an accuracy of 0.784 (95% CI: 0.752–0.814), reflecting a high sensitivity (0.906, 95% CI: 0.793–0.969)) but lower specificity (0.794, 95% CI: 0.76–0.824).

The classification of reasons for ED visits showed mixed results. Although it achieved a commendable accuracy of 0.823 (95% CI: 0.793–0.851), the model occasionally struggled with cases involving multiple overlapping reasons, where clinical judgment played a significant role in defining the primary cause. The sensitivity for reasons for access displayed wide variability, with values ranging from 0.5 to 1.

Multimodel comparison and consistency

The analysis was extended to seven OpenAI models (GPT-4o, GPT-4o-mini, GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, GPT-5.1, GPT-5.2), each run three times (Supplementary Material, eTable S2, eFigure S1–2). Across all models and variables, mean accuracy was 83.6%. Results were consistent across all seven models (mean Fleiss’ kappa = 0.922). The top-performing models were GPT-4.1 (mean kappa = 0.966), GPT-4o (mean kappa = 0.957), and GPT-5.2 (mean kappa = 0.947). All full-scale models achieved accuracies above 96% for triage color codes, ED outcome, and immunization classification. The smaller model GPT-4.1-nano showed the weakest consistency (mean kappa = 0.795) and lowest accuracy for several variables, indicating a performance–cost trade-off. The balanced accuracy (AUC) heatmap (eFigure S2) confirmed that performance patterns were consistent across models for well-defined variables (e.g., outcome, triage) but diverged for more ambiguous tasks (e.g., specialist consult, reason for ED visit). These results demonstrate that the findings are not contingent on single model architecture and that LLM-based extraction is robust across model families.

Classification efficiency

The manual processing of each EHR took approximately five minutes, totaling 58 h. GPT-based extraction reduced this to ∼6 s per record (1 h and 10 min in total), to which must be added 3 h of one-shot work, required for prompt engineering. Despite requiring a payment plan, the cost for the entire 697 records GPT-based processing remained relatively low at 23.42 euros (GPT-5.2). Once finalized, marginal per-encounter processing was automated and deterministic. No model fine-tuning or additional infrastructure beyond the API client was needed.

Discussion

This study evaluated the performance of seven OpenAI LLM variants in extracting and classifying clinically relevant information from free-text EHRs of pediatric ED encounters involving CMC, demonstrating accuracies ranging from 0.722 to 0.993 for items of interest with strong inter-run consistency (median Fleiss’ kappa = 0.924).

These results illustrate the strengths of the model in automating information classification from unstructured text, aligned with previous findings.^21,22 The model's high accuracy for well-defined variables, such as triage codes, immunizations, oxygen therapy, and ED outcomes, underscores its potential to streamline workflows and reduces administrative burden, enabling clinicians to devote more time to patient care. However, the model's performance declined in cases where ambiguous documentation, such as referencing surnames without professional roles or unclear mentions of medication administration, reduced specificity and exposed limitations in interpretive reasoning.

Tasks such as identifying the reasons for ED visits also encountered challenges when documentation involved overlapping or multiple causes. These findings indicate that, while LLM excels in structured and explicit contexts, in complex or subjective classifications, the model is still far from professional clinical judgment accuracy, thus requiring further refinement to adapt to diverse documentation styles and linguistic nuances.

Beyond their technical performance, the implementation of LLMs for automated data extraction offers a significant advantage in terms of both economic and human resources. Human-based classification of information requires extensive labor from highly specialized personnel, diverting valuable time away from direct patient care. By contrast, LLMs can achieve high accuracy with minimal setup and infrastructure requirements, significantly reducing both costs and the time healthcare professionals dedicate to data processing.

For CMC in particular, LLM-based extraction offers distinct advantages. Children with medical complexity encounters generate especially dense and variable documentation due to multimorbidity, device dependence, and evolving clinical presentations. Manual review of such records is time-consuming and prone to inter-rater variability. Automated extraction can enable systematic monitoring of care delivery patterns, including tracking recurrent ED visits, identifying shifts in reason-for-visit profiles, and flagging gaps in specialist involvement or palliative care referrals. These capabilities support proactive care coordination and quality improvement for this vulnerable population, where timely identification of care patterns can directly influence clinical outcomes. Furthermore, the multimodel analysis demonstrated that these benefits are not limited to a single proprietary system; the consistent performance across model families suggests that institutions can select models based on cost, latency, or policy constraints while maintaining extraction quality. Unlike traditional supervised ML approaches, which require large labeled training datasets and retraining when clinical workflows change, LLM-based extraction relies on prompt-based instruction and can be adapted to new extraction schemas without additional annotated data.

Many low- and middle-income countries struggle to implement sophisticated EHR systems owing to infrastructural and financial constraints. Large language models can bridge this gap by transforming unstructured or semistructured medical records into actionable tidy formats, enabling healthcare systems to leverage the benefits of data-driven decision-making without requiring substantial investments in technology or infrastructure. Simultaneously, the advantages of LLM-driven automation extend beyond resource-limited environments. In resource-rich countries, where EDs face significant challenges related to high patient volumes and administrative burdens, LLM can serve as an integrative tool. Structured data outputs can be seamlessly integrated with EHRs, enabling improved reporting, quality monitoring, and compliance with the billing requirements.

Finally, LLM's application can significantly contribute to public health initiatives by offering richer and reliable datasets for health surveillance, trend tracking, outbreak management, addressing disparities in care delivery, and leveraging the extensive amount of existing data available for research purposes.

Limitations

This study represents a single-center experience, which may not reflect the real-world variability in clinical practices and documentation styles. To ensure generalizability of the findings, validation across multiple centers and healthcare systems is deemed necessary. Although the analysis was extended to seven model variants with three iterations each, all models belong to the OpenAI family; generalizability to other LLM providers (e.g., Anthropic Claude, Google Gemini) remains to be established. The time and effort required for clinicians to review and correct misclassified records was not formally measured, and future studies should quantify this human-in-the-loop verification burden.

Formal evaluation of translation accuracy (Italian to English) was not performed; however, external evaluations of GPT-based translation and multilingual clinical text processing report high performance in comparable biomedical contexts.^23,24 Further, data used in this study were exclusively from Italy, and no applicability of the model to other languages and cultural contexts was assessed.

Results were not stratified by sex/gender, race/ethnicity, educational attainment, or other socioeconomic variables. These variables were not consistently available in the ED information system export at the encounter level across the study period, precluding reliable subgroup analyses. This omission may mask performance heterogeneity and limits assessment of equity. Future work will incorporate systematically captured sociodemographic variables and report subgroup performance and fairness metrics.

Finally, implementation and maintenance of LLM models entails operational costs that, while potentially lower than alternative approaches, may impact feasibility in some institutions.

Conclusions

In this diagnostic study of pediatric emergency care encounters involving CMC, the results illustrated the considerable potential of LLM to automate and improve data management across both high- and low/middle-resource settings. The multimodel, multi-iteration design confirmed that these findings are robust and not contingent on single model architecture. For CMC, where clinical encounters are inherently complex and documentation burdens are disproportionately high, LLM-based extraction offers a scalable pathway to systematic care monitoring and quality improvement. By reducing the administrative burden on clinicians and enabling more efficient workflows, LLM supported faster, more accurate, and data-driven decision making. Future research should focus on optimizing performance for complex classification tasks, ensuring multilingual adaptability, multimodal inputs (e.g., scanned PDFs and handwriting recognition for clinician notes; audio), and addressing privacy and regulatory considerations. Despite remaining challenges, LLMs represented a promising tool for modernizing clinical documentation, bridging healthcare disparities, and improving data-driven patient care.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076261431431 - Supplemental material for Extracting structured clinical data from pediatric emergency records using LLMs: A multimodel retrospective study of children with medical complexity

Supplemental material, sj-docx-1-dhj-10.1177_20552076261431431 for Extracting structured clinical data from pediatric emergency records using LLMs: A multimodel retrospective study of children with medical complexity by Gloria Brigiari, Marco Franzoi, Chiara La Piana, Franco Quarantiello, Anna Zanin, Franca Benini and Dario Gregori in DIGITAL HEALTH

Footnotes

Contributorship

Brigiari and Franzoi had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. Brigiari and Franzoi are considered cofirst authors. Gregori and Benini are considered cosenior authors. Brigiari, Franzoi, and Gregori contributed to concept and design. All authors contributed to acquisition, analysis, or interpretation of data. Brigiari and Franzoi contributed to drafting of the manuscript. La Piana, Quarantiello, Zanin, Benini, and Gregori contributed to critical revision of the manuscript for important intellectual content. Brigiari and Franzoi contributed to statistical analysis. Gregori and Benini contributed to supervision.

ORCID iDs

Gloria Brigiari

Dario Gregori

Chiara La Piana

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data access,responsibility,and analysis

Brigiari and Franzoi affirm that they had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.

Supplemental material

Supplemental material for this article is available online.

References

Schuur

Venkatesh

. The growing role of emergency departments in hospital admissions. N Engl J Med 2012; 367: 391–393.

Carayon

Wetterneck

Alyousef

, et al. Impact of electronic health record technology on the work and workflow of physicians in the intensive care unit. Int J Med Inf 2015; 84: 578–594.

Moy

Hobensack

Marshall

, et al. Understanding the perceived role of electronic health records and workflow fragmentation on clinician documentation burden in emergency departments. J Am Med Inform Assoc 2023; 30: 797–808.

Walker

Dwyer

Heaton

. Emergency medicine electronic health record usability: where to from here? Emerg Med J 2021; 38: 408–409.

Kuo

Houtrow

; Council on children with disabilities. Recognition and management of medical complexity. Pediatrics 2016; 138: e20163021.

Bloom

Pott

Thomas

, et al. Usability of electronic health record systems in UK EDs. Emerg Med J 2021; 38: 410–415.

Price

Kwok

ESH

Cheung

, et al. Physician experience with the epic electronic health record (EHR) system: longitudinal findings from an emergency department implementation. Can J Emerg Med 2022; 24: 630–635.

Luna

Almerares

Mayan

, et al. Health informatics in developing countries: going beyond pilot practices to sustainable implementations: a review of the current challenges. Healthc Inform Res 2014; 20: 3–10.

Spasic

Nenadic

. Clinical text data in machine learning: systematic review. JMIR Med Inform 2020; 8: e17984.

10.

Abdullah

Hamza

Kim

. Resource-efficient medical report generation using large language models. arXiv 2024: arXiv:2410.15642. doi:10.48550/arXiv.2410.15642

11.

Hou

Shi

, et al. Multimodal large language models for medical report generation via customized prompt tuning. arXiv 2025: arXiv:2506.15477. doi:10.48550/arXiv.2506.15477

12.

Voigt

Von Dem Bussche

. The EU General Data Protection Regulation (GDPR). Cham, Switzerland: Springer International Publishing, 2017.

13.

Cohen

Korevaar

Altman

, et al. STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration. BMJ Open 2016; 6: e012799.

14.

langchain/CITATION.cff at masterlangchain-ai/langchain. GitHub. Accessed January 6, 2025. https://github.com/langchain-ai/langchain/blob/master/CITATION.cff

15.

Unstructured-IO/unstructured . Published online September 18, 2025. Accessed September 18, 2025. https://github.com/Unstructured-IO/unstructured.

16.

OpenAI . Accessed January 6, 2025. https://openai.com/.

17.

Muschelli

. ROC And AUC with a binary predictor: a potentially misleading metric. J Classif 2020; 37: 696–708.

18.

Robin

Turck

Hainard

, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 2011; 12: 77. doi: 10.1186/1471-2105-12-77.

19.

Dorai-Raj

. <Sundar.dorai-raj@pdf.com>. binom: binomial confidence intervals for several parameterizations. 2006: 1.1–1.1. doi:10.32614/CRAN.package.binom

20.

Kuhn

. Caret: classification and regression training. 2007; 7: 1. doi:10.32614/CRAN.package.caret

21.

Lorenzoni

Gregori

Bressan

, et al. Use of a large language model to identify and classify injuries with free-text emergency department data. JAMA Netw Open 2024; 7: e2413208.

22.

Huang

Yang

Rong

, et al. A critical assessment of using ChatGPT for extracting structured data from clinical notes. NPJ Digit Med 2024; 7: 106.

23.

Neves

Jimeno Yepes

Névéol

, et al. Findings of the WMT 2023 biomedical translation shared task: evaluation of ChatGPT 3.5 as a comparison system. Proc Eighth Conf Mach Transl 2023: 43–54. doi:10.18653/v1/2023.wmt-1.2

24.

Menezes

MCS

Hoffmann

Tan

ALM

, et al. The potential of generative pre-trained transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study. Lancet Digit Health 2025; 7: e35–e43.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.85 MB