Sage Journals: Discover world-class research

Abstract

Objective

This study aimed to improve transcription accuracy for Korean hospital telephone consultations by fine-tuning the Whisper large-v3-turbo model. The goal was to assess whether domain-specific adaptation enhances automatic speech recognition (ASR) performance across speaker types in telemedicine.

Methods

I used a publicly available speech corpus comprising 1,272,630 Korean-language audio files (∼1,300 h) from telemedicine interactions involving doctors, nurses, and patients. Audio signals were standardized (16 kHz, 16-bit) and paired with normalized transcripts. The Whisper model was fine-tuned using supervised learning with data augmentation (SpecAugment, speed perturbation, noise injection) and speaker normalization. Performance was evaluated using word error rate (WER) and character error rate (CER), with statistical tests (Wilcoxon Signed-Rank and Sign Test) applied across speaker groups.

Results

The fine-tuned model consistently outperformed the baseline. In the patient group, WER improved from 22.92% to 22.42% and CER from 5.32% to 4.98%. Statistically significant improvements were observed for doctors and patients (p < .001), while changes in nurse data were not significant due to low baseline error. CER was found to better reflect transcription fidelity in Korean, as it was less affected by morphological variation and word segmentation errors typical in agglutinative languages. Loss monitoring confirmed stable convergence without overfitting.

Conclusion

Domain-specific fine-tuning of Whisper improves ASR performance in Korean telemedicine, especially for spontaneous patient speech. CER is more appropriate than WER for evaluating Korean ASR systems. These findings support the use of optimized ASR models for more accurate and reliable clinical documentation in digital health environments, with potential to reduce clinician documentation burden, support continuity of care, and improve patient safety.

Keywords

automatic speech recognition medical transcription whisper model fine-tuning character error rate word error rate clinical conversations domain adaptation

Introduction

Telephone consultations have become an essential communication tool in primary healthcare, supporting appointment scheduling, test result notifications, and providing advice for minor symptoms.^1,2 However, traditional telephone consultations continue to face major documentation challenges.^3,4 Because only about half of consultation content may be recorded, patient records can remain incomplete, undermining continuity of care.² Manual transcription further increases the risk of omissions, misinterpretations, and delays, while also adding cognitive and administrative burden for healthcare providers.^5,6 In addition, unstructured records limit integration with electronic medical records (EMRs), complicate information retrieval, and restrict secondary data use.^7–9 These issues have become even more pressing with the expansion of telemedicine during and after the COVID-19 pandemic, increasing the need for documentation systems that are efficient, secure, and accurate.^10,11 ASR has emerged as a promising solution because it can automate transcription, reduce manual error, improve record accuracy, and support more efficient EMR-linked workflows and data-driven clinical decision-making.^12–15

However, applying ASR in healthcare presents unique challenges. The complexity of medical terminology, variability in speech patterns, and the need for high transcription accuracy remain significant obstacles.^15,16 Moreover, clinical environments involve background noise, overlapping speech, and variable audio quality,¹⁷ while strict regulations such as HIPAA and GDPR necessitate robust data protection.¹⁸

Despite recent advances, general-purpose ASR systems still perform suboptimally in medical transcription.¹⁹ Clinical conversations contain technical terminology, abbreviations, and informal spoken expressions, and their recognition accuracy is further reduced by accent variation, disfluencies, and background noise.^20–24 Earlier frameworks such as Kaldi offered flexibility but limited domain-specific adaptability, while end-to-end models including DeepSpeech and Wav2Vec 2.0 improved overall ASR performance yet remained constrained in healthcare applications.^14,25–27 Transformer-based models, particularly Whisper, have further advanced ASR in noisy and multilingual settings.²⁸ Nevertheless, although Whisper shows strong general-domain performance, it still exhibits reduced accuracy in medical conversations because of vocabulary and discourse mismatches.¹⁹ These limitations are likely to be amplified in Korean clinical telephone consultations, where informal speech, rapid turn-taking, and overlapping dialogue create additional complexity.²⁹ To the best of my knowledge, no previous study has examined the fine-tuning of Whisper or other large-scale ASR models using Korean clinical speech data, leaving an important gap in domain-specific adaptation for the Korean healthcare context.

Fine-tuning pre-trained ASR models on domain-specific datasets may improve recognition of medical terminology, better capture conversational structures unique to clinical interactions, and enhance overall transcription accuracy.^30,31 When integrated with Electronic Medical Records (EMRs), such systems may also reduce administrative burden, streamline workflows, and support structured data storage for more effective clinical decision-making.³²

Recent studies suggest that ASR performance in clinical and telemedicine-related settings varies widely according to task and context. A recent systematic review reported that word error rates ranged from 0.087 in controlled dictation settings to over 50% in conversational or multi-speaker clinical scenarios.³³ In patient-clinician conversations recorded under relatively controlled conditions, specialized digital scribe models have reported WERs of 8.8% to 10.5%, whereas a psychotherapy study reported an overall WER of 25% and a WER of 34% for harm-related utterances, indicating that performance requirements differ substantially across use cases.^34,35 Cross-language medical ASR studies have likewise shown marked variability across languages, highlighting that performance is not directly transferable across linguistic settings.³⁶ In addition, a recent Korean study in radiation oncology clinics reported a fine-tuned character error rate of 0.26 for clinician-patient conversations, highlighting both the feasibility and the continuing difficulty of domain-specific ASR in Korean medical speech.³⁷ Taken together, these findings suggest that there is no single universal WER or CER threshold that can be considered acceptable for all telemedicine applications. Instead, performance should be interpreted in relation to the intended clinical task, the linguistic properties of the target language, and the potential downstream impact of transcription errors.

To address this gap, the present study proposes an end-to-end system for hospital telephone consultation transcription using a fine-tuned Whisper ASR model. The system records calls, converts speech to text, and stores transcriptions in a centralized database linked to patient profiles, supporting efficient EMR management (Supplementary Figure S1). I fine-tuned Whisper using transfer learning on a Korean clinical telephone consultation dataset and evaluated its performance against the pre-trained model using word error rate (WER) and character error rate (CER). Results demonstrate substantial improvements in transcription accuracy, highlighting the value of domain-specific ASR adaptation for reducing documentation burdens and enhancing workflow efficiency in healthcare.

Methods

Study design and experimental pipeline

This study is a secondary analysis and model-development study that fine-tunes the Whisper large-v3-turbo model for improved transcription of hospital telephone consultations using a publicly available, de-identified dataset (AI-Hub: ‘Medical staff and patient voices for non-face-to-face treatment’).^29,38 All data processing, supervised fine-tuning, and performance evaluation were conducted at Gyeongsang National University, Republic of Korea, between May 2024 and February 2025.

The overall experimental procedure followed a structured pipeline consisting of data preprocessing, supervised fine-tuning, and performance evaluation (Figure 1). Medical call audio files were extracted and preprocessed through an audio segmentation module to reduce noise and isolate clean speech segments. In parallel, the corresponding transcription text was normalized to remove punctuation and special characters to ensure consistent evaluation. The resulting paired data (clean audio and normalized text) were used to fine-tune the pre-trained Whisper large-v3-turbo model using supervised learning on domain-specific medical speech (approximately 1,300 h) from doctors, nurses, and patients. After training, performance was evaluated using standard speech recognition metrics, including word error rate (WER) and character error rate (CER), and compared with the baseline Whisper model.

Figure 1.

Workflow of supervised fine-tuning of the Whisper model for medical telephone consultation transcription.

Data collection and quality control

For this study, I used the publicly available speech dataset “Medical staff and patient voices for non-face-to-face treatment,” provided by AI-Hub.³⁸ The dataset comprises 1,272,630 audio files, corresponding to approximately 1,300 h of audio recordings collected from diverse telemedicine sessions between healthcare professionals and patients (Table 1). All speakers used Korean, and the entire dataset is composed in the Korean language. The dataset is openly accessible for academic research purposes under the licensing terms specified by AI-Hub. All recordings were collected by the dataset provider under their governance procedures; the dataset documentation indicates that consent for personal information use and data disclosure was obtained during dataset creation. The present study involved no new data collection and analyzed only publicly available, de-identified data.

Table 1.

Sociodemographic characteristics of medical speech segments.

	Total	Doctor	Nurse	Patient	χ2 (df)	p-value
Number	1272616 (100.0%)	381991 (30.0%)	371591 (29.2%)	519034 (40.8%)	31925.1 (2)	< 0.0001
Sex	..	..	..	..	838523.2 (2)	< 0.0001
Female	843427 (66.3%)	189272 (49.5%)	368778 (99.2)	285377 (55.0%)
Male	429189 (33.7%)	192719 (50.4%)	2813 (0.8%)	233657 (45.0%)
Age	..	..	..	..	2060628.1 (9)	< 0.0001
3-10	25209 (2.0%)	0 (0.0%)	0 (0.0%)	25209 (18.4%)
11-19	73304 (5.8%)	2000 (0.5%)	0 (0.0%)	71304 (13.7%)
20-29	511835 (40.2%)	290234 (76.0%)	108377 (29.2%)	113224 (21.8%)
30-39	349065 (27.4%)	76111 (19.9%)	198527 (53.4%)	74427 (14.3%)
40-49	141064 (11.1%)	10091 (2.6%)	64687 (17.4%)	66286 (12.8%)
50-50	41435 (3.3%)	2558 (0.7%)	0 (0.0%)	38877 (7.5%)
60-69	111431 (8.8%)	997 (0.3%)	0 (0.0%)	110434 (21.3%)
over 70	19273 (1.5%)	0 (0.0%)	0 (0.0%)	19273 (3.7%)
Dialect	..	..	..	..	2334871.0 (6)	< 0.0001
Standard	744528 (58.5%)	309186 (80.9%)	276526 (74.4%)	158816 (30.6%)
Gyeongsang	262254 (20.6%)	43548 (11.4%)	49554 (13.3%)	169152 (30.6%)
Jeolla	153246 (12.0%)	16425 (4.3%)	42574 (11.5%)	94247 (18.2%)
Chungcheong	95597 (7.5%)	8866 (2.3%)	0 (0.0%)	86731 (16.7%)
Gangwon	5958 (0.5%)	0 (0.0%)	0 (0.0%)	5958 (0.5%)
Jeju	10018 (0.8%)	3977 (1.0%)	2939 (0.8%)	3102 (0.6%)
Foreign	1029 (0.1%)	0 (0.0%)	0 (0.0%)	1029 (0.2%)

The dataset was created by first generating dialogue scripts representing possible communication patterns between healthcare professionals and patients in telemedicine scenarios. Accordingly, the recordings primarily reflect scripted/role-play teleconsultation scenarios rather than spontaneous, naturally occurring telephone consultations. These scripts were then read and recorded by 50 medical professionals from the Korea University Medical Center, including doctors and nurses across 15 clinical departments such as laboratory medicine, family medicine, dentistry, thoracic surgery, anesthesiology, neurosurgery, clinical pharmacology, gastroenterology, nuclear medicine, pulmonology, radiation oncology, otorhinolaryngology, rehabilitation medicine, orthopedics, and colorectal surgery.³⁸ In addition, 1,500 virtual patients participated, recruited through a crowdsourcing platform. Although the dataset provides broad clinical coverage and high-quality recordings, it may not fully capture spontaneous speech variability, including hesitations, self-repairs, interruptions, and overlapping speech.³⁸

A structured multi-stage QC workflow was implemented during dataset construction, including sequential checks for adherence to the recording/formatting guide, evaluation of content acceptability (semantic plausibility and conversational naturalness), and a final confirmation prior to inclusion; dedicated QC tools were used to support inspection.³⁸

Operational decision criteria were specified for deletion versus limited correction. When pronunciation is unclear (e.g., mumbling) or when an utterance contains misread content that may change meaning, the item is deleted. When the audio deviates from the script but meaning is preserved, a restricted refinement procedure is allowed in which the script text is minimally edited to match the audio, primarily limited to particles and endings; beyond these restricted cases (e.g., word omission, meaning-changing substitutions, or non-standard reading errors), the item is deleted. Concrete audio-quality thresholds and rejection rules were also specified. Recordings were required to be intelligible, free of electrical or unexplained noise, and to avoid long silences between words (e.g., pauses exceeding 2 s). Examples of deletion targets included intrusive non-speech events (e.g., large impacts, sirens/horns, footsteps/knocks), voices of other speakers, coughing or loud breathing, overly segmented speech, strong voice tremor, swallowing sounds, mumbling, stuttering, abnormal pronunciation, and persistent background noise unrelated to the recording context.

Because this study is a secondary analysis of a publicly released dataset, stage-wise rejection/correction rates, disagreement-resolution statistics, and inter-reviewer agreement indices are not available in the public documentation for this dataset and therefore cannot be reported here without introducing unverified estimates.

Sample size determination and justification

The sample size in this study was determined by the availability of the publicly released AI-Hub corpus rather than by prospective recruitment. I used the full set of eligible recordings provided after the dataset provider’s multi-stage quality control (N = 1,272,630 audio files; approximately 1,300 h). For model evaluation, the corpus was split into 85% for training and 15% for evaluation, resulting in an evaluation set of approximately 190,895 audio segments. Given this evaluation size, performance estimates are expected to be highly stable; for example, under a conservative binomial approximation (p = 0.50), the maximum 95% margin of error for an estimated proportion is 1.96 × √(p(1−p)/N), which corresponds to approximately 0.22 percentage points when N ≈ 190,895. Therefore, the selected sample size (i.e., the full available corpus with a large held-out evaluation subset) provides adequate statistical precision for comparing baseline and fine-tuned ASR performance.

Sociodemographic and contextual characteristics

Among the total speech segments, 30.0% originated from doctors (n = 381,991), 29.2% from nurses (n = 371,591), and 40.8% from patients (n = 519,034). Female speakers accounted for 66.3% of the overall dataset (n = 843,427), whereas male speakers accounted for 33.7% (n = 429,189) (Table 1).

Regarding age distribution, the largest proportion of speech data came from the 20–29 years group (40.2%, n = 511,835), followed by 30–39 years (27.4%, n = 349,065) and 40–49 years (11.1%, n = 141,064). Contributions from pediatric patients aged 3–10 years accounted for 2.0% of the dataset (n = 25,209) (Table 1).

The dataset also includes a wide range of dialectal variations, reflecting real-world clinical diversity. Standard Korean was used in 58.5% of the recordings (n = 744,528), while major regional dialects included Gyeongsang (20.6%, n = 262,254), Jeolla (12.0%, n = 153,246), and Chungcheong (7.5%, n = 95,597) (Table 1).

Geographically, the majority of recordings were collected from capital metropolitan regions (63.3%, n = 805,829), followed by southeastern regions (16.6%, n = 211,692), southwestern regions (9.2%, n = 117,696), and central regions (10.8%, n = 137,399) (Table 2).

Table 2.

Contextual characteristics of medical speech across groups.

	Total	Doctor	Nurse	Patient	χ2 (df)	p-value
Region	..	..	..	..	1583285.4 (4)	< 0.0001
Capital Metropolitan	805829 (63.3%)	349078 (91.4%)	314420 (84.6%)	142331 (27.4%)
Central	137399 (10.8%)	20480 (5.4%)	493 (0.1%)	116426 (22.4%)
Southeast	211692 (16.6%)	8433 (2.2%)	40244 (10.8%)	163015 (31.4%)
Southwest	117696 (9.2%)	4000 (1.0%)	16434 (4.4%)	97262 (18.7%)
Environment	..	..	..	..	5162423.8 (6)	< 0.0001
Home	1063772 (83.6%)	252291 (66.0%)	327682 (88.2%)	483799 (93.2%)
Park	1392 (0.1%)	995 (0.3%)	0 (0.0%)	397 (0.1%)
Store	554 (<0.1%)	0 (0.0%)	0 (0.0%)	554 (0.1%)
Office	195538 (15.4%)	119979 (31.4%)	43911 (11.8%)	31648 (6.1%)
Station	999 (0.1%)	999 (0.3%)	0 (0.0%)	0 (0.0%)
Car	2245 (0.2%)	0 (0.0%)	0 (0.0%)	2245 (0.4%)
Etc.	8130 (0.6%)	7738 (2.0%)	0 (0.0%)	392 (0.1%)
Device	..	..	..	..	1849577.3 (4)	< 0.0001
Laptop	340605 (26.8%)	108603 (28.4%)	207939 (56.0%)	24063 (4.6%)
Smart Pad	114988 (9.0%)	24618 (6.4%)	46964 (12.6%)	43406 (8.4%)
Smart Phone	815750 (64.1%)	248781 (65.1%)	116195 (31.3%)	450774 (86.8%)
Recorder	792 (0.1%)	0 (0.0%)	0 (0.0%)	792 (0.2%)
Etc.	495 (<0.1%)	0 (0.0%)	495 (0.1%)	0 (0.0%)

In terms of recording environments, home settings accounted for the largest share (83.6%, n = 1,063,772), followed by office environments (15.4%, n = 195,538). A small proportion of sessions was recorded in public spaces such as parks, stores, stations, and cars, which naturally introduced background noise from conversations, television sounds, and environmental activities. As a result, the dataset reflects realistic acoustic conditions encountered in telemedicine settings (Table 2).

Regarding the devices used, smartphones were the dominant medium (64.1%, n = 815,750), followed by laptops (26.8%, n = 340,605) and smart pads (9.0%, n = 114,988). Other devices, including dedicated recorders (0.1%, n = 792), contributed minimally (Table 2).

Dataset preprocessing

To optimize the model’s learning efficiency and ensure reproducible evaluation, all audio files were standardized to a 16 kHz sampling rate with 16-bit depth. Text normalization was applied deterministically using the same script across the entire dataset. Specifically, punctuation and typographic symbols were removed from transcripts, including sentence-final marks (., ?, !), commas, colons/semicolons, quotation marks, brackets/parentheses, and other non-alphanumeric symbols. Multiple whitespace characters were collapsed into a single space, and leading/trailing spaces were stripped. Korean characters and standard Latin letters were retained. Numerals and medically meaningful quantity expressions were preserved in their original surface form (e.g., “5 mg,” “2회 (twice),” “37.8”), and medically meaningful tokens (e.g., medication names, disease names, and clinical abbreviations) were retained without dictionary-based substitution.

Transfer learning of the whisper model

For fine-tuning the Whisper model, the following training configurations were applied. The model used in this study was Whisper large-v3-turbo, a transformer-based ASR model developed by OpenAI.²⁸ The model was chosen not only for computational efficiency but also for its strong multilingual ASR performance, robustness in noisy speech environments, and suitability for conversational speech transcription using a large-scale pre-trained transformer architecture. These characteristics were considered particularly relevant for Korean clinical telephone consultations, which involve domain-specific terminology, speaker variability, and acoustically heterogeneous recording conditions. In addition, the turbo variant was selected over the standard large-v3 model because the present study aimed to evaluate not only transcription performance but also practical feasibility for clinical deployment, where inference speed, scalability, and computational cost are important implementation considerations.²⁸

However, no separate preselection benchmarking experiment was conducted in this study to compare Whisper large-v3-turbo directly against the standard Whisper large-v3 model or alternative ASR architectures such as Conformer- or wav2vec 2.0-based systems before training. Therefore, the choice of Whisper large-v3-turbo was based on prior literature, its strong reported performance in multilingual and noisy ASR settings, and its practical suitability for domain adaptation, rather than on an internal head-to-head benchmark.²⁷ The Whisper large-v3-turbo model is structurally designed to process audio segments of up to 30 s, and for longer audio inputs, the model automatically applies a sliding-window or chunked segmentation approach to transcribe the entire recording seamlessly.²⁸ Because the model is designed for robust multilingual processing, it was used without applying any translation to the dataset.

The training process was conducted for a total of 1,400 steps, a value determined through preliminary experiments, which confirmed that the model sufficiently converged within this range. Notably, this value reflects an early stopping criterion based on validation loss monitoring, as extending training beyond 1,400 steps resulted in no further performance gains and a slight increase in test loss. Typically, Whisper fine-tuning for low-resource languages or domain-specific applications requires 3,000–4,000 steps, as suggested in previous studies.^28,39,40 The corpus was split into 85% for training and 15% for evaluation using stratified partitioning to preserve the relative proportions of doctors, nurses, and patients across subsets, and the split was speaker-independent such that no individual speaker appeared in both subsets; overlap of scripts or scenarios across splits was not considered during partitioning.

The learning rate was set to 1e-5, a commonly used value for Whisper fine-tuning. This setting allows the model to adapt to the new domain while preserving the pre-trained weights without excessive deviation. Additionally, a gradual learning rate warm-up was applied for the first 500 steps to prevent abrupt weight changes at the start of training, ensuring stable convergence. This technique is a widely adopted approach in Transformer-based models.^28,39,40

For memory optimization, the batch size was set to 16 during training and 8 during evaluation, a configuration found to be suitable for training the Whisper base model under GPU memory constraints. The gradient accumulation step was set to 1, as a batch size of 16 was sufficient for stable training. If a smaller batch size were required due to resource limitations, the gradient accumulation step could be increased to maintain an equivalent effective batch size.^28,39,40

To maximize GPU utilization while improving training efficiency, gradient checkpointing and FP16 computation were applied. These optimizations reduce memory consumption while accelerating the training process.^28,39 On average, the entire fine-tuning process required approximately 43 h and 57 min on the GPU environment used in this study.

To improve generalization in noisy, multi-speaker telemedicine speech, I applied data augmentation during fine-tuning. The overall fine-tuning workflow followed the official Hugging Face guidance for Whisper large-v3-turbo.²⁸ SpecAugment was applied to log-mel features using a conservative masking policy with two frequency masks (maximum width: 15 mel bins) and two time masks (maximum width: 70 frames; proportion parameter p = 0.2).⁴¹ Additive noise injection was applied by mixing background-noise segments representative of telemedicine environments at a signal-to-noise ratio uniformly sampled between 10 and 20 dB, with a probability of 0.5 per training utterance.⁴² Speed perturbation was implemented using speed factors of 0.9 and 1.1 (±10%), with a probability of 0.5 per training utterance. These augmentation settings were selected to improve robustness to moderate background noise and speaking-rate variability while preserving clinically meaningful lexical content, consistent with established ASR augmentation practices.^28,39,40

During training, evaluation was conducted every 700 steps, and logging was performed every 25 steps to closely monitor training progress. For evaluation, decoding was performed using greedy generation (num_beams = 1) with a maximum generation length of 225 tokens. The decoding task was fixed to transcription rather than translation, and the target language was constrained to Korean to match the dataset. Temperature was fixed at 0.0 for deterministic decoding. For utterances longer than the model’s 30-second receptive field, the default sequential long-form decoding strategy was used, in which the audio was processed in consecutive 30-second windows rather than with chunked independent decoding. No additional custom suppression rules were introduced beyond the model’s default generation configuration, and no timestamp output was requested during evaluation. These settings were selected to ensure a stable and reproducible comparison between the baseline and fine-tuned models.

At the end of training, the best-performing model was selected based on the lowest CER on the validation set. This selection criterion was applied to prevent overfitting and enhance the model’s generalization performance.

Performance evaluation of the baseline model and the fine-tuned model

To assess the performance of the ASR models, I evaluated the Whisper Fine-tuned model trained on domain-specific data against the baseline Whisper large-v3-turbo model. The evaluation was conducted using two key metrics: WER and CER, both of which are widely used in Korean speech recognition assessments.^43,44

WER measures the word-level transcription accuracy by calculating the number of deleted, inserted, and substituted words between the recognized text and the ground truth transcript. This metric is useful for evaluating sentence-level recognition performance, where a lower WER indicates higher ASR accuracy and better overall speech recognition quality.⁴⁴

However, WER has limitations in evaluating Korean speech recognition due to the linguistic structure of the language. As an agglutinative language, Korean frequently employs particles and verb endings, which leads to morphological variations and ambiguous word boundaries. This characteristic makes WER less effective for assessing transcription accuracy in Korean ASR models.⁴³

To address this issue, I incorporated CER as an additional metric, which provides a more reliable measure of ASR performance for Korean. CER evaluates transcription accuracy at the character level, making it particularly well-suited for languages with complex morphological structures. A lower CER indicates higher accuracy, and it is considered a more appropriate evaluation metric for Korean speech recognition.

WER and CER are computed as follows:

W E R = \frac{D + I + S}{N}

(1)

C E R = \frac{D_{C} + I_{C} + S_{C}}{N_{C}}

(2)

WER is calculated based on the number of deleted (D), inserted (I), and substituted (S) words, relative to the total number of words (N) in the ground truth transcript. CER is computed using the number of deleted (D_c), inserted (I_c), and substituted (S_c) characters, relative to the total number of characters (N_c) in the ground truth transcript. By utilizing both WER and CER, this study provides a comprehensive evaluation of ASR performance at both the word and character levels. WER and CER were computed using an in-house script implementing the standard definitions in Equations (1), (2). The same text-normalization procedure described above was applied to both the reference transcripts and the model hypotheses before computing WER and CER to ensure consistency and reproducibility.

Statistical analysis for performance comparison between the baseline and fine-tuned models

To determine whether there was a statistically significant difference in performance between the Whisper baseline model and Whisper Fine-tuned model, I formulated the following hypotheses: The null hypothesis (H₀) assumes that no statistically significant difference exists in performance (WER, CER) between the Whisper base model and Whisper Fine-tuned models, whereas the alternative hypothesis (H₁) posits that there is a statistically significant difference in performance (WER, CER) between the two models.

The dataset was categorized into three speaker groups (nurses, doctors, and patients) based on WER and CER scores, and the performance differences between the two models were evaluated for each speaker group as well as for the overall dataset. To assess the distribution of the data, I conducted Q–Q plot analysis and Shapiro–Wilk tests to examine normality. Based on the results, I selected the appropriate statistical method for comparing the models. If the data followed a normal distribution, I applied a paired t-test, which assumes normality and evaluates the mean difference between the two models on the same dataset. If the data did not follow a normal distribution, I employed the Wilcoxon Signed-Rank Test, a non-parametric method used to assess median differences between paired samples. Even in cases where the data did not meet normality assumptions, I proceeded with the paired t-test, as the sample size was sufficiently large to approximate normality. A p-value < 0.05 was considered statistically significant, indicating that the observed performance difference between the baseline and fine-tuned models was unlikely due to random variation.

Experimental environment and software

All experiments were conducted on a high-performance workstation equipped with four NVIDIA RTX 3090 GPUs (10,496 CUDA cores, 328 Tensor cores, 82 RT cores, base clock 1.40 GHz), an Intel® Core™ i9 X-series processor (3.7 GHz), and 192 GB of system memory, running on Ubuntu 18.04 LTS.

For ASR, I utilized the Whisper large-v3-turbo model, implemented via the OpenAI Whisper framework (v3.1.0). All training and inference processes were executed using Python (v3.10.13) with PyTorch (v2.2.1) as the primary deep learning backend. Audio preprocessing was performed with Torchaudio (v2.2.1), and raw speech decoding and encoding were handled using FFmpeg (v6.0).

To ensure reproducibility, all experiments were conducted within an isolated Conda environment (v23.11.0), with software dependencies and package versions carefully controlled throughout the study.

Results

Characteristics of recorded medical conversations

As described in the Methods, the corpus spans multiple clinical specialties and consultation purposes, providing context for the acoustic and linguistic variability observed in the recorded conversations. This study analyzed the acoustic and linguistic properties of the collected medical speech corpus to characterize the patterns of communication between healthcare professionals and patients. A total of 1,272,630 audio files, corresponding to approximately 1,300 h of telemedicine conversations, were examined, and the descriptive statistics are summarized (Table 3).

Table 3.

Descriptive statistics of the medical speech corpus.

Data set	Total	Min	Mean	Median	Max	S. D.	95% CI
Audio (Second)
Doctor	382002	0.66	2.6926	2.64	20.34	0.7586	2.6902 - 2.6950
Nurse	371593	0.18	3.7603	3.36	16.92	1.6712	3.7549 - 3.7657
Patients	519035	0.48	3.9675	3.6	57.66	1.8008	3.9626 - 3.9724
Total	1272630	0.18	3.5243	3.12	57.66	1.6171	3.5215 - 3.5271
Corpus (Words)
Doctor	1707386	1	4.4696	4	17	1.5095	4.4648 - 4.4744
Nurse	2295007	1	6.1761	5	25	3.2162	6.1658 - 6.1865
Patient	2941138	1	5.6666	5	73	3.1125	5.6581 - 5.6750
total	6943531	1	5.4560	5	73	2.8490	5.4511 - 5.4610
Sentence
Doctor	382002
Nurse	371593
Patient	519035
Total	1272630

The analysis of speech duration revealed clear differences among the three groups. Doctors produced the shortest speech segments, with a mean length of 2.69 s (SD = 0.76, range: 0.66–20.34 s), indicating relatively concise utterances during consultations. Nurses exhibited longer speech segments, with a mean length of 3.76 s (SD = 1.67, range: 0.18–16.92 s), while patients produced the longest speech segments overall, with a mean duration of 3.97 s (SD = 1.80, range: 0.48–57.66 s). The boxplot illustrates that doctors’ utterances are concentrated around shorter durations with low variability, whereas nurses and patients demonstrate broader distributions. Notably, some patient utterances extended beyond 50 s, highlighting the presence of prolonged and information-rich responses (Table 3 and Figure 2(a)).

Figure 2.

Distribution of Speech Length and Word Count Across Groups.

An analysis of word usage patterns revealed additional distinctions among the three groups. Doctors’ utterances contained an average of 4.47 words per segment (SD = 1.51, range: 1–17 words), suggesting that their speech was generally brief and focused. Nurses used longer expressions, averaging 6.18 words per segment (SD = 3.22, range: 1–25 words), whereas patients produced an average of 5.67 words per segment (SD = 3.11, range: 1–73 words), reflecting greater variability in lexical richness. The boxplot demonstrates that patient and nurse utterances tend to be more elaborate and varied compared to doctors’ speech, which remains shorter and more uniform (Table 3 and Figure 2(a)).

Overall, the medical speech corpus demonstrates notable variability in both duration and lexical richness across groups. While doctors generally provide concise instructions and targeted questions, nurses frequently deliver longer context-rich statements, and patients exhibit the highest heterogeneity in both segment length and word usage. These findings highlight the complexity of modeling natural medical dialogues and provide important insights for developing and optimizing ASR systems in telemedicine environments.

Comparison of speaker-specific performance before and after transfer learning

The performance comparison between the baseline Whisper model and the fine-tuned model was conducted by analyzing CER and WER across three speaker groups: nurses, doctors, and patients. The results indicate that the fine-tuned model consistently outperformed the baseline model, achieving lower WER and CER in all speaker groups.

For the nurse group, the baseline model exhibited an average WER of 11.0%, whereas the fine-tuned model achieved a slightly lower WER of 10.89%. Similarly, for the doctor group, the WER decreased from 14.66% in the baseline model to 14.44% in the fine-tuned model. A similar trend was observed in the patient group, where the baseline model’s WER of 22.92% was reduced to 22.42% after fine-tuning (Figure 3(a)).

Figure 3.

Comparison of WER and CER between the baseline and fine-tuned models across speaker groups.

When comparing CER between the two models, the fine-tuned model demonstrated an even more pronounced improvement than in WER. In the nurse group, the CER decreased from 2.50% in the baseline model to 2.44% in the fine-tuned model. For the doctor group, the CER was reduced from 3.35% to 3.20% after fine-tuning. Finally, in the patient group, the CER showed the most substantial improvement, dropping from 5.32% in the baseline model to 4.98% in the fine-tuned model (Figure 3(b)).

To better understand the performance differences, representative examples of transcription errors from patient data were analyzed. In the baseline model, relatively higher WER and CER were observed, primarily owing to semantic substitutions, phrase rephrasing, and omissions, which led to distorted meanings in complex patient utterances. Contrarily, the fine-tuned model demonstrated improved accuracy, with reduced errors such as adverb omissions, phrase deletions, and qualifier drops. These findings illustrate that fine-tuning improves overall recognition rates and enhances the model’s ability to handle nuanced patient speech patterns (Table 4).

Table 4.

Comparison of Representative Transcription Errors Between Base and Fine-Tuned ASR Models on Patient Data set with Corresponding WER and CER.

Model	Reference	ASR output	WER	CER	Error type
Base	머리가 띵하고 구역질이 나서 힘들어요. (I feel dizzy and nauseous, so I’m struggling.)	머리가 아프고 토할 것 같아요. (I have a headache and feel like vomiting.)	24.6%	12.1%	Semantic substitution
	가슴이 두근거리고 숨쉬기가 힘들어요. (My heart is racing and it’s hard to breathe.)	가슴 두근거려서 좀 힘드네요. (Heart’s pounding, so I’m tired.)	25.0%	13.2%	Rephrasing, omission
	어제부터 열이 나고 기침이 심해요. (I’ve had a fever since yesterday and a bad cough.)	어제부터 열이 좀 있고 기침해요. (I have some fever and I cough.)	23.7%	11.8%	Intensity reduction
Fine-tuning	오늘 머리가 좀 아프고 열이 있어요. (I have a headache and a fever today.)	머리가 아프고 열 있어요. (I have a headache and fever.)	15.0%	7.3%	Adverb omission
	기침 가래가 심하고 목이 아파서요. (I have severe cough and phlegm, and my throat hurts.)	기침 가래가 심해요. (I have severe cough and phlegm.)	14.3%	6.3%	Phrase omission
	요즘 숨이 자주 차고 어지러워요. (These days, I often feel short of breath and dizzy.)	숨이 좀 차고 어지러워요. (I feel a bit short of breath and dizzy.)	13.8%	6.2%	Qualifier omission

The WER and CER reductions in the patient group were notably greater compared to the nurse and doctor groups. This trend can be attributed to the unstructured nature of patient speech, which often involves non-standard pronunciation, diverse accents, and informal conversation patterns, making it inherently more challenging for ASR models to transcribe accurately. The results suggest that fine-tuning significantly enhances transcription accuracy, particularly in more variable speech conditions, such as those found in patient conversations.

Speaker-specific error pattern analysis

To better understand speaker-specific differences in transcription accuracy, an additional analysis was conducted on representative utterances from the nurse and patient groups. Although the fine-tuned model improved WER and CER performance across all groups, systematic discrepancies were observed in the difficulty of accurately transcribing each speaker type (Table 5).

Table 5.

Representative transcription errors by speaker group with corresponding WER and CER.

Group	Reference	ASR output	WER	CER	Error type
Nurse	집안에 고혈압인 분 있나요? (Do you have anyone in your family with high blood pressure?)	집안에 고혈압 환자 있나요? (Do you have a hypertension patient in your family?)	5.5%	8.3%	Word substitution
	술을 얼마나 자주 드시나요? (How often do you drink alcohol?)	술을 얼마나 자주 하시나요? (How often do you drink?)	6.7%	10.0%	Verb replacement
	가슴이 두근거리시나요? (Do you feel your heart racing?)	가슴 두근거리나요? (Heart pounding?)	7.1%	9.1%	Omission of particle/honorific
Patient	최근에 어지럼증이 너무 심해져요. (Recently, my dizziness has become much worse.)	최근에 어지럼증 심해 (Recently, dizziness bad.)	13.5%	20.0%	Omission of particle/ending
	기침 가래가 심하고 목이 아파서요. (I have a severe cough with phlegm, and my throat hurts.)	기침 심하고 목 아파 (Cough severe, throat hurts.)	15.2%	21.4%	Partial omission
	속이 너무 안 좋고 체한 것 같아요. (My stomach feels bad, and I think I have indigestion.)	속 안 좋고 체했어요 (Stomach bad, indigestion.)	14.8%	19.5%	Simplification, substitution

For the nurse group, error rates were consistently low due to their use of structured, concise, and context-driven language during consultations. Nurses tend to rely on predictable phrasing and medical terminology, resulting in fewer recognition errors. For example, phrases such as “혈압약을 드시고 계신가요? (Are you taking blood pressure medicine?)” were consistently transcribed with high accuracy, reflecting the low variability and standardized expressions used in nursing dialogues (Table 5).

Contrarily, patient utterances exhibited substantially higher WER and CER values, largely due to the spontaneous and unstructured nature of their speech. Patients often provided long, fragmented responses, self-corrections, or contextually ambiguous expressions, which significantly increased recognition difficulty. For instance, sentences such as “어제는 안 먹었는데 오늘은 먹었어요. (I didn’t eat it yesterday, but I ate it today.)” frequently resulted in substitutions and boundary shifts, leading to greater transcription variability (Table 5).

This difference suggests that automatic speech recognition systems face fewer challenges when processing structured, context-rich speech from healthcare professionals, while patient speech demands more robust handling of irregular syntax, overlapping clauses, and spontaneous discourse patterns (Table 5). These findings indicate that further model optimization should focus on accommodating the higher linguistic variability present in patient conversations.

Statistical analysis of performance differences before and after transfer learning

To statistically analyze the performance differences between the baseline model and the fine-tuned model, normality tests were first conducted. The Shapiro–Wilk test was used to assess whether the distribution of CER values followed a normal distribution, and Q–Q plot inspection was performed as a supplementary visual check. The Shapiro–Wilk test produced p-values below 0.05 for both models, confirming that neither dataset satisfied the assumption of normality. The Q–Q plot is provided in the Supplementary Material (Supplementary Figure S2).

Given non-normality, the Wilcoxon Signed-Rank test was applied to evaluate paired performance differences. For the overall dataset, both WER and CER showed statistically significant improvements after fine-tuning (p = 0.001 for WER; p < 0.0001 for CER), with effect sizes in the moderate range (effect size = 0.434 for WER, 95% CI 0.287–0.691; effect size = 0.412 for CER, 95% CI 0.254–0.675) (Table 6). Within speaker groups, doctors and patients exhibited statistically significant improvements for both WER and CER (doctors: WER effect size = 0.335, 95% CI 0.169–0.592, p = 0.005; CER effect size = 0.345, 95% CI 0.162–0.612, p = 0.001; patients: WER effect size = 0.260, 95% CI 0.072–0.514, p = 0.027; CER effect size = 0.212, 95% CI 0.006–0.477, p = 0.018), whereas the nurse group did not show statistically significant differences and the confidence intervals included zero for both metrics (Table 6).

Table 6.

Performance comparison between baseline and fine-tuned models using the Wilcoxon Signed-Rank test.

Method	Group	Wilcoxon signed-rank test			Sign test
Method	Group	Effect size	p-value	CI	Effect size	p-value	CI
WER	Nurse	0.133	0.157	-0.092 – 0.381	0.46	0.5	0.360 – 0.563
	Doctor	0.335	0.005	0.169 – 0.592	0.34	0.002	0.248 – 0.442
	Patient	0.260	0.027	0.072 – 0.514	0.39	0.031	0.294 – 0.493
	Overall	0.434	0.001	0.287 – 0.691	0.29	< 0.0001	0.214 – 0.403
CER	Nurse	0.144	0.180	-0.084 – 0.387	0.46	0.5	0.360 – 0.563
	Doctor	0.345	0.001	0.162 – 0.612	0.28	< 0.0001	0.211 – 0.409
	Patient	0.212	0.018	0.006 – 0.477	0.37	0.016	0.276 – 0.472
	Overall	0.412	< 0.0001	0.254 – 0.675	0.3	< 0.0001	0.212 – 0.402

Results from the Sign test provided convergent evidence, indicating statistically significant differences for doctors, patients, and the overall dataset for both WER and CER, while no significant difference was observed for the nurse group (Table 6). Taken together, these results indicate that fine-tuning yielded statistically reliable improvements overall and in doctor and patient speech, with the largest and most stable effects observed at the corpus level and more modest effects within speaker subgroups.

Analysis of loss variations during training

To evaluate the model’s training performance and detect potential underfitting or overfitting, I analyzed the variations in training loss and test loss throughout the training process. During the initial 700 steps, both training and test loss decreased sharply, after which the rate of change became minimal. This indicates that the model had sufficiently learned the patterns in the data, confirming that underfitting did not occur (Figure 4).

Figure 4.

Training and test loss curves during supervised fine-tuning of the whisper model.

As training progressed, a gradual divergence between training and test loss was observed, suggesting a potential risk of overfitting. To mitigate this issue, training was halted at 1,400 steps, a point at which the model had optimized its learning without excessive overfitting (Figure 4).

Error pattern analysis of WER vs. CER discrepancies

To further investigate the difference between WER and CER performance metrics, an error pattern analysis was conducted using patient utterances from the test set. The analysis revealed that, in Korean medical conversations, WER often overestimates transcription errors compared to CER owing to the agglutinative nature of the Korean language. Minor variations such as particle omissions, synonym substitutions, and word order differences frequently cause disproportionately high WER values, even when the overall meaning is accurately preserved.

For example, in one case, the reference sentence “혈압약을 드시고 계신가요? (Are you taking your blood pressure medication?)” was transcribed as “혈압약을 먹고 있나요? (Are you taking your blood pressure medication?)”. Although the expressions “드시다 (to take [a medicine], honorific)” and “먹다 (to take/eat)” are lexically different, they convey the same semantic meaning, resulting in a WER of 66.7% but a much lower CER of 8.3%. In another case, the omission of “오늘은 (today)” from “오늘은 기침이 좀 심하시네요 (Your cough seems a bit worse today.)” led to a WER of 50.0% despite only a minor difference at the character level (CER = 4.7%). Similar patterns were observed in other examples, where small postposition drops, boundary shifts, or alternative phrasing yielded substantial WER increases without significant character-level deviations (Table 7).

Table 7.

Patterns of Discrepancies Between WER and CER in Patient Speech Segments from Patient Data set.

Case	Reference	ASR output	WER	CER	Error type
1	혈압약을 드시고 계신가요? (Are you taking blood pressure medicine?)	혈압약을 먹고 있나요? (Are you eating blood pressure medicine?)	66.7%	8.3%	Synonym substitution
2	오늘은 기침이 좀 심하시네요. (Your cough seems a bit worse today.)	기침이 좀 심하시네요. (Cough seems a bit worse.)	50.0%	4.7%	Particle omission
3	약은 식후에 드세요. (Please take the medicine after meals.)	약은 밥 먹고 드세요. (Take the medicine after eating rice.)	60.0%	6.5%	Alternative phrasing
4	혈압 체크는 하셨나요? (Have you checked your blood pressure?)	혈압을 체크하셨나요? (Did you check your blood pressure?)	33.3%	2.7%	Word boundary variation
5	검사 결과가 잘 나왔어요. (The test results came out well.)	검사 결과 잘 나왔어요. (Test results came out well.)	50.0%	3.1%	Postposition drop

These findings indicate that CER provides a more reliable performance metric for evaluating Korean ASR systems, particularly in medical transcription contexts where preserving semantic accuracy is essential. Although WER is highly sensitive to morphological and lexical variations typical of agglutinative languages, CER better reflects the true quality of the transcriptions when the conveyed meaning remains intact.

Discussion

This study demonstrates that fine-tuning the Whisper large-v3-turbo model improves ASR performance in Korean clinical telemedicine conversations. Leveraging approximately 1,300 hours of domain-specific speech data, the fine-tuned model showed lower WER and CER than the baseline model across speaker groups, with statistically significant paired-test improvements for doctors, patients, and the overall dataset; in contrast, differences for nurse speech were not statistically significant in the paired analyses (Table 6). Patient speech remained the most challenging condition, yet still exhibited statistically significant gains after fine-tuning, supporting the value of domain-specific adaptation for highly variable, unstructured consultation narratives (Table 6).

A key methodological contribution of this work is the validation of CER as a more suitable evaluation metric than WER for Korean ASR. Given the agglutinative structure of the Korean language, WER tends to overestimate transcription errors by penalizing morphological segmentation mismatches or omitted particles, even when the underlying semantic meaning is preserved. Our findings underscore that CER more accurately captures clinically meaningful transcription quality, especially for use in medical documentation.

Speaker-specific analyses further revealed systematic differences in transcription difficulty. Doctors’ structured and concise speech resulted in the lowest error rates. Nurses’ utterances, while context-rich, produced only modestly higher error rates. In contrast, patient speech showed considerable variability and yielded the highest WER and CER. Qualitative error analyses suggested that fine-tuning reduced meaning-altering substitutions and omissions in patient utterances and improved robustness to spontaneous, unstructured narratives. Nevertheless, aggregate WER/CER do not fully represent clinical criticality, because errors affecting negation, medications, dosage/frequency expressions, numbers/units, or symptom descriptors may have disproportionate downstream consequences even when overall error rates change modestly.

Recent evidence suggests that clinical ASR error rates should be interpreted in a task- and setting-sensitive manner rather than against a single universal deployment threshold. A recent systematic review reported that word error rates in clinical documentation ranged from 0.087 in highly controlled dictation settings to over 50% in conversational or multi-speaker scenarios, underscoring the large performance differences across documentation contexts.³³ Related review evidence also indicates that the clinical value of AI-powered voice-to-text systems should be evaluated not only in terms of transcription accuracy, but also in relation to documentation workflow, quality of care, and the specific outpatient or primary care context in which the technology is deployed.⁴⁵ Recent benchmark studies have likewise shown substantial variability across languages, accents, and conversational conditions: multilingual medical ASR results demonstrate marked cross-language differences in WER/CER even after domain adaptation,³⁶ spontaneous accented healthcare conversations can incur substantial performance degradation under natural dialogue conditions,⁴⁶ and Whisper itself has shown accent-related variability across speaker groups.⁴¹ Viewed in this context, the present Korean telemedicine results should be interpreted relative to the difficulty of domain-specific conversational clinical speech rather than against a single universal error-rate threshold.

The dataset employed in this study spans a broad range of speaker roles, regional dialects, and recording environments, which may improve ecological relevance for telemedicine-oriented ASR development. However, because the corpus reflects scripted or role-play rather than spontaneous clinical telephone consultations, external validity to routine telemedicine settings should be interpreted cautiously. Real-world calls may involve more variable background noise, channel instability, rapid turn-taking, and overlapping speech, all of which may increase transcription difficulty beyond that observed in the present evaluation. Accordingly, safe deployment will require further validation under spontaneous, real-world telemedicine conditions and should be supported by implementation safeguards such as confidence-based flagging, human review, and monitoring of subgroup-specific error patterns.

The present study did not directly evaluate model performance stratified by gender, age, or dialect, and therefore no definitive conclusions can be drawn regarding subgroup equity. Although Table 1 provides descriptive metadata on these speaker characteristics, the dataset was not originally designed or statistically balanced for reliable subgroup-level ASR benchmarking. Accordingly, simple stratified WER/CER comparisons were not presented, because such analyses could be unstable and potentially misleading if interpreted as evidence of algorithmic bias. Nevertheless, fairness remains a critical consideration for clinical deployment, because even modest recognition disparities may accumulate into unequal documentation quality, communication burden, or clinical follow-up across patient groups. Future work should therefore incorporate prospectively designed subgroup-balanced evaluations and subgroup-specific error auditing to identify and mitigate systematic disparities before routine implementation.

The digital divide is also relevant to equitable deployment. ASR-enabled telemedicine systems may be less accessible to culturally and linguistically diverse communities, including patients who are more comfortable speaking languages other than the dominant clinical language, those who code-switch, and those with limited digital literacy or less stable access to high-quality devices and communication networks. In such settings, speech recognition errors may reflect not only model limitations but also structural inequities in access to linguistically appropriate care and digital infrastructure. Accordingly, future work should examine multilingual and dialect-aware adaptation, as well as the usability of ASR-supported telemedicine for patients from culturally and linguistically diverse backgrounds.

In addition to technical considerations, the implementation of ASR systems in Korean healthcare settings must comply with national regulatory, legal, and ethical standards. Korean-specific statutes such as the Personal Information Protection Act (PIPA) and the Medical Service Act impose strict requirements for data security, privacy, and clinical integrity, beyond those addressed by broader international frameworks such as HIPAA and GDPR.¹⁸ Furthermore, the legal status of ASR-generated transcripts as admissible components of electronic medical records (EMRs) remains unresolved. Institutional policies must establish whether such transcripts can serve as formal documentation. To ensure safe integration, ASR systems should include periodic human validation, clinician-in-the-loop auditing, and automated error monitoring to prevent inaccuracies from entering patient care workflows.

Despite these strengths, several limitations must be acknowledged. The patient group, while exhibiting the greatest improvement, also showed the highest error rates, reflecting the complexity of unstructured clinical speech. To confirm generalizability and clinical utility, future research should validate the model on larger and more diverse datasets encompassing broader patient demographics, dialects, and spontaneous speech characteristics.^12,44 Enhancing post-processing with NLP-based techniques may improve transcription usability, while speaker-aware modeling and contextual ASR could yield additional performance gains.^47–49 Future work should also perform ablation analyses of SpecAugment, additive noise injection, speed perturbation, and amplitude normalization to quantify the individual contribution of each component and to identify the most effective augmentation strategy for real-world telemedicine speech.

Although early stopping at 1,400 steps based on validation loss was effective, further regularization strategies—such as dropout, adaptive learning rate scheduling, and flexible stopping criteria—may improve model convergence and generalization.^28,39,40 Evaluating this framework across other professional domains could also provide insights into its broader scalability.⁵⁰

Only the Whisper large-v3-turbo model was assessed in this study, and no direct head-to-head benchmark was conducted against other Korean ASR solutions or commercial speech recognition services. Accordingly, the present findings should be interpreted as evidence of within-model improvement through domain-specific fine-tuning rather than as proof of superiority over alternative systems. Nevertheless, the observed gains are noteworthy in light of the broader Korean ASR landscape, where commercial platforms such as Naver CLOVA Speech and Google Cloud Speech-to-Text, as well as alternative open-source architectures including Kaldi, Wav2Vec2, and Conformer-based systems, represent relevant comparative baselines for future evaluation.^25–27 From a practical perspective, the present results suggest that adapting a strong multilingual foundation ASR model to domain-specific Korean clinical telephone speech may offer a viable alternative to general-purpose speech recognition pipelines. However, because commercial and open-source ASR systems differ in training data, adaptation interfaces, decoding strategies, and deployment settings, meaningful comparison requires controlled benchmarking on the same corpus and evaluation protocol. Future work should therefore benchmark fine-tuned Whisper-based clinical ASR against these systems under matched evaluation conditions to better establish comparative performance, implementation trade-offs, and clinical utility. Subgroup-specific evaluation—including variations across dialects, age groups, and environments—remains essential to mitigate bias and ensure equitable deployment.^50–52

Finally, although WER and CER are standard, they may not fully capture semantic or clinical adequacy. Future work should incorporate semantic-level metrics such as BERTScore and task-specific usability measures,^53,54 while also addressing clinically important deployment challenges such as multi-speaker diarization, overlapping speech, accent and dialect variability, and speech from vulnerable populations, including older adults and those with dysarthria⁵². Such evaluation will be particularly important for determining whether ASR systems can be implemented safely and equitably across diverse telemedicine populations.

Conclusions

This study demonstrated that fine-tuning the Whisper large-v3-turbo model on 1,300 hours of Korean clinical speech significantly improved transcription accuracy, with the most substantial gains in patient speech. The validation of CER over WER provides a methodological advancement for Korean ASR evaluation. These findings support the feasibility of domain-specific adaptation and establish a foundation for scalable and equitable ASR deployment in clinical settings.

Supplemental material

Supplemental material - Optimizing whisper for Korean telemedicine: Fine-tuning domain-specific ASR for clinical telephone transcription

Supplemental material for Optimizing whisper for Korean telemedicine: Fine-tuning domain-specific ASR for clinical telephone transcription by Woongchang Yoon in Digital Health.

Footnotes

Acknowledgements

I thank Jong-In Yun for assistance with experiments.

ORCID iD

Woongchang Yoon

Ethical considerations

Since this study utilized publicly available, de-identified datasets and involved no direct participant contact or new data collection, it was exempted from IRB review per the institutional guidelines of Gyeongsang National University (GIRB-D25-NX-0117). Written informed consent was obtained by the original data collectors during dataset creation as described in the provider’s documentation; no additional consent was obtained by the authors for this secondary analysis.

Author contributions

Conceptualization: WY, Data curation: WY, Formal analysis: WY, Funding acquisition: WY, Investigation: WY, Methodology: WY, Project administration: WY, Resources: WY, Supervision: WY, Validation: WY, Visualization: WY, Writing – original draft: WY, Writing – review & editing: WY.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the New Faculty Research Grant from Gyeongsang National University in 2025, GNU-NFRG-0085.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

Data supporting the findings of this study are available from the corresponding author upon reasonable request. In addition, the fine-tuning training scripts and related documentation used in this study are publicly available at the following GitHub repository: .

AI tool disclosure

No AI tools were used in the development, writing, or editing of this manuscript.

Supplemental material

Supplemental material for this article is available online.

Appendix

References

Downes

Mervin

Byrnes

, et al. Telephone consultations for general practice: a systematic review. Syst Rev-London 2017; 6: 128. https://doi.org/10.1186/s13643-017-0529-0

Car

Sheikh

. Telephone consultations. Bmj 2003; 326: 966–969. https://doi.org/10.1136/bmj.326.7396.966

van Galen

Car

. Telephone consultations. Bmj 2018; 360: k1047. https://doi.org/10.1136/bmj.k1047

Berntsson

Eliasson

Beckman

. Patient safety when receiving telephone advice in primary care - a Swedish qualitative interview study. Bmc Nurs 2022; 21: 24. https://doi.org/10.1186/s12912-021-00796-9

Dharmar

Kuppermann

Romano

, et al. Telemedicine Consultations and Medication Errors in Rural Emergency Departments. Pediatrics 2013; 132: 1090–1097. https://doi.org/10.1542/peds.2013-1374

Mays

Mathias

. Measuring the rate of manual transcription error in outpatient point-of-care testing. J Am Med Inform Assn 2019; 26: 269–272. https://doi.org/10.1093/jamia/ocy170

Tayefi

Ngo

Chomutare

, et al. Challenges and opportunities beyond structured data in analysis of electronic health records. Wires Comput Stat 2021; 13: e1549. https://doi.org/10.1002/wics.1549

Meingast

Roosta

Sastry

. Security and privacy issues with health care information technology. In: 2006 28th Annual International Conference of the Ieee Engineering in Medicine and Biology Society, 2006, Vols 1-15, pp. 3317.

Issa

Al Akour

Ibrahim

, et al. Privacy, confidentiality, security and patient safety concerns about electronic health records. Int Nurs Rev 2020; 67: 218–230. https://doi.org/10.1111/inr.12585

10.

Portnoy

Waller

Elliott

. Telemedicine in the Era of COVID-19. J Aller Cl Imm-Pract 2020; 8: 1489–1491. https://doi.org/10.1016/j.jaip.2020.03.008

11.

Ftouni

AlJardali

Hamdanieh

, et al. Challenges of telemedicine during the COVID-19 pandemic: a systematic review. BMC medical informatics and decision making 2022; 22: 207. https://doi.org/10.1186/s12911-022-01952-0

12.

Falcetta

De Almeida

Lemos

JCS

, et al. Automatic documentation of professional health interactions: a systematic review. Artificial Intelligence in Medicine 2023; 137: 102487. https://doi.org/10.1016/j.artmed.2023.102487

13.

Vogel

Kaisers

Wassmuth

, et al. Analysis of Documentation Speed Using Web-Based Medical Speech Recognition Technology: Randomized Controlled Trial. J Med Internet Res 2015; 17: e247. https://doi.org/10.2196/jmir.5072

14.

Latif

Qadir

Qayyum

, et al. Speech Technology for Healthcare: Opportunities, Challenges, and State of the Art. Ieee Rev Biomed Eng 2021; 14: 342–356. https://doi.org/10.1109/Rbme.2020.3006860

15.

Quiroz

Laranjo

Kocaballi

, et al. Challenges of developing a digital scribe to reduce clinical documentation burden. Npj Digit Med 2019; 2: 114. https://doi.org/10.1038/s41746-019-0190-1

16.

Johnson

Lapkin

Long

, et al. A systematic review of speech recognition technology in health care. Bmc Medical Informatics and Decision Making 2014; 14: 94. https://doi.org/10.1186/1472-6947-14-94

17.

Kumah-Crystal

Pirtle

Whyte

, et al. Electronic Health Record Interactions through Voice: A Review. Appl Clin Inform 2018; 9: 541–552. https://doi.org/10.1055/s-0038-1666844

18.

Mandl

Perakslis

. HIPAA and the Leak of “Deidentified” EHR Data. New Engl J Med 2021; 384: 2171–2173. https://doi.org/10.1056/NEJMp2102616

19.

Mani

Palaskar

Konam

. Towards Understanding ASR Error Correction for Medical Conversations. Natural Language Processing for Medical Conversations Proceedings of the First Workshop on Natural Language Processing for Medical Conversations. Association for Computational Linguistics 2020; 7–11. https://doi.org/10.18653/v1/2020.nlpmc-1.2

20.

Wittich

Burkle

Lanier

. Medication Errors: An Overview for Clinicians. Mayo Clin Proc 2014; 89: 1116–1125. https://doi.org/10.1016/j.mayocp.2014.05.007

21.

Baker

Deng

Glass

, et al. Research Developments and Directions in Speech Recognition and Understanding, Part 1. Ieee Signal Proc Mag 2009; 26: 75–80. https://doi.org/10.1109/Msp.2009.932166

22.

Wassink

Gansen

Bartholomew

. Uneven success: automatic speech recognition and ethnicity-related dialects. Speech Commun 2022; 140: 50–70. https://doi.org/10.1016/j.specom.2022.03.009

23.

Young

Mihailidis

. Difficulties in Automatic Speech Recognition of Dysarthric Speakers and Implications for Speech-Based Applications Used by the Elderly: A Literature Review. Assist Technol 2010; 22: 99–112. https://doi.org/10.1080/10400435.2010.483646

24.

Edwards

Salloum

Finley

, et al. Medical Speech Recognition: Reaching Parity with Humans. Lect Notes Artif Int 2017; 10458: 512–524. https://doi.org/10.1007/978-3-319-66429-3_51

25.

Ravanelli

Parcollet

Bengio

. The Pytorch-Kaldi Speech Recognition Toolkit. Int Conf Acoust Spee Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE 2019; 6465–6469. https://doi.org/10.13140/rg.2.2.18985.44647

26.

Amodei

Ananthanarayanan

Anubhai

, et al. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. Pr Mach Learn Res 2016; 48: 173–182.

27.

Baevski

Zhou

Mohamed

, et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Adv Neur 2020; 33: 12449–12460.

28.

Radford

Kim

, et al. Robust Speech Recognition via Large-Scale Weak Supervision. International Conference on Machine Learning 2023; 202: 202.

29.

Kheddar

Himeur

Al-Maadeed

, et al. Deep transfer learning for automatic speech recognition: Towards better generalization. Knowl-Based Syst 2023; 277: 110851. https://doi.org/10.1016/j.knosys.2023.110851

30.

Kheddar

Hemis

Himeur

. Automatic speech recognition using advanced deep learning approaches: A survey. Inform Fusion 2024; 109: 102422. https://doi.org/10.1016/j.inffus.2024.102422

31.

Blackley

Huynh

Wang

, et al. Speech recognition for clinical documentation from 1990 to 2018: a systematic review. J Am Med Inform Assn 2019; 26: 324–338. https://doi.org/10.1093/jamia/ocy179

32.

Shickel

Tighe

Bihorac

, et al. Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis. Ieee J Biomed Health 2018; 22: 1589–1604. https://doi.org/10.1109/Jbhi.2017.2767063

33.

JJW

Wang

Zhou

, et al. Evaluating the performance of artificial intelligence-based speech recognition for clinical documentation: a systematic review. Bmc Medical Informatics and Decision Making 2025; 25: 236. https://doi.org/10.1186/s12911-025-03061-0

34.

Tran

Mangu

Tai-Seale

, et al. Automatic speech recognition performance for digital scribes: a performance comparison between general-purpose and specialized models tuned for patient-clinician conversations. In: AMIA Annual Symposium Proceedings, 2023, p. 1072.

35.

Miner

Haque

Fries

, et al. Assessing the accuracy of automatic speech recognition for psychotherapy. Npj Digit Med 2020; 3: 82. https://doi.org/10.1038/s41746-020-0285-8

36.

Le-Duc

Phan

Pham

, et al. MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. Industry Track, 2025, 6, 1113–1150.

37.

Chun

Park

Ryu

, et al. Development and benchmarking of a Korean audio speech recognition model for Clinician-Patient conversations in radiation oncology clinics. Int J Med Inform 2023; 176: 105112. https://doi.org/10.1016/j.ijmedinf.2023.105112

38.

AI-Hub . Medical staff and patient voices for non-face-to-face treatment. National Information Society Agency, 2021. https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDa

39.

Liu

Yang

. Exploration of Whisper fine-tuning strategies for low-resource ASR. Eurasip J Audio Spee 2024; 2024: 29. https://doi.org/10.1186/s13636-024-00349-3

40.

Graham

Roll

. Evaluating OpenAI's Whisper ASR: Performance analysis across diverse accents and speaker traits. Jasa Express Lett 2024; 4: 025206. https://doi.org/10.1121/10.0024876

41.

Park

Chan

Zhang

, et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. Interspeech 2019: 2613–2617. https://doi.org/10.21437/Interspeech.2019-2680

42.

Peddinti

Povey

, et al. Audio Augmentation for Speech Recognition. In: 16th Annual Conference of the International Speech Communication Association (Interspeech 2015), 2015, Vols 1-5, pp. 3586–3589.

43.

Tjandra

Singhal

Zhang

, et al. Massively Multilingual Asr on 70 Languages: Tokenization, Architecture, and Generalization Capabilities. In: Icassp 2023 - 2023 Ieee International Conference on Acoustics, Speech and Signal Processing, Icassp 2023. https://doi.org/10.1109/Icassp49357.2023.10094667

44.

von Neumann

Boeddeker

Kinoshita

, et al. On Word Error Rate Definitions and Their Efficient Computation for Multi-Speaker Speech Recognition Systems. In: Icassp 2023 - 2023 Ieee International Conference on Acoustics, Speech and Signal Processing, Icassp 2023. https://doi.org/10.1109/Icassp49357.2023.10094784

45.

Alboksmaty

Aldakhil

Hayhoe

BWJ

, et al. The impact of using AI-powered voice-to-text technology for clinical documentation on quality of care in primary care and outpatient settings: a systematic review. Ebiomedicine 2025; 118: 105861. https://doi.org/10.1016/j.ebiom.2025.105861

46.

Sanni

Abdullahi

Kayande

, et al. Afrispeech-Dialog: A Benchmark Dataset for Spontaneous English Conversations in Healthcare and Beyond. In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies. Long Papers, 2025; 1, pp. 8399–8417.

47.

Rakshit

Mehta

Dasgupta

. A novel pipeline for improving optical character recognition through post-processing using natural language processing. In: 2023 IEEE Guwahati Subsection Conference (GCON). IEEE, 2023, pp. 01–06.

48.

Chen

Wang

, et al. Unispeech-Sat: Universal Speech Representation Learning with Speaker Aware Pre-Training. In: 2022 Ieee International Conference on Acoustics, Speech and Signal Processing (Icassp), 2022, pp. 6152–6156. https://doi.org/10.1109/Icassp43922.2022.9747077

49.

Sun

Zhang

Woodland

. Minimising Biasing Word Errors for Contextual ASR With the Tree-Constrained Pointer Generator. Ieee-Acm T Audio Spe 2023; 31: 345–354. https://doi.org/10.1109/Taslp.2022.3224286

50.

Alharbi

Alrazgan

Alrashed

, et al. Automatic Speech Recognition: Systematic Literature Review. Ieee Access 2021; 9: 131858–131876. https://doi.org/10.1109/Access.2021.3112535

51.

Rekesh

Koluguri

Kriman

, et al. Fast conformer with linearly scalable attention for efficient speech recognition. In: 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8.

52.

Xie

Fin

, et al. Exploring Self-Supervised Pre-Trained Asr Models for Dysarthric and Elderly Speech Recognition. In: 2023 Ieee International Conference on Acoustics, Speech and Signal Processing (Icassp), 2023. https://doi.org/10.1109/Icassp49357.2023.10097275

53.

Hanna

Bojar

. A fine-grained analysis of BERTScore. In: Proceedings of the Sixth Conference on Machine Translation, 2021, pp. 507–517.

54.

Wronikowska

Malycha

Morgan

, et al. Systematic review of applied usability metrics within usability evaluation methods for hospital electronic healthcare record systems Metrics and Evaluation Methods for eHealth Systems. J Eval Clin Pract 2021; 27: 1403–1416. https://doi.org/10.1111/jep.13582