Sage Journals: Discover world-class research

Abstract

While the automated assessment of oral reading fluency (ORF) using accuracy and speech rate has proliferated, expressiveness of speech, as measured by prosodic features, has been neglected due to its inherent complexity and lack of technological resources. Despite the potential benefits of burgeoning technology for assessing hard-to-measure constructs such as ORF, insensitivity to linguistic diversities threatens valid score interpretations and fair use for all learners. The present study investigated the potential benefits of developing an automated, prosody-inclusive ORF assessment in a post-secondary education setting involving many English language learners (ELLs). The analysis focused on three ways the inclusion of prosody may improve automated ORF assessments: by reducing bias against ELLs, improved prediction of reading comprehension, and improved diagnostic information. Data were analyzed by comparing two different scoring outcomes, the traditional ORF measure and a new prosody-inclusive outcome score, comparing these measures across language background. Results showed that the inclusion of prosody improves automated ORF assessment by reducing discrepancies between ELLs and English first language students caused by automated speech recognition inaccuracies, leads to better prediction of reading comprehension with ELLs, and provides meaningful diagnostic information. Detailed descriptions of the models, their relevance, and implications for the language testing community are discussed.

Keywords

Artificial Intelligence automated scoring language background bias linguistic equity machine learning oral reading fluency prosody

Introduction

Artificial intelligence (AI) is transforming the fields of language education and assessment, offering unprecedented opportunities to redefine and enhance both how language is learned and how it is assessed (Chapelle & Chung, 2010; Nakatsuhara & Berry, 2021; Xi, 2023). Technological advances facilitate the assessment of complex language abilities like prosody in oral reading fluency (ORF) assessment, which has traditionally been underrepresented. This underrepresentation stems from multiple challenges, including the theoretical complexity of defining and operationalizing prosody as a construct that encompasses features of expression, psychometric challenges in reliably measuring it, and the practical difficulties associated with the resource-intensive nature of manual prosody assessment (Chaudhry & Kazim, 2022). While technological advancements address some of these long-standing challenges, they also bring new and complex concerns. As Nakatsuhara and Berry (2021) caution,

while these innovations have opened the door to types of speaking test tasks which were previously not possible . . . it should be kept in mind that each of the affordances offered by technology also raises a new set of issues to be tackled (p. 343).

For example, though AI-driven technologies such as automated speech recognition and natural language processing enable the analysis of speech features (e.g., rhythm, intonation, stress), they also raise concerns about transparency and fairness due to potential AI biases against accented speech (Hannah et al., 2022). Furthermore, the opacity of AI scoring processes can undermine trust as educators and students may find it difficult to understand how scores are generated or interpret what they represent. To address these challenges, the field of language assessment needs to balance the benefits of AI-driven assessments with mindful consideration of their potential pitfalls to ensure innovative and ethical practices.

The present study investigates the integration of prosody in automated ORF assessment, focusing on its potential to improve the construct representation and psychometric rigor. Prosody in oral reading is defined as the ability to read text “with appropriate expression or intonation coupled with phrasing that allows for the maintenance of meaning” (Kuhn et al., 2010, p. 235). While automated assessment of accuracy and speech rate in ORF has advanced significantly (Peng et al., 2022; White et al., 2021), the expressiveness of speech, as measured by prosodic features, is often neglected in both traditional and automated ORF assessment due to its complexity (Washburn, 2022). This is an oversight that has been a concern for over two decades (Kim et al., 2021; Kuhn & Schwanenflugel, 2019; Kuhn & Stahl, 2003; Schwanenflugel et al., 2004; Schwanenflugel & Kuhn, 2015).

Additionally, research in ORF assessment has predominantly focused on primary grade students (Grabe, 2010; Jiang, 2016; Jiang et al., 2012) reading in their first language (Chang, 2019; Khor et al., 2014; Prakash & Kurian, 2019) while there is limited research on ORF assessment in English language learning (ELL) contexts (Yang, 2021). This limitation is further complicated by a potential reduction in the accuracy of automated ORF assessment relying on speech-to-text technology for ELL students compared to English first language students (EL1s; Chen et al., 2018; Hannah et al., 2022; Mirzaei et al., 2015). The purpose of the present study was to examine the extent to which a new automated ORF scoring method, incorporating prosody, differs from the conventional scoring approach relying on accuracy and rate metrics. Specifically, the study was guided by two primary objectives: (1) to evaluate whether the inclusion of prosody in the scoring method reduces discrepancies in ORF scores between EL1 and ELL students and (2) to determine whether incorporating prosody enhances the predictive and diagnostic efficacy of ORF assessments.

Literature review

Automated oral reading fluency assessment

Oral reading fluency (ORF) is defined as the ability to read text aloud with accuracy, speed, and proper expression (Kennedy Shriver, 2010). It is a well-established, core skill that is highly predictive of later reading comprehension, especially among young children (Morrison & Wilcox, 2020). The National Assessment of Educational Progress (NAEP), a federally mandated US education progress monitoring organization, recognizes ORF assessment as a key tool for evaluating students’ progress toward meeting state reading standards (White et al., 2021). Accuracy refers to the ability to correctly decode and pronounce the words in a given text. It is foundational to fluency, as errors such as mispronunciations, skipped words, or substitutions can disrupt comprehension. In assessment, accuracy is typically measured as the percentage of words read correctly out of the total number of words (Hudson et al., 2020). Readers may struggle with accuracy or unfamiliar vocabulary (Samuels, 2006). Speed or reading rate reflects the rate at which a reader processes and verbalizes text. It demonstrates automaticity, the ability to recognize words effortlessly without pausing for conscious decoding. Fluent readers maintain a steady, appropriate pace that allows for efficient comprehension without sacrificing understanding. Reading speed is often quantified in words correct per minute (WCPM), with benchmarks tailored to specific proficiency levels or age groups (Hasbrouck & Tindal, 2017). Proper expression, largely corresponding to speech prosody, refers to reading with rhythm, intonation, pitch, and stress that align with the meaning and structure of the text (Kuhn et al., 2010). Prosody allows readers to convey emotions, tone, and intent, making the text engaging and meaningful for listeners.

Unlike accuracy and speed, prosody is more challenging to measure quantitatively, leading to its relative neglect in traditional ORF assessment (Kuhn & Stahl, 2003). Where prosody is often not overlooked is in assessments that target the construct of speaking (as opposed to the construct of reading). Table 1 presents some common assessments that include an oral reading task, with the scoring protocol, target construct, and prosody inclusion indicated. The table illustrates the divide between assessments in language learning contexts (with many age groups) measuring speaking ability and assessments in the K-12 context which measure reading ability, and prosody is only commonly included in the former. Despite the use of prosody in speaking assessment, there remains a dearth of automated reading assessments designed to measure the full breadth of the ORF construct, inclusive of prosody, especially in language learning contexts.

Table 1.

Assessments that include an oral reading task.

Assessment	Scoring protocol	Prosody	Stimulus	Target construct	Target population
PTE Academic (2024)	Machine-Scored	Includes Prosody	Short Text	Speaking	Language Learners
DET (2021)	Machine-Scored	Does not include Prosody	Sentence	Speaking	Language Learners
TOEIC (2022)	Human-Scored	Includes Prosody	Short Text	Speaking	Language Learners
VEPT (Versant, 2024)	Machine-Scored	Does not include Prosody	Short Text	Reading	Language Learners
DIBELS (University of Oregon, 2023)	Human-Scored	Does not include Prosody	Short Text	Reading	K-12 students
FLORA (Bolaños et al., 2013)	Machine-Scored	Includes Prosody	Short Text	Reading	K-12 students
Acadience Reading (Good III & Kaminski, 2020)	Machine-Scored	Does not include Prosody	Short Text	Reading	K-12 students

Note: PTE = Pearson Test of English, DET = Duolingo English Test, TOEIC = Test of English for International Communication, VEPT = Versant English Placement Test, DIBELS = Dynamic Indicators of Basic Early Literacy Skills, FLORA = Fluent Oral Reading Assessment.

In reading assessment contexts, where prosody is often not included, automaticity is measured using words correct per minute (WCPM), which calculates the average number of correctly spoken words in a 60-second reading interval. The correctness of the words is determined based on whether the spoken word matches the read word, as determined by the listener (either a human or machine, depending on the scoring protocol). WCPM grade norms have been established for grades 3–8 (Hasbrouck & Tindal, 2017) and post-secondary students (Jiang et al., 2012; Rasinski et al., 2022), making it the standard ORF outcome measure.

While WCPM is a robust predictor of comprehension across various developmental stages, including young students (Hudson et al., 2020), adolescents (Washburn, 2022), and post-secondary students (Jiang et al., 2012), researchers have raised concerns about its omission of prosody (Kuhn & Schwanenflugel, 2019; Kuhn & Stahl, 2003; Schwanenflugel et al., 2004; Schwanenflugel & Kuhn, 2015). It is argued that comprehension is enhanced when readers not only read but also incorporate the proper phrasal and syntactical structures intended by the author. This is because meaning is conveyed not solely through individual word forms but also through their arrangement and relationships within the broader text (Morrison & Wilcox, 2020).

Consequently, when ORF is assessed without considering prosody, the construct is not effectively represented, potentially weakening its relationship to comprehension (Kuhn & Stahl, 2003). From an assessment perspective, over-reliance on WCPM can also obscure distinct reader profiles (Schwanenflugel & Kuhn, 2015) and may inadvertently encourage instructional practices that prioritize reading speed over comprehension (Kuhn & Schwanenflugel, 2019). Additionally, although WCPM is intended to encompass both accuracy and rate, research by Valencia et al. (2010) indicates that it predominantly measures reading rate (r = .99) rather than accuracy (r = .43).

The automation of ORF assessment is rapidly becoming the standard (Peng et al., 2022; White et al., 2021). For instance, NAEP’s scoring approach uses a speech-to-text system that transcribes students’ oral reading, subsequently evaluating accuracy and rate, while prosody continues to be assessed by human assessors. This approach has been validated for both EL1 and Spanish L1 speakers, demonstrating sufficient human-machine reliability (Balogh et al., 2012) with that sample. Several research initiatives (Bailly et al., 2022; Bolaños et al., 2013; Molenaar et al., 2023; Proença et al., 2017) and industry efforts (Kelly et al., 2020; Peng et al., 2022) are also focused on developing automated ORF scoring models. These efforts have primarily focused on English L1 primary grade students and, in some cases, have included the automated scoring of prosody.

Automated prosody assessment

Recent advances have enriched our understanding of prosody’s role in language processing, particularly its communicative functions and relationships with other speech aspects (Kalathottukaren et al., 2015). However, challenges persist in precisely identifying prosodic features (Peppé, 2009) and developing tools to systematically assess prosodic skills across different age groups (Crystal, 2009). Theoretical frameworks explain prosody through phonetic (pitch, duration, stress), physical (fundamental frequency, syllable duration, intensity), and phonological (variations in pitch, length, loudness) correlates. These features extend the meaning of spoken language beyond mere words and grammatical structures (Kalathottukaren et al., 2015; Roach, 2000).

Prosodic impairments are notably evident in children with autism spectrum disorder (ASD), who exhibit distinctive differences in acoustic parameters (Diehl & Paul, 2013). Similarly, individuals with apraxia of speech exhibit pronounced inaccuracies in acoustic features, including inappropriate pitch assignments, uneven syllable timing, varied speech rates, and uniform loudness (Barry, 1995). These prosodic inaccuracies can significantly diminish speech intelligibility (Chin et al., 2012).

The communicative functions of prosody are categorized into grammatical, affective, and pragmatic roles. Grammatically, prosody helps delineate the boundaries of phrases, clauses, or sentences, and can differentiate between word classes in the case of homonyms. Affectively, it expresses the speaker’s emotions and attitudes, where emotions like happiness may be conveyed through a quick speech rate, rising pitch, high pitch variability, and rapid voice onsets. Pragmatically, prosody manages discourse, signaling to listeners when a speaker has concluded their thoughts or expects a particular response. Effective social interactions, therefore, rely on accurate perception and production of these varied communicative functions of prosody.

Despite the considerable attention that prosody has received in speech pathology and communication disorders, it remains underexplored in the field of language assessment, where both its teaching and assessment are quite challenging (Levis, 2023). In read-aloud tasks, prosody is typically rated by humans using rubrics or, more recently, by machines trained to predict human scores. Several prosody rubrics have been developed, including the NAEP scale (Daane et al., 2005), the multi-dimensional fluency scale (MDFS; Rasinski et al., 2009), and the comprehensive oral reading fluency scale (CORFS; Benjamin et al., 2013). Bolaños et al. (2013) pioneered automated expressive (i.e., prosody-inclusive) ORF assessment by building a machine learning model trained to predict a unidimensional, human-generated prosody score, using 19 lexical and prosodic features. Subsequent models developed in research contexts have been published by Proença et al. (2017) and Molenaar et al. (2023), who have continued to develop novel approaches, utilizing additional features and techniques. These prosody or expressive reading fluency models extract features from the sound input using speech processing software such as Praat (Boersma & van Heuven, 2001) and then train machine learning models to predict human scores which are based on one of the prosody rubrics described above.

ORF assessment with additional language learners

Some research has found that the relationship between ORF and reading comprehension extends to ELL students in both K-12 context (Marrs et al., 2022; Newell et al., 2020) and post-secondary contexts (Jiang et al., 2012; Yang, 2021). However, ELL students experience ORF assessments differently from EL1 students, leading to pronounced differences in score interpretations. For example, Marrs et al. (2022) found that ELLs are disproportionately placed in special education streams based partially on poor ORF performance, arguing that the ELL groups’ scores are primarily a language proficiency issue, not a reading ability issue. This suggests potential shortcomings for ORF assessments with ELL students, a point also concluded by a systematic review on ORF as a screening tool for ELLs (Newell et al., 2020), which suggests that the strength of validity evidence is weaker for ELLs. Lems (2012) further found that ELL students’ ORF scores were lower than those of EL1 students, even when their reading comprehension scores were equivalent, a gap that widened with increasing differences between students’ primary language orthography and English. Additionally, Aldhanhani and Abu-Ayyash (2020) found that ELL students are less inclined to practice reading aloud, which they linked to pronunciation-related social anxiety, further complicating ORF assessments for ELLs.

When ORF is machine-scored, automated speech recognition (ASR) models are used to convert the speech to text, and then scores are calculated by comparing the ASR transcript to the read text. Automated ORF assessments, when used with ELL students, bring forth a very important, bias-related issue. The issue is that ASR models consistently show lower accuracy for these groups, identified by comparing human transcripts to ASR transcripts (Derwing et al., 2000). For instance, research on the SpeechRater scoring engine (Zechner & Loukina, 2020) Chen et al.’s (2018) found that the error rates were 5%–15% higher for nonnative speech than for native speakers. Another study comparing ASR error rates across language backgrounds in both predictable (reading) and unpredictable speech (describing a picture) contexts, found averages of 19.5% for EL1s, 23.2% for students with other Indo-European language backgrounds, and 26.9% for those with non-Indo-European language backgrounds (Hannah et al., 2022). Mohyuddin & Kwak (2023) found similar results comparing groups with six different language backgrounds to an English language background group across six different ASR models. They found that the English language background group had the highest accuracy in all cases. We characterize the reduced ASR accuracy for L2 groups as bias because the accuracy of the transcription is directly related to the outcome scores for ORF assessments and not related to the target construct (reading fluency). Rather, the accuracy of ASR is determined largely by the language backgrounds present in ASR training data (Isaacs, 2017). Given that ASR accuracy is directly related to automated ORF scores, the differential accuracy with L2 populations presents serious threats to the validity and fairness of automated scoring. The present study seeks to investigate these threats and better understand how they may be mitigated by the inclusion of prosody as a core sub-construct within a traditional ORF assessment.

Present study

The present study investigated the extent to which a new ORF scoring method, which incorporates prosody (described below), differs from the traditional scoring method relying on accuracy and rate alone. Specifically, we focused on examining differences in these two automated scoring methods between EL1 and ELL students in post-secondary university settings. Two lines of inquiry guided the below research questions: (1) whether the inclusion of prosody can reduce scoring bias (first two research questions) and (2) whether the inclusion of prosody can improve the predictive and diagnostic efficacy of ORF assessment (last three research questions). The research questions are:

How do human and ASR transcription methods affect the difference between EL1 and ELL ORF scores, using the traditional scoring method?

How do the existing and new ORF scores differ between EL1 and ELL students?

How do the existing and new ORF scores predict reading comprehension differently across language backgrounds?

How do the existing and new ORF scores predict reading comprehension across language backgrounds, considering comprehension ability levels?

What are the characteristics of ORF diagnostic profiles based on the new ORF component scores?

Methods

Participants

Data were collected from APLUS (Academic Platform for Languages in University Settings), a digital learning-oriented language assessment platform at a large Canadian University. APLUS is designed to support the development of academic communication skills among first-year undergraduate and graduate students. Summer bridge programs and foundational core courses have adopted it by assigning 5% of the course grade for the completion of APLUS. When students complete the APLUS tasks, they receive a certificate which they submit to their instructors to indicate completion. The scores derived from APLUS are primarily used by the students to better understand their own language abilities in relation to the university’s expectations. A total of 807 students have used APLUS with varying degrees of task completion. Table 2 summarizes the sample characteristics of the study participants in terms of gender, educational degree, and language background (EL1s and ELLs).

Table 2.

Sample characteristics on gender, level, and language backgrounds.

Variable	EL1	ELL
Gender
Female	126	455
Male	29	167
Non-binary	1	4
Prefer not to disclose	1	0
Level
Graduate Program	103	410
Undergraduate Program	52	213
First Language
Arabic	0	7
Chinese Variant	0	576
English	156	0
French	0	2
Korean	0	10
Other	0	16
Portuguese	0	2
Russian	0	2
Turkish	0	2
Urdu	0	2

Note: Other includes 16 other languages, each of which only one student indicated as their first language.

Tasks

The present study used the reading comprehension and read-aloud tasks included in APLUS. In the reading comprehension task, students read a short (1685 word) academic article (Murray, 2021) on popular science focusing on gender equity and innovation, and responded to 21 reading comprehension questions. Students were instructed to take notes focusing on the author’s key points and supporting details and were also asked to practice time management by setting a timer for their own reading. The reading comprehension test was designed to elicit three core comprehension skills, explicit, inferential, and global comprehension. The test had a Cronbach’s alpha of α = .86 and α = .83 for ELL and EL1 students, respectively (measured using the Kuder–Richardson reliability estimate).

The read-aloud task asked students to read a challenging 138-word excerpt from the Murray article (described above) and instructed to read “naturally and clearly.” The read-aloud task participants included 105 EL1 and 307 ELLs. Students complete APLUS tasks on their own time and so the study had no control over the acoustic environment or recording equipment used, nor was there any proctoring employed. Students’ speech inputs were automatically saved and processed using OpenAI’s open-source Whisper ASR model for automatic transcription (Radford et al., 2022). Whisper demonstrates robust transcription capabilities, achieving an average text-normalized word error rate (WER) of 12.8 across 15 standard datasets, which are among the best performance metrics of any ASR model to date (Radford et al., 2022). Comparing the results of Radford et al. (2022) to Ferraro et al. (2023), who benchmarked the performance of many paid and open-source ASR platforms, Whisper performs better than average and outperforms the three major paid platforms of Amazon, Google, and Microsoft. Importantly, Whisper’s training data includes 680,000 hours of audio, with 117,000 covering 96 languages besides English (Radford et al., 2022). However, there are indications that the Whisper model more accurately transcribes North American English dialects over others (Calbert & Roll, 2023).

Analysis

Scoring

Reading comprehension test scores were derived from a 2-parameter logistic (2PL) item-response theory (IRT) model using a maximum likelihood estimator (Bock & Aitkin, 1981; Lord, 1980). For the read-aloud task data, five different scores were estimated: speed, accuracy, existing ORF score (WCPM), prosody, and a new ORF score. Speed was calculated by subtracting each student’s reading time from the maximum time any reader took to read the Murray excerpt. Accuracy was calculated as the 1 − WER (word error rate), which is the percentage of words read incorrectly based on either a human or ASR-derived transcript. To calculate the percentage of words read incorrectly, the sum of miscues (insertions, substitutions, and deletions) was divided by the total words in the read text. An NLP algorithm was used to parse each transcript and compare it to the original reference text to identify each incorrect word from the transcript and return the number of incorrect words as a percentage of the total words read. The existing ORF score (WCPM) combines speed and accuracy. WCPM is calculated by dividing the raw number of correctly spoken words by the reading time and dividing the result by 60. Finally, the new ORF score was calculated for the purpose of this study, by combining speed, accuracy, and prosody. The new ORF score was calculated as $\frac{P r o s o d y^{2}}{\sqrt{(W E R * 100) + T i m e}},$ where prosody is the total prosody score output of the model (range: 4–20, described below), and time is the amount of time spent reading in seconds. This new ORF scoring model was decided upon to approximately balance the influence of each individual component, with prosody being weighted slightly higher than the other two based on correlation coefficients. The correlation between the existing and new ORF scoring methods is r = .69. Table 3 provides sample sizes and descriptives for each of the variables included in the analysis (the proportions belonging to each language background in each task are similar to those in the full sample shown in Table 2).

Table 3.

Descriptive statistics for ORF variables.

	ELL			EL1
	n	M	SD	n	M	SD
Time (s)	307	92.2	17.5	105	67.2	9.2
ASR WER	307	28.9	12.8	105	12.3	9.1
Human WER	75	10.8	6.2	27	5.2	3.8
ASR WCPM	307	70	19.8	105	115.4	18.7
Human WCPM	75	85.4	19.6	27	123.2	18.7
Machine Prosody	307	13	1.8	105	15.6	2.1
Human Prosody	165	11.3	5.2	41	17.3	5.1
ORF Score	307	16.2	5.8	105	28.3	8.7

Note: WER = Word Error Rate; ASR = Automated Speech Recognition; WCPM = Words Correct per Minute.

Prosody ML model

Prosody scores were derived from a machine learning model trained to predict human scores. Seven human raters, trained using a rubric based on the multi-dimensional fluency scale (MDFS; Rasinski, 2004), assessed four dimensions of reading prosody: expression and volume, phrasing, smoothness, and pace. All raters were graduate students in education, three of whom were EL1s, and four were ELLs with Mandarin as their first language. Two rater training sessions were held where raters closely reviewed the rubric and independently scored a set of 15 speech samples, followed by discussions on challenges, confusion, and scoring discrepancies. The raters then scored 206 samples for model training, with each sample being scored by four randomly assigned raters. The final scores were summed into a single prosody score as the machine learning model was better able to predict a single, combined score than four trait scores. Inter-rater reliability was evaluated using the intra-class correlation coefficient (ICC), with high values ranging from .957 to .964. Expert raters resolved any score differences as long as they were not involved in the original scoring.

The prosody ML model was built using an initial set of 63 prosodic and lexical features derived from the acoustic properties of each student’s audio recording in combination with grammatical reference points in the ASR transcript. During model training, some features were not retained, resulting in a final model that included 47 features. The present study’s original 63 features were built upon previous research, incorporating many acoustic properties referenced in previous studies as well as new features explored here, such as the number and range of relative decibels and pitch peaks. Previous research into machine-derived prosodic features highlighted two main groupings: pitch- and pausing-related features (Benjamin et al., 2013; Kim et al., 2021). The present model added two additional groupings: speech rate and volume (containing 15 of the 63 total features). All retained features, their groupings, and descriptions are detailed in Table 4.

Table 4.

Prosodic features included in the prosody ML model.

Feature	Description
Pausing-related features
Micropauses	The number of micropauses (0.2 < micropauses < 0.4 seconds).
Micropause rate	The number of micropauses/number of syllables.
Hesitations	The number of hesitations (0.4 < hesitations < 0.6 seconds).
Hesitation rate	The number of hesitations/number of syllables.
Unfilled pauses	The total number of unfilled pauses (> 0.6 seconds).
Total pause rate	The number of pauses/number of syllables.
Silence duration	The total amount of silence time.
Longest silence	The longest silence.
Av. silence duration	The average length of all silence time.
Av. non-punctuation silence	The average length of silence time not around punctuation.
Non-punctuation silence rate	The number of non-punctuation silences / number of syllables.
Av. punctuation silences	The average length of silence time around punctuation.
Punctuation silence rate	The number of silences around punctuation / total duration.
Av. period silence	The average length of silence time not around periods.
Var. period silence	The variance in silence time around periods.
Period silence rate	The number of silences around periods / total duration.
Av. syllable b/w silences	The average number of syllables between silences.
Av. word b/w silences	The average number of words between silences.
Pitch-related features
Av. F0	The mean pitch (F0).
Av. F0 peak drop	The mean drop in pitch from each relative peak to valley.
Av. F0 final drop	The mean drop in pitch in the final pitch drop of each sentence.
Av. F0 sentence drops	The mean drop in pitch from sentence peaks to end of sentences.
Av. F0 slopes	The average slope of pitch progress across each sentence.
N F0 peaks	The number of pitch peaks.
SD F0 peak	The standard deviation in raw peak drops.
Range F0	The range of pitch values.
Quartile range F0	The range of pitch from 90th percentile to 10th percentile.
Harmonicity	The degree of acoustic periodicity, or harmonics-to-noise ratio.
Local absolute jitter	The average absolute difference between consecutive periods.
Local Jitter	The local absolute jitter / average period.
PPQ 5 jitter	The five-point period perturbation quotient.
RAP jitter	The relative average perturbation.
Speech rate features
Speech rate	Number of syllables/total duration.
Articulation rate	Number of syllables/total speaking duration.
Av. sent rate	The average speed of sentences.
Var. sentence rate	Variance in the speed of sentences.
N syllables	The number of syllables.
Var. syllables	The variance in length of syllables.
Max syllable length	The max length of syllables.
Volume-related features
Av. DB	The mean volume (measured in decibels of sound pressure level, DB SPL).
SD DB	The standard deviation in volume.
Av. DB peak drops	The mean drop in volume from each relative peak to valley.
N DB peaks	The number of volume peaks.
SD DB peaks	The standard deviation in volume peak drops.
APQ 11 shimmer	The 11-point amplitude perturbation quotient.
APQ 3 shimmer	The three-point amplitude perturbation quotient
APQ 5 shimmer	The five-point amplitude perturbation quotient.

Early work by Schwanenflugel et al. (2004) identified the core acoustic properties of prosodic reading used in this study, such as less variable intersentential pauses, more variable fundamental frequency (F0, also known as pitch), consistent demarcation of sentence boundaries, and a falling pitch at the end of sentences, all of which were associated with high-skill readers. Subsequent research expanded these early findings, introducing features like the ratio between the average pitch of lexically stressed syllables and unstressed syllables (Bolaños et al., 2013) and the number of vocalic nuclei per minute (Bailly et al., 2022).

To build the most optimal ML model, we tested four common ML regression algorithms: extreme gradient boosting (XGBOOST; Chen & Guestrin, 2016), random forest regression (Ho, 1995), ridge regression, and lasso regression (Hastie et al., 2009). Tree-based ML models (i.e., random forest) use a complex series of if-then-else decision trees based on the training data, comparing final decisions to the ground truth (human scores). Gradient boosting decision trees (i.e., XGBOOST) iteratively train an ensemble of shallow decision trees, multiplying the total number of trees compared to a traditional tree-based model. XGBOOST, a scalable tree-based regression model, performed the best and was retained.

All 63 features were normalized and included in all models initially. Features not contributing independent information to the model were removed, as described in Hastie et al. (2009). Models and feature combinations were tested using 10-fold cross-validation, aiming to minimize the mean absolute percentage error (MAPE) in the testing sample. The MAPE statistic is the average difference between the predicted score and the reference score as a percentage of the reference score. The best model (XGBOOST) and the best feature combination (indicated in Table 4) achieved a MAPE of 0.198. All features were extracted with Praat (Boersma & van Heuven, 2001), and models were built and run in Python (version 3.7) using the scikit-learn toolkit (version 2.22).

Analysis by research questions

RQ1 (asking how ASR accuracy affects EL1 and ELL ORF scores) was answered by comparing the existing ORF scores between EL1 and ELL students when derived from both the ASR and human transcripts. Using the human transcripts as the “ground truth,” RQ1 therefore identifies the amount of discrepancy added to ORF scores by the ASR system and compares the standardized difference (d) between the EL1 and ELL groups in both conditions. For RQ2 (asking how EL1 and ELL groups differ between the two ORF scores), an independent samples t-test was conducted to examine group differences in: (1) the existing WCPM-based method and (2) the new prosody-inclusive method. RQ3 (asking how the prediction of reading comprehension differs between the two ORF scores) compared the ability of existing and new ORF scores to predict reading comprehension between language groups using two regression models. Model 1 was a hierarchical regression model which included each of the independent components of ORF (speed, accuracy, and prosody), language background, and interactions between each ORF component and language background. This model aimed to assess the strength of the independent relationships between traditional ORF variables and reading comprehension and then examine the added predictive relationship of prosody, controlling for language background. Model 2 included the existing ORF score and the new ORF score as independent variables, along with language background and interaction terms. This model examined the relative strength of each ORF score after controlling for the other and for language background. Both models for RQ3 set ELLs as the reference group, as they were the larger sample and the primary group of interest for this study. Multicollinearity was assessed for each RQ3 model by examining the variable inflation factor (VIF), with a cutoff of 5 (James et al., 2013).

For RQ4, we tested Model 2 described from RQ3 using quantile regression analysis to determine if predictive relationships vary across different comprehension levels. Unlike ordinary least squares (OLS) regression, which estimates the mean of the dependent variable based on the independent predictor variables, quantile regression estimates the conditional median or other quantiles of the dependent variable (Koenker & Basset, 1978). This analytic approach is particularly useful for investigating the predictive relationship of independent variables across different score points in the distribution of the dependent variable. We analyzed the 10th, 25th, 50th, 75th, and 90th quantiles of the dependent reading comprehension variable.

For RQ5 (asking about the characteristics of ORF diagnostic profiles), we conducted latent profile analysis (LPA) to identify distinct reading fluency profiles based on three ORF variables: speed, accuracy, and prosody, which were each standardized to the same scale. LPA is an analytic technique that estimates a categorical latent variable identifying subpopulations based on each individual’s level on a set of continuous variables. The optimal number of latent profiles was determined based on Spurk et al. (2020), by using a combination of content-related considerations and statistical fit values: bootstrap likelihood ratio test (BLRT), Bayesian information criterion (BIC), Akaike information criterion (AIC), Kullback information criterion (KIC), and Entropy.

All analyses were conducted in R (version 3.6.1) using open-source packages: mirt for IRT models (Chalmers, 2012), psych for RQ1, RQ2, and RQ3 (Revelle, 2022), quantreg for RQ4 (Koenker et al., 2023), and tidyLPA for RQ5 (Rosenberg et al., 2019).

Results

Differences in ASR and human transcription accuracy between ELL and EL1 students

The mean difference between ELL and EL1 WCPM scores when using an ASR transcript was 45 points (d = 2.33, t = 20.59***) while the difference was 38 points when using a human transcript (d = 1.97, t = 8.69***). Both differences are large, but the human transcript difference is 7 points smaller, indicating that about 16% of the difference between ELL and EL1 scores can be attributed to reduced ASR accuracy with ELL students, in the current sample. Table 3 shows the means and standard deviations for WCPM scores using both human and ASR transcription on both ELL and EL1 groups.

Differences in existing and new ORF scores between ELL and EL1 students

Two ORF scoring methods were compared: the existing ORF scoring approach accounting for accuracy and speed and the new ORF scoring method adding prosody. As shown in Table 5, the standardized group difference between EL1 and ELL groups’ existing ORF scores is d = 2.33 (t = 20.59***) compared to a difference of d = 1.64 (t = 13.14***) with the new ORF score. Both group differences are large, which can be seen clearly in Figure 1, but the new ORF score difference between these groups is much smaller.

Table 5.

Comparison of ORF scores using different methods for ELL and EL1 students.

	Rawscore		Standardized score		d	t	p
	ELL	EL1	ELL	EL1
Existing ORF Scoring Method	70	115.4	-0.42	1.22	2.33***	20.59	<.001
New ORF Scoring Method	17.6	32.1	-0.37	0.95	1.64***	8.69	<.001

<.05, **<.01, ***<.001.

Figure 1.

Boxplots comparing ORF scoring methods for language groups.

Difference in predictive relationships between ORF scoring methods and reading comprehension ability

The results from hierarchical multiple regression analyses are presented in Table 6. The result from Model 1 showed that without prosody, both speed and accuracy significantly predicted reading comprehension (β = 0.203** and 0.212**, respectively), and that there was no statistically significant group difference in reading ability (β = 0.101), though the EL1 group’s mean was higher (0.28 and −0.07 for EL1s and ELLs, respectively). When prosody was added, only accuracy and prosody remained significant (β = 0.194* and 0.295***, respectively). Additionally, no significant interaction effects were observed, indicating that the relationship between ORF variables and reading comprehension was not significantly different between ELL and EL1 students. The adjusted R² increased from 9% to 17% of total variance explained when prosody was added.

Table 6.

Hierarchical multiple regression analysis results.

	b	β	SE	t	p	Adj. R ²	$Δ R^{2}$
Step 1						.09	.09
Speed	0.009	0.203**	0.003	2.813	.005
Accuracy	0.013	0.212**	0.005	2.913	.004
LanguageEL1	0.202	0.101	1.388	0.146	.884
Speed:LanguageEL1	-0.006	-0.274	0.01	-0.54	.590
Accuracy:LanguageEL1	0.003	0.131	0.011	0.259	.796
Step 2						.17	.08
Speed	0.006	0.143	0.004	1.771	.078
Accuracy	0.012	0.194*	0.005	2.416	.016
Prosody	0.075	0.295***	0.02	3.718	< .001
LanguageEL1	-0.171	-0.093	1.409	-0.122	.903
Speed:LanguageEL1	-0.002	-0.115	0.01	-0.207	.836
Accuracy:LanguageEL1	0.009	0.416	0.012	0.748	.455
Prosody:LanguageEL1	-0.033	-0.294	0.037	-0.91	.364

<.05, **<.01, ***<.001.

Model 2 examined the predictive relationship of existing and new ORF scores with reading comprehension. As shown in Table 7, both significantly predicted reading comprehension (β = 0.277* and 0.34**, respectively) after controlling for each other and language background. Similar to the results from Model 1, the standardized β coefficient for the new ORF score was higher than that of the existing ORF score, indicating improvement in the predictive power of the new method. None of the interaction terms was significant.

Table 7.

Regression results of predictive relationship between ORF scores and reading comprehension.

	b	β	SE	t	p
Existing ORF Score	0.008	0.277*	0.004	2.3	.022
New ORF Score	0.027	0.34**	0.009	3.109	.002
LanguageEL1	0.358	0.194	0.612	0.584	.56
Existing ORF Score:LanguageEL1	-0.002	-0.141	0.006	-0.342	.733
New ORF Score:LanguageEL1	-0.014	-0.25	0.013	-1.104	.271

<.05, **<.01, ***<.001.

Results from quantile regression analysis

Table 8 presents the quantile regression results which indicate that both the existing ORF and the new ORF scores were significant predictors of reading comprehension across the distribution of reading comprehension scores. The main effect of each scoring method was significant for all of the quantiles measured except for the 90th. However, the estimated beta coefficients for the existing ORF score were strongest toward the upper end of the distribution, while the new ORF score coefficients were strongest toward the lower end. This suggests that the new ORF score may be slightly better at predicting reading comprehension in students with poorer reading comprehension skills.

Table 8.

Quantile regression results.

	OLS	0.1	0.25	0.5	0.75	0.9
Existing ORF Score
Main Effect	.43***	.38*	.45***	.42**	.57***	.23
EL1	.30	.06	-.16	.18	-.25	-.31
Interaction Effect	-.44	-.30	.04	-.17	-.44	.01
New ORF Score
Main Effect	.47***	.45***	.42***	.51**	.46*	.24
EL1	.28	-.01	.25	.25	-.08	-.19
Interaction Effect	-.35	-.04	-.21	-.32	-.34	.02

<.05, **<.01, ***<.001.

ORF latent profiles

Table 9 shows the model fit statistics for LPA models, ranging from 3 to 10 profiles. The six-profile model was determined to be the best-fitting model as it demonstrated the most favorable BIC, KIC, and entropy values, and it was the last model to exhibit a significant BLRT value compared to the preceding model. The six-profile model included two profiles (three and four) representing small percentages of the total sample, with sizes of n = 16 (5%) and n = 11 (3.4%), respectively. Although a profile with <25 individuals or <3% of the sample size is not recommended (Spurk et al., 2020), it is often justified if the profiles depict distinct groups worth identifying.

Table 9.

Fit statistics for LPA models with 3 to 10 profiles.

Profile	AIC	BIC	KIC	Entropy	BLRT (p)
3	2514.2	2567.0	2531.2	.628	31.5 (.01)
4	2499.0	2566.8	2520.0	.677	23.2 (.01)
5	2479.1	2562.0	2504.1	.773	27.9 (.01)
6	2463.5	2561.4	2492.5	.801	23.6 (.01)
7	2460.9	2574.0	2493.9	.746	10.5 (.12)
8	2464.2	2592.4	2501.2	.722	4.7 (.43)
9	2469.8	2613.0	2510.8	.697	2.5 (.73)
10	2467.9	2626.2	2512.9	.745	9.8 (.14)

Note: Bold values indicate the model selected as the best-fitting. AIC: Akaike information criterion, BIC: Bayesian information criterion, KIC: Kashyap information criterion, BLRT: bootstrapped likelihood ratio test.

Figure 2 compares six latent profiles, with each differentiated by their estimated scores on the z-scale across three components of oral reading fluency: speed, accuracy, and prosody. Profile 1, comprising 13% of the sample, exhibited very low scores in speed and prosody and moderately low accuracy scores. Profile 2, representing 39%, showed low to moderate performance across all three components. Profile 3 (5%) exhibited very high speed and accuracy, but very low prosody. Profile 4 (3%) displayed high speed, very low accuracy, and high prosody. Profiles 5 (24%) and 6 (15%) exhibited high and very high scores in all three components, suggesting these profiles represented the fluent and extremely fluent readers.

Figure 2.

Latent profile analysis of three ORF sub-scores.

Table 10 shows the percentages of language groups across the different latent profiles. Students in Profile 1 were slow and inaccurate, with low prosody scores 1.5 standard deviations below the mean. Profile 1 (42 students) had a high predominance of ELL students at 97.6%, compared to only 2.4% of EL1 students, suggesting marked differences in the characteristics of ORF profiles by language background. Profile 3, which is balanced between ELLs and EL1s, included students who read with well above average speed and accuracy but were very poor at reading prosodically. Profile 4, consisting of 11 students, were fast and prosodic readers but read the least accurately of all profiles, evoking the profile of a reader who may skim content, incorrectly reading many words along the way but doing so fairly effectively. About 18% of Profile 4 students were EL1, which is close to the percentage of EL1 students in the sample overall. Profile 6 predominantly consisted of EL1 students, suggesting that these students were the most fluent readers, as they exhibited high scores in all three ORF components.

Table 10.

Distribution of six latent profiles between ELL and EL1 students.

	n	ELL %	EL1 %
Profile 1	42	97.6	2.4
Profile 2	126	97.6	2.4
Profile 3	16	50.0	50.0
Profile 4	11	81.8	18.2
Profile 5	78	53.8	46.2
Profile 6	47	14.9	85.1
Total	320	71.9	28.1

Discussion

The present study evaluated the potential of incorporating prosody into automated ORF assessment and the ways this may reduce language background bias and provide greater construct coverage and diagnostically useful information about student reading profiles. Recognizing the construct underrepresentation of traditional ORF assessment based on accuracy and rate alone, the present study further questioned whether its reliance on ASR speech-to-text transcription models would present a potential bias against ELL students due to ASR inaccuracy, as well documented in previous research (L. Chen et al., 2018; Hannah et al., 2022; Mirzaei et al., 2015). RQ1 replicated these previous findings and quantified the impact on traditional ORF scores by comparing the difference between ELL and EL1 students’ WCPM when using human and ASR transcription. The difference between these two groups’ WCPM scores was smaller (7 points on average) when human transcription was used, indicating that differential ASR accuracy is systematically driving ELL students’ automated ORF scores down. This issue poses a significant threat to the validity of automated ORF assessments, as reduced ASR accuracy embedded within the system can result in unfair disadvantages for specific groups of test takers, such as ELLs. Building on the results showing language group differences in ORF scores between human transcription and ASR models, RQ2 then questioned whether a new, prosody-inclusive ORF score would reduce or amplify those differences. The findings indicated that the new ORF score did reduce the gap between these two groups, dropping from d = 2.33 to d = 1.64. The difference between ELL and EL1 scores in both cases was greater in the present study than has been found in previous studies. For example, White et al. (2021) found a difference of d = 0.61, and Jimerson et al. (2013) found d = 0.81. However, both those studies focused on primary grade students, in those cases the English language skills of the students were likely far more developed than in the present study where international students likely had much less exposure to English. No previous studies could be found that disaggregated ORF scores by language background in a post-secondary, English language learning context.

RQ3 and RQ4 then sought to evaluate the relationship between the new ORF score and reading comprehension when the new ORF score encompasses all three components (accounting for speed, accuracy, and prosody). The findings showed that prosody was a crucial predictor within the studied sample, being the strongest predictor and nearly doubling the variance explained in the initial model. Consequently, the new ORF score outperformed the existing ORF score in predicting reading comprehension but was not further differentiated across the distribution of the reading comprehension scores. These study results somewhat align with previous studies with similar samples (post-secondary ELLs), though it should be noted that no previous research has predicted reading comprehension using automated ORF scores. Jiang (2016) found that, compared to WCPM, prosody was a stronger predictor of reading comprehension for Japanese L1 speakers, less strong for Arabic L1 speakers, and comparable for Chinese L1 speakers. Tunskul and Piamsai (2016) identified accuracy as the most significant predictor of reading comprehension, followed by prosody, and then reading rate. The results of the present study align more closely with research conducted in EL1 contexts, which consistently indicates that prosody predicts comprehension beyond the contributions of accuracy and rate (Kuhn & Schwanenflugel, 2019; Schwanenflugel & Benjamin, 2017; Schwanenflugel & Kuhn, 2015). Unlike many previous studies which focus on young learners, the present study found prosody to be an important predictor of reading comprehension with older students. This point adds evidence to the notion that while gains in accuracy and rate tend to plateau as children develop, prosody continues to be an important predictor into adolescence, becoming a stronger predictor of more complex comprehension tasks in later years (Schwanenflugel & Kuhn, 2015).

Finally, RQ5 sought to consider a diagnostic approach to ORF assessment by examining the characteristics of ORF latent profiles between language groups. The study results clearly indicate different associations between the ORF latent profiles and language backgrounds. Further research may explore implications for reading instruction that would benefit readers who struggle with only one or two of speed, accuracy, or prosody, rather than thinking about them as struggling readers generally. Instructional practices targeting students with struggling prosody were identified as a research gap in a recent review of reading interventions (Hudson et al., 2020), and the importance of considering ORF components separately has been stressed by researchers (Samuels, 2006; Valencia et al., 2010). Valencia et al. (2010) for example found that a model including rate, accuracy, and prosody outperformed a model that combined the three sub-components into a single score, which is consistent with the present study. The same authors further argued that assessment results which separate these three sub-constructs would “add specificity and diagnostic information that are necessary for effective instructional interventions” (Valencia et al., 2010, p. 286).

The present study has several limitations that future studies could address. First, although the sample size was substantial, the balance between language background groups and the overall diversity of ELL language backgrounds could be improved to ensure more representative results. In addition, further research should explore methods to enhance the automated prosody model. Language background-specific training could potentially yield better results for the various language backgrounds represented in the study participants. Moreover, employing more advanced ML approaches, such as large language models and word embedding models, shows promise and could significantly improve the automated scoring of prosody and ORF. Future studies should also investigate the utility and impact of incorporating prosody into ORF assessments. This could be achieved by evaluating the experiences and reading-related outcomes of both teachers and students who utilize prosody-inclusive ORF assessments compared to those relying on accuracy and rate alone.

Conclusion

The present study showed that the inclusion of prosody within automated ORF assessment can mitigate bias against ELL students caused by differential ASR accuracy, improves the predictive power for reading comprehension with both ELL and EL1 post-secondary students, and improves the diagnostic capacity of the assessment. These findings underscore the transformative potential of technology in addressing the long-standing issues of construct under-representation in ORF assessments. Additionally, the distinct profiles of reading fluency identified in this research highlight the importance of a nuanced approach to both assessment and instruction. By providing score reports that identify levels across the different components of ORF—speed, accuracy, and prosody—educators can develop tailored instructional strategies that better meet the unique needs of individual students. This diagnostic specificity in assessment and intervention could hold promise for improving language outcomes for students (in formative assessment contexts), especially those who may be marginalized by traditional, less sensitive assessment methods.

The broader implications of this research extend beyond immediate educational contexts, suggesting a paradigm shift in how we understand and measure language competence. As we continue to harness the capabilities of AI and related technologies, there is a critical need to ensure these tools are developed and implemented with an eye toward ethical considerations, particularly concerning fairness and accessibility for all test takers. Ongoing research and dialogue among educators, technologists, policymakers, and assessment practitioners are essential to navigate these challenges, aiming for a future where technology-rich assessments enable all students to demonstrate their true potential without bias.

Footnotes

Acknowledgements

The authors express their sincere gratitude to the editors of Language Testing and especially to the special issue editors, Yasuyo Sawaki and Eunice Eunhee Jang, for their thoughtful reviews and insights.

Author contributions

Liam Hannah: Conceptualization; Data curation; Formal analysis; Investigation; Methodology; Validation; Visualization; Writing – original draft; Writing – review & editing.

Eunice Eunhee Jang: Conceptualization; Funding acquisition; Supervision; Writing – review & editing.

Meng-Hsun Lee: Project administration; Supervision; Writing – review & editing.

Bruce Russell: Project administration; Writing – review & editing.

Declaration of conflicting interests

The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: E.E.J. is a co-editor of this special issue, “Advancing language assessment for teaching and learning in the era of the artificial intelligence (AI) revolution: Promises and challenges.”

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the University of Toronto International Student Experience Fund (ISEF).

Ethical approval and informed consent

This research has been approved by the Social Sciences, Humanities & Education Research Ethics Board of the University of Toronto, RIS Protocol Number: 39783. Informed written consent was obtained from each participant included in the study.

ORCID iDs

Liam Hannah

Eunice Eunhee Jang

Meng-Hsun Lee

Data availability statement

Unfortunately, the data cannot be made available at this time.

References

Aldhanhani

Z. R.

Abu-Ayyash

E. A. S.

(2020). Theories and research on oral reading fluency: What Is needed? Theory and Practice in Language Studies, 10(4), 379–388. https://doi.org/10.17507/tpls.1004.05

Bailly

Godde

Piat-Marchand

A.-L.

Bosse

M.-L.

(2022). Automatic assessment of oral readings of young pupils. Speech Communication, 138, 67–79. https://doi.org/10.1016/j.specom.2022.01.008

Balogh

Bernstein

Cheng

Van Moere

Townshend

Suzuki

(2012). Validation of automated scoring of oral reading. Educational and Psychological Measurement, 72(3), 435–452. https://doi.org/10.1177/0013164411412590

Barry

R. M.

(1995). A comparative study of the relationship between dysarthria and verbal dyspraxia in adults and children. Clinical Linguistics & Phonetics, 9(4), 311–332. https://doi.org/10.3109/02699209508985339

Benjamin

R. G.

Schwanenflugel

P. J.

Meisinger

E. B.

Groff

Kuhn

M. R.

Steiner

(2013). A spectrographically grounded scale for evaluating reading expressiveness. Reading Research Quarterly, 48(2), 105–133. https://doi.org/10.1002/rrq.43

Bock

R. D.

Aitkin

(1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443–459. https://doi.org/10.1007/BF02293801

Boersma

van Heuven

(2001). Speak and unSpeak with PRAAT. Glot International, 5(9).

Bolaños

Cole

R. A.

Ward

W. H.

Tindal

G. A.

Schwanenflugel

P. J.

Kuhn

M. R.

(2013). Automatic assessment of expressive oral reading. Speech Communication, 55(2), 221–236. https://doi.org/10.1016/j.specom.2012.08.002

Calbert

Roll

(2023). Evaluating OpenAI’s Whisper ASR: Performance analysis across diverse accents and speaker traits. Cambridge Open Engage. https://doi.org/10.33774/coe-2023-8fcj1

10.

Chalmers

R. P.

(2012). Mirt: A multidimensional item response theory package for the R environment [R]. Journal of Statistical Software, 48, 1–29. https://doi.org/10.18637/jss.v048.i06

11.

Chang

A. C.-S.

(2019). The effects of repeated oral reading practice on EFL learners’ oral reading fluency development. Reading Matrix: An International Online Journal, 19(2), 103–113.

12.

Chapelle

C. A.

Chung

Y.-R.

(2010). The promise of NLP and speech processing technologies in language assessment. Language Testing, 27(3), 301–315. https://doi.org/10.1177/0265532210364405

13.

Chaudhry

M. A.

Kazim

(2022). Artificial intelligence in education (AIEd): A high-level academic and industry note 2021. AI and Ethics, 2(1), 157–165. https://doi.org/10.1007/s43681-021-00074-z

14.

Chen

Zechner

Yoon

S.-Y.

Evanini

Wang

Loukina

Tap

Davis

Lee

C. M.

Mundowsky

Leong

C. W.

Gyawali

(2018). Automated scoring of nonnative speech using the “SpeechRater”? V. 5.0 Engine. Research Report. ETS RR-18-10. ETS Research Report Series. https://eric.ed.gov/?id=EJ1202795

15.

Chen

Guestrin

(2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). https://doi.org/10.1145/2939672.2939785

16.

Chin

S. B.

Bergeson

T. R.

Phan

(2012). Speech intelligibility and prosody production in children with cochlear implants. Journal of Communication Disorders, 45(5), 355–366. https://doi.org/10.1016/j.jcomdis.2012.05.003

17.

Crystal

(2009). Persevering with prosody. International Journal of Speech-Language Pathology, 11(4), 257–257. https://doi.org/10.1080/17549500902858753

18.

Daane

M. C.

Campbell

J. R.

Grigg

W. S.

Goodman

M. J.

Oranje

(2005). The nation’s report card: Fourth-grade students reading aloud: NAEP 2002 Special Study of Oral Reading (NCES 2006–469). U.S. Department of Education. Institute of Education Sciences, National Center for Education Statistics. https://nces.ed.gov/nationsreportcard/pubs/studies/2006469.aspx

19.

Derwing

T. M.

Munro

M. J.

Carbonaro

(2000). Does popular speech recognition software work with ESL speech? TESOL Quarterly, 34(3), 592–603. https://doi.org/10.2307/3587748

20.

Duolingo English Test. (2021). Analysis of the Scoring and Reliability for the Duolingo English Test. Duolingo. https://dy8n3onijof8f.cloudfront.net/media/resources/standards/scoring.pdf

21.

Diehl

J. J.

Paul

(2013). Acoustic and perceptual measurements of prosody production on the profiling elements of prosodic systems in children by children with autism spectrum disorders. Applied Psycholinguistics, 34(1), 135–161. https://doi.org/10.1017/S0142716411000646

22.

Ferraro

Galli

La Gatta

Postiglione

(2023). Benchmarking open source and paid services for speech to text: An analysis of quality and input variety. Frontiers in Big Data, 6. https://www.frontiersin.org/articles/10.3389/fdata.2023.1210559

23.

Good

III Kaminski

(2020). Acadience Reading K-6 Assessment Manual. Acadience Learning. https://acadiencelearning.org/wp-content/uploads/2020/08/AcadienceReading_Assessment_Manual.pdf

24.

Grabe

(2010). Fluency in reading—Thirty-five years later. Reading in a Foreign Language, 22(1), 71–83. https://doi.org/10/articles/grabe.pdf

25.

Hannah

Kim

Jang

E. E.

(2022). Investigating the effects of task type and linguistic background on accuracy in automated speech recognition systems: Implications for use in language assessment of young learners. Language Assessment Quarterly, 19(3), 289–313. https://doi.org/10.1080/15434303.2022.2038172

26.

Hasbrouck

Tindal

(2017). An update to compiled ORF norms (Technical Report No. 1702). Behavioral Research and Teaching, University of Oregon. https://files.eric.ed.gov/fulltext/ED594994.pdf

27.

Hastie

Tibshirani

Friedman

(2009). The elements of statistical learning (2nd ed.). Springer.

28.

T. K.

(1995). Random decision forests. In Proceedings of 3rd International Conference on Document Analysis and Recognition (pp. Vol. 1, 1278–282). https://doi.org/10.1109/ICDAR.1995.598994

29.

Hudson

Koh

P. W.

Moore

K. A.

Binks-Cantrell

(2020). Fluency interventions for elementary students with reading difficulties: A synthesis of research from 2000-2019. Education Sciences, 10. https://doi.org/10.3390/educsci10030052

30.

Isaacs

(2017). Fully automated speaking assessment: Changes to proficiency testing and the role of pronunciation. In The Routledge handbook of contemporary English pronunciation. Routledge.

31.

James

Witten

Hastie

Tibshirani

(2013). Linear regression. In An introduction to statistical learning: With applications in R (pp. 59–126). Springer. https://doi.org/10.1007/978-1-4614-7138-7_3

32.

Jiang

(2016). The role of oral reading fluency in ESL reading comprehension among learners of different first language backgrounds. Reading Matrix: An International Online Journal, 16(2), 227–242.

33.

Jiang

Sawaki

Sabatini

(2012). Word reading efficiency, text reading fluency, and reading comprehension among Chinese learners of English. Reading Psychology, 33(4), 323–349. https://doi.org/10.1080/02702711.2010.526051

34.

Jimerson

S. R.

Hong

Stage

Gerber

(2013). Examining oral reading fluency trajectories among English language learners and English speaking students. Journal of New Approaches in Educational Research, 2(1), Article 1. https://doi.org/10.7821/naer.2.1.3-11

35.

Kalathottukaren

R. T.

Purdy

S. C.

Ballard

(2015). Prosody perception and musical pitch discrimination in adults using cochlear implants. International Journal of Audiology, 54(7), 444–452. https://doi.org/10.3109/14992027.2014.997314

36.

Kelly

A. C.

Karamichali

Saeb

Vesely

Parslow

Gomez

G. M.

Deng

Letondor

Mullally

Hempel

O’Regan

Zhou

(2020). SoapBox labs fluency assessment platform for child speech. Interspeech.

37.

Kennedy Shriver

. (2010). Developing early literacy: Report of the national early literacy panel. National Institute of Child Health and Human Development. https://www.nichd.nih.gov/publications/product/346

38.

Khor

C. P.

Low

H. M.

Lee

L. W.

(2014). Relationship between oral reading fluency and reading comprehension among ESL students. Gema Online Journal of Language Studies, 14(3). https://doi.org/10.17576/GEMA-2014-1403-02

39.

Kim

Y.-S. G.

Quinn

J. M.

Petscher

(2021). Reading prosody unpacked: A longitudinal investigation of its dimensionality and relation with word reading and listening comprehension for children in primary grades. Journal of Educational Psychology, 113(3), 423–445. https://doi.org/10.1037/edu0000480

40.

Koenker

Bassett

(1978). Regression quantiles. Econometrica, 46(1), 33–50. https://doi.org/10.2307/1913643

41.

Koenker

Portnoy

P. T.

Melly

Zeileis

Grosjean

Moler

Saad

Chernozhukov

Fernandez-Val

Ripley

(2023). Quantreg (5.97). University of Illinois.

42.

Kuhn

M. R.

Schwanenflugel

P. J.

(2019). Prosody, pacing, and situational fluency (or why fluency matters for older readers). Journal of Adolescent & Adult Literacy, 62(4), 363–368. https://doi.org/10.1002/jaal.867

43.

Kuhn

M. R.

Schwanenflugel

P. J.

Meisinger

E. B.

(2010). Aligning theory and assessment of reading fluency: Automaticity, prosody, and definitions of fluency. Reading Research Quarterly, 45(2), 230–251. https://doi.org/10.1598/RRQ.45.2.4

44.

Kuhn

M. R.

Stahl

S. A.

(2003). Fluency: A review of developmental and remedial practices. Journal of Educational Psychology, 95(1), 3–21. https://doi.org/10.1037/0022-0663.95.1.3

45.

Lems

(2012). Reading fluency and comprehension in English language learners. In Rasinski

T. V.

Blachowicz

Lems

(Eds.), Fluency instruction: Research-based best practices (2nd ed.). Guilford Press.

46.

Levis

J. M.

(2023). Teaching prosody in research studies: How researchers make decisions about pedagogy in research studies. Journal of Second Language Pronunciation, 9(3), 285–288. https://doi.org/10.1075/jslp.23047.lev

47.

Lord

F. M.

(1980). Applications of item response theory to practical testing problems. Lawrence Erlbaum.

48.

Marrs

De Leon

Lawless

(2022). Use of English language proficiency data to better assess reading growth for English language learners. School Psychology Review, 51(4), 427–440. https://doi.org/10.1080/2372966X.2021.1917971

49.

Mirzaei

M. S.

Meshgi

Akita

Kawahara

(2015). Errors in automatic speech recognition versus difficulties in second language listening. In Proceedings of the 2015 EUROCALL Conference (pp. 410–415). https://doi.org/10.14705/rpnet.2015.000367

50.

Mohyuddin

Kwak

(2023). Automatic speech recognition in diverse English accents. Internatonal Conference on Computational Science and Computational Intelligence (CSCI), 714–718. https://doi.org/10.1109/CSCI62032.2023.00122

51.

Molenaar

Tejedor-Garcia

Strik

Cucchiarini

(2023). Automatic assessment of oral reading accuracy for reading diagnostics. Interspeech, 5232–5236. https://doi.org/10.21437/Interspeech.2023-1681

52.

Morrison

T. G.

Wilcox

(2020). Assessing expressive oral reading fluency. Education Sciences, 10(3), 1–13. https://doi.org/10.3390/educsci10030059

53.

Murray

(2021). Mothers of invention. Science, 372(6548), 1260–1262. https://doi.org/10.1126/science.abh3178

54.

Nakatsuhara

Berry

(2021). Use of innovative technology in oral language assessment. Assessment in Education: Principles, Policy & Practice, 28(4), 343–349. https://doi.org/10.1080/0969594X.2021.2004530

55.

Newell

K. W.

Codding

R. S.

Fortune

T. W.

(2020). Oral reading fluency as a screening tool with English learners: A systematic review. Psychology in the Schools, 57(8), 1208–1239. https://doi.org/10.1002/pits.22406

56.

Peng

Xue

Hall

Newburn

(2022). English MAP reading fluency technical report. Northwest Evaluation Association.

57.

Peppé

S. J. E.

(2009). Why is prosody in speech-language pathology so difficult? International Journal of Speech-Language Pathology, 11(4), 258–271. https://doi.org/10.1080/17549500902906339

58.

Prakash

R. R.

Kurian

(2019). Exploring the role of oral reading fluency (ORF) in reading instruction: A classroom-based study with engineering students. Artha Journal of Social Sciences, 18(1), 45–57. https://doi.org/10.12724/ajss.48.5

59.

Proença

Lopes

Tjalve

Stolcke

Candeias

Perdigão

(2017). Automatic evaluation of reading aloud performance in children. Speech Communication, 94, 1–14. https://doi.org/10.1016/j.specom.2017.08.006

60.

PTE Academic. (2024). PTE Academic test taker score guide. Pearson. https://www.pearsonpte.com/ctf-assets/yqwtwibiobs4/5Sz9Ur4qbus8AEOQdetkAj/69f6c1f2e2870980740b10a2ea9b467f/pte-academic-test-taker-score-guide-nov-2024-v4.pdf

61.

Radford

Kim

J. W.

Brockman

McLeavey

Sutskever

(2022). Robust speech recognition via large-scale weak supervision. arXiv. https://doi.org/10.48550/arXiv.2212.04356

62.

Rasinski

T. V.

(2004). Assessing reading fluency. Pacific Resources for Education and Learning. https://eric.ed.gov/?id=ED483166

63.

Rasinski

T. V.

Galeza

Vogel

Viton

Rundo

Royan

Nemer Shaheen

Bartholomew

Kaewkaemket

Stokes

Young

Paige

(2022). Oral reading fluency of college graduates: Toward a deeper understanding of college ready fluency. Journal of Adolescent & Adult Literacy, 66(1), 23–30. https://doi.org/10.1002/jaal.1248

64.

Rasinski

T. V.

Rikli

Johnston

(2009). Reading fluency: More than automaticity? More than a concern for the primary grades? Literacy Research and Instruction, 48(4), 350–361. https://doi.org/10.1080/19388070802468715

65.

Revelle

(2022). psych: Procedures for psychological, psychometric, and personality research [R]. https://CRAN.R-project.org/package=psych

66.

Roach

(2000). Techniques for the phonetic description of emotional speech [Conference session]. Proceedings of Speechemotion. ITRW on Speech and Emotion, Newcastle, Northern Ireland.

67.

Rosenberg

J. M.

Beymer

P. N.

Anderson

D. J.

Lissa

C. j van

Schmidt

J. A.

(2019). TidyLPA: An R package to easily carry out latent profile analysis (LPA) using open-source or commercial software. Journal of Open Source Software, 3(30), 978. https://doi.org/10.21105/joss.00978

68.

Samuels

S. J.

(2006). Toward a Model of Reading Fluency. In Samuels

S. J.

Farstrup

A. E.

(Eds.), What research has to say about fluency instruction (pp. 24–46). International Reading Association.

69.

Schwanenflugel

P. J.

Benjamin

R. G.

(2017). Lexical prosody as an aspect of oral reading fluency. Reading and Writing, 30(1), 143–162. https://doi.org/10.1007/s11145-016-9667-3

70.

Schwanenflugel

P. J.

Hamilton

A. M.

Kuhn

M. R.

Wisenbaker

J. M.

Stahl

S. A.

(2004). Becoming a fluent reader: Reading skill and prosodic features in the oral reading of young readers. Journal of Educational Psychology, 96(1), 119–129. https://doi.org/10.1037/0022-0663.96.1.119

71.

Schwanenflugel

P. J.

Kuhn

(2015). Reading fluency. In Afflerbach

(Ed.), Handbook of individual differences in reading: Text and context. Routledge.

72.

Spurk

Hirschi

Wang

Valero

Kauffeld

(2020). Latent profile analysis: A review and “how to” guide of its application within vocational behavior research. Journal of Vocational Behavior, 120, 103445. https://doi.org/10.1016/j.jvb.2020.103445

73.

Test of English for International Communication. (2022). Examinee handbook. Educational Testing Service. https://www.ets.org/pdfs/toeic/toeic-speaking-writing-examinee-handbook.pdf

74.

Tunskul

Piamsai

(2016). The English oral reading fluency test: Relationships to comprehension and test takers’ attitudes. LEARN Journal: Language Education and Acquisition Research Network, 9(1).

75.

University of Oregon. (2023). Dynamic Indicators of Basic Early Literacy Skills (DIBELS®): Administration and scoring guide, 2023 Edition. University of Oregon. https://dibels.uoregon.edu

76.

Valencia

S. W.

Smith

A. T.

Reece

A. M.

Wixson

K. K.

Newman

(2010). Oral reading fluency assessment: Issues of construct, criterion, and consequential validity. Reading Research Quarterly, 45(3), 270–291. https://doi.org/10.1598/RRQ.45.3.1

77.

Versant. (2024). Versant by Pearson English placement test: Test description and validation summary. Pearson Education. https://www.pearson.com/content/dam/one-dot-com/one-dot-com/pearson-languages/en-gb/pdfs/versant-resources/validation-report-versant-by-pearson-english-placement-test.pdf

78.

Washburn

(2022). Reviewing evidence on the relations between oral reading fluency and reading comprehension for adolescents. Journal of Learning Disabilities, 55(1), 22–42. https://doi.org/10.1177/00222194211045122

79.

White

Sabatini

Park

B. J.

Chen

Bernstein

(2021). The 2018 NAEP oral reading fluency study. National Center for Education Statistics. https://nces.ed.gov/nationsreportcard/subject/studies/pdf/2021025_2018_orf_study.pdf

80.

(2023). Advancing language assessment with AI and ML- Leaning into AI is inevitable, but can theory keep up? Language Assessment Quarterly, 20(4-5), 1–20. https://doi.org/10.1080/15434303.2023.2291488

81.

Yang

(2021). Oral reading fluency of Chinese second language learners. Reading and Writing, 34(4), 981–1001. https://doi.org/10.1007/s11145-020-10101-w

82.

Zechner

Loukina

(2020). Automated scoring of extended spontaneous speech. In Yan

Rupp

A. A.

Foltz

P. W.

(Eds.), Handbook of automated scoring. Chapman and Hall/CRC.

Investigating construct representativeness and linguistic equity of automated oral reading fluency assessment with prosody

Abstract

Keywords

Introduction

Literature review

Automated oral reading fluency assessment

Automated prosody assessment

ORF assessment with additional language learners

Present study

Methods

Participants

Tasks

Analysis

Scoring

Prosody ML model

Analysis by research questions

Results

Differences in ASR and human transcription accuracy between ELL and EL1 students

Differences in existing and new ORF scores between ELL and EL1 students

Difference in predictive relationships between ORF scoring methods and reading comprehension ability

Results from quantile regression analysis

ORF latent profiles

Discussion

Conclusion

Footnotes

Acknowledgements

Author contributions

Declaration of conflicting interests

Funding

Ethical approval and informed consent

ORCID iDs

Data availability statement

References