Abstract
Dynamic assessments (DAs) of word reading skills demonstrate strong criterion reference validity with word reading measures (WRMs). However, DAs vary in the skills they assess, their format and administration method, and the type of words and symbols used in test items. These characteristics may have implications on assessment validity. To compare validity of DAs of word reading skills on these factors of interest, a systematic review of five databases and the gray literature was conducted. We identified 35 studies that met the inclusion criteria of evaluating participants aged 4 to 10, using a DA of word reading skills and reporting a Pearson’s correlation coefficient as an effect size. A random effects meta-analysis with robust variance estimation and subgroup analyses by DA characteristics was conducted. There were no significant differences in mean effect size based on administration method (computer vs. in-person) or symbol type (familiar vs. novel). However, DAs that evaluate phonological awareness or decoding (vs. sound-symbol knowledge), those that use a graduated prompt format (vs. test-teach-retest), and DAs that use nonwords (vs. real words) demonstrated significantly stronger correlations with WRMs. These results inform selection of DAs in clinical and research settings, and development of novel, valid DAs of word reading skills.
Keywords
Introduction
Literacy is a complex construct requiring integration of multiple skills. Simply, it can be described as the product of the ability to recognize or decode words and to comprehend language (Hoover & Gough, 1990). In this review, we define word reading skills as subskills that comprise word recognition ability (e.g., Scarborough, 2001). These subskills—phonological awareness, knowledge of the alphabetic principle (or sound-symbol knowledge) decoding and sight word recognition have been consistently found to be among the strongest and most accurate predictors of reading ability for young children (e.g., Catts et al., 2005; Ehri, 1998; Hogan et al., 2005). In speech-language pathology (SLP), psychology and education, most word reading tools employ a static assessment (SA) paradigm (e.g., Phonological Awareness Test-2 [PAT-2:NU], Robertson & Salter, 2017; Woodcock Reading Mastery Test-III [WRMT-III], Woodcock, 2011). In SA, the examiner evaluates a child’s acquired knowledge in a given domain without providing prompts or feedback and then compares their performance to their peers (Grigorenko & Sternberg, 1998). For example, in a static decoding task, a child would be presented with a word (e.g., pup) and asked to read it. If the child struggled or made an error, their response would be marked as incorrect, and the examiner would continue on to the next item.
Children from, diverse linguistic backgrounds or those with fewer literacy experiences are prone to perform poorly because they have limited or different acquired knowledge compared to the English monolingual children for whom the tests are designed (Bedore & Peña, 2008; Ginsborg, 2006). In our previous example of a static decoding task, a bilingual child might struggle to read the word “pup” because of a lack of familiarity with the vocabulary term, or the English letters or sounds. When many children underperform, it leads to floor effects, rendering it difficult to discern those who are truly at-risk from those who have had insufficient linguistic or educational experiences. This can result in failure to identify word reading difficulties early (Catts et al., 2009).
Given these limitations, interest in alternative approaches, like dynamic assessments (DAs), has been increasing. While SAs measure a child’s acquired skills DAs examine a child’s ability to learn a skill with support in the form of teaching, feedback and prompting from the examiner in the test (Grigorenko & Sternberg, 1998). This approach reduces bias and misidentification of difficulty because the impact of previous linguistic or literacy experiences on test outcomes is minimized (Bedore & Peña, 2008; Petersen & Gillam, 2013). Reviews that have evaluated the use of DAs report promising findings on their utility and validity. DAs demonstrate greater predictive validity than SAs across several domains (e.g., DAs of cognitive ability, literacy, and mathematics; Caffrey et al., 2008). DAs of word reading can predict unique variance in later reading ability beyond SAs; (Dixon, Oxley, Gellert, & Nash, 2023), can contribute to the accurate identification of reading difficulties; (Dixon, Oxley, Nash, & Gellert, 2023), and demonstrate consistent validity with word reading outcome measures across typically developing, at-risk, bilingual, and monolingual children (Wood et al., 2024).
However, these DAs are also characterized by heterogeneity in terms of the word reading skills they assess, their format, administration method; in addition to the word and symbol type they use. Previous reviews have explored the impact of these factors in the domain of static assessment. For example, the factor of administration method (virtual vs. in-person) was considered by Alfano et al. (2024). Authors found that across the seven included studies, there were no significant differences between online and in-person administration of pediatric language and literacy assessments. Outcomes such as these provide support for the development of novel SA tools that can be administered online. Whether administration method, or any other of the characteristics described below affect the validity of DAs has not been considered thus far.
Research Aim
In this meta-analysis we directly examine characteristics of DAs and determine which if any types of assessments are superior to others in terms of their criterion reference validity with word reading measures by the five factors discussed below (word reading skill type, format, administration method, word type, symbol type). Outcomes of the current meta-analyses have implications for the revision of existing assessments and development of novel DAs.
Word Reading Skill Type
DAs have been designed to evaluate various skills associated with literacy development and ability including decoding (e.g., Cho et al., 2017), phonological awareness (e.g., Gellert & Elbro, 2017b), sound-symbol knowledge (e.g., Clayton et al., 2018), morphological awareness (e.g., Navarro et al., 2018), expressive vocabulary development (Peña et al., 2001), oral narratives (Peña et al., 2014), reading comprehension (e.g., Gruhn et al., 2020), and working memory (e.g., Swanson, 1994). As stated, in this review, we focus exclusively on DAs of that evaluate the word reading skills of decoding, phonological awareness or sound-symbol knowledge because these skills have consistently been found to be among the strongest predictors of word reading ability (e.g., The National Early Literacy Panel (2008).
Historically research in the validity of static measures of these word reading skills has found that all three correlate strongly with later word reading ability in alphabetic languages. The National Early Literacy Panel (2008) which included nearly 300 studies, and a meta-analysis of 60 effect sizes by Elbro and Scarborough (2003) both documented strong correlations between kindergarten letter(-sound) knowledge (
Rationale for analysis: In the realm of DA there have been fewer systematic examinations into the capacity of these skills to predict later word reading. A recent systematic review of 18 studies found that across studies, DAs of phonological awareness and decoding predicted between 1% and 21% additional unique variance beyond traditional static measures in later word reading ability, but that a DA of paired associate learning (a task akin to learning sound-symbol knowledge) accounted for only 6% unique variance (Dixon, Oxley, Gellert, & Nash, 2023), suggesting a different pattern of prediction between DAs and SAs. Given that DAs evaluate ability to learn rather than acquired knowledge, it may be that assessments that evaluate more complex skills better permit children to demonstrate their ability to learn in the context of DA. Simple sound-symbol knowledge tasks require that a child learn the name that corresponds to a symbol. However, complex phonological awareness tasks like phoneme substitution require that a child identify a sound, delete it, replace it with a new sound and blend the new sounds together to form the new word. Complex decoding tasks require integration of both sound-symbol knowledge and phonological awareness skills. These more complex tasks might be better suited to capturing learning potential in the context of DA. Given this,
Format
DAs come in many formats, but there are two primary approaches (Lantolf & Poehner, 2004). Interactionist DA is unscripted and endeavors to modify cognitive or skill ability. The examiner responds contingently to the individual examinee and their capacities. Interventionist DA, however, more closely parallels SA. The examiner provides pre-defined levels of support in response to student performance. Its scripted nature requires less clinical skill and time to administer, and its standardization permits evaluation of its psychometric validity (Poehner, 2008). Interventionist DAs across cognitive domains demonstrate stronger predictive validity than interactionist DAs (Caffrey et al., 2008). In the field of word reading assessment, DAs are generally characterized as interventionist. The studies included in this review focus on two common formats of interventionist DA.
The first, referred to as the (test)/teach/retest (TT) format in this paper, consists of a static pre-test, followed by a dynamic teaching phase, and a static re-test (Budoff, 1987). During the teaching phase, children receive feedback and instruction. Not all assessments incorporate the initial static pre-test. If one is conducted, post-test performance is compared to the initial score to assess the difference in performance following teaching. When no pre-test is administered, the post-test measures how a child performs after receiving explicit dynamic instruction in a task.
The second approach, referred to as the graduated prompts (GP) format in this paper, combines teaching and testing phases of the assessment within each item (Brown & Ferrara, 1985). Children are provided with feedback regarding their response. If incorrect, a hierarchy or series of increasingly explicit prompts are provided, until the child answers correctly, or all prompts are exhausted. The greater number of prompts required, the lower the score on an item (Brown & Ferrara, 1985). A previous review suggested that there are no differences in classification accuracy of reading disorder for DAs that use a GP versus TT format and that both formats are used with similar frequency in assessment of word reading (Dixon, Oxley, Nash, & Gellert, 2023).
Rationale for analysis: Previous research has found that noncontingent DAs correlate more strongly with word reading outcomes than contingent ones (Caffrey et al., 2008).
Administration Method
Assessments, dynamic or otherwise, can be conducted in-person or via computer. For example, many static tests originally developed for in-person use are now available to be used through their computer-based platform Q-global and Q interactive (e.g., WRMT-II test of decoding), and dynamic tests are being developed for both in-person (e.g., Gellert & Elbro, 2017a) and computer use (e.g., Aravena et al., 2013). Development of virtual or computer-based assessments has become increasingly important, in the wake of the COVID-19 pandemic and the subsequent shift to distance learning (Campbell & Goldstein, 2022).
Rationale for analysis: Post-pandemic, many clinicians and researchers continue to operate virtually, and therefore the factor of administration method and its implications on validity should be considered. No significant differences have been found between administering an SA online versus in-person (Alfano et al., 2024), but this factor has not been considered in the context of DAs, which can be administered in-person (e.g., Spector, 1992), virtually by an examiner (e.g., Barker & Saunders, 2020), or in a computer program where no examiner is required (e.g., Aravena et al., 2018). A recent review found that computerized DAs were used less frequently than in-person measures (Dixon, Oxley, Nash, & Gellert, 2023), but the implications of administration method on the validity of DAs have not yet been considered quantitatively.
Word Type
Assessments of the word reading skills of phonological awareness and decoding use either real words or nonwords, which are made-up words that abide by the language’s phonotactic and orthotactic constraints (e.g., “meeb” in English). Commercially developed SA tools such as the Comprehensive Test of Phonological Processing-2 (CTOPP-2, Wagner et al., 2013) which evaluates phonological awareness, or the Woodcock Reading Mastery Test – Third Edition (WRMT-III, Woodcock, 2011), which evaluates decoding, include subtests with words and nonwords. Some DAs like the CUBED-3 dynamic test of decoding include a word reading and nonword decoding measure (Petersen et al., 2016). However, many DAs have fewer subtests and tend to employ either words or nonwords. For example, Gellert and Elbro’s dynamic phoneme identification task uses real words (Gellert & Elbro, 2017b), but their dynamic decoding measure uses nonwords (Gellert & Elbro, 2017a).
Rationale for analysis: Reading words and non-words are purported to tap into different processes (Shapiro et al., 2013). Children may initially recognize familiar words by sight without activating their knowledge of sound-symbol correspondences, phoneme blending, and decoding skills (Ehri & Wilce, 1985). For example, recognizing their name or a high frequency word like “the” in print. However, when reading nonwords, decoding skills are necessary because these words are unfamiliar (Hoover & Tunmer, 1993). Similarly, word and nonword phonological awareness tasks may activate different skills. When using real words, performance may be impacted by acquired vocabulary knowledge, while nonword tasks may be a purer form of phonological ability (Wagner et al., 2013). In the domain of oral language assessment, nonword repetition tasks have been shown to reduce bias against culturally and linguistically diverse children (Ortiz, 2021). Nonword decoding and phonological awareness tasks may similarly reduce bias against those with different or limited literacy experiences in word reading skills assessments. Children enter kindergarten with a wide range of language and literacy abilities that can be attributed linguistic diversity, their home literacy environment, access to books and libraries, or exposure to literacy instruction in preschool (Ackerman & Barnett, 2005). Importantly, nonword tasks do not disadvantage strong readers with advanced lexical knowledge (Castles et al., 2018). They can account for significant unique variance in word reading ability beyond real word reading (e.g., Hogan et al., 2005). To date, no studies have considered the role of word type in validity of DAs of word reading skills.
Symbol Type
Word reading assessments of sound-symbol knowledge (SSK) and decoding use either familiar or novel symbols. Typically, SAs use the letters or characters of the language for which they were created. For instance, in the PAT-2 (Robertson & Salter, 2017) the phoneme-grapheme subtest (a measure of sound-symbol knowledge) evaluates a child’s acquired knowledge of the relationship between familiar English letters and sounds, and the phoneme decoding subtest evaluates ability to read nonwords comprised of English graphemes. Recently, there has been increased interest in using novel symbols in DAs as it permits evaluation of how well a child can learn new symbol-sound relationships (e.g., that the symbol
= sound /m/, Gellert & Elbro, 2017a), and apply this knowledge to decode symbol-based words (e.g., that the symbols
= the nonword /ma/, Gellert & Elbro, 2017a), while minimizing the influence of previous linguistic and literacy exposure.
Rationale for analysis: No prior reviews have examined whether symbol type (novel vs. familiar) affects DA’s strength of correlational relationship with word reading ability. Primary studies suggest that DAs that use novel symbols can differentiate between typical readers and those with dyslexia (Aravena et al., 2013, 2018). These measures can explain unique variance in later reading ability beyond traditional measures for preliterate children (Horbach et al., 2015). When administered in kindergarten to predict ability in grade 1, a DA decoding measure that used novel symbols (Gellert & Elbro, 2017a) had a superior diagnostic accuracy to a one that used familiar letters (Petersen et al., 2016). Use of novel symbols is a recent development in the field of word reading assessment and there has not yet been a systematic quantitative examination of the relative validity of these two approaches.
The Current Study
The current study investigates whether these characteristics of word reading skill, format, administration method, word, and symbol type affect DA’s validity as measured by association with performance on word reading measures. We examine criterion validity, as represented by the correlation between performance on a DA of word reading skills (phonological awareness, sound-symbol knowledge, or decoding) and a word reading measure (single real or nonword accuracy or fluency). Like Caffrey et al. (2008) we use Pearson’s correlation coefficients as our effect size, given that these are the most observed type of effect size reported across studies. We focus exclusively on DAs of word reading skills as they are best suited to evaluate and predict reading ability in our target demographic of children are learning to read (Catts et al., 2005). We examine overall validity and stratify DAs into subgroups by their format (graduated prompts vs. train/test), administration method (in-person vs. via computer), word type (real word vs. nonword) and symbol type (familiar vs. novel). We also conduct a comprehensive search of the gray literature and include studies published in languages other than English.
Overall rationale: Outcomes of this review will inform which characteristics of DAs of word reading skills are associated with the greatest criterion reference validity, as measured by strength of correlation between performance on DAs and word reading measures. For clinicians, outcomes will provide insight into which DA measures are appropriate for use in their practice (e.g., is it preferable to use a DA that evaluates decoding or phonological awareness, one that uses nonwords or real words? Is it suitable to use computerized DAs, or are in-person DAs superior?). For researchers, a quantitative examination how these factors affect validity of DAs of word reading skills can inform revisions of existing measures or development of new tools. This can be achieved by modifying or developing tests with characteristics shown to be most strongly associated with performance on word reading measures (e.g., when designing new DAs, which factors are important to consider?)
Research Questions and Hypotheses
Do the following five factors have implications on the criterion reference validity of dynamic assessments of word reading skills as measured by strength of correlation between performance on DAs and performance on WRMs?
1.
• We hypothesize that performance on DAs of decoding and phonological awareness will be more strongly correlated with performance on WRMs, relative to DAs of SSK. This is because we anticipate that these more complex tasks might be better suited to capturing learning potential in the context of DA.
2.
• We hypothesize that performance on DAs that use a graduated prompts format will be more strongly correlated with performance on WRMs, relative to DAs that use a test-teach-retest format. This is because the GP is highly explicit and structured, while the TT format allows for greater flexibility in response to the student in the training/teaching portion. Previous work has found that more explicit approaches demonstrate greater criterion and predictive validity with outcome measures (Caffrey et al., 2008).
3.
• We hypothesize that performance on DAs that are administered in-person will be more strongly correlated with performance on WRMs, relative to DAs that are administered via computer. This is because DA is characterized by increased interaction between examiner and examinee and so it maybe be impacted to a greater extent by computer administration than static assessment, which is typically scripted and where the examiner acts as an objective observer rather than an interactive participant in assessment
• We hypothesize that performance on DAs that use nonwords will be more strongly correlated with performance on WRMs, relative to DAs that use real words. This is because nonwords are unfamiliar to all children and may therefore be better suited to evaluate a child’s ability to learn decoding skills in assessment. Real words may be known to children, and therefore would not allow them to showcase their “learning” in the test.
• We hypothesize that performance on DAs that use novel symbols will be more strongly correlated with performance on WRMs, relative to DAs that use familiar symbols. This is because novel symbols are unfamiliar to all children and may therefore be better suited to evaluate a child’s ability to learn sound-symbol correspondences. Familiar symbols (letters of their own alphabet) may already be known to children and therefore would not allow them to showcase their “learning” in assessment.
Method
Protocol Availability
The review objectives and meta-analytic approach were planned a priori and detailed in a registered protocol on the Open Science Framework (Wood & Molnar, 2023).
Ethics Statement
Ethics approval was not required for this meta-analysis given that all data were collected from studies that are publicly accessible (University of Toronto, n.d.).
Eligibility Criteria
Study inclusion criteria were determined a priori and outlined in the review protocol (Wood & Molnar, 2023). Included studies are:
(i) Primary research articles found in peer-reviewed journals, or unpublished gray literature such as Masters or Doctoral theses found in preprint repositories and on Google Scholar.
(ii) Studies that assessed children with a mean age between 4;0 and 10;0 who were monolingual or bi/multilingual, typically developing, at-risk for reading or diagnosed with a reading difficulty. Articles that included adults or children with other developmental challenges, such as hearing impairment, developmental language disorder, or autism spectrum disorder were excluded.
(iii) Articles that reported a correlation coefficient between a DA of a word reading skill, and a word reading measure, concurrently or longitudinally.
(iv) No limitation was placed on setting or location, but only articles written in English, French, Spanish, or a different language with full text translation to one of these languages were included.
Search Strategy and Information Sources
An initial search was carried out in five databases, MEDLINE, Embase, CINAHL (Cumulative Index to Nursing and Allied Health Literature), PsycINFO, and ERIC (Education Resources Information Centre), using the terms “dynamic assessment” and “literacy” as well as their related keywords in titles and abstracts. The search strategy was developed via consultation with a University of Toronto librarian. A complete list of search terms used in each database can be found in Tables S1 and S2 of the Supplemental files. No filters were used. Equivalent terms “dynamic assessment” and “literacy” were searched in three preprint repositories: MedArxiv, EdArxiv, and PsyArxiv. Forward searching was then completed on Google Scholar using the “cited by” function with included articles. To check whether any relevant articles were missed during the database, preprint, and Google Scholar search, the reference lists of the included articles were reviewed and compared to the list of included articles. Finally, appeals for unpublished work were made via social media callouts, posts to mailing lists, and direct emails to labs across Canada, the United States and Europe that reported conducting research in field of literacy.
Study Selection Reliability
Study selection and data extraction were managed in Covidence (2023), a web-based software that facilitates completion of reviews. A team of 10 research assistants (RAs) assisted in title/abstract screening and full text review. At the title/abstract stage, RAs received a 1-hr training session covering key concepts and relevant terms (e.g., defining dynamic assessment and each word reading skill) and subsequently completed 100 practice title/abstract screenings on a mock review prior to screening in earnest. At this stage, two independent team members voted to include or exclude based on whether the title and abstract indicated that the paper evaluated a word reading skill DA. Across all pairs of reviewers, the weighted mean average agreement was 94% and the Cohen’s Kappa coefficient was .40 which is characterized as fair (McHugh, 2012). In the full text stage, RAs again received a 1-hr training session lead by the first author detailing specific eligibility criteria (e.g., reviewing whether a word reading measure was included, whether the age group was correct, etc.). Each RA completed a practice full text review of a paper with feedback from the first author. Two independent reviewers then voted to include or exclude full texts based on whether they met the pre-defined eligibility criteria. At this stage, interrater agreement across pairs was 85% and the weighted mean Cohen’s Kappa coefficient was .66, which is considered substantial (McHugh, 2012).
As demonstrated in Figure 1, 24 articles of the 4,824 records identified via the database search were relevant and included. A search of 3 preprint repositories yielded 850 articles of which one was included. Forward searching of these 25 included articles via Google Scholar led to identification of an additional 9 studies. The reference lists of the 34 articles were reviewed to determine if there were relevant articles that had been missed. One additional study was identified through this process. Finally, callouts were made for unpublished studies or data to mailing lists, via posts to social media and by directly contacting labs conducting literacy related research across Canada, the United State and Europe, but this did not lead to identification of any additional relevant articles. In summary, 35 articles met the criteria for inclusion. The study identification process, including reasons for study exclusion (e.g., no use of a dynamic assessment of one of three word reading skills as in Navarro et al., 2018) is outlined in the PRISMA diagram below (Page et al., 2021).

Preferred reporting items for systematic review and meta-analyses flowchart.
Coding Data Items
Data from relevant articles was extracted using a custom template on Covidence, which is available on the Open Science Framework protocol site (Wood & Molnar, 2023). The first and second author both extracted data from all articles and compared and consolidated findings. Any disagreements were resolved through discussion between all three authors. The following data points were extracted for each included study:
General Information
The study title, journal name, date of publication, DOI, author name(s), institutional affiliation(s), funding, any potential conflicts of interest, and the country in which the study took place.
Participant Characteristics
The number of participants included in analyses, the percentage of males, as well as the mean age of the children at the outset of the study was noted. A total of 6,683 participants were included. The overall mean age was 5 years 6 months, and the overall average percentage of males was 51%. Mean age was not reported for 14 studies and the percentage of males was not reported for 9. We were able to ensure that studies that did not report mean age still met inclusion criteria, as all minimally reported the grade of participants (e.g., indicated that participants are in grade 1 and therefore between age 4 and 10). Authors also extracted information regarding participant reading status (typically developing vs. at-risk), language status (monolingual vs. bilingual; and age (4–5, 6–7 vs. 8–9). These factors are examined in a separate paper evaluating the validity of DAs of word reading skills across diverse populations (Wood et al., 2024). Table 1 provides additional details regarding the mean age and % of males of included studies.
Number of Participants, Effect Size, Mean Age, Grade, % Males, Study Design, Skills Evaluated and Characteristics of DAs, and Word Reading Measures of Included Studies.
Effect Sizes
Pearson’s correlation coefficients representing the relationship between DAs and WRMs were extracted. All relevant correlation coefficients listed were noted (e.g., a DA that used two PA tasks and their correlations with WRMs). In some instances, a lower score on a DA translated to better performance. In these situations, negative correlations were transformed to positive ones for analysis. From the 35 studies (
Measures
Dynamic Assessments (DAs)
In this review, DA is defined as an assessment that provides teaching, training, feedback on performance, and/or prompting in testing. In some instances, these measures were not reported as “DAs” but were described as paired associate learning tasks. This was typically for measures that evaluated SSK skills (e.g., Liu et al., 2021). We reported the word reading skills evaluated (phonological awareness, sound-symbol knowledge, and/or decoding), and the task used to assess the skill (e.g., phonological awareness/phoneme blending). If multiple tasks were used to evaluate a skill, authors listed all tasks utilized. A DA was characterized as SSK if the task involved learning the relationship between a visual referent (symbol or letter) and a syllable or phoneme. DAs were considered to evaluate PA skills if they assessed one or more of the auditory skills of rhyming, blending, segmenting, manipulating, deleting, substituting phonemes, syllables, words or onset and rimes. Tasks that required a child to recognize more than one symbol-sound relationship and blend these sounds together to “read” multi-symbol words (e.g., CV, VC, CVC, etc.) were labeled as decoding tasks. Authors also noted the format of the DA (i.e., graduated prompts (GP) or train/test (TT). DAs were considered GP if a series of prompts was used after a participant response to an individual test item and were characterized as TT if they minimally incorporated a teaching/training phase, followed by a separate static post-test. In terms of administration method (i.e., in person or computer) DAs that were conducted virtually by a clinician over the computer, or those that were computerized (i.e., no clinician) were considered as computer-based administration, while all others were characterized as in-person. For word type, DAs that used words that existed in the language of testing’s lexicon were considered real words (e.g., cat in English) while those that used invented words (e.g., meeb in English), or words from other languages were considered nonwords (e.g., “copa” a Spanish word in an English task). Finally, authors indicated whether novel or familiar symbols were used. DAs that used letters or characters that belonged to the orthography of the language of testing were said to use familiar symbols (e.g., using alphabet letters in an English DA), while those that used invented symbols or letters from orthographies distinct from the language of testing were characterized as using novel symbols (e.g., using Hebrew letters in an English DA measure).
Of the 192 effect sizes from the 35 included studies that examined use of a DA, most evaluated decoding (
Word Reading Measures (WRMs)
For the purposes of this review, WRMs are assessments that evaluate word reading ability of single real or nonwords using a correct/incorrect evaluation system and without provision of feedback, prompting or teaching. WRMs were conducted concurrently with the DA or longitudinally at a later timepoint. Of the 35 studies, most effect sizes represented longitudinal relationships (
Quality Appraisal Ratings
Following extraction, included studies were evaluated independently by the first and second author using an adapted version of two quality assessment tools from the Johanna Briggs Institute (Moola et al., 2020). This tool is available on the Open Science Framework protocol page (Wood & Molnar, 2023). Most studies (
Overall, the quality appraisal consisted of eight items to be rated over five domains. Items regarding participants, flow and timing, and statistical analyses were assigned one point, while items concerned with the index test (DA) and the reference tests (WRMs) were worth two points due to their greater significance in achieving review objectives. Conflicts were resolved through discussion between all three authors. Quality scores were ranked as either low quality (0%–33%), medium quality (34%–66%) or high quality (67%–100%). Only medium and high-quality studies were included in the analyses. No studies were excluded based on their score. The overall quality appraisal rating for each study is included in Table 1. Please refer to Table S3 in the Supplemental Material for individual ratings for each question for each study.
Statistical Analyses
All statistical analyses were conducted in R using the metafor package (R Core Team, 2021; Viechtbauer, 2010). First, a random effects meta-analysis with robust variance estimation (RVE) was conducted to examine the overall mean effect representing the association between DAs of word reading skills and WRMs. Random effects models assume that variability can stem from multiple sources, both individual sampling error and heterogeneity between studies. This is appropriate in this instance given that studies evaluated populations of different ages, from distinct locations using varied DAs and WRMs (Borenstein et al., 2010). We elected to use RVE because it permits inclusion of multiple effect sizes from a single study, while accounting for dependence between samples (Pustejovsky & Tipton, 2022). Prior to analysis, the 192 effect sizes were transformed into Z scores using Fisher Z transformation (Corey et al., 1998). Effect sizes are nested within study cluster, with the assumption that they intercorrelate both within and across clusters. A weighted average of these effect sizes was calculated, then transformed back to a Pearson’s correlation coefficient for interpretation as an overall effect (Hedge’s g). We then conducted five subgroup analyses to examine whether the strength of the association between DAs and WRMs differed across the test characteristics of skill type, administration method, format, word, and symbol type. These subgroup analyses were planned a priori. To account for multiple comparisons, we used a Bonferroni adjustment to set a new value for significance (
Results
As demonstrated in Table 1, 35 studies reported 192 correlations between a DA of word reading skills (phonological awareness, sound-symbol knowledge, or decoding) and a WRM. The effect sizes from these 35 studies were included in the random effects meta-analysis with robust variance estimation examining the relationship between DAs word reading skills and WRMs. The forest plot of these 192 effect sizes can be found in Figure S1 of the Supplemental Material. As anticipated, the overall mean effect size is large (
The contribution of individual effect sizes to heterogeneity was also examined via a Baujat plot, (Figure S2 in the Supplemental Material; Baujat et al., 2002). Two effect sizes from Gellert and Elbro (2017a) were identified in the upper right quadrant of the plot, suggesting they contributed significantly to heterogeneity. No plausible reasons for this were identified based on the characteristics of the study. We reran the meta-analysis with these effect sizes removed, but this did not change magnitude of the overall effect (
Risk of Publication Bias
A funnel plot was generated to subjectively examine risk of publication bias (See Figure S3 in the Supplemental Material). Visual inspection of the funnel plot suggests potential asymmetry. Several studies with small sample sizes and positive findings were identified and included, compared to studies with small sample sizes and negative findings (e.g., Horbach et al., 2018; Loreti, 2015) and are clustered at the bottom right of the funnel. There is a possibility that studies with negative outcomes were not completed, published, or submitted into the gray literature (Lee & Hotopf, 2012). However, this may also be simply because the skills evaluated in the DAs included (phonological awareness, sound-symbol knowledge and decoding) are known to correlate with word reading performance, and so negative effects are not anticipated. Furthermore, despite the apparent visual asymmetry, Egger’s test was calculated and not found to be significant for presence of plot asymmetry (
Subgroup Analyses
Subgroup analyses by DA word reading skill type, format, administration method word and symbol type, were planned a priori and were conducted to examine whether these characteristics have implications for the criterion reference validity of DAs with word reading measures. Mixed effects models were used to examine whether there were significant differences in mean effect sizes for DAs based on these factors. Results of these subgroup analyses are reported in Table 2. The adjusted significance value was set to
Findings for each factor are described below:
Word reading skills. In line with our hypothesis, results indicate that there are significantly stronger correlations between DAs of phonological awareness and decoding and WRMs, relative to DAs of SSK. Though multiple comparisons for each of the three subgroups were not completed, the mean effect sizes for DAs of decoding (
Format. In line with our hypothesis, results indicate that there are significantly stronger correlations between DAs that use a graduated prompts and WRMs (
Administration method. Contrary to our hypothesis, there were no significant differences in strength of correlational relationship between DAs administered in-person (
Word type. In line with our hypothesis, results indicate that there are significantly stronger correlations between DAs that use nonwords (
Symbol type. Contrary to our hypothesis, there were no significant differences in strength of correlational relationship between DAs that used familiar (
Results of Subgroup Analysis by Skill Type, Format, Administration Method, Word, and Symbol Type.
Adjusted
Significant result.
Discussion
This review examined whether characteristics of dynamic assessments (DAs) of word reading skills (phonological awareness, sound-symbol knowledge, and decoding) affected the criterion reference validity of DAs of word reading skills as measured by strength of correlational relationship with performance on a word reading measure (WRM). Thirty-five articles met inclusion criteria of evaluating children with a mean age between 4;0 and 10;0 and that reported a Pearson’s correlation coefficient between a DA of a word reading skill and a WRM. This is the first review to directly and quantitatively examine the criterion reference validity DAs of word reading skills with WRMs on the basis of these characteristics. Findings have important implications for informing clinical word reading assessment practices, and developing novel DAs of word reading tools.
Main Findings
As expected, results of the overall meta-analysis suggest that DAs of word reading are strongly correlated with WRMs. Regarding results of the subgroup analysis, DAs of phonological awareness and decoding, those that used a graduated prompts format, and those that used nonwords demonstrated greater strength of correlational relationship with WRMs than those that evaluated sound-symbol knowledge, used a test-teach-retest format, or used real words. There were no significant differences in terms of strength of correlational relationship between DAs and WRMs for factors of administration method (in-person vs. computer) or symbol type (familiar vs. novel).
Word Reading Skills
Analysis of
Format
The analyses evaluating DA
Administration Method
The analysis examining the role of
Word Type
Regarding
Symbol Type
This trend of unfamiliarity leading to increased capacity to evaluate ability to learn in DA was not reflected in the subgroup analysis by
Despite this, these results still have important implications for the validity of DAs of word reading skills. The use of nonwords and novel symbols has the potential to further reduce bias against children with diverse linguistic experiences. When words and symbols are unfamiliar to all, lack of experience with the oral and print system of the language of evaluation has less impact on performance. Results of this review support the use and development of nonword novel-symbol based DA measures. These types of tools could be used to equitably evaluate children with and without experience in the language of testing. This is in contrast with static tools, which are developed for and can only be used in a valid, unbiased manner with a single language group, typically a monolingual population. A DA of word reading skills that uses nonwords and novel-symbols evaluates ability to learn word reading skills, while minimizing impact of previous linguistic experience. This type of tool and could act as an equitable alternative to language-specific SAs for culturally and linguistically diverse children for whom there are limited assessment tools.
Limitations
First, while we endeavored to examine relevant DA characteristics and their implications on validity, it is possible that there are other factors that may be contributing to the overall strength of relationship between DAs of word reading skills and word reading measures. One such factor is word reading measure type. In the Caffrey et al. (2008) review, authors examined whether type of outcome measure had implications for validity of DAs and reported that researcher developed tools demonstrated the largest mean effect size relative to norm and criterion-referenced measures, or teacher/clinician ratings. Regrettably, we were not able to replicate the subgroup analysis of word reading measure type fairly, since a large majority (156/192) of effect sizes represented correlations between DAs and a norm-referend or criterion-referenced WRM, while significantly fewer (35/192) used a researcher developed tool, and none used a teacher or clinician rating. Second, correlation coefficients were selected as the measure of effect size because they are consistently reported. While this allowed for inclusion of additional studies, as a result, only correlational inferences can be made about results. Finally, it is possible that relevant studies may not have been identified because they were published in a language that our review team was not able to read (e.g., many studies in Korean and Hebrew were excluded in the title and abstract screening phase), or because they used key terms not captured by our search strategy.
Clinical Implications
The results of this systematic review and meta-analysis have implications for clinicians like speech-language pathologists and psychologists and educators who routinely evaluate word reading skills and who may require alternatives to static assessments for bi/multilingual students or those with limited literacy experiences. Outcomes suggest that, when possible, clinicians should favor DAs of phonological awareness and decoding skills, that are structured in a graduated prompts format, and that use nonwords comprised of familiar or novel symbols. Results indicate that these measures can be conducted in-person or virtually, which is particularly relevant post-pandemic, as many professionals continue to evaluate children in a virtual context. An example to consider that meets most of these criteria is the CUBED-3 dynamic decoding measure (DDM). The DDM, a measure developed based on the included studies conducted by Petersen et al. (2016) and Petersen and Gillam (2015), uses a test teach retest approach rather than a graduated prompts approach but evaluates both phonological awareness and decoding, uses nonwords comprised of familiar letters and is administered in person. This measure can be used as a criterion-referenced screening tool, to set intervention targets or to monitor progress (CUBED-3 Dynamic Decoding Measure [DDM], Petersen & Spencer, 2023).
Implications for Tool Development
Results of this study also inform development of novel DAs of word reading skills, or revisions of existing tools. Notably, findings support the virtual administration of DAs of word reading skills. This is an important consideration given that there has been a significant increase in the provision of virtual care and tele-assessment since the COVID-19 pandemic (Campbell & Goldstein, 2021). Researchers should consider developing virtual versions of DAs to ensure that assessment of these critical early reading skills can be made available in any future disruptions to in-person education or clinical services, or simply to ensure that children living in rural or remote areas have access to high quality, equitable dynamic assessments. Study outcomes also support the use of dynamic assessments of word reading skills that use nonwords and novel symbols. Not only does unfamiliarity of words and symbols lend itself well to a DA task where the goal is to evaluate ability to learn, this type of measure may also minimize linguistic and cultural bias associated with traditional static word reading skill tasks. Developing new DAs or revising existing measures to include nonword, novel-symbol based versions is critical for monolingual and bilingual children for whom there are no language specific tests of word reading skills available.
Future Research
Beyond this, future studies can also directly compare DAs with differing characteristics, using research designs and statistical analyses that permit a better understanding of the causal role these factors play in predicting reading ability and identifying reading disorder at various timepoints in a child’s journey of learning to read. This can be achieved through longitudinal studies comparing the relative predictive validity of DAs that differ in their format, administration method, word and symbol type, or other relevant factors via regression or structural equation modeling. Studies should also explicitly examine whether specific characteristics of DAs of word reading skills have a greater capacity to limit floor effects associated with traditional static measures or result in improved diagnostic accuracy. Ideally, these studies should include populations for whom DA is purported to be most useful, particularly bilingual children and those with limited previous literacy experiences.
Supplemental Material
sj-docx-1-sgo-10.1177_21582440241300536 – Supplemental material for Characteristics of Dynamic Assessments of Word Reading Skills and Their Implications for Validity: A Systematic Review and Meta-analysis
Supplemental material, sj-docx-1-sgo-10.1177_21582440241300536 for Characteristics of Dynamic Assessments of Word Reading Skills and Their Implications for Validity: A Systematic Review and Meta-analysis by Emily Wood, Kereisha Biggs and Monika Molnar in SAGE Open
Supplemental Material
sj-docx-2-sgo-10.1177_21582440241300536 – Supplemental material for Characteristics of Dynamic Assessments of Word Reading Skills and Their Implications for Validity: A Systematic Review and Meta-analysis
Supplemental material, sj-docx-2-sgo-10.1177_21582440241300536 for Characteristics of Dynamic Assessments of Word Reading Skills and Their Implications for Validity: A Systematic Review and Meta-analysis by Emily Wood, Kereisha Biggs and Monika Molnar in SAGE Open
Supplemental Material
sj-docx-3-sgo-10.1177_21582440241300536 – Supplemental material for Characteristics of Dynamic Assessments of Word Reading Skills and Their Implications for Validity: A Systematic Review and Meta-analysis
Supplemental material, sj-docx-3-sgo-10.1177_21582440241300536 for Characteristics of Dynamic Assessments of Word Reading Skills and Their Implications for Validity: A Systematic Review and Meta-analysis by Emily Wood, Kereisha Biggs and Monika Molnar in SAGE Open
Supplemental Material
sj-png-4-sgo-10.1177_21582440241300536 – Supplemental material for Characteristics of Dynamic Assessments of Word Reading Skills and Their Implications for Validity: A Systematic Review and Meta-analysis
Supplemental material, sj-png-4-sgo-10.1177_21582440241300536 for Characteristics of Dynamic Assessments of Word Reading Skills and Their Implications for Validity: A Systematic Review and Meta-analysis by Emily Wood, Kereisha Biggs and Monika Molnar in SAGE Open
Supplemental Material
sj-png-5-sgo-10.1177_21582440241300536 – Supplemental material for Characteristics of Dynamic Assessments of Word Reading Skills and Their Implications for Validity: A Systematic Review and Meta-analysis
Supplemental material, sj-png-5-sgo-10.1177_21582440241300536 for Characteristics of Dynamic Assessments of Word Reading Skills and Their Implications for Validity: A Systematic Review and Meta-analysis by Emily Wood, Kereisha Biggs and Monika Molnar in SAGE Open
Supplemental Material
sj-png-6-sgo-10.1177_21582440241300536 – Supplemental material for Characteristics of Dynamic Assessments of Word Reading Skills and Their Implications for Validity: A Systematic Review and Meta-analysis
Supplemental material, sj-png-6-sgo-10.1177_21582440241300536 for Characteristics of Dynamic Assessments of Word Reading Skills and Their Implications for Validity: A Systematic Review and Meta-analysis by Emily Wood, Kereisha Biggs and Monika Molnar in SAGE Open
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by a Social Sciences and Humanities Research Council of Canada (SSHRC) Canada Graduate Scholarship (Master’s), a Ministry of Colleges and Universities Ontario Graduate Scholarship, a SSHRC Canada Graduate Scholarship (Doctoral), and a Duolingo Dissertation Grant awarded to the first author; a University of Toronto Excellence Award awarded to the second author; and a Social Sciences and Humanities Research Council of Canada Insight Grant (#435-2024-0713) awarded to the third author.
Data Availability Statement
Additional supplemental materials are available on the Open Science Framework systematic review protocol page.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
