Abstract
Background. One important objective for clinical trialists in rehabilitation is determining efficacy of interventions to enhance motor behavior. In part, limitation in the precision of measurement presents a challenge. The few valid, low-cost observational tools available to assess motor behavior cannot escape the variability inherent in test administration and scoring. This is especially true when there are multiple evaluators and raters, as in the case of multisite randomized controlled trials (RCTs). One way to enhance reliability and reduce variability is to implement rigorous quality control (QC) procedures. Objective. This article describes a systematic QC process used to refine the administration and scoring procedures for the Wolf Motor Function Test (WMFT)–Functional Ability Scale (FAS). Methods. The QC process, a systematic focus-group collaboration, was developed and used for a phase III RCT, which enlisted multiple evaluators and an experienced WMFT-FAS rater panel. Results. After 3 staged refinements to the administration and scoring instructions, we achieved a sufficiently high interrater reliability (weighted κ = 0.8). Conclusions and Implications. A systematic focus-group process was shown to be an effective method to improve reliability of observational assessment tools for motor behavior in neurorehabilitation. A reduction in noise-related variability in performance assessments will increase power and potentially lower the number needed to treat. Improved precision of measurement can lead to more cost-effective and efficient clinical trials. Finally, we suggest that improved precision in measures of motor behavior may provide more insight into recovery mechanisms than a single measure of movement time alone.
Introduction
Improvements in motor control or skill that can be causally linked to specific interventions could inform clinical practice and perhaps, more important, advance knowledge about the mechanisms of recovery. The most robust measures of motor behavior use laboratory-based instruments to record spatial-temporal parameters of position (kinematics) and force (kinetics).1-3 However, these instrumented assessments are typically costly and time-consuming and require specialized training and equipment for data collection and analysis. As technology advances, more low-cost options with real-time capabilities will become available.4,5 Yet most researchers and some practitioners still rely on inexpensive, low-technology measures such as the Wolf Motor Function Test (WMFT) 6 and the Action Research Arm Test. 7 These kinds of assessment tools can be subject to poor reliability (ie, increased variability), especially when used by multiple raters. Ultimately, a diminished interrater reliability can reduce the statistical power for determining efficacy of rehabilitative intervention studies.
One way to control for the inherent risk of diminished interrater reliability is through quality control (QC) procedures. The WMFT 6 is a standardized, performance-based, upper-extremity (UE) assessment of functional capability for adults poststroke that includes 15 timed items and 2 strength items. From video capture, a discrete WMFT–Functional Ability Scale (FAS) score is determined post hoc to quantify the quality of movement (relative to the less affected side) for each of the 15 timed tasks. The score is based on characteristics of speed, precision, coordination, and fluidity—metrics of skill. 8 The WMFT-FAS uses a 6-point ordinal rating scale that ranges from 0 (no use of the affected side attempted) to 5 (normal) for a maximum score of 75 for the 15 tasks. The psychometric properties of the WMFT-FAS were previously explored in 3 related articles.8-10 The minimal detectable change (MDC90) was reported to be 0.37 (range = 0.2-0.4) points per task 8 within an overall change of 20 out of 75 points (27%; MDC95) across tasks. 10 Morris et al 8 found the interrater reliability (using interclass correlations [ICC]) to be ≥0.88 when task items were pooled, yet there was inconsistency for individual tasks with median ICCs that ranged from 0.36 to 0.93.
The Interdisciplinary Comprehensive Arm Rehabilitation Evaluation (ICARE) phase III multisite randomized controlled trial (RCT) 11 afforded an opportunity to implement a rigorous QC process. This effort was undertaken to strengthen interrater reliability of the rater panel scores and thereby improve construct validity of the WMFT-FAS for the ICARE trial. Our process resulted in a fine-tuned revision to the administration and scoring instructions (ASIs; see supplementary material). Ultimately, our aim is to inform clinical investigators about the QC process and to provide the revised ASI to future users.
Methods
Organization of the Phase III Clinical Trial
The ICARE trial with targeted enrollment of 360 participants is a prospective phase III RCT that includes 3 regional centers (Los Angeles, Washington DC, and Atlanta) and 7 clinical sites (5 in Los Angeles and 1 each in Washington DC and Atlanta). 11 The primary aim of ICARE is to compare arm and hand recovery in adults early poststroke who are randomized to 1 of 3 groups. The experimental group participates in an outpatient structured therapy program termed the Accelerated Skill Acquisition Program (ASAP), and they are compared to those randomized to a dose-equivalent usual and customary therapy program.
The primary outcome is the change in the log-transformed WMFT time score at 1 year postrandomization. The WMFT-FAS is but one of a large battery of secondary outcomes. As with other multisite studies, the ICARE trial requires considerable coordination, training, and standardization of multiple team members to ensure success. Each site has a team leader and usually one reserve, whereas each regional center has multiple blinded evaluators (BEs) who formally administer the ICARE assessment battery at each of 4 designated time points. The administrative core employs 3 experienced clinicians (rater panel) who rate the 15 timed tasks from digitally captured footage. Data are collected and managed through the primary database hosted by the central data management and analysis center (DMAC) and the file transfer protocol (FTP) site hosted by the Administrative Core at the University of Southern California in Los Angeles. The flow of data from test administration to rating score submission is illustrated in Figure 1.

Flow diagram: The flow of data from initial administration of the WMFT by the BE to uploading of scores to the FTP site by the rater panel member is shown.
Licensed occupational and physical therapists blinded to group assignment administer the WMFT at baseline and postrandomization at 3 predetermined follow-up periods: posttherapy, 6 months, and 1 year. The WMFT is filmed using a digital camera (standardized across sites) according to detailed procedures outlined in the ICARE Manual of Procedures (MOP). Prior to study initiation, site-designated BEs attended a 3-day Clinical Research Evaluators Training Meeting in Los Angeles. Training materials and protocols, with video demonstrations, are updated when necessary and made available on a secure ICARE Web site for ongoing review and use by BEs and other study personnel. Test administration certification requires each BE to demonstrate at least 90% proficiency with the standardized administration and scoring criteria. Recertification is required every 6 months until data collection is complete.
Immediately after test administration, the local site edits the digital file and uploads it to a secure FTP server. A trained member of the administrative core evaluates and approves each deidentified digital file for quality (ie, data completeness, and audio and visual clarity) and consistency with the ASI essential elements (see Results for details on essential elements). Once the QC check is complete, the digital files are made available to the rater panel. Files that fail the QC check are retained and depending on the reason may go back to the BE for remediation. In any case, the test is not readministered. Failed files are returned to BEs if deemed fixable. For example, if the problem was in editing, the BE can re-edit the digital file from the raw footage. Most important, feedback is given to sites to ensure that the same error does not recur, regardless of the reason for failure.
The rater panel scores each digitally recorded assessment using the 6-point scale. Initially, panel members independently determined a score for each task. This yielded 3 ratings per task for each digital file (ie, 15 tasks × 3 raters = 45 scores/test). Thus, without attrition, we estimated ~1440 digital files (360 participants × 4 time points) at ~30 to 45 minutes each for accurate scores, per rater.
Timeline for QC Process
To improve consistency in test administration across sites and to verify the reliability of WMFT-FAS scoring, the ICARE clinical leadership in collaboration with the BEs and rater panel implemented a QC process. Prominent events of the systematic QC process included refinement of the WMFT template (ie, table-top mat), clarification and refinement of the ASI, statistical analysis of the rater panel scores, and initiation of monthly rater panel meetings. These events are chronicled in Figure 2.

Timeline for systematic quality control process. Top: interrater reliability (quadratic weighted κ) for the WMFT-FAS rater panel overall for each round of analysis for each of the 3 rater pairs. The number of digital files (n) included in each round is listed. Bottom: timeline of events (with dates) for the ICARE trial, including revisions to the WMFT-FAS administration and scoring instructions (ASI) and other notable events before, during, and after the 5 rounds of analysis. QCh was conducted to ensure essential elements were met for each digital file.
Three revisions were made to the ASI to simplify instructions, specify the setup, and reduce scoring ambiguity (see Results for details). After each revision, the changes were reviewed in meetings held with the site team leaders, BEs, and rater panel members. Each site and BE provided signed documentation that the revised protocols were reviewed and replaced in the MOP binders; master copies were also available through the secure ICARE Web site.
In concert with the ASI revisions, the DMAC conducted 5 rounds of interrater reliability testing based on scores generated by the rater panel. The first 2 rounds used digital files from a prior study completed before ICARE, and the later 3 rounds used digital files from the ICARE trial itself. Each panel member rated the digital files and submitted their scores to the DMAC through the secure FTP site.
Monthly Web-based meetings of the rater panel began after round 2 of reliability testing. These meetings were initiated in an effort to hone the panel’s objective rating skills and to identify scoring criteria that needed clarification. After independent review of sample digital files, discrepancies in rating were identified by the DMAC. Each rater shared the decision process they used to rate discrepant items and a meaningful discussion ensued. During meetings, boundaries for determining whether a task was performed very slowly (score 2), slower (score 3), or slightly slower (score 4) were discussed, and differences between raters on specific task scores (identified by the DMAC) were reviewed. An examination of the boundaries for each scoring category included a review of the actual movement time of task performance and a comparison with the less-involved limb’s performance. However, a specific time interval for each category was not designated because, in most cases, it was determined relative to the less-involved limb. Specific instructions about the WMFT FAS rating and the speed of movement are included in the ASI (supplementary material, Section VI.B.) in the section beginning, “For determination of normal ….” The rater panel agreed that the difference between speed categories was best determined through examples and discussion during the monthly meetings. Discrepancies were adjudicated by majority vote.
Statistical Analysis
Interrater reliability was assessed using the weighted κ (κw) for all 5 rounds of testing. κw Is a measure of concordance for ordinal outcomes that weights larger disparities in ratings higher than smaller differences,12,13 which was appropriate given the 6-point scaling. A quadratic weighting was used for these analyses to penalize greater discordance in ratings more severely than would occur with a linear weighting. For example, a difference in ratings of a 2 and a 3 would be seen as a much smaller disparity in rating than a difference in rating between a 2 and a 5. κw was utilized because the items under review were ordinal. Whereas ICCs are appropriate for interval-level data, they are inappropriate for ordinal-level data because they require calculation of parametric outcomes that are not meaningful in nonparametric data. To guide our efforts toward the achievement of interrater reliability, we looked to the benchmarks that Landis and Koch 14 used and set a lower bound of 0.80, which should represent substantial agreement.
Results
Revisions to ASIs
The initial version of the ASI adopted for the ICARE MOP was the same as that used in the EXCITE trial. 15 Prior to randomization for ICARE, Revision 1 was made to the original table-top template and the ASI (Figure 2). Instructions on positioning for task execution and camera placement were revised, and placement markers for each item were labeled on the template. Participant instructions for the quick demonstrations were condensed for all items. For example, for item 10, the instructions for the quick demonstration now reads, “Pick up the pencil as fast as you can.”
Revision 2 of the ASI was completed after randomization commenced. Refinements included specifications on chair positions, participant/object proportions (ie, box size), and filming strategies. For example, the filming position “Side-Close” was specified to zoom in on fine-motor skills. Instructions for task items were also clarified. For example, for item 12, Stack checkers, we added, “Do it like this (demonstrate correct method), not like this (demonstrate incorrect method of at least 1 checker still touching the table surface in the stacked position).” Finally, a description of template placement during test administration was added.
Revision 3 was finished after round 4 of statistical testing and included clarification of the scoring criteria and the addition of essential and desirable elements for each item. Essential elements are the specific features that must be accomplished for the task to be deemed complete. The desirable elements are other qualitative features that should be included in the task but are not necessary for completion. If a participant cannot complete the essential elements, they are given a timed score of 120+ (coded as −7). An essential element in item 15, Turing key in lock, reads, “Turns key fully each direction and back to vertical.” Two desirable elements for item 15 are, “lateral pinch and turns key to the instructed direction first.” Revision 3 also included photos of task demonstrations. In the General Comments section, details were added such as, “If a BE makes a mistake with set-up or timing, the task may be repeated an additional time.” Consultation with investigators from the EXCITE trial and the BE team at Emory University guided the third ASI revision.
Reliability Testing
Reliability testing began during the rater panel’s training period. The 5 rounds of testing were interspersed with ASI revisions, review of previous reliability findings, consultation with WMFT-FAS experts, and quality checks of digital files (Figure 2). The timeline reveals that this process transpired over a 3-year period.
In round 1, some κw values between raters were below our 0.8 criterion (0.67 to 0.83). Therefore, the ICARE leadership team consulted the reliability findings from Morris et al. 8 The pooled reliability was reportedly high, 8 but the interrater reliability for more than half of the tasks was lower than the 0.80 criterion set for the ICARE trial. 8 For at least 1 of 2 test periods, interrater reliability for 9 of 15 tasks was <0.75, and ICCs for a number of tasks (ie, 1, 4, 7, 10, and 11; see Figure 3 legend for task ID) were <0.50. 8 This important factor contributed to our decision to seek improvement of interrater reliability to a greater level than round 1 of ICARE and that reported by Morris et al. 8

Interrater reliability by task at round 5: The quadratic weighted κ values between members of the WMFT-FAS rater panel are shown for each task. The 15 timed tasks are as follows: (1) Forearm to table; (2) Forearm to box; (3) Extend elbow; (4) Extend elbow with weight; (5) Hand to table; (6) Hand to box; (8) Reach and retrieve; (9) Lift can; (10) Lift pencil; (11) Lift paper clip; (12) Stack checkers; (13) Flip cards; (15) Turning key in lock; (16) Fold towel; and (17) Lift basket. Note that the 2 force tasks are not listed, but the full WMFT task numbering has been retained.
For round 2, interrater reliability was inconsistent (κw values 0.62-0.87). Only 1 rater pair (raters 2 and 3) reached the κw criteria of 0.80 in the first 2 rounds of statistical testing (Figure 2). Therefore, the ICARE clinical leadership team, in collaboration with the DMAC, requested that each rater panel member score every ICARE digital file (ie, 3 scores per file) and meet as a group once a month to review and adjudicate task items with between-rater scoring discrepancies.
For round 3, all κw values (0.76- 0.89) were higher than for round 2. Yet κw values in round 4 showed a reduction in agreement between rater 2 and the other 2 raters (0.51 and 0.66) compared with the first 3 rounds. Additional QC methods were then pursued to examine the basis for the diminished reliability. This involved a heightened screening process in which the digital files were reviewed for confirmation of clear visibility of the essential task elements and accuracy of editing. After round 4, the ASI underwent a third and final refinement.
For round 5, the κw pooled reliability was above criterion (0.81 to 0.86). Yet the κw between raters for the 15 tasks in round 5 was less consistent (Figure 3). Three tasks fell below a κw of 0.70 for at least 2 rater pairs, including the following: Forearm to table (task 1; 0.66), Forearm to box (task 2; 0.66), and Reach and retrieve (task 8; 0.63). Of note, these 3 tasks were among those Morris et al 8 found to have a low level of agreement on at least 1 of 2 test periods (task 1 = 0.52; task 2 = 0.57; task 8 = 0.61). For ICARE, raters 1 and 3 had the highest agreement (κw > 0.7 for all tasks), and raters 2 and 3 had the lowest agreement for the Reach and retrieve task (task 8 = 0.63). It is important to note that a κ value of 0.80 is a very stringent cutoff, representing almost perfect alignment of scores. Values of κ >0.60 are considered very highly associated.
Once an interrater κw ≥0.8 was achieved by round 5 for the pooled tasks, 90% of the digital files were each assigned to only 1 rater, with ~10% (randomly selected by the DMAC) distributed to all 3 raters for independent scoring. This shift in allocation from 3 to 1 rater per digital file reduced the time and cost of the rating process. Periodic examination of rater reliability (10% of digital files) has demonstrated that interrater reliability is being maintained at levels greater than κw = 0.70, with most being greater than 0.80.
Discussion
Findings from recent multisite definitive neurorehabilitation RCTs have been less optimistic than the phase II trials that preceded them,16-19 primarily because it is difficult to replicate the group differences from smaller-scale trials in multisite large-scale trials.20-22 This phenomenon may be, at least in part, a result of the inherent confounding factors introduced when conducting bench to bedside work and the lower methodological rigor often tolerated in smaller single-site, compared with larger multisite, trials.20-23 This article focuses on one potential confounder—random error introduced during administration and scoring of an observationally based motor behavior assessment. Improved interrater reliability is one way to increase the sensitivity of these types of measures in multisite RCTs.
The ICARE trial was powered on the log WMFT-time score, the primary outcome variable that will be used to determine the efficacy of the experimental therapy protocol. 11 The QC process we describe here was implemented for one of the secondary outcome measures—one that will be important for interpreting changes in the WMFT time score. The systematic QC process included modifications to the ASI criteria and quality checks of digital files. This process likely elevated the construct validity of this secondary outcome measure. Clinicians and researchers who wish to establish substantial agreement in using the WMFT-FAS should find the details and knowledge gained by the ICARE team particularly helpful for future endeavors.
There is no doubt that the QC process we describe is time-consuming, costly, and requires considerable resources to implement. Given that the WMFT-FAS was a secondary outcome measure, why implement such a rigorous, resource-consuming process? What might be the benefit of improved interrater reliability? Recently, See et al 24 showed that a standardized training approach used with examiners for a phase II controlled trial significantly reduced variability in scoring on the Fugl-Meyer Assessment (FMA) UE Scale. Data analysis revealed that the improved reliability on the FMA decreased the variance in scoring by 20%. In turn, a 20% reduction in variance on the FMA would allow a reduction in sample size from 137 to 88 to detect group differences for a trial powered at 80%. 24 For the ICARE Trial, an improved WMFT-FAS interrater reliability could effectively strengthen the sensitivity to detect group differences. However, we cannot know the possible impact on ICARE until we are permitted to analyze group data (expected in August 2014). For studies in which the WMFT-FAS is a primary measure,25-27 an improved reliability could lead to increased power. As shown recently, even small decreases in variability can have a large effect on the sample size required to detect a statistically significant effect. 23 Furthermore, a decrease in sample size could have a very large effect on the cost of conducting a clinical trial. Use of the revised WMFT ASI (supplementary material) could minimize the need for an extensive QC effort, decrease the cost, and increase the efficiency of future single- and multisite clinical trials in stroke rehabilitation.
Recently, Woodbury et al 28 used Rasch analysis to establish a hierarchy of item difficulty for 14 of 15 items based on the rating scale of the WMFT-FAS. From that analysis, the authors discovered that item 8, “Reach and retrieve,” had abnormally high missing values because of administration and filming errors. Without this item, the authors were able to establish an item difficulty hierarchy to show that higher scores on difficult items were associated with higher UE function. For ICARE, greater reliability and precision of the WMFT-FAS score will be important for determining whether the structured intervention significantly contributed to an improvement in UE motor behavior and skill above that achieved for the dose-equivalent usual and customary treatment group.
Limitations
One limitation was that, initially, we relied on the interrater reliability of the WMFT-FAS reported in Morris et al 8 and assumed that we would achieve the same, if not a higher, rater agreement level. After round 1 of reliability testing and closer examination of the findings for individual items in Morris et al, 8 we realized that this assumption was incorrect. In hindsight, we should have scrutinized previous results more carefully before conducting round 1. This may have allowed us to initiate strategies to improve interrater reliability earlier in the trial.
Another limitation is that we assumed that the initial in-person training provided to the BEs would be sufficient. However, over the course of the trial, new BEs who joined the team did not receive this in-person training, although they did have Web site access to training materials, protocols, video demonstrations, and in-person local experienced BEs. In previous work, the investigators suggested that evaluator training may have been insufficient because of the low agreement level for many items. 8 See et al 24 showed that standardized training and testing after training increased reliability of the FMA. We speculate that in-person training and greater scrutiny of the knowledge and skill of all BEs could have improved consistency in WMFT execution and may have hastened the achievement of interrater reliability.
A final limitation pertains to the Web-based group meetings of the rater panel. Although this process was deemed beneficial overall, we note that there was an initial familiarization period that may have adversely influenced individual scoring strategies and subsequent ratings assigned by the panel members. Initially, the raters reported a tendency to second guess their first responses and predict how the other raters would score. This context effect dissipated with time and familiarization with the process.
Recommendations for Clinical Researchers
From this systematic QC process, we offer a few recommendations to investigators who plan to use the WMFT-FAS in their clinical trial research. (1) The quality of the visual media capture is critical. As such, close attention to the camera setup is strongly recommended to ensure sufficient visualization of the essential and desirable elements (see supplementary material). (2) Frequent and consistent (preplanned) Web-based rater panel focus-group meetings are recommended to sharpen rater skills, foster consistency in scoring, and maintain these skills over the entire course of the study. (3) Implementation of regular meetings for the BEs is recommended to maintain consistency in test execution over time and across sites. Remote meeting formats such as Go-to-Meeting, WebEx, or Adobe Connect are useful for recommendations 2 and 3. (4) To further strengthen interrater reliability of the WMFT-FAS, we recommend removal of the most problematic items in future revisions, including Forearm to table (task 1), Forearm to box (task 2), and Reach and retrieve (task 8); see rationale in Results.8,11,28 Although standardization procedures are common for implementation of clinical trials research, the first 2 recommendations are unique aspects of QC enforced in the ICARE study. Finally, we suggest that the first 3 recommendations can be generalized for use with comparable quality-based motor performance measures in the context of multisite RCTs.
Conclusions and Implications
The effort expended to modify the ASI procedures and achieve a substantial level of interrater reliability likely enhanced the construct validity of the WMFT-FAS instrument. We detail the systematic QC process developed for ICARE, so that others may benefit from a more sensitive and objective measure of motor behavior. We believe that the process of strengthening the psychometric properties of observationally based motor behavior measures is vital to advancing the science of our field and to enhancing our understanding of the mechanisms of recovery. This concern is timely, given the recent fervent discussions surrounding the nature of suboptimal motor recovery (recovery vs compensatory)29-32 and impairment-based vs task-based intervention protocols.1,33 As the number of clinical trials in neurorehabilitation grows and we attempt to demonstrate sufficient efficacy and effectiveness of our interventions, it is essential that we use reliable tools that provide information pertaining to restitution and substitution strategies.
Observational measures that provide reliable information about “how the movement was performed,” with the addition of temporal measures (eg, movement time), could offer new insights about the recovery process. The WMFT-FAS complements the WMFT-time score. Thus, we would ideally like the scores from both measures to improve. Yet discrepancies may provide greater insight. 34 For example, if movement time decreases while the WMFT-FAS score is unchanged, this may suggest that the improvement stems from suboptimal compensatory strategies. Therefore, complementary measures of motor behavior and movement time should be used to better understand the mechanisms of recovery that are affected by rehabilitative interventions.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by grants from the National Institute of Neurological Disorders and Stroke and the Eunice Kennedy Shriver National Institute of Child Health and Human Development under award numbers U01 NS056256 and T32 HD064578.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
