Abstract
Background:
Despite the frequent use of the Timed Up and Go (TUG) test in clinical trials, evaluation of longitudinal test-retest reliability is generally lacking and still inconclusive for patients with Parkinson’s disease (PD).
Objective:
We aimed to further investigate long-term reliability and sensitivity of the TUG test among this population. Furthermore, we explored alternative assessment strategies of the test aimed at elucidating whether the inclusion or combination of timed trials may have potential implications on outcome measure.
Methods:
Relative and absolute reliability of the TUG performance were obtained in forty-three subjects with PD over three timed trials in two different testing sessions separated by a two-months period.
Results:
Our results reported excellent intra-session and moderate inter-session reliability coefficients. The use of different assessment strategies of the TUG was found to have an important impact on outcome measure, highlighting the averaging of several timed trials in each testing session as a recommended alternative to minimize measurement error and increase reliability in longitudinal assessments. Nevertheless, beyond acceptable reliability, poor trial-to-trial stability of the measure appears to exist, since the ranges of expected variability upon retesting were wide and the incidence of spurious statistical effects was not negligible, especially in longitudinal repeated testing.
Conclusion:
Limitations may exist in the interpretation of the TUG outputs as part of longitudinal assessments aimed at evaluating treatment effectiveness in PD population. Researchers and practitioners should be aware of these concerns to prevent possible misrepresentations of functional ability in patients for a particular intervention.
Keywords
INTRODUCTION
Parkinson’s disease (PD) is the second most prevalent neurodegenerative disease and the most common age-related movement disorder worldwide [1]. PD manifests progressively over time by several motor dysfunctions such as resting tremor, rigidity in motion, bradykinesia or postural instability [2]. As a result, patients suffer from significant deleterious effects on gait and balance that negatively impact their ability to perform everyday tasks and quality of life [3, 4]. Thus, the standardized assessment of gait and gait-related motor tasks through simple and reliable tests, is highly relevant and can provide valuable information to monitor the effectiveness of therapeutic measures and the disease progression.
The Timed Up and Go (TUG) test [5] is a useful tool to assess gait-related functioning and motor symptoms in PD because it involves sequential locomotor tasks that incorporate walking and turning, which are both affected by PD [6–8]. Because of the test suitability for patients with PD, as well as the simple administration, minimal equipment or expertise in mobility analysis required, the TUG is being increasingly used as a part of routine clinical examinations [9]. The clinical utility of the TUG has been widely studied yielding excellent test-retest reliability in older persons [10] and population with different disabilities or conditions (e.g., multiple sclerosis, stroke, cerebral palsy or Huntington’s disease) [9]. However, the evidence in patients with PD is still inconclusive, since previous reliability data vary extensively [6, 11–13] and some studies suggest caution in the use of the TUG in this population [9, 13].
Several methodological aspects when assessing reliability may contribute to the mixed results and could be limiting an extensive clinical utility of the TUG in patients with PD. The heterogeneity of configurations for administering the test and procedures to obtain reliability outputs, is one important source of variability [6, 13]. Although the TUG test was originally described as including a practice trial before a single timed trial [5], the number of timed trials performed by PD patients varied from one to five [6, 13]. This is not trivial, since a different combination of timed trials allows different ways to obtain the reliability output. In this regard, previous studies pointed that when the TUG is used to examine change in people with PD, the averaging of at least two trials reduces measurement error and increase reliability [11, 13]. In a similar vein, though not in individuals with PD, it has been argued that a reasonable precision for estimates of reliability requires at least three trials [14, 15] and that a more stable TUG output is achieved when selecting the fastest of the timed trials [15, 16]. Thus, exploring alternative reliability protocols in PD population is an important undertaking to establish a consensus on how many trials patients have to perform to achieve better precision of measurement.
The representativeness of the time period between testing sessions to obtain reliability data could be another important drawback. Assessment of test-retest reliability of the TUG in PD have usually been limited to short interval designs, obtained within a single session [6] and over two sessions separated only by 1-week [13, 17] or two-weeks period [11, 12]. While such reports are useful, a limitation could exist in the interpretation of the measures as part of the longitudinal assessment of treatment effectiveness in clinical trials, where longer periods of time are employed [18]. Sources of error can be maximized over longer periods of time [19], and professionals should have knowledge of the expected variability of TUG measures obtained from typical long-term intervention periods.
To our knowledge, only a recent study has evaluated long-term test-retest reliability on the performance of the TUG in patients with PD [20]. This study demonstrated a good inter-session reliability and suggested the use of this test for prospective assessments of gait and mobility in PD. However, as authors stated, their relatively small sample size (n = 15) may not have been representative enough to draw conclusions regarding reliability, and these preliminary findings need to be confirmed through a larger data set to gain generalizability. In fact, sample sizes of at least 30 participants (and preferably larger ones), are recommended to properly evaluate test-retest reliability [14, 21]. Additionally, only relative reliability, i.e., the relationship between the two or more measurements by using the same method, was evaluated by means of the intra-class correlation coefficient (ICC). Although informative, on the basis of this type of analysis alone, it is impossible to know the precision and expected variability of the measure for a particular patient (absolute reliability) and, consequently, distinguishing between a patient improvement on the test or change due to measurement error after an intervention period can be difficult [22].
Therefore, the purpose of the present study was twofold: 1) to further investigate the long-term reliability and sensitivity of the TUG test among individuals with PD; and 2) to explore alternative reliability protocols aimed at elucidating whether the inclusion and/or combination of timed trials, would be a better assessment configuration to achieve better precision of measurement in this population. To this aim, we measured relative and absolute reliability of the TUG performance in a group of patients with PD over three timed trials in two different testing sessions separated by a two-months period. We employed this test-retest interval since this is a representative duration of most physical therapy interventions [18, 23]. Furthermore, unlike the original protocol of the TUG test described as including a practice trial before a single timed trial [5], we included three timed trials after the practice to enable the possibility of testing reliability values through different evaluation alternatives that have been shown to improve the precision of measurement in PD and other special populations (i.e., the averaging of several timed trials [11, 13], the inclusion of at least three timed trials [14, 15] or the selection of the fastest among a series of timed trials [15, 16]).
METHODS
Participants
The study sample consisted of forty-three subjects with PD (11 females, mean age = 63.4±9.1 years). Although the initial sample consisted of forty-five participants, two of them were excluded from the final analyzes because they had abnormal TUG performance (see Data analysis section). The following criteria were used to determine patient eligibility: clinical diagnosis of idiopathic PD, ability to walk independently without assistance, absence of neurologic disorders other than PD, not being treated with deep brain stimulation, and absence of medication (i.e., anti-dopaminergic), orthopedic or cardiovascular disturbances likely to affect gait function.
No subject showed sign of dementia as assessed by the Mini-Mental State Examination (MMSE scores≥23) [24]. All patients were in a mild-to-moderate stage in the progression of the disease according to the Hoehn and Yahr scale (stages 1–2.5) [25] and the average time from clinical diagnosis was 5.5±4.3 years. The level of severity of motor signs associated with PD was also examined using the Unified Parkinson’s Disease Rating Scale Part-III (UPDRS-III) [26]. The demographic and clinical characteristics of the sample are summarized in Table 1.
Demographic and clinical characteristics of the sample
Data are reported as mean±SD. MMSE, Mini-Mental State Examination; UPDRS-III, Unified Parkinson’s Disease Rating Scale –Part III (motor function examination); H&Y, Hoehn and Yahr staging of severity of Parkinson’s disease. There were no statistically significant differences between sessions for UPDRS-III scores.
This study was conducted in full compliance with the Declaration of Helsinki 1975 (updated in Fortaleza, 2013) and approved by the Local Ethics Committee of the University of A Coruña. All subjects gave their written informed consent before their inclusion in the study.
Procedure
The study was carried out in two different testing sessions (Session 1 and Session 2) with a two-month period between them. Both sessions were conducted under similar environmental conditions and each subject was evaluated at the same time of the day during the ‘ON’ medication state (45–90 min after medication intake). Subjects were asked not to change treatment or daily activities during the two-month interval and to take medication at the same time in both testing days.
At each session, a neurologist carried out a preliminary clinical evaluation to determine disease progress including MMSE, UPDRS-III, and H&Y scale. Data from the first session were taken as a reference while data from the second session as a control measure. Subsequently, subjects received instructions regarding TUG test and were allowed one practice trial to familiarize them with the procedure before testing commenced. Experimental TUG assessments consisted of 3 consecutive timed trials (T1, T2, and T3).
Before the second session, the neurologist confirmed that each participant had experienced no significant change (e.g., medication, injury, disease progression) within the two-month period. Both testing sessions were conducted by the same rater, who had previous patient experience using the functional test. None of the subjects suffered freezing of gait episodes, dystonia, excessive rigidity, or tremor during the experimental trials.
Timed up and go test
The TUG is a mobility test designed to measure basic mobility skills in geriatric population or people with neurological conditions [5, 17]. In the assessment protocol, patients were seated on a chair, and were instructed to stand up, walk at their own comfortable and safe walking speed for 3 m, turn 180° at a designated spot, come back, and sit down on the chair again. The measured outcome is the time in seconds to complete the entire sequence. We used a digital sensor system developed in our laboratory for this purpose. Time started once the subject’s back left the chair and ended when the subject’s back touched the chair.
Data analysis
All analyzes were performed using SPSS v22 (IBM, Chicago, IL). Original data were log-transformed (natural logarithms) for compliance with the normality assumptions and ensuring the viability of parametric statistical tests. Two cases with TUG values greater than 2.2 interquartile range or 2.5 z-score were considered outliers and excluded from further analysis [27, 28].
Intra-session reliability
A repeated-measures ANOVA with the within-participants factor of Trial (T1, T2, and T3) was conducted on TUG output to assess intra-session performance differences. Post-hoc analysis was conducted using a Bonferroni adjustment. Sphericity was tested by means of the Mauchley sphericity test and the Green-House Geisser correction was applied when violation of this assumption occurred. The effect sizes were reported by partial eta-squared (η2p) or Cohen’s d (d) when appropriate.
Test-retest reliability was calculated using the ICC, the corresponding 95%confidence interval (CI), and the standard error of mean (SEM). ICC is a relative measure of reliability indicating the ability of a test to differentiate between different individuals [22]. ICC and 95%IC estimates were calculated based on a single measure 2-way mixed-effects model, absolute-agreement. The SEM quantifies the precision of individual scores on a test based on the assessment of reliability within individual subjects. The error term from the 2-way model of the ANOVA was used to calculate the SEM (SEM = square root of the mean square error term from the ANOVA) [22], which represents the absolute reliability. A smaller SEM indicates a better absolute reliability of the measure.
Finally, the SEM was used to calculate the Minimal Detectable Change (MDC) of TUG output with the following formula: MDC = SEM*1.96* √2 [22, 29]. The MDC is a value that distinguishes between true and apparent change due to a measurement error [22, 29], thus determining which changes resulted from a hypothetical therapy. The smaller the MDC, the more sensitivity in the measurement.
Inter-session reliability
Three different combination of TUG timed trials were employed in order to obtain the inter-session reliability: first trial (first trial after practice), optimal trial (the fastest trial for each subject and session) and average performance (mean of trials 1–3 for each subject and session). A paired samples t-test was conducted on each output to assess inter-session performance differences. ICC and 95%IC estimates were calculated based on a single measure (for first and optimal trials) or k-mean rating (for average performance), 2-way mixed-effects model, absolute-agreement. SEM and MDC were calculated following the procedures described above.
In order to rule out possible influence of intra-individual variability of the motor signs of the disease on our results, we analyzed the inter-session differences in UPDRS-III scores by using a t-test for dependent samples. Additionally, a simple linear regression analysis was carried out to test if UPDRS-III change could significantly predict longitudinal variations in the TUG performance.
RESULTS
Intra-session reliability
The TUG output showed an excellent test-retest reliability across Trials in both Session 1 (ICC = 0.96) and Session 2 (ICC = 0.94). SEM values were 0.44 and 0.46 in Session 1 and 2, respectively. MDC values were 1.21 in Session 1 and 1.28 in Session 2 (see Table 2).
Intra-session descriptive and reliability data of Timed Up and Go test
*Indicates statistically significant differences with respect to Trial 1; #indicates statistically significant differences with respect to Trial 2 (p < 0.05). ICC, intraclass correlation coefficient (single measure, 2-way mixed-effects model, absolute-agreement); 95%CI, confidence interval, SEM, standard error mean; MDC, minimal detectable change (seconds).
The repeated-measures ANOVA on TUG output showed a significant main effect of Trial in both Session 1 (F2,84 = 3.71, p = 0.028, η2p = 0.08) and Session 2 (F1.55,65.25 = 6.05, p = 0.007, η2p = 0.13). Further post-hoc analysis in Session 1 reported an improvement in TUG performance in T2 vs. T1 (t42 = 2.53, p = 0.046, d = 0.39). In a similar vein, Session 2 also depicted an improvement in performance in T3 vs. T1 (t42 = 3.03, p = 0.01, d = 0.46) and T3 vs. T2 (t42 = 2.83, p = 0.02, d = 0.43) (see Table 2).
Inter-session reliability
The test-retest reliability analysis showed poor to moderate reliability across sessions for the first trial (ICC = 0.48), optimal trial (ICC = 0.55) and average performance (ICC = 0.70) output. SEM values were 1.35 for the first trial, 1.16 for the optimal trial and 1.23 for average performance. MDC values were 3.75 for the first trial, 3.22 for the optimal trial and 3.40 for average performance (see Table 3).
Inter-session descriptive and reliability data of Timed Up and Go test
*Indicates statistically significant differences between sessions (p≤0.05). ICC, intraclass correlation coefficient (asingle measure or bmean-rating [k = 3], 2-way mixed-effects model, absolute-agreement); 95%CI, confidence interval; SEM, standard error mean; MDC, minimal detectable change (seconds). First trial (first trial after practice), optimal trial (the fastest trial for each subject and session) and average performance (mean of trials 1–3 for each subject and session).
The paired samples t-test reported a statistically significant improvement in Session 2 (vs. Session 1) for all TUG outputs: first trial (t42 = 2.01, p = 0.05, d = 0.31), optimal trial (t42 = 2.33, p = 0.03, d = 0.36) and average performance (t42 = 2.26, p =0.03, d = 0.35) (see Table 3).
The optimal trial was Trial 3 for 17 (39%) and 24 (56%) subjects, Trial 2 for 14 (33%) and 9 (21%) subjects, and Trial 1 for 12 (28%) and 10 (23%) subjects in Session 1 and 2, respectively.
The motor signs of the disease reported by the UPDRS-III were comparable between sessions (t42 = –0.32, p = 0.75) (see Table 1). Furthermore, the simple linear regression analysis indicated that the UPDRS-III change scores were not a significant predictor of longitudinal variations in the TUG performance in our sample (R2 = 0.001; F1,42 = 0.06, p = 0.81).
DISCUSSION
In the present study, we reported the long-term reliability and sensitivity of the TUG test among patients with PD by exploring different assessment strategies aimed at increasing precision of measurement. Our results reported excellent intra-session and moderate inter-session reliability, indicating that the use of alternative TUG configurations, i.e., averaging trials or selecting the fastest within a series of three attempts, could minimize the measurement error and increase reliability in longitudinal assessments. However, the TUG times decreased significantly with intra- and inter-session repeated testing, indicating poor trial-to-trial stability over time and suggesting caution in the interpretation of results. Our descriptions of the SEM and MDC provided criteria about the expected variability of the measure on long-term retesting to interpret whether a change in TUG output can be considered an intervention-related real improvement or it is attributable to measurement error.
Like many previous studies, we primarily characterized test-retest reliability of the TUG by using the ICC as the main reference output. The intra-session coefficients obtained in this study (0.94–0.96) fell within the range of reliability values found in previous research [6, 30], although the inter-session reliability was markedly lower and varied extensively (0.48–0.70). In this regard, our data confirmed that the use of different assessment strategies of the TUG may considerably impact on long-term reliability values among patients with PD, since the mean of the three timed trials was reported to represent the highest ICC value (0.70), followed by the output of the optimal (0.55) and the first timed trial (0.48). This finding highlighted the heterogeneity of procedures to obtain reliability outputs as an important source of variability in previous studies [6, 13] and seems to cast some doubt on the use of the original TUG instructions, which describe a practice trial before a single timed trial [5], at least when the test is used in repeated long-term longitudinal assessments.
In light of our results and reinforcing previous evidence [11, 31], it seems that taking the average of several timed trial for each testing session, is a feasible way to minimize measurement error and increase reliability for longitudinal assessment. However, even taking this approach, the inter-session reliability obtained in our study (0.70) was consistently lower than those found in previous studies (0.77–0.85) [12, 20]. This apparent discrepancy was better qualified by considering the time period employed between testing sessions. While higher reliability values (0.80–0.85) have been usually reported using shorter test-retest time frames than the used here (7–14 days) [12, 17], the only study evaluating reliability of the TUG over a longer time frame in patients with PD showed a lower coefficient (0.77) [20], which is comparable to our output. Despite both long-term reliability values could be considered acceptable from a statistical perspective, the subtle weakness of the coefficients pointed in the direction of previous evidence suggesting caution in the use of this test in people with PD [9, 13], especially when the purpose is to re-evaluate over long time frames. Additionally, it is critical to bear in mind that reliability for individual patient application needs to be higher than for application in a group setting. While minimal standards for reliability coefficients are typically considered to be 0.70 for use of measures for group comparisons, a more stringent minimum reliability threshold of 0.90 has been suggested as acceptable for individual clinical measurements over time [32–34]. Since the error around an individual’s score is larger than the error around a group mean score, reliability coefficients below this threshold provide too wide intervals to be useful for individual patient application and, consequently, functional tests not reaching these reliability standards do not should be recommended for use in people with PD [13].
Another important factor that may account for the inconsistent results across studies could be the characteristics of the sample. It is important to consider that the ICC can be affected by sample homogeneity, i.e., if a study population is highly homogeneous, it is more difficult to distinguish between individuals and ICC deviates from 1. Thus, considering that more than 90%of our sample was in a stage 2 or lower of disease severity according to H&Y (the lowest of all the reliability studies), the optimal and homogeneous group performance in the TUG demonstrated by our participants (e.g., 10.58 s±2.01), and the fact that sources of error can be maximized over longer periods of time [19], it is possible that these factors may be contributing to explain the low ICC values obtained in our study. In fact, a previous study with a sample that captured a wide spectrum of disease severity (H&Y 1–4) and demonstrating greater heterogeneity in TUG performance (15 s±10), obtained one of the highest values of inter-session reliability [13].
Unlike most previous reliability studies in patients with PD, we examined performance differences between testing trials and provided test specific descriptive statistics. By doing so, we were able to demonstrate that TUG times decreased significantly with repeated testing both intra- and inter-session. For example, our data showed that the TUG times during the second session were on average 0.63 seconds shorter in comparison to the first session, a difference lower than the variability that can be expected upon retesting (SEM = 1.23) or the value defined as a meaningful change (MDC = 3.40). Even though the testing sessions were conducted two months apart, it cannot be ruled out that a small but significant familiarization or learning effect may have occurred as a function of the accumulated repetition of trials (i.e., from 10.71 s in trial 1 of the first session to 9.80 s in trial 3 of the second session). Although familiarization effects have not been reported in some studies that proposed a shorter period between testing sessions [17], experience allows for improved TUG performance [35] and learning effects are not negligible in repeated testing among patients with PD [6]. Therefore, since many of the reliability studies to date have not evaluated statistical differences between trials [11, 20] and, in those cases where differences were tested the results were mixed [6, 17], future studies should determine whether this factor is relevant or not, and whether it could introduce a systematic bias in long-term evaluations.
In any case, beyond the factors that could underlie these differences between trials, important here is that our findings confirmed that in addition to the low inter-session reliability coefficients, the measurement error of TUG in long-term assessment was substantial and that statistically significant differences in performance can be captured even when relevant patient improvement on test cannot be assumed. This observation is critical for physical therapy intervention studies, since functional improvements in patients are sometimes assumed from small-magnitude differences inferred only from traditional p-value-based analyses [36]. On the basis of this type of analysis alone, information on the statistical significance of performance changes is provided, but this do not inform as to the likely cause or origin of that change [37]. Thus, in order to avoid misrepresentation of functional ability in patients, professionals must consider whether change scores in performance represent true functional changes or are a result of variability attributable to systematic bias or random measurement error. As we have shown here, the use and reporting of statistics such as SEM and MDC, in addition to the traditional p-value-based analysis of differences, might be an interesting approach to addressing these issues. Establishing this magnitude threshold would further guarantee that intervention-related functional improvements would be contributing more than any other source of error to the observed changes in performance, thus avoiding the problem of spurious statistical effects [37].
While providing new information about the long-term reliability of the TUG, the main findings of this study should be considered with regards to its limitations. First, the generalization of the present findings should be limited to patients with mild to moderate disease severity (H&Y score ranging from 1 to 2.5), such as those included here. Different degrees of functional loss in TUG performance have been described at different points in the disease process [38] and, consequently, these data may not be representative for the population with PD in general. Second, special caution must be considered in interpreting the results obtained on the use of the original TUG instructions, i.e., a single timed trial preceded by a practice trial. Finding better coefficients of longitudinal reliability through alternative TUG configurations, i.e., by selecting the average or optimal trial, does not preclude the possibility that the additional timed trials during the first session, may have intensified the practice-related improvement observed in the first timed trial of the second session. However, it has been reported that learning effects can be present even when practice is limited to a minimum within a single session [6]. Furthermore, in cases where systematic changes are present between consecutive trials, modification of the measurement configuration by adding trials is encouraged to minimize bias [22]. Still, our findings confirmed that adding or combining timed trials improved the inter-session consistency of the measure, but this was not enough to achieve a desirable ‘plateau’ of performance even when a two-month period was used between testing sessions. Thus, since the presence of systematic bias could not be ruled out from our data, any estimate of within-subject variation should be interpreted cautiously [14, 22]. The inclusion of a group of patients with PD not receiving the therapy in order to differentiate practice-related effects from true therapeutic changes, should be an essential criterion for any intervention study in which repeated measures of TUG are required.
CONCLUSIONS
Our results suggest that the TUG test can provide acceptable test-retest reliability values in patients with PD, although poor trial-to-trial stability of the measure appears to be masked, especially when the test is used in longitudinal repeated testing. The use of different assessment configurations of the TUG was found to have an important impact on the outputs of reliability and precision of measurement in this population, highlighting the averaging of several timed trials in each testing session as a recommended alternative. Nevertheless, the ranges of expected variability of the measure upon retesting can be wide and the incidence of spurious statistical effects was not negligible. Limitations may exist in the interpretation of the measures as part of longitudinal assessments, which prevent us from recommending the use of the TUG as a tool for evaluating treatment effectiveness in patients with PD, unless alternative protocols can guarantee better reliability and stability of the measure. Furthermore, even when the reliability for application in a group setting could be considered acceptable (albeit limited), its use for individual patient application over time should be avoided. Researchers and practitioners should be aware of these concerns to prevent possible misrepresentations of functional ability in patients for a particular intervention.
Footnotes
ACKNOWLEDGMENTS
M. Fernandez-del-Olmo and J.A. Sanchez-Molina have been founded by the Spanish Ministry of Economy, Industry, and Competitiveness (ref. DEP2017-87384-R). Santos-Garcia, D. has received honoraria for educational presentations and advice service by Abbvie, UCB Pharma, Lundbeck, KRKA, Zambon, Bial, Italfarmaco, and Teva. Luque-Casado, A. has received a ‘Juan de la Cierva’ postdoctoral grant from the Spanish Ministry of Science and Innovation (ref. FJCI-2016-28405). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
CONFLICT OF INTEREST
The authors have no conflict of interest to report.
