Abstract
Objectives:
To determine intra-rater and inter-rater reliability of functional outcome measures in adults with neurofibromatosis 1 and to ascertain how closely objective and subjective measures align.
Methods:
A total of 49 ambulant adults with neurofibromatosis 1 aged 16 years and over were included in this observational study: median age 31 years (range: 16–66 years), 29 females, 20 males. Participants were video-recorded or photographed performing four functional outcome measures. Four raters from the neurofibromatosis centre multi-disciplinary team independently scored the measures to determine inter-rater reliability. One rater scored the measures a second time on a separate occasion to determine intra-rater reliability. The measures evaluated were the functional reach, timed up and go, 10 m walk and a modified nine-hole peg tests. Participants also completed a disease-specific quality-of-life questionnaire.
Results:
Inter-rater reliability and intra-rater reliability scores (intra-class coefficient) were similar for each outcome measure. Excellent rater agreement (intra-class coefficient, r ⩾ 0.9) was found for the functional reach, timed up and go and the 10 m walk tests. Rater agreement was good for the modified nine-hole peg test: intra-class coefficient r = 0.75 for intra-rater reliability and 0.76 for inter-rater reliability. The timed up and go and the 10 m walk tests correlated highly with perceived mobility challenges in the quality-of-life questionnaire.
Conclusion:
The functional reach, timed up and go and 10 m walk tests are potentially useful outcome measures for monitoring neurofibromatosis 1 treatment and will be assessed in multi-centre and longitudinal studies.
Keywords
Introduction
Neurofibromatosis 1 (NF1) is a common, inherited disease associated with benign and malignant peripheral nerve sheath tumours. 1 The complications of NF1 are variable, unpredictable and widespread, and cognitive impairment is common. 1 Challenges with functional tasks such as walking, balance or using the hands are common in NF1. 2 They may arise from tumours causing pressure on peripheral nerves or the spinal cord, central nervous system tumours or skeletal abnormalities. The relationship between objective performance and subjective experience is also important. Functional challenges may be amenable to medical, surgical or physical interventions and there is a need for robust functional outcome measures in this patient group to assess treatment efficacy. To the best of our knowledge, there has been no systematic evaluation of functional, motor outcome measures in adults with this disease.
An essential requirement of robust outcome measurements is that they are reliable. 3 Reliability is defined as ‘the degree to which measurement is free from measurement error’. 4 It is important to evaluate properties such as reliability within the target population, as variability within a disease strongly influences outcome measurement results. 3
Inter-rater reliability requires the same group of subjects to be measured at the same time by different observers, and intra-rater reliability considers the same subjects and the same observer with measurements taken at different time points. 5 Absolute reliability is expressed as the standard error of measurement (SEM) and this can be calculated from rater reliability. Minimal detectable change (MDC) describes the minimal amount of change in the instrument score to be sure that the score change is not attributable to measurement error. This may be calculated from SEM. 6
The INF1-QOL questionnaire (impact of NF1 on quality-of-life questionnaire) is a validated, reliable disease-specific questionnaire. 2 Responders categorise problems as no issues, mild, moderate or severe, in the 14-item, self-report, quality-of-life questionnaire. It includes two functional domains: walking and using the hands.
Advances in molecular biology have facilitated the development of novel therapy to include drugs that have the potential to treat symptomatic neurofibromas which has accelerated the quest for functional outcome measures. The primary aim of this study was to evaluate inter- and intra-rater reliability of four commonly used gait, balance and hand function outcome measures in adults with NF1. From these data, we calculated the SEM and MDC. The secondary aim of this study was to correlate patients’ perceived mobility and upper limb function as rated through the INF1-QOL questionnaire with their objective functional outcome measurement scores.
Methods
Guy’s and St. Thomas’ NHS Foundation Trust is a national centre for the diagnosis, management and support of 1150 people with NF1.
All adults (aged 16 years and over) with NF1 who attended their clinic appointments during the 4-month recruitment period (May–September 2015) were approached by letter inviting them to take part in this observational study. We aimed to recruit 50 participants as recommended for a reliability study. 3 At the time of their appointment, the treating clinician (doctor or nurse) confirmed they met the inclusion/exclusion criteria and ascertained whether they wished to take part or not.
To be included in this study, the participants needed to meet the following requirements: have a clinical diagnosis of NF1, be aged 16 years or over, have sufficient cognition to provide informed consent, not have significant mobility or balance impairments that are unrelated to their NF1 and be able to walk more than 10 m without physical assistance (may use walking aids).
Written consent was collected by the researcher and the participant was given a unique alphanumeric research identification code. Participants provided demographic information and completed a NF1 quality-of-life patient-reported outcome measure (INF1-QOL). Each participant completed three repetitions of each of the chosen outcome measures while being video recorded or photographed by the researcher.
Ethical permission was granted by National Research Ethics Service- Hampstead, reference 15/LO/1084.
Outcome measurement selection
A review of the evidence base identified that a wide variety of motor performance outcome measures have undergone metric evaluation in comparable cohorts such as chronic pain, community dwelling older adults with multi-morbidity, spinal cord injury, stroke and multiple sclerosis (MS). The research team chose to evaluate motor performance outcome measures for walking, balance and use of the hands, based on the functional challenges identified by people who have NF1 in a pre-study focus group and functional challenges identified in the INFI-QOL questionnaire. 2 The four selected outcome measures were chosen based on a high rate of rater reliability in comparable conditions, and following communication with clinicians and researchers in this field, who specified that the outcome measurements needed to be quick and easy to perform and interpret in the outpatient clinic environment to ensure long-term uptake into practice.
The functional reach test assesses standing balance. In the functional reach test, 7 the participant stands parallel to a wall with arms at 90° of shoulder flexion and reaches forward as far as they can without taking a step. A photograph was taken at the furthest point that the participant was able to reach and measurements are recorded to 1 mm.
The timed up and go test assesses functional mobility. In the timed up and go test, 8 participants stand from a chair, walk 3 m, turn around and return to the chair. Measurements are recorded to milliseconds.
The 10 m walk test assesses functional mobility and gait speed. In the 10 m walk test, 9 participants walk at their normal speed along a measured walkway. Measurements are recorded to milliseconds.
The modified nine-hole peg test assesses upper limb function through dexterity. In the modified nine-hole peg test, the participant takes pegs from a bowl and places them into the holes of a peg board. Measurements are recorded to milliseconds on a digital stopwatch.
Rating process
Video recordings and photographs of participants performing the outcome measurement tests were immediately transferred to a secure electronic location. Four raters watched and rated the videos and photographs separately to assess inter-rater reliability and they posted their scores for each test into a sealed box. One of the raters rated the photographs and video recordings a second time, to assess intra-rater reliability. The rater team comprises four experienced members of the NF1 multi-disciplinary team including doctors, a nurse and a physiotherapist. The researcher collated these data onto a spreadsheet. Data were transferred to SPSS for statistical analysis (Figure 1).

Study flow diagram.
Bias
Several steps were taken to counter bias in this study. The researcher was not involved in the recruitment process or as a video rater to reduce the risk of researcher bias such as selection bias. The intra-rater reliability tester was instructed to watch the videos a second time, only after they had watched all 49 sets of the videos through once to ameliorate recall bias. Outcome measurements were completed with the same researcher (R.M.) with standardised instructions to reduce the risk of performance bias. Videos were taken of the outcome measurement sessions and used for analysis to ensure all raters saw the same test after being provided with training on how to interpret findings to reduce the risk of detection bias.
Statistical analysis
Data from all measures were analysed using the IBM statistical package for social sciences (SPSS) version 23. A two-way mixed effects model was used to calculate intra-class coefficient (ICC 3,1) and evaluate relative intra-rater reliability of the 10 m walk test, timed up and go test, functional reach test and nine-hole peg test. A two-way random-effects model was used (ICC 2,1) to evaluate inter-rater reliability of the 10 m walk test, functional reach test, timed up and go and nine-hole peg test (see Table 2). The statistical analysis processes align with other studies investigating inter- and intra-rater reliability of the selected functional outcome measures.10–12
The ICC is a number between 0 and 1: 1 represents perfect reliability with no measurement error and 0 represents no reliability. Values less than 0.5 are indicative of poor reliability, values between 0.5 and 0.75 indicate moderate reliability, values between 0.75 and 0.9 indicate good reliability, and values greater than 0.90 indicate excellent reliability. 13
The SEM, absolute inter- and intra-rater reliability, was calculated for each measure in adults with NF1 with the following equation
MDC was calculated for each measure from the SEM using the following equation
where MDC is the minimal detectable change, SEM is the standard error of measurement and r is the reliability (ICC).
Results
A total of 85 adults with NF1 were sent invitation letters for this study: 15 did not wish to take part, 14 did not meet the eligibility criteria and 7 had not received the participant information sheet before their appointment (more than 24 h). Thus, 49 ambulant adults with NF1 volunteered and participated in the study.
There were 29 females and 20 males in this study with a mean age of 31 years (range: 16–66 years). The range of scores for the INF1-QOL was 1–26, mean score was 9 (where 0 indicates no difficulty and 42 indicates severe difficulty in all 14 domains).
Table 1 details the ICC for intra-rater and inter-rater reliability for each of the four outcome measurements with 95% confidence intervals (95% CI). Reliability (ICC) was excellent with low measurement error and tight 95% CIs for the functional reach, timed up and go and 10 m walk test time and speed. The modified nine-hole peg test had lower ICCs and wider 95% CI than the other three measures. For each outcome measure, ICC and 95% CI values were comparable for inter- and intra-rater reliability.
Rater reliability: intra-class coefficient (ICC) scores with 95% confidence intervals (95% CI) for outcome measures.
Table 2 details the mean score and range for each of the above tests alongside clinically important MDC. There was a more or less continuous distribution for all measures. The wide range of times was not simply due to outliers but reflected the clinical heterogeneity of NF1.
Mean scores for each outcome measure with standard deviation, range, SEM and MDC scores, calculated from inter-rater reliability.
SEM: standard error of measurement; MDC: minimal detectable change.
Table 3 details the correlations between each functional outcome measure and the INF1-QOL questionnaire which measured patient reported, disease-specific quality of life. Pearson correlations were computed between each functional test and the total INF1-QOL questionnaire scores, subsections for question 7 for walking and question 8 for hand function. As can be seen, all functional tests correlated significantly with the INF1-QOL total score. For question 7 ‘walking’, the best correlations were for the two measures of mobility. By contrast, the correlations with question 8 ‘hand function’, were largely non-significant.
Correlation Pearson (r) between each functional test and subsections of the INF1-QOL questionnaire with significance level.
ns: not significant.
p < 0.05; **p < 0.01.
Discussion
In this study, we evaluated the inter- and intra-rater reliability of a set of functional outcome measures in 49 adults with NF1 who were representative of the typical ranges of disease severity seen in a published study of quality of life in adults with NF1. 2 The functional reach test, timed up and go test and 10 m walk tests demonstrated excellent reliability for both inter- and intra-rater reliability. Interestingly, the modified nine-hole peg test demonstrated lower inter- and intra-rater reliability than the other measures tested.
The reliability scores (ICC) for inter- and intra-rater reliability of the functional reach test were excellent. They align with high levels of inter- and intra-rater reliability for normal 14 and frail elderly adults. 15 The SEM calculated from these findings was similar to multiple different clinical populations, with published standard errors in measurement between 1.86 and 2.91 cm for spinal cord injuries, 16 stroke 11 and peripheral vestibular disorders. 17 The MDC for this measure was variable between different clinical populations: from 5 cm in spinal cord injury patients to 11.5 cm in Parkinson’s.18,19 Relative to the mean functional reach score, an MDC of 8.08 cm in NF1 was deemed acceptable.
Inter- and intra-rater reliability scores for the timed up and go test were also excellent. They align with high levels of inter- and intra-rater reliability for healthy, normal older adults. 20 There are no published data on SEM or MDC for the timed up and go test so currently, it is not possible to compare our findings against other clinical groups. Both SEM and MDC scores were deemed acceptable in NF1.
Inter- and intra-rater reliability scores for the 10 m walk test were also excellent. They align with healthy adults, spinal cord injuries, stroke and traumatic brain injuries with ICC greater than 0.9.21–24 The SEM aligns with comparable clinical populations including spinal cord injuries, 16 strokes 11 and geriatrics. 25 An MDC of 0.26 m/s is similar to MDC scores for similar clinical populations.
The reliability scores (ICC) for the modified nine-hole peg test were 0.75 and 0.76 for intra-rater and inter-rater reliability, respectively. This is lower than the classic nine-hole peg test 26 when used in healthy adults and in multiple sclerosis, where reliability scores were greater than 0.9.27,28 We cannot ignore that this may be because we used a modified version of the test, but based on rater feedback we suggest that this test may be difficult for people with NF1 because the test requires sustained concentration. Individuals with NF1 often have cognitive impairment including difficulty with concentration and planning meaning that the start and end points of the test were difficult to determine because some participants hesitated before continuing the task. There may also be a muscle force component. 29
High levels of rater reliability indicate consistency in interpretation, within rater and between raters. MDC scores calculated from data collected within this study provide the clinician with an important marker that may assist with their decision-making. The next stage of metric evaluation of these outcome measures, test–retest reliability, will reveal how stable the measures are over a time period where NF1 symptoms remain stable.
There was an overall correlation between the functional outcome measurement scores with the total score for the quality-of-life questionnaire (INF1-QOL), but greater correlations with sub-items. The highest level of correlation was between the mobility outcome measures (timed up and go and 10 m walk test) and question 7 of the INF1-QOL measure which related to walking. The functional reach test is less closely aligned with any questions in the INF1-QOL questionnaire but still achieves statistical significance (–0.47). This may be because the functional reach specifically evaluates standing balance, a factor not specifically targeted within the INF1-QOL questionnaire. As balance and falls were raised as an important concern for people with NF1 in the pre-study focus group, this measure may still be of benefit and deserves further exploration as part of future trials. Interestingly, there was a small but significant correlation (–0.48) between the modified nine-hole peg test and question 8 of the INF1-QOL measure which relates to hand function. As cognitive processing contributes to the time taken to perform this test and rater reliability is not as good as the other measures, it does not appear to be as useful an outcome measure for assessing upper limb function in this patient group and measures such as grip dynamometry may be more appropriate.
Limitations of study
To our knowledge, this is the first study to evaluate reliability of functional outcome measures through use of videos. By video recording and photographing participants, we could be certain that all raters analysed the same test, from the same angle and it also ensured that the researcher could continue to stand close to the participant for balance and mobility testing and ensure safety as per routine clinical practice. We acknowledge that outcome measurement assessments are not normally conducted through video media but this testing regime was acceptable to participants and the research team who needed to fulfil research duties around their clinical commitments. We chose to evaluate reliability of outcome measures across a variety of professions (doctors, physiotherapists, nurses) to ensure that raters represented the multi-disciplinary team. We were limited by the time available to carry out the study but as outlined, this did not impact on the value of our data. Assessment of other upper limb outcome measures would have proved fruitful and this will be the focus of future work.
Originally we aimed to recruit 50 individuals for the study, but achieved 49 participants. Although this might appear to be a limitation, in practical terms it did not alter the reliability estimates. This is because the reliabilities that we observed were predominantly 0.9 or better apart from the modified nine-hole peg test which was 0.75–0.76. Indeed, Shoukri et al. 30 demonstrate that sample size theory demonstrates that our study is adequately powered for reliabilities in the range observed in our study. 30 Many reliability studies, using similar methods, have employed as few as 16 participants to evaluate reliability. 12
Conclusion
There is a need for reliable functional outcome measures to monitor treatment and to evaluate novel therapy in NF1 adults. The functional reach, timed up and go and 10 m walk tests had excellent inter- and intra-rater reliability and were quick and easy to perform in a clinic setting. Furthermore, these tests correlated highly with perceived functional challenges of mobility in the INF1-QOL questionnaire. The modified nine-hole peg test was slightly less reliable and other upper limb measures such as dynamometry should be evaluated in this group of patients. Our future aims will be to evaluate these motor outcome measures in multi-centre and longitudinal studies. We will also use them as tools for assessing patient outcome of therapeutic interventions.
Footnotes
Acknowledgements
The authors acknowledge Alexandra Curtis and Rona Inniss for their valuable contributions in the write up phase of this study.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Ethical approval
Ethical approval for this study was obtained from National Research Ethics Service- Hampstead reference 15/LO/1084.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Informed consent
Written informed consent was obtained from all participants in this study, including ages 16–18 years. In the United Kingdom, young people over the age of 16 years are deemed able to provide informed consent when ‘Gillick competent’. This includes all studies that do not involve clinical trials of an interventional product (CTIMPS)’. ![]()
