Abstract
Objective
We tested whether intermixing mental health items with items addressing comfort and capability could limit the floor effects noted when mental health is measured in musculoskeletal specialty care.
Methods
One hundred and 31 people seeking care for upper and lower extremity musculoskeletal conditions were randomized to complete randomly ordered, unlabeled mental health items intermixed with comfort and capability items, or intact and labelled questionnaires. For the two approaches, we compared: (1) flooring and ceiling effects; (2) mean and median questionnaire scores; (3) internal consistency (Cronbach alpha); and (4) exploratory factor analysis. We sought correlations between mental health and levels of pain intensity and capability.
Results
We found slightly more flooring in the intermixed group for symptoms of depression (66% [41 of 62] vs 46% [32 of 69], p-value = .034), no differences in the mean and median scores for each questionnaire, lower internal consistency measured by Cronbach alpha, and lower factor loading coefficients in exploratory factor analysis for symptoms of depression and anxiety in the intermixed group. The mean level of symptoms of anxiety was significantly different between two groups (intermixed: 0.87 [95% CI 0.82 to 0.92], fixed: 0.96 [95% CI 0.93 to 0.98]). There were no differences in the association of the mental health measures gathered via the two different strategies with measures of pain intensity and magnitude of capability.
Conclusion
The finding that intermixing mental health questions with questions about comfort and capability did not diminish floor effects suggests no advantage to intermixing mental health items in questionnaires used in musculoskeletal care and research.
Introduction
Background
Questionnaires used to quantify the subjective aspects of health such as comfort and capability are referred to as patient reported outcome measures (PROMs). 1 Questionnaires are also used to measure mental health, such as symptoms of depression or anxiety. 2 There is a notable correlation between measures of capability and comfort and mental health measures.3–5 And the true correlations may be stronger given the evidence that people don’t report measures of symptoms of depression and anxiety forthrightly, creating a strong floor effect for measures of symptoms of depression and anxiety.6,7
Rationale
The existing research questionnaires may pose challenges as they tend to be lengthy and burdensome for participants. These questionnaires may include mental health measures that encompass factors perceived as irrelevant by patients, which can potentially be offensive. This issue becomes evident through the quick completion of the questionnaires and the prevalence of high floor effects, particularly observed in measures related to symptoms of general anxiety and depression within musculoskeletal care settings.8,9
Factors such as internal reliability (the degree to which the items in an instrument measure the same construct) 10 and flooring and ceiling effects partially determine the usefulness of questionnaires in a medical setting. Ceiling or flooring effects occur when a large proportion of participants provide maximum or minimum scores to questionnaires, which results in lost information at the top or bottom of the scale. 11 Our thought was that intermixing mental health questions with questions regarding comfort and capability might make answering questions about mental health seem more appropriate in musculoskeletal specialty care, thereby limiting the observed flooring and ceiling effects. In other words, a set of questions about mental health together might seem like screening for depression or anxiety, which might feel inappropriate or unwelcomed. On the other hand, intermixing questionnaires with questions regarding comfort and capability might be percieved as a genuine curiosity about the impact of those symptoms on overall wellbeing. We found a study of 603 college students who completed a set of 3 12-item questionnaires evaluating a travel website, half randomly intermixed and half kept with the original questionnaire, that found that internal reliability (the degree to which the items in an instrument measure the same construct) 10 was increased with grouping of questionnaires while correlation with other constructs was reduced. 7 We did not find examples of intermixing questionnaire items to try to reduce flooring and ceiling effects.
Study questions
Among people seeking musculoskeletal specialty care, we asked: (1) Is there a difference in flooring and ceiling effects, mean and median questionnaire scores, internal consistency, and factor loading coefficients in exploratory factor analysis between randomly intermixed unlabeled questions compared to labeled fixed order complete questionnaires? (2) Is there a difference in the association with capability and pain intensity?
Materials and methods
Study design and setting
We obtained approval from our Institutional Review Board. For this cross-sectional questionnaire-based randomized study we enrolled participants seeking care regarding upper and lower extremity conditions at two regional orthopedic clinics, and one institution-based upper extremity clinic in an urban area in the United States between July and August 2021.
Participants
Consecutive new and returning patients were approached by a research assistant who was not directly involved in care before or after their visit. All patients who were not fluent in English or were unable to provide verbal informed consent were excluded. We perform extensive cross-sectional research and stopped tracking declines to participate because they are very infrequent. A total of 143 participants began filling out the survey on a tablet device or phone in a private exam room without a researcher present; however, 12 of them did not complete the survey. The most common reasons for noncompletion were a desire to bring a long visit to a close and logistical problems such as system errors or internet connectivity issues. Multiple imputation, the ideal method of addressing missing data at random, 12 cannot be used for factor analysis. We therefore omitted all 12 patients who did not complete all questionnaire, leaving 131 patients for analysis: 69 participants in the fixed order group and 62 participants in the intermixed group. Our Institutional Review Board approved the protocol and accepted completing the survey as a form of consent. The average completion time was less than 9 minutes. This ensures that respondents are unlikely to experience survey fatigue.
Randomization
Participants were randomized 1:1 into two groups: (1) Participants completed randomly ordered, intermixed, non-labeled questionnaires. We refer to this group as the intermixed group. (2) Participants completed each questionnaire with its questions in the same order as used in the questionnaire’s development and validation studies. In this group, each questionnaire was labelled with the questionnaire’s name (e.g., The Negative Pain Thoughts Questionnaire). The questionnaires were also in the same order for each participant. We refer to this group as the fixed-order group. Randomization was achieved through the survey software used SurveyMonkey (Palo Alto, CA, USA). We contacted SurveyMonkey on their exact method of randomization, but they wouldn’t comment. We assume randomization is based on a semi-random number generator.
Questionnaires
Patient-Reported Outcome Measurement Information System Physical Functioning Short Form (PROMIS PF SF), Negative Pain Thoughts Questionnaire-4 (NPTQ), and Pain Catastrophizing Scale-4 (PCS-4) were answered on 5-point Likert scales, while Patient Health Questionnaire-2 (PHQ-2) and Generalized Anxiety Disorder 2 (GAD-2) were answered on a 4-point Likert scale.
The validated PROMIS PF SF Version two consists of eight questions, and measures limitations of physical activity. 13 Higher scores indicate better physical function (fewer limitations of physical activity).14,15 Sample questions include, “Are you able to do chores such as vacuuming or yard work?” and “are you able to go for a walk of at least 15 min”. A score of 50 is the average for the United States general population with a standard deviation of 10. PROMIS PF SF has shown high internal reliability via Chronbach alpha in patients with lower extremity orthopedic trauma injuries. 13
We measured distress and misconceptions about symptoms with (1) the validated Negative Pain Thoughts Questionnaire (NPTQ-4). Higher scores indicate greater distress and unhelpful thoughts. 16 Sample statements include “my problem makes me feel awful and it overwhelms me” and “even though I can still do a lot of things, I can’t enjoy them because of my condition”. The scores range from 4 to 20. Previous studies have shown greater physical capability to be associated with fewer negative pain thoughts. 17 (2) Validated PCS-4, 18 total scores ranging from 0 to 16 with higher scores representing greater distress and misconceptions. 19 A sample statement includes, “I worry all the time about whether the pain will end”. Magnitude of incapability is associated with catastrophic thinking as measured by the PCS-4. 20
We measured symptoms of anxiety using the validated GAD-2, 21 a two-item screening questionnaire scored from 0 to 6 with a higher score indicating greater anxiety. Questions include: “over the last 2 weeks have you (1) felt nervous, anxious, or on edge and (2) been unable to stop control of worrying?”. Magnitude of incapability is associated with symptoms of depression and anxiety. 22
The validated PHQ-2 measures symptoms of depression. 15 Its score ranges from 0 to six and higher score indicates more symptoms of depression.15,23 Questions include: “over the last 2 weeks have you (1) had little interest or pleasure in doing things and (2) felt down, depressed, or hopeless?”. Pain intensity and incapability correlate with PHQ-2 scores in patients with upper extremity illness. 24
Pain intensity was measured on an ordinal scale from 0 to 10, with 0 indicating no pain and 10 the worst pain possible.
We did not collect patient demographics. Due to the randomization, we expect little influence of demographic variables. Our patients symptoms, diagnoses, and sociodemographics are representative of a typical musculoskeletal specialty practice.
Primary and secondary study outcomes
Our primary study goal was to determine if there was a difference in reliability between randomly intermixed and unlabeled questionnaires compared to labeled and fixed order questionnaires. We specifically tested reliability by comparing: (1) flooring and ceiling effects, using fisher exact test. We defined flooring as the lowest possible score and ceiling as the highest possible score; (2) mean or median questionnaire scores using t-test (parametric) or Mann-Whitney U test (non-parametric); (3) internal consistency, in other words, are the questions within each questionnaire measuring the same thing? We measured this in two ways: (A) Cronbach alpha and (B) exploratory factor analysis. Factor analysis was performed with a Promax rotation; 95% confidence intervals were created around factor loading coefficients through bootstrapping (n=1000). We defined a difference in internal consistency between embedded and intermixed groups as a non-overlapping 95% confidence interval.
Our secondary study goal was to compare the association with capability and pain intensity. We calculated spearman correlation with bootstrap (n=1000) 95% confidence intervals of each mental health measure with capability and pain intensity. We defined a difference in strength of association as non-overlapping 95% confidence intervals.
Power analysis
A priori power analysis indicated that to find a difference of five points on PROMIS PF short form, with an expected SD of 10 in both groups, we would need 63 participants in each group (total of 126) with alpha at 0.05 and power 80%. Based on previous study, our goal was to include 150 participants, in order to perform a reliable factor analysis. The study was inadvertently terminated upon reaching the sample size for our primary hypothesis, probably due to personnel change, but there was adequate power for the factors analysis with the number of patients available.
Results
Difference in flooring and ceiling effects, mean and median questionnaire scores, internal consistency, and factor loading coefficients in exploratory factor between unlabeled, intermixed questionnaires compared to labeled questionnaires in a fixed order
Difference in mean or median scores and the number of people with the lowest and highest score.
We included 62 people in the intermixed group and 69 people in the fixed order group. PROMIS: patient-reported outcome measurement information system; NPTQ-4: negative pain thoughts questionnaire-4; PCS-4: pain catastrophizing scale-4; PHQ-2: patient health questionnaire-2 (depression); GAD-2: generalized anxiety disorder 2-item (anxiety); parametric data represented as mean ± SD, non-parametric data represented as median (IQR).
We found no differences in the mean and median scores among each of the questionnaires between groups (PROMIS PF SF, NPTQ-4, PCS-4, PHQ-2, GAD-2).
Difference in reliability measured by Cronbach alpha between fixed order and intermixed questionnaires.
PROMIS: patient-reported outcome measurement information system; NPTQ-4: negative pain thoughts questionnaire-4; PCS-4: pain catastrophizing scale-4; PHQ-2: patient health questionnaire-2 (depression); GAD-2: generalized anxiety disorder 2-item (anxiety); alpha coefficient with 95% confidence interval.
Difference in questions groupings between intermixed and fixed order questionnaires.
The factor loading coefficient indicates how much the question increases when the construct measured by the questionnaire (like physical function) increases by one; a coefficient of 1.0 would indicate perfect alignment. PROMIS: patient-reported outcome measurement information system; NPTQ-4: negative pain thoughts questionnaire-4; PCS-4: pain catastrophizing scale-4; PHQ-2: patient health questionnaire-2 (depression); GAD-2: generalized anxiety disorder 2-item (anxiety).
Difference in association with disability and pain intensity
The difference in association of mental health with disability and pain intensity.
PROMIS: patient-reported outcome measurement information system; NPTQ-4: negative pain thoughts questionnaire-4; PCS-4: pain catastrophizing scale-4; PHQ-2: patient health questionnaire-2 (depression); GAD-2: generalized anxiety disorder 2-item (anxiety).
Discussion
Patient-reported outcome measures quantify the subjective aspects of illness with the intention of capturing outcomes that matter to patients to help improve patient care. Current questionnaires are designed for research purposes and may be long and burdensome. Mental health measures may address factors patients deem irrelevant, and with a risk of offense, which seems manifest in rapid completion and high floor effects of measures of symptoms of general anxiety and symptoms of depression in musculoskeletal care.6,7 We analyzed whether intermixing mental health items with items addressing physical symptoms and capability could limit flooring and ceiling effects, and if doing so would alter mean and median questionnaire scores, internal consistency, and factor loading coefficients in exploratory factor analysis. We conclude that intermixing mental health questions does not improve their performance.
Limitations
The findings of this study should be interpreted in light of some limitations. First, we had more incomplete questionnaires than typical and our cohorts were slightly unbalanced (69 and 62 participants). This was due to use of quick response (QR) codes to allow people to complete questionnaires on their phone rather than on our tablet during the COVID-19 pandemic. People using their phone are more likely to leave before the questionnaire is completed. Second, the generalizability might be limited because we only have English-speaking patients with musculoskeletal pain visiting orthopedic surgeons who are all White men. Our impression is that there is sufficient diversity for a viable experiment to measure associations and the concepts measured are unlikely to change with greater variation in language, socio-demographics, and specialist characteristics.
Difference in flooring and ceiling effects, mean and median questionnaire scores, internal consistency, and factor loading coefficients in exploratory factor between intermixed compared to fixed order questionnaires
The observation of comparable flooring, no differences in the mean and median scores, no differences in internal consistency, and lower factor loading coefficients for symptoms of anxiety in the intermixed group suggests there are no advantages to intermixing mental health and capability questions. These findings are similar to the prior study among college students completing questionnaires about travel websites, with the exception that their sample was larger, which may explain why their differences were statistically significant. 7 For our purposes, small differences that might be significant in a larger sample are clinically irrelevant. Our primary aim was to reduce floor and ceiling effects to get better spread in mental health scores, and intermixing questionnaire items did not achieve this.
Difference in association with disability and pain intensity
The observation of no differences in association of the NPTQ-4, PCS-4, PHQ-2, and GAD-2 with measures of levels of incapability (PROMIS PF) and pain intensity between the intermixed and fixed-order group, further supports similar responses no matter the question presentation. Notably, this observation is in spite of the known, and reproduced, floor effects of questions about general despair and worry. Despite the imperfections of measuring mental health using questionnaires in musculoskeletal specialty care, we are still able to identify important relationships and opportunities for improved mental health.
Conclusion
The observation of no advantages to intermixing mental health and comfort and capability questionnaires suggests a need for alternative strategies to improve measurement of mental health in musculoskeletal specialty care. For instance, patients might answer questions addressing thoughts and feelings regarding physical symptoms more forthrightly than questions addressing general worry or despair. 25
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Author’s Note
Work performed at the University of Louisville School of Medicine and Dell Medical School—University of Texas.
