Abstract
Gastrointestinal illnesses cause physical, emotional and social impact on patients. Patient reported outcome measures (PROMs) are increasingly used in clinical decision-making, clinical research and approval of new therapies. In the last decade, there has been a rapid increase in the number of PROMs in gastroenterology and, therefore, the choice between which of these PROMs to use can be difficult. Not all PROM instruments currently used in research and clinical practice in gastroenterology have gone through a rigorous development methodology. New drugs and therapies will not have access to the market if the PROMs used in their clinical trials are not validated according to the guidelines of the international agencies. Therefore, it is important to know the required properties of PROMs when choosing or evaluating a drug or a clinical intervention. This paper reviews the current literature on how to assess the validity and reliability of PROMs. It summarises the required properties into a practical guide for gastroenterologists to use in assessing an instrument for use in clinical practice or research.
Introduction
More than a century ago, the first health outcome measure was proposed by Florence Nightingale by classifying patients into relieved, unrelieved and dead. 1 Other guides such as mortality rates have historically been used to measure health outcomes in the population. 2 However, the definition of health has changed in the past century to include a wider view of outcomes which includes freedom from disease, ability to perform daily activities, happiness, social and emotional well being, and quality of life. The World Health Organisation (WHO) has defined health as ‘physical, mental, and social well-being, and not merely the absence of disease and infirmity' (p. 100). 3 As a result numerous measures have been developed in an attempt to quantify health. Health outcome measures are tools used to evaluate an individuals’ health using different health related parameters. Patient reported outcome measures (PROMs) are instruments that are completed by patients and capture one or more aspects of health.4,5 The use of PROMs to monitor surgical outcomes formally in England has been an important development. 6 Since 2007, the Department of Health in England has required the routine measurement of patient- reported health outcomes for all National Health Service (NHS) patients via its PROMs programme.5,6 PROMs are increasingly used in decision-making to encourage a patient-centred approach.5,7 For this reason, PROMs that are chosen and used in practice must be valid, reliable and clinically useful measures.
There are over 100 PROMs applicable to gastroenterological disorders which are described in the gastrointestinal PROMs (GI-PRO) database. 4 The classical physician-based health outcome measures are now increasingly being replaced by PROMs which enable the assessment and monitoring of disease or treatment effects from the patients’ own perception, rather than the health professionals’ judgments (which does not always reflects patients’ views).8,9
A doctor’s ability to interpret and apply PROMs in clinical practice has great potential to contribute to a better understanding of the patients' well-being. There is an increasing use of PROMS in both research and clinical work in gastroenterology. It is important differentiate between PROMS for functional disorders and those for organic disorders. PROMs should be used in identification of the symptomatic profiles, diagnosis and treatment of functional disorders such as post prandial distress syndrome, epigastric pain syndrome, chronic idiopathic nausea, excessive belching, IBS and other functional diseases. 10 The lack of objective measurable markers of symptoms improvement, such as stool frequency and rectal bleeding, means the evaluation of treatment response has to be based on the patients’ reporting of symptoms. The Rome III criteria11,12 are very useful to assess the outcomes of new treatments for functional gastrointestinal disorders.
Nowadays, PROMs are categorised as generic or disease-specific.13,14 Generic PROMs are applicable to any disease and are useful for comparison or economic studies between different conditions. Specific PROMs, on the other hand, are specific for one condition. Both types of measures, generic and specific, are now seen as complementary rather than conflicting when appraising patient outcomes. Some of the commonly used generic tools are the European Quality of life - five dimensions (EQ-5D), 15 the Short Form 36 (SF-36) questionnaire, 16 the Cleveland Quality of life,17,18 and the Short Form 12 (SF-12). 19 Examples of the disease specific PROMs are the Inflammatory Bowel Disease Questionnaire (IBDQ) 20 and the rating form of IBD patient concerns. 21 A good review of PROMs that have been used to evaluate the efficacy of therapeutic agents in functional dyspepsia trials was done by Ang et al. 22 Mouzas and Pallis reported a good review of the PROMs used in inflammatory bowel disease. 23 The patient-reported outcome and quality of life measures database provides a comprehensive list of the available PROM questionnaires. 24
The increasing number of PROMs in the recent years requires gastroenterologists to decide which PROM to use and how to assess each measure. Several studies have suggested using standards to assess properties such as validity, reliability and responsiveness. Examples of these standards have been presented by Terwee et al., 25 scientific advisory committee of the Medical Outcome Trust, 26 Evaluating the Measurement of Patient-Reported Outcomes (EMPRO) tool, 27 Bombardier and Tugwell, 28 Andresen, 29 Steiner, 30 DeVon et al., 31 McDowell and Jenkinson, 32 the Food and Drug Administration (FDA)33,34 and the European Medicines Agency 35 guidelines. The FDA guidance in 200633,34 describes how to evaluate PROMs used as effectiveness endpoints in clinical trials. These publications describe the required criteria for a successful PROM and are written mainly for health outcome specialists and methodologists who are involved in developing health outcome measures for clinical trials or a evaluating new medical technology. None of these publications individually summarises these standards into one relatively brief yet fairly comprehensive practical checklist for doctors to use in their day-to-day clinical practice. Before using health outcome measures for research or in clinical practice, it is essential to ensure that they are appropriate to the context, perform well and possess the required characteristics. In this article we describe how to assess these requirements: the concept of item generation, validity, reliability, responsiveness, utility and cross-cultural adaptation; and how to evaluate these measures in a way that is easy to follow and applicable in clinical practice.
The quality properties of PROMs
There are five main components for good quality PROMs: item selection, validity, reliability, responsiveness and interpretability. With the increased number of multinational and multicultural clinical research studies, certain criteria regarding cultural, educational and social adaptation of the PROMs are needed to use the questionnaires in a different language or country.
Item selection
Items can be derived from three main sources:13,36,37
Research: reviewing old PROMs is the most commonly used approach in finding items. There are several reasons why old measures can be used to derive the new PROMs items; it saves a lot of time and effort, there are possibly a limited number of questions to ask about a specific problem such as abdominal pain, vomiting, etc. and old measures have been repeatedly used and tested in many studies and trials. Patients: by asking the patients to identify items and domains to be included in the scale. Patients can be excellent sources of health outcome measure items. Some techniques like focus groups and key informant interviews have been used to collect patients' viewpoints in a systematic manner.
13
This method of item generation has been useful in constructing a quality of life measure for example the IBDQ,
20
the rating form of IBD patient concerns
21
and functional dyspepsia.
38
Clinical observations: items are derived by clinicians based on their experience.
However, the FDA statement33,34 considers that inclusion of patients in developing a PROM questionnaire is the most important source of items. It stresses that item generation should include a wide range of patients to represent variations in severity and in population characteristics such as age, sex and educational level. It is important to assess the respondents and administrator burden when choosing items. Items that cause undue physical, emotional or cognitive strain on patients generally decrease the quality and completeness of PROM data. The language in a PROM should be clear and not technical. Items should not offend or discriminate against people for example when assessing the emotional aspect of quality of life. Therefore, items should be tested on small group of patients for a preliminary or a pilot testing to make sure that they are understandable and not ambiguous. 39 This pilot testing can include any number of patients. The FDA guidance mention that the number of patients in the pilot testing is not as critical as the cognitive interview quality and patient diversity included in the sample. Pilot testing of items is commonly used in developing quality of life questionnaires such as the IBDQ and the UK-IBDQ.20,40
Once the pool of items has been created, a number of statistical techniques can be used in order to select the most relevant items:
Frequency of endorsement: The frequency of endorsement (also called response rate) examines the proportion of people who select the same item response. Only items with endorsement rate between 0.2 – 0.8 (or 20%–80%) are chosen.13,41 Items with lower or higher rates are considered redundant because they will add little value to the index. Items with high response rates more than 80% (i.e. more than 80% of patients chose the same answer) are considered for removal because they cannot be used to distinguish between patients. If the same answer was chosen by less than 20% of patients, then it is possibly not related to the condition and can be removed. Item-total correlation: The item-total correlation is the statistical correlation of each item with the total PROM score. The accepted range is between 0.2–0.8.
13
A value below 0.2 indicates that the item is not relevant. A value of more than 0.8 indicates that the item is redundant and does not add a value to the total scale. Internal consistency: Internal consistency is the statistical correlation or the homogeneity between the items in the measure.20,25 The internal consistency is commonly measured by calculating Cronbach α statistic.13,36,42 The acceptable value of Cronbach α is between 0.7–0.9.13,25 Higher values more than 0.9 may indicate an overlap between items.
Reliability
Reliability is the consistency between the score of a health outcome measure applied in different circumstances. The principle of reliability is that applying the PROM in different occasions or by different observers produces similar results.13,43 Statisticians suggest that a reliability of 0.75 should be the minimum requirement for a useful health outcome measure. 13 Common reliability statistics are the intraclass correlation co-efficient (ICC) and the Pearson correlation co-efficient (r). They are expressed as a numbers between –1 and 1 with 0 indicating no reliability; 1 is perfect reliability between the two set of tests and a negative number indicates that the two sets of tests change inversely.
Common types of reliability testing are
Inter-observer reliability is used to assess the degree of consistency between different observers assessing the same patients. Test-retest reliability is used to assess reliability of the PROM when applied on two separate occasions. To estimate test-retest reliability, the measure should be administered to the same group of patients on two separate occasions between which there has been no overall change in the clinical condition of the patients. Typically a period of 14 days is acceptable.
13
Responsiveness
Responsiveness is the ability of the PROM tool to detect a change in patients’ clinical condition. This is estimated by applying the health outcome measure to a group of patients whose clinical condition has changed.13,44 There are several statistics for responsiveness but the commonest is the responsiveness ratio, which is calculated by dividing the mean change in scores for patients who reported a change by the standard deviation of the scores of stable patients.13,44 Other responsiveness indicators mentioned in literature are effect size (ES) 45 and standardised response mean (SRM). 46 The acceptable value for responsiveness ratio is 0.5 or 50%.44,47,48
Validity
Validity is the ability of the test to measure what is intended to measure. Validity can be broadly divided into three types (referred to as the 3 Cs):
13
content validity, construct validity and criterion validity.
Content validity: checks if the measure as a whole covers all the relevant and important aspects of the disease. Experts in the field usually judge content validity to ensure an item appropriately measures the desired health outcome. Construct validity: is used when there is no ‘gold' standard measure with which a new PROM can be compared.
13
A combination of laboratory tests, other health outcome measures, or clinical observations might be necessary to provide the data that support the construct validation of the PROM.25,49 The common statistic to assess construct validity is by calculating a correlation coefficient. An appropriate correlation coefficient for construct validity should be somewhere between 0.4–0.8.13,47 Criterion validity: measures the correlation of the new measure with a ‘gold standard' measure, which exhibit the same characteristics. When the correlation is explored at the same time then it is described as ‘concurrent validation'. This is often used when an existing measure is potentially to be replaced by a shorter, cheaper or less invasive measure.
13
In this case, we would expect a very high correlation co-efficient (≥0.8). When the new measure is compared with a criterion that is measured later, this type of validation is called ‘predictive validation'. This type of validation is often used with measures that predict future events like response to treatment or mortality.
Interpretability
Interpretability means assigning qualitative meaning to the health outcome measure scores.25,50–52 To aid using PROM in clinical practice, doctors should be able to translate the PROM score to clinical meaning by knowing the minimal important change (MIC). The MIC is defined as ‘the smallest difference in score in the domain of interest which patients perceive as beneficial and would mandate, in the absence of troublesome side effects and excessive cost, a change in the patient’s management’ (p. 407). 53 Additional useful information is derived from the ‘floor and ceiling effects'. The ceiling effect is a term used to describe the effect when the majority of the patient scores are close to or at the top of the measure. The floor effect is a term used when the majority of patient scores are close to or at the bottom of the measure. 25 A measure is said to have a floor or ceiling effect when more than 15% of patients score the lowest or highest possible score, respectively. 25 If floor or ceiling effects are present, it is difficult to accurately assess the health outcomes of patients who score at the extreme ends of the PROM. Results from those groups of patients should be interpreted with caution.
Cross-cultural adaptation
Cross-cultural adaptation is the process that deals with language and cultural adaptation issues when preparing the PROMs for use in another country.54,55 Items should not only be translated linguistically, but also must adapt culturally to the target country culture. For example a question about difficulty in using a fork in eating may not be applicable in a country where a fork is not used in eating. The cross-cultural adaptation involves forward and backward translation of the questions or items, review of the results by linguistics, methodologists, statisticians and health care professionals, and pretesting the PROM in a small group of patients to check the clarity of the PROM in the new setting and its consistency with the original PROM version. 54 The final step involves psychometric validation of the new PROM in the target population to check validity, reliability and responsiveness.13,55 In some ethnic minorities or special groups of patients, there is a need for specific cultural and language educational materials such as the use of pictogram, smileys or other picture-based representations.56,57 A good example of a picture-based PROM is the gastro-oesophageal reflux disease analyser (GERDyser) questionnaire which comprises 10 dimensions, each is illustrated by pictogram drawings. 58 In fact a recent study published by Tack et al. showed that the use of pictograms with verbal descriptors significantly improves the reliability of PROMs by around 30% by avoiding potential bias by patients. 59
Checklist to evaluate PROMs
Checklist for evaluating the patient reported outcome measure (PROM) questionnaires
MIC: minimal important change; SEM: standard error of measurement; SDC: smallest detectable change.
Conclusions
This paper intends to provide a practical overview of the main components for a good quality PROM; it does not intend to provide a detailed description of each component. Readers who require more detailed explanations are encouraged to refer the references cited in the paper. The FDA 33 and the European Medicines Agency 35 guidelines provide further recommendations on the proper development and validation of PROMs especially for clinical trials.
A good example of a well-validated PROM is the IBDQ. The IBDQ was developed in 1985 as a quantitative, disease-specific Health related quality of life (HRQoL) measure in patients with inflammatory bowel disease (IBD). A number of patients with IBD and health professionals were asked to list all problems they had observed or experienced as a direct result of IBD. This process resulted in a total of 150 items. All these items were then administered to another group of patients with IBD to rate each problem on a five-point Likert scale, ranging from a low score (score 1) indicating no problem to a high score (score 5) indicating a severe problem. A final list of 32 items was derived and reviewed by experienced clinicians, the items were grouped into four groups: gastrointestinal symptoms, systemic symptoms, emotional dysfunction and social dysfunction. 60 The final version of IBDQ had good reproducibility (ICC was 0.7) and responsiveness (by calculating the responsiveness ration on patients who reported change in a seven-point assessment of their condition). The IBDQ had good construct validity when correlated with disease specific and generic PROMs.61–65 The clinically important change in score was observed to be a decrease of between 16–30 points. 66 IBDQ was validated into different languages versions and have further proved its validity, internal consistency and reliability in several validation studies worldwide.64,65,67–77
As new therapies in gastroenterology are rapidly emerging, PROMs are increasingly used in clinical decision-making. There is a need to support and educate the gastroenterologist on how to assess these tools to encourage them to use them in clinical practice.
Every PROM tool should have five important properties: items should be selected from a reliable source and should be clear to patients; the PROM must reliably yield consistent measurements; the PROM must measure what is intended; the PROM should change with the change in patients’ condition (i.e. be ‘responsive'); and the PROM should be easily transferred to clinical meaningful values, showing ‘interpretability'.
Footnotes
Author contribution
LA contributed in writing the manuscript, reviewing the literature. HAH and JGW contributed in reviewing the literature and examining the content of the manuscript. All authors approved the final version of the manuscript.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Conflict of interest
None declared.
