Sage Journals: Discover world-class research

Abstract

Accurate assessment of patient-reported outcomes (PROs) is essential for informing clinical decision-making and guiding health policy. Item Response Theory (IRT) enhances measurement by providing detailed evidence on item discrimination, difficulty, fairness, and precision, consistent with COSMIN guidelines. A quantitative, cross-sectional design was employed, involving 500 adult patients attending outpatient facilities across public, private, and community-based healthcare centers in southern Ghana. Stratified random sampling was used to ensure representativeness across settings. Psychometric evaluation combined CTT analyzes Cronbach’s alpha, item-total correlations, and factor analysis with IRT modeling, specifically the graded response model (GRM). CTT analyses indicated good internal consistency (Cronbach’s α = .84), while IRT modeling (graded response model) showed higher reliability (marginal reliability = 0.91) and revealed patterns of precision across the health spectrum. IRT-based scores were meaningfully associated with treatment adherence (β = .45), quality of life (β = .41), and self-reported health status (β = .38), illustrating predictive validity. Differential item functioning analyses indicated limited subgroup bias. Integrating CTT and IRT strengthens the rigor, precision, and fairness of PRO measurement. IRT-calibrated instruments demonstrate practical value for clinical monitoring and health system evaluation and are recommended for routine implementation in diverse healthcare settings.

Keywords

patient-reported outcomes item response theory classical test theory psychometrics differential item functioning health policy predictive validity

The study shows that integrating Classical Test Theory (CTT) and Item Response Theory (IRT) strengthens the measurement of patient-reported outcomes, offering deeper insight into item performance and precision.

Using data from 500 outpatients in southern Ghana, the instrument demonstrated good CTT reliability (α = .84) and superior IRT marginal reliability (0.91), with IRT providing clearer precision across the health continuum.

IRT-based scores showed strong predictive validity, significantly associating with treatment adherence, quality of life, and self-reported health status while exhibiting minimal subgroup bias through DIF analysis.

The findings recommend adopting IRT-calibrated PRO measures in clinical practice and instrument development to ensure greater accuracy, fairness, and responsiveness across diverse patient populations.

Introduction

Patient-reported outcomes (PROs) are increasingly recognized as essential in clinical research, routine care, and health policy decision-making^1
-6 PROs capture patients’ perceptions of their health, including symptoms, functional status, quality of life, and treatment satisfaction^1,7
-9 reinforcing the shift toward patient-centered care and evidence-based practice. They serve as key indicators for evaluating interventions, guiding resource allocation, and informing policy^5,10
-12 making precision and fairness in measurement critical. Ensuring measurement accuracy requires attention to item-level properties and potential biases across diverse populations. PRO instruments must reliably reflect patient experiences to avoid measurement error, misclassification, and inequities in clinical or policy applications^13
-16 Advanced psychometric approaches allow detailed evaluation of item characteristics, including difficulty, discrimination, and precision across the health spectrum^18
-22 Computerized Adaptive Testing (CAT) is 1 practical application that can reduce respondent burden while maintaining high measurement precision^18,23
-25 For example, in cubital tunnel syndrome surgery patients, CAT reduced the number of items from 10 to a median of 2 without sacrificing precision^26
-28 Overall, PROs provide a robust, patient-centered approach to capturing meaningful health outcomes for both clinical and policy purposes.

Beyond improving precision, IRT enhances fairness in health measurement. Item functioning (DIF) analyses within the IRT framework enable researchers to identify whether items function differently across subgroups such as age, gender, or socioeconomic status, even when individuals share the same underlying health level.²⁵ Correcting for DIF reduces bias, ensuring that PRO instruments capture true health status rather than reflecting demographic differences.^20,27,29 Moreover, IRT-based calibration allows instruments to be placed on common scales, facilitating comparisons across studies, populations, and policy settings. Collectively, these advantages highlight IRT’s potential to improve not only the psychometric properties of PRO instruments but also the interpretability and generalizability of the resulting data. The integration of Item Response Theory (IRT) into routine clinical practice and health policy evaluation remains uneven.^10,30
-33 Although initiatives such as the Patient-Reported Outcomes Measurement Information System (PROMIS) have demonstrated the value of IRT-calibrated instruments, many studies and health systems continue to rely on CTT-based measures. Systematic evidence comparing IRT and CTT in patient-reported health outcomes is still limited, particularly regarding reliability, bias detection, and predictive validity. For example, Schroeders and Gnambs³⁰ noted that although both frameworks inform item development, IRT has advantages in bias detection and measurement precision, yet its practical applications remain underutilized. Similarly, Menold and Raykov,³ Zumbo and Chan,³¹ and Feng et al³⁴ applied both CTT and IRT approaches in the development of new PRO measures but highlighted that most evaluations remain methodological, with few studies systematically assessing predictive validity or clinical utility. This gap restricts clinicians, researchers, and policymakers from fully appreciating the practical benefits of adopting IRT-based measures.

Our study addressed this gap by systematically examining how Item Response Theory (IRT) can enhance the measurement of patient-reported outcomes (PROs) relative to Classical Test Theory (CTT). The focus is to evaluate whether IRT approaches provide stronger measurement precision, address item bias across demographic groups, and offer greater predictive value for clinical indicators and policy outcomes. In doing so, our study emphasized 3 essential criteria for robust measurement in health research: reliability, fairness, and validity. This study has significant implications for both clinical practice and health policy. For clinicians, IRT-based tools may support more precise monitoring of patient progress, enabling earlier detection of health changes and more tailored interventions. For policymakers, unbiased and predictive PRO data can strengthen resource allocation, program evaluation, and policy design. More broadly, advancing IRT in health outcomes research helps ensure that patient perspectives are captured with the highest levels of accuracy and fairness, bridging the gap between psychometric innovation and practical application.

Research Questions

To what extent does Item Response Theory improve the reliability and measurement precision of patient-reported health outcome instruments compared to Classical Test Theory, as measured by item and test information functions?

How effectively does IRT detect and correct for item bias and differential item functioning (DIF) across demographic groups (eg, age, gender, socioeconomic status) in patient-reported outcome measures?

What is the predictive validity of IRT-calibrated patient-reported health outcomes in explaining variations in clinical indicators and health policy-related outcomes?

Methods

Research Design

This study employed a quantitative, cross-sectional design to evaluate the psychometric performance of patient-reported outcome (PRO) instruments. The design enabled systematic data collection from a large and diverse sample at a single point in time, facilitating subgroup comparisons without the financial and logistical demands of longitudinal follow-up^16,35,36 Cross-sectional designs are widely recommended in psychometric research when the primary goal is to assess reliability, validity, and fairness rather than change over time. The study incorporated reliability analyses (eg, Cronbach’s alpha), item-level parameter estimation (difficulty, discrimination), factor analysis, and information functions to provide a comprehensive assessment of measurement precision^16,35 Drawing participants from multiple outpatient facilities increased heterogeneity in health status and demographic characteristics, enabling robust testing of differential item functioning (DIF) across gender, age, and socioeconomic groups. Although longitudinal designs may better assess responsiveness, the cross-sectional approach provided a practical and methodologically rigorous framework that balanced feasibility, efficiency, and scalability for healthcare research and policy applications^14,35

Study Population and Sampling

The study population consisted of adult patients (18 years and above) receiving outpatient healthcare services in selected hospitals and clinics across southern Ghana. These facilities were purposively chosen to reflect diversity in healthcare provision, including public hospitals, private hospitals, and community-based clinics. This variation in settings was important for ensuring that the instrument was tested across a wide spectrum of patient experiences and service environments. A stratified random sampling strategy was employed to enhance representativeness. Facilities were first stratified by type (public, private, and community clinics), after which patients were randomly selected from each stratum. This method minimized sampling bias and ensured proportional inclusion of participants from different healthcare contexts. Eligibility criteria required that participants be residents of the region, 18 years or older, and actively seeking outpatient care at the time of the study. Patients who did not meet these criteria, or those unable to provide informed consent, were excluded. The sample size determination was guided by methodological recommendations for IRT calibration, which emphasize the need for relatively large samples to produce stable and reliable item parameter estimates.³⁷ While traditional CTT analyses can be conducted with smaller samples, IRT typically requires at least 500 participants to achieve stable parameter estimation, particularly when applying models that incorporate multiple parameters such as the 2-parameter logistic (2PL) or graded response models.¹⁶ To account for potential missing data and non-responses, the target sample was set slightly above this threshold, ensuring sufficient statistical power for subgroup analyses and DIF testing.

To address COSMIN guidelines and clarify the measurement tool evaluated, the instrument used in this study was a 20-item Patient-Reported Outcomes (PRO) scale assessing health functioning and symptom burden across domains such as physical functioning, emotional well-being, symptom severity, and role limitations. This clarification is important because sample size requirements for both CTT and IRT depend partly on the number of items being analyzed. Although the target sample of 500 participants meets the minimum threshold commonly used for graded response models, COSMIN recommendations indicate that 2-parameter IRT models ideally require sample sizes greater than 1000 to ensure highly stable item parameter estimates. Therefore, the sample size should be viewed as a methodological limitation, and the findings would benefit from future replication in larger and more diverse samples to enhance the robustness and generalizability of the IRT results.

Instrumentation

Data were collected using a 20-item Patient-Reported Outcomes (PRO) questionnaire developed by the research team to assess patients’ health functioning and symptom burden in outpatient settings. The instrument was conceptualized to capture 4 key domains commonly highlighted in the health outcomes literature: symptoms, functional capacity, emotional well-being, and overall quality of life. Together, these domains provided a multidimensional and holistic assessment of patients’ lived health experiences, ensuring that the instrument reflected both clinical presentations and psychosocial dimensions relevant to healthcare delivery and policy evaluation in Ghana. Each item was designed using Likert-type ordered response categories, typically ranging from 4 to 5 options (eg, not at all, a little, moderately, very much). These ordered categories enabled the capture of nuanced gradations in patient experiences, reducing measurement error that often accompanies dichotomous response formats and supporting the application of both Classical Test Theory (CTT) and Item Response Theory (IRT) models. To establish content validity, the development process incorporated a rigorous, multi-stage expert review. Panels comprising clinicians, psychometricians, and public health specialists independently evaluated the items to assess their clarity, conceptual relevance, domain coverage, and cultural appropriateness for Ghanaian outpatient populations. The reviewers also examined the alignment between items and the underlying constructs, ensuring that each domain was adequately represented and that item wording was suitable for diverse literacy levels. Feedback from the panels guided refinements to item phrasing, response scale anchors, and domain definitions, strengthening the instrument’s conceptual coherence and improving accessibility for users. This systematic and multidisciplinary development process ensured that the final version of the PRO questionnaire was methodologically robust, contextually grounded, and suitable for subsequent CTT and IRT psychometric analyses.

Data Collection Procedures

Data were collected electronically using Google Forms due to its secure, accessible, and cost-effective features, including real-time response monitoring and automated data storage.³⁸ The survey link was distributed through hospital administrators, outpatient service desks, and WhatsApp groups to maximize reach and participation across varying levels of digital access. Eligibility screening items at the beginning of the survey confirmed age (18 years and above), regional residency, and current outpatient status. Ineligible respondents were automatically excluded. Trained research assistants were available onsite to provide technical support, particularly for participants with limited digital familiarity, while preserving response independence. Electronic informed consent was obtained prior to participation. Data were securely stored, exported to Microsoft Excel for preliminary management, and then analyzed in R and IRTPRO. These platforms supported reliability estimation, factor analysis, item parameter estimation, test information functions, and DIF analyses, enabling comprehensive psychometric evaluation.³⁹

Data Analysis

A 2-step analytic strategy was implemented to provide a rigorous psychometric evaluation of the PRO instrument. Reliability was assessed using Cronbach’s alpha²⁹ and McDonald’s omega¹⁵ to estimate internal consistency, while corrected item-total correlations (threshold = 0.30) were examined to evaluate item contribution. For item-level modeling, the Graded Response Model (GRM) was applied to estimate discrimination and difficulty parameters, along with test information functions and marginal reliability indices to assess precision across the latent trait continuum.^16,36 Model and item fit were evaluated using S-X² statistics, infit and outfit mean square (MNSQ) values, point-biserial correlations, and standardized residuals. Exploratory and confirmatory factor analyses were conducted to verify unidimensionality. Differential Item Functioning (DIF) analyses were performed across gender, age, and socioeconomic status using likelihood ratio chi-square tests, Mantel-Haenszel indices, and effect sizes, interpreted according to ETS classification guidelines.⁴⁰ Predictive validity was examined using multiple regression models to assess associations between PRO scores and treatment adherence, medication side effect burden, healthcare utilization, quality of life, self-reported health status, and insurance satisfaction. Model evaluation included regression coefficients, R², adjusted R², AIC, and BIC.¹⁴

Ethical Considerations

The study strictly followed international and local ethical standards, including the Belmont Report and the Declaration of Helsinki. Ethical approval was obtained from the Institutional Review Boards of participating health facilities. Informed consent was secured electronically, with participants required to explicitly agree before proceeding, and they were informed of their right to withdraw or skip questions without penalty. The study ensured anonymity and confidentiality by using unique identifiers and storing data securely on password-protected servers, with records retained for 5 years before permanent deletion. Special provisions were made for vulnerable participants, such as elderly individuals or those with limited literacy, including reading consent forms in the local language.

EQUATOR Reporting Guideline Statement

This study adhered to the principles of the EQUATOR (Enhancing the QUAlity and Transparency Of health Research) Network, an international initiative dedicated to improving the reliability, transparency, and value of health research by promoting accurate and comprehensive reporting. In line with EQUATOR’s mission, the study followed the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) guidelines, which provide structured guidance for reporting observational health research. A completed STROBE checklist has been included as a Supplemental File to enhance transparency and ensure alignment with established reporting standards.

Results

In Table 1, the comparison shows that while Classical Test Theory (CTT) indicated good internal consistency (Cronbach’s α = .84) and meaningful item-total correlations (0.41-0.67), Item Response Theory (IRT) offered greater precision and detail. The IRT graded response model yielded higher marginal reliability (0.91) and provided item-level parameters: discrimination (0.89-2.35) showed how well items differentiate between respondents, and difficulty (−2.10 to 2.45) captured the full spectrum of the latent trait. The test information function peaked at θ = 0.5, highlighting maximum measurement precision around slightly above-average outcomes, a nuance not detectable under CTT. Overall, IRT provided a more refined and informative assessment of reliability and scale sensitivity than CTT.

Table 1.

Comparison of CTT and IRT-Based Reliability and Precision.

Approach	Reliability (Cronbach’s alpha/marginal reliability)	Item-total correlation range	Item discrimination (a)	Difficulty parameters (b)	Peak information (θ)
CTT	.84	.41-.67	–	–	–
IRT (GRM)	.91 (Marginal)	–	0.89-2.35	−2.10 to 2.45	0.5

Note. CTT reliability was assessed with Cronbach’s alpha, while IRT reliability used marginal reliability. a = discrimination, b = difficulty, θ = latent trait level. Peak information shows the point of greatest measurement precision. CTT = Classical Test Theory; IRT = Item Response Theory; GRM = Graded Response Model.

Figure 1 shows that while CTT indicated acceptable internal consistency (Cronbach’s α = .84) and moderate item-total correlations (0.41-0.67), IRT provided higher marginal reliability (0.91) and more detailed item-level insights. IRT discrimination parameters (0.89-2.35) revealed the scale’s ability to distinguish subtle differences in patient outcomes, and the test information function peaked at θ = 0.5, indicating maximum measurement precision slightly above average. Unlike CTT, IRT highlights where the instrument is most and least precise, offering a more nuanced and clinically informative assessment of PRO reliability and sensitivity

Figure 1.

CTT and IRT precision metrics.

Table 2 shows that most items (13 of 20) fit the IRT model well, demonstrating non-significant S-X² values, infit and outfit MNSQ within 0.7 to 1.3, point-biserial correlations above 0.30, standardized residuals within ±2.0, and high item reliability (0.89-0.96). These items reliably discriminate among respondents and contribute strongly to the scale. Four items (Items 4, 7, 11, 19) were borderline, showing marginal misfit with slightly elevated MNSQ values and moderate discrimination, indicating areas for potential refinement. Three items (Items 2, 12, 16) showed significant misfit, low discrimination, high residuals, and reduced reliability, suggesting they may need revision or removal to improve the instrument’s overall psychometric integrity. Clearly, the scale is generally robust, with targeted adjustments likely to enhance measurement precision and validity.

Table 2.

Item Fit Statistics.

Items	S-X² value	df	P-value	Infit MNSQ	Outfit MNSQ	Point-biserial (rpbis)	Std. residual	Item reliability	Fit status
Item 1	21.3	15	.13	0.97	1.01	0.42	−0.3	0.91	Acceptable
Item 2	32.8	15	.004^*	1.25	1.32	0.28	+2.1	0.85	Misfit
Item 3	18.6	15	.23	0.94	0.98	0.47	−0.5	0.92	Acceptable
Item 4	26.4	15	.034	1.15	1.18	0.35	+1.8	0.87	Borderline
Item 5	19.9	15	0.18	0.91	0.95	0.51	−0.7	0.93	Acceptable
Item 6	23.7	15	.073	1.05	1.08	0.39	+0.9	0.89	Acceptable
Item 7	29.1	15	.018	1.18	1.20	0.31	+1.5	0.86	Borderline
Item 8	14.8	15	.48	0.86	0.90	0.56	−1.1	0.95	Acceptable
Item 9	22.7	15	.095	1.02	1.05	0.44	−0.2	0.90	Acceptable
Item 10	16.8	15	.32	0.89	0.93	0.53	−1.0	0.95	Acceptable
Item 11	27.5	15	.025^*	1.20	1.25	0.30	+2.0	0.84	Misfit/Border
Item 12	40.2	15	<.001^**	1.40	1.52	0.25	+3.2	0.81	Misfit
Item 13	11.9	15	.69	0.82	0.87	0.58	−1.4	0.96	Acceptable
Item 14	20.5	15	.15	0.95	0.99	0.46	−0.6	0.91	Acceptable
Item 15	24.3	15	.052	1.07	1.10	0.36	+0.6	0.88	Acceptable
Item 16	35.0	15	.002 ^**	1.30	1.38	0.22	+2.8	0.80	Misfit
Item 17	13.5	15	.58	0.88	0.92	0.49	−0.9	0.94	Acceptable
Item 18	18.0	15	.24	0.98	1.03	0.41	−0.4	0.90	Acceptable
Item 19	28.9	15	.019^*	1.17	1.21	0.33	+1.6	0.86	Borderline
Item 20	17.2	15	.29	0.93	0.97	0.50	−0.8	0.93	Acceptable

Note. p < 0.05 indicates significant misfit; P < .01 indicates strong misfit. Infit and outfit MNSQ values ideally fall between 0.7 and 1.3 for good fit. Point-Biserial (rpbis) values above 0.30 suggest strong item discrimination. Std. Residuals outside ± 2.0 signal localized misfit. Item reliability above 0.85 indicates stable item parameter estimates.

The results of the Differential Item Functioning (DIF) analysis presented in Table 3 highlight how item responses vary across demographic groups, providing insight into potential measurement bias in the scale. For the gender comparison (male vs female), 15 items were tested, and 2 were found to demonstrate significant DIF, representing 13.3% of the total items. The chi-square statistics for these items ranged between 12.4 and 18.6 (P < .01), and the adjusted false discovery rate (FDR) P-values fell between .018 and .027, confirming statistical significance after correction. The effect size (ΔR²) was modest, ranging from .014 to .021, while the Mantel–Haenszel (MH) statistics ranged from 1.28 to 1.35 (P < .01). These findings indicate a moderate magnitude of DIF. Importantly, the direction of DIF suggests that females endorsed higher symptom severity than males at equivalent levels of the latent trait (θ). These differences were most evident in symptom-related items, suggesting potential gender-related response tendencies. In the case of age groups (18–39 vs 40+), only 1 of the 15 items exhibited significant DIF, representing 6.7% of the tested items. The chi-square statistic for this item was 9.2 (P < .05), with an adjusted P-value of .042. The effect size was relatively small (ΔR² = .010), and the MH statistic of 1.22 (P < .05) points to a small-to-moderate DIF effect. The direction of DIF indicated that younger adults reported higher functioning levels at equivalent θ compared to their older counterparts. This bias was specific to a functioning-related item, which may reflect generational differences in perceived physical or cognitive capabilities. Finally, the socioeconomic status (SES) comparison showed no evidence of DIF. All 15 items were free from significant differences across SES groups, with no notable chi-square values, adjusted P-values, or effect sizes.

Table 3.

Differential Item Functioning (DIF) Analysis.

Group comparison	Number of items tested	Items with significant DIF	Percentage of DIF items	Δχ² range (P-value)	Adjusted P-value (FDR)	Effect size (ΔR²) range	Mantel–Haenszel (MH) ΔMH	Magnitude of DIF	Direction of DIF	Notes
Gender (Male vs Female)	15	2	13.3%	12.4-18.6 (P < .01**)	.018-.027	.014-.021	1.28-1.35 (P < .01)	Moderate	Females endorsed higher symptom severity at equivalent θ	Symptom-related items
Age (18-39 vs 40+)	15	1	6.7%	9.2 (P < .05**)	.042	.010	1.22 (P < .05)	Small-Moderate	Younger adults reported higher functioning at equivalent θ	Functioning-related item
Socioeconomic status	15	0	0%	-	-	-	-	-	-	No DIF detected

Note. Mantel–Haenszel (MH) ΔMH represents effect size indices for DIF, interpreted using ETS guidelines (small, moderate, large) DIF = differential item functioning. Δχ² = Chi-square difference test; FDR = false discovery rate correction. Significance: *P < .05, **P < .01.

Table 4 demonstrates that IRT-calibrated PRO scores significantly predict multiple clinical and policy-relevant outcomes, confirming strong predictive validity. Treatment adherence was the strongest outcome (β = .45, standardized β = .48, P < .001), with the model explaining 32% of the variance (R² = .32). Quality of life (β = .41, R² = .30) and self-reported health status (β = .38, R² = .27) were also strongly and significantly predicted (P < .001), indicating that PRO scores meaningfully reflect patient well-being and overall health perceptions. Healthcare utilization showed a significant negative association (β = −.21, P = .005, R² = .18), suggesting that higher PRO scores are linked to fewer healthcare visits and potentially lower system burden. Medication side effect burden was moderately predicted (β = .29, P = .001, R² = .14), supporting the instrument’s relevance for monitoring treatment effects. Additionally, a policy-relevant outcome (insurance satisfaction) was significantly predicted (β = .33, P < .001, R² = .22), highlighting broader system-level implications.

Table 4.

Predictive Validity of IRT-Calibrated PRO Scores on Clinical and Policy Outcomes.

Dependent variable	β (Unstandardized)	SE	95% CI (β)	Standardized β	t-Value	Sig. (P)	R ²	Adj. R²	F-statistic (df)	AIC	BIC	Notes
Treatment adherence	.45	0.07	[0.32, 0.58]	.48	6.72	<.001**	0.32	0.31	45.2 (1, 498)	1214	1240	Strongest predictor
Self-Reported Health Status	.38	0.06	[0.26, 0.50]	.44	5.89	<.001**	0.27	0.26	34.7 (1, 498)	1295	1319	Significant predictor
Healthcare utilization (visits)	−0.21	0.08	[−0.37, −0.05]	−0.25	−2.85	.005**	0.18	0.17	21.4 (1, 498)	1370	1396	Negative association
Medication side effect burden	.29	0.09	[0.12, 0.46]	.31	3.22	.001**	0.14	0.13	16.8 (1, 498)	1402	1427	Positive association
Quality of life index	.41	0.07	[0.27, 0.55]	.46	5.95	<.001**	0.30	0.29	33.6 (1, 498)	1268	1293	Strong predictor
Policy-relevant outcome (eg, insurance satisfaction)	.33	0.08	[0.17, 0.49]	.37	4.12	<.001**	0.22	0.21	19.7 (1, 498)	1345	1371	Useful for policy

Note. Regression coefficients (β) are reported with standard errors (SE), 95% confidence intervals (CI), standardized β, and model fit indices (R², Adjusted R², F-statistic, AIC, BIC). Significance levels: *P < .05, **P < .01. Predictive validity of IRT-calibrated PRO scores was evaluated across clinical and policy outcomes.

Discussion

The findings show that IRT provided stronger reliability and precision than CTT for PRO measurement. While CTT demonstrated acceptable internal consistency (α = .84),²⁹ the IRT Graded Response Model produced higher marginal reliability (0.91), indicating more stable trait estimation across the latent continuum, consistent with prior literature.^{2,16,21,41,42} Item discrimination parameters ranged from 0.89 to 2.35, reflecting strong sensitivity to differences in health status and supporting evidence that IRT better captures item utility than traditional approaches.^25,43 The Test Information Function indicated peak precision at θ ≈ 0.5, demonstrating differential accuracy across trait levels rather than uniform error assumptions.^{20,26,39,44,45} Advancements such as Multidimensional IRT (MIRT)^3,5,46
-48 and Bayesian estimation^4,6,31,40 further enhance construct validity and parameter stability. Additionally, Computerized Adaptive Testing (CAT) can reduce item burden by 50% or more while maintaining reliability,^{16,30,39,49,50} though implementation requires infrastructure and contextual adaptation.

DIF analysis showed that most items functioned equivalently across groups. Only 13.3% of items demonstrated gender-related DIF, 1 item showed age-related DIF, and none exhibited SES-related DIF, highlighting IRT’s sensitivity to subtle bias.^18,20,40,51 Although limited gender DIF aligns with prior findings,^45,50
-53 its presence warrants monitoring, as even small biases may contribute to inequities.^9,13,44,54 Encouragingly, the absence of SES-related DIF supports equitable application across socioeconomic groups.^25,55
-57 Ongoing cross-cultural validation, item bank refinement, and longitudinal DIF monitoring are recommended^5,6,58,59 to sustain fairness and relevance. IRT-calibrated PRO scores demonstrated strong predictive validity, explaining 32% of variance in treatment adherence, 30% in quality of life, 27% in self-reported health status, and 22% in insurance satisfaction.^{7,20,26,43,47,57} A significant negative association with healthcare utilization (β = −.21) suggests that higher PRO scores were linked to fewer visits, consistent with efficiency gains reported in prior studies.^7,15,16,27 PRO scores also predicted medication side effect burden, supporting their role in pharmacovigilance.^10,60
-63 Overall, the results position IRT-calibrated PROs as precise, equitable, and clinically informative tools with relevance for electronic health integration, decision-support systems,^{11,22,30,37,52} and equity-focused health policy development.

Implications for Theory, Policy, and Practice

This study demonstrates that Item Response Theory (IRT) offers significant advantages over Classical Test Theory (CTT) in health outcomes research. Theoretically, IRT enhances psychometric analysis by providing detailed insights into item reliability, discrimination, and precision, while also detecting subtle Differential Item Functioning (DIF) to ensure measurement fairness and equity. From a policy perspective, IRT-calibrated Patient-Reported Outcomes (PROs) offer predictive validity for treatment adherence, quality of life, and healthcare utilization, supporting their use in value-based healthcare, insurance decisions, and equitable resource allocation. The minimal influence of socioeconomic factors on DIF further positions PROs as fair tools for national health surveys and policy planning. Clinically, IRT-based PROs can be integrated into electronic health records and decision-support systems to enable proactive, patient-centered care. They allow clinicians to monitor well-being, side effects, and emerging risks, and guide interventions with precision. Routine psychometric audits and training can enhance workforce capacity to interpret these data effectively. In the main, the study positions IRT not only as a methodological innovation but as a multidimensional framework bridging psychometric rigor with social responsibility, with meaningful applications across theory, policy, and clinical practice.

Conclusions

This study demonstrated that Item Response Theory (IRT) offers clear advantages over Classical Test Theory (CTT) in measuring patient-reported outcomes (PROs), enhancing reliability, fairness, and predictive validity. IRT provided precise item-level estimates and test information functions, capturing variation across the health continuum more effectively than CTT. It also detected and addressed limited but meaningful gender- and age-related item biases, ensuring equitable measurement. IRT-calibrated PRO scores meaningfully predicted clinical outcomes (eg, treatment adherence, quality of life, medication side effects) and policy-relevant indicators (eg, insurance satisfaction), supporting both patient care and resource planning. Overall, the findings position IRT-based PRO instruments as powerful tools for clinical decision-making, health policy, and patient-centered care, emphasizing that rigorous psychometric methods improve both methodological quality and ethical practice. Future work should focus on cross-cultural validation, item bank development, and longitudinal testing in diverse settings.

Recommendations

The study recommends that researchers prioritize Item Response Theory (IRT) for developing and validating patient-reported outcome (PRO) instruments, including longitudinal applications to ensure responsiveness and validity across diverse populations. Clinicians should integrate IRT-calibrated PROs into routine practice via electronic health records and dashboards to support real-time monitoring, risk identification, and individualized care, with training provided to interpret IRT outputs. Policymakers and administrators are encouraged to use IRT-based PRO data for health system evaluation, resource allocation, and value-based policy decisions, leveraging its predictive validity for treatment adherence, quality of life, and healthcare utilization. Finally, routine psychometric audits, including Differential Item Functioning (DIF) monitoring, cross-cultural validation, and item bank updates, are recommended to maintain fairness, equity, and accuracy in healthcare measurement.

Limitations

Its cross-sectional design limits causal inferences, and predictive validity analyses, while informative, cannot confirm temporal relationships. The sample was confined to outpatient facilities in southern Ghana, which may restrict generalizability to other regions or rural settings. Reliance on self-reported data introduces potential recall bias and social desirability effects, despite efforts to ensure clarity and cultural appropriateness. The sample size (n = 500) was at the lower bound for IRT modeling, possibly reducing stability of item parameters and power for subgroup DIF analyses. Finally, electronic data collection may have excluded individuals with limited digital literacy, introducing potential selection bias. These limitations highlight the need for longitudinal designs, broader and larger samples, mixed-mode data collection, and triangulation with clinical indicators in future research.

Supplemental Material

sj-docx-1-inq-10.1177_00469580261441163 – Supplemental material for Enhancing Measurement Precision of Patient-Reported Outcomes Using Item Response Theory

Supplemental material, sj-docx-1-inq-10.1177_00469580261441163 for Enhancing Measurement Precision of Patient-Reported Outcomes Using Item Response Theory by Simon Ntumi, Lawrence Larbi Sakyi, Tapela Bulala, Divine Agbovor, Gabriel Odame Amfo, John Gabla, Rudi Anakwa and Emmanuel Ohene Amezah in INQUIRY: The Journal of Health Care Organization, Provision, and Financing

Supplemental Material

sj-docx-2-inq-10.1177_00469580261441163 – Supplemental material for Enhancing Measurement Precision of Patient-Reported Outcomes Using Item Response Theory

Supplemental material, sj-docx-2-inq-10.1177_00469580261441163 for Enhancing Measurement Precision of Patient-Reported Outcomes Using Item Response Theory by Simon Ntumi, Lawrence Larbi Sakyi, Tapela Bulala, Divine Agbovor, Gabriel Odame Amfo, John Gabla, Rudi Anakwa and Emmanuel Ohene Amezah in INQUIRY: The Journal of Health Care Organization, Provision, and Financing

Footnotes

Acknowledgements

The authors express their gratitude to the patients and healthcare staff who participated in the study across the selected outpatient facilities in Ghana. Special appreciation is extended to the hospital administrators and research assistants who facilitated data collection. The authors also acknowledge the Department of Educational Foundations at the University of Education, Winneba, for their administrative and academic support.

ORCID iDs

Simon Ntumi

Gabriel Odame Amfo

Ethical Considerations

Ethical clearance for this study was obtained from the Institutional Review Board of the University of Education, Winneba, with Approval Number: UEW/IRB/2025/034, granted on 15 August 2025.

Consent to Participate

Electronic informed consent was secured from all participants prior to data collection, and participation was entirely voluntary. The study adhered to international ethical standards for research involving human participants, ensuring confidentiality, anonymity, and secure data handling throughout the research process.

Author Contributions

Each author played a distinct role in the development of this study. SN was responsible for data analysis and interpretation, ensuring that both Classical Test Theory (CTT) and Item Response Theory (IRT) findings were thoroughly examined. GOA drafted the introduction and provided the contextual framing of the study. RA and JG jointly developed the methodology section, including the research design, sampling strategy, and analytical framework. DA and LLS contributed to the discussion of results, offering theoretical insights and contextual interpretation. EOA prepared the conclusion and recommendations, highlighting the study’s significance and future directions. TB from Botswana University of Agriculture and Natural Resources served as a reviewer, critically evaluating the manuscript and providing feedback to strengthen its quality and clarity.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The datasets generated and analyzed during this study are not publicly available due to ethical restrictions and confidentiality agreements with participating healthcare facilities. However, anonymized datasets or summary results may be made available upon reasonable request to the corresponding author (SN), subject to institutional approval and compliance with ethical data-sharing protocols.

Supplemental Material

Supplemental material for this article is available online.

References

Evans

Mathur

The value of online surveys: a look back and a look ahead. Internet Res. 2018;28(4):854-887.

Brinkman

CORR Insights®: does a concise patient-reported outcome measure provide a valid measure of physical function for cancer patients after lower extremity surgery?

Clin Orthop Relat Res. 2025;483(1):76-79.

Menold

Raykov

Evaluating measurement invariance with IRT and SEM: guidelines and recommendations. Struct Equ Modeling. 2022;29(2):195-210. doi:10.1080/10705511.2021.1956312

Liu

Chen

Bellinger

, et al. Pre-natal and early life lead exposure and childhood inhibitory control: an item response theory approach. Environ Health. 2024;23(1):71.

Tang

Schalet

Peipert

Cella

Does scoring method impact PRO change estimation? Comparing CTT vs IRT. Value Health. 2023;26(10):1518-1524. doi:10.1016/j.jval.2023.06.002

Królikowska

Reichert

Senorski

Karlsson

Becker

Prill

Scores and sores: exploring patient-reported outcomes for knee evaluation. Knee Surg Sports Traumatol Arthrosc. 2025;33(1): 21-31.

Terwee

Cella

Embracing the future of patient-centered care with advances in patient-reported outcomes. Adv Patient-Reported Outcomes. 2025;1(1):100001.

Ayasse

Coon

CD.

Investigating item response theory model performance in the context of evaluating clinical outcome assessments in clinical trials. Qual Life Res. 2025;34(4):1125-1136.

Zhang

Ren

Chen

Wang

Xie

Patient-reported outcome scale for allergic rhinitis (AR-PRO): development and validation in China. Int Forum Allergy Rhinol. 2025;15(5):492-501. doi:10.1002/alr.23521

10.

Kroenke

Miksch

Spaulding

, et al. Choosing and using patient-reported outcome measures in clinical practice. Arch Phys Med Rehabil. 2022;103(5S):S108-S117.

11.

Dåderman

Persson

Ahlstrand

, et al. Item response theory modelling of the trait emotional intelligence questionnaire-short form: item streamlining, differential item functioning, and validity in a Swedish multicenter cross-sectional study. BMC Psychol. 2025;13(1):987.

12.

Ntumi

Bulala

Multilevel psychometric analysis of clinician burnout using Bayesian G-Theory and IRT fusion models. Inquiry. 2025;62:00469580251371389.

13.

Frey

. Test theory and classical test theory. In: van den Bulck J, ed. The International Encyclopedia of Media Psychology. Wiley; 2020:1-6.

14.

McNeish

Thanks coefficient alpha, we’ll take it from here. Psychol Methods. 2018;23(3):412-433.

15.

Ntumi

Validating high-fidelity simulation for assessing procedural skills in nursing education: a kane framework approach in Ghana. BMC Med Educ. 2025;25(1):1170.

16.

Wang

Lee

Peacock

Kim

RY.

Leveraging Patient-reported outcomes to facilitate dental implant rehabilitation in maxillofacial free flap reconstruction. J Oral Maxillofac Surg. 2025;83(8):922-924.

17.

Hays

Spritzer

Schalet

Cella

PROMIS®-29 v2.0 profile physical and mental health summary scores. Qual Life Res. 2018;27(7):1885-1891. doi:10.1007/s11136-018-1842-3

18.

Wang

Liu

Zhang

Chen

IRT to develop a CAT for health-related quality of life. Health Qual Life Outcomes. 2024;22:125.

19.

Harrison

Trickett

RW.

Patient reported outcome measures: from the classics to AI. J Hand Surg Eur. 2025;50(6):807-813.

20.

Díaz Berenguer

Bossa

Lebleu

Pauwels

Sahli

High-dimensional item response theory analysis of patient-reported outcomes in total knee arthroplasty. npj Digit Med. 2025;8(1):391.

21.

Zou

Goetz

Stebbins

Mestre

Luo

Increasing sensitivity in patient-reported MDS-UPDRS items for predicting medication initiation in early pd. Mov Disord Clin Pract. 2025;12(2):148-156.

22.

Reeve

Fayers

Applying item response theory modeling for evaluating questionnaire item and scale properties. Qual Life Res. 2018;27(1):1-2.

23.

Cronbach

LJ.

Coefficient alpha and the internal structure of tests. Psychometrika. 1951;16(3):297-334. doi:10.1007/BF02310555

24.

Khatri

Harrison

MacDonald

, et al. Item response theory validation of the Oxford knee score and activity and participation questionnaire: a step toward a common metric. J Clin Epidemiol. 2024;175:111515.

25.

Yan

Chen

Shen

Chen

Gao

Prediction of weekly PROs during radiotherapy: single-patient vs transformer model. BioData Min. 2025;18(1):64. doi:10.1186/s13040-025-00464-7

26.

Mei

Cappelleri

Bayesian item response theory to estimate power in clinical trials with PRO endpoints. Qual Life Res. 2025;34(4):1113-1124.

27.

Hunter

Hewitt

Online surveys and digital data collection in healthcare research: practical considerations. BMJ Open. 2020;10(3):e034397.

28.

Taheri

Tavousi

Momenimovahed

, et al. Development and psychometric properties of maternal health literacy inventory in pregnancy. PLoS One. 2020;15(6):e0234305. doi:10.1371/journal.pone.0234305

29.

Xie

Zhang

Ren

Chen

Wang

Patient-reported outcome scale for idiopathic pulmonary fibrosis: development and validation in China. J Evid Based Med. 2024;17(4):758-770.

30.

Schroeders

Gnambs

Sample-size planning in item-response theory: a tutorial. Adv Methods Pract Psychol Sci. 2025;8:1-13. doi:10.1177/25152459251314798

31.

Zumbo

Chan

EKH

. Validity and Validation in Social, Behavioral, and Health Sciences. Springer; 2014.

32.

Mokkink

, et al. COSMIN and the selection of health measurement instruments. Braz J Phys Ther. 2016;20(2):105-113. doi:10.1590/bjpt-rbf.2014.0143

33.

Mancha

Muñoz

de la Cruz-Merino

, et al. Development and validation of a sexual relations satisfaction scale in patients with breast cancer - “SEXSAT-Q”. Health Qual Life Outcomes. 2019;17(1):143. doi:10.1186/s12955-019-1197-7

34.

Feng

Liu

Zhou

Lin

Research paradigm of the International Classification of functioning, Disability and Health (ICF) with item response theory: clarification, classification, and challenge. Physiother Res Int. 2025;30(1):e70021.

35.

Cella

, et al. The Patient-Reported Outcomes Measurement Information System (PROMIS). J Clin Epidemiol. 2010;63(11):1179-1194. doi:10.1016/j.jclinepi.2010.04.011

36.

Ren

Cappelleri

JC.

Patient-reported outcome measure and its application in patients with stroke: item response theory. Stroke Vasc Neurol. 2025;10(1):1-4.

37.

Xiang

Wong

Cao

Perfecto

McGrath

CPJ

. Development and validation of the Oral Health Behavior Questionnaire for adolescents based on the health belief model (OHBQAHBM). BMC Public Health. 2020;20:701. doi:10.1186/s12889-020-08851-x

38.

Wolff

Absolom

Ahmed

, et al. Enhancing provider adoption of patient-reported outcome measures (PROMs) through implementation science: insights from two international workshops. J Patient Rep Outcomes. 2025;9(1):77.

39.

Van den Broecke

de Jong

Mbengi

Vanroelen

Development of ICF-based PRO and PRE measures. BMJ Open. 2024;14(12):e087798.

40.

Russell

Lovett

Vogeley

, et al. Differential item functioning of patient-reported outcomes measures of anxiety and depression by severity of cognitive impairment. Alzheimer Dement. 2024;20:1-15. doi:10.1002/alz.089740

41.

Bornstein

Jager

Putnick

DL.

Sampling in developmental science: situations, shortcomings, solutions, and standards. Dev Rev. 2013;33(4):357-370. doi:10.1016/j.dr.2013.08.003

42.

Sajobi

Sanusi

Mayo

, et al. Unsupervised item response theory models for assessing sample heterogeneity in patient-reported outcomes measures. Qual Life Res. 2024;33(3):853-864.

43.

Piussi

Senorski

Irrgang

. Patient-reported outcome measures: past, present, and future. In: Doral MN, Karlsson J, Nyland J, Bilge O, Hamrin Senorski E, eds. Sports Injuries. Springer; 2025:227-241.

44.

Lord

Novick

MR.

Statistical Theories of Mental Test Scores. IAP; 2008.

45.

Terwee

Ahmed

Alhasani

, et al. Comparable real-world patient-reported outcomes data across health conditions, settings, and countries: the PROMIS international collaboration. NEJM Catal Innov Care Deliv. 2024;5(9):1-24.

46.

Arimoro

Lix

Ferro

, et al. Tree-based item-response theory model for evaluating differential item functioning in patient-reported outcome measures: a web-based R Shiny implementation. Qual Life Res. 2025; 34(12):3587-3596.

47.

Hulbaek

Petersen

Ibsen

Psychometric properties of the Danish SDM-Q-9 questionnaire for shared decision-making in patients with pelvic floor disorders and low back pain: item response theory modelling. BMC Med Inform Decis Mak. 2025;25(1):194.

48.

Edelen

Zeng

Hays

, et al. Development of an ultra-short measure of eight domains of health-related quality of life for research and clinical care: the PROMIS®-16 profile. Qual Life Res. 2025;34(1):3-15.

49.

Stover

, et al. Integrating PROs into cancer care. CA Cancer J Clin. 2022;72(2):149-165. doi:10.3322/caac.21703

50.

Revicki

Cella

DF.

Health status assessment for the twenty-first century: item response theory, item banking and computer adaptive testing. Qual Life Res. 1997;6(6):595-600. doi:10.1023/a:1018420418455

51.

Adeghe

Okolo

Ojeyinka

OT.

The influence of patient-reported outcome measures on healthcare delivery: a review of methodologies and applications. Open Access Res J Biol Pharm. 2024;10(2):13-21.

52.

Ntumi

Item Response Theory analysis of a work ethic scale: evidence from multi-industry samples in West Africa. Int J Sel Assess. 2025;33(4):1-19.

53.

Applications of item response theory in stroke research. Stroke Vasc Neurol. 2025;10(1):1-8. doi:10.1136/svn-2023-003313

54.

Embretson

Reise

SP.

Item Response Theory for Psychologists. Lawrence Erlbaum Associates; 2000. doi:10.4324/9781410605269

55.

Nalty

Patel

Bird

Lewis

Lin

PP.

Does a concise PROM provide a valid measure of physical function for cancer patients after lower extremity surgery?

Clin Orthop Relat Res. 2025;483(1):76-79.

56.

de Ayala

. The Theory and Practice of Item Response Theory. Guilford Press; 2009.

57.

Jeyaraman

Ramasubramanian

Balaji

Muthu

Voices that matter: the impact of patient-reported outcome measures on clinical decision-making. World J Methodol. 2025;15(2):98066.

58.

Ntumi

From resilience to self-efficacy: cross-cultural mediation effects of emotion regulation and perceived social support in adolescents. Psych J. 2026;15(1):e70070.

59.

Schalet

Lim

Cella

Choi

SW.

Linking scores with PRO instruments: validation of three linking methods. Psychometrika. 2021;86(3):717-746. doi:10.1007/s11336-021-09776-z

60.

Thompson

Mudaranthakam

, et al. Estimating power for clinical trials with patient reported outcomes - using item response theory. J Clin Epidemiol. 2022;141:141-148.

61.

Van den Broecke

de Jong

Kiasuwa Mbengi

Vanroelen

. Development of ICF-based patient-reported outcome and experience measures to study social participation among people with chronic diseases: a mixed-methods protocol. BMJ Open. 2024;14(12):e087798.

62.

Teunissen

Braakhuis

van der Heijden

Hageman

MGJS

. IRT-based CAT reduces burden in cubital tunnel syndrome outcomes. J Hand Surg. 2023;48(7):653.e1-653.e9. doi:10.1016/j.jhsa.2023.03.004

63.

Nolla

Rasmussen

Rothrock

, et al. Seamless integration of computer-adaptive patient reported outcomes into an electronic health record. Appl Clin Inform. 2024;15(1):145-154.