Sage Journals: Discover world-class research

Abstract

This paper delves into the realm of ordinal classification processes for multiple raters. A Probit hierarchical model is proposed linking rater’s ordinal ratings with rater diagnostic skills (bias and magnifier) and patient latent disease severity, where patient latent disease severity is assumed to follow a latent class normal mixture distribution. This model specification provides closed-form expressions for both overall and individual rater receiver operator characteristic (ROC) curves and the area under these ROC curves (AUC). We further extend the model by incorporating covariate information and adding a regression layer for rater diagnostic skill parameters and/or for patient latent disease severity. The extended covariate models also offer closed-form solutions for covariate-specific ROCs and AUCs. These analytical tools greatly facilitate traditional diagnostic accuracy analysis. We demonstrate our methods thoroughly with a practical mammography example.

Keywords

AUC diagnostic bias diagnostic magnifier diagnostic score disease class disease severity ROC

1. Introduction

In the context of diagnostic medicine, the assessment of diagnostic accuracy of different diagnostic tests is very important. Many diagnostic procedures involve task-based judgement from radiologists/raters. For example, image analysis is increasingly integral to the screening and diagnostic processes, encompassing rater interpretation of various modalities such as X-ray, mammograms, MRI scans, CT scans, and others. When raters’ task-based judgement involved, multirater multi-case (MRMC)¹ study design is commonly recommended to assess the diagnostic ability of raters. The MRMC structure, where each rater interprets every case (test result) for all patients, ensures a comprehensive and consistent evaluation of the diagnostic test. A systematic review and guide of designing the MRMC structure can be found in Obuchowski and Bullen’s paper.¹

The resulting data of the MRMC structure is evidently correlated and therefore challenging to analyze. In the literature of MRMC receiver operator characteristic (ROC) analysis, four parametric methods stand out for handling the complicated correlated MRMC data. The Dorfman–Berbaum–Metz (DBM)^2,3 proposed a mixed-effects ANOVA model to fit jackknife pseudovalues of ROC summary measures. Obuchowski and Rockette⁴ proposed a linear ANOVA model to fit ROC summary measures treating patients as random. Beiden–Wagner–Campbell⁵ is a development of the original DBM method with the focus on the variance component estimation. Song and Zhou⁶ proposed a marginal regression model that only handles area under the curves (AUCs). In addition to these methods, Obuchowski⁷ proposed a non-parametric approach to analyze clustered ROC data through AUC. A completed review of these methods can be found in Obuchowski’s⁷ and Hillis’s paper⁸. It is noteworthy that all these methods require the presence of the true disease status, also referred to as the gold standard, and that their models are based on some diagnostic summary measures such as AUC rather than the original rating data.

Our motivating data set has the MRMC data structure. Instead of assessing diagnostic accuracy of mammograms when interpreted by many radiologists, our proposed approach focuses on the assessment of the diagnostic abilities of the radiologists in interpreting the mammograms. Beam et al.⁹ aimed to investigate the variability in the mammogram interpretation among radiologists in the United States. The study selected 110 radiologists through a stratified random sample from mammography facilities accredited by the U.S. Food and Drug Administration. Original index mammograms of high technical quality were obtained from 148 women, drawn from an age-stratified random sample of a large screening program affiliated with the University of Pennsylvania. Every radiologist interpreted all mammograms in a controlled reading environment for two 3-hour periods. A modified 5-point Breast Imaging Reporting and Data System (BI-RADS) scale¹⁰ was used for the radiologists to record their findings: 1, normal, return to normal screening; 2, benign, return to normal screening; 3, probably benign, 6-month follow-up; 4, possibly malignant, biopsy recommended; 5, probably malignant, biopsy strongly recommended. The true disease status of the women screened was determined by a biopsy examination or a minimum follow-up period of 2 years.

We chose a subset of the Beam data comprising 146 patients and 107 raters from the initial pool of data selected through evaluation of data completeness. In our chosen mammogram data, $82$ women were cancer-free and $64$ had malignancies. This chosen mammogram dataset exhibits a well-structured MRMC design, featuring ample complete ratings. It encompassed patient information and both facility-level and rater-level information, thereby enables a more thorough and comprehensive exploration of the factors that influence the diagnostic accuracy.

Beam’s data involve a larger number of raters than common MRMC data. For such a large set of MRMC data, Lin et al.¹¹ employed Bayesian hierarchical modeling to account for the between rater variability and the clustered patient variability. Lin et al.¹¹ dichotomized the ordinal rating into binary rating and linked the rater binary decisions with patient true binary disease statuses to form an agreement web. To our knowledge, it is the first paper to assess rater diagnostic ability through directly modeling diagnostic skill set (rater bias and rater magnifier) instead of using the traditional paired sensitivity and specificity. The rater bias refers to a rater’s inherent inclination to consistently rate a patient as “diseased.” The rater magnifier interactively affects on the patient latent disease severity and magnifies the patient’s latent disease severity in the correct direction, leading to an increased likelihood of making a correct decision. Lin et al.¹¹ laid the groundwork for a fundamental analysis of investigating rater’s diagnostic skills. Later, Kim et al.¹² built on Lin’s approach and generalized such methods to ordinal rating scales.

This paper continues the journey of Lin et al.,¹¹ Kim et al.,¹² employing Bayesian probit modeling to connect the ordinal rating with rater bias, rater magnifier, and patient latent disease severity. Our proposed approach has three key strengths over the existing approaches as follows. One drawback in Kim et al.¹² is that they can not predict the true binary disease status for patients, instead their rating agreement web prefers to having gold standards included in the model. In this paper, we adopt the latent class modeling idea and treat the unknown binary disease statuses as latent classes for modeling the distribution of the patient disease severity. In this way, the Markov chain Monte Carlo (MCMC) sampling allows for the prediction of the true disease status for patients. Another major difference between Kim’s approach and ours is that we developed a novel approach to evaluate rater ROCs and AUCs which dramatically reduces the work of formulating ROCs and AUCs in Kim et al.¹². Our approach utilizes the binormal distribution of the latent diagnostic score and derives a closed form of smooth ROCs and AUCs. It provides enhanced flexibility for studying both individual and overall rater skills. Thirdly, this paper also makes significant extensions of the primary model by incorporating patient and rater covariates into the diagnostic analysis. Through adding a regression layer on the rater parameters and/or the patient disease severity, the rater covariate effects on the diagnostic skills and/or the patient covariate effects on the disease progress are examined. More importantly, the rater and/or patient covariate-specific ROCs and AUCs are derived. These extended models open avenues for refining models through systematic modifications and covariate adjustments. It is worth mentioning that some of the ROC analyses have already incorporated covariate information but in a very different way. For example, Pepe et al.¹³ represented the ROC curve as the distribution of standardized marker measurements in cases and based on it proposed three different ways to directly incorporate covariates into the ROC analysis.

The remainder of this paper is organized as follow. Section 2 presents the primary model and the hierarchical distributions, the closed form of overall and individual rater ROC curves and their corresponding AUCs. Section 3 extends the primary model by incorporating rater and patient covariates, and derives the related ROC analyses. Section 4 applies our proposed methods to analyze the mammogram data and compare the results using different models. Section 5 concludes with a discussion. The computational algorithms and the results of simulation studies are presented in a supplementary file.

2. Primary model and ROC analysis

2.1. Primary model

We consider a study setting where $n$ raters each independently classify the same set of $m$ patient test results, such as X-ray images or mammograms. Ratings are made according to an ordinal classification scale with $K$ categories. This results in an $m$ $\times$ $n$ MRMC structure, where $W_{i j}$ is the ordinal rating of patient $i$ by rater $j$ . Our proposed model connects rater’s classification data with patient’s latent disease severity and rater’s diagnostic skills. The primary model is proposed as following:

P (W_{i j} \leq k | u_{i}) = Φ (α_{k} - (a_{j} + b_{j} u_{i})),

(1)

for

k = 1, \dots, K; i = 1, \dots, m; j = 1, \dots, n

, where

Φ (\cdot)

is the cumulative distribution function (CDF) of a standard normal and

- \infty < α_{1} < α_{2} < \dots < α_{K - 1} < α_{K} = \infty

are ordinal threshold parameters;

u_{i}

represents the latent disease severity of patient

i

; and

a_{j}

and

b_{j}

are random effect terms, representing the diagnostic bias and diagnostic magnifier of rater

j

, respectively. More precisely speaking, here

u_{i}

is the patient disease severity reflected by the test result, such as the X-ray image. It is a better idea to model the reflected disease severity by the test result as the true latent disease severity plus some measurement error. As our major interest in this paper is not to investigate the quality of diagnostic tests, we simply assume the disease severity reflected by the test result is just the true patient latent disease severity. A lager positive value of

u_{i}

signals a higher chance of the

i

th patient having the disease. The diagnostic bias

a_{j}

shows the instinct tendency of rater

j

to assign a higher (or lower) rating to a patient whose latent trait

u_{i}

value is 0. The diagnostic magnifier

b_{j}

reflects the rater’s diagnostic ability to magnify the patient latent disease severity in the same direction during the diagnostic process. A better-performing rater should be more adept at detecting the potential for the disease, that is, having

b_{j} > 1

. One thing to note is that a qualified rater defined here has

b_{j} > 0

. We assume all raters are qualified, because in practice the ability or skill of a qualified rater may not classify a patient accurately, but it is unlikely that this effect would be magnified in the opposite direction.

The second layer of our model specifies the distributions for the latent patient disease severity $u_{i}$ and for the rater diagnostic skill set ( $a_{j}$ and $b_{j}$ ). The patient latent disease severity $u_{i}$ ’s are assumed independently follow a latent class normal mixture (binormal) model:

\begin{aligned} u_{i} | D_{i} & \sim N (μ_{D_{i}}, τ_{D_{i}}^{- 1}), \\ D_{i} & \sim Bern (p), \end{aligned}

(2)

for

i = 1, \dots, m

, where

D_{i}

represents the unknown true binary disease class for each patient

i

with value

1

indicating the patient having the disease and

0

otherwise;

p

is the parameter of disease prevalence, that is, the population probability of a person having the disease. The binormal model implies that

u_{i}

follows

N (μ_{0}, τ_{0}^{- 1})

given the patient has no disease, and

N (μ_{1}, τ_{1}^{- 1})

given the patient has the disease, with the probability of having the disease equal to

p

Rater’s bias and magnifier $a_{j}$ and $b_{j}$ are assumed to independently follow normal distributions,

\begin{aligned} a_{j} & \sim N (μ_{a}, τ_{a}^{- 1}), \\ b_{j} & \sim N (μ_{b}, τ_{b}^{- 1}), \end{aligned}

(3)

for

j = 1, \dots, n

. For identifiability purposes, the overall means of rater diagnostic bias and magnifier,

μ_{a}

and

μ_{b}

, are fixed at 0 and 1, respectively. The variation of rater bias and diagnostic magnifier are naturally quantified by

τ_{a}^{- 1} (= σ_{a}^{2})

and

τ_{b}^{- 1} (= σ_{b}^{2})

, respectively. To evaluate the overall diagnostic accuracy, correctly assessing the variation of rater diagnostic performance is of importance.

For the Bayesian estimation, the third layer of our model is to specify priors for all parameters. We assign conventional non-informative/diffuse priors to all the parameters. For the parameters $(μ_{0}, μ_{1}, τ_{0}, τ_{1}, p)$ in the binormal distribution, vague normal priors $N (0, τ^{- 1})$ with a small value of $τ$ are assigned to $μ_{0}$ and $μ_{1}$ ; vague gamma priors $Ga (γ, γ)$ with a small value $γ$ are assigned to $τ_{0}$ and $τ_{1}$ ; and $p$ follows a uniform $Unif (0, 1)$ prior. For the rater precision parameters $τ_{a}$ and $τ_{b}$ , the vague gamma priors are again assigned. Finally, for the ordinal threshold parameters, vague uniform and conditional uniform priors are assigned, such as $α_{1} \sim Unif (- δ, δ)$ , and $α_{k} | α_{k - 1} \sim Unif (α_{k - 1}, α_{k - 1} + δ)$ , for $k = 2, \dots, K - 1$ , where $δ$ is a large value.

2.2. Latent diagnostic process and ROC curves

To better understand model (1) and to develop an ROC analysis, we introduce a latent diagnostic process scheme for each rating. That is, each rater $j$ has a latent continuous diagnostic score $S_{i j}$ when rating patient $i$ . For binary rating, the patient is rated as diseased when the score $S_{i j}$ is greater than a threshold $t$ and non-diseased otherwise. For ordinal rating, the patient has a rating $k$ if the score $S_{i j}$ falls between $α_{k - 1}$ and $α_{k}$ , for $k = 1, \dots, K$ , where $α_{0}$ is defined as $- \infty$ . When we dichotomize the ordinal ratings (i.e., classifying a patient as diseased if $W_{i j} > k$ ), we basically choose the binary rating threshold $t$ as $α_{k}$ . Therefore, both binary rating and ordinal rating can be unified by this latent diagnostic process. Specifically, for the probit model (1), the latent continuous diagnostic score $S_{i j}$ is assumed to follow a normal distribution

S_{i j} | u_{i} \sim N (a_{j} + b_{j} u_{i}, 1),

(4)

for

i = 1, \dots, m

;

j = 1, \dots, n

. It can be verified that

P (S_{i j} \leq α_{k} | u_{i}) = P (W_{i j} \leq k | u_{i}),

which means that rater

j

gives rating

k

for patient

i

is equivalent to that rater

j

has the diagnostic score

S_{i j}

falling between

α_{k - 1}

and

α_{k}

. Applying the law of total expectation and total variance, the continuous rating score

S_{i j}

of rater

j

conditional on the patient latent true disease class

D_{i}

follows a binormal model as below,

S_{i j} | D_{i} = d \sim N (a_{j} + b_{j} μ_{d}, 1 + b_{j}^{2} σ_{d}^{2}),

(5)

with

d = 0

1

ROC curve is a typical method to measure the diagnostic accuracy by plotting sensitivity ( $Sen$ ) against 1 minus specificity ( $Spe$ ). In this paper, we aim to draw ROC curves for each rater and derive the area under the ROC curves (AUC). A better rater should have a larger value of AUC. In Chapter 2 of Krzanowski,¹⁴ the ROC curve of a binormal model is derived as $Φ^{- 1} (Sen) = β_{0} + β_{1} Φ^{- 1} (1 - Spe)$ with intercept $β_{0}$ and slope $β_{1}$ related to the mean and variance parameters of the two components in the binormal model. For the purpose of the completeness, we put the derivation of the ROC curve and the AUC for the binormal model in the appendix. It is worth mentioning that these formulas are in closed-form and free of demanding the true disease status. Based on the binormal model (5), for each rater $j$ , the intercept and slope of the ROC curve are

β_{0 j} = \frac{b_{j} (μ_{1} - μ_{0})}{\sqrt{1 + b_{j}^{2} σ_{1}^{2}}}, β_{1 j} = \sqrt{\frac{1 + b_{j}^{2} σ_{0}^{2}}{1 + b_{j}^{2} σ_{1}^{2}}} .

(6)

It is clear that diagnostic bias

a_{j}

does not affect the individual ROC curve of rater

j

. The AUC of individual rater

j

under model (5) is,

{AUC}_{j} = Φ (\frac{b_{j} (μ_{1} - μ_{0})}{\sqrt{2 + b_{j}^{2} (σ_{1}^{2} + σ_{0}^{2})}}) .

(7)

From (7), it is straightforward that the larger

b_{j}

is, the larger AUC

_{j}

is. When

b_{j} = 0

, AUC

_{j} = 1 / 2

; when

b_{j} = 1

, AUC

_{j} = Φ (\frac{μ_{1} - μ_{0}}{\sqrt{2 + σ_{1}^{2} + σ_{0}^{2}}})

and when

b_{j} \to \infty

, AUC

_{j} \to Φ (\frac{μ_{1} - μ_{0}}{\sqrt{σ_{1}^{2} + σ_{0}^{2}}})

, which is the AUC of the original binormal model¹⁴ with

a_{j} = 0

and

b_{j} = 1

. We designate (6) and (7) as rater-specific ROC and AUC, emphasizing the fact that these expressions allow for exploring ROC and AUC for each individual rater

j

simultaneously. Both ROC curves and AUCs are directly linked to each individual rater’s diagnostic magnifier parameter, facilitating the assessment of each rater’s ability without the requirement for knowing the true patient disease status or gold standard.

Beyond individual diagnostic performance, our interest extends to monitor the overall diagnostic accuracy of the entire group of raters. Pepe¹⁵ mentioned pooling all ratings from raters and formulating the overall ROC curve and AUC. Metz¹⁶ suggested averaging over parameters of the individual rater ROC curves to obtain the overall ROC curve. However, this type of average ROC curve in general does not have the AUC equal to the mean of individual AUCs. Therefore, Chen and Samuelson¹⁷ proposed both a non-parametric method and a parametric method for generating area- preserving average ROC curve from the ROC curves of individual raters. In this paper, the overall rater ROC and AUC are not simply based on pooling individual rater ratings but are model-based theoretical ones. They are derived based on the average diagnostic scores for diseased and non-diseased groups after integrating over the distributions of $a_{j}$ and $b_{j}$ . Given normal distributions of $a_{j}$ and $b_{j}$ , applying the law of total expectation and total variance to (5), we deduce the distribution of the overall rating score $S$ conditional on the latent true disease class as below,

S | D = d \sim N (μ_{d}, 1 + σ_{d}^{2} + σ_{a}^{2} + (σ_{d}^{2} + μ_{d}^{2}) σ_{b}^{2}) .

(8)

Employing Krzanowski’s ROC and AUC formulas,¹⁴ the overall ROC curve has the intercept and slope,

\begin{aligned} β_{0} = & \frac{μ_{1} - μ_{0}}{\sqrt{1 + σ_{1}^{2} + σ_{a}^{2} + (σ_{1}^{2} + μ_{1}^{2}) σ_{b}^{2}}}, \\ β_{1} = & \sqrt{\frac{1 + σ_{0}^{2} + σ_{a}^{2} + (σ_{0}^{2} + μ_{0}^{2}) σ_{b}^{2}}{1 + σ_{1}^{2} + σ_{a}^{2} + (σ_{1}^{2} + μ_{1}^{2}) σ_{b}^{2}}} . \end{aligned}

(9)

The overall AUC is then

AUC = Φ (\frac{μ_{1} - μ_{0}}{\sqrt{2 + 2 σ_{a}^{2} + σ_{0}^{2} + σ_{1}^{2} + (σ_{0}^{2} + σ_{1}^{2} + μ_{0}^{2} + μ_{1}^{2}) σ_{b}^{2}}}) .

(10)

In conclusion, based on (9) and (10), the overall diagnostic accuracy is influenced not only by the variance of the rater diagnostic magnifier but also by the variance of the rater diagnostic bias.

3. Extensions of the primary model

The motivating mammogram dataset provides additional information on both patients and raters, prompting us to incorporate these information as covariates into the model. The inclusion of covariates enhances the model’s flexibility, enabling it to account for different sources of variations in the data and understand how certain factors influence the outcome. Adjusting for covariates improves the model’s predictive accuracy by facilitating a more nuanced understanding of underlying patterns and enhancing prediction accuracy. The incorporation of covariates also elevates the interpretability of the model, empowering researchers to grasp how various factors contribute to observed outcomes and aiding in drawing meaningful conclusions from the analysis.

For many diseases, characteristics of patients pose strong effects on the disease severity. For instance, patient age is a significant variable that influences the disease status. Patients in the motivating mammogram dataset were age-stratified, therefore, including patient age provides insights into age-specific effect on the diagnostic process. Various characteristics can affect a rater’s performance/ability^14,18,19 including: experience and expertise (with more experienced raters typically performing better); training and education (adequate training and education can improve a rater’s performance by providing them with the necessary knowledge and skills to perform the task effectively); bias (raters may have personal biases that can affect their performance, such as confirmation bias, where they tend to look for information that supports their preconceptions, and availability bias, where they rely too heavily on information that is readily available to them); and etc. All these characteristics can be viewed as rater covariates.

To explicitly examine covariate effects, we first study patient and rater covariate effects separately. Then, we continue to explore whether incorporating all covariate variables (both patient’s and rater’s) simultaneously will significantly impact rater diagnostic accuracy analysis. Three covariate models are designated as Patient-covariate model, Rater-covariate model, and Combined-covariate model. Each of these models builds upon the original primary model.

3.1. Patient-covariate model and Patient-covariate-specific ROC analysis

Suppose that each patient $i$ has covariates $z_{i}^{T} = (z_{0 i}, \dots, z_{q i})^{T}$ with $z_{0 i} = 1$ and each of other covariates centralized. Assume that the latent disease severity of patient $i$ given the latent true disease status $D_{i}$ has an underlying relationship with the covariates:

u_{i} | D_{i} \sim N (z_{i}^{T} r_{D_{i}}, τ_{D_{i}}^{- 1}),

(11)

where

r_{D_{i}} = (r_{0 D_{i}}, \dots, r_{q D_{i}})^{T}

is the coefficient vector depending on

D_{i} = 0

1

. In this setting, the theoretical mean of the

i

th patient disease severity

u_{i}

emerges as

z_{i}^{T} r_{D_{i}}

instead of

μ_{D_{i}}

in the original model. Because of the centralization of each patient covariate variable, the intercept

r_{0 d}

component aids in tracing back the mean

u_{d}

with

d = 0

and

1

in the original model.

The introduced latent continuous rating score $S_{i j}$ of (4) for each ordinal diagnostic rating $W_{i j}$ conditional on the true disease class now follows

S_{i j} | D_{i} = d \sim N (a_{j} + b_{j} z_{i}^{T} r_{d}, 1 + b_{j}^{2} σ_{d}^{2}),

(12)

with

d = 0

1

. Then the ROC curve of the

j

th rater rating a patient with covariates

z_{i}^{T}

has the ROC intercept and slope coefficients respectively as

β_{0 i j} = \frac{b_{j} z_{i}^{T} (r_{1} - r_{0})}{\sqrt{1 + b_{j}^{2} σ_{1}^{2}}}, β_{1 i j} = \sqrt{\frac{1 + b_{j}^{2} σ_{0}^{2}}{1 + b_{j}^{2} σ_{1}^{2}}},

(13)

respectively. The corresponding AUC of rater

j

for a patient with covariates

z_{i}^{T}

{AUC}_{i j} = Φ (\frac{b_{j} z_{i}^{T} (r_{1} - r_{0})}{\sqrt{2 + b_{j}^{2} (σ_{1}^{2} + σ_{0}^{2})}}) .

(14)

Equation (14) offers a means to assess the diagnostic performance of the

j

th rater across various patient covariate levels. The more separating of

z_{i}^{T} r_{1}

and

z_{i}^{T} r_{0}

, the larger the

{AUC}_{i j}

is. An increment of

b_{j}

corresponds to a uniformly larger value of

{AUC}_{i j}

, and

{AUC}_{i j} \to Φ ((z_{i}^{T} (r_{1} - r_{0})) / (\sqrt{σ_{1}^{2} + σ_{0}^{2}}))

, as

b_{j} \to \infty

Next, we derive the overall rater ROC and AUC formulas given the patient covariate level. Given the normal distributions of $a_{j}$ and $b_{j}$ , applying the law of total expectation and total variance, we derive the distribution of patient-covariate-specific diagnostic score $S_{i \cdot}$ conditional on the patient disease class $D_{i} = d$ as follows,

S_{i \cdot} | D_{i} = d \sim N (z_{i}^{T} r_{d}, 1 + σ_{d}^{2} (1 + σ_{b}^{2}) + σ_{a}^{2} + σ_{b}^{2} (z_{i}^{T} r_{d})^{2}) .

Thereby, the patient-covariate-specific overall ROC curve has the intercept and slope as below,

\begin{aligned} β_{0 i \cdot} = \frac{z_{i}^{T} (r_{1} - r_{0})}{\sqrt{1 + σ_{1}^{2} + σ_{a}^{2} + (σ_{1}^{2} + (z_{i}^{T} r_{1})^{2}) σ_{b}^{2}}}, \\ β_{1 i \cdot} = \sqrt{\frac{1 + σ_{0}^{2} + σ_{a}^{2} + (σ_{0}^{2} + (z_{i}^{T} r_{0})^{2}) σ_{b}^{2}}{1 + σ_{1}^{2} + σ_{a}^{2} + (σ_{1}^{2} + (z_{i}^{T} r_{1})^{2}) σ_{b}^{2}}}, \end{aligned}

(15)

and the patient-covariate-specific overall AUC is

{AUC}_{i \cdot} = Φ (\frac{z_{i}^{T} (r_{1} - r_{0})}{\sqrt{2 + 2 σ_{a}^{2} + σ_{0}^{2} + σ_{1}^{2} + (σ_{0}^{2} + σ_{1}^{2} + (z_{i}^{T} r_{0})^{2} + (z_{i}^{T} r_{1})^{2}) σ_{b}^{2}}})

(16)

Examining the patient-covariate-specific overall rater ROC and AUC formulas (15 and 16) concludes that they are affected not only by the variances of diagnostic bias and magnifier, but also by different covariate levels.

3.2. Rater-covariate model and Rater-covariate-specific ROC analysis

Suppose that each rater $j$ has covariates $x_{j} = (x_{j 1}, \dots, x_{j p})^{T}$ . It is reasonable to assume that rater covariates influence the model through rater bias $a_{j}$ and rater magnifier $b_{j}$ . Instead of having two fixed means, namely $μ_{a}$ and $μ_{b}$ , we introduce a linear relationship between rater covariates and the means of the rater ability set. Specifically,

\begin{aligned} a_{j} & \sim N (x_{j}^{T} r_{a}, τ_{a}^{- 1}), \\ b_{j} & \sim N (x_{j}^{T} r_{b} + 1, τ_{b}^{- 1}), \end{aligned}

(17)

for

j = 1, \dots, n

, where coefficient vector

r_{a} = (r_{a 1}, \dots, r_{a p})^{T}

and

r_{b} = (r_{b 1}, \dots, r_{b p})^{T}

. Note that for the identifiability purpose, rater covariates are all centralized with mean 0.

The difference between the original model and Rater-covariate model is that we incorporate rater covariates and assume a regression layer on both $a_{j}$ and $b_{j}$ instead of assuming simple normal distributions. For Rater-covariate model, with the law of total expectation and total variance applied to (5), the continuous rater-covariate-specific rating score $S_{\cdot j}$ conditional on the latent true disease class $D = d$ is a normal distribution with mean $x_{j}^{T} r_{a} + (x_{j}^{T} r_{b} + 1) μ_{d}$ and variance $1 + σ_{d}^{2} (x_{j}^{T} r_{b} + 1)^{2} + σ_{a}^{2} + (σ_{d}^{2} + μ_{d}^{2}) σ_{b}^{2}$ . Therefore, the intercept and slope of the rater-covariate-specific ROC curves are shown as below,

\begin{aligned} β_{0. j} & = \frac{(x_{j}^{T} r_{b} + 1) (μ_{1} - μ_{0})}{\sqrt{1 + σ_{1}^{2} (x_{j}^{T} r_{b} + 1)^{2} + σ_{a}^{2} + (σ_{1}^{2} + μ_{1}^{2}) σ_{b}^{2}}}, \\ β_{1. j} & = \sqrt{\frac{1 + σ_{0}^{2} (x_{j}^{T} r_{b} + 1)^{2} + σ_{a}^{2} + (σ_{0}^{2} + μ_{0}^{2}) σ_{b}^{2}}{1 + σ_{1}^{2} (x_{j}^{T} r_{b} + 1)^{2} + σ_{a}^{2} + (σ_{1}^{2} + μ_{1}^{2}) σ_{b}^{2}}}, \end{aligned}

(18)

and the rater-covariate-specific AUC is,

\begin{aligned} {AUC}_{\cdot j} = \\ Φ (\frac{(x_{j}^{T} r_{b} + 1) (μ_{1} - μ_{0})}{\sqrt{2 + 2 σ_{a}^{2} + (σ_{0}^{2} + σ_{1}^{2}) (x_{j}^{T} r_{b} + 1)^{2} + (σ_{0}^{2} + σ_{1}^{2} + μ_{0}^{2} + μ_{1}^{2}) σ_{b}^{2}}}) . \end{aligned}

(19)

For Rater-covariate model, the rater overall ROC curve and AUC are impossible to derive due to the different rater covariate levels. According to ROC curve coefficients (18) and AUC (19) formulas, both variances of rater diagnostic magnifier and rater bias affect the rater-specific ROC curves and AUCs, as well as the means and variances of patient disease severity. The rater-covariate-specific bias

x_{j}^{T} r_{a}

has no effect on the ROCs and AUCs, whereas the rater-covariate-specific magnifier

x_{j}^{T} r_{b} + 1

induces a uniform trend in the AUCs with a larger value providing a larger AUC.

3.3. Combined-covariate model and patient-rater-specific ROC analysis

The patient and rater covariate effects on the diagnostic accuracy have been studied separately. Incorporating all covariates from both patient and rater simultaneously into a single model provides a comprehensive understanding of how these factors collectively influence rater diagnostic accuracy. In this subsection, we add both patient covariate regression layer (11) and rater covariate regression layer (17) on the original model and refer the extension as Combined-covariate Model. The nature of the Combined-covariate model determines the impossibility of deriving a binormal normal distribution of $S_{i j}$ in (4) over the entire group of patients and/or raters. Thereby, we focus solely on the continuous rating score $S_{i j}$ at the given patient and rater covariate levels. Given the normal distributions of $u_{i}$ , $a_{j}$ and $b_{j}$ , applying the law of total expectation and total variance, the distribution of patient-rater-covariate-specific continuous rating variable $S_{i j}$ of the $j$ th rater and $i$ th patient covariate level conditional on the true disease status $D_{i}$ is given by:

\begin{aligned} S_{i j} | D_{i} = d \sim \\ N (x_{j}^{T} r_{a} + (x_{j}^{T} r_{b} + 1) x_{j}^{T} r_{d}, \\ 1 + σ_{d}^{2} (x_{j}^{T} r_{b} + 1)^{2} + σ_{a}^{2} + (σ_{d}^{2} + (x_{j}^{T} r_{d})^{2}) σ_{b}^{2}) . \end{aligned}

Therefore, the patient-rater-covariate-specific ROC curve has intercept

β_{0 i j}

and slope

β_{1 i j}

in the following form,

\begin{aligned} β_{0 i j} & = \frac{z_{i}^{T} (x_{j}^{T} r_{b} + 1) (r_{1} - r_{0})}{\sqrt{1 + σ_{1}^{2} (x_{j}^{T} r_{b} + 1)^{2} + σ_{a}^{2} + (σ_{1}^{2} + (z_{i}^{T} r_{1})^{2}) σ_{b}^{2}}}, \\ β_{1 i j} & = \sqrt{\frac{1 + σ_{0}^{2} (x_{j}^{T} r_{b} + 1)^{2} + σ_{a}^{2} + (σ_{0}^{2} + (z_{i}^{T} r_{0})^{2}) σ_{b}^{2}}{1 + σ_{1}^{2} (x_{j}^{T} r_{b} + 1)^{2} + σ_{a}^{2} + (σ_{1}^{2} + (z_{i}^{T} r_{1})^{2}) σ_{b}^{2}}}, \end{aligned}

(20)

and the corresponding patient-rater-covariate-specific

{AUC}_{i j}

{AUC}_{i j} = Φ (\frac{z_{i}^{T} (x_{j}^{T} r_{b} + 1) (r_{1} - r_{0})}{\sqrt{2 + 2 σ_{a}^{2} + (σ_{0}^{2} + σ_{1}^{2}) (x_{j}^{T} r_{b} + 1)^{2} + (σ_{0}^{2} + σ_{1}^{2} + (z_{i}^{T} r_{0})^{2} + (z_{i}^{T} r_{1})^{2}) σ_{b}^{2}}}) .

(21)

The rater-covariate-specific bias

x_{j}^{T} r_{a}

continues to have no effect on the ROCs and AUCs, whereas the rater-covariate-specific magnifier

x_{j}^{T} r_{b} + 1

induces a uniform trend in the ROCs and AUCs as observed in Rater-covariate model. Patient covariates along with the variance parameters work simultaneously to influence ROCs and AUCs.

4. Mammogram data application

In this section, we apply the proposed methods (the original and extended models) to analyze the mammogram dataset introduced previously. We have developed four MCMC Gibbs samplers (one for each model) to estimate the parameters with details described in the supplementary file. For each model, we run 30,000 MCMC iterations, taking the first 5,000 iterations as a burn-in period. The convergence of the MCMC chains is verified using multiple convergence criteria within an R package coda.²⁰ Then estimation results are summarized based on these MCMC samples. The results from the original model serve as a baseline reference.

It is established that patient age has a significant impact for breast cancer progression. Therefore, for the covariate models, we select patient age as the patient covariate and the patient-covariate coefficient vectors are denoted as $r_{d} = (r_{0 d}, r_{1 d})^{T}$ with $d = 0$ for non-diseased and $1$ for diseased. Patient age in the mammogram data is ranged from 40 to 85 years, with a mean of 58 years. We choose rater age, gender (Female=1; Male=0), academic facility linkage (Yes=1; No=0), and the number of experience years as rater covariates based on.^18,19,21 The corresponding rater-covariate coefficient vectors are denoted as $r_{a} = (r_{a 1}, r_{a 2}, r_{a 3}, r_{a 4})^{T}$ and $r_{b} = (r_{b 1}, r_{b 2}, r_{b 3}, r_{b 4})^{T}$ , respectively. The percentage of female is 16.4%, being affiliated with academic is 9.5%, the average and standard deviation of rater age is 48.4 and 8.8, respectively, and the average and standard deviation of the number of experience years is 13.8 and 7.9, respectively.

Estimation results of main parameters are summarized in Table 1, listing the estimated posterior means (point estimates), posterior standard deviations, and $95 %$ credible intervals of parameters. Here, we use “SD” represents the “Posterior Standard Deviation” and “CI” for the “Credible Interval”. The four models overall exhibit similar estimation results with Patient-covariate and Combined-covariate models outperforming Original and Rater-covariate models. This may be attributed to whether the patient covariate is incorporated. The comparison criteria Deviance Information Criterion (DIC) listed in the last line of Table 1 also shows Combined-covariate model performs the best with the smallest DIC value, and is followed by Patient-covariate model, Rater-covariate model, and the original model.

Table 1.
Parameter estimates (SD) and $95 %$ credible intervals are presented for the four models.

Original Patient-covariate Rater-covariate Combined-covariate

$α_{1}$ $- 0.244$ (0.012) $- 0.279$ (0.019) $- 0.253$ (0.020) $- 0.291$ (0.014)

$(- 0.320, - 0.185)$ $(- 0.340, - 0.251)$ $(- 0.307, - 0.221)$ $(- 0.335, - 0.270)$

$α_{2}$ 0.437 (0.014) 0.415 (0.010) 0.444 (0.015) 0.412 (0.018)

(0.441, 0.475) (0.388, 0.461) (0.425, 0.474) (0.388, 0.445)

$α_{3}$ 1.358 (0.048) 1.342 (0.032) 1.368 (0.044) 1.343 (0.036)

(1.333, 1.486) (1.322, 1.477) (1.313, 1.484) (1.301, 1.444)

$α_{4}$ 3.175 (0.054) 3.180 (0.073) 3.190 (0.059) 3.152 (0.045)

(3.111, 3.324) (3.104, 3.288) (3.120, 3.324) (3.097, 3.257)

$σ_{a}$ 0.433 (0.033) 0.447 (0.033) 0.449 (0.034) 0.449 (0.034)

(0.373,0.503) (0.386, 0.519) (0.385, 0.514) (0.387, 0.521)

$σ_{b}$ 0.183 (0.015) 0.190 (0.015) 0.182 (0.015) 0.181 (0.015)

(0.156, 0.213) (0.163, 0.223) (0.156, 0.213) (0.154, 0.211)

$p$ 0.477 (0.120) 0.458 (0.129) 0.471 (0.119) 0.457 (0.129)

(0.266, 0.710) (0.238, 0.713) (0.266, 0.707) (0.234, 0.713)

$μ_{0}$ $- 0.453$ (0.142) — -0.491 (0.145) —

$(- 0.676, - 0.114)$ $(- 0.752, - 0.188)$

$μ_{1}$ 1.876 (0.481) — 1.872 (0.484) —

(1.021, 2.776) (0.987, 2.744)

$σ_{0}$ 0.680 (0.136) 0.698 (0.145) 0.686 (0.132) 0.715 (0.150)

(0.432, 0.956) (0.431, 0.988) (0.437, 0.949) (0.438, 1.022)

$σ_{1}$ 1.350 (0.261) 1.265 (0.290) 1.335 (0.265) 1.262 (0.292)

(0.850, 1.807) (0.721, 1.768) (0.835, 1.808) (0.715, 1.772)

non-diseased:

$r_{00}$ (intercept) — $- 0.451$ (0.156) — $- 0.458$ (0.163)

$(- 0.724, - 0.109)$ $(- 0.743, - 0.101)$

$r_{10}$ (age) — 0.016 (0.009) — 0.016 (0.011)

$(- 0.001, 0.032)$ $(- 0.005, 0.037)$

diseased:

$r_{01}$ (intercept) — 1.872 (0.505) — 1.852 (0.519)

(0.963, 2.779) (0.937, 2.813)

$r_{11}$ (age) — 0.039 (0.011) — 0.039 (0.017)

(0.018, 0.059) (0.005, 0.073)

rater bias:

$r_{a 1}$ (age) — — -0.008 (0.008) $- 0.008$ (0.008)

$(- 0.024, 0.008)$ $(- 0.024, 0.008)$

$r_{a 2}$ (gender) — — $- 0.080 (0.118)$ $- 0.087 (0.128)$

(-0.311, 0.152) $(- 0.340, 0.165)$

$r_{a 3}$ (academic) — — 0.107 (0.138) 0.105 (0.144)

(-0.166, 0.375) $(- 0.177, 0.386)$

$r_{a 4}$ (experience) — — 0.009 (0.009) 0.009 (0.009)

$(- 0.009, 0.027)$ $(- 0.009, 0.027)$

rater magnifier:

$r_{b 1}$ (age) — — $- 0.010$ (0.009) $- 0.009$ (0.003)

$(- 0.017, - 0.003)$ $(- 0.016, - 0.003)$

$r_{b 2}$ (gender) — — 0.071 (0.049) 0.095 (0.054)

$(- 0.027, 0.167)$ $(- 0.011, 0.201)$

$r_{b 3}$ (academic) — — 0.079 (0.059) 0.097 (0.061)

$(- 0.037, 0.191)$ $(- 0.022, 0.217)$

$r_{b 4}$ (experience) — — 0.009 (0.004) 0.009 (0.004)

(0.001, 0.017) (0.001, 0.017)

DIC 317.59 284.00 301.62 230.89

	Original	Patient-covariate	Rater-covariate	Combined-covariate
$α_{1}$	$- 0.244$ (0.012)	$- 0.279$ (0.019)	$- 0.253$ (0.020)	$- 0.291$ (0.014)
	$(- 0.320, - 0.185)$	$(- 0.340, - 0.251)$	$(- 0.307, - 0.221)$	$(- 0.335, - 0.270)$
$α_{2}$	0.437 (0.014)	0.415 (0.010)	0.444 (0.015)	0.412 (0.018)
	(0.441, 0.475)	(0.388, 0.461)	(0.425, 0.474)	(0.388, 0.445)
$α_{3}$	1.358 (0.048)	1.342 (0.032)	1.368 (0.044)	1.343 (0.036)
	(1.333, 1.486)	(1.322, 1.477)	(1.313, 1.484)	(1.301, 1.444)
$α_{4}$	3.175 (0.054)	3.180 (0.073)	3.190 (0.059)	3.152 (0.045)
	(3.111, 3.324)	(3.104, 3.288)	(3.120, 3.324)	(3.097, 3.257)
$σ_{a}$	0.433 (0.033)	0.447 (0.033)	0.449 (0.034)	0.449 (0.034)
	(0.373,0.503)	(0.386, 0.519)	(0.385, 0.514)	(0.387, 0.521)
$σ_{b}$	0.183 (0.015)	0.190 (0.015)	0.182 (0.015)	0.181 (0.015)
	(0.156, 0.213)	(0.163, 0.223)	(0.156, 0.213)	(0.154, 0.211)
$p$	0.477 (0.120)	0.458 (0.129)	0.471 (0.119)	0.457 (0.129)
	(0.266, 0.710)	(0.238, 0.713)	(0.266, 0.707)	(0.234, 0.713)
$μ_{0}$	$- 0.453$ (0.142)	—	-0.491 (0.145)	—
	$(- 0.676, - 0.114)$		$(- 0.752, - 0.188)$
$μ_{1}$	1.876 (0.481)	—	1.872 (0.484)	—
	(1.021, 2.776)		(0.987, 2.744)
$σ_{0}$	0.680 (0.136)	0.698 (0.145)	0.686 (0.132)	0.715 (0.150)
	(0.432, 0.956)	(0.431, 0.988)	(0.437, 0.949)	(0.438, 1.022)
$σ_{1}$	1.350 (0.261)	1.265 (0.290)	1.335 (0.265)	1.262 (0.292)
	(0.850, 1.807)	(0.721, 1.768)	(0.835, 1.808)	(0.715, 1.772)
non-diseased:
$r_{00}$ (intercept)	—	$- 0.451$ (0.156)	—	$- 0.458$ (0.163)
		$(- 0.724, - 0.109)$		$(- 0.743, - 0.101)$
$r_{10}$ (age)	—	0.016 (0.009)	—	0.016 (0.011)
		$(- 0.001, 0.032)$		$(- 0.005, 0.037)$
diseased:
$r_{01}$ (intercept)	—	1.872 (0.505)	—	1.852 (0.519)
		(0.963, 2.779)		(0.937, 2.813)
$r_{11}$ (age)	—	0.039 (0.011)	—	0.039 (0.017)
		(0.018, 0.059)		(0.005, 0.073)
rater bias:
$r_{a 1}$ (age)	—	—	-0.008 (0.008)	$- 0.008$ (0.008)
			$(- 0.024, 0.008)$	$(- 0.024, 0.008)$
$r_{a 2}$ (gender)	—	—	$- 0.080 (0.118)$	$- 0.087 (0.128)$
			(-0.311, 0.152)	$(- 0.340, 0.165)$
$r_{a 3}$ (academic)	—	—	0.107 (0.138)	0.105 (0.144)
			(-0.166, 0.375)	$(- 0.177, 0.386)$
$r_{a 4}$ (experience)	—	—	0.009 (0.009)	0.009 (0.009)
			$(- 0.009, 0.027)$	$(- 0.009, 0.027)$
rater magnifier:
$r_{b 1}$ (age)	—	—	$- 0.010$ (0.009)	$- 0.009$ (0.003)
			$(- 0.017, - 0.003)$	$(- 0.016, - 0.003)$
$r_{b 2}$ (gender)	—	—	0.071 (0.049)	0.095 (0.054)
			$(- 0.027, 0.167)$	$(- 0.011, 0.201)$
$r_{b 3}$ (academic)	—	—	0.079 (0.059)	0.097 (0.061)
			$(- 0.037, 0.191)$	$(- 0.022, 0.217)$
$r_{b 4}$ (experience)	—	—	0.009 (0.004)	0.009 (0.004)
			(0.001, 0.017)	(0.001, 0.017)
DIC	317.59	284.00	301.62	230.89

We first discuss several common parameters across the four models. The estimation of thresholds ( $α$ ’s) are valuable for drawing ordinal ROC curves. The point estimates of each threshold are similar across the four models with 0.05 the largest difference. The posterior standard deviations of the thresholds are pleasantly small. All the point estimates of the disease prevalence $p$ are relatively close to 0.44 (difference within 0.03) with the CIs all containing 0.44. Note that the true disease proportion in the mammogram dataset is 0.44.

To assess the diagnostic variation of the whole group of radiologists, we look at the standard deviation of rater bias and rater magnifier, $σ_{a}$ and $σ_{b}$ , respectively. The estimation of $σ_{a}$ and $σ_{b}$ is consistent across the four models as seen in Table 1. The estimated $σ_{b}$ is around $0.18$ . Rater bias shows more variation than rater magnifier with the estimated $σ_{a}$ around 0.45. The same phenomena can be observed in Figure 1, where the estimated $a_{j}$ and $b_{j}$ of each rater are depicted.

Figure 1.

Mammogram data: plots of point estimates and 95 % credible intervals of rater bias $a_{j}$ (top panel) and diagnostic magnifier $b_{j}$ (bottom panel) for the 107 raters.

The estimation of the standard deviation parameters $σ_{0}$ and $σ_{1}$ in the binormal model of $u_{i}$ is similar across the four models. The non-diseased group has the estimated standard deviation of patient disease severity around $0.69$ ; while the diseased group has a much larger estimated standard deviation around $1.30$ . To compare the estimation of the mean parameters in the binormal model of $u_{i}$ , we look at the estimation of $μ_{0}$ and $μ_{1}$ in the original model and Rater-covariate model, and the estimation of $r_{00}$ and $r_{01}$ in Patient-covariate model and Combined-covariate model. The point estimates ${\hat{r}}_{00}$ (around $- 0.45$ ) and ${\hat{r}}_{01}$ (around $1.87$ ) in Patient-covariate model and Combined-covariate model match the estimates ${\hat{μ}}_{0}$ and ${\hat{μ}}_{1}$ in the original and Rater-covariate models. Their SDs are close in Patient-covariate and Combined-covairate models, and both slightly larger than those of the original and Rater-covariate models. Overall, the two component means of $u_{i}$ are quite separated and the standard deviation of the diseased group is larger than non-diseased group, which can also be visually observed in Figure 2. Figure 2 displays the estimated distribution of patient latent disease severity from the original model (Rater-covariate model has nearly identical estimation results). It is clear that there are some area of overlapping under the disease group’s left tail and non-disease group’s right tail, which may cause label-switching/non-identifiable issue during the MCMC fitting. In the last section of discussion, we will further address this issue and provide possible solutions.

Figure 2.

Mammography data: plot of the estimated mixture distribution of patient latent disease severity $u$ . The estimated distributions are also drawn separately for the patients with $D = 0$ (non-disease) and the patients with $D = 1$ (disease).

In both Patient-covariate and Combined-covariate models, patient age is significant for the patient disease severity of the diseased group but not for the non-diseased group. When a patient age increases one year, the mean disease severity of the disease group is estimated to increase by 0.039 units.

From Table 1, Rater-covariate model and Combined-rater model produce almost identical estimation of rater covariate coefficients. The four covariates have no significant effects on rater bias, but rater age and experience year have significant effects on rater magnifier. Rater age shows a negative effect on the rater magnifier. With one year increase of rater age, the mean of rater magnifier reduces by approximately 0.01 units. The rater experience year shows a positive effect. The rater magnifier mean increases by approximately 0.009 units with a year increase of rater experience year.

One advantage of our proposed disease latent class modeling is to facilitate with the prediction of patient true binary disease statues. To check the prediction adequacy, Figure 3 plots the average prediction of the binary disease status $d$ against the average rating from raters for each patient. It is clear that across all the models, the prediction of $d$ is reasonable with larger average predicted values matching larger average ratings. It is evident that the predictions of disease statuses exhibit more noise in Patient-covariate model and Combined-covariate model, which is not surprising as using additional patient covariates brings more variation in the prediction. Notably, certain data points on the far left display anomalous behavior, attributed to the overlapping area of the far left tail of the latent disease severity distribution due to the larger $σ_{1}$ , as illustrated in Figure 2. This discrepancy explains one source of imperfections in the rating scores within the mammogram dataset. The bottom panel of Figure 3 also plots the estimated latent disease severity $u$ against the average rating from raters for each patient. All the four models show almost identical behaviors in term of the estimation of $u$ with larger estimated $u$ matching larger average rating consistently, which indicates all the models perform correctly.

Figure 3.

Mammography data: (top panel) plots of the average predicted patient-specific binary disease status $\hat{d}$ using different models against patient average rating; (bottom panel) plots of the estimated patient-specific latent disease severity $\hat{u}$ using different models against patient average rating. From left to right are the original model, Patient-covariate model, Rater-covariate model, and Combined-covariate model.

4.1. Rater diagnostic accuracy analysis

The MCMC sampling allows the estimation of each rater bias $a_{j}$ and rater magnifier $b_{j}$ . Figure 1 depicts the point estimates and $95 %$ CIs of each rater bias $a_{j}$ and diagnostic magnifier $b_{j}$ from the original model. It is observed that the estimated diagnostic biases $\hat{a}$ ’s are roughly between $- 1.5$ and $1.5$ and the estimated diagnostic magnifiers $\hat{b}$ ’s between $0.5$ and $1.5$ . The three covariate extended models have similar estimation of $a_{j}$ ’s and $b_{j}$ ’s. The detailed comparison of the estimation of $a_{j}$ and $b_{j}$ can be found in the supplementary file. Remarkably, raters 53 and 63 consistently possess the highest and lowest values of diagnostic magnifier across all four models with ${\hat{b}}_{53}$ around $1.382$ and ${\hat{b}}_{63}$ $0.565$ , respectively.

Now we shift our focus to the assessment of rater performance through the utilization of ROC curves and AUCs. The posterior estimates of the ROC curves for individual raters using the original model are presented in Figure 4. Among these curves, the green curve corresponds to the ROC curve of the best-performing rater ( ID: Rater 53) with the highest AUC $= 0.917$ , while the red curve represents the ROC curve of the weakest-performance rater (ID: Rater 63) with the lowest AUC $= 0.805$ . The blue curve represents the median rater (ID: Rater 14) ROC curve. A visual examination reveals that the ROC curve of the best rater is positioned closer to the upper-left corner than those of the worse raters, indicative of its superior performance and the larger associated AUC. Furthermore, on the three (best, median, worst) smoothed ROC curves, individual points corresponding to each estimated threshold $α_{k}$ are plotted. They unveil the trade-off between each pair of sensitivity and specificity across various classification thresholds. The best rater’s points are closer to each other and to upper left corner, while the worst rater’s points are more separated and farther away from the upper left corner. Especially, for the median and best raters, $α_{2}$ corresponds to the best pair of sensitivity and specificity among all of the four $α$ thresholds, which suggests “classifying patient diseased when a rating is greater than 2” be the best dichotomizing rule.

Figure 4.

Mammogram data: estimated individual ROC curves, especially ROC curves of the best rater 53, median rater 14, and the worst rater 63, from the original model.

The Patient-covariate model enables drawing patient age-specific ROC curves for each rater, thus visualizing the patient age effect on each rater’s ROC curves. Figure 5 shows ROC curves of the best (ID: 53) and worst rater (ID: 63) for specific 25, 35, 45, 55, and 65 patient age levels. The trend of the ROC curves is the same for both raters: the older a patient is, the higher ROC curve and the lager AUC. The best and worst raters’ AUCs differ tremendously for the same patient age level, especially for younger patients. When patient age is 25, the 53th rater’s AUC is 0.950 and the 63th is 0.843 (difference over 0.1).

Figure 5.

Mammogram data: estimated patient-age-specific ROC curves. The left panel is for the 53th (best) rater; the right panel is for the 63th (worst) rater.

The Rater-covariate model enables drawing rater-covariate-specific ROC curves over the entire patient group. Table 1 has shown that rater age and rater experience year have significant effects on rater magnifier. Figure 6 shows rater experience year’s effect on ROC curves and AUCs across different levels of academic facility linkage and gender while fixing rater age as 50. The upper two graphs are for raters who are not academically linked raters and the lower two graphs are for raters who are academically linked. Within each layer, male is on the left and female is on the right. Rater experience year levels are 5, 10, 15, 20, 30. Figure 6 illustrates an overall pattern of longer experience year dictating higher ROC curves and larger AUCs. According to the AUC estimates, it is also observed that the academic group slightly outperforms the non-academic group and that female raters exhibit slightly larger AUCs than males consistently. Note that the ROC curves for investigating rater age effect are displayed in the supplementary file.

Figure 6.

Mammogram data: estimated rater-covariate-specific ROC curves for investigating the effect of rater experience year across different levels of academic facility linkage and gender when fixing rater age at 50.

The Combined-covariate model is particularly intriguing and complicated, as patient and rater covariates synergistically influence ROC curves and AUCs. To illustrate, we consider only academically linked raters with age fixed at 50 and display the results for investigating the effects of rater experience year and patient age across the two levels of gender. The plots investigating the effects of rater age and patient age across the two levels of gender can be found in the supplementary file. Figure 7 shows that regardless of gender, the longer experience year is, the higher ROC climbs. For different patient age levels (35, 55, and 65, looking across columns), we observe that the older patient is, the higher ROC and the larger AUC are. Although the gender effect is not significant for the diagnostic magnifier (seen in Table 1), comparing the two rows in Figure 7 shows that female raters consistently perform better than male raters with slightly larger AUCs.

Figure 7.

Mammogram data: estimated patient-rater-covariate-specific ROC curves for investigating the effects of rater experience year and patient age for academically linked raters across different levels of gender when fixing rater age at 50.

5. Discussion

In this paper, we have proposed a hierarchical probit model with latent disease class modeling to analyze MRMC ordinal data, aiming to assess diagnostic accuracy of raters. Subsequently, we extend the model through incorporating covariate effects into rater diagnostic skill parameters and/or patient latent disease severity. The four models are referred to the original model, Patient-covariate model, Rater-covariate model and Combined-covariate model. For each model, we have developed smoothed ROC curves and corresponding AUC formulas to study rater diagnostic accuracy with or without covariates effects. Gibbs samplers have been developed for parameter estimation for these four models. Simulation studies are conducted and show good estimation performance of the four models. The details of the Gibbs samplers and simulation results are documented in the supplementary file. Applying the proposed models to Beam’s mammogram data demonstrates the practicality and validity of the proposed models.

Every proposed model in this paper possesses speciality. The original model studies rater’s ROC curve individually and globally. Patient-covariate model reveals patient-specific covariate effects on rater diagnostic accuracy. Rater-covariate model captures the effects of rater-specific covariates on diagnostic accuracy. Combined-covariate model allows patient and rater covariates to affect diagnostic accuracy synergistically. Including covariates allows an in-depth analysis of how each covariate contributes to the change of diagnostic accuracy through the analysis of ROC curves and AUCs. Especially in Combined-covariate model, patient and rater’s covariates work synergistically, providing deeper insights into the covariate effects on the diagnostic accuracy. The choice of model depends on research questions and available information in the dataset. The proposed approach also enables researchers to better understand patient and rater covariate effects on the disease status prediction. This insight can empower healthcare providers make better decisions when diagnosing patients.

From the literature, label switching problem usually occurs during the parameter estimation process when latent class models applied.^22,23 In this paper, the diseased and non-diseased label switching issue is not present during the parameter estimation process even though the decomposition of $u$ and the prediction of $d$ both indicate that there are some roots for the label switching. One possible solution to fix the label switching is to use the information of the disease prevalence $p$ . One may employ an informative beta prior on the disease prevalence $p$ with specified informative mode and concentration hyperparameters.²⁴

Our proposed models do not require any established gold standard, which represent the true disease status. However, should gold standard be available, our methods have the capability to seamlessly adapt by simply dropping the sampling step of $D_{i}$ ’s. We demonstrate this with refitting the mammogram data using the patient true disease statuses. Overall, the estimation results are very comparable to those obtained when $D_{i}$ ’s unknown. While the rater-related parameters have almost the same results, the patient-related parameters have slightly different estimates and much smaller SDs. It is not surprising as knowing true $D_{i}$ ’s reduces the variation of the estimation, especially for the patient-related parameters, and the conflicts between some true $D_{i}$ ’s and their rater ordinal ratings leads to slightly different estimates of patient-related parameters. The detailed estimation results are summarized in the supplemental file. Note that when the disease and non-disease groups have close latent disease severity distributions, without knowing the true disease status or gold standard may cause a non-identifiable problem, which has been addressed in literature.^25,26

Imbalanced MRMC data are often encountered in practice. The incompleteness of the data does not pose a significant challenge for Bayesian MCMC estimation methods. To address missing data, one simply treat them as latent variables and sample them during the MCMC iterations. However, the missing data may worsen the label switching/nonidentifiability issue. In the supplemental file, some simulation results are presented to show the effect of the missing data on the estimation. In future work, we will investigate the label switching problem systematically by connecting the missing data percentage and the disease severity parameters for the diseased and non-diseased patient groups.

Appendix

Assume that a rater has a continuous rating score $S$ . A case is classified as diseased by the rater when score $S > t$ and non-diseased when $S \leq t$ , where $t$ is a threshold value. Let $S e n (t)$ stand for sensitivity and $Spe (t)$ stand for specificity given the threshold $t$ . Hence, $Sen (t) = P (S > t | D = 1)$ and $1 - Spe (t) = P (S > t | D = 0)$ , where $D$ stands for the true disease status with $1$ indicating diseased and $0$ non-diseased. The ROC curve (sensitivity against $1$ minus specificity) for the rater is then drawn when the threshold value $t$ traveling through the whole real set. Suppose that $S$ follows a binormal model,

S | D = d \sim N (μ_{d}, σ_{d}^{2}),

for

d = 0

1

. It can be shown that

1 - Spe (t) = P (S > t | D = 0) = Φ (\frac{μ_{0} - t}{σ_{0}}) \overset{△}{=} x (t),

(22)

where

Φ

is the CDF of a standard normal distribution. Similarly,

Sen (t) = P (S > t | D = 1) = Φ (\frac{μ_{1} - t}{σ_{1}}) \overset{△}{=} y (t) .

(23)

From equation (22), we have

t = μ_{0} - σ_{0} Φ^{- 1} (1 - Spe (t))

and from equation (23), we have

t = μ_{1} - σ_{1} Φ^{- 1} Sen (t)

. Therefore, the ROC curve formula¹⁴ is derived as

Φ^{- 1} (Sen) = β_{0} + β_{1} Φ^{- 1} (1 - Spe)

, where

β_{0} = (μ_{1} - μ_{0}) / σ_{1}

and

β_{1} = σ_{0} / σ_{1}

. Adopting the same

x (t)

and

y (t)

into the formal definition of AUC, we obtain that

\begin{aligned} AUC = & \int_{0}^{1} y (x) d x = \int_{+ \infty}^{- \infty} y (t) \frac{d x}{d t} d t = Φ (\frac{μ_{1} - μ_{0}}{\sqrt{σ_{1}^{2} + σ_{0}^{2}}}) . \end{aligned}

Supplemental Material

sj-pdf-1-smm-10.1177_09622802251404063 - Supplemental material for Diagnostic accuracy analysis for multiple raters using probit hierarchical model for ordinal ratings

Supplemental material, sj-pdf-1-smm-10.1177_09622802251404063 for Diagnostic accuracy analysis for multiple raters using probit hierarchical model for ordinal ratings by Yun Yang, Xiaoyan Lin and Kerrie P Nelson in Statistical Methods in Medical Research

Footnotes

ORCID iDs

Xiaoyan Lin

Kerrie P Nelson

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors are grateful for the partial support provided by R01 grant (5R01CA226805-04) from the United States National Institutes of Health.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental material

Supplemental material for this article is available online.

References

Obuchowski

Bullen

. Multireader diagnostic accuracy imaging studies: Fundamentals of design and analysis. Radiology 2022; 303: 26–34.

Dorfman

Berbaum

Metz

. Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method. Invest Radiol 1992; 27: 723–731.

Roe

Metz

. Dorfman-berbaum-metz method for statistical analysis of multireader, multimodality receiver operating characteristic data: validation with computer simulation. Acad Radiol 1997; 4: 298–303.

Obuchowski

Rockette

. Hypothesis testing of diagnostic accuracy for multiple readers and multiple tests: an ANOVA approach with dependent observations. Comm Stat Sim Comput 1995; 24: 285–308.

eiden

Wagner

Campbell

. Components-of-variance models and multiple bootstrap experiments: an alternative method for random effects, receiver operating characteristic analysis. Acad Radiol 2000; 7: 341–349.

Song

Zhou

X-H

. A marginal model approach for the analysis of multi-reader multi-test receiver operating characteristic (ROC) data. Biostatistics 2005; 6: 303–312.

Obuchowski

. Nonparametric analysis of clustered ROC curve data. Biometrics 1997; 53: 567–578.

Hillis

Berbaum

Metz

. Recent developments in the Dorfman–Berbaum–Metz procedure for multireader ROC study analysis. Acad Radiol 2008; 15: 647–661. Erratum in: Acad Radiol. 2008;20(5):660. PMID: 18423323; PMCID: PMC2737462.

Beam

Conant

Sickles

. Association of volume and volume-independent factors with accuracy in screening mammogram interpretation. J Natl Cancer Inst 2003; 95: 282–290.

10.

American College of Radiology. ACR BI-RADS Atlas: Breast Imaging Reporting and Data System (5th ed.). Reston, VA: American College of Radiology, 2013.

11.

Lin

Chen

Edwards

, et al. Modeling rater diagnostic skills in binary classification processes. Stat Med 2018; 37: 557–571.

12.

Kim

Lin

Nelson

. Measuring rater bias in diagnostic tests with ordinal ratings. Stat Med 2021; 40: 4014–4033.

13.

Pepe

Longton

Janes

. Estimation and comparison of receiver operating characteristic curves. Stata J 2009; 9: 17–39.

14.

Krzanowski

Hand

. ROC Curves for Continuous Data. (1st ed.). Chapman and Hall/CRC, 2009.

15.

Pepe

. The statistical evaluation of medical tests for classification and prediction. New York, NY: Oxford University Press, 2003.

16.

Metz

. Some practical issues of experimental design and data analysis in radiological ROC studies. Invest Radiol 1989; 24: 234–245.

17.

Chen

Samuelson

. The average receiver operating characteristic curve in multireader multicase imaging studies. Br J Radiol 2014; 87: 20140016.

18.

O’Sullivan

Kolodner

Chute

. AAMC project on the clinical education of medical students: Clinical skills education. Acad Med 1993; 68: 1–24.

19.

Holmboe

Scranton

Sumption

, et al.

Assessing physician performance: What should be measured and why?

JAMA 2002; 286: 1083–1090.

20.

Plummer

Best

Cowles

, et al. CODA: Convergence diagnosis and output analysis for MCMC. R News 2006; 6: 7–11.

21.

Eva

Regehr

. Assessing clinical competence: A review of the literature. Med Educ 2007; 41: 1127–1138.

22.

Stephens

. Dealing with label switching in mixture models. J R Stat Soc Ser B: Stat Methodol 2000; 62: 795–809.

23.

Rodríguez

Walker

. Label switching in Bayesian mixture models: Deterministic relabeling strategies. J Comput Graph Stat 2014; 23: 25–45.

24.

Wang

Lin

Nelson

. Bayesian hierarchical latent class models for estimating diagnostic accuracy. Stat Methods Med Res 2019; 29: 1112–1128.

25.

Tan

Kutner

. Random effects models in latent class analysis for evaluating accuracy of diagnostic tests. Biometrics 1996; 52: 797–810.

26.

Beiden

Campbell

Meier

, et al. The problem of ROC analysis without truth: the EM algorithm and the information matrix. Proc SPIE 3981, Medical Imaging 2000: Image Perception and Performance 2000. DOI: 10.1117/12.383099.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.47 MB