Sage Journals: Discover world-class research

Abstract

Agreement is an important concept in medical and behavioral sciences, in particular in clinical decision making where disagreements possibly imply a different patient management. The concordance correlation coefficient is an appropriate measure to quantify agreement between two scorers on a quantitative scale. However, this measure is based on the first two moments, which could poorly summarize the shape of the score distribution on bounded scales. Bounded outcome scores are common in medical and behavioral sciences. Typical examples are scores obtained on visual analog scales and scores derived as the number of positive items on a questionnaire. These kinds of scores often show a non-standard distribution, like a J- or U-shape, questioning the usefulness of the concordance correlation coefficient as agreement measure. The logit-normal distribution has shown to be successful in modeling bounded outcome scores of two types: (1) when the bounded score is a coarsened version of a latent score with a logit-normal distribution on the [0,1] interval and (2) when the bounded score is a proportion with the true probability having a logit-normal distribution. In the present work, a model-based approach, based on a bivariate generalization of the logit-normal distribution, is developed in a Bayesian framework to assess the agreement on bounded scales. This method permits to directly study the impact of predictors on the concordance correlation coefficient and can be simply implemented in standard Bayesian softwares, like JAGS and WinBUGS. The performances of the new method are compared to the classical approach using simulations. Finally, the methodology is used in two different medical domains: cardiology and rheumatology.

Keywords

Limited scale logistic transform intraclass correlation coefficient Likert scale concordance

1 Introduction

Reliability and validity studies are of paramount importance in behavioral and medical sciences. They provide information about the amount of error inherent to any diagnosis, score or measurement. Reliability studies involve repeated measurements of a random sample of items from a target population under the same experimental conditions. On quantitative scales, reliability is classically quantified by an intraclass correlation coefficient (ICC).^1,2 In validity studies, and in particular when studying criterion validity, the measurement instrument is calibrated against an established method. The established method is often regarded as a “gold standard” measuring the “true value” of the quantity to be determined. However, it is frequent that the reference method cannot be viewed as giving a true value. Then, the comparability of the new and the reference methods is assessed by the degree of agreement between them. Agreement is also important in clinical decision making, where disagreements between physicians can lead to different treatments for the patient. On quantitative scales, agreement can be assessed visually using the Bland and Altman plot³ or be quantified using the concordance correlation coefficient (CCC).^4,5 In the presence of two fixed scorers, the ICC and CCC are equivalent under the assumption that the joint distribution of the scores given by the two methods is bivariate normal. However, the concordance correlation can also be determined without the bivariate normality assumption, which represents an advantage over the intraclass correlation.^6,7

The present paper is motivated by two studies. The first study, the COCO study (unpublished work), is a study on the COmpliance and COmplexity of drug regimen in hypertension. The primary objective was to compare the compliance to an antihypertensive treatment when it was given in a single tablet or as two or three separate tablets. Compliance was assessed on a 100 mm VAS scale by the patient and the physician. The interest was therefore also in the agreement level between the patient and the physician assessments and in the impact of the drug regimen complexity on this agreement level. The second study is the PETRA study,⁸ an explorative study on the use of F-Fluorodeoxyglucose Positron Emission Computer Tomography (F-FDG PET/CT) in the assessment of rheumatoid arthritis (RA) remission. The DAS28 score, involving a clinical examination of 28 particular articulations for swelling and tenderness, is usually used in the assessment of remission. The presence of synovitis was assessed using F-FDG PET/CT and ultrasound (US) in these 28 articulations. The agreement between the percentage of positive joints detected with the F-FDG PET/CT and the US is of particular interest to study whether these techniques could be interchangeably used as additional tool in the remission diagnosis.

A first particularity of these studies is that the outcome is a bounded score and can present a variety of distributions, including J- or U-shapes⁹ with lots of observations at the boundaries. Second, the impact of covariates on the CCC is of direct interest in the first study. Barnhart and Williamson¹⁰ proposed the use of three sets of generalized estimating equations to model the CCC according to covariates. However, on bounded scales, the adequacy of the CCC as agreement measure and therefore of the method of Barnhart and Williamson could be questioned because the CCC is based on the two first statistical moments, which might not be good summary measures to describe J- or U-shapes. Although the CCC was also recently generalized to handle log-normal distributions,^11,12 the methods do not cover data with more general J- or U-shapes.

Hutton and Stanghellini¹³ proposed to use censored skewed normal distributions to model bounded outcomes. They consider the case where the bounded score is a weighted average of scores obtained on various questionnaires and scaled to range between 0 and 100. The idea behind their approach is to assume the existence of values beyond the range of the bounded scale and to consider that values at the boundaries are censored. In the two motivating examples, this assumption can hardly be made. In the COCO study, the outcome is measured on a VAS scale where the boundaries are defined as the smallest and largest possible score. In the PETRA study, the outcome is the percentage of positive joints which cannot be extended outside the 0–100% range. Moreover, U-shapes cannot be adequately described using a censored skewed normal distribution. Lesaffre et al.⁹ developed a methodology to handle bounded scores of two types: (1) when the bounded score is a coarsened version of a latent score with a logit-normal (LN) distribution on the [0,1] interval, like potentially in the COCO study and (2) when the bounded score is a proportion with the true probability having a LN distribution, like possibly in the PETRA study. The LN distribution was originally suggested by Johnson¹⁴ and can describe a variety of distributions, including distributions with J- and U-shapes. A LN distribution can directly be assumed when the bounded score is a pure percentage, like the ejection fraction in cardiology or is a continuous bounded score.

In the present work, a model-based CCC based on a bivariate generalization of the LN distribution is defined. We adopted a model-based approach to permit a direct relationship between the CCC and predictors. This could help researchers in improving agreement levels by identifying influential factors. Statistical inference for the CCC is obtained within the Bayesian framework. The choice of a Bayesian framework is motivated by the fact that it has shown good frequentist properties in a variety of settings. In Bayesian inference, prior knowledge about the parameters is combined with the likelihood to yield a posterior distribution. When the posterior distribution is not of standard form, Markov Chain Monte Carlo (MCMC) can be used to sample from the posterior distribution. Bayesian estimation is also flexible in the presence of covariates and missing values and easy to implement. The methodology developed in this paper can be directly implemented in standard Bayesian software, like JAGS and WinBUGS.

The motivating datasets are presented in Section 2. Then, the classical definition of the CCC is given in Section 3. This definition is extended in Section 4 to bounded scales and the inferential procedure is given in Section 5. In section 6, the Bayesian estimation method is presented. Simulations are performed to assess the properties of the new methodology in Section 7. Then, the method is illustrated on the COCO and the PETRA studies in Section 8. Finally, the methodology is discussed in Section 9.

2 Motivating studies

2.1 COCO study

About 20–30% of adults suffer from hypertension, a major risk factor for cardiovascular diseases. The control of hypertension by adapting the lifestyle behavior and taking medication reduces these risks. Unfortunately, less than 50% of the patients treated for hypertension have controlled blood pressure.¹⁵ Compliance to the treatment is an important determinant of controlled blood pressure. Indeed, whatever the definition of compliance, poor compliance is the most important cause of uncontrolled blood pressure and only 50–70% of the patients being treated for hypertension in real life situation are considered to be “good compliers.”

Negative determinants of compliance include multiple daily dosing, chronic duration and asymptomatic disease. The COCO study purposed to investigate whether a fixed combination of antihypertensives (taken as a combined treatment in one tablet) instead of multiple drug intake could decrease drug regimen complexity and improve treatment compliance. To avoid bias related to the type of drug used, attention was restricted to the combination of a diuretic with another antihypertensive drug, given as a combination tablet or as two or three separate tablets.

The COCO study is a multicenter survey carried out from November 2005 to June 2006. A total of 1260 eligible patients, with a stable hypertensive treatment since at least six months, were evaluated during a regular visit by their physician. In this report, the agreement between the patient and the physician assessment of the compliance on a 100 mm VAS scale is studied. The effect of the number of tablets on the agreement level is of particular interest since it could be more difficult to define compliance when increasing treatment complexity. Other possible predictors considered are gender, disease duration (year), acceptability of the treatment (5-point Likert scale) and tolerance to the treatment (%). Compliance scores were available for 1025 patient–physician pairs and are displayed in Figure 1.

Figure 1.

COCO study. Compliance assessed on a 100 mm VAS scale by the patient and the physician. The observed marginal distribution (histogram) and the marginal density distribution predicted by the LN approach (lines) are reported in the margins.

2.2 PETRA study

The disease activity score based on 28 joints (DAS28) might not be sufficient to assess remission in RA. Several studies have shown that patients in remission according to the DAS28 still evidence synovitis by US and magnetic imaging resonance. Those patients could eventually develop irreversible joint damage. Although F-FDG PET/CT is known to be correlated with DAS28 in patients with active RA, its role in assessing remission has not been evaluated yet.

The PETRA study⁸ is therefore an exploratory study to see whether F-FDG PET/CT could be used in the assessment of RA remission. The presence of synovitis was assessed on the 28 joints involved in the determination of the DAS28 score on 63 patients with RA, representing a total of 1764 joints. There were 42 (67%) women and 21 (33%) men, on average 55 years old (range: 24–77 years, median: 55 years). A total of 22 (35%) patients were in remission (DAS28<2.63), 31 (49%) presented with moderate disease activity (2.6 ≤DAS28≤5.1) and 10 (16%) with severe disease activity (DAS28>5.1). F-FDG PET/CT scans were first analyzed visually and then semiquantitatively by determining the Standardized Uptake Value of the positive joints. Synovitis was considered as present in US according to the OMERACT criteria. In this paper, interest is in the agreement between the F-FDG PET/CT scan and the US on the assessments of the number of positive joints. Interest is primarily on the number of positive joints detected rather than on each joint separately because this is the quantity used in the DAS28 score to assess disease remission. The distribution of the proportion of joints with synovitis observed by the two methods is depicted in Figure 2.

Figure 2.

PETRA study. Proportion of positive joints observed with the F-FDG PET/CT scan and the US. The marginal densities observed (histogram) and predicted (lines) by the LN approach are also depicted.

3 Classical definition of the CCC

Let $Y_{i 1}$ and $Y_{i 2}$ denote the scores given on item i (i = 1, …, N) by the scorers 1 and 2, respectively. Suppose that they are randomly taken from a bivariate population with mean $μ = (μ_{1}, μ_{2})'$ and variance-covariance matrix

\begin{matrix} Σ = (\begin{matrix} σ_{1}^{2} & σ_{12} \\ σ_{21} & σ_{2}^{2} \end{matrix}) \end{matrix}

The degree of concordance between Y₁ and Y₂ can be characterized by the expected value of the squared difference

E (Y_{1} - Y_{2})^{2} = (μ_{1} - μ_{2})^{2} + (σ_{1}^{2} + σ_{2}^{2} - 2 σ_{12})

Lin⁴ proposed to apply a transformation to scale the agreement index between −1 and 1, leading to the CCC

ρ_{c} = 1 - \frac{E (Y_{1} - Y_{2})^{2}}{E_{ind} (Y_{1} - Y_{2})^{2}} = \frac{2 ρ σ_{1} σ_{2}}{(μ_{1} - μ_{2})^{2} + σ_{1}^{2} + σ_{2}^{2}} = ρ C_{b}

(1)

where ρ is the Pearson correlation coefficient and

C_{b} = [(ν + 1 / ν + u^{2}) / 2)]^{- 1}

with

ν = σ_{1} / σ_{2}

representing the scale shift and

u = (μ_{1} - μ_{2}) / \sqrt{σ_{1} σ_{2}}

the location shift relative to the scale. Lin⁴ noted that C_b (0 < C_b ≤ 1) is a bias correction factor measuring how far the best-fit deviates from the 45° line (measure of accuracy). No deviation occurs when C_b = 1. The Pearson’s correlation coefficient ρ measures how far observations deviate from the best-fit line (measure of precision). A sample estimate of the coefficient is obtained by replacing the parameters by their maximum likelihood estimates (MLEs).

When sampling from a bivariate normal distribution, Lin⁴ showed that $\hat{ρ_{c}}$ has an asymptotic normal distribution with mean ρ_c and variance

var (\hat{ρ_{c}}) = \frac{1}{N - 2} (\frac{(1 - ρ^{2}) ρ_{c}^{2} (1 - ρ_{c}^{2})}{ρ^{2}} + \frac{4 ρ_{c}^{3} (1 - ρ_{c}) u^{2} - 2 ρ_{c}^{4} u^{4}}{ρ^{2}})

where u is the location shift defined above.

While the asymptotic standard error of the CCC relies on the bivariate normal distribution of the scores, the definition of the CCC itself is not based on any distributional assumption and involves only the two first moments and the correlation. The CCC is therefore valid for a variety of distributions. However, it could not be an adequate measure in case of distributions with a J- or a U-shape because the first and the second moments are not good summary measures to describe the shape of the distribution in that case.

4 CCC on bounded scales

Scores obtained on a bounded scale, like in the COCO and the PETRA studies, show a variety of distributions, from unimodal to non-standard J-and U-shape. Johnson¹⁴ and later Lesaffre et al.⁹ suggested the use of the LN distribution to model these non-standard distributions. The LN distribution can accommodate a wide range of shapes, as shown in Figure 3.

Figure 3.

Different logit-normal distributions among which the predicted LN distributions for the COCO and PETRA studies.

In this section, we generalize the approach of Lesaffre et al.⁹ to the bivariate case considering that the bounded score is a coarsened version of a latent score with a LN distribution on the [0,1] interval, like in the COCO study or that the bounded score is a proportion with the true probability having a LN distribution, like in the PETRA study. Note that when the bounded score is a pure percentage, like the ejection fraction in cardiology or is a continuous bounded score, a bivariate LN (BLN) distribution can directly be assumed.

Since the mean and the variance of the scores given by the two scorers as well as the agreement/correlation will be allowed to vary according to covariates related to the items/scorers, the subscript i referring to the item is introduced.

4.1 Continuous bounded scores

Let $B_{i 1}$ and $B_{i 2}$ denote the continuous scores given on a scale bounded between 0 and 1 for item i ( $i = 1, \dots, N$ ) by the scorers 1 and 2, respectively. The CCC is defined by equation (1), after applying a logit transformation to the scores, i.e., $Y_{i 1} = logit (B_{i 1})$ and $Y_{i 2} = logit (B_{i 2})$ .

4.2 Coarsened bounded scores

Suppose that the vector of the scores given by the two scorers $B_{i} = (B_{i 1}, B_{i 2})'$ is a discrete version of a continuous latent bivariate vector $U_{i} = (U_{i 1}, U_{i 2})'$ resulting from the mechanism such that the vector $B_{i} = (B_{i 1}, B_{i 2})'$ is observed if $(a_{s (i 1)} \leq U_{i 1} < a_{s (i 1) + 1}, a_{s (i 2)} \leq U_{i 2} < a_{s (i 2) + 1})$ . The boundaries can be different for the two scorers but we assume, without loss of generality, that they are the same since this paper focus primarily on agreement. If the score can take the values 0, …, K, the boundaries can be defined as

a_{0} = 0 < a_{1} = \frac{0.5}{K} < \dots < a_{k} = \frac{k - 0.5}{K} < \dots < a_{K} = \frac{K - 0.5}{K} < a_{K + 1} = 1

It is further assumed that the vector U_i follows a BLN distribution, i.e., that $logit (U_{i}) = Z_{i}$ with $Z_{i} \sim N_{2} (γ_{i}, T_{i})$ where $γ_{i} = (γ_{i 1}, γ_{i 2})'$ is the mean vector and the variance-covariance matrix is

T_{i} = (\begin{matrix} τ_{i 1}^{2} & τ_{i 12} \\ τ_{i 21} & τ_{i 2}^{2} \end{matrix})

The CCC is defined at the latent scale level as

λ_{ci} = 1 - \frac{E (Z_{i 1} - Z_{i 2})^{2}}{E_{ind} (Z_{i 1} - Z_{i 2})^{2}} = \frac{2 λ_{i} τ_{i 1} τ_{i 2}}{(γ_{i 1} - γ_{i 2})^{2} + τ_{i 1}^{2} + τ_{i 2}^{2}}

(2)

Note that the correlation obtained on the latent scale (denoted by λ_i) is involved in the definition of the CCC (denoted by λ_ci). The concept of correlation is in this case close to the concept of polychoric correlation coefficient since this latter coefficient represents the correlation between two normally distributed continuous latent variables, obtained from two observed ordinal variables.

4.3 Proportions

Suppose that the bounded score B_ij (i = 1, …, N;j = 1, 2) is the proportion of success resulting from a series of independent Bernoulli experiments, conditionally on the item scored. For example, in the PETRA study, each of the 28 joints of a patient is assessed as positive or negative for synovitis. Like in Lesaffre et al.,⁹ it is assumed that $B_{ij} \sim Bin (U_{ij}, N_{ij})$ . Then, it is further assumed that the vector $U_{i} = (U_{i 1}, U_{i 2})'$ follows a BLN distribution, as described in Section 4.2. Here too, the CCC is defined through equation (2).

5 Statistical inference

The aim is to relate the CCC of interest to predictors depending on the items and/or scorers’ characteristics. The CCC can be defined on the original scale (ρ_ci) in case of bivariate normal data or on a latent or transformed scale (λ_ci) in case of BLN data. Note that if interest lies in correlation rather than in agreement, the correlation coefficients ρ_i or λ_i can be considered instead of the CCC ρ_ci and λ_ci, respectively. The inference procedure will remain the same. Therefore, the general notation φ _i will be used to refer to correlation and CCCs. Since - 1 ≤ φ _i ≤ 1, Fisher link function can be used to link φ _i to predictors,

\frac{1}{2} \ln (\frac{1 + φ_{i}}{1 - φ_{i}}) = x_{i}' β_{A}

(3)

with

β_{A}

a vector of parameters and x_i a vector of covariates related to the items and/or the scorers.

5.1 Continuous bounded scores

We suppose that for item $i (i = 1, \dots, N), (B_{i 1}, B_{i 2})$ is BLN with mean vector $μ_{i} = (μ_{i 1}, μ_{i 2})'$ and variance–covariance matrix

Σ_{i} = (\begin{matrix} σ_{i 1}^{2} & ρ_{i} σ_{i 1} σ_{i 2} \\ ρ_{i} σ_{i 1} σ_{i 2} & σ_{i 2}^{2} \end{matrix})

That is, we suppose that

(logit (B_{i 1}), logit (B_{i 2}))' = (Y_{i 1}, Y_{i 2})' \sim N_{2} (μ_{i}, Σ_{i})

and that the scores

(b_{i 1}, b_{i 2}) = (expit (y_{i 1}), expit (y_{i 2}))'

are observed. The contribution of the ith item to the likelihood function is given by

\begin{matrix} L_{i} (μ_{i}, Σ_{i} | b_{i 1}, b_{i 2}) = \frac{1}{2 π σ_{i 1} σ_{i 2} \sqrt{(1 - ρ_{i}^{2})}} e^{- \frac{1}{2 (1 - ρ_{i}^{2})} (\frac{(y_{i 1} - μ_{i 1})^{2}}{σ_{i 1}^{2}} - 2 ρ_{i} \frac{(y_{i 1} - μ_{i 1}) (y_{i 2} - μ_{i 2})}{σ_{i 1} σ_{i 2}} + \frac{(y_{i 2} - μ_{i 2})^{2}}{σ_{i 2}^{2}})} \end{matrix}

This likelihood can be expressed in terms of the CCC since

2 ρ_{i} σ_{i 1} σ_{i 2} = ρ_{ci} [(μ_{i 1} - μ_{i 2})^{2} + σ_{i 1}^{2} + σ_{i 2}^{2}]

(see equation (1)). Therefore, if the mean assessment of the scorers is related to predictors through a linear model and the variance is related to predictors through a log-linear model, i.e.

μ_{ir} = x_{ir}' β_{Mr}

and

\log (σ_{ir}^{2}) = x_{ir}' β_{Sr}

with

x_{ir}

a vector of covariates, and

β_{Mr}

and

β_{Sr}

vectors of parameters, the likelihood can be written as

L_{i} (μ_{i}, Σ_{i} | b_{i 1}, b_{i 2}) = L_{i} (β_{A}, β_{Mr}, β_{Sr} | b_{i 1}, b_{i 2})

5.2 Coarsened bounded scores

If the vector $B_{i} = (B_{i 1}, B_{i 2})'$ ( $i = 1, \dots, N$ ) is a coarsened version of U_i following a BLN distribution, as described in Section 4.2, the contribution of the ith item to the likelihood can be written as

\begin{matrix} L_{i} (γ_{i}, T_{i} | b_{i 1}, b_{i 2}) = \frac{1}{2 π \sqrt{1 - λ_{i}^{2}}} \int z_{(a_{s (1)})}^{l} z_{(a_{s (1)})}^{u} \int z_{(a_{s (2)})}^{l} z_{(a_{s (2)})}^{u} e^{\frac{- 1}{2 (1 - λ_{i}^{2})} (v_{i 1}^{2} + v_{i 2}^{2} - 2 λ_{i} v_{i 1} v_{i 2})} {dv}_{i 1} {dv}_{i 2} \end{matrix}

with

z_{(a_{s (r)})}^{l} = (logit (a_{s (ir)}) - γ_{ir}) / τ_{ir}

and

z_{(a_{s (r)})}^{u} = (logit (a_{s (ir) + 1}) - γ_{ir}) / τ_{ir}

. This likelihood can be rewritten in the following form,

\begin{matrix} L_{i} (γ_{i}, T_{i} | b_{i 1}, b_{i 2}) = Φ_{2} (z_{(a_{0})}^{u}, z_{(a_{0})}^{u})^{h_{i 00}} Π_{j = 1}^{K} [Φ_{2} (z_{(a_{j})}^{u}, z_{(a_{0})}^{u}) - Φ_{2} (z_{(a_{j})}^{l}, z_{(a_{0})}^{u})]^{h_{ij 0}} Π_{k = 1}^{K} [Φ_{2} (z_{(a_{0})}^{u}, z_{(a_{k})}^{u}) - Φ_{2} (z_{(a_{0})}^{u}, z_{(a_{k})}^{l})]^{h_{i 0 k}} \\ \times Π_{j = 1}^{K} Π_{k = 1}^{K} [Φ_{2} (z_{(a_{j})}^{u}, z_{(a_{k})}^{u}) - Φ_{2} (z_{(a_{j})}^{u}, z_{(a_{k})}^{l}) - Φ_{2} (z_{(a_{j})}^{l}, z_{(a_{k})}^{u}) + Φ_{2} (z_{(a_{j})}^{l}, z_{(a_{k})}^{l})]^{h_{ijk}} \end{matrix}

where

Φ_{2} (., .)

is the cumulative density function of the bivariate normal distribution, the scores

y_{i 1}

and

y_{i 2}

take the values 0, …, K and h_ijk = 1 if

(y_{i 1} = j, y_{i 2} = k)

and equals 0 otherwise.

Let the means and variances depend on covariates, i.e., $γ_{ir} = x_{ir}' β_{Mr}$ and $\ln (τ_{ir}^{2}) = x_{ir}' β_{Sr}$ where $x_{ir}'$ is a vector of covariates related to the item and/or the scorers, $β_{Mr}$ and $β_{Sr}$ are vectors of parameters. Since $2 λ_{i} τ_{i 1} τ_{i 2} = λ_{ci} [(γ_{i 1} - γ_{i 2})^{2} + τ_{i 1}^{2} + τ_{i 2}^{2}]$ (see equation (2)), the bivariate likelihood can be expressed in terms of the CCC, $L_{i} (γ_{i}, T_{i} | b_{i 1}, b_{i 2}) = L_{i} (β_{A}, β_{Mr}, β_{Sr} | b_{i 1}, b_{i 2})$ .

5.3 Proportions

We suppose that for item i ( $i = 1, \dots, N$ ), the random variable B_ij follows a Binomial distribution with probability U_ij and that the vector $U_{i} = (U_{i 1}, U_{i 2})'$ follows a BLN distribution with mean vector $γ_{i}$ and variance–covariance matrix T_i, as defined in Section 4.3. The contribution of the ith item to the likelihood function is given by

\begin{matrix} L_{i} (γ_{i}, T_{i} | b_{i 1}, N_{i 1}, b_{i 2}, N_{i 2}) = \frac{N_{i 1}!}{b_{i 1}! (N_{i 1} - b_{i 1})!} U_{i 1}^{b_{i 1}} (1 - U_{i 1})^{b_{i 1}} \frac{N_{i 2}!}{b_{i 2}! (N_{i 2} - b_{i 2})!} U_{i 2}^{b_{i 2}} (1 - U_{i 2})^{b_{i 2}} \\ \times \frac{1}{2 π τ_{i 1} τ_{i 2} \sqrt{(1 - λ_{i}^{2})}} e^{- \frac{1}{2 (1 - λ_{i}^{2})} (\frac{(z_{i 1} - γ_{i 1})^{2}}{τ_{i 1}^{2}} - 2 λ_{i} \frac{(z_{i 1} - γ_{i 1}) (z_{i 2} - γ_{i 2})}{τ_{i 1} τ_{i 2}} + \frac{(z_{i 2} - γ_{i 2})^{2}}{τ_{i 2}^{2}})} \end{matrix}

If the means and variances depend on covariates similarly to Section 5.2 and since $2 λ_{i} τ_{i 1} τ_{i 2} = λ_{ci} [(γ_{i 1} - γ_{i 2})^{2} + τ_{i 1}^{2} + τ_{i 2}^{2}]$ , the likelihood can be rewritten $L_{i} (γ_{i}, T_{i} | b_{i 1}, N_{i 1}, b_{i 2}, N_{i 2}) = L_{i} (β_{A}, β_{Mr}, β_{Sr} | b_{i 1}, N_{i 1}, b_{i 2}, N_{i 2})$ .

6 Bayesian estimation

It is possible to obtain the MLE of the CCC analytically for the bivariate normal distribution. However, in the case of bounded outcomes, the use of the logit link makes the computations more complex and there is no analytical way to estimate the cumulative bivariate normal density used in the coarsened case. The cumulative bivariate normal density can be estimated using algorithms¹⁶ or approximation formula.^17–20 We adopted a Bayesian approach using MCMC and an approximation formula mainly for two reasons. First, Bayesian methods have shown good frequentist properties in a variety of settings and are very flexible in the handling of covariates and missing values. Second, by doing so, the method developed only requires to write the likelihood in a standard Bayesian software (e.g. JAGS, WinBUGS). In a Bayesian approach prior knowledge about the parameters is combined with the observed data (likelihood) to yield the posterior distribution. We used vague priors which express the lack of prior information on the parameters. For all the regression coefficients $β$ , vague N(0, 10⁶ ) independent priors were taken. The MCMC calculations were performed using JAGS.²¹

Several approximation formula of the cumulative bivariate normal distribution $Φ_{2} (., .)$ were proposed.^17–20 Among them, we chosen the approximation of Mee and Owen,¹⁷ which provides the least error and is relatively simple to implement (see Appendix 1).

7 Simulations

To study the performances of the BLN methodology and compare them to the performances of the classical method, 1000 datasets were generated under the bivariate normal assumption for two scenarios. Both scorers possess a N(2,1) distribution in scenario 1 and a N(0,3) distribution in scenario 2 on the latent scale. The same distribution was assumed for both scorers to ensure that all the range of CCC could be covered. Then, the expit transformation was applied to the bivariate data. The LN marginal probability distribution of the two scorers (LN(2,1) and LN(0,3)) is depicted in Figure 4.

Figure 4.

Simulations. Logit-normal distributions used in the two scenarios.

For each scenario, three sample sizes (N = 25, 50, 100) and five values of the CCC (λ _c = 0.0, 0.2, 0.4, 0.6, 0.8) were considered. The classical CCC was computed on the bounded scale. The mean of the parameter estimates and their standard error obtained on the 1000 datasets are reported in the summary tables along with the coverage level, defined as the percentage of samples where the 95% confidence level covers the theoretical CCC value. The posterior distribution of the CCC was also obtained in the Bayesian framework under a bivariate normal model and the BLN model. In the coarsened case, the performance of the method was evaluated with 10 and with 100 cut-offs values. A large number of experiments (1000) was considered in the proportion case to speed up the convergence. Three chains were considered. A period of 1000 iterations was taken as burn-in period and 1000 iterations were sufficient to attain convergence. The posterior mean, median and standard deviation (SD) of the CCC are reported in the summary tables. In this case, the coverage level is defined as the percentage of samples for which the 95% equal-tailed credibility interval covers the theoretical value.

7.1 Bounded coarsened scores

Only the results with 10 cut-offs values are presented in the main manuscript. The case of 100 cut-offs values, showing similar results, is provided as supplemental material. As it can be seen in Tables 1 and 2, the coverage level is very close the 95% nominal level with the BLN approach while it remains unsatisfactory with the classical approach and under the bivariate normal assumption. These two latter approaches show however better coverage levels under the second scenario than under the first one. This can be explained by the fact that the marginal probability distribution is symmetrical under scenario 2 and skewed under scenario 1. The mean and the variance therefore better describe the shape of the distribution in scenario 2. The coverage levels also tend to decrease when the CCC increases under the classical and the bivariate normal approaches. This can be explained by the fact that a normal approximation of the sampling distribution of the CCC is less appropriate when the CCC approaches its maximal value 1. Finally, note that the coverage levels are slightly worsened with the Bayesian bivariate normal approach than with the classical approach. This could be due to the fact the Bayesian bivariate normal approach makes the additional assumption of a normal distribution of the data, which is not the case in the scenarios.

Table 1.

Simulations – Bounded coarsened scales: Scenario 1.

		Classical			Bayesian (BN)				Bayesian (BLN)
N	λ_c	Mean	SE	Cov	Mean	SD	Med	Cov	Mean	SD	Med	Cov
25	0.0	–0.00	0.19	0.92	0.00	0.17	0.00	0.95	0.00	0.19	–0.00	0.95
25	0.2	0.16	0.18	0.91	0.15	0.17	0.15	0.93	0.17	0.19	0.17	0.95
25	0.4	0.33	0.17	0.90	0.30	0.16	0.31	0.90	0.34	0.18	0.35	0.94
25	0.6	0.51	0.14	0.90	0.48	0.14	0.49	0.85	0.54	0.15	0.55	0.94
25	0.8	0.71	0.10	0.91	0.68	0.10	0.69	0.77	0.75	0.11	0.77	0.92
50	0.0	–0.00	0.14	0.93	–0.00	0.13	–0.00	0.95	–0.00	0.14	–0.00	0.95
50	0.2	0.17	0.13	0.92	0.16	0.13	0.16	0.92	0.18	0.14	0.19	0.95
50	0.4	0.34	0.12	0.90	0.32	0.12	0.33	0.87	0.37	0.13	0.37	0.95
50	0.6	0.53	0.10	0.88	0.51	0.10	0.52	0.84	0.57	0.11	0.58	0.95
50	0.8	0.73	0.07	0.83	0.71	0.07	0.72	0.69	0.77	0.07	0.78	0.94
100	0.0	–0.00	0.10	0.92	–0.00	0.10	–0.00	0.93	–0.00	0.11	–0.00	0.94
100	0.2	0.17	0.10	0.91	0.16	0.09	0.16	0.91	0.19	0.10	0.19	0.94
100	0.4	0.35	0.09	0.90	0.34	0.09	0.34	0.86	0.39	0.09	0.39	0.95
100	0.6	0.54	0.07	0.85	0.53	0.07	0.53	0.79	0.58	0.08	0.58	0.95
100	0.8	0.74	0.05	0.74	0.73	0.05	0.73	0.60	0.79	0.05	0.79	0.95

Both scorers have a LN(2,1) distribution (10 point-scale). In the column Bayesian (BN), the posterior distribution of the CCC is estimated under a bivariate normal distribution while it is estimated under a bivariate logit-normal distribution in the column Bayesian (BLN). Cov = coverage.

Table 2.

Simulations – Bounded coarsened scales: Scenario 2.

		Classical			Bayesian (BN)				Bayesian (BLN)
N	λ_c	Mean	SE	Cov	Mean	SD	Med	Cov	Mean	SD	Med	Cov
25	0.0	−0.00	0.19	0.93	0.00	0.18	−0.00	0.95	0.00	0.18	0.00	0.94
25	0.2	0.16	0.19	0.93	0.15	0.18	0.15	0.95	0.17	0.18	0.17	0.95
25	0.4	0.34	0.17	0.92	0.31	0.17	0.32	0.93	0.34	0.17	0.34	0.95
25	0.6	0.53	0.14	0.92	0.50	0.14	0.51	0.88	0.54	0.15	0.55	0.94
25	0.8	0.74	0.09	0.92	0.71	0.10	0.72	0.84	0.75	0.11	0.76	0.93
50	0.0	−0.00	0.14	0.93	−0.00	0.13	−0.00	0.94	−0.00	0.14	−0.00	0.95
50	0.2	0.17	0.14	0.93	0.17	0.13	0.17	0.94	0.19	0.14	0.19	0.95
50	0.4	0.35	0.12	0.94	0.33	0.12	0.34	0.94	0.37	0.12	0.37	0.95
50	0.6	0.53	0.10	0.92	0.52	0.10	0.52	0.88	0.56	0.10	0.57	0.94
50	0.8	0.75	0.06	0.92	0.73	0.06	0.74	0.81	0.77	0.07	0.78	0.95
100	0.0	−0.01	0.10	0.95	−0.01	0.10	−0.01	0.96	−0.00	0.10	-0.00	0.95
100	0.2	0.17	0.10	0.93	0.16	0.09	0.17	0.93	0.18	0.10	0.19	0.94
100	0.4	0.35	0.09	0.92	0.34	0.09	0.35	0.90	0.38	0.09	0.38	0.95
100	0.6	0.54	0.07	0.90	0.53	0.07	0.53	0.84	0.58	0.07	0.58	0.95
100	0.8	0.75	0.04	0.83	0.74	0.04	0.75	0.72	0.78	0.05	0.78	0.94

Both scorers have a LN(0,3) distribution (10 point−scale). In the column Bayesian (BN), the posterior distribution of the CCC is estimated under a bivariate normal distribution while it is estimated under a bivariate logit-normal distribution in the column Bayesian (BLN). Cov = coverage.

7.2 Percentage scales

The simulations results for the percentage case are given in Table 3 under the first scenario and Table 4 under the second scenario. Similar conclusions than in the bounded coarsened case can be drawn, regarding the results in Tables 3 and 4.

Table 3.

Simulations – Percentage scales: Scenario 1.

		Classical			Bayesian (BN)				Bayesian (BLN)
N	λ_c	Mean	SE	Cov	Mean	SD	Med	Cov	Mean	SD	Med	Cov
25	0.0	0.01	0.18	0.93	0.01	0.17	0.01	0.96	0.01	0.18	0.01	0.96
25	0.2	0.17	0.18	0.91	0.15	0.17	0.16	0.93	0.17	0.17	0.18	0.94
25	0.4	0.34	0.16	0.89	0.31	0.16	0.32	0.89	0.35	0.16	0.36	0.94
25	0.6	0.54	0.13	0.89	0.51	0.14	0.52	0.87	0.55	0.13	0.56	0.94
25	0.8	0.75	0.09	0.89	0.72	0.09	0.73	0.81	0.76	0.09	0.77	0.95
50	0.0	−0.00	0.14	0.94	−0.00	0.13	−0.00	0.95	−0.00	0.13	−0.00	0.95
50	0.2	0.18	0.13	0.92	0.17	0.13	0.17	0.92	0.19	0.13	0.19	0.94
50	0.4	0.35	0.12	0.90	0.34	0.12	0.34	0.88	0.38	0.12	0.38	0.94
50	0.6	0.55	0.10	0.88	0.54	0.10	0.54	0.85	0.58	0.09	0.58	0.94
50	0.8	0.76	0.06	0.89	0.75	0.06	0.75	0.81	0.78	0.06	0.79	0.94
100	0.0	−0.01	0.10	0.94	−0.00	0.10	−0.00	0.95	−0.01	0.10	−0.01	0.95
100	0.2	0.17	0.10	0.91	0.17	0.09	0.17	0.92	0.19	0.10	0.19	0.94
100	0.4	0.36	0.09	0.88	0.35	0.08	0.35	0.88	0.39	0.08	0.39	0.95
100	0.6	0.56	0.07	0.87	0.55	0.07	0.55	0.84	0.59	0.07	0.59	0.95
100	0.8	0.77	0.04	0.87	0.76	0.04	0.77	0.79	0.80	0.04	0.80	0.94

Both scorers have a LN(2,1) distribution. In the column Bayesian (BN), the posterior distribution of the CCC is estimated under a bivariate normal distribution while it is estimated under a bivariate logit-normal distribution in the column Bayesian (BLN). Cov = coverage.

Table 4.

Simulations – Percentage scales: Scenario 2.

		Classical			Bayesian (BN)				Bayesian (BLN)
N	λ_c	Mean	SE	Cov	Mean	SD	Med	Cov	Mean	SD	Med	Cov
25	0.0	−0.00	0.19	0.93	0.00	0.18	0.00	0.95	0.00	0.18	0.00	0.94
25	0.2	0.16	0.19	0.93	0.15	0.18	0.15	0.94	0.17	0.18	0.18	0.95
25	0.4	0.34	0.17	0.92	0.32	0.16	0.33	0.92	0.35	0.16	0.36	0.95
25	0.6	0.53	0.14	0.93	0.50	0.14	0.51	0.90	0.54	0.14	0.56	0.95
25	0.8	0.74	0.09	0.92	0.71	0.10	0.72	0.84	0.76	0.09	0.77	0.94
50	0.0	0.00	0.14	0.94	0.00	0.13	0.00	0.94	−0.00	0.14	−0.00	0.94
50	0.2	0.17	0.14	0.93	0.16	0.13	0.16	0.94	0.19	0.13	0.19	0.94
50	0.4	0.35	0.12	0.93	0.34	0.12	0.34	0.92	0.38	0.12	0.39	0.95
50	0.6	0.54	0.10	0.92	0.52	0.10	0.52	0.88	0.57	0.10	0.58	0.94
50	0.8	0.75	0.06	0.92	0.73	0.06	0.74	0.81	0.79	0.06	0.80	0.96
100	0.0	−0.00	0.10	0.94	−0.00	0.10	−0.00	0.95	−0.00	0.10	−0.00	0.94
100	0.2	0.17	0.10	0.92	0.17	0.09	0.17	0.92	0.19	0.10	0.20	0.94
100	0.4	0.35	0.09	0.93	0.34	0.09	0.35	0.92	0.39	0.09	0.40	0.96
100	0.6	0.54	0.07	0.87	0.53	0.07	0.54	0.83	0.60	0.07	0.60	0.94
100	0.8	0.75	0.04	0.85	0.74	0.04	0.75	0.73	0.80	0.04	0.80	0.95

Both scorers have a LN(0,3) distribution. In the column Bayesian (BN), the posterior distribution of the CCC is estimated under a bivariate normal distribution while it is estimated under a bivariate logit-normal distribution in the column Bayesian (BLN). Cov = coverage

8 Examples

Posterior predictive checks (PPC) were used to assess the model fit in two different ways. First, we verified that the marginal probability distribution of each scorer followed a LN distribution as predicted by the BLN model using a chi-square test.²² Second, a multivariate model checking was done according to the method of Crespi and Boscardin.²³ The method consists in simulating a large number, (e.g. R = 500), of bivariate observations ${\tilde{b}}_{ir} = ({\tilde{b}}_{i 1, r}, {\tilde{b}}_{i 2, r})$ according to the BLN for each subject i ( $i = 1, \dots, N; r = 1, \dots, R)$ . Then, the Euclidian distance between the actual and the replicated vector of observations is determined for each subject, i.e. ${DOR}_{ir} = d (b_{i}, {\tilde{b}}_{ir},)$ . This distance is also determined for each pair of vectors of replicated observations ${DRR}_{irr'} = d ({\tilde{b}}_{ir}, {\tilde{b}}_{ir'})$ . When the model is consistent with the data, the distance between the actual and replicated vectors of observations, ${DOR}_{ir}$ , should coincide with ${DRR}_{irr'}$ . If the model has a poor fit, the distance between the actual and replicated vector of observations, DOR_ir , is expected to be higher than the distance between two replicated vectors ${DRR}_{irr'}$ . Therefore, a Mann–Whitney test is performed by randomly selecting, for each subject, one distance DOR_ir between the actual and replicated vector of observations and one distance ${DRR}_{irr'}$ between two replicated vectors. This is performed a large number of times (e.g. 100), providing several p-values. If there is no model fit problem, the p-values should exhibit an uniform distribution.²⁴

8.1 COCO study

The assessment of the compliance was available for 1025 patient–physician pairs (see Figure 1). Among the patients, 363 (35%) patients were taking a single tablet, 382 (37%) patients were taking 2 tablets and 280 (27%) patients 3 tablets. The compliance scores are given according to demographic and treatment characteristics in Table 5.

Table 5.

COCO study – Compliance scores (mean (SD), median) given by the patient and the physician according to demographic and treatment characteristics (N = 1025).

		Patient		Physician		CCC (classical)	CCC (BLN)
Parameter	N	Mean (SD)	Med	Mean (SD)	Med	(95% CI)	Median (95% CrI)
Gender
Males	502	74 (18)	77	77 (17)	80	0.70 (0.66–0.75)	0.71 (0.65–0.75)
Females	523	76 (18)	79	79 (16)	83	0.68 (0.64–0.73)	0.73 (0.68–0.77)
Acceptability^a
(1)	235	83 (14)	86	87 (10)	89	0.54 (0.45–0.62)	0.60 (0.50–0.67)
(2)	288	79 (14)	82	82 (14)	85	0.60 (0.53–0.68)	0.58 (0.50–0.66)
(3)	324	70 (18)	72	73 (16)	75	0.61 (0.55–0.68)	0.66 (0.59-0.72)
(4)	135	68 (19)	69	68 (19)	70	0.74 (0.67–0.82)	0.80 (0.73–0.85)
(5)	51	63 (25)	67	64 (21)	67	0.78 (0.68–0.88)	0.83 (0.70–0.91)
Nb tablets
1	363	78 (16)	81	81 (16)	86	0.72 (0.67–0.77)	0.70 (0.65–0.75)
2	382	74 (18)	78	77 (16)	81	0.69 (0.63–0.74)	0.72 (0.66–0.77)
3	280	72 (19)	75	74 (17)	77	0.66 (0.59–0.72)	0.66 (0.59–0.73)
Duration
≤5 yrs	520	74 (18)	78	77 (17)	81	0.72 (0.68–0.76)	0.72 (0.67–0.76)
>5 yrs	516	76 (17)	78	78 (17)	82	0.67 (0.62–0.72)	0.70 (0.64–0.75)
Good tolerability
No (Score < 7)	587	69 (18)	71	72 (17)	74	0.61 (0.56–0.66)	0.64 (0.59–0.70)
Yes (Score ≥7)	438	83 (14)	87	85 (12)	89	0.69 (0.64–0.74)	0.63 (0.57–0.69)

The classical CCC is given with 95% confidence interval along with the posterior median (95% equal tailed credibility interval) of the CCC obtained with the BLN model.

(1): No problem, (2): acceptable, (3): Annoying but acceptable, (4): Just acceptable, (5): Not acceptable.

When only the number of tablets is taken into account, there was no difference in the agreement levels according to the number of tablets. The posterior median of the CCC was 0.70 (0.65–0.75) when taking one tablet, 0.72 (0.66–0.77) when taking two tablets and 0.66 (0.59–0.73) when taking three tablets. For comparative purposes, the total deviation index²⁵ with a proportion of 0.90 (

{TDI}_{0.90}

) was computed non-parametrically using a quantile regression.²⁶ The

{TDI}_{0.90}

gives a boundary such that 90% of the absolute differences between the patient and the physician scores are below the boundary. The

{TDI}_{0.90}

was equal to 2.1 (1.8–2.4) cm when taking a single tablet, 2.2 (1.9–2.5) cm when taking two tablets and 2.3 (1.8–2.8) cm when taking three tablets. The effect of the number of tablets, corrected for the effect of tolerance, gender (male/female) and duration of the disease (years) on the agreement level is summarized in Table 6. Acceptability was not included in the model because of the strong association between the number of tablets and the acceptability score.

Table 6.

COCO study – Summary measures for the posterior distribution of the parameters of the models for the means, the variances and the CCC under the bivariate logit-normal approach.

	Physician				Patient
Parameter	Mean (SD)	P2.5	P50	P97.5	Mean (SD)	P2.5	P50	P97.5
Model for the means
Intercept	−0.99 (0.13)	−1.24	−0.99	−0.74	−0.93 (0.14)	−1.23	−0.92	−0.68
Gender (M)	−0.044 (0.064)	−0.17	−0.043	0.081	−0.084 (0.063)	−0.21	−0.083	0.039
Nb tablets
1	0.047 (0.080)	−0.11	0.045	0.20	0.16 (0.083)	−0.0015	0.16	0.32
2	−0.013 (0.078)	−0.16	−0.012	0.14	0.081 (0.074)	−0.066	0.081	0.23
Disease duration (yrs)	0.014 (0.0062)	0.0025	0.015	0.027	0.0091 (0.0059)	−0.0024	0.0092	0.020
Tolerance	0.31 (0.017)	0.28	0.31	0.34	0.32 (0.019)	0.29	0.32	0.36
Model for the variances
Intercept	0.013 (0.24)	−0.51	0.032	0.44	−0.047 (0.23)	−0.51	−0.049	0.40
Gender (M)	−0.13 (0.11)	−0.33	−0.13	0.089	0.11 (0.11)	−0.093	0.11	0.32
Nb tablets
1	0.077 (0.14)	−0.20	0.078	0.35	0.38 (0.13)	0.11	0.38	0.63
2	0.17 (0.14)	−0.10	0.17	0.45	0.18 (0.14)	−0.092	0.18	0.44
Disease duration (yrs)	0.010 (0.0096)	−0.0079	0.011	0.030	0.0059 (0.010)	−0.014	0.0062	0.025
Tolerance	−0.031 (0.028)	−0.083	−0.032	0.027	−0.050 (0.029)	−0.11	−0.050	0.0062
Model for the CCC
Intercept	0.81 (0.15)	0.51	0.82	1.10
Gender (M)	−0.082 (0.071)	−0.22	−0.082	0.061
Nb tablets
1	0.17 (0.10)	−0.029	0.17	0.37
2	0.21 (0.098)	0.0085	0.21	0.40
Disease duration (yrs)	−0.0070 (0.0069)	−0.021	−0.0069	0.0064
Tolerance	−0.019 (0.019)	−0.056	−0.020	0.019

As seen in Table 6, the compliance score given by the physician was positively associated with the disease duration and the tolerance while the score given by the patient was only positively associated with the tolerance. While the variability of the physician compliance assessments was not associated to any of the covariates, the variability in the compliance scores given by the patient was higher with one tablet than with three tablets. The agreement level was higher when the treatment involved two tablets instead of three tablets and nearly higher when it involved one tablet instead of three tablets.

The univariate PPC provided a posterior predictive p-value of 0.88 for the physicians and 0.65 for the patients, indicating no evidence of lack of fit. A QQ-plot for the uniformity of the p-values obtained with the Mann–Whitney test is given in Figure 5. There was also no evidence of lack of fit from the Kolmogorov–Smirnov test for the uniformity of the p-value (p = 0.47).

Figure 5.

COCO study. QQ-plot for the uniformity of the p-values obtained for the multivariate posterior predictive check.

8.2 PETRA study

The percentage of positive joints out of the 28 joints involved in the DAS28 score is reported in Table 7 according to the disease activity. Remission is defined by a DAS28 score lower than 2.6, moderate disease activity by a DAS28 score between 2.6 and 5.1 and severe disease activity by a DAS28 score above 5.1.

Table 7.

PETRA study – Percentage of positive joints according to the disease activity (N = 63).

		F-FDG PET/CT scan	US	CCC (classical)	CCC (BLN)
Parameter	N	Mean (SD)	Mean (SD)	(95% CI)	Median (95% CrI)
DAS28 (CRP)
<2.6	22	13.0 (19.2)	5.7 (5.5)	0.047 (–0.16–0.25)	0.089 (–0.32–0.49)
[2.6;5.1]	31	16.7 (23.8)	6.2 (6.5)	0.17 (0.017–0.32)	0.32 (0.0080–0.59)
>5.1	10	48.6 (40.0)	22.5 (20.3)	0.35 (–0.038–0.73)	0.40 (0.0030–0.75)

The classical CCC (95% confidence interval) and the posterior median (95% equal-tailed credibility interval) of the CCC obtained with the BLN approach.

The posterior distribution of the parameters of the BLN approach is summarized in Table 8. The percentage of positive joints increased with the DAS28 score for both F-FDG PET/CT and US, but there was no evidence that the variability in the percentage of positive joints was related to the DAS28 score. There was also no evidence that the agreement level between the two methods was associated with the DAS28 score.

Table 8.

PETRA study – Summary measures for the posterior distribution of the parameters of the models for the means, the variances and the CCC under the bivariate logit-normal approach.

	F-FDG PET/CT scan				US
Parameter	Mean (SD)	P2.5	P50	P97.5	Mean (SD)	P2.5	P50	P97.5
Models for the mean
Intercept	–3.8 (0.5)	–4.7	–3.7	–2.8	–3.7 (0.72)	–5.2	–3.7	–2.4
DAS28 (CRP)	0.28 (0.14)	0.0013	0.28	0.53	0.43 (0.21)	0.010	0.43	0.84
Models for the variance
Intercept	–1.5 (1.1)	–3.6	–1.6	0.90	0.94 (0.61)	–0.29	0.97	2.1
DAS28 (CRP)	0.31 (0.26)	–0.24	0.33	0.78	0.15 (0.15)	–0.13	0.14	0.47
Model for the CCC
Intercept	0.14 (0.34)	–0.50	0.13	0.83
DAS28 (CRP)	0.056 (0.082)	–0.10	0.054	0.21

The overall posterior mean of the CCC is 0.474 (SD: 0.096), very close to the posterior median (0.475). Ninety percent of the differences between the F-FDG PET/CT and US percentages were less than 46.4% (17.3–75.5). Under the normality assumption, the posterior mean of the CCC is 0.327 (SD: 0.063) with a posterior median of 0.328. These agreement levels are not satisfactory and can be explained by the fact that the F-FDG PET/CT detects more positive joints in the small joints of the hands than the US.

The univariate PPC provided a posterior predictive p-value of 0.61 for the F-FDG PET CT scan and 0.76 for the US, indicating no evidence of lack of fit. A QQ-plot for the uniformity of the p-values obtained with the Mann–Whitney test is given in Figure 6. There was also no evidence of lack of fit from the Kolmogorov–Smirnov test for the uniformity of the p-value (p = 0.21).

Figure 6.

PETRA study. QQ-plot for the uniformity of the p-values obtained for the multivariate posterior predictive check.

9 Discussion

Bounded scales, with visual analog scales as most famous example, are common in medical and behavioral sciences. In this paper, we developed a methodology to study the agreement (or the correlation) between two assessments made on a bounded scale in a Bayesian framework. In particular, the method is developed under two settings: (1) when the scores are coarsened versions of a latent score following a BLN distribution or (2) when the scores are binomial with the true probabilities following a BLN distribution. When the bounded score is continuous, a direct logit transformation of the scores was proposed. This method permits to directly evaluate the impact of categorical and continuous covariates on agreement levels and can be implemented in standard Bayesian softwares, like WinBUGS or JAGS. The programs and the data to analyse the COCO and the PETRA studies are available as web supplemental material. The CCC was related to covariates using Fisher Z transform link and the variance using a log-linear model. These link functions are commonly used but can be replaced by other link functions, as long as they ensure estimation of the parameters within the parameter space boundaries.

In the coarsened setting, a CCC based on the means, the variances and the correlation of the scores on a latent scale was defined. The correlation obtained on the latent scale is close to the concept of the polychoric correlation coefficient, with two distinctions. First, a bivariate LN distribution is assumed on the latent scale instead of a bivariate normal distribution. Second, the thresholds used on the bounded scale to define the coarsened scores are given a priori instead of being estimated. In the percentage case, the correlation obtained on the transformed scale is close to the Pearson correlation coefficient obtained after a logit transformation of the data. The CCC is often criticized because, like the ICC, it depends on the range of the scores observed on the scale. In particular, given the same scores difference between two scorers, the CCC can be high in heterogeneous populations and low in homogeneous populations.⁵ However, in the present setting, we can expect that most often a large range of possible scores will be covered (including the boundaries) because the scale is bounded. For example, in the COCO study, the scores vary between 1 and 100 on the 100 mm VAS scale and in the PETRA study the percentages vary between 0 and 100%.

In the case of bounded coarsened scores and percentages, the simulations showed on one hand very good coverage levels for the BLN approaches. Note that an approximation to the bivariate normal cumulative probability distribution was used in the coarsened case. Inference based on the classical CCC, on the other hand, showed poorer coverage levels as the value of the CCC increases. This can be explained by (1) an asymmetric sampling distribution of the CCC near the boundaries −1 and 1, breaching the assumption of normality and rendering the use of symmetric confidence intervals less appropriate and (2) the fact that the classical CCC is based on the mean and the variance of the scores, which could not be appropriate to describe skewed distribution. Using the bivariate normal Bayesian approach with equal-tailed credibility intervals does not improve the coverage levels showing the inappropriateness of the additional bivariate normal assumption.

The LN distribution can be used to describe a variety of distributions obtained on bounded scales, from U- to J-shapes. However, the adequacy of the LN distribution should be checked, at least visually, like in Figures 1 and 2 because the LN distribution does not cover all possible patterns that can be encountered in daily practice. Alternatively to the LN distribution, one may think of using a bivariate beta distribution instead of a BLN distribution on the latent scale, especially if the distribution of the scores shows some uniform pattern. This is a topic for future research.

The total deviation index (TDI_x), giving a boundary including x% of the differences between the two scorers was used for illustrative purposes. This index was originally based on the normality assumption of score differences²⁵ but non-parametric estimates were developed.²⁶ This index completes the information given by the CCC.²⁵ Modeling this index directly according to a set of covariates should be possible²⁵ and could be a topic for future research.

In the COCO study, the number of tablets slightly influenced the agreement level between the physician and the patient assessment of compliance. One explanation could be that compliance is more difficult to define when the treatment complexity increases. This asks for a clear definition of compliance when conducting studies on drug efficacy, particularly if the treatment is complex. In the PETRA study, the agreement between the F-FDG PET/CT and the US scan was not satisfactory because the F-FDG PET/CT was more sensitive in the small joints of the hands. The implication of the presence of synovitis according to F-FDG PET/CT and US on the patient health should therefore be studied separately.

In conclusion, we proposed a method to directly evaluate the effect of covariates on the level of agreement on bounded scales. This could, for example, help researchers in improving the definition of concepts, as with compliance in the COCO study or help in comparing methods, like in the PETRA study. Extension of the method to multilevel data and several scorers is a topic for future research.

Footnotes

Acknowledgements

The authors are grateful to Dr. Michel Malaise and Dr. Charline Rinkin (Rheumatology Department, CHU of Liège, Liège, Belgium), for providing the PETRA data.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is part of project 451-13-002 funded by the Netherlands Organisation for Scientific Research.

Supplemental material

Supplemental material is available for this article online.

Appendix 1

References

Shrout

Fleiss

. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979; 86: 420–428.

McGraw

Wong

. Forming inferences about some intraclass correlation coefficients. Psychol Meth 1996; 1: 30–46.

Bland

Altman

. Originally published as volume 1, issue 8476 statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986; 327: 307–310.

Lin

LIK

. A concordance correlation coefficient to evaluate reproducibility. Biometrics 1989; 45: 255–268.

Huiman

Michael

Lawrence

. An overview on assessing agreement with continuous measurements. J Biopharm Stat 2007; 17: 529–569.

Barnhart

Williamson

. Weighted Least-squares approach for comparing correlated kappa. Biometrics 2002; 58: 1012–1019.

Chen

Barnhart

. Comparison of ICC and CCC for assessing agreement for data without and with replications. Comput Stat Data Anal 2008; 53: 554–564.

Rinkin C, Fosse P, Chapelier N, et al. 18F-Fluorodeoxyglucose Positron Emission Computer Tomography and Ultrasonography for Assessing Remission in Patients with Rheumatoid Arthritis [abstract]. Arthritis Rheumatol. 2015; 67(suppl 10). http://acrabstracts.org/abstract/18f-fluorodeoxyglucose-positron-emission-computer-tomography-and-ultrasonography-for-assessing-remission-in-patients-with-rheumatoid-arthritis/ (accessed 12 April 2017).

Lesaffre

Rizopoulos

Tsonaka

. The logistic transform for bounded outcome scores. Biostatistics 2007; 8: 72–85.

10.

Barnhart

Williamson

. Modeling concordance correlation via gee to evaluate reproducibility. Biometrics 2001; 57: 931–940.

11.

Carrasco

Jover

King

et al.

Comparison of concordance correlation coefficient estimating approaches with skewed data. J Biopharm Stat 2007; 17: 673–684.

12.

Feng

Baumgartner

Svetnik

. A bayesian estimate of the concordance correlation coefficient with skewed data. Pharma Stat 2015; 14: 350–358.

13.

Hutton

Stanghellini

. Modelling bounded health scores with censored skew-normal distributions. Stat Med 2011; 30: 368–376.

14.

Johnson

. Systems of frequency curves generated by methods of translation. Biometrika 1949; 36: 149–176.

15.

Egan

Zhao

Axon

. US trends in prevalence, awareness, treatment, and control of hypertension, 1988–2008. JAMA 2010; 303: 2043–2050.

16.

Genz

Bretz

. Computation of multivariate normal and t probabilities, New York: Springer Verlag, 2009.

17.

Mee

Owen

. A simple approximation for bivariate normal probabilities. J Qual Technol 1983; 15: 72–75.

18.

Cox

Wermuth

. A simple approximation for bivariate and trivariate normal integrals. Int Stat Rev/Revue Internationale de Statistique 1991; 59: 263–269.

19.

Albers

Kallenberg

. A simple approximation to the bivariate normal distribution with large correlation coefficient. J Multivar Anal 1994; 49: 87–96.

20.

Hong

. An approximation to bivariate and trivariate normal integrals. Civil Eng Environ Syst 1999; 16: 115–127.

21.

Plummer M (2003) JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In: Proceedings of the 3rd International Workshop on Distributed Statistical Computing, Citeseer, March, 20–22.

22.

Gelman

Meng

X-L

Stern

. Posterior predictive assessment of model fitness via realized discrepancies. Stat Sin 1996; 6: 733–760.

23.

Crespi

Boscardin

. Bayesian model checking for multivariate outcome data. Comput Stat Data Anal 2009; 53: 3765–3772.

24.

Bruyneel

Squires

et al.

Bayesian multilevel MIMIC modeling for studying measurement invariance in cross-group comparisons. Med Care 2017; 55: e25–e35.

25.

Lin

LIK

. Total deviation index for measuring individual agreement with applications in laboratory performance and bioequivalence. Stat Med 2000; 19: 255–270.

26.

Lin

LIK

Pan

Hedayat

et al.

A simulation study of nonparametric total deviation index as a measure of agreement based on quantile regression. J Biopharm Stat 2016; 26: 937–950.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.02 MB

0.00 MB

Modeling agreement on bounded scales

Abstract

Keywords

1 Introduction

2 Motivating studies

2.1 COCO study

2.2 PETRA study

3 Classical definition of the CCC

4 CCC on bounded scales

4.1 Continuous bounded scores

4.2 Coarsened bounded scores

4.3 Proportions

5 Statistical inference

5.1 Continuous bounded scores

5.2 Coarsened bounded scores

5.3 Proportions

6 Bayesian estimation

7 Simulations

7.1 Bounded coarsened scores

7.2 Percentage scales

8 Examples

8.1 COCO study

8.2 PETRA study

9 Discussion

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

Supplemental material

Appendix 1

References

Supplementary Material