Sage Journals: Discover world-class research

Abstract

Devices that measure our physical, medical and mental condition have entered our daily life recently. Such devices measure our status in a continuous manner and can be useful in predicting future medical events or can guide us towards a healthier life. It is therefore important to establish that such devices record our behaviour in a reliable manner and measure what we believe they measure. In this article, we propose to measure the reliability and validity of a newly developed measuring device in time using a longitudinal model for sequential kappa statistics. We propose a Bayesian estimation procedure. The method is illustrated by a validation study of a new accelerometer in cardiopulmonary rehabilitation patients.

Keywords

continuous recording Reliability time-event sequential data time series transient event

1 Introduction

The future in many scientific domains lies in devices recording continuously and in real-time biological, physical, behavioural or environmental information. These devices offer the possibility to study individuals in their natural environment and can provide real-time personalized feedback. For example, physical activity can be assessed in real-time using activity trackers. The devices produce intensive longitudinal data, characterized by a large amount of observations (e.g., thousands), often very close in time (e.g., every second) and collected on every individual of a sample over a short time period (Bolger and Laurenceau (2013)).

Despite the diversity of uses and purposes, it is imperative for all these devices to be reliable (provide consistent measurements) and valid (measure what it is meant to be measured). Lack of reliability and validity can lead to incorrect conclusions from scientific studies and unreproducible research (Munafo et al.(2017)Munafo, Nosek, Bishop, Button, Chambers, Percie du Sert, Simonsohn, Wegenmakers, Ware, and Ioannidis). The picture is equally bleak for every day users, who will be unable to assess changes in any biological, physical, behavioural or environmental information provided by the measurement instruments. Reliability refers to the ability in differentiating between items in a population. This is an essential property of a measurement scale, especially when assessing the correlation with other measures because of the well-known attenuation effect. On top of good reliability, good agreement is also sometimes imperative, as in clinical decision-making where the decision depends on the score provided by observers/devices. Agreement is also an important concept when studying criterion validity, where the measurement instrument is calibrated against an established method. The established method is often regarded as a ‘gold standard’ measuring the ‘true value’ of the quantity to be determined. However, it is frequent that also the reference method is subject to measurement error. In that case, the comparability of the new and the reference methods is assessed by the degree of agreement between them.

It is important to assess reliability and agreement in realistic settings on the target user groups on top of controlled laboratory conditions. Subjects may behave differently in real life as compared to laboratory settings, impacting the reliability/agreement levels. This was the case in the CAM study (Annegarn et al.(2011)Annegarn, Spruit, Uszko-Lencer, Vanbelle, Savelberg, Schols, Wouters, and Meijer), motivating this article. The CAM study was designed to validate a new accelerometry sensor, the Movement in a bOX (MOX, initially named CAM) in real conditions. The MOX can categorize body activity as non-weight bearing (e.g., lying or sitting), weight bearing (e.g., standing) or dynamic (e.g., walking). This device was developed to be an objective alternative to questionnaires when measuring physical activity during the revalidation of patients with chronic organ failure. During one hour of unconstrained activity, 10 patients with chronic organ failure were videotaped in a revalidation centre while their body activity was continuously recorded with the MOX, worn simultaneously on the leg and on the trunk for comparative purposes. The aims were (a) to determine the validity of the new accelerometry sensor through the agreement level between MOX recordings and human observations of body activity on the videotape considered as the reference method; (b) to assess temporal stability of the validity levels (e.g., validity levels can decrease because of device shifts); and (c) to determine the influence of the body location where the device is held on the validity levels.

When two observers (or devices) classify items (subjects/objects) on nominal scales, Cohen's kappa coefficient (Cohen (1960)) is an appropriate agreement measure and the intraclass kappa coefficient (Kraemer (1979)) an appropriate reliability measure. Kappa coefficients have the particularity to take the marginal probability distribution of the observers into account, which is often a desirable property. These coefficients were extended over the years to account for predictors under various study designs including longitudinal settings. However, intensive longitudinal data, as produced by accelerometers, differ from longitudinal data in many respects (Walls and Schafer (2006)). First, they often show complex trajectories, including cyclic patterns or chaotic elements over time that can wildly vary between subjects. If poorly modelled, biased estimates of agreement and reliability levels can be obtained. Second, observations close in time often exhibit serial correlation. Ignoring this correlation will lead to incorrect conclusions regarding the reliability and agreement levels. In particular, reliability and agreement are likely to be overestimated, resulting in daily-life use of measurement instruments less reliable and valid than expected. Third, studies with intensive longitudinal data often involve a small number of subjects (10 in the CAM study) due to the high costs of studies involving these cutting edge technologies and many observations per subject (3 600 in the CAM study) due to the recording speed and storage properties of the electronic devices. The combination of these two latter facts leads to computational problems and instability in the parameter estimates when using the existing unit-specific models (Gajewski et al.(2007)Gajewski, Hart, Bergquist-Beringer, and Dunton; Hsiao et al.(2011)Hsiao, Chen, and Kao; Vanbelle et al.(2012)Vanbelle, Mutsvari, Declerck, and Lesaffre; Tsai (2012); Vanbelle and Lesaffre (2015)), such as multilevel models, and population-based approaches (Klar et al.(2000)Klar, Lipsitz, and Ibrahim; Williamson et al.(2000)Williamson, Lipsitz, and Manatunga; Gonin et al.(2000)Gonin, Lipsitz, Fitzmaurice, and Molenberghs), like generalized estimating equations. This is especially true when the number of measurement occasions surpasses the number of participants (Little et al.(2017)Little, Wang, and Gorrall), as in the CAM study and therefore prevents the use of these methods in the current context.

One simple solution is to summarize the information over time intervals and therefore reduce the number of repeated measurements. For example, define one minute intervals and determine the time spent under each body activity within these intervals. Then, the aforementioned methods on the reduced data could eventually be applied. This practice, encountered in behavioural sciences (Rapp et al.(2011)Rapp, Carroll, Stangeland, Swanson, and Higgins; Liu et al.(2016)Liu, Zhou, Palumbo, and Wang), should be avoided in the context of agreement studies. The most obvious reason is that summary measures can perfectly agree (e.g., two methods recorded 25 seconds duration in non-weight bearing posture) when perfect disagreement is observed in the raw data (e.g., the first method recorded the 25 first seconds and the second method the 25 last seconds of the interval as non-weight bearing posture). Ignoring these disagreements can lead to incorrect conclusions when using the measurement instrument to study process dynamics.

This let us to develop a new partial-Bayesian methodology, on the ground of Vanbelle and Lesaffre (2015), permitting the direct evaluation of the impact of predictors on the agreement levels obtained on intensive longitudinal data. The new method extends the method of Vanbelle and Lesaffre (2015) in two ways. First, it permits to summarize the information over time intervals without loss of information over disagreement in the raw data. Second, the method permits to account for possible correlation both within and between time intervals. We describe the motivating data in Section 2. After extending the definition of the kappa coefficient to intensive longitudinal data in Section 3, we introduce the methodology to model agreement obtained on intensive longitudinal data in Section 4. The results of the CAM study are presented in Section 5. A simulation study is conducted in Section 6. Finally, the methodology is discussed in Section 7.

2 Motivating data: The CAM study

Patients with chronic organ failure are generally characterized by an inactive lifestyle to avoid the unpleasant sensation of dyspnea. Decreased weight-bearing activities and postures in daily life (e.g., walking and standing) are important triggers in the development and/or progression of lower-limb muscle atrophy, muscle weakness and exercise intolerance in patients with chronic organ failure (Annegarn et al.(2011)Annegarn, Spruit, Uszko-Lencer, Vanbelle, Savelberg, Schols, Wouters, and Meijer). To implement daily physical activity as an outcome measure of cardiopulmonary rehabilitation, Annegarn et al.(2011)Annegarn, Spruit, Uszko-Lencer, Vanbelle, Savelberg, Schols, Wouters, and Meijer wanted to validate in real settings a new accelerometry sensor, the MOX, developed within Maastricht University by the service point Activity Monitoring.

To that end, 10 patients were recruited during their rehabilitation programme at CIRO+, a centre of expertise for chronic organ failure in Horn (The Netherlands). The patients were asked to wear the MOX at two different body places simultaneously, namely on the leg (frontal part of the thigh) and on the trunk (lower back) for one hour during their daily activities. These patients were also videotaped everywhere when wearing the MOX, except in the toilets. The subject's activity (non-weight bearing posture, weight bearing posture or dynamic activity) was also determined every second on the video by a researcher, blinded to the values obtained with the MOX. The records of one subject are displayed in Figure 1.

Figure 1

CAM study. Activity (N = non-weight bearing posture, W = weight-bearing posture and D = dynamic activity) recorded with the video (bottom), the MOX worn on the trunk (middle) and on the leg (top) for one subject

The aims were (a) to determine the validity of the new sensor in real settings through the agreement level between MOX recordings and human observations of body activity on the videotape considered as the reference method; (b) to assess temporal stability of the validity levels (e.g., validity levels can decrease because of device shifts); and (c) to determine the influence of the body location where the device is held on the validity levels.

In this article, we focus on the distinction between non-weight bearing postures (NWBP) on one hand and weight-bearing postures and dynamic activities on another hand, because these two latter positions have both to be encouraged during the rehabilitation process.

3 Cohen's kappa and intraclass kappa coefficients

3.1 Introduction

Kappa coefficients are defined in terms of population parameters similarly to Vanbelle (2016). Consider two fixed observers classifying a sample of items (subjects or objects) from population $I$ on a binary scale. Let the random variable $Y_{ir}$ express the classification of item $i$ by observer $r$ , that is, $Y_{ir} = 1$ if observer $r$ ( $r = 1, 2$ ) classifies a randomly selected item $i$ of population $I$ in category $1$ and is equal to zero otherwise. Further, consider the random variable $Z_{i} = 1 - I (Y_{i 1}, Y_{i 2})$ expressing the disagreement between the two observers on the classification of item $i$ , where $I (,)$ is the identity function. The random variable $Z_{i}$ then equals 1 if a disagreement occurs and equals 0 otherwise with $Z_{i} \sim Bern (ν_{i})$ where $ν_{i}$ denotes the probability to disagree.

Cohen's kappa coefficient is defined as

κ = 1 - \frac{E (Z_{i})}{E_{ind} (Z_{i})},

(3.1)

where $E (Z_{i})$ is the expectation of $Z_{i}$ over the population of items and $E_{ind} (Z_{i})$ is the expectation assuming that $Y_{i 1}$ and $Y_{i 2}$ are statistically independent, that is, $P (Y_{i 1} = j, Y_{i 2} = k) = P (Y_{i 1} = j) P (Y_{i 2} = k) .$ With the additional assumption that $E (Y_{i 1}) = E (Y_{i 2})$ , the intraclass kappa coefficient is obtained.

The definition of the kappa coefficients is based on independent $Z_{i}$ . However, when intensive longitudinal data are the basis for the estimation of the kappa coefficients, this assumption does not hold because then the $Z_{i}$ exhibit serial correlation. In next sections, we develop a methodology to model dependent kappas obtained from an intensive longitudinal study, on the ground of Vanbelle and Lesaffre (2015).

3.2 Cohen's kappa in an intensive longitudinal study

Suppose that two observers (or devices) classify on a binary scale a sequence of $T$ intensive records obtained on each subject of a random sample of size $N$ from population $I$ . For example, in the CAM study, there are three observers (MOX(leg), MOX(trunk) and the video). The research question involves two pairs of observers, namely MOX(leg) versus video and MOX(trunk) versus video. Records are made on $N = 10$ subjects every second during one hour resulting in $T = 3 600$ observations per subject.

Let $Y_{ir, t}$ be the random variable equal to 1 if observer $r$ classifies the $t$ th record of subject $i$ in category 1 ( $i = 1, \dots, N; r = 1, 2; t = 1, \dots, T$ ) and equal to zero otherwise. The random variable $Y_{ir, t}$ follows a Bernoulli distribution $Y_{ir, t} \sim Bern (π_{ir, t})$ , where $π_{ir, t}$ is the probability for observer $r$ to classify the $t$ th record of subject $i$ in category 1. Similarly to Vanbelle and Lesaffre (2015), let the random variable $Z_{i, t} = 1 - I (Y_{i 1, t}, Y_{i 2, t})$ express the disagreement between the two observers on the $t$ th record of subject $i$ , where $I (., .)$ is the identity function. The random variable $Z_{i, t}$ also follows a Bernoulli distribution $Z_{i, t} \sim Bern (ν_{i, t})$ with $ν_{it}$ denoting the probability to disagree.

Suppose that the observation period is divided into small time intervals. The intervals can be of unequal lengths but assume, for notational convenience that the sequence of $T$ observations is divided into $V$ time intervals including $L$ time points each. The cumulative distribution of the random variables $Y_{i 1, t}$ and $Y_{i 2, t}$ over the $v$ th time interval ( $v = 1, \dots, V$ ), namely $C_{ir, v} = \sum_{t = (v - 1) \times L + 1}^{v \times L} Y_{ir, t}$ follows a binomial distribution ( $C_{ir, v} \sim Bin (π_{ir, v}, L)$ ) under three conditions. The three conditions are (a) the time intervals have a fixed length, (b) the probability to be in category 1 is constant within each time interval, that is, $π_{ir, t} = π_{ir, v}$ ( $t = (v - 1) \times L + 1, \dots, v \times L$ ) and (c) the random variables $Y_{ir, (v - 1) \times L + 1}, \dots, Y_{ir, v \times L}$ are independent.

The first condition holds by construction of the time intervals. The second condition is not necessarily met. However, taking the average probability over a time interval corresponds to temporal aggregation, a smoothing technique used in time series analysis (see, e.g., Silvestrini and Veredas (2008)). In particular, flow aggregation consists in taking the sum of variables over time intervals as aggregated variable. Finally, the third condition does not hold. Observations close in time are very likely to be correlated. To relax this third assumption, the binomial distribution is replaced by a beta-binomial distribution. The additional overdispersion parameter in the beta-binomial distribution $BetaBin (π_{ir, v}, L, ρ_{Mir, v})$ accounts for the correlation between the observations within a time interval. The two first moments of $C_{ir, v}$ are then $μ_{ir, v} = L π_{ir, v} and σ_{ir, v}^{2} = L π_{ir, v} (1 - π_{ir, v}) (1 + (L - 1) ρ_{Mir, v})$ where $ρ_{Mir, v}$ is the pairwise correlation between two observations in the time interval $v$ and is restricted to be positive ( $ρ_{Mir, v} \in [0, 1]$ ).

The random variable $Z_{i, t}$ , expressing the disagreement, also follows a Bernoulli distribution. We, therefore, consider likewise the cumulative random variable $U_{i, v} = \sum_{t = (v - 1) \times L + 1}^{v \times L} Z_{i, t}$ . We assume that it follows a beta-binomial distribution $U_{i, v} \sim BetaBin (ν_{i, v}, L, ρ_{κ i, v})$ , where $ν_{i, v}$ denotes the probability of disagreement in the time interval $v$ and $ρ_{κ i, v}$ the correlation between disagreements in the time interval $v$ .

Then, the agreement coefficient in the $v$ th time interval is defined similarly to Eqn. 3.1,

κ_{i, v} = 1 - \frac{E (U_{i, v})}{E_{ind} (U_{i, v})} = 1 - \frac{ν_{i, v}}{{Qe}_{i, v}} = 1 - \frac{ν_{i, v}}{1 - π_{i 1, v} π_{i 2, v} - (1 - π_{i 1, v}) (1 - π_{i 2, v})} .

(3.2)

Equation 3.2 also defines ${Qe}_{i, v}$ , which is the disagreement obtained when the two observers are assumed to be statistically independent, that is, ${Qe}_{i, v} = 1 - π_{i 1, v} π_{i 2, v} - (1 - π_{i 1, v}) (1 - π_{i 2, v})$ . With the additional assumption that $π_{i 1, v} = π_{i 2, v}$ , the intraclass kappa coefficient is obtained.

4 Statistical inference

4.1 Statistical model

The aim is to relate the agreement coefficient $κ_{i, v}$ defined in the $v$ th time interval to predictors depending on the items and/or observers’ characteristics. Predictors in the CAM study are the location of the device on the body (leg or trunk) and time, as we want (a) to compare the agreement between the video and the MOX worn at two body places and (b) study the agreement stability over time. Since $- 1 \leq κ_{i, v} \leq 1$ and the behaviour of this agreement coefficient is similar to the one of the correlation coefficient, the following model, with Fisher link function (Fisher (1915)), is considered:

\frac{1}{2} ln (\frac{1 + κ_{i, v}}{1 - κ_{i, v}}) = X_{i, v}^{T} β + δ_{i},

(4.1)

where $β$ is a vector of parameters and $δ_{i} \sim N (0, τ_{κ}^{2})$ is a random effect relative to the items.

In the same way, the probability to be classified in category 1 will be related to predictors through the random effects model

Probit (π_{ir, v}) = X_{ir, v}^{T} α_{r} + γ_{i},

(4.2)

where $α_{r}$ is a vector of parameters and $γ_{i} \sim N (0, τ_{M}^{2})$ is a random effect relative to the items. In the CAM study, the predictors are the device (video, MOX worn on the trunk or on the leg) and time.

Suppose now that we have followed up subjects intensively in a short period of time providing continuous measurements from different observers or measurement devices. From the records we establish whether or not they agree. Namely, suppose that we obtained $y_{ir, t} = 1$ when the $t$ th measurement of subject $i$ is actually classified in category $1$ by observer $r$ and zero otherwise with $c_{ir, v}$ the corresponding cumulative distribution for the time interval $v$ . In the same way, let $z_{i, t} = 1$ when there is disagreement between the two observers on the $t$ th measurement of the $i$ th item and $u_{i, t}$ being the corresponding cumulative distribution for the time interval $v$ . For each time interval $v$ , the contribution of the $i$ th item to the likelihood function corresponding to the classification made by two observers, conditional on the random effects, is given by the Dirichlet-Multinomial likelihood. Since it is difficult to express this likelihood according to the parameters of interest $π_{i 1, v},$ $π_{i 2, v}$ and $κ_{i, v}$ , similarly to Vanbelle and Lesaffre (2015), the pseudo-likelihood

\begin{matrix} L_{C} (π_{i 1, v}, π_{i 2, v}, ν_{i, v}, ρ_{Mir, v}, ρ_{κ i, v} ∣ c_{i 1, v}, c_{i 2, v}, u_{i, v}) & \approx \\ L_{C} (ν_{i, v}, ρ_{κ i, v} ∣ π_{i 1, v}, π_{i 2, v}, ρ_{Mir, v}, c_{i 1, v}, c_{i 2, v}, u_{i, v}) L_{C} (π_{i 1, v}, π_{i 2, v}, ρ_{Mir, v} ∣ c_{i 1, v}, c_{i 2, v}) \end{matrix}

is considered where the index $C$ denotes the condition over the random effects. This pseudo-likelihood can be expressed in terms of $π_{i 1, v},$ $π_{i 2, v}$ and $κ_{i, v}$ , using the relationship between $κ_{i, v}$ and $ν_{i, v}$ given in Eqn. 3.2 and a reparametrization of the beta-binomial likelihood (see Appendix A).

4.2 Within-interval correlations

The within-interval correlation is captured through the overdispersion parameter in the beta-binomial distributions. By considering beta-binomial distributions, the correlation between observations belonging to the same time interval is an intraclass correlation coefficient and is therefore restricted to be positive. Values close to 0 indicate heterogeneity in the observations within time intervals while values close to 1 indicate homogeneity. This could help in determining the adequacy of the time intervals length. Namely, a relatively high intraclass correlation means that observations within a time interval are homogeneous, supporting the temporal aggregation assumption.

4.3 Between-interval correlations

Between-interval correlations of derived measures from the intensive longitudinal data may give extra insight in the stability of these measures over time. One popular choice in this context, is again the intraclass correlation. However, in the presence of a binary outcome and covariates the intraclass correlation is less straightforward to compute. Goldstein et al. (2002) Goldstein, Browne, and Rasbash developed three approaches to estimate the intraclass correlation for binomial random variables. To be used in our context, we extended their simulation approach to beta-binomial random variables in the Bayesian framework. We refer to Appendix B for the calculation of the intraclass correlation for kappa coefficients (Eqn. 4.1) and marginal probabilities (Eqn. 4.2).

4.4 Bayesian estimation

Maximum likelihood estimation of the model parameters from the above pseudo-likelihood proves to be quite difficult. We therefore adopted the partial-Bayesian approach suggested in Vanbelle and Lesaffre (2015) using Markov chain Monte Carlo (MCMC). The frequentist coverage of Bayesian confidence intervals has been shown to be excellent in Vanbelle and Lesaffre (2015) and is often quite good, see, for example, Lesaffre and Lawson (2012). It is also often better than using the Delta method, which needs asymptotic arguments and often leads to too narrow confidence intervals (Efron (1992)), especially if non-linear functions are involved, as it is the case in Eqn. 4.1. Because of the limited number of subjects, parsimony was needed in the modelling approach. The two overdispersion parameters, $ρ_{Mir, v}$ and $ρ_{κ i, v}$ , were assumed to be constant across the devices, the subjects and the time intervals, that is, $ρ_{Mir, v} = ρ_{M}$ and $ρ_{κ i, v} = ρ_{κ}$ . In a Bayesian approach, prior knowledge about the parameters is combined with the observed data (likelihood) to yield the posterior distribution, that is, $p (α, β, τ_{κ}, τ_{M}, ρ_{κ}, ρ_{M} | c_{1, t}, c_{2, t}, u_{t}) L (α, β, τ_{κ}, τ_{M}, ρ_{κ}, ρ_{M} ∣ c_{1, t}, c_{2, t}, u_{t}) p (α, β, τ_{κ}, τ_{M}, ρ_{κ}, ρ_{M}),$ where $c_{1, t}$ , $c_{2, t}$ and $u_{t}$ denote the vectors of observations for all items.

We used vague priors which express the lack of prior information on the parameters. For the regression coefficients $β$ and $α$ , vague $N (0, 10^{2})$ independent priors were taken. We used vague uniform priors on $[0, 100]$ for the standard deviations of the random effects, since inference based on Gamma( $ε$ , $ε$ ) priors (with $ε$ small) on precisions is too sensitive to the choice of $ε$ (Gelman (2006)). A limit of 100 was chosen to allow intra-class correlation coefficients values in the whole [0,1] range. A uniform prior on $[0, 1]$ was taken for the overdispersion parameters $ρ_{M}$ and $ρ_{κ}$ . The MCMC calculations were performed using Jags (Plummer (2003)). A total of 3 chains with 5 500 iterations each and a burn-in period of 2 500 iterations was sufficient to attain convergence according to Gelman and Rubin's diagnostics measure R. This value was close to 1 for all parameters, which means there was no evidence against convergence. Trace plots are additionally given in Appendix C.

5 CAM study

The activity (non-weight bearing, weight bearing, dynamic) of 10 patients with chronic organ failure was recorded every second during 1 hour by the new accelerometry sensor, the MOX, worn simultaneously on the leg and on the trunk. The patients were also videotaped when wearing the device and the activity was then assessed every second by a researcher. The video is considered as the reference method. The agreement level between the new device and the video is determined. Agreement levels obtained for the trunk and the leg are then compared.

The probability to be in a non-weight bearing posture (NWBP) and the agreement level between the MOX and the video assessments are displayed in Figure 2 over the one hour observation period. These quantities were determined over 2 seconds intervals to be graphically displayed. The probability to be in a NWBP seems to vary over time and differs markedly between MOX (trunk) on one hand and MOX (Leg) and video on the other hand. This will obviously lead to different agreement levels for the MOX(trunk) and the MOX(leg). A burnin period of 5 minutes after the device placement was discarded from the data when applying the modelling approach as all patients were standing to have the MOX device installed.

Figure 2

CAM study. Top: Evolution of the probability of being in a NWBP over the one hour observation period with the MOX worn on the trunk (light grey), on the leg (dark grey) and by the observer on the video (black). Bottom: Agreement (kappa) between the video and the MOX worn on the leg (black) and on the trunk (light grey)

No specific pattern in the shape of the evolution over time of the probability to be in a NWBP was expected or of direct interest since patients were studied in unconstrained real conditions at different moments of the day. The probability to be in NWBP is therefore modelled non-parametrically using low-rank thin plate splines (Wood (2003)). This leads to the following hierarchical probit regression model for the marginal probabilities $π_{ir, v}$ ,

Probit (π_{ir, v}) = α_{2} {TRUNK}_{ir, v} + α_{3} {LEG}_{ir, v} + \sum_{s = 1}^{10} c_{s} R_{is, v} + γ_{i},

where TRUNK equals 1 if the device is the MOX worn on the trunk and equals 0 otherwise, LEG equals 1 if the device is the MOX worn on the leg and equals 0 otherwise. The coefficients $α$ are parameters and $γ_{i}$ are the random intercepts relative to the subjects ( $γ_{i} \sim N (0, τ_{M}^{2})$ ). The coefficients $c_{s}$ are the coefficients for the thin-plate splines. Hierarchical centring was used, such that ( $c_{s} \sim N (α_{1}, τ_{S}^{2})$ ).

For the agreement, the interest is in the existence of any trend in the evolution of the agreement levels over time (i.e., stability of agreement) and in the comparison of the agreement level between the video and the MOX worn at two body places. This leads to the following model,

\frac{1}{2} ln (\frac{1 + κ_{i, v}}{1 - κ_{i, v}}) = β_{2} {TRUNK 2}_{i, v} + β_{3} {TIME}_{i, v} + β_{4} {TIME}_{i, v}^{2} + β_{5} {TIME}_{i, v}^{3} + δ_{i},

where TRUNK2 equals 1 if the agreement between the MOX(trunk) and the video is considered and 0 if the agreement between the MOX(leg) and the video is considered. The coefficients $β$ are parameters and $δ_{i}$ are the random intercepts relative to the subjects. We used hierarchical centring such that $δ_{i} \sim N (β_{1}, τ_{K}^{2})$ .

For modelling purposes, we analysed the data twice, once with time intervals of 30 seconds and once with time intervals of 60 seconds. The data and R code are available as supplementary Web material. The posterior distribution of the marginal probability distribution and the agreement levels is summarized in Table 1 for 30 and 60 seconds intervals by using 10 equally spaced knots. The results with 10, 15 and 20 equally spaced knots were very similar (not shown).

Table 1

Posterior distribution of the hierarchical probit regression model corresponding to the marginal probability distribution of the devices and the regression model for the agreement coefficient for 30 and 60 seconds intervals using 10 knots

	30 seconds				60 seconds
Parameter	Mean (SD)	2.5 $%$	50 $%$	97.5 $%$	Mean (SD)	2.5 $%$	50 $%$	97.5 $%$
Marginal Probability Distribution
Intercept $(α_{1})$ ( $\times 10^{- 2}$ )	0.29 (0.22)	$-$ 0.15	0.29	0.73	0.27 (0.21)	$-$ 0.13	0.27	0.69
TRUNK $(α_{2})$	0.87 (0.051)	0.77	0.87	0.97	0.85 (0.067)	0.72	0.85	0.99
LEG $(α_{3})$	0.17 (0.052)	0.066	0.17	0.27	0.21 (0.069)	0.078	0.21	0.35
$ρ_{M}$	0.87 (0.0048)	0.86	0.87	0.88	0.82 (0.0073)	0.80	0.82	0.83
$τ_{M}^{2}$	0.51 (0.40)	0.16	0.41	1.4	0.47 (0.36)	0.15	0.37	1.5
$τ_{S}^{2}$ ( $\times 10^{- 4}$ )	0.42 (0.39)	0.10	0.32	1.3	0.37 (0.34)	0.078	0.26	1.3
Agreement Coefficient
Intercept $(β_{1})$	1.29 (0.17)	0.96	1.29	1.63	1.18 (0.16)	0.87	1.18	1.50
TRUNK2 $(β_{2})$	$-$ 0.86 (0.056)	$-$ 0.96	$-$ 0.85	$-$ 0.75	$-$ 0.73 (0.065)	$-$ 0.86	$-$ 0.73	$-$ 0.60
TIME $(β_{3})$ ( $\times 10^{- 1}$ )	$-$ 0.13 (0.037)	$-$ 0.21	$-$ 0.13	$-$ 0.063	$-$ 0.12 (0.045)	$-$ 0.21	$-$ 0.12	$-$ 0.035
TIME $^{2}$ $(β_{4})$ ( $\times 10^{- 3}$ )	$-$ 0.15 (0.084)	$-$ 0.31	$-$ 0.15	0.0014	$-$ 0.17 (0.11)	$-$ 0.38	$-$ 0.17	0.034
TIME $^{3}$ $(β_{5})$ ( $\times 10^{- 5}$ )	1.1 (0.59)	0.011	1.13	2.3	1.0 (0.74)	$-$ 0.47	1.0	2.4
$τ_{K}^{2}$	0.26 (0.19)	0.081	0.20	0.75	0.21 (0.14)	0.065	0.17	0.58
$ρ_{K}$	0.71 (0.011)	0.69	0.71	0.73	0.63 (0.014)	0.60	0.63	0.66

As seen in Table 1, the parameter estimates obtained with the 30 seconds model and the 60 seconds model are very similar. Given the random effects, the probability of recording NWBP differs between the video and the MOX worn at the two body places. The level of agreement between the MOX and the video records decreases with time and is higher for the MOX worn on the leg than on the trunk. The posterior distribution, obtained by averaging over the random effects, is depicted in Figure 3 for the marginal probability distribution and the agreement with pointwise 95% equal-tailed credibility intervals.

Figure 3

CAM study. Posterior distribution obtained by averaging over the random effects (pointwise 95% equal-tailed credibility interval) of the marginal probability distribution for the three devices (top) and of the agreement levels (bottom) when considering time intervals of 30 seconds. In the top panel, plain line is for video, dotted line for MOX(leg) and dashed line for MOX(Trunk). In the bottom panel, plain line is for MOX(leg) vs video and dashed line for MOX(Trunk) vs video

For 30 seconds intervals, the within-interval correlation is 0.87 [0.86,0.88] for the devices marginal probabilities and 0.71 [0.69,0.73] for the agreement coefficients. The between-interval correlation, as assessed by the intraclass correlation, is depicted over time in Figure 4 for each device. While the intraclass correlation is close and constant over time for the 3 device marginal probabilities, it is further apart for the two agreement coefficients.

Figure 4:

CAM study. Posterior distribution obtained by averaging over the random effects (pointwise 95% equal-tailed credibility interval) of the between-interval intraclass correlation coefficient for devices marginal probabilities (top) and agreement coefficients (bottom) when considering time intervals of 30 seconds. In the top panel, black is for video, dark grey for MOX(leg) and light grey for MOX(Trunk). In the bottom panel, black is for MOX(leg) vs video and light grey for MOX(Trunk) vs video

Thus, we can conclude from our modelling exercise that the level of agreement with the video is better when the MOX was worn on the leg than on the trunk. This difference was however not observed in laboratory settings on healthy subjects (Annegarn et al.(2011)Annegarn, Spruit, Uszko-Lencer, Vanbelle, Savelberg, Schols, Wouters, and Meijer). Misclassification when the MOX was worn on the trunk mainly occurred because patients leaned forward while sitting to increase their ventilatory capacity. This forward sitting posture was classified as standing when the MOX was worn on the trunk, because gravitational accelerations in the anterior-posterior signal were in the range of the standing posture. A decrease of the agreement level between the MOX and the video was also observed over time. It is well known that kappa coefficients are influenced by the marginal probability distribution. The combination of a change of the probability of being in a NWBP over time with a small sample size could be partially responsible for variation in the agreement coefficient over time. However, this could be also an indication that the observation period was not long enough for the agreement level to reach a stable level or that the device shifted from the original position with time, decreasing the agreement levels.

6 Simulations

Binary bivariate intensive longitudinal data were simulated using state-space models as follows. For every subject ( $i = 1, \dots, N$ ), a bivariate AR(1) process was simulated to mimic the measurements obtained from the two observers $r = 1, 2$ on $T$ time points, that is,

\begin{matrix} Y_{ir, t} & = & x_{ir, t} + v_{ir, t}, \\ x_{ir, t} & = & Φ_{r} x_{ir, t - 1} + w_{ir, t}, i = 1, \dots, N; t = 1, \dots, T; r = 1, 2 \end{matrix}

(6.1)

with

(\begin{matrix} v_{i 1, t} \\ v_{i 2, t} \end{matrix}) \sim N ((\begin{matrix} 0 \\ 0 \end{matrix}), (\begin{matrix} σ_{v_{1}}^{2} & σ_{v_{1} v_{2}} \\ σ_{v_{1} v_{2}} & σ_{v_{2}}^{2} \end{matrix})) and (\begin{matrix} w_{i 1, t} \\ w_{i 2, t} \end{matrix}) \sim N ((\begin{matrix} 0 \\ 0 \end{matrix}), (\begin{matrix} σ_{w_{1}}^{2} & σ_{w_{1} w_{2}} \\ σ_{w_{1} w_{2}} & σ_{w_{2}}^{2} \end{matrix}))

The corresponding correlation between $Y_{i 1, t}$ and $Y_{i 2, t}$ is, when $Φ_{r}$ <1,

cor (Y_{i 1, t}, Y_{i 2, t}) = \frac{\frac{σ_{w_{1} w_{2}}}{1 - Φ_{1} Φ_{2}} + σ_{v_{1} v_{2}}}{\sqrt{(\frac{σ_{w_{1}}^{2}}{1 - Φ_{1}^{2}}) (\frac{σ_{w_{2}}^{2}}{1 - Φ_{2}^{2}})}}

When the random variables $Y_{ir, t}$ are dichotomized with respect to their mean to obtain bivariate binary data, $cor (Y_{i 1, t}, Y_{i 2, t})$ is the tetrachoric correlation coefficient. This tetrachoric correlation coefficient can be, in that special case, directly related to the kappa coefficient through the formula (Lord et al.(1968)Lord, Novick, and Birnbaum)

κ_{it} = \frac{2}{π} \arcsin (cor (Y_{i 1, t}, Y_{i 2, t})) .

(6.2)

This permits to express $σ_{v_{1} v_{2}}$ as a function of $κ_{it}$ ,

σ_{v_{1} v_{2}} = \sqrt{(\frac{σ_{w_{1}}^{2}}{1 - Φ_{1}^{2}}) (\frac{σ_{w_{2}}^{2}}{1 - Φ_{2}^{2}})} \sin (\frac{π}{2} κ_{it}) - \frac{σ_{w_{1} w_{2}}}{1 - Φ_{1} Φ_{2}}

and therefore specify values for $κ_{it}$ to characterize the bivariate AR(1) process. Although the simulation method allows some control over the behaviour of $κ_{it}$ , it is not possible to determine analytically the relationship between the marginal $κ_{it}$ in Eqn. 6.2 and the one obtained in our hierarchical model defined in Eqn. 3.2. It is nevertheless possible to approximate this relationship using results for logistic regression. For a marginal model $P (Y_{it} = 1) = expit (X_{it} α^{M})$ and a conditional model $P (Y_{it} = 1 | γ_{i}) = expit (X_{it} α^{C} + γ_{i})$ with $γ_{i} \sim N (0, σ_{γ}^{2})$ , we have

{\hat{α}}^{M} \approx \frac{{\hat{α}}^{C}}{\sqrt{c^{2} σ_{γ}^{2} + 1}}

with $c^{- 1} = 15 π / (16 \sqrt{3})$ (Hedeker et al.(2018)Hedeker, du Toit, Demirtas, and Gibbons). Given that $atanh (x) = 0.5 logit (0.5 + x / 2)$ , if we consider the marginal model $atanh (κ_{it}) = X_{it} β^{M}$ and the conditional model $atanh (κ_{it} | δ_{i}) = X_{it} β^{C} + δ_{i}$ with $δ_{i} \sim N (0, σ_{δ}^{2})$ , we have

{\hat{β}}^{M} \approx \frac{{\hat{β}}^{C}}{\sqrt{4 c^{2} σ_{δ}^{2} + 1}} .

Two scenarios were envisaged:

Scenario 1: $T = 900$ , $N = 10, 30$ , $σ_{v_{1}}^{2} = σ_{v_{2}}^{2} = 1$ , $Φ_{1} = Φ_{2} = 0.90$ , $σ_{w_{1}}^{2} = σ_{w_{2}}^{2} = 0.01$ , $σ_{w_{1} w_{2}} = 0.8$ , $atanh (κ_{it}) = 0.7$

Scenario 2: $T = 900$ , $N = 10, 30$ , $σ_{v_{1}}^{2} = σ_{v_{2}}^{2} = 1$ , $Φ_{1} = Φ_{2} = 0.90$ , $σ_{w_{1}}^{2} = σ_{w_{2}}^{2} = 0.01$ , $σ_{w_{1} w_{2}} = 0.8$ and $atanh (κ_{it}) = 0.7 + 0.0002 t$

Starting values for the state equation were randomly chosen as

(\begin{matrix} x_{i 1, 0} \\ x_{i 2, 0} \end{matrix}) \sim N ((\begin{matrix} 0 \\ 0 \end{matrix}), \frac{1}{1 - ϕ^{2}} (\begin{matrix} σ_{w_{1}}^{2} & σ_{w_{1} w_{2}} \\ σ_{w_{1} w_{2}} & σ_{w_{2}}^{2} \end{matrix}))

Before being analysed, the simulated bivariate AR(1) normal process was dichotomized at the value (0,0) to lead to a bivariate binary process. Data were summarized over intervals of 60 time points, leading to 15 time intervals. A total of 3 000 iterations with 2 000 burnin iterations were sufficient to attain convergence. The type I error rate, equal to the proportion of times the 95% equal-tailed credible interval does not cover the theoretical parameter values was determined for each simulation condition and reported in Table 2. For a simulation scheme with 1 000 simulated samples, the type I error rate is expected to be between 0.036 and 0.064.

Table 2

Summary of the posterior distribution of the hierarchical agreement model over 1 000 simulated samples

	Scenario 1			Scenario 2
$N = 10$
Parameter	Median	Mean (SD)	type I	Median	Mean (SD)	Type I
Intercept	0.70	0.70 (0.031)	0.030	0.70	0.70 (0.032)	0.026
TIME	0.00	0.00 (0.00)	0.042	0.0002	0.0002 (0.00006)	0.036
$τ_{K}^{2}$	0.0065	0.0028 (0.0015)		0.00064	0.0013 (0.0019)
$N = 30$
Parameter	Median	Mean (SD)	type I	Median	Mean (SD)	Type I
Intercept	0.70	0.70 (0.017)	0.061	0.70	0.70 (0.017)	0.058
TIME	0.00	0.00 (0.00)	0.045	0.0002	0.0002 (0.00004)	0.043
$τ_{K}^{2}$	0.00023	0.00037 (0.00043)		0.00024	0.00040 (0.00047)

As seen in Table 2, the type I error is somewhat conservative for the intercept when the sample size is equal to 10. All simulation schemes led to a small random effect variance.

7 Discussion

With the advent of measurement devices worn continuously by people to measure physical activities and medical conditions in daily life, we predict that intensive longitudinal data will become available abundantly in the future, especially in medical and psychological research. We argue that such devices will become increasingly important. For instance, some insurance companies are tracking your vitality using the measuring devices and offer rewards for those who have collected enough vitality points. In addition, such devices are currently incorporated in clinical trials. It is therefore important that such measuring devices reflect the true status of your activity over time.

In this article, we developed a methodology to study agreement in the presence of binary intensive longitudinal data. This method can be implemented in standard Bayesian software (e.g., Jags). We considered small time intervals and supposed that over the small time intervals, the agreement followed a beta-binomial distribution, accounting for the serial correlation within time intervals. We then proposed a partial-Bayesian methodology to directly evaluate the impact of categorical and continuous predictors, on Cohen's or the intraclass kappa coefficient. This method highlighted the effect of the position on the body where the MOX device was worn on the agreement level with the video records. The level of agreement with the video was better when the MOX was worn on the leg than on the trunk.

Several aspects of the proposed methodology need to be discussed. First, we believe that temporal aggregation makes sense. This is also confirmed by the large intra-interval correlation of 0.87 for the marginal probability distributions and 0.71 for the agreements. However, the length of the time intervals was arbitrarily fixed. The choice was suggested by the nature of the data (patients with chronic organ failure are not likely to change their position every second during rehabilitation). The length of the time intervals should be carefully chosen to consider all important pattern particularities in the data and to limit the effect of the assumption that the probability to be in a certain position is constant within time intervals (temporal aggregation). If the length of the time interval is too long, some pattern in the data can be hidden or some assumption (e.g., constant probability within an interval) might be breached. On the other hand, very short time intervals can lead to too much computational burden. Moreover, due to the small sample size, an additional assumption of constant overdispersion parameter for all time intervals and subjects had to be assumed. The use of constant overdispersion for all time intervals seems reasonable in this particular case since patients were observed under unconstrained conditions. There was therefore no reason to expect variation. For larger sample sizes, overdispersion parameters could further be modelled over time according to subject characteristics.

Second, choices were made to model the distribution of the data. First, the probability to classify items in category 1 and the probability to disagree were modelled using a beta-binomial distribution to account for possible correlation between the outcomes obtained at the different time points. The beta-binomial distribution was preferred over existing alternatives (see, e.g., Diniz et al.(2010)Diniz, Tutia, and Leite) because the distribution is available in Jags. Using alternative methods require to write explicitly the likelihood function. Second, the probit link function was used to model the probability to classify subjects in category 1. The logit link function is another option. The probit link was preferred because of the relationship between the kappa coefficient and the tetrachoric correlation coefficient under particular circumstances used in the simulation section (see Lord et al.(1968)Lord, Novick, and Birnbaum). Third, Fisher link function was introduced as a convenient way for modelling the kappa coefficient. This link function was chosen because it was developed to model the correlation coefficient (Fisher (1915)). Kappa coefficients present a similar behaviour. The complementary log-log function is an alternative. Fourth, splines were used to describe the evolution over time of the probability to be in non-weight bearing posture. This choice is motivated by the fact that we were not interested in the shape of the time evolution. When the evolution of the probability to classify items in category 1 is of interest, a parametric model should be preferred. Alternative splines approaches led to similar results in the present example.

Third, the method is based on a pseudo-likelihood rather than the full-likelihood. Vanbelle and Lesaffre (2015) have shown that in the context of multilevel modelling, using only part of the likelihood leads to somewhat higher standard errors for agreement coefficients when compared with a full-Bayesian approach, occurring principally for low agreement values, generally of little practical interest. Only limited simulation results were shown in the current context because of the difficulty to simulate bivariate binary intensive longitudinal data with a given trend in the agreement. Some conservatism was observed for the intercept when the sample size was equal to 10. Given the approximation between marginal and conditional parameters and given that the relationship between the tetrachoric coefficient and the kappa coefficient are only valid when the marginal probability distribution of the observers is exactly equal to 0.5, these simulation results are promising. More extended simulations are however needed.

In conclusion, we proposed a method to directly evaluate the effect of covariates on the level of agreement in the presence of binary continuous records. Future research to extend the method to continuous and ordinal scales is needed. An extension to spatio-temporal continuous records could be envisaged. \appendix

Footnotes

Supplementary material

Supplementary materials for this article, including data and R code, are available from http://www.statmod.org/smij/archive.html

Acknowledgements

The authors thank the reviewers for their valuable comments. The authors are also grateful to Dr K. Meijer (Maastricht University) for providing the data and to Ayfer Ezgi Yilmaz, Department of Statistics, Hacettepe University, Ankara, Turkey.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship and/or publication of this article: This research is part of project 451-13-002 funded by the Netherlands Organisation for Scientific Research.

Appendix

References

Annegarn

, Spruit

, Uszko-Lencer

, Vanbelle

, Savelberg

, Schols

, Wouters

and Meijer

(2011) Objective physical activity assessment in patients with chronic organ failure: A validation study of a new single-unit activity monitor. Archives of Physical Medicine and Rehabilitation , 92, 1852–1857.e1. doi: http://dx.doi.org/10.1016/j.apmr.2011.06.021. URL http://www.sciencedirect.com/science/article/pii/S0003999311004151

Bolger

and Laurenceau

(2013) Intensive Longitudinal Methods: An Introduction to Diary and Experience Sampling Research . New York, NY: Guilford Press.

Cohen

(1960) A coefficient of agreement for nominal scales. Educational and Psychological Measurement , 20, 37–46.

Diniz

CAR

, Tutia

and Leite

(2010) Bayesian analysis of a correlated binomial model. Brazilian Journal of Probability and Statistics , 24, 68–77. doi:10.1214/08- BJPS014. URL https://doi.org/10.1214/08-BJPS014

Efron

(1992) Six questions raised by the bootstrap. In Bootstrap Proceedings Volume , edited by R. LaPage and L. Billard. New York, NY: Wiley.

Fisher

(1915) Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika , 10, 507–521. doi: 0.2307/2331838.

Gajewski

, Hart

, Bergquist-Beringer

and Dunton

(2007) Inter-rater reliability of pressure ulcer staging: Ordinal probit Bayesian hierarchical model that allows for uncertain rater response. Statistics in Medicine , 26, 4602–4618.

Gelman

(2006) Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Analysis , 1, 515–534.

Goldstein

, Browne

and Rasbash

(2002) Partitioning variation in multi-level models. Understanding Statistics , 1, 223–231.

10.

Gonin

, Lipsitz

, Fitzmaurice

and Molenberghs

(2000) Regression modelling of weighted by using generalized estimating equations. Journal of the Royal Statistical Society, Series C , 49, 1–18.

11.

Hedeker

, du Toit

SHC

, Demirtas

and Gibbons

(2018) A note on marginalization of regression parameters from mixed models of binary outcomes. Biometrics , 74, 354–361. doi: https://doi.org/10.1111/biom.12707. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/biom.12707

12.

Hsiao

, Chen

and Kao

(2011) Bayesian random effects for interrater and test-retest reliability with nested clinical observations. Journal of Clinical Epidemiology , 64, 808–814.

13.

Klar

, Lipsitz

and Ibrahim

(2000) An estimating equations approach for modelling kappa. Biometrical Journal , 42, 45–58.

14.

Kraemer

(1979) Ramifications of a population model for as a coefficient of reliability. Psychometrika , 44, 461–472.

15.

Lesaffre

and Lawson

(2012) Bayesian Biostatistics (Statistics in Practice) . New York, NY: John Wiley.

16.

Little

, Wang

and Gorrall

(2017) Viii. The past, present, and future of developmental methodology. Monographs of the Society for Research in Child Development , 82, 122–139. doi: 10.1111/mono.12302. URL http://dx.doi.org/10.1111/mono.12302

17.

Liu

, Zhou

, Palumbo

and Wang

(2016) Dynamical correlation: A new method for quantifying synchrony with multivariate intensive longitudinal data. Psychological Methods , 21, 291–308.

18.

Lord

, Novick

and Birnbaum

(1968) Statistical Theories of Mental Test Scores . Reading, MA: Addison-Wesley.

19.

Munafo

, Nosek

, Bishop

, Button

, Chambers

, Percie du Sert

, Simonsohn

, Wegenmakers

E-J

, Ware

and Ioannidis

(2017) A manifesto for reproducible science. Nature Human Behaviour , 1, 0021.

20.

Plummer

(2003) Jags: A program for analysis of Bayesian graphical models using Gibbs sampling. URL https://www.r-project.org/conferences/DSC-2003/Proceedings/Plummer.pdf

21.

Rapp

, Carroll

, Stangeland

, Swanson

and Higgins

(2011) A comparison of reliability measures for continuous and discontinuous recording methods: Inflated agreement scores with partial interval recording and momentary time sampling for duration events. Behavior Modification , 35, 389–402.

22.

Silvestrini

and Veredas

(2008) Temporal aggregation of univariate and multivariate time series models: A survey. Journal of Economic Surveys , 22, 458–497. doi: https://doi.org/10.1111/j.1467-6419.2007.00538.x. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-6419.2007.00538.x//onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-6419.2007.00538.x.

23.

Tsai

M-Y

(2012) Assessing inter- and intra-agreement for dependent binary data: A Bayesian hierarchical correlation approach. Journal of Applied Statistics , 39, 173–187.

24.

Vanbelle

(2016) A new interpretation of the weighted kappa coecients. Psychometrika , 81, 399–410. doi: 10.1007/s11336-014- 9439-4. URL http://dx.doi.org/10.1007/s11336-014-9439-4

25.

Vanbelle

and Lesaffre

(2015) Modeling agreement on categorical scales in the presence of random scorers. Biostatistics . doi: 10.1093/biostatistics/kxv036. URL http://biostatistics.oxfordjournals.org/content/early/2015/09/21/biostatistics.kxv036.abstract

26.

Vanbelle

, Mutsvari

, Declerck

and Lesaffre

(2012) Hierarchical modeling of agreement. Statistics in Medicine , 31, 3667–3680. doi: 10.1002/sim.5424. URL http://dx.doi.org/10.1002/sim.5424

27.

Walls

and Schafer

(2006) Models for Intensive Longitudinal Data . New York, NY: Oxford University Press.

28.

Williamson

, Lipsitz

and Manatunga

(2000) Modeling kappa for measuring dependent categorical agreement data. Biostatistics , 1, 191–202.

29.

Wood

(2003) Thin plate regression splines. Journal of the Royal Statistical Society. Series B (Statistical Methodology) , 65, 95–114.

Modelling agreement for binary intensive longitudinal data

Abstract

Keywords

1 Introduction

2 Motivating data: The CAM study

Figure 1

CAM study. Activity (N = non-weight bearing posture, W = weight-bearing posture and D = dynamic activity) recorded with the video (bottom), the MOX worn on the trunk (middle) and on the leg (top) for one subject

3.1 Introduction

4.1 Statistical model

4.3 Between-interval correlations

4.4 Bayesian estimation

5 CAM study

Figure 2

Footnotes

Supplementary material

Acknowledgements

Declaration of conflicting interests

Funding

Appendix

References