Sage Journals: Discover world-class research

Abstract

After a brief review of methods for assessing the reliability of individual test, the article presents a method of obtaining reliability of a battery of tests where battery score could be defined as sum of weighted or unweighted scores of the component tests. Such battery reliability can be influenced significantly by method of selection of weights to arrive at the battery scores and methods of estimating reliability of component tests. Test reliability as per theoretical definition in terms of length of score vectors of two parallel tests and angle between such vectors in N-dimensional space also helps to find values of error score variance of the test fits well in estimation of battery reliability. Relationship between theoretically defined reliability r_tt and split-half correlation r_gh was established. For weighted battery score, a Lagrange multiplier-based solution for determination of weights is recommended with the use of reliability as per the theoretical definition. Weights found as above have the advantage that the battery score (Y) has minimum variance. Also, covariance between the battery score and the score of an individual test is a constant. Condition of battery score being equi-correlated with the standardized score of each constituent test was derived.

Keywords

Reliability battery of tests weighted sum true score variance minimum variance

Introduction

In practical situations, a battery of tests is administered instead of a single test. Single tests may not be adequate to detect appropriately the required latent abilities. For example, aptitudes are made of a complex of abilities. Intelligence tests could be a combination of verbal and numerical aptitudes among others. Most diagnostic questions require the assessment of personality, intelligence, and perhaps even the presence of organic involvement. Thus, a battery of tests is needed to measure complex abilities and to identify individual’s strengths and weaknesses. In a battery, a set of component tests are grouped together and administered in situations like selection of personnel for employment, Health-Related Physical Fitness Test (HRPFT) for jobs, sports, and even for assessment of fitness of Astronauts, mental-health issues, and so on. Development of cognitive battery for clinical trials of cognition-enhancing treatments for schizophrenia was highlighted by Mesholam-Gately et al. (2008). Kassin (2003) observed that application of factor analysis to the study of trait organization has provided the theoretical basis for the construction of test batteries. Examples of such batteries include among others SRA Primary Mental Abilities (PMA), Differential Aptitude Tests (DAT), Guilford-Zimmerman Aptitude Test, and so on.

While estimation of reliability by various methods and error variance derived from reliability for individual test are widely discussed (Anastasi and Urbina, 1997; Franzen, 2000; Murphy and Davidshofer, 2005), evaluation of reliability of test battery as a whole along with properties does not appear to have been accomplished to a great extent. This is despite the fact that combining tests into a battery in no way reduces the potential for measurement error associated with both the individual test and the battery.

Thus, need is felt to estimate reliability of a battery as a function of reliability of each component test along with estimation of true score variance of the battery, considering battery score as a weighted sum of component tests and avoiding scaling of scores of component tests.

Rest of the article is organized as follows. Previous studies on measures of reliability of test and battery are elaborated in the following section. Conceptual framework for battery scores and issues relating to scaling of scores of component tests and aggregation methods are elaborated in Section “Battery scores.” Section “Battery reliability” deals with theoretical details regarding choice of methods of finding reliability of the component tests as per theoretical definition and associated properties. This was followed by method of the computation of the reliability of a battery of tests using summative scores and weighted scores with discussion on recommended method of finding weights which minimizes variance of the weighted sum. The article is rounded up in Section “Summary and discussion,”, by recalling the salient outcomes of the work.

Literature survey

Reliability of an individual test has been pursued in a variety of ways aiming to indicate consistency, precision, repeatability, trustworthiness, and so on of a test. Reliability as degree of consistency of test scores, that is, repeatability, has been recommended by Berkowitz et al. (2000). However, different values of test–retest reliability of a single test may be obtained depending on the time gap between the administrations. Cronbach’s alpha is widely used to find reliability as a measure of internal consistency. However, it is necessary to check that the data fit the unidimensional model before calculating alpha (Trizano-Hermosilla and Alarado, 2016). Violation of assumption of tau-equivalence will underestimate α (Graham, 2006; Raykov, 1997). Working with data which comply with this assumption is generally not viable in practice (Teo and Fan, 2013). Wilcox (1992) showed how coefficient alpha is vulnerable to modest numbers of outlying observations and may substantially inflate alpha. Limitations of Cronbach’s alpha have also been reported by Hattie (1985) and Ritter (2010). Parallel test or split-half reliability considers correlation of two parallel sub-tests obtained by dichotomization of the test score in one session. A major limitation of the split-half reliability is that the resulting estimate is not unique and depends on how exactly the split was carried out. Indeed, different splits into halves of an initial multiple-component instrument can well yield different reliability estimates. Zimmerman (2007) favored expressing reliability in terms of standard error of measurement (SEM). Thus, no popular method of finding reliability uses the theoretical definition of reliability which is defined as the ratio of the true score variance and observed score variance. Consequently, multiple values of the error variance and reliability can be obtained for the same test even if the sample remains unchanged. Reliability that conforms to the theoretical definition cannot be computed since true scores of individuals taking the test are not known. Moreover, estimates of reliability are themselves vulnerable to measurement error (Vacha-Haase et al., 2000).

The above are also applicable for estimating reliability of a battery of tests, where battery score is taken as sum or weighted sum of component tests scores. Reliability of the battery can be influenced significantly by method of selection of weights to arrive at the battery scores and methods of estimating reliability of component tests, since sources of errors of individual tests may get manifold for the battery. For estimation of reliability of a battery, Lubans et al. (2011) used test–retest approach. Berchtold (2016) opined that while reliability is the ability of a measure applied twice upon the same respondents to produce the same ranking on both occasions, agreement requires not only to preserve the relative order of the respondents in the two sets of measurements but also the same exact result that each respondent obtains on the two testing situations. Battery reliability using weighted sum of component tests was attempted by researchers like Wang and Stanley (1970), Feldt and Brennan (1989), Rudner (2001), Webb et al. (2007), and others using different approaches and scaling. However, the approaches gave rise to different values of battery reliability, especially when individual components measure multidimensional constructs. Rudner (2001) indicated that adding raw scores fails to recognize the relative importance of components to the overall composite; weights proportional to reliability of the components may results in lower SEM of the composite score in comparison to simple summative scores. In case of availability of pre-specified, quantifiable external criterion, weights of component tests may be found by linear multiple regressions with the criterion. Such method of finding weights may result in maximum validity of the composite score. Canonical correlation allows an extension of the parallel test situation and split-half technique. Safrit and Wood (2013) and Koch et al. (2003) used canonical correlation analysis (CCA) where randomized subsets were created first which were then correlated using canonical correlation technique. Essence of CCA is to form pairs of linear combinations of predictor and criterion variables to maximize the correlation between each pair. In the context of reliability of a battery, there may not exist unique efficient procedure to obtain predictor and criterion variables. In addition, CCA technique finds linear relationships between variables within a set and between pairs of canonical variates (linear pairings of canonical variates, one from each of the two sets of variables). Non-linear components of these relationships, if any, are not recognized and are not captured in the analysis. Outliers have mixed impact on the results of the CCA, and each set of variables is inspected independently for univariate and multivariate outliers. Because canonical correlation is very sensitive to small changes in the data set, the decision to eliminate cases must be made very carefully. Both multicollinearity (when variables are very highly correlated) and singularity (when one variable is a linear combination of two or more variables in that set) should be eliminated before analysis proceeds. Precautions for using CCA were mentioned by Lambert and Durand (1975). Equivalences among canonical factor analysis, canonical reliability, and principal component analysis were studied by Conger and Stallard (1976); they found that if variables are scaled so that they have equal measurement errors, then canonical factor analysis on all non-error variance, principal component analysis, and canonical reliability analysis give rise to equivalent results. However, the assumption of equal measurement errors is abstract and too difficult to achieve.

Battery scores

Battery scores can be viewed as a composite score of component tests, that is, aggregation of scores of component tests to a single value for an examinee. Stages for construction of composite score/index involve scaling of test scores and deciding appropriate method of combining.

Major points to be noted are as follows: (1) Component tests could be independent or dependent with various degrees; (2) Battery score could be weighted or unweighted sum of scores of component tests; (3) Reliability of the battery can be influenced significantly by reliability of each component test and also by the method of arriving at the battery scores; (4) In addition to battery reliability, it will be desirable to find estimates of true score variance and/or error score variance of the battery; (5) Reliability based on internal consistency may be relevant for an individual test but not for battery due to multidimensionality of a test battery.

Scaling

Usual practice is to consider standardized scores of component tests by subtracting from the sample mean and dividing by the sample standard deviation (SD). Distribution of standardized score of a test will follow normal distribution for large sample, even if distribution of original test score may be different from normal. There are many other ways to normalize test scores. Illustrative list is as follows:

$Z = (X - X_{m i n} / X_{m a x} - X_{m i n}) .$ Here, standardization is based on the range rather than the SD. But the extreme values (minimum and maximum) may be unreliable outliers.

$Z = (X_{i} / \bar{X}) X 100$ . Here, the units receive score depending on their distance from the mean. Statistically, it is less robust to the influence of outliers than some of the other methods.

$Z = (X_{i} / X_{m a x}) X 100$ . Here, percentage difference from the maximum score is considered.

Z= log(x); x > 0 to reduce skewness of (positive) data.

Note that there is no best method of scaling or normalization or transformations of component score. However, scaling will have significant effects on the composite score, its distribution, and battery reliability. Hence, it is felt desirable to find composite battery score avoiding scaling of component scores.

Combining method

Usual methods of combining scores of component scores may include (1) Summative score, that is, battery score of an individual is the sum of his or her score in each component test. This considers all tests are of equal importance. This allows compensability among component tests, which may not be desirable or (2) Weighted sum which assumes linear model with implication of full compensability, that is, poor performance in some component tests can be well compensated by high values in other tests. However, there are different methods of deciding weights even under the condition $Σ W_{i} = 1 and 0 \leq W_{i} \leq 1 .$ Chakrabartty (2017) elaborated the problem of subjective weights, even under $Σ^{} W_{i} = 1$ with examples. However, selection of data-driven weights needs to consider among others desired outcomes of the weighted sum. Non-linear combination of test scores to find battery scores cannot be ruled out.

Battery reliability

Since battery reliability will be function of reliabilities of the component tests, it may be prudent to select a method of finding reliability of a component test as per the definition, that is, as the ratio of true score variance $(S_{T}^{2})$ and observed score variance $(S_{X}^{2})$ or as $1 - (S_{E}^{2} / S_{X}^{2})$ where $S_{E}^{2}$ denotes the error score variance.

Reliability of tests

Chakrabartty (2013) proposed a method of obtaining test reliability as per the theoretical definition along with computation of error variance and true score variance from single administration of the test. The method uses lengths of two parallel tests and angle between the two vectors representing scores of the two parallel tests. A test consisting of n-items administered among N persons, when dichotomized in parallel halves say g-th test and h-th test, results in two points $X_{g}$ and $X_{h}$ in the N-dimensional space. As per classical definition, two test “g” and “h” are parallel if $T_{g i} = T_{h i}$ and $S_{e g} = S_{e h}$ . It can be proved that if g and h are parallel, ${\bar{X}}_{g} = {\bar{X}}_{h}$ and $S_{X g}^{2} = S_{X h}^{2} .$

Also, $X_{g} = T_{g} + E_{g}$ and $X_{h} = T_{h} + E_{h}$ . Now $T_{g i} = T_{h i}$ implies $X_{g i} - X_{h i} = E_{g i} - E_{h i},$ so that

\begin{array}{l} {|| X_{g} ||}^{2} + {|| X_{h} ||}^{2} - 2 || X_{g} || || X_{h} || \cos θ_{g h} \\ = {|| E_{g} ||}^{2} + {|| E_{h} ||}^{2} - 2 || E_{g} || || E_{h} || \cos θ_{g h}^{(E)} \end{array}

where $θ_{g h}$ is the angle between the vectors $X_{g}$ and $X_{h}$ while $θ_{g h}^{(E)}$ is the angle between $E_{g}$ and $E_{h} .$ But correlation between error scores of two parallel tests is zero. Thus,

{|| X_{g} ||}^{2} + {|| X_{h} ||}^{2} - 2 || X_{g} || || X_{h} || \cos θ_{g h} = {|| E_{g} ||}^{2} + {|| E_{h} ||}^{2} = N S_{E}^{2}

since $S_{E}^{2} = \frac{1}{N} \sum^{} {(E_{g i} + E_{h i})}^{2} = \frac{1}{N} [{|| E_{g} ||}^{2} + {|| E_{h} ||}^{2}]$

The above equation suggests

S_{E}^{2} = \frac{1}{N} [{|| X_{g} ||}^{2} + {|| X_{h} ||}^{2} - 2 || X_{g} || || X_{h} || \cos θ_{g h}]

(1)

where $θ_{g h}$ is the angle between the vectors $X_{g}$ and $X_{h}$ . True score variance $S_{T}^{2}$ can be computed as $S_{X}^{2} - S_{E}^{2}$ .

Hence, theoretical test reliability $r_{t t} = S_{T}^{2} / S_{X}^{2} = (1 - S_{E}^{2} / S_{X}^{2})$ is given by

r_{t t} = 1 - \frac{{|| X_{g} ||}^{2} + {|| X_{h} ||}^{2} - 2 || X_{g} || || X_{h} || \cos θ_{g h}}{N S_{X}^{2}}

(2)

Since, it can be proved that parallel tests are of equal length, the formula (1) and (2) can be further simplified as

S_{E}^{2} = \frac{2 {|| X_{g} ||}^{2}}{N} (1 - \cos θ_{g h})

(3)

and

r_{t t} = 1 - \frac{2 {|| X_{g} ||}^{2}}{N S_{X}^{2}} (1 - \cos θ_{g h})

(4)

Equation (1) helps to find value of error variance of the test and hence true score variance as $(S_{X}^{2} - S_{E}^{2})$ and use them directly to get equation (2) to find reliability of the test as per the theoretical definition from a single administration, in terms of length of score vectors of two parallel tests and angle between such vectors in N-dimensional space. Thus, it is possible to find true score variance from the data and to calculate a reliability coefficient that conforms to the theoretical definition even if true scores or error scores of individuals taking the test are not known.

Properties of theoretically defined reliability

Equations (1) and (2) were derived assuming dichotomization of a test in parallel halves. However, in practice, it may be difficult to dichotomize a test where means and variances of scores exactly match. Thus, the above two equations work well even if there are marginal differences in length of the vectors $X_{g}$ and $X_{h}$ .

Correlation between two parallel sub-tests, $r_{g h}$ , as an estimate of split-half reliability may have value different from value of $r_{t t}$ .

Equation (1) gives a way to find error variance of the test avoiding popular method of first finding test reliability not conforming to the definition and then using $S_{E}^{2} = S_{X}^{2} (1 - r_{t t})$ .

If g-th test and h-th tests are parallel, $S_{E}^{2} = 2 S_{X g}^{2} [1 - r_{g h}]$

Proof: equation (3.2) can be re-written as

\begin{matrix} S_{E}^{2} = \frac{1}{N} \sum^{} X_{g i}^{2} + \frac{1}{N} \sum^{} X_{h i}^{2} - \frac{2}{N} \sum^{} X_{g i} X_{h i} \\ = 2 S_{X g}^{2} - 2 cov (X_{g}, X_{h}) since {\bar{X}}_{g} = {\bar{X}}_{h} and S_{X g} = S_{X h} \\ = 2 S_{X g}^{2} (1 - r_{g h}) \end{matrix}

(5)

where $r_{g h}$ denotes correlation between g-th and h-th tests, that is, split-half reliability.

5. If g-th and h-th tests are parallel, theoretically defined reliability $r_{t t} \geq r_{g h}$

Proof: $X = X_{g} + X_{h} \Rightarrow \bar{X} = {\bar{X}}_{g} + {\bar{X}}_{g}$ and ${\bar{X}}^{2} = 4 {\bar{X}}_{g}^{2}$ since ${\bar{X}}_{g} = {\bar{X}}_{h}$

\begin{matrix} S_{X}^{2} = S_{X g}^{2} + S_{X h}^{2} + 2 cov (X_{g}, X_{h}) \\ = 2 S_{X g}^{2} + 2 r_{g h} S_{X g}^{2} since S_{g h} = S_{X h} \\ = 2 S_{X g}^{2} (1 + r_{g h}) \end{matrix}

Thus,

\begin{matrix} {S_{T}}^{2} = S_{X}^{2} - S_{E}^{2} \\ = 2 S_{X g}^{2} (1 + r_{g h}) - 2 S_{X g}^{2} (1 - r_{g h}) = 4 r_{g h} S_{X g}^{2} \end{matrix}

And theoretically defined reliability is

r_{t t} = \frac{{S_{T}}^{2}}{S_{X}^{2}} = \frac{4 r_{g h} S_{X g}^{2}}{2 S_{X g}^{2} (1 + r_{g h})} = \frac{2 r_{g h}}{1 + r_{g h}}

(7)

Equation (7) gives relationship between theoretically defined reliability $r_{t t}$ and split-half correlation $r_{g h}$ .

From (7), $r_{t t} / r_{g h} = 2 / 1 + r_{g h}$ , which is greater than or equal to one.

\frac{r_{t t}}{r_{g h}} \geq 1 \Rightarrow r_{t t} \geq r_{g h}

6. For a given data set, sample values of $S_{E}^{2}, {S_{T}}^{2}, S_{X}^{2}$ are likely to differ for different samples. Unbiased and consistent estimate of observed score is $σ_{X}^{2} = (1 / N - 1) {\sum (X_{t} - \bar{X})}^{2}$ and can be written as $(N / N - 1) S_{X}^{2}$ . Following a similar approach, one can find $σ_{T}^{2}$ and $σ_{E}^{2}$ with the knowledge of $S_{E}^{2}$ from equation (1) and ${S_{T}}^{2}$ as $(S_{X}^{2} - S_{E}^{2})$ .

Estimation of battery reliability

The method of finding test reliability as per the classical definition can be extended to find the reliability of a battery of tests. The method of obtaining the reliability of the battery will depend on the selected method of reliability of the component tests and on the definition of the battery scores. Usually, battery scores are computed as the sum of the scores in the individual constituent tests (summative scores) or as a weighted sum of these tests scores. Here, we omit the discussion of difference scores–based battery computation for brevity’s sake.

Weighted sum of component tests

Suppose a battery consists of two tests: Test 1 and Test 2. For the i-th person, let $X_{1 i} and X_{2 i}$ denote, respectively, the battery score, score obtained in Test 1 and Test 2. Let this examinee’s true scores and error score of the j-th constituent test be $T_{j i}$ and $E_{j i}$ , respectively, j = 1, 2. Then the summative score of the i-th examinee is $X_{i} = X_{1 i} + X_{2 i},$ assuming additivity and equal weights. Let $r_{t t_{1}}$ and $r_{t t_{2}}$ denote reliability of the respective constituent tests.

Now, $var (X) = var (X_{1}) + var (X_{2}) + 2 cov (X_{1}, X_{2})$ suggests

\sum^{} X_{1 i} X_{2 i} = \sum^{} (T_{1 i} + E_{1 i}) (T_{2 i} + E_{2 i}) = \sum^{} T_{1 i} T_{2 i}

since $\sum^{} T_{1 i} E_{2 i} = \sum^{} T_{2 i} E_{1 i} = \sum^{} E_{1 i} E_{2 i} = 0$

Thus,

\begin{matrix} cov (X_{1}, X_{2}) = \frac{\sum^{} X_{1 i} X_{2 i}}{N} - {\bar{X}}_{1} {\bar{X}}_{2} \\ = \frac{\sum^{} T_{1 i} T_{2 i}}{N} - {\bar{T}}_{1} {\bar{T}}_{2} \\ = cov (T_{1}, T_{2}) \end{matrix}

Now variance of true score of the battery is

\begin{matrix} S_{T (B a t t e r y)}^{2} = var (T_{1}) + var (T_{2}) + 2 var (T_{1}, T_{2}) \\ = var (T_{1}) + var (T_{2}) + 2 cov (X_{1}, X_{2}) \\ = r_{t t_{1}} S_{X_{1}}^{2} + r_{t t_{2}} S_{X_{2}}^{2} + 2 cov (X_{1}, X_{2}) \end{matrix}

(9)

Thus, reliability of a battery is given by

r_{t t (B a t t e r y)} = \frac{r_{t t_{1}} S_{X_{1}}^{2} + r_{t t_{2}} S_{X_{2}}^{2} + 2 cov (X_{1}, X_{2})}{S_{X_{1}}^{2} + S_{X_{2}}^{2} + 2 cov (X_{1}, X_{2})}

(10)

In general, if a battery has K-tests which are equally important and the battery score is a summative score of the K-constituent tests, then the true score variance of the battery and reliability of the battery will be given, respectively, by

r_{t t (B a t t e r y)} = \frac{\sum_{i = 1}^{K} r_{t t_{i}} S_{X_{i}}^{2} + 2 \sum_{i = 1, i \neq j}^{K} \sum_{j = 1}^{K} 2 cov (X_{i}, X_{j})}{\sum_{i = 1}^{K} S_{X_{i}}^{2} + \sum_{i = 1, i \neq j}^{K} \sum_{J = 1}^{K} 2 cov (X_{i}, X_{j})}

(11)

The idea can be extended to find reliability of battery where battery score is weighted sum of scores of the constituent tests.

Consider a battery of K-tests and weights $W_{1}, W_{2}, \dots\dots.., W_{k}$ where $W_{i}$ > 0 ∀i = $1, 2, \dots\dots\dots., K$ and $\sum_{i = 1}^{k} W_{i} = 1 .$ To find reliability of such a battery, let us consider the composite score or the battery score $Y = \sum_{i = 1}^{k} W_{i} X$ where $X_{i}$ is the score of the $i - th$ constituent test and $W_{i}$ is the corresponding weight.

Clearly, $var (Y) = \sum_{i = 1}^{k} W_{i}^{2} var (X_{i})$

Then

r_{t t (B a t t e r y)} = \frac{\begin{array}{l} \sum_{i = 1}^{K} r_{t t_{i}} W_{i}^{2} S_{X_{i}}^{2} + \\ 2 \sum_{i = 1, i \neq j}^{K} \sum_{j = 1}^{K} 2 W_{i} W_{j} cov (X_{i}, X_{j}) \end{array}}{\begin{array}{l} \sum_{i = 1}^{K} W_{i}^{2} S_{X_{i}}^{2} + \\ \sum_{i = 1, i \neq j}^{K} \sum_{J = 1}^{K} 2 W_{i} W_{j} cov (X_{i}, X_{j}) \end{array}}

(12)

S_{T (B a t t e r y)}^{2} = \sum_{i = 1}^{K} r_{t t_{i}} S_{X_{i}}^{2} + \sum_{i = 1, i \neq j}^{K} \sum_{j = 1}^{K} 2 cov (X_{i}, X_{j})

(13)

and

S_{E (B a t t e r y)}^{2} = S_{T (B a t t e r y)}^{2} - S_{X (B a t t e r y)}^{2}

(14)

Equation (12) helps to find reliability of a test battery consisting of K-number of tests where battery score is defined as weighted sum of component tests. True score variance of the battery can be estimated using equation (13). $S_{E (B a t t e r y)}^{2}$ can also be estimated using equation (14)

However, different choices of weights will give different values of the test battery. Thus, a major problem in the computation of battery reliability could be experienced in determination of the weights.

Proposed methods for determination of weights

Selection of weights is a central issue for weighted sum approach. Weighted sum approaches assume additive models. The weights which minimizes variance of weighted sum are referred as ideal weights as inverse of the variance by Hartung et al. (2008). Shahar (2017) gave three proofs (using method of Lagrange multipliers, Induction method, and Cauchy-Schwarz inequality) of the ideal weights that minimize the variance of a weighted average. His proposition was that if the component indicators ${X_{i}}^{'} s$ for i=1, 2, . . . , n are uncorrelated then $var (\sum_{i = 1}^{n} W_{i} X_{i})$ is minimized when $W_{i} = [(1 / var (X_{i})] / \sum_{i = 1}^{n} [1 / var (X_{i})]$ and its minimum value is $1 / \sum_{i = 1}^{n} [1 / var (X_{i})]$

Major observations are as follows:

The component indicators may be independent or correlated with varying degrees. Hence, need is felt to find weights from a general perspective without assuming that variables are independent or they follow certain distributions and to satisfy one or more desired properties of the weighted sum.

Minimum variance is not the only criteria for a good weighted battery score. Determination of weights depends on desired outcomes of the weighted sum. Depending on purpose, one may like to find weights which are (1) proportional to covariance of variables and the weighted sum or (2) proportional to covariance and inversely proportional to SDs or 3. Weighted sum is equi-correlated with the component tests and so on.

The method of findings weights from a general perspective without assuming that variables are independent or they follow certain distributions is based on the following proposition:.

Proposition 1. Let $X_{1}, X_{2}, \dots\dots., X_{n}$ be n-variables with varying degree of correlations. If the weighted sum $Y = \sum_{i = 1}^{n} w_{i} X_{i}$ where $W = {(w_{1}, w_{2}, \dots.., w_{n})}^{T}$ is the vector of weights such that $W^{'} e = 1$ , then $var (Y) = W^{'} S W$ is minimum if $W = S^{- 1} e / e^{'} S^{- 1} e$ where $S_{n x n}$ is the variance–covariance matrix of $X_{1}, X_{2}, \dots\dots.., X_{n}$ .

Proof: Consider $A = W^{'} S W + λ (1 - W^{'} e)$ where λ is a Lagrangian multiplier

Clearly, $\partial A / \partial W = 2 S W - λ e = 0$ and $\partial A / \partial λ = 1 - W^{'} e = 0$

Solving the above two equations on the assumption S is non-singular,

W = \frac{λ S^{- 1} e}{2} \Rightarrow W = \frac{S^{- 1} e}{e^{'} S^{- 1} e} and λ = \frac{2}{e^{'} S^{- 1} e}

Properties

1. Weights obtained from the above method minimizes variance of the weighted sum

and $var (Y) = W^{'} S W = \frac{1}{e^{'} S^{- 1} e}$

2. Covariance between the weighted sum Y and any component test $X_{i}$ is constant $\forall i = 1, 2, \dots \dots, n$

3. If we take normalized values ${Z_{i}}^{'} s$ of ${X_{i}}^{'} s$ , the variance-covariance matrix S is reduced to correlation matrix R, and the new weight vector $W - (R^{- 1} / e^{1} R^{- 1} e)$ satisfies

ϒ_{(Y, Z_{i})} = ϒ_{(Y, Z_{j})} = \frac{1}{\sqrt{e^{'} R^{- 1} e}} \forall i \neq j

Proposition 2. If the weight vector $W = {(w_{1}, w_{2}, \dots\dots, w_{n})}^{T}$ satisfies the conditions $0 \leq w_{i} \leq 1, \sum_{i = 1}^{n} w_{i} = 1$ and $w_{i} \propto cov (Y, X_{i})$ then W is the eigen vector corresponding to the maximum eigen value of the variance-covariance matrix S.

Proof: It can be proved that weights vector in this context are eigen vectors corresponding to the maximum eigen values of the variance-covariance matrix S on the assumption of S is positive definite, since $| S - λ I | W = 0$ , where λ is an eigen value of S.

Properties

Weights are proportional to covariance of variables and the weighted sum;

Variables which are highly associated with the weighted sum linearly to get higher positive weights.

Proposition 3. To find $w_{1}, w_{2}, \dots\dots., w_{n}$ such that $0 \leq w_{i} \leq 1, \sum_{i = 1}^{n} w_{i} = 1$ and

w_{i} \propto \frac{cov (Y, Z_{i})}{var (Z_{i})}

Proof: Here, W satisfies $| R - μ I | s W = 0$ where R is the correlation matrix, μ is the maximum eigen value of R and s is the diagonal matrix of SDs of $Z_{i}$ s. It can be proved that $W = s^{- 1} u / e^{'} s^{- 1} u$ where u is eigenvector of R corresponding to its largest eigen value.

Observe that here weights are proportional to covariance and inversely proportional to SDs

Recommended method to find weights

It is proposed to determine the vector $W = {(W_{1}, W_{2}, \dots\dots.., W_{K})}^{T}$ with $\sum_{i = 1}^{K} W_{i} = 1$ such that the variance of the battery score is a minimum where the battery score vector is Y with $var (Y) = W^{T} D W$ where D is the variance-covariance matrix of the component tests subject to the condition $W^{T} e = 1$ where e is the K-dimensional vector with i-th component equal to 1, ∀ i = 1, . . .,K. As per proposition 1, the weight vector in the instant case is

W = \frac{D^{- 1} e}{e^{T} D^{- 1} e} and λ = \frac{2}{e^{T} D^{- 1} e}

Benefits of the recommended method of finding weights

Weights found as above have the advantage that the battery score $(Y)$ has minimum variance. Also, covariance between the battery score and the score of an individual test is a constant, that is, $cov (Y_{i}, X_{i}) = 1 / e^{T} D^{- 1} e$ ∀ i. If the available test scores are standardized and independent such that the i-th score is $Z_{i},$ then weights are equal, and correlation between Y and $Z_{i}$ is equal to correlation between Y and $Z_{j} = 1 / \sqrt{e^{T} R^{- 1} e} \forall i \neq j, i, j = 1, 2, \dots, K$ where R is the correlation matrix. In other words, the battery score is equi-correlated with the standardized score of each constituent test.

In order to find the battery scores, weights of test scores can also be found by principal component analysis, factor analysis, canonical reliability, and so on. However, it is suggested that reliability of the battery scores be found as weighted scores where the weights are determined as per Proposition 1.

It is recommended to use theoretical definition of test reliability for each constituent test as per equation (2) for clear theoretical advantages and Proposition 2.

Summary and discussion

Assuming non-availability of pre-specified, quantifiable external criterion, the article presents a well-defined and easy approach to the computation of reliability of test batteries, where battery score could be defined as sum of weighted or unweighted scores of the component tests. The method avoids scaling of scores of component tests. Such battery reliability can be influenced significantly by method of selection of weights to arrive at the battery scores and methods of estimating reliability of component tests, since sources of errors of individual tests may get manifold for the battery. Reliability, which is theoretically defined as the ratio of the true variance to the total variance, is an important property of measurement and works well to find battery reliability. It is possible to find true score variances of the battery even if true scores of individuals taking the component tests are not known.

Test reliability as per theoretical definition in terms of length of score vectors of two parallel tests and angle between such vectors in N-dimensional space also helps to find values of error score variance of the test fits well in estimation of battery reliability. Relationship between theoretically defined reliability $r_{t t}$ and split-half correlation $r_{g h}$ was established. Higher value of the theoretically defined reliability $r_{t t}$ will tend to raise value of validity of the test since validity coefficients for a particular test cannot be higher than the reliability coefficients for that test. The proposed method of computing error variance of the test helps to make unbiased and consistent estimates of $σ_{T}^{2}$ and $σ_{E}^{2}$ for the population. Error variance of the test or its population estimate $(σ_{E}^{2})$ may be mentioned along with the theoretically defined test reliability while reporting a test or a battery of tests.

There could be different methods of obtaining data-driven weights depending on the purposes. For weighted battery score, a Lagrange multiplier–based solution is recommended with the use of reliability as per the theoretical definition. Weights found as above have the advantage that the battery score $(Y)$ has minimum variance. Also, covariance between the battery score and the score of an individual test is a constant; condition of battery score being equi-correlated with the standardized score of each constituent test was derived.

While in this work, the classically defined reliability of a test with binary items (one right and the rest wrong) was discussed, a future study is proposed on empirical illustration and to find bias of the classically defined reliability with simulated data and compare with other methods of obtaining reliability.

Footnotes

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Satyendra Nath Chakrabartty

Author biography

Satyendra Nath Chakrabartty has an MStat from Indian Statistical Institute. He was a Research Scholar at Psychometric Research and Service Unit of Indian Statistical Institute. Prof. Chakrabartty has taught Post Graduate courses at Indian Statistical Institute, University of Calcutta, Galgotias Business School, and many others. After serving Kolkata Port Trust for 25 years in various managerial positions, he joined Mumbai Port Trust as Director (Planning & Research) and subsequently took over as Director, Indian Institute of Port Management, and retired from the position of Director, Kolkata Campus of the Indian Maritime University. Currently, he is associated with Indian Ports Association, New Delhi, as a Consultant.

References

Anastasi

Urbina

(1997) Psychological Testing (7th edn). Upper Saddle River, NJ: Prentice Hall.

Berchtold

(2016) Test–retest: Agreement or reliability? Methodological Innovations 9: 1–7.

Berkowitz

Wolkowitz

Fitch

, et al. (2000) The Use of Tests as Part of High-Stakes Decision Making for Students: A Resource Guide for Educators and Policy Makers. Washington, DC: US Department of Education.

Chakrabartty

(2013) Best split-half and maximum reliability. IOSR Journal of Research & Method in Education 3(1): 1–8.

Chakrabartty

(2017) Composite index: Methods and properties. Journal of Applied Quantitative Methods 12(2): 25–33.

Conger

Stallard

(1976) equivalences among canonical factor analysis, canonical reliability analysis and principal components analysis: Implications for data reduction of fallible measures. Educational and Psychological Measurement 36(3): 619–625.

Feldt

Brennan

(1989) Reliability. In: Linn

(ed.) Educational Measurement (3rd edn). New York: The American Council on Education; MacMillan, pp. 105–146.

Franzen

(2000) Does the Internet make us lonely? European Sociological Review 16(4): 427–438.

Graham

(2006) Congeneric and (essentially) tau-equivalent estimates of score reliability what they are and how to use them. Educational and Psychological Measurement 66: 930–944.

10.

Hartung

Knapp

Sinha

(2008) Statistical Meta-Analysis with Applications. Hoboken, NJ: John Wiley & Sons.

11.

Hattie

(1985) Methodology review: Assessing uni-dimensionality of tests and items. Applied Psychological Measurement 9(2): 139–164.

12.

Kassin

(2003) Psychology (4th edn). Upper Saddle River, NJ: Prentice Hall.

13.

Koch

Gürtler

Fischer-Barnicol

, et al. (2003) Determination of reliability of psychometric tests in psychiatry using canonical correlation. Psychiatrische Praxis 30(Suppl 2): S157–S160.

14.

Lambert

Durand

(1975) Some precautions in using canonical analysis. Journal of Marketing Research 12(4): 468–475.

15.

Lubans

Morgan

Callister

, et al. (2011) Test–retest reliability of a battery of field-based health-related fitness measures for adolescents. Journal of Sports Sciences 29(7): 685–693.

16.

Mesholam-Gately

Seidman

Stover

, et al. (2008) The MATRICS consensus cognitive battery, part 1: Test selection, reliability, and validity. American Journal of Psychiatry 165(2): 203–213.

17.

Murphy

Davidshofer

(2005) Psychological Testing: Principles and Applications. Upper Saddle River, NJ: Pearson; Prentice Hall.

18.

Raykov

(1997) Scale reliability, Cronbach’s coefficient alpha, and violations of essential tau-equivalence with fixed congeneric components. Multivariate Behavioral Research 32: 329–353.

19.

Ritter

(2010) Understanding a widely misunderstood statistic: Cronbach’s alpha. Paper presented at southwestern educational research association (SERA) conference, New Orleans, LA, 17–20 February.

20.

Rudner

(2001) Informed test component weighting. Educational Measurement: Issues and Practice 20: 16–19.

21.

Safrit

Wood

(2013) The test battery reliability of the health-related physical fitness test. Research Quarterly for Exercise and Sport 58(2): 160–167.

22.

Shahar

(2017) Minimizing the variance of a weighted average. Open Journal of Statistics 7: 216–224.

23.

Teo

Fan

(2013) Coefficient alpha and beyond: Issues and alternatives for educational research. Asia Pacific Education Review 22: 209–213.

24.

Trizano-Hermosilla

Alarado

(2016) Best alternatives to Cronbach’s alpha reliability in realistic conditions: Congeneric and asymmetrical measurements. Frontiers in Psychology 7: 769.

25.

Vacha-Haase

Kogan

Thompson

(2000) Sample compositions and variabilities in published studies versus those in test manuals: Validity of score reliability inductions. Educational and Psychological Measurement 60(4): 509–522.

26.

Wang

Stanley

(1970) Differential weighting: A review of methods and empirical studies. Review of Educational Research 40: 663–705.

27.

Webb

Shavelson

Haertel

(2007) Reliability coefficient and generalizability theory. In: Rao

Sinharay

(eds) Handbooks of Statistics 26: Psychometrics. Amsterdam: Elsevier, pp. 81–124.

28.

Wilcox

(1992) Robust generalizations of classical test reliability and Cronbach’s alpha. British Journal of Mathematical and Statistical Psychology 45: 239–254.

29.

Zimmerman

(2007) Correction for attenuation with biased reliability estimates and correlated errors in populations and samples. Educational and Psychological Measurement 67(6): 920–939.

Reliability of test battery

Abstract

Keywords

Introduction

Literature survey

Battery scores

Scaling

Combining method

Battery reliability

Reliability of tests

Properties of theoretically defined reliability

Estimation of battery reliability

Weighted sum of component tests

Proposed methods for determination of weights

Properties

Properties

Recommended method to find weights

Benefits of the recommended method of finding weights

Summary and discussion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

Author biography

References