Sage Journals: Discover world-class research

Abstract

Observational case-control studies for biomarker discovery in cancer studies often collect data that are sampled separately from the case and control populations. We present an analysis of the bias in the estimation of the precision of classifiers designed on separately sampled data. The analysis consists of both theoretical and numerical results, which show that classifier precision estimates can display strong bias under separating sampling, with the bias magnitude depending on the difference between the true case prevalence in the population and the sample prevalence in the data. We show that this bias is systematic in the sense that it cannot be reduced by increasing sample size. If information about the true case prevalence is available from public health records, then a modified precision estimator that uses the known prevalence displays smaller bias, which can in fact be reduced to zero as sample size increases under regularity conditions on the classification algorithm. The accuracy of the theoretical analysis and the performance of the precision estimators under separate sampling are confirmed by numerical experiments using synthetic and real data from published observational case-control studies. The results with real data confirmed that under separately sampled data, the usual estimator produces larger, ie, more optimistic, precision estimates than the estimator using the true prevalence value.

Keywords

Precision recall bias classification observational study experimental design

Introduction

Biomarker discovery is typically attempted by means of observational case-control studies where classification techniques are applied to high-throughput measurement technologies, such as DNA microarrays,^1,2 next-generation RNA sequencing (RNA-seq),³ or “shotgun” mass spectrometry.⁴ The validity and reproducibility of the results depend critically on the availability of accurate and unbiased assessment of classification accuracy.^5,6

The vast majority of published methods in the statistical learning literature make the assumption, explicitly or implicitly, that the data for training and accuracy assessment are sampled randomly, or unrestrictedly, from the mixture of the populations. However, observational case-control studies in biomedicine typically proceed by collecting data that are sampled with restrictions. The most common restriction, and the one that is studied in this article, is that the data are sampled separately from the case and control populations. That creates an important issue in the application of traditional statistical learning techniques to biomedical data, because there is no meaningful estimator of case prevalences under separate sampling. Therefore, any methodology that directly or indirectly uses estimates of case prevalence could be severely biased.

Precision and Recall have become very popular classification accuracy metrics in the statistical learning literature.^7–9 The recall does not depend on the prevalence, while the precision does. Therefore, we investigate in this article the bias of the precision estimator when the typical separate sampling design used in case-control studies is not properly taken into account.

A similar study was conducted previously into the accuracy of cross-validation under separate sampling.¹⁰ It was shown in that study that the usual “unbiasedness” property of k-fold cross-validation does not hold under separate sampling. In fact, the bias can in fact be substantial and systematic, ie, not reducible under increasing sample size. In Braga-Neto et al,¹⁰ modified k-fold cross-validation estimators were proposed for the class-specific error rates. In the case where the true case prevalence is known, those estimators can be combined into an estimator of the overall error rate, which satisfies the usual “unbiasedness” property of cross-validation.

By contrast, the present paper employs analytical and numerical methods to investigate precision estimation under separate sampling. We show that the usual precision estimator is asymptotically unbiased as sample size increases, under the condition that the classification rule has a finite Vapnik-Chervonenkis (VC) dimension. However, under separate sampling, we show that the usual precision estimator will in general display a systematic bias, which cannot be reduced by increasing sample size, if the observed prevalence of cases in the data is different from the true prevalence in the population of interest, and the bias is larger the more different they are. In particular, the bias tends to be large when the true prevalence is small but the training data contain an equal number of examples from both classes, which is a common scenario in practice. If the true case prevalence is known (eg, from public health records), then a modified precision estimator that uses the known prevalence is shown to be asymptotically unbiased in the separate sampling case, under the condition that the classification rule is sufficiently stable as sample size increases. All of these theoretical results, and the approximations used to derive them, are verified by numerical experiments using both synthetic and real data from published studies.

Materials and Methods

In this section, we define and study the various error rates of interest in this study, including precision and recall.

Population performance metrics

The feature vector $X \in R^{d}$ summarizes numerical characteristics of a patient (eg, blood concentrations of given proteins). The label $Y \in {0, 1}$ is defined as $Y = 0$ if the patient is from the control population, and $Y = 1$ if the patient is from the case population.

The prevalence is defined by

prev = P (Y = 1)

(1)

ie, the probability that a randomly selected individual is a case subject. The prevalence plays a fundamental role in the sequel.

A classifier $ψ : R^{d} \to {0, 1}$ assigns $X$ to the control or case population, according to whether $ψ (X) = 0$ or $ψ (X) = 1$ , respectively. The classification sensitivity and specificity are defined as follows:

sens = P (ψ (X) = 1 | Y = 1)

(2)

spec = P (ψ (X) = 0 | Y = 0)

(3)

The closer both are to 1, the more accurate the classifier is. A noteworthy property of the sensitivity and specificity is that they do not depend on the prevalence.

Other common performance metrics for a classifier are the false-positive (FP), false-negative (FN), true-positive (FP), and true-negative (FN) rates, given by

FP = P (ψ (X) = 1, Y = 0)

(4)

= (1 - spec) \times (1 - prev)

(5)

FN = P (ψ (X) = 0, Y = 1) = (1 - sens) \times prev

(6)

TP = P (ψ (X) = 1, Y = 1) = sens \times prev

(7)

TN = P (ψ (X) = 0, Y = 0) = spec \times (1 - prev)

(8)

Unlike sensitivity and specificity, the previous performance metrics do depend on the prevalence.

Note that

prev = FN + TP, 1 - prev = FP + TN

(9)

sens = \frac{TP}{TP + FN},spec = \frac{TN}{TN + FP}

(10)

Finally, we define the precision and recall accuracy metrics. Precision measures the likelihood that one has a true case given that the classifier outputs a case:

prec = P (Y = 1 | ψ (X) = 1)

(11)

Applying Bayes’ Theorem and using previously derived relationships reveal that

prec = \frac{TP}{TP + FP} = \frac{sens \times prev}{sens \times prev + (1 - spec) \times (1 - prev)}

(12)

On the other hand, recall is simply the sensitivity:

rec = sens = \frac{TP}{TP + FN}

(13)

It follows that precision depends on the prevalence, but recall does not.

Estimated performance metrics

In practice, the performance metrics defined in the previous section need to be estimated from sample data $S_{n} = {(X_{1}, Y_{1}), \dots, X_{n}, Y_{n})}$ . Let $\hat{P}$ denote the empirical probability measure defined by $S_{n}$ . The estimator of prevalence is

\hat{prev} = \hat{P} (Y = 1) = \frac{1}{n} \sum_{i = 1}^{n} I_{Y_{i} = 1}

(14)

where $I_{A} = 1$ if $A$ is true and $I_{A} = 0$ if $A$ is false. Similarly,

\hat{FP} = \hat{P} (ψ (X) = 1, Y = 0) = \frac{1}{n} \sum_{i = 1}^{n} I_{{ψ (X_{i}) = 1, Y_{i} = 0}}

(15)

\hat{FN} = \hat{P} (ψ (X) = 0, Y = 1) = \frac{1}{n} \sum_{i = 1}^{n} I_{{ψ (X_{i}) = 0, Y_{i} = 1}}

(16)

\hat{TP} = \hat{P} (ψ (X) = 1, Y = 1) = \frac{1}{n} \sum_{i = 1}^{n} I_{{ψ (X_{i}) = 1, Y_{i} = 1}}

(17)

\hat{TN} = \hat{P} (ψ (X) = 0, Y = 0) = \frac{1}{n} \sum_{i = 1}^{n} I_{{ψ (X_{i}) = 0, Y_{i} = 0}}

(18)

The remaining performance metrics estimators are defined analogously, using equations (10), (12), and (13):

\hat{spec} = \frac{\hat{TN}}{\hat{TN} + \hat{FP}} = \frac{\sum_{i = 1}^{n} I_{{ψ (X_{i}) = 0, Y_{i} = 0}}}{\sum_{i = 1}^{n} I_{Y_{i} = 0}}

(19)

\hat{prec} = \frac{\hat{TP}}{\hat{TP} + \hat{FP}} = \frac{\sum_{i = 1}^{n} I_{{ψ (X_{i}) = 1, Y_{i} = 1}}}{\sum_{i = 1}^{n} I_{ψ (X_{i}) = 1}}

(20)

\hat{rec} = \hat{sens} = \frac{\hat{TP}}{\hat{TP} + \hat{FN}} = \frac{\sum_{i = 1}^{n} I_{{ψ (X_{i}) = 1, Y_{i} = 1}}}{\sum_{i = 1}^{n} I_{Y_{i} = 1}}

(21)

Mixture and separate sampling

The usual scenario in Statistical Learning is to assume that $S_{n} = {(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n})}$ is an independent and identically distributed (i.i.d.) sample from the true distribution of the pair $(X, Y)$ . That makes $S_{n}$ a sample from the mixture of populations, where each label $Y_{i}$ is distributed as

\begin{matrix} P (Y_{i} = 1) = prev \\ P (Y_{i} = 0) = 1 - prev \end{matrix}

(22)

for $i = 1, \dots, n$ . Under mixture sampling, $N_{0} = \sum_{i = 0}^{n} I_{Y_{i} = 0}$ and $N_{1} = \sum_{i = 1}^{n} I_{Y_{i} = 1} = n - N_{0}$ are binomial random variables, with parameters $(n, 1 - prev)$ and $(n, prev)$ , respectively.

By contrast, observational case-control studies in biomedicine typically proceed by collecting data from the populations separately, where the separate sample sizes $n_{0}$ and $n_{1}$ , with $n_{0} + n_{1} = n$ , are pre-determined and nonrandom, ie, sampling occurs with the restriction $N_{0} = \sum_{i = 0}^{n} I_{Y_{i} = 0} = n_{0}$ (or, equivalently, $N_{1} = \sum_{i = 1}^{n} I_{Y_{i} = 1} = n_{1}$ ). Therefore, all probabilities and expectations over the sample are conditional on $N_{0} = n_{0}$ . The restriction means that the labels $Y_{1}, \dots, Y_{n}$ are no longer independent, even though the feature vectors $X_{1}, \dots, X_{n}$ are still independent given the labels. In fact, under separate sampling, only the order of the labels $Y_{1}, \dots, Y_{n}$ may be random. Thus, $f (Y_{1}, \dots, Y_{n} | N_{0} = n_{0})$ is a discrete uniform distribution over all $(\begin{matrix} n \\ n_{0} \end{matrix})$ possible orderings. This can also be obtained by direct computation, as follows:

\begin{array}{l} f (Y_{1}, \dots, Y_{n} | N_{0} = n_{0}) = \frac{f (Y_{1}, \dots, Y_{n}, N_{0} = n_{0})}{P (N_{0} = n_{0})} \\ = (\begin{array}{l} \frac{{prev}^{n_{1}} {(1 - prev)}^{n_{0}}}{(\begin{matrix} n \\ n_{0} \end{matrix}) {prev}^{n_{1}} {(1 - prev)}^{n_{0}}} = \frac{1}{(\begin{matrix} n \\ n_{0} \end{matrix})}, & if \sum_{i = 1}^{n} I_{Y_{i} = 0} = n_{0} \\ 0, & otherwise \end{array} \end{array}

(23)

It is not difficult to verify that under equation (23), the marginal distribution of each label $Y_{i}$ is given by

\begin{array}{l} P (Y_{i} = 1 | N_{0} = n_{0}) = \frac{n_{1}}{n} \overset{Δ}{=} r \\ P (Y_{i} = 0 | N_{0} = n_{0}) = \frac{n_{0}}{n} = 1 - r \end{array}

(24)

for $i = 1, \dots, n$ , where $r$ is the (fixed) sample size ratio under separate sampling. Comparing equations (22) and (24) reveals the main difference between mixture and separate sampling.

Bias of the precision estimator

In this section, we present a theoretical large sample analysis of the bias of the estimators discussed previously, focusing on the precision estimator. Estimation bias is defined as the expectation over the sample data $S_{n}$ of the difference between the estimated and true quantities.

The situation is clear with the estimator of the prevalence itself, given by equation (14). Under mixture sampling, we have

E [\hat{prev}] = \frac{1}{n} \sum_{i = 1}^{n} E [I_{Y_{i} = 1}] = P (Y_{1} = 1) = prev

(25)

so the estimator is unbiased (in addition, as $n$ increases, $Var (\hat{prev}) \to 0$ and $\hat{prev} \to prev$ in probability, by the law of large numbers). However, under separate sampling,

\begin{array}{l} E [\hat{prev} | N_{0} = n_{0}] = \frac{1}{n} \sum_{i = 1}^{n} E [I_{Y_{i} = 1} | N_{0} = n_{0}] \\ = P (Y_{1} = 1 | N_{0} = n_{0}) = r \end{array}

(26)

according to equation (24). This also follows directly from the fact that $\hat{prev}$ becomes a constant estimator, $\hat{prev} \equiv r$ , according to equation (14). Thus,

\begin{array}{l} {Bias}_{sep} (\hat{prev}) = E [\hat{prev} - prev | N_{0} = n_{0}] \\ = r - prev \end{array}

(27)

Assuming that the sample size ratio $r = n_{1} / n$ is held constant as $n$ increases (eg, under the common balanced design case, $n_{0} = n_{1} = n / 2$ ), then this bias cannot be reduced with increased sample size. Furthermore, the bias is larger the further away $prev$ is from $r$ . In particular, the bias tends to be large when $prev$ is small and $r = 1 / 2$ , which is a common scenario in practice.

The situation for $\hat{FP}$ , $\hat{FN}$ , $\hat{FP}$ , and $\hat{TN}$ is more complicated. First, we are interested in a classifier $ψ_{n}$ derived by a classification rule from the sample data $S_{n} = {(X_{1}, Y_{1}), \dots, X_{n}, Y_{n})}$ . Therefore, all expectations and probabilities in the previous sections are conditional on $S_{n}$ . Under mixture sampling, the powerful Vapnik-Chervonenkis Theorem can be applied to show that all of these estimators are asymptotically unbiased, provided that classification rule has a finite VC Dimension.¹¹ This includes many useful classification algorithms such as Linear Discriminant Analysis (LDA), linear Support Vector Machines (SVMs), perceptrons, polynomial-kernel classifiers, certain decision trees, and neural networks, but it excludes nearest-neighbor classifiers, for example. Classification rules with finite VC dimension do not cut the feature space in complex ways and are thus generally robust against overfitting.

Assuming mixture sampling and a classification algorithm with finite VC dimension $V_{C}$ , it can be shown that (the details are omitted; see Braga-Neto and Dougherty⁶ for a similar argument)

{Bias}_{mix} (\hat{FP}) \leq 8 \sqrt{\frac{V_{C} \log (n + 1) + 4}{2 n}}

(28)

so that the bias vanishes as $n \to \infty$ . Similar inequalities apply to $\hat{FN}$ , $\hat{FP}$ , and $\hat{TN}$ . These are distribution-free results; hence, vanishingly small bias is guaranteed if $n ≫ V_{C}$ , regardless of the feature-label distribution. For linear classification rules, $V_{C} = d + 1$ , where $d$ is the dimensionality of the feature vector. In this case, the $\hat{FP}$ , $\hat{FN}$ , $\hat{FP}$ , and $\hat{TN}$ estimators are essentially unbiased if $n ≫ d$ .

Next we consider the bias of the precision and recall estimators under mixture sampling (the analysis for the sensitivity and specificity estimators is similar; in fact, the former is just the recall estimator). We will make use of the following approximation for the expectation of a ratio of two random variables $W$ and $Z$ (see Appendix 1 for the derivation of this approximation and the conditions under which it is valid):

E [\frac{W}{Z}] \approx \frac{E [W]}{E [Z]}

(29)

The approximation is quite accurate if $W$ and $Z$ are around $E [W]$ and $E [Z]$ , respectively (it is asymptotically exact as $W \to E [W]$ and $Z \to E [Z]$ ). For the precision estimator,

\begin{array}{l} E [\hat{prec}] = E [\frac{\hat{TP}}{\hat{TP} + \hat{FP}}] \approx \frac{E [\hat{T P}]}{E [\hat{TP} + \hat{FP}]} \\ \approx \frac{E [TP]}{E [TP + FP]} \approx E [\frac{TP}{TP + FP}] = E [prec] \end{array}

(30)

for a sufficiently large sample, where we used the previously established asymptotic unbiasedness of $\hat{TP}$ , $\hat{TP}$ , and $\hat{FN}$ . An entirely similar derivation shows that $E [\hat{rec}] = E [rec]$ . Hence, for “well-behaved” classification algorithms (those with finite VC dimension), both the precision and recall estimators are asymptotically unbiased under mixture sampling.

We are not aware of the existence of a VC theory for separate sampling at this time. To obtain approximate results for the separate sampling case, we will assume instead that at large enough sample sizes, the classifier $ψ$ is nearly constant, and invariant to the sample. This assumption is not unrelated to the finite VC dimension assumption made in the case of mixture sampling. Many of the same classification algorithms that have finite VC dimension, such as LDA and linear SVMs, will also become nearly constant as sample size increases. In this case, we have

\begin{array}{l} E [\hat{TP} | N_{0} = n_{0}] = \frac{1}{n} \sum_{i = 1}^{n} E [I_{{ψ (X_{i}) = 1, Y_{i} = 1}} | N_{0} = n_{0}] \\ = P (ψ (X_{1}) = 1, Y_{1} = 1 | N_{0} = n_{0}) \\ = P (ψ (X_{1}) = 1 | Y_{1} = 1) P (Y_{1} = 1 | N_{0} = n_{0}) \\ = sens \times r \end{array}

(31)

where we used the fact that the event ${ψ (X_{1}) = 1}$ is independent of $N_{0}$ given $Y_{1}$ and equation (24). Note that the equality $P (ψ (X_{1}) = 1 | Y_{1} = 1) = sens$ depends on the fact that $ψ$ is assumed to be constant, so that $(X_{1}, Y_{1})$ behaves as an independent test point (also because of a constant $ψ$ , there is no expectation around $sens$ ). Hence, $\hat{TP}$ is biased under separate sampling, with

{Bias}_{sep} (\hat{TP}) = sens \times r - TP = sens \times (r - prev)

(32)

As in the case with the bias of $\hat{prev}$ under separate sampling, the bias of $\hat{TP}$ cannot be reduced with increasing sample size. The bias is in fact larger the more sensitive the classifier is. One can derive similar results for $\hat{FP}$ , $\hat{FN}$ , and $\hat{TN}$ .

The recall estimator is approximately unbiased under separate sampling:

\begin{array}{l} E [\hat{rec} | N_{0} = n_{0}] = E [\frac{\hat{TN}}{\hat{TN} + \hat{FP}} | N_{0} = n_{0}] \\ = E [\frac{\hat{TP}}{\hat{prev}} | N_{0} = n_{0}] = \frac{E [\hat{TP} | N_{0} = n_{0}]}{r} \\ = \frac{sens \times r}{r} = sens = rec \end{array}

(33)

This is a consequence of recall’s not being a function of the prevalence. However, for the precision estimator,

\begin{array}{l} E [\hat{prec} | N_{0} = n_{0}] = E [\frac{\hat{TP}}{\hat{TP} + \hat{FP}} | N_{0} = n_{0}] \\ \approx \frac{E [\hat{TP} | N_{0} = n_{0}]}{E [\hat{TP} + \hat{FP} | N_{0} = n_{0}]} \\ = \frac{sens \times r}{sens \times r + (1 - spec) \times (1 - r)} \\ \neq \frac{sens \times prev}{sens \times prev + (1 - spec) \times (1 - prev)} = prec \end{array}

(34)

The precision estimator is thus biased under separate sampling unless the true prevalence matches exactly the sample ratio $r = n_{1} / n$ ; the bias is larger the further away $prev$ is from $r$ .

In case the true prevalence is known, eg, from public health records and government databases, then we show below that the following estimator of the precision,

\begin{array}{l} {\hat{prec}}^{prev} = \frac{\hat{sens} \times prev}{\hat{sens} \times prev + (1 - \hat{spec}) \times (1 - prev)} \end{array}

(35)

which is based on equation (12), is an asymptotically unbiased estimator of the precision under either mixture or separate sampling. Asymptotic unbiasedness in the mixture sampling case can be shown by repeating the steps in the analysis of the ordinary precision estimator. Under separate sampling, we have

\begin{array}{l} E [{\hat{prec}}^{prev} | N_{0} = n_{0}] \\ \approx \frac{E [\hat{sens} | N_{0}] \times prev}{E [\hat{sens} | N_{0}] \times prev + (1 - E [\hat{spec} | N_{0}]) \times (1 - prev)} \\ = \frac{sens \times prev}{sens \times prev + (1 - spec) \times (1 - prev)} = prec \end{array}

(36)

since $E [\hat{sens} | N_{0} = n_{0}] = sens$ and $E [\hat{spec} | N_{0} = n_{0}] = spec$ , as can be easily shown. Hence, ${\hat{prec}}^{prev}$ is an asymptotically unbiased estimator of the precision under either mixture or separate sampling. The ordinary precision estimator $\hat{prec}$ should not be used under separate sampling, or large and irreducible bias may occur. On the other hand, in the impossibility of obtaining information on the true prevalence value, then no meaningful estimator of the precision is possible.

Results and Discussion

In this section, we employ synthetic and real-world data to investigate the accuracy of the analysis in the previous section and the performance of the precision estimator under separate sampling. Corresponding results for mixture sampling and the recall estimator can be found in the Supplementary Material.

Experiments with synthetic data

We performed a set of experiments employing synthetic data from a homoskedastic Gaussian model, consisting of three-dimensional class-conditional distributions $N (μ_{i}, Σ)$ , for $i = 0, 1$ , with $μ 0 = (0, 0, 0)$ , $μ_{1} = (0, 0, θ)$ , where $θ > 0$ is a parameter governing the separation between the classes, and $Σ = diag (σ_{1}^{2}, σ_{2}^{2}, σ_{3}^{2})$ (ie, a matrix with $σ_{1}^{2}, σ_{2}^{2}, σ_{3}^{2}$ on the diagonal and zeros off the diagonal). We consider two sample sizes, $n = 30$ and $n = 200$ , so that we can compare the results for small and large sample sizes. All experiments with separate sampling are performed with sample size ratio $r = n_{1} / n \in [0.1, 0.9]$ . The synthetic data parameters are summarized in Table 1.

Table 1.

Synthetic data parameters.

Parameter	Value
Dimensionality/feature size	$D = 3$
Mean difference	$θ = 2$
Covariance matrix	$σ_{1}^{2} = 0.5, σ_{2}^{2} = 0.5, σ_{3}^{2} = 1$
Sample size	$n = 30, 200$
Sample size ratio $r$	$r = 0.1, 0.3, 0.5, 0.7, 0.9$
True prevalence	$prev = 0.1, 0.3, 0.5, 0.7, 0.9$

For each value of $r$ and $prev$ , we repeat the following process 1000 times and average the results to estimate expected values:

Generate sample data $S_{n}$ of size $n$ according to $r$ (separate sampling) or $p r e v$ (mixture sampling);

Train a classifier using one of three classification rules:¹² LDA, 3-Nearest Neighbors (3NN), and a nonlinear Radial-Basis Function Support Vector Machine (RBF-SVM).

Obtain recall and precision estimates. Compute both the usual precision estimate $\hat{prec}$ and the modified precision estimate ${\hat{prec}}^{prev}$ .

Obtain accurate estimates of the true precision values using a test set of size 10 000.

Figure 1 displays the results of the experiment. Note that there is only one curve for the traditional precision estimator $\hat{prec}$ because it does not employ the actual value of $prev$ . The values of $\hat{prec}$ and ${\hat{prec}}^{prev}$ coincide when $prev = r$ , as expected. However, as the values of $prev$ and $r$ become different, their values become quite different, and ${\hat{prec}}^{prev}$ displays much less bias, ie, it tracks the true precision much more closely, than $\hat{prec}$ . At the small sample size $n = 30$ , both estimators display bias, which is however much larger overall for $\hat{prec}$ than for ${\hat{prec}}^{prev}$ . At the large sample size $n = 200$ , the bias of ${\hat{prec}}^{prev}$ nearly disappears for LDA and is reduced for the other classification rules. We note that among these classification rules, LDA is the only one with a finite VC dimension; the fact that the bias in this case shrinks to zero as sample size increases confirms the results of the theoretical analysis in the previous section (convergence is quite fast, and quite evident at $n = 200$ , due to the fact that the synthetic data are homoskedastic Gaussian). Note also that the bias of $\hat{prec}$ cannot be reduced by increasing sample size, which is also in agreement with the theoretical analysis (and so are the results in the Supplementary Material).

Figure 1.

Average true precision (solid curves), average usual precision estimate $\hat{prec}$ (dash-diamond curves), and average modified precision estimate ${\hat{prec}}^{prev}$ (dashed curves), for LDA, 3NN, and RBF-SVM, with sample sizes $n = 30$ and $n = 200$ , and different prevalence values, as a function of the sample size ratio. LDA indicates Linear Discriminant Analysis; 3NN, 3-Nearest Neighbors; RBF-SVM, Radial-Basis Function Support Vector Machine.

To examine more closely the effect of the difference between $prev$ and $r$ on precision estimation, Figure 2 plots bias estimates for $\hat{prec}$ and ${\hat{prec}}^{prev}$ as a function of the absolute difference between $prev$ and $r$ , using the same data employed in Figure 1. It can be seen that the bias is always positive, indicating optimistic precision estimates. In nearly all cases, ${\hat{prec}}^{prev}$ has a smaller bias than $\hat{prec}$ , and when $prev$ is far from $r$ , the difference in bias becomes quite large.

Figure 2.

Estimated bias of the usual precision estimator $\hat{prec}$ (dotted curves), and the modified precision estimator ${\hat{prec}}^{prev}$ (dashed curves) for LDA, 3NN, and RBF-SVM, with sample sizes $n = 30$ and $n = 200$ , and different prevalence values, as a function of the absolute difference between true prevalence and sample size ratio. LDA indicates Linear Discriminant Analysis; 3NN, 3-Nearest Neighbors; RBF-SVM, Radial-Basis Function Support Vector Machine.

Case studies with real data

Here we further investigate the bias of precision estimation under separate sampling using real data from three published studies.

Leukemia study

This publication¹³ used a tumor microarray data set containing two types of human acute leukemia: acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). Gene expression measurements were taken from $15 154$ genes from 72 tissue specimens, 47 of which of ALL type (class 0), and 25 of AML type (class 1), so that $r = 0.347$ . The estimator ${\hat{prec}}^{prev}$ was computed using the value $prev = 0.222$ , which is the incidence rate of ALL over AML in the US population.¹⁴

Breast cancer study

The second publication¹⁵ employed the Wisconsin Breast Cancer (Original) Dataset from the University of California-Irvine (UCI) Machine Learning Repository,^16,17 which has been used by several groups to investigate breast cancer classification methods.^18,19 The data set consists of 699 instances, 458 and 241 of which are from benign and malignant tumors, respectively, and 10 features corresponding to cytological characteristics of breast fine-needle aspirates. According to Wilkins,²⁰ fewer than 20% of breast lumps are malignant; therefore, we used used $prev = 0.2$ in the computation of the modified precision estimator ${\hat{prec}}^{prev}$ .

Liver disease study

The final publication²¹ employed a liver disease data set, also from the UCI Machine Learning Repository. This data set contains 5 blood test attributes and 345 records, of which 145 belong to individuals with liver disease (class 0) and 200 measurements are taken from healthy individuals (class 1), so that $r = 0.42$ . This data set was donated to UCI in 1990, when the prevalence rate for chronic liver disease in the United States was $prev = 0.1178$ ,²² which we use as the prevalence in the computation of the ${\hat{prec}}^{prev}$ estimator.

All three studies used libraries from the Weka machine learning environment²³ to compute usual precision estimates on separately sampled data, while ignoring true prevalences, for different classification rules: Naive Bayes (NB),²⁴ C4.5 decision tree,²⁵ Back-Propagated Neural Networks, 3NN, and Linear SVM.¹² We reproduced the analysis in all three papers using Weka, obtaining almost exactly the same $\hat{prec}$ estimates reported in those papers, and added for comparison the ${\hat{prec}}^{prev}$ using the prevalence values described above. The results, displayed in Figure 3, show that without exception, the usual precision estimates $\hat{prec}$ are larger than the more accurate ${\hat{prec}}^{prev}$ estimates, in agreement with the previously observed fact that $\hat{prec}$ displays a larger (optimistic) bias. The bias is particularly large in the case of the liver disease study, reflecting the fact that among the three data sets, this is the one where the value of $prev$ and $r$ differ the most.

Figure 3.

Precision estimates for different classification rules using separately sampled leukemia, breast cancer, and liver disease data. The white bars depict the usual estimated precision estimates, while the shaded bars are for the precision estimates using the true case prevalences. NB indicates Naive Bayes; 3NN, 3-Nearest Neighbors; SVM, Support Vector Machine.

Concluding Remarks

Accuracy and reproducibility in observational studies is critical to the progress of biomedicine, in particular, in the discovery of reliable biomarkers for disease diagnosis and prognosis. In this study, theoretical results confirmed by numerical experiments show that the usual estimator of precision can be severely biased under the typical separate sampling scenario in observational case-control studies. This will be true especially if the true disease prevalence differs significantly from the apparent prevalence in the data. If knowledge of the true disease prevalence is available, or can even be approximately ascertained, then it can be used to define a modified precision estimator, which is nearly unbiased at moderate sample sizes. In all the results using real data sets, we observed that the usual precision estimator produces values that are larger, ie, more optimistic, than the modified one using the true prevalence, which agrees with the results obtained with the synthetic data. Absence of knowledge about the true prevalence means simply that the precision cannot be reliably estimated in observational case-control studies and its use should be discouraged. Finally, we note that in our experiments, we considered the case where the prevalence is between 0.1 and 0.9, not without reason. If the prevalence is significantly under 0.1, as is the case in some rare diseases, then neither the precision, nor in fact the classification error, should be used as a criterion of performance, but rather the sensitivity and specificity need to be considered separately—otherwise, a large precision and small classification error can be achieved by biasing the classification rule to produce FP rates close to zero while ignoring the FN rate.

Supplemental Material

suppl_figures – Supplemental material for On the Bias of Precision Estimation Under Separate Sampling

Supplemental material, suppl_figures for On the Bias of Precision Estimation Under Separate Sampling by Shuilian Xie and Ulisses M Braga-Neto in Cancer Informatics

Footnotes

Appendix 1

Here we derive the asymptotic approximation in equation (29). If $f : ℝ^{2} \to ℝ$ is infinitely differentiable at point $(a, b)$ , then it can be expanded by a bivariate Taylor series around $(a, b)$ as

(37)

\begin{array}{l} f (x, y) = f (a, b) + \frac{\partial f (a, b)}{\partial x} (x - a) + \frac{\partial f (a, b)}{\partial y} (y - b) \\ + second and higher order terms in x - a and y - b \end{array}

Now let $X_{n}$ and $Y_{n}$ be sequences of random variables with means µ_X and µ_Y, with $μ_{Y} \neq 0$ . The ratio $x / y$ is infinitely differentiable at $(a, b)$ if $b \neq 0$ ; therefore, we can apply the previous result and get

(38)

\begin{array}{l} \frac{X_{n}}{Y_{n}} = \frac{μ_{X}}{μ_{Y}} + \frac{1}{μ_{Y}} (X_{n} - μ_{X}) - \frac{μ_{X}}{μ_{Y}^{2}} (Y_{n} - μ_{Y}) \\ + second and higher order terms in X_{n} - μ_{X} and Y_{n} - μ_{Y} \end{array}

Taking expectations on both sides gives

(39)

E [\frac{X_{n}}{Y_{n}}] = \frac{μ_{X}}{μ_{Y}} + E [\begin{array}{l} second and higher \\ order terms in X_{n} - μ_{X} and Y_{n} - μ_{Y} \end{array}]

Except in pathological cases involving heavy-tailed distributions, the remainder in the previous equation becomes negligible as $X_{n} \to μ_{X}$ and $Y_{n} \to μ_{Y}$ in probability. Therefore, we write

(40)

E [\frac{X}{Y}] \approx \frac{E [X]}{E [Y]}

as long as $X$ and $Y$ are around $E [X]$ and $E [Y]$ , respectively (ie, $Var [X]$ and $Var [Y]$ are small).

Funding:

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests:

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Author Contributions

UMB-N proposed the original idea of studying precision estimates under separate sampling. SX conducted a detailed bibliographical research on the use of precision in Bioinformatics. SX designed and conducted the numerical experiments using the synthetic and real data sets. Both authors contributed in the discussion of the results. SX prepared the initial draft of the manuscript, and UMB-N contributed in the preparation of the final version.

Supplemental Material

Supplemental material for this article is available online.

ORCID iD

Shuilian Xie

References

Schena

Shalon

Davis

Brown

PO.

Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995;270:467–470.

Lockhart

Dong

Byrne

et al . Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol. 1996;14:1675.

Mortazavi

Williams

McCue

Schaeffer

Wold

Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat Methods. 2008;5:621–628.

Aebersold

Mann

Mass spectrometry-based proteomics. Nature. 2003;422:198–207.

Braga-Neto

Dougherty

Is cross-validation valid for microarray classification?

Bioinformatics. 2004;20:374–380.

Braga-Neto

Dougherty

Error Estimation for Pattern Recognition. New York, NY: John Wiley & Sons; 2015.

Ong

Magrabi

Coiera

Automated categorisation of clinical incident reports using statistical text classification. Qual Saf Health Care. 2010;19:e55.

Dang

Lawrence

CB.

Allerdictor: fast allergen prediction using text classification techniques. Bioinformatics. 2014;30:1120–1128.

Hassanpour

Langlotz

Amrhein

et al . Performance of a machine learning classifier of knee MRI reports in two large academic radiology practices: a tool to estimate diagnostic yield. Am J Roentgenol. 2017;208:750–753.

10.

Braga-Neto

Zollanvari

Dougherty

ER.

Cross-validation under separate sampling: strong bias and how to correct it. Bioinformatics. 2014;30:3349–3355. doi:10.1093/bioinformatics/btu527.

11.

Devroye

Gyorfi

Lugosi

A Probabilistic Theory of Pattern Recognition. New York, NY: Springer; 1996.

12.

Duda

Hart

Stork

et al . Pattern Classification. 2nd ed. New York, NY: Springer; 2001:55.

13.

Hewett

Kijsanayothin

Tumor classification ranking from microarray data. BMC Genomics. 2008;9:S21.

14.

Howlader

Noone

Krapcho

et al . SEER cancer statistics review 1975-2013. SEER. http://seer.cancer.gov/csr/1975_2013/. Updated 2016.

15.

Asri

Mousannif

Al Moatassime

et al . Using machine learning algorithms for breast cancer risk prediction and diagnosis. Proc Comput Sci. 2016;83:1064–1069.

16.

Dua

Graff

UCI machine learning repository. UCI. http://archive.ics.uci.edu/ml. Updated 2017.

17.

Wolberg

Mangasarian

OL.

Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc Natl Acad Sci U S A. 1990;87:9193–9196. doi:10.1073/pnas.87.23.9193.

18.

Shajahaan

Shanthi

ManoChitra

Application of data mining techniques to model breast cancer data. Int J Emerg Technol Adv Eng. 2013;3:362–369.

19.

Akay

MF.

Support vector machines combined with feature selection for breast cancer diagnosis. Expert Syst Appl. 2009;36:3240–3247.

20.

Wilkins

Interpreting Signs and Symptoms (LWW Medical Book Collection). Philadelphia, PA: Lippincott Williams & Wilkins; 2007.

21.

Ramana

Prasad

Venkateswarlu

NB.

A critical study of selected classification algorithms for liver disease diagnosis. Int J Database Manag Syst. 2011;3:101–114.

22.

Younossi

Stepanova

Afendy

et al . Changes in the prevalence of the most common causes of chronic liver diseases in the united states from 1988 to 2008. Clin Gastroenterol Hepatol. 2011;9:524–530.e1; quiz e60.

23.

Holmes

Donkin

Witten

Weka: A Machine Learning Workbench (Working paper 94/9). Hamilton, New Zealand: Department of Computer Science, University of Waikato; 1994.

24.

Friedman

Geiger

Goldszmidt

Bayesian network classifiers. Mach Learn. 1997;29:131–163.

25.

Dietterich

An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Mach Learn. 2000;40:139–157.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.37 MB