Sage Journals: Discover world-class research

Abstract

Selected reaction monitoring (SRM) has become one of the main methods for low-mass-range–targeted proteomics by mass spectrometry (MS). However, in most SRM-MS biomarker validation studies, the sample size is very small, and in particular smaller than the number of proteins measured in the experiment. Moreover, the data can be noisy due to a low number of ions detected per peptide by the instrument. In this article, those issues are addressed by a model-based Bayesian method for classification of SRM-MS data. The methodology is likelihood-free, using approximate Bayesian computation implemented via a Markov chain Monte Carlo procedure and a kernel-based Optimal Bayesian Classifier. Extensive experimental results demonstrate that the proposed method outperforms classical methods such as linear discriminant analysis and 3NN, when sample size is small, dimensionality is large, the data are noisy, or a combination of these.

Keywords

Proteomics biomarker approximate Bayesian computation (ABC)Markov chain Monte Carlo (MCMC)Optimal Bayesian Classifier (OBC)selected reaction monitoring (SRM)

Introduction

Proteomics is the field which deals with the study of cellular behavior and human disease at the protein level. Recently, cancer treatment and prevention have made great strides, thanks to the development of high-throughput technologies in proteomics. Among these, mass spectrometry (MS) analysis has become the preferred choice because of advantages such as high molecular specificity and better detection sensitivity.¹ Hence, MS is widely used in identification and quantification of complex proteome mixtures with the goal of discovering biomarkers, ie, molecular markers for disease.^2–4

However, a major challenge in biomarker discovery is the identification of low-abundance proteins in peripheral blood. Selected reaction monitoring (SRM), conducted using a triple-quadrupole (QQQ) instrument, has an extended mass range and has become one of the main methods for low-mass-range–targeted proteomics by MS.⁵

Nevertheless, in most SRM-MS biomarker validation studies, the sample size is very small due to the economic cost of the experiments and difficulty in recruiting cases. Typically, the number of features (measured proteins) is vastly larger than the sample size. Moreover, depending on the instrument sensitivity, the data can be noisy due to low peptide efficiency, ie, low number of ions detected per peptide.

All the aforementioned issues create a difficult challenge to classical data-driven classification methods. In this article, this is addressed by a model-based Bayesian method for classification of SRM-MS data. We perform Bayesian inference of the parameters of the SRM model proposed in the work by Atashpaz-Gargari et al⁵ and build a kernel classifier, similar to the classifier for liquid chromatography-mass spectrometry (LC-MS) data proposed in the work by Banerjee and Braga-Neto.⁶ As in the latter reference, our method uses a likelihood-free approach, called approximate Bayesian computation (ABC),^7–9 which is necessary because the SRM model of Atashpaz-Gargari et al⁵ is complex and does not have an analytical formulation of the likelihood. After calibration of the parameters, the ABC method is implemented via a Markov chain Monte Carlo (MCMC) procedure^10,11 to obtain a sample from the posterior distribution of the protein concentrations. Small MCMC sample sizes are sufficient to obtain a kernel-based implementation of the Optimal Bayesian Classifier (OBC).¹² Extensive experimental results examining the effect of various parameters demonstrate that the proposed method outperforms classical methods such as linear discriminant analysis (LDA) and 3NN,¹³ when sample size is very small, dimensionality is large, the data are noisy, or a combination of these.

The organization of the article is as follows. Section “SRM-based MS model” surveys the SRM-MS model. Section “ABC-MCMC classification algorithm” explains in detail the ABC rejection algorithm and the approximate Bayesian computation-Markov chain Monte Carlo (ABC-MCMC) classifier. Section “Numerical experiments and results” presents the numerical results. Section “Conclusions” presents concluding remarks.

SRM-BASED MS MODEL

In this article, we employ the model for the SRM pipeline proposed in the work by Atashpaz-Gargari et al.⁵ Next, we review briefly each of the main components of this model.

Protein mixture model

The protein mixture model concerns the true abundance of proteins in the SRM experiment. There are $n$ samples in each class; for convenience, the 2 classes are labeled as 0 for control and 1 for treatment. There are $N_{p r o}^{a}$ proteins, $N_{p r o}^{c}$ of which are low-abundance candidates for biomarker validation. Protein identities are input as a FASTA file. As argued in previos works,^5,14 protein concentration can be modeled by a gamma distribution. Hence, the protein concentration is given by

γ_{i} \sim (\begin{array}{l} Γ (k_{c}, θ_{c}) & i = 1, 2, 3, \dots, N_{p r o}^{c} \\ Γ (k_{a}, θ_{a}) & i = N_{p r o}^{c} + 1, N_{p r o}^{c} + 2, \dots, N_{p r o}^{a} \end{array}

(1)

The variables $k$ and $θ$ are, respectively, shape and scale parameters. These are uniform random variables defined as $k_{c} \sim Unif (k_{c}^{l o w}, k_{c}^{h i g h})$ , $k_{a} \sim Unif (k_{a}^{l o w}, k_{a}^{h i g h})$ and $θ_{c} \sim Unif (θ_{c}^{l o w}, θ_{c}^{h i g h})$ , $θ_{a} \sim Unif (θ_{a}^{l o w}, θ_{a}^{h i g h})$ , respectively. The initial values of these variables, which are displayed in Table 1, reflect the dynamic range of protein abundance levels while taking into account that the candidate proteins are expressed at a much lower level than the background proteins. The initial values used here are consistent with values obtained experimentally in the work by Taniguchi et al¹⁴ as well as the hyperparameter values used in the work by Atashpaz-Gargari et al.⁵ Furthermore, these initial values are modified based on the data, as part of the prior calibration process described in Algorithm 1.

Table 1.

Parameters used in the experiment.

Parameter	Symbol	Value/range
Instrument response factor	$κ$	5
Noise severity	$α, β$	0.03, 3.6
Peptide efficiency factor	$e_{i}$	[0.1, 1]
Shape (gamma distribution)	$k_{a}, k_{c}$	Unif(1.6, 2.4), Unif(4, 6)
Scale (gamma distribution)	$θ_{a}, θ_{c}$	Unif(9e6, 11e6), Unif(90, 110)
Purification	$η_{i}$	$10^{- 6}$
Coefficient of variation	$ϕ$	Unif(0.3, 0.5)
Fold change	f	Unif(1.5, 1.6)

Proteins are divided into biomarker (differentially expressed) and nonbiomarker (not differentially expressed) proteins. We use fold change to quantify the difference:

f_{l} = (\begin{array}{l} a_{i} & if the protein i is overexpressed \\ \frac{1}{a_{i}} & if the protein is underexpressed \\ 1 & otherwise \end{array}

(2)

for $l = 1, \dots, N_{p r o}^{a}$ . The fold change parameter $a_{i}$ is uniformly distributed in the interval $[1, h]$ , for $h > 1$ . The value of $h$ used here is displayed in Table 1.

While the gamma distribution is chosen for mean protein concentrations, the variation of protein concentration is modeled by a multivariate gaussian vector. Accordingly, the concentration of protein $l$ in class $j$ is modeled as follows:

C_{l j}^{p r o} \sim (\begin{array}{l} N ([γ_{1}, γ_{2}, \dots, γ_{N_{p r o}^{a}})], Σ) & for j \in class 0 \\ N ([γ_{1} f_{1}, γ_{2} f_{2}, \dots, γ_{N_{p r o}^{a}} f_{N_{p r o}^{a}})], Σ) & for j \in class 1 \end{array}

(3)

for $l = 1, \dots, N_{p r o}^{a}$ . Here, we consider a diagonal covariance matrix $Σ = {[σ_{l k}^{2}]}_{N_{p r o} X N_{p r o}}$ so that the protein concentrations are mutually independent or very weakly correlated (correlation between proteins can be included at the cost of adding more parameters to the model):

Σ = [\begin{matrix} σ_{11}^{2} & 0 & 0 & \dots & 0 \\ 0 & σ_{22}^{2} & 0 & \dots & 0 \\ . & . & . & \dots & . \\ . & . & . & \dots & . \\ 0 & 0 & 0 & \dots & σ_{N_{p r o}^{a}}^{2} \end{matrix}]

(4)

where

σ_{i j}^{2} = (\begin{array}{l} σ_{i i}^{2} & if i = j and i, j = 1, \dots, N_{p r o}^{a} \\ 0 & otherwise \end{array}

(5)

and

σ_{i i}^{2} = ϕ * γ_{i}^{2}, i = 1, \dots, N_{p r o}^{a}

(6)

The coefficient of variation $ϕ$ has the initial value displayed in Table 1, which is the same as the one used in the work by Banerjee and Braga-Neto.⁶ This value is modified based on the data, as part of the prior calibration process described in Algorithm 1.

To model the purification process usually performed as part of the SRM-MS protocol, we select a set $G_{p}$ of high-abundance proteins to be removed (in fact, attenuated) from the protein mixture:

{\hat{C}}_{i j}^{p r o} = (\begin{matrix} η_{i} C_{i j}^{p r o} & for i \in G_{p} \\ C_{i j}^{p r o} & otherwise \end{matrix}

(7)

The value for $η_{i}$ corresponds to the efficiency of the purification process and should be very small. The value assumed here is displayed in Table 1.

Peptide mixture model

In SRM-MS, tryptic digestion of proteins is performed to generate small-mass peptides. Let $Ω_{i}$ be the set of all the proteins which contain the ith peptide:

C_{i j}^{p e p} = \sum_{k \in Ω_{i}} {\hat{C}}_{k j}^{p r o} i = [1, 2, \dots N_{c}^{p p}], j \in [0, 1]

(8)

The readout abundance $μ_{i j}$ of the peptide can be modeled as follows:

μ_{i j} = C_{i j}^{p e p} e_{i} κ

(9)

Here, $e_{i}$ represents the peptide efficiency factor and $κ$ represents the LC-MS response factor.

However, the true peptide abundance is different from its readout value due to the noise:

ν_{i j} = ε_{i j} + λ_{i j} i = [1, 2, \dots, N_{c}^{p p}], j \in [0, 1]

(10)

where $ε_{i j}$ is additive gaussian noise, which has a quadratic dependence on $μ_{i j}$ s given below:

ε_{i j} \sim N (0, α μ_{i j}^{2} + β μ_{i j}) i = [1, 2, \dots, N_{c}^{p p}], j \in [0, 1]

(11)

where $λ_{i j}$ is the additive exponential noise introduced due to transition effects:

λ_{i j} \sim Exp (μ_{t r a n} \times μ_{i j})

(12)

where $μ_{t r a n}$ is a fixed constant.

The next step is called protein abundance roll-up. This is the process of obtaining the abundances of the parent proteins from the abundances and related characteristisc of their child peptides, detected during the MS1 process. To obtain the identities of the parent proteins, a second round of MS, called MS/MS, is often used and available databases of identities are searched. Here, we assume that the data from the rolled up abundances can be obtained and the readout of protein $l$ in sample $j$ is given by

x_{l j} = \frac{1}{κ η_{l}} \sum_{i \in N_{l}} ν_{l j} l = [1, 2, \dots, N_{p r o}], j \in [0, 1]

(13)

where $κ$ is the instrument response factor, $N_{l}$ is the set of proteins present in peptide l and $η_{l}$ is the number of peptides in set $N_{l}$ . The data $x_{l j}$ in equation (13) are then used for classification.

ABC-MCMC Classification Algorithm

As described in the introduction section, the algorithm mainly has 3 steps: prior calibration via ABC rejection sampling, posterior sampling using an ABC-MCMC algorithm, and classification using a kernel-based method. We describe each of these steps below.

Prior calibration via ABC rejection sampling

Once the protein abundances as described in equation (7) are obtained, the total number of proteins $N_{p r o}^{a}$ is reduced via a feature selection algorithm. As per the equations in the previous section, the protein abundance profiles are a function of the following:

Baseline parameters $γ = [γ_{1}, γ_{2}, \dots, γ_{d}]$

Prior hyperparameters: $k_{a}, k_{c}, θ_{a}, θ_{c}, ϕ, f$

Instrument parameters: $κ, α, β, e_{i}$

Algorithm 1 . Prior calibration of $k_{c}, k_{a}, θ_{c}, θ_{a}, ϕ$ using ABC rejection sampling
1. Generate $M_{c a l}$ quintuplets of parameters of $k_{c}, k_{a}, θ_{c}, θ_{a}, ϕ$ such that $\begin{array}{l} k_{a}^{(t)} \sim U n i f (k_{a}^{l o w}, k_{a}^{h i g h}) \\ k_{c}^{(t)} \sim U n i f (k_{c}^{l o w}, k_{c}^{h i g h}) \\ θ_{a}^{(t)} \sim U n i f (θ_{a}^{l o w}, θ_{a}^{h i g h}) \\ θ_{c}^{(t)} \sim U n i f (θ_{c}^{l o w}, θ_{c}^{h i g h}) \\ ϕ^{(t)} \sim U n i f (ϕ^{l o w}, ϕ^{h i g h}) \\ f o r t = 1, 2, \dots, M_{c a l} \end{array}$ 2. Now simulate a control sample set $S_{0}^{(t)}$ of size n for each quintuplet of parameters for t = 1, 2,. . ., M_cal 3. Accept the quintuplet $(k_{a}^{(t)}, k_{c}^{(t)}, θ_{a}^{(t)}, θ_{c}^{(t)}, ϕ_{a}^{(t)})$ if $\| \| T (S_{0}^{(t)}) - T (S_{0}) \| \| < ε$ , $for t = 1, 2, \dots, M_{c a l}$ . Here, \|\| \|\| denotes the Euclidean norm and T denotes vector sample mean. 4. Let $B = [(k^{1}, θ^{1}, ϕ^{1}), \dots, (k^{n_{a}}, θ^{n_{a}}, ϕ^{n_{a}})]$ be the set of accepted triplets. 5. The calibrated k can be approximated as follows: $k_{a}^{c a l} = \int_{k_{a}^{l o w}}^{k_{a}^{h i g h}} k_{p} (k_{a} \| S_{n}) d k = \frac{1}{n_{a}} \sum_{a = 1}^{n_{a}} k_{a}^{c a l}$ 6. Similarly, other 4 parameters are also calculated.

Algorithm 1 . Prior calibration of

k_{c}, k_{a}, θ_{c}, θ_{a}, ϕ

using ABC rejection sampling

1. Generate

M_{c a l}

quintuplets of parameters of

k_{c}, k_{a}, θ_{c}, θ_{a}, ϕ

such that

\begin{array}{l} k_{a}^{(t)} \sim U n i f (k_{a}^{l o w}, k_{a}^{h i g h}) \\ k_{c}^{(t)} \sim U n i f (k_{c}^{l o w}, k_{c}^{h i g h}) \\ θ_{a}^{(t)} \sim U n i f (θ_{a}^{l o w}, θ_{a}^{h i g h}) \\ θ_{c}^{(t)} \sim U n i f (θ_{c}^{l o w}, θ_{c}^{h i g h}) \\ ϕ^{(t)} \sim U n i f (ϕ^{l o w}, ϕ^{h i g h}) \\ f o r t = 1, 2, \dots, M_{c a l} \end{array}

2. Now simulate a control sample set

S_{0}^{(t)}

of size n for each quintuplet of parameters for t = 1, 2,. . ., M_cal
3. Accept the quintuplet

(k_{a}^{(t)}, k_{c}^{(t)}, θ_{a}^{(t)}, θ_{c}^{(t)}, ϕ_{a}^{(t)})

| | T (S_{0}^{(t)}) - T (S_{0}) | | < ε

for t = 1, 2, \dots, M_{c a l}

. Here, || || denotes the Euclidean norm and T denotes vector sample mean.
4. Let

B = [(k^{1}, θ^{1}, ϕ^{1}), \dots, (k^{n_{a}}, θ^{n_{a}}, ϕ^{n_{a}})]

be the set of accepted triplets.
5. The calibrated k can be approximated as follows:

k_{a}^{c a l} = \int_{k_{a}^{l o w}}^{k_{a}^{h i g h}} k_{p} (k_{a} | S_{n}) d k = \frac{1}{n_{a}} \sum_{a = 1}^{n_{a}} k_{a}^{c a l}

6. Similarly, other 4 parameters are also calculated.

Prior calibration via ABC rejection sampling is as described in Algorithm 1. Monte Carlo integrations are performed to obtain a set of parameters and only some of them are kept and rest are rejected via comparing with a threshold. All the approximated triplets are averaged to obtain the optimal parameter.

In this algorithm, $ε$ is the error tolerance. This has to be chosen optimally so that it should not be too high for bad samples to be accepted or it should not be very small that all the samples are accepted, ie, $P (| | T (S_{0}^{(t)}), T (S_{0}) | | < ε) \approx 0$

Once the optimal parameters are obtained, the fold change vector is calculated by the following sample mean estimate:

f_{l, c a l} = \frac{T_{l} (S_{1})}{T_{l} (S_{0})}, l = 0, 1, 2, \dots, d

(14)

where $T_{l}$ denotes the lth sample mean for the selected protein only.

ABC-MCMC posterior sampling

ABC-MCMC sampling is as described in Algorithm 2. Vector $γ = γ_{1}, γ_{2}, \dots, γ_{d}$ is sampled from $p (γ | S_{n}) \propto p (S_{n} | γ) p (γ)$ . After a burn-in period for the Markov chain of $t_{s}$ , the next $M$ samples from $t_{s}$ to $t_{s} + M$ are considered as the generated data. Proper selection of the thresholds in step 4 of Algorithm 2 plays a very important role in the performance of the ABC-MCMC algorithm.

Algorithm 2 . Obtain the posterior samples of $γ$ using ABC-MCMC algorithm
1. Generate the mean vector $γ^{(0)} = (γ_{0}, γ_{1}, \dots, γ_{d})$ from the $Γ$ distribution with optimal parameters generated in Algorithm 1. For $t = 0, 1, …, t_{s}, t_{s + 1}, \dots, t_{s} + M$ where $t_{s}$ is the burn-in period do: 2. Generate $γ^{(t + 1)} = C o l M e a n s (S_{0}^{(t)})$ where ColMeans is a function which calculates mean feature (protein) wise. 3. Simulate the control and treatment samples $S_{0}^{t + 1}$ and $S_{1}^{t + 1}$ each of size using $γ^{(t + 1)}$ and $γ^{(t + 1)} \cdot f_{c a l}$ , respectively. 4. Let $q = {\begin{array}{l} 1 & ∥ T (S_{0}^{(t + 1)}) - T (S_{0}) ∥ < ϵ_{0} and ∥ T (S_{1}^{(t + 1)}) - T (S_{1}) ∥ < ϵ_{1} \\ 0 & o t h e r w i s e \end{array}$ 5. If q = 1, accept $γ^{(t + 1)}$ else $γ^{(t + 1)} = γ^{(t)}$

Algorithm 2 . Obtain the posterior samples of

γ

using ABC-MCMC algorithm

1. Generate the mean vector

γ^{(0)} = (γ_{0}, γ_{1}, \dots, γ_{d})

from the

Γ

distribution with optimal parameters generated in Algorithm 1.
For

t = 0, 1, …, t_{s}, t_{s + 1}, \dots, t_{s} + M

where

t_{s}

is the burn-in period do:
2. Generate

γ^{(t + 1)} = C o l M e a n s (S_{0}^{(t)})

where ColMeans is a function which calculates mean feature (protein) wise.
3. Simulate the control and treatment samples

S_{0}^{t + 1}

and

S_{1}^{t + 1}

each of size using

γ^{(t + 1)}

and

γ^{(t + 1)} \cdot f_{c a l}

, respectively.
4. Let

q = {\begin{array}{l} 1 & ∥ T (S_{0}^{(t + 1)}) - T (S_{0}) ∥ < ϵ_{0} and ∥ T (S_{1}^{(t + 1)}) - T (S_{1}) ∥ < ϵ_{1} \\ 0 & o t h e r w i s e \end{array}

5. If q = 1, accept

γ^{(t + 1)}

else

γ^{(t + 1)} = γ^{(t)}

Kernel-based classification

We employ the kernel-based scheme proposed in the work by Banerjee and Braga-Neto,⁶ which is itself based on the OBC in Dalton and Dougherty.¹² One of the issues with kernel-based classification is choosing the right value of the kernel bandwidth parameter. If the value of the bandwidth parameter chosen is high, then it leads to oversmoothing and thus hiding many details in the data distribution. However, a small value for the bandwidth parameter leads to undersmoothing and thus many spurious noisy elements in the data are not eliminated. To address this, we employ an ensemble method, where different classifiers with different bandwidth parameters are obtained and then majority vote is used for classification. The classification algorithm is described in detail in Algorithm 3.

Algorithm 3 . Using the ABC-MCMC–based posterior samples for classification.
1. Choose a set of kernel bandwidth parameters $h = (h_{1}, h_{2}, \dots, h_{f})$ where f is the number of bandwidth values taken. 2. Choose the number of $γ$ samples from markov chain to be used in the kernel classifier. Say we select q samples from the posterior. It is advisable to choose the samples from the end. For example, in this case, $t_{s} + M - q$ to $t_{s} + M$ . 3. Choose a suitable kernel K for the analysis. In this article, we have chosen a zero mean unit variance gaussian kernel. 4. For a given test point x do: Declare a result vector res_vec=zeros For i in $h_{1}, h_{2}, \dots, h_{f}$ do: if $(c \sum_{t = t_{s} + M - q}^{t_{s} + M} \sum_{j = 1}^{n} K (\frac{x - x^{(j)}}{h_{i}}) \geq (1 - c) \sum_{t = t_{s} + M - q}^{t_{s} + M} \sum_{j = n + 1}^{2 n} K (\frac{x - x^{(j)}}{h_{i}}))$ res_vec[i]=1 else res_vec[i]=0 5. The kernel-based classifier is now given by $Ψ (x) = (\begin{array}{l} 1 & if s u m (res_vec) \geq \frac{f + 1}{2} \\ 0 & otherwise \end{array}$ (15)

Algorithm 3 . Using the ABC-MCMC–based posterior samples for classification.

1. Choose a set of kernel bandwidth parameters

h = (h_{1}, h_{2}, \dots, h_{f})

where f is the number of bandwidth values taken.
2. Choose the number of

γ

samples from markov chain to be used in the kernel classifier. Say we select q samples from the posterior. It is advisable to choose the samples from the end. For example, in this case,

t_{s} + M - q

t_{s} + M

.
3. Choose a suitable kernel K for the analysis. In this article, we have chosen a zero mean unit variance gaussian kernel.
4. For a given test point x do:
Declare a result vector res_vec=zeros
For i in

h_{1}, h_{2}, \dots, h_{f}

do:
if

(c \sum_{t = t_{s} + M - q}^{t_{s} + M} \sum_{j = 1}^{n} K (\frac{x - x^{(j)}}{h_{i}}) \geq (1 - c) \sum_{t = t_{s} + M - q}^{t_{s} + M} \sum_{j = n + 1}^{2 n} K (\frac{x - x^{(j)}}{h_{i}}))

res_vec[i]=1
else
res_vec[i]=0
5. The kernel-based classifier is now given by

Ψ (x) = (\begin{array}{l} 1 & if s u m (res_vec) \geq \frac{f + 1}{2} \\ 0 & otherwise \end{array}

(15)

Numerical Experiments and Results

In this section, we demonstrate the application of the proposed ABC-MCMC classification algorithm for SRM data, using a synthetic data set generated from a subset of the human proteome. We selected a list of proteins from the Drugbank and applied tryptic digestion of proteins using the OpenMS software.¹⁵ Because our interest is in small sample sizes, we chose simple classification rules, which are known to perform well with small samples, for comparison: LDA and k-nearest neighbor (KNN) with k = 3.

Synthetic SRM-MS data were generated by the model described in section “SRM-based mass spectrometry model,” using the parameters in Table 1. Synthetic sample data for prior calibration were generated using the midpoint of the intervals specified in Table 1. For example, as $ϕ \sim Unif (0.3, 0.5)$ , we take 0.4 as the initial value.

For the MCMC procedure, we consider 10 000 samples from the posterior distribution of $γ$ . A burn-in stage of around 3000 iterations is considered. The value of prior probability was taken to be 0.5 (equally likely classes). Kernel density estimation is based on 15 MCMC samples of $γ$ , ie, q = 15 in Algorithm 3 (increasing this number did not show any significant difference in the results). From the initial number of 350 proteins, a t test is applied to select the top 10 to 15 proteins. We consider sample sizes $n = 10$ through $n = 40$ per class and select the number of features to be $d = 3, 5, 8, 10$ . The results displayed below are average results over 6 runs of the experiment for each combination of classification rule, sample size, and dimensionality. The classification error for each case is estimated on an independent synthetic test data set of 100 sample points.

Effect of sample size

Figure 1 displays the average error rates for the different classification rules. The number of proteins selected is fixed at $d = 10$ . With the increase in sample size, we see that the total error decreases for all classification rules. An important observation is that at small sample sizes, the performance of ABC-MCMC is best, confirming the general principle of good small-sample performance by Bayesian methods.

Figure 1.

Average classification error rates against sample size for a fixed number of selected proteins $d = 10$ . ABC-MCMC indicates approximate Bayesian computation-Markov chain Monte Carlo; LDA, linear discriminant analysis.

Effect of dimensionality

The average error rates of the various classification rules against dimensionality, ie, number of selected proteins, are displayed in Figure 2, for fixed sample size $n = 10$ per class. We can observe a very strong peaking phenomenon¹⁶: as the number of selected proteins increases, the average classification error rates tend to go down at first, but then increase sharply, due to the small sample size, ie, small ratio between number of points over the dimensionality. One can observe that the ABC-MCMC classification rule is the most accurate one when $d$ is large, which is in agreement with the fact that Bayesian methods tend to outperform competing techniques under small ratios of sample size to dimensionality.

Figure 2.

Average classification error rates against number of selected proteins for a fixed sample size $n = 10$ . ABC-MCMC indicates approximate Bayesian computation-Markov chain Monte Carlo; LDA, linear discriminant analysis.

Effect of variability

Here, we keep the sample size at $n = 10$ and the number of features at $d = 8$ to investigate the impact on the classification of error rate of an increasing variability of the true protein concentration values. In Figure 3, one can observe that the performance of all classification rules degrades with increasing values of the coefficient of variation $ϕ$ ; however, the performance of the ABC-MCMC algorithm is uniformly better than the others due to the small sample size $n = 10$ .

Figure 3.

Average classification error rates against the coefficient of variation $ϕ$ for a fixed sample size $n = 10$ per class and fixed number of selected proteins $d = 8$ . ABC-MCMC indicates approximate Bayesian computation-Markov chain Monte Carlo; LDA, linear discriminant analysis.

Effect of peptide efficiency

Finally, we investigate the impact on the classification accuracy of varying the peptide efficiency. The peptide efficiency factor $a$ a controls how many ions can be detected for a given peptide. Increasing this parameter uniformly increases efficiency for all peptides, which corresponds to a more accurate SRM-MS experiment. Indeed, one can observe in Figure 4 that classification accuracy tends to increase with increasing peptide efficiency. One can also observe that the ABC-MCMC classification rule displays the smallest error rates among the competing methods at low peptide efficiency, ie, in a more noisy experiment.

Figure 4.

Average classification error rates against the lower bound for the peptide efficiency factor $e_{i}$ for a fixed sample size $n = 10$ per class and fixed number of selected proteins $d = 8$ . ABC-MCMC indicates approximate Bayesian computation-Markov chain Monte Carlo; LDA, linear discriminant analysis.

Conclusions

In this article, we have proposed a Bayesian approach for classifying SRM data with the goal of facilitating biomarker development. This method is a combination of ABC and MCMC. We can see that for small sample sizes, large dimensionality, or noisy data, the performance of the proposed Bayesian classifier is superior to that of other approaches. Our results are based on a subset of the human proteome selected from the Drugbank, which are submitted to tryptic digestion in silico. In addition, the prior hyperparameters are calibrated using the available data. This makes the the approach realistic and broadly applicable. Because we are studying the effects of the various parameters of the SRM pipeline on the classification error, there is a need to use synthetic data from a generative model. The results are, however, expected to be reproducible on clinical SRM data.

Footnotes

Funding:

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests:

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Author Contributions

KN and UBN conceived and designed the experiments. KN analyzed the data. KN wrote the first draft of the manuscript. UBN contributed to the writing of the manuscript. Both authors read and approved the final manuscript.

References

Aebersold

Mann

Mass spectrometry-based proteomics. Nature. 2003;422:198–207.

Rifai

Gillette

Carr

Protein biomarker discovery and validation: the long and uncertain path to clinical utility. Nat Biotechnol. 2006;24:971–983.

Hüttenhain

Malmstrm

Picotti

Aebersold

Perspectives of targeted mass spectrometry for protein biomarker verification. Curr Opin Chem Biol. 2009;13:518–525.

Blonder

Veenstra

Targeted proteomics for validation of biomarkers in clinical samples. Brief Funct Genomic Proteomic. 2009;8:126–135.

Atashpaz-Gargari

Braga-Neto

Dougherty

Modeling and systematic analysis of biomarker validation using selected reaction monitoring. EURASIP J Bioinform Syst Biol. 2014;2014:17.

Banerjee

Braga-Neto

Bayesian ABC-MCMC classification of liquid chromatography-mass spectrometry data. Cancer Inform. 2017;14:175–182.

Turner

Zandt

IV.

A tutorial on approximate Bayesian computation. J Mathemat Psychol. 2012;56:69–85.

Csilléry

Blum

Gaggiotti

François

Approximate Bayesian Computation (ABC) in practice. Trends Ecol Evol. 2003;25:410–418.

Sisson

Fan

Likelihood-free Markov chain Monte Carlo. In: Brooks

Gelman

Jones

Meng

, eds. Handbook of Markov Chain Monte Carlo. New York, NY: Chapman & Hall; CRC Press; 2010.

10.

Geyer

CJ.

Practical Markov chain Monte Carlo. Statis Sci. 1992;7:473–483.

11.

Wegmann

Leuenberger

Excoffier

Efficient approximate Bayesian computation coupled with Markov chain Monte Carlo without likelihood. Genetics. 2009;182:1207–1218.

12.

Dalton

Dougherty

Optimal classifiers with minimum expected error within a Bayesian framework part I: discrete and Gaussian models. Pattern Recogn. 2013;46:1301–1314.

13.

Webb

Statistical Pattern Recognition. 2nd ed. New York, NY: John Wiley & Sons; 2002.

14.

Taniguchi

Choi

et al. Quantifying

coli proteome and transcriptome with single-molecule sensitivity in single cells. Science. 2010;329:533–538.

15.

Sturm

Bertsch

Gröpl

et al . OpenMS—an open-source software framework for mass spectrometry. BMC Bioinform. 2008;9:163.

16.

Hughes

On the mean accuracy of statistical pattern recognizers. IEEE Trans Informat Theory. 1968;14:55–63.

Bayesian Classification of Proteomics Biomarkers from Selected Reaction Monitoring Data using an Approximate Bayesian Computation-Markov Chain Monte Carlo Approach