Sage Journals: Discover world-class research

Abstract

Proteomics promises to revolutionize cancer treatment and prevention by facilitating the discovery of molecular biomarkers. Progress has been impeded, however, by the small-sample, high-dimensional nature of proteomic data. We propose the application of a Bayesian approach to address this issue in classification of proteomic profiles generated by liquid chromatography-mass spectrometry (LC-MS). Our approach relies on a previously proposed model of the LC-MS experiment, as well as on the theory of the optimal Bayesian classifier (OBC). Computation of the OBC requires the combination of a likelihood-free methodology called approximate Bayesian computation (ABC) as well as Markov chain Monte Carlo (MCMC) sampling. Numerical experiments using synthetic LC-MS data based on an actual human proteome indicate that the proposed ABC-MCMC classification rule outperforms classical methods such as support vector machines, linear discriminant analysis, and 3-nearest neighbor classification rules in the case when sample size is small or the number of selected proteins used to classify is large.

Keywords

optimal Bayesian classifier approximate Bayesian computation Markov chain Monte Carlo liquid chromatography-mass spectrometry proteomics

Introduction

Recent advances in high-throughput technologies in proteomics promise to revolutionize cancer treatment and prevention by facilitating the discovery of molecular biomarkers, which can be used to improve diagnosis, guide targeted therapy, and monitor therapeutic response.¹ Among all high-throughput proteomic technologies, mass spectrometry has increasingly become the method of choice for the analysis of complex protein samples.² High molecular specificity and excellent detection sensitivity explain the widespread adoption of mass spectrometry (MS)-based proteomics as a popular tool for the identification and quantification of the composition of complex proteome mixtures.

However, to date, the rate of discovery of successful biomarkers is still unsatisfactory. In addition to challenges such as the high dynamic range of proteins³ and inaccurate protein quantification,⁴ an important impediment to progress is that, in clinical applications of mass spectrometry, the number of samples available is extremely small, whereas mass spectra contain hundreds of thousands of intensity measurements with signals generated by thousands of proteins/peptides. This small-sample, high-dimensionality problem requires the experiment and analysis to be carefully designed and validated in order to arrive at statistically meaningful results. Through model-based approaches and simulation using ground-truthed synthetic data, the problem of biomarker discovery can be studied and evaluated.

In this paper, we propose the application of a Bayesian approach to address the small-sample, high-dimensionality problem in the classification of proteomic profiles generated by liquid chromatography–mass spectrometry (LC-MS). Our approach relies on the detailed LC-MS experiment pipeline model developed in Ref. 5, as well as on the theory of the optimal Bayesian classifier (OBC), proposed in Ref. 6. However, the complexity of the LC-MS experiment, involving steps of sample preparation, protein digestion, peptide ionization, peptide detection, and protein quantification, implies that the likelihood function for the LC-MS model is exceedingly complex, requiring the application of a likelihood-free Bayesian approach. In this paper, we apply a new likelihood-free methodology called approximate Bayesian computation (ABC).⁷ The basic ABC rejection sampling method generates candidate parameters by sampling from the prior distribution and creates a model-based simulated dataset. If the dataset conforms to the observed dataset, the candidate can be retained as a sample from the posterior distribution. Thus, one can avoid evaluating the likelihood function, which is essential for classical Bayesian posterior simulation methods. The ABC approach can also be implemented via a combination of rejection sampling and Markov chain Monte Carlo (MCMC) sampling.⁸

The detailed implementation of our approach involves first the prior calibration of the hyperparameters of the LC-MS model using an ABC approach via rejection sampling and then using the ABC method implemented via an MCMC procedure to obtain samples from the posterior distribution of the protein concentrations, which are used to approximate the OBC using Monte Carlo integration and kernel smoothing. Numerical experiments using synthetic LC-MS data based on an actual human proteome indicate that the ABC-MCMC classification rule outperforms classical methods such as support vector machines (SVMs), linear discriminant analysis (LDA), and 3-nearest neighbor (3NN) classifiers in the case when sample size is small or the number of selected proteins used to classify is large. We also quantify the effect of experimental parameters such as the coefficient of variation (noise) and instrument peptide efficiency factor on classification accuracy.

The paper is organized as follows. The “LC-MS Model” section surveys the LC-MS model proposed in Ref. 5, which is the basis for our inference approach. The ‘ABC-MCMC Classification Framework” section describes in detail the algorithms for prior calibration, sampling from the posterior, and computation of the ABC-MCMC classifier. The “Numerical Experiments” section presents the results of a numerical experiment using synthetic LC-MS data corresponding to a subset of the human proteome. Finally, the “Conclusion” section brings concluding remarks.

LC-MS Model

Here, we describe briefly the label-free LC-MS model proposed in Ref. 5. Two sample classes are considered, control (class 0) and treatment (class 1). There are n sample profiles from each class, sharing N_pro protein species from a specified proteome, which is typically input into the model as a FASTA file. As argued in Ref. 9, protein concentration in the control sample is best described as a Gamma distribution,

γ_{l} = Γ (k, θ), l = 1, 2, \dots, N_{p r o},

(1)

where the shape k and scale θ parameters are assumed to be uniform random variables, such that k ~ Unif(k_low, k_high) and θ ~ Unif(θ_low, θ_high). The values for k_low, k_high, θ_low, and θ_high were chosen to adequately reflect the dynamic range of protein abundance levels (see the “Numerical Experiments” section).

According to whether there is a significant difference in abundance between control and treatment populations, proteins are divided into biomarker (differentially expressed) proteins and background (not differentially expressed) proteins. The difference in abundance for biomarker proteins is quantified by the fold change,

f_{l} = {\begin{matrix} a_{l,} & if the l th protein is overexpressed, \\ 1 / a_{l} & if the l th protein is underexpressed, \\ 1, & otherwise . \end{matrix}

(2)

The multivariate Gaussian distribution is recommended as the model for protein concentration variations in each class.¹⁰ Accordingly, the protein expression level for the lth protein in the jth sample profile is modeled as

c_{l j}^{p r o} = {\begin{matrix} N ([γ_{1,} γ_{2}, \dots, γ_{N_{p r o}}], \sum) & if j \in class 0, \\ N ([γ_{1} f_{1}, γ_{2} f_{2}, \dots, γ_{N_{p r o}} f_{N_{p r o}}], \sum) & if j \in class 1. \end{matrix}

(3)

In this paper, we assume a diagonal covariance matrix $\sum = {[σ_{l k}^{2}]}_{N_{p r o} \times N_{p r o}}$ such that protein concentrations are mutually independent (the results will still be approximately valid as long as the proteins are only weakly correlated):

\sum = [\begin{matrix} α_{11}^{2} & 0 & \dots & 0 \\ 0 & α_{22}^{2} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & α_{N_{p r o} N_{p r o}}^{2} \end{matrix}]

(4)

where

σ_{l k}^{2} = {\begin{matrix} σ_{l l}^{2}, & if l = k a n d l, k = 1, 2, \dots, N_{p r o} \\ 0, & otherwise, \end{matrix}

(5)

and

\begin{matrix} σ_{l l}^{2} = φ \times γ_{l}^{2}, & l = 1, 2, \dots, N_{p r o} . \end{matrix}

(6)

The coefficient of variation ϕ is calibrated based on the observed data.

In order to perform in silico tryptic digestion of the protein samples, we use the peptide mixture model from openMS.¹¹ Let ω_i be the set of all proteins that contain the ith peptide. If there are N_pep peptide species, in total, across all proteins in a given sample, then their molar concentrations are given as

\begin{matrix} c_{i j}^{p e p} = \sum_{k \in Ω_{i}} c_{k j}^{p r o}, & i = 1, 2, \dots, N_{p e p}, & j = 1, 2, \dots, 2 n . \end{matrix}

(7)

In general, ion abundance in MS data bears the signature of the concentration of a peptide type, say i in sample j. Taking measurement uncertainty factors in consideration, one may envisage that the expected readout μ_ij of the abundance of said peptide can be modeled as,

\begin{matrix} μ_{i j} = c_{i j}^{p e p} e_{i} κ, & i = 1, 2, \dots, N_{p e p} & j = 1, 2, \dots, \dots, 2 n, \end{matrix}

(8)

where e_i denotes the peptide efficiency factor and κ represents the LC-MS instrument response factor.⁵

The true peptide abundance differs from its readout due to noise. Accordingly, the actual abundance of a peptide v_ij is modeled as v_ij = μ_ij + ∊_ij, where ∊_ij is additive Gaussian noise and follows the distribution

\begin{matrix} \in_{i j} ~ N (0, α μ_{i j}^{2} + β μ_{i j}), & i = 1, 2, \dots, N_{p e p}, & j = 1, 2, \dots, 2 n, \end{matrix}

(9)

where α and β specify the quadratic dependence of the noise variance on the expected abundance.^5,12

Peptide signals observed in mass spectra are in fact the result of true signals with interfering noise signals and also signals from other peptides. Therefore, the signal-to-noise ratio (SNR) affects the true positive rate (TPR) greatly. To take account of this, we describe the SNR as

S N R = \frac{E {[υ]}^{2}}{V a r (υ)} = \frac{1}{α + \frac{β}{μ}} .

(10)

Taking interfering signals in consideration, the TPR of peptides is defined as

TPR = (t \times {SNR}^{p} + b) \times o_{i j},

(11)

where 0_ij is an overlapping factor. If algorithms like NITPICK, BPDA, and BPDA2d are used, then 0_ij ≈ 1.⁵

Finally, we consider in our model three peptide filters, in order: (1) nonunique peptides present in more than one protein of the proteome in study are discarded; (2) peptides with missing value rates greater than 0.7 are discarded; and (3) among the remaining peptides, those having correlation larger than 0.6 with all other peptides are kept.

The MS1 output provides information about detected peptides, their abundances, and related characteristics. The process of filtering these data and compiling the parent protein abundances from the raw peptide data is called protein abundance roll-up. To obtain the identities of the parent proteins from captured peptide sequence information, one will often use a second round of MS and search available MS/MS (MS2) databases. Alternatively, the accurate mass and time approach matches peptides to databases using the monoisotopic mass and elution time predictors, obviating the need of a second step of MS.¹³ We will assume here that data are available in the form of rolled-up abundances, whereby the readout of protein l in sample j can be written as

\begin{matrix} x_{i j} = \frac{1}{κ n_{l}} \sum_{i \in N_{l}} υ_{i j}, & l = 1, 2, \dots, N_{p r o}, & j = \end{matrix} 1, 2, \dots, 2 n,

(12)

where κ is the instrument response factor, N_l is the set of all peptides present in protein l that are retained after the filtering scheme described in the previous paragraph, and n_l is the number of peptides in set N_l. The protein abundance is set to zero when less than two peptides pass the previous filters.

ABC-MCMC Classification Framework

Bayesian analysis for complex models used in recent applications involve intractable likelihood functions, which has prompted the development of new algorithms generally called approximate Bayesian computation (ABC). In this approach, one generates candidate parameters by sampling from the prior distribution and creating a model-based simulated dataset. If the dataset conforms to the observed dataset, the candidate can be retained as a sample from the posterior distribution. Thus, one can avoid evaluating the likelihood function, which is essential for classical Bayesian posterior simulation methods. The ABC approach can be implemented via rejection sampling, MCMC, and sequential Monte Carlo methods.⁸ Utilizing the LC-MS proteomics model described in the last section, we first do prior calibration of the hyperparameters using an ABC approach via rejection sampling, and then use the ABC method implemented via an MCMC procedure to obtain samples from the posterior distribution of the protein concentrations in order to derive the ABC-MCMC classifier for LC-MS data.

Overview of the inference procedure

The sample data S = S₀ ⋓ S₁ consist of two subsamples S₀ and S₁ corresponding to the control group (eg, healthy volunteers) and treatment group (eg, cancer patients), respectively, where each subsample contains n protein abundance profiles. Given the sample data, the total number of proteins N_pro is reduced via feature selection (eg, ranking by the two-sample t-test statistic) to a tractable number d of selected proteins. According to the adopted LC-MS model, described in the “LC-MS Model” section, the protein abundance profiles are a function of the baseline protein concentration vector γ = (γ₁, …, γ_d, (b) the prior hyperparameters k, θ, ϕ, f , consisting of shape and scale parameters of the Gamma distribution in (1), the fold change parameters in (2), and the coefficient of variation in (6); and (c) the LC-MS instrument-related parameters κ, α, β, e , b, t, p, which are assumed to be known for a given instrument (see Table 1 for the value of these parameters in our numerical experiment). Figure 1 displays the relationship among these various parameters.

Figure 1

Relationship among all parameters of the LC-MS model (see text).

Table 1

LC-MS parameters used in the experiment.

PARAMETER	SYMBOL	VALUE/RANGE
Instrument response	κ	5
Noise severity	α, β	0.03, 3.6
Peptide efficiency factor	e ⁱ	[0.1–1]
Peptide detection algorithm	b, t, p	0,0.0016,2

Our approach consists of treating γ as the hidden parameter vector, posterior samples of which are obtained using an ABC-MCMC sampling method, after a step of calibration of the hyperparameters using ABC rejection sampling. The samples from the posterior allow us to calculate the OBC for the problem. All these steps are described in detail in the sequel.

Algorithm 1

Prior calibration of k, θ, and ϕ using ABC rejection sampling.

Generate M_cal triplets of parameters of {k^(t),θ^(t),ϕ^(t)} such that,

\begin{array}{l} k^{(t)} ~ U n i f (k_{l o w}, k_{h i g h}), θ^{(t)} ~ U n i f (θ_{l o w}, θ_{h i g h}), \\ φ^{(t)} ~ U n i f (φ_{l o w}, φ_{h i g h}) \end{array}

for t = 1,…, M_cal.

Simulate a control sample set $S_{0}^{(t)}$ of size n for each triplet {k^(t),θ^(t),ϕ^(t)} for t = 1, 2,…, M_cal.

Accept the triplet {k^(t),θ^(t),ϕ^(t)} if | T | < ∊, for t = 1,…,M_cal, where |·| denotes the Euclidean norm and T denotes the vector sample mean.

Let $A = {{k^{1}, θ^{1}, φ^{1}}, \dots, {k^{n_{a}}, θ^{n_{a}}, φ^{n_{a}}}}$ be the set of all accepted triplets. The calibrated k can be approximated as follows

k_{c a l} = \int_{k_{l o w}}^{k_{h i g h}} k p (k | S_{n}) d k \approx \frac{1}{n_{a}} \sum_{a = 1}^{n_{a}} k^{a} .

Similar Monte Carlo integrations are performed to calculate θ_cal and ϕ_cal.

Prior calibration via ABC rejection sampling

Calibration of the hyperparameters k, θ, ϕ, f is accomplished using the ABC rejection sampling method. Unlike Knight et al.¹⁴, who proposed using discarded features to perform prior calibration for an MCMC implementation of the OBC, here we use the selected features, as we need to calibrate the fold change as well, which is specific to each selected protein.

First, we calibrate k, θ, and ϕ using the control sample only, since these parameters are common across control and treatment populations and f has not been calibrated yet. The procedure used is displayed in Algorithm 1. In this algorithm, ∊ is the error tolerance. It has been proved⁷ that smaller ∊ gives better approximation of the posterior p(k|S_n). However, this must be balanced against the possibility that $P (‖ T (S_{0}^{(t)}), T (S_{0}) ‖ < \in) \approx 0$ , which would prevent convergence to the posterior.

Next we calibrate the fold change parameter f = (f₁,…,f_d) for each selected protein. If sample size is large (n > 50) then the simple sample estimate

f_{l, c a l} = \frac{T_{l} (S_{1})}{T_{l} (S_{0})}, f o r l = 1, \dots, d,

(13)

where T_l denotes the sample mean for the lth selected protein only, is fairly accurate, and may be used as the prior calibration. However, for smaller sample sizes, we follow the steps enumerated below in Algorithm 2.

Posterior sampling via an ABC-MCMC procedure

After prior calibration, we would like now to draw samples from the posterior distribution of the protein baseline expression vector γ = (γ₁,…,γ_d), namely, p(γ| S_n)∝ p(S_n| γ)p(γ), in order to derive the OBC. In our case, no closed-form expressions for either the likelihood function or posterior distribution exist, so Bayesian analysis is performed using an ABC-MCMC procedure, described in Algorithm 3. After a burn-in interval of t_s time steps, the Markov chain is assumed to have become stationary, and $γ^{(t_{s} + 1)}, \dots, γ^{(t_{s} + M)}$ may be considered to be samples from the baseline expression posterior distribution p(γ|y=0,S_n), while $γ^{(t_{s} + 1)} f_{c a l}, \dots, γ^{(t_{s} + M)} f_{c a l}$ (where vector multiplication is defined as componentwise multiplication) maybe taken to be samples from the altered expression posterior distribution p(γ|y=1,S_n).

Algorithm 2

Prior calibration of f_l, l = 1,…, d, using ABC rejection sampling.

Generate M_cal baseline expression values $γ_{l}^{(t)} ~ Γ (k_{c a l}, θ_{c a l})$ for t = 1,…,M_cal.

Simulate a control sample $S_{0}^{(t)}$ of size n using the baseline expression mean $γ_{l}^{(t)}$ , for t = 1,…,M_cal (in fact, only the abundances for the lth protein need to be generated).

Accept $γ_{l}^{(t)}$ if $T_{l} (S_{0}^{(t)}) - T_{l} (S_{0}) | < \in_{1}$ and $p_{l} (S_{0}^{(t)}, S_{0}) > 1 - \in_{2}$ , where T_ldenotes the sample mean and ρ_ldenotes the sample correlation for the abundances of the lth protein only.

Generate M_cal fold change parameters $f_{l}^{(t)}$ such that If T_l(S₁)/T_l(S₀) ≥ 1, then $f_{l}^{(t)} ~ U n i f (α_{l o w}, α_{h i g h})$ , If T_l(S₁)/T_l(S₀) < 1, then $f_{l}^{(t)} ~ U n i f (1 / α_{h i g h}, α_{l o w})$ , for t = 1,…, M_cal.

Simulate a treatment sample $S_{1}^{(t)}$ of size n using the altered expression mean $f_{l}^{(t)} γ_{l}^{(t)}$ , for t = 1, 2,…, M_cal (in fact, only the abundances for the lth protein need be generated).

Accept $f_{l}^{(t)} γ_{l}^{(t)}$ if $| T_{l} (S_{1}^{(t)}) - T_{l} (S_{l}) | < \in_{1}$ and $ρ (S_{1}^{(t)}, S_{1}) > 1 - \in_{2}$ .

Let $n_{a}^{0}$ be the number of accepted baseline expression means in step 3 and let $n_{a}^{1}$ be the number of accepted altered expression means in step 6. Define $λ^{0} = n_{a}^{0} / M_{c a l}$ , the rate of acceptance of control means, $λ^{0} = n_{a}^{1} / M_{c a l}$ , the rate of acceptance of treatment means.

If λ⁰ > λ¹ then assign f_l,cal = 1 (ie, background protein) and return from the algorithm.

Otherwise, f_cal,l ≠ 1 (ie, marker protein). For all the accepted altered expression means, we perturb each of the fold changes $f_{l}^{*} = f_{l} + N_{l}$ , where N_l is zero-mean Gaussian noise with a small variance. With these perturbed fold changes, we again apply the ABC rejection algorithm, this time with error tolerances, $\in_{1}^{'} < \in_{1}$ and $\in_{2}^{'} < \in_{2}$ .

10.

The mean of all accepted fold change parameters in step 9 is a reasonably accurate fold changed f_cal for the given protein.

Optimal Bayesian classifier

Let ψ: R^d ↠ {0, 1} be a classifier that takes a protein abundance profile X ∊ R^d into one of the two labels 0 or 1, which code for the control (baseline expression) and treatment (altered expression) populations, respectively. The error of the classifier is the probability of a mistake given the sample data:

ε [ψ] = P (ψ (X) \neq Y | S),

(14)

where Y ∊ {0, 1} denotes the true label corresponding to X.

Algorithm 3

Posterior sampling of γ using an ABC-MCMC procedure.

Generate γ ⁽⁰⁾ = (γ₀, γ₁, …, γ_d) such that γ_l ~ γ(k, θ), for l = 1, 2, …, d.

Simulate control and treatment samples $S_{0}^{(0)}$ and $S_{1}^{(0)}$ of size n using γ ⁽⁰⁾ and γ ⁽⁰⁾ f _cal, respectively (where vector multiplication is defined as componentwise multiplication).

Accept γ ⁽⁰⁾ if $‖ T (S_{0}^{(0)} - T (S_{0})) ‖ < \in_{0}$ and $‖ T (S_{1}^{(0)} - T (S_{1})) ‖ < \in_{1}$ , otherwise repeat steps 1 and 2 until the condition is met.

For t = 0,1,…, t_s, t_s+1,…,t_s + M where t_s is the burn-in period, repeat:

Generate γ ^(t+1) ~ g(γ; γ^(t)), where the proposal density g(γ; γ^(t)) is multivariate Gaussian ℕ_d(γ^(t),σ²I^d), with a small variance σ².

Simulate control and treatment samples $S_{0}^{(t + 1)}$ and $S_{1}^{(t + 1)}$ of size n using γ ^(t+1) and γ ^(t+1) f _cal, respectively.

Let $q = {\begin{matrix} \min (1, \frac{p (γ^{(t + 1)} g γ^{(t)}; γ^{(t + 1)})}{p (γ^{(t)}) g (γ^{(t + 1)}; γ^{(t)})}), & \begin{array}{l} if ‖ T (s_{0}^{(t + 1)}) - T (s_{0}) ‖ < \in_{0} \\ and ‖ T (s_{1}^{(t + 1)}) - T (s_{1}) ‖ < \in_{1} \end{array} \\ 0 & otherwise, \end{matrix}$ where p(·) is the Gamma prior for protein baseline expression.

Accept γ ^(t+1) with probability q, or let γ ^(t+1) = γ ^(t) with probability 1 – q.

Now, consider a Bayesian setting, where the joint distribution of ( X,Y ) depends on a random parameter vector θ. In this case, the classification error ∊_θ[ψ] also becomes a random variable, as a function of θ. The expected value of the classification error over the posterior distribution of fi becomes the quantity of interest:

E_{θ | S} [ε_{θ} [ψ]] = E_{θ | S} [P (ψ (X) \neq Y | θ, S)] .

(15)

The OBC⁶ is the classifier that minimizes the quantity in (15):

ψ_{O B C} = \underset{ψ \in C}{\arg \min} E_{θ | S} [ε_{θ} [ψ]],

(16)

where C is the space of classifiers. It was shown in Ref. 6 that the OBC is given by

ψ_{O B C} (x) = {\begin{matrix} 1, & if E [c | S] p (x | Y = 1, S) > (1 - E [c | S]) p (x | Y = 0, S), \\ 0, & otherwise \end{matrix}

(17)

where c = P(Y = 1 | θ ) is the prior probability of class 1, and

\begin{matrix} p (x | Y = y, S) = \int_{Θ} p (x | θ, Y = y, S) p (θ | Y = y, S) d θ, & y = 0, 1, \end{matrix}

(18)

are the effective class-conditional densities.

In the present case of the LC-MS model discussed in the “LC-MS Model” section, the random parameter vector θ corresponds to the baseline expression vector γ. We approximate the integral in (18) using the MCMC samples $γ^{(t_{s} + 1)}, \dots, γ^{(t_{s} + M)}$ from the posterior distribution of γ, obtained with Algorithm 3:

\begin{matrix} p (x | Y = y, S) \approx \frac{1}{M} \sum_{t = t, + 1}^{t_{s} + M} p (x | γ^{(t)}, Y = y, S), & y = 0, 1. \end{matrix}

(19)

Now, the densities p( x | γ ^(t),Y = y, S), y = 0, 1, cannot be directly determined for the LC-MS model, and hence we approximate them using a kernel-based approach. For each MCMC sample γ ^(t), we simulate control and treatment samples $S_{0}^{(0)}$ and $S_{1}^{(0)}$ of size n based on γ ^(t+1) and γ ^(t+1) f _cal, respectively. Let $S_{0}^{(t)} = {x_{1}^{(t)}, \dots, x_{n}^{(t)}}$ and $S_{1}^{(t)} = {x_{n + 1}^{(t)}, \dots, x_{2 n}^{(t)}}$ . Then

\begin{matrix} p (x | γ^{(t)}, Y = y, S) \approx \frac{1}{n} \sum_{t = n y + 1}^{n y + n} \frac{1}{h^{d}} K (\frac{x - x_{j}^{(t)}}{h}), & y = 0, 1, \end{matrix}

(20)

where K is a zero-mean, unit-covariance, multivariate Gaussian density, and h > 0 is a suitable kernel bandwidth parameter.

In addition, we will assume c to be known (eg, from epidemiological data) and fixed, so E[c | S] = c. After some simplification, the resulting OBC, which we call an ABC-MCMC Bayesian classifier, is a kernel-based classifier given by

\begin{matrix} 1, & i f c \sum_{t = t, + 1}^{t_{s} + M} \sum_{j = 1}^{n} K (\frac{x - x_{j}^{(t)}}{h}) \\ > (1 - c) \sum_{t = t, + 1}^{t_{s} + M} \sum_{j = n + 1}^{2 n} K (\frac{x - x_{j}^{(t)}}{h}), \\ 0, & otherwise . \end{matrix}

(21)

Numerical Experiments

We demonstrate the application of the proposed ABC-MCMC classification rule to synthetic LC-MS data generated from a subset of the human proteome, containing around 4000 drug targets, which was compiled as a FASTA file from DrugBank¹⁵ – this is the same proteome that was used in the numerical experiments of Ref. 5 – and compare its performance against that of popular classification rules: linear support vector machines, LDA, and 3NN.¹⁶ As our interest is on small-sample performance, we selected methods that are simple and known to perform well with small samples and avoid overfitting: linear SVMs are sophisticated methods widely used in the pattern recognition and machine learning communities, which displays minimal overfitting, while LDA and 3NN are classical methods that are well known to have superior small-sample performance.¹⁷

We select randomly among these data 500 proteins to play the role of background proteins, along with 20 proteins to serve as biomarkers. Synthetic LC-MS protein abundance data were generated using realistic sample preparation, LC-MS instrument characteristics, and protein quantification parameters – see Table 1. These are the “LC-MS experiment parameters” of Figure 1, which are assumed to be known and are held constant throughout the simulation. (For the peptide efficiency factor, values uniformly distributed in the indicated range are randomly generated for each peptide, and then held constant.) As argued in Ref. 5, the values and ranges adopted in Table 1 adequately represent the peptide mixture, peptide abundance mapping, peptide detection and identification, and protein abundance roll-up that is typical in an LC-MS workflow.

The hyperparameter priors for k, θ, ϕ, f are the uniform distributions shown in Table 2 (except where noted below). The lower and upper bounds of each interval are chosen while keeping in consideration that, in practice, the dynamic range of protein expression level has approximately 4 orders of magnitude.⁵ The synthetic sample data were generated using the middle point of each interval as parameters: k = 2, θ = 1000, ϕ = 0.4, and α_l = 1.55 (again, except where noted below).

Table 2

Hyperparameter priors used in the experiment.

PARAMETER	SYMBOL	RANGE/VALUE
Shape (gamma distribution)	k	Unif(1.6, 2.4)
Scale (gamma distribution)	θ	Unif(800, 1200)
Coefficient of variation	ϕ	Unif(0.3, 0.5)
Fold change	α _l	Unif(1.5, 1.6)

We consider sample sizes from n = 10 through n = 50 per class, and select d = 3, 5, 8, or 10 proteins from the original 520 proteins using the two-sample t-test (notice that background proteins could be erroneously selected by the t-test, especially for small sample sizes, which makes the experiment realistic).

For the MCMC step, M = 10,000 samples were drawn from the posterior distribution of γ , after a burn-in stage of t_s = 3000 iterations, which confers a high degree of accuracy to the approximation. A constant value c = 0.5 was assumed in (21).

A total of 12 runs of the experiment were run for each combination of sample size, dimensionality, and parameter settings, and the average true error rate for each classification rule was obtained using a large synthetic test set containing 1000 sample points. This is a comprehensive simulation, given the relatively large computational burden required for accurate prior calibration and ABC-MCMC computation.

The root mean square error (RMS) of the test set error estimator, which reflects the expected distance between the estimate and the true error, is bounded by equation (2.29) in Ref. 17 as follows

R M S \leq \frac{1}{2 \sqrt{m}},

(22)

where m is the number of test points. With m = 1000, we obtain RMS ≤ 0.016, which is of the order of the differences in average error rates observed in the plots. While not implying statistical significance, this result means that we can assign a large degree of confidence to the comparative results.

Effect of sample size

Figure 2 displays the expected error rates of the various classification rules for varying sample size and fixed number of selected proteins d = 8. We can see that, as expected, the expected error rates of all classifiers tend to go down as sample size increases, but the ABC-MCMC classifier has the smallest expected error at small sample sizes. This is in agreement with the predicted superiority of the Bayesian approach in small-sample scenarios. Though the difference in performance among the classification rules may seem to be small, the point to be emphasized is that the ABC-MCMC displays a consistently smaller error rate for small sample sizes.

Figure 2

Expected classification error rates for varying sample size and fixed number of selected proteins d = 8.

Effect of dimensionality

Figure 3 displays the expected error rates of the various classification rules for varying number of selected proteins and fixed sample size n = 10 per class. Here we can see that, as the number of selected proteins increases, expected classification error rates tend to go down at first, but then increase slightly, which is in agreement with the well-known peaking phenomenon of classification.¹⁸ We can see that the ABC-MCMC classification rule displays the smallest expected error rate when d is large, which once again agrees with the prediction that Bayesian methods perform comparatively well under small-sample scenarios (here, small n/d ratio).

Figure 3

Expected classification error rates for varying number of selected proteins and fixed sample size n = 10 per class.

Effect of coefficient of variation

Here we keep both the sample size and the dimensionality fixed at n = 10 per class and d = 8, respectively, and investigate the impact on classification error rate of an increased variability in the true protein concentration values, by changing the value of the coefficient of variation ϕ used to generate the LC-MS data. To accommodate this change, the hyperparameter prior for ϕ is changed from the value displayed in Table 2 to Unif(ϕ₀ – 0.1, ϕ₀ + 0.1), where ϕ₀ is the value used to generate the data. Increasing the coefficient of variation corresponds to the effect of very noisy background proteins in the LC-MS channel. Accordingly, it can be seen in Figure 4 that as ϕ increases the expected error rates for all classification rules approach the no-information value 0.5, ie, the same error rate of flipping a coin. However, the expected error rate of the ABC-MCMC classification rule approaches 0.5 error rate rather more slowly than the others, indicating superiority in classifying noisy data.

Figure 4

Expected classification error rates for fixed sample size n = 10 per class, fixed number of selected proteins d = 8, and varying coefficient of variation ϕ.

Effect of peptide efficiency factor

Finally, we investigate the impact of varying the peptide efficiency factor on the classification error rates. We do this by changing the lower bound in the range for e_i displayed in Table 1 from α = 0.1 to a value varying between 0 and 1. The peptide efficiency factor affects how many ions an instrument can detect for a given peptide. Larger values for e_i imply a smaller transmission loss for the corresponding peptide. Increasing the lower bound a will uniformly increase efficiency for all peptides, which corresponds to a better LC-MS instrument. We can see in Figure 5 that, indeed, the expected classification error rates tend to decrease with an increasing lower bound on the peptide efficiency factor, though somewhat modestly (all other things being equal). We can also observe that among all algorithms, the ABC-MCMC classification rule displays the smallest error rate over nearly the entire range in the plot.

Figure 5

Expected classification error rates for fixed sample size n = 10 per class, fixed number of selected proteins d = 8, and varying lower bound a for the peptide efficiency factor e_i ~ Unif(α, 1).

Conclusion

We proposed in this paper a model-based Bayesian approach for classification of LC-MS proteomics data with the ultimate goal of facilitating biomarker discovery for cancer research. Our approach combines state-of-the-art Bayesian computation techniques, namely, ABC and MCMC, for the calculation of the OBC. As expected, the proposed Bayesian classifier outperforms other approaches when sample size is small or the number of selected proteins to classify is large. We believe that our simulation using a subset of 4000 human protein drug targets and realistic parameter settings is indicative of the performance of the proposed methodology on real data. The challenges associated with designing experiments and obtaining appropriate real data to calibrate and validate the methodology go beyond the scope of the present paper and are intended to be part of future work.

Author Contributions

Conceived and designed the experiments: UB, UBN. Analyzed the data: UB. Wrote the first draft of the manuscript: UB. Contributed to the writing of the manuscript: UBN. Agree with manuscript results and conclusions: UB, UBN. Made critical revisions and approved final version: UB, UBN. Both authors reviewed and approved of the final manuscript.

References

Rifai

Gillette

Carr

S.A.

Protein biomarker discovery and validation: the long and uncertain path to clinical utility. Nat Biotechnol. 2006; 24: 971–83.

Aebersold

Mann

Mass spectrometry-based proteomics. Nature. 2003; 422(6928): 198–207.

Httenhain

Malmstrm

Picotti

Aebersold

Perspectives of targeted mass spectrometry for protein biomarker verification. Curr Opin Chem Biol. 2009; 13: 518–25.

Griffin

Long

Label-free, normalized quantification of complex mass spectrometry data for proteomics analysis. Nat Biotechnol. 2010; 28: 83–9.

Sun

Braga-Neto

Dougherty

A systematic model of the LC-MS proteomics pipeline. BMC Genomics. 2011; 13: S2.

Dalton

Dougherty

Optimal classifiers with minimum expected error within a Bayesian framework – part I: discrete and Gaussian models. Pattern Recognit. 2013; 46(5): 1301–14.

Sisson

Fan

Likelihood-free Markov chain Monte Carlo. In: Brooks

Gelman

Jones

Meng

X-L

, eds. Handbook of Markov Chain Monte Carlo. Boca Raton, FL: Chapman and Hall/CRC Press; 2010: 319–41.

Peters

Fan

Sisson

On Sequential Monte Carlo, Partial Rejection Control and Approximate Bayesian Computation. Kensington: University of New South Wales; 2009: arxiv: 0808.3466v2.

Taniguchi

Choi

P.J.

G.W.

Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science. 2010; 329: 533.

10.

Vogel

Wang

Yao

Marcotte

E.M.

Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation. Nat Biotechnol. 2007; 25: 117–24.

11.

Sturm

Bertsch

Gröpl

OpenMS—an open-source software framework for mass spectrometry. BMC Bioinformatics. 2008; 9: 163.

12.

Anderle

Roy

Lin

Becker

Joho

Quantifying reproducibility for differential proteomics: noise analysis for protein liquid chromatography-mass spectrometry of human serum. Bioinformatics. 2004; 20(18): 3575–82.

13.

Pasa-Tolic

Masselon

Barry

Shen

Smith

Proteomic analyses using an accurate mass and time tag strategy. Biotechniques. 2004; 37(4): 621–624, 626–33,636assim.

14.

Knight

Ivanov

Dougherty

MCMC implementation of the optimal Bayesian classifier for non-Gaussian models: model-based RNA-Seq classification. BMC Bioinformatics. 2014; 15: 401.

15.

Knox

Law

Jewison

DrugBank 3.0: a comprehensive resource for ‘omics’ research on drugs. Nucleic Acids Res. 2011; 39: D1035–41.

16.

Webb

Statistical Pattern Recognition. 2nd ed. New York: John Wiley & Sons; 2002.

17.

Braga-Neto

Dougherty

Error Estimation for Pattern Recognition. New York: Wiley; 2015.

18.

Hughes

On the mean accuracy of statistical pattern recognizers. IEEE Trans Inf Theory. 1968; IT-14(1): 55–63.

Bayesian ABC-MCMC Classification of Liquid Chromatography–Mass Spectrometry Data

Abstract

Keywords

Introduction

LC-MS Model

ABC-MCMC Classification Framework

Overview of the inference procedure

Prior calibration via ABC rejection sampling

Posterior sampling via an ABC-MCMC procedure

Optimal Bayesian classifier

Numerical Experiments

Effect of sample size

Effect of dimensionality

Effect of coefficient of variation

Effect of peptide efficiency factor

Conclusion

Author Contributions

References