Abstract
Proteomics promises to revolutionize cancer treatment and prevention by facilitating the discovery of molecular biomarkers. Progress has been impeded, however, by the small-sample, high-dimensional nature of proteomic data. We propose the application of a Bayesian approach to address this issue in classification of proteomic profiles generated by liquid chromatography-mass spectrometry (LC-MS). Our approach relies on a previously proposed model of the LC-MS experiment, as well as on the theory of the optimal Bayesian classifier (OBC). Computation of the OBC requires the combination of a likelihood-free methodology called approximate Bayesian computation (ABC) as well as Markov chain Monte Carlo (MCMC) sampling. Numerical experiments using synthetic LC-MS data based on an actual human proteome indicate that the proposed ABC-MCMC classification rule outperforms classical methods such as support vector machines, linear discriminant analysis, and 3-nearest neighbor classification rules in the case when sample size is small or the number of selected proteins used to classify is large.
Keywords
Introduction
Recent advances in high-throughput technologies in proteomics promise to revolutionize cancer treatment and prevention by facilitating the discovery of molecular biomarkers, which can be used to improve diagnosis, guide targeted therapy, and monitor therapeutic response. 1 Among all high-throughput proteomic technologies, mass spectrometry has increasingly become the method of choice for the analysis of complex protein samples. 2 High molecular specificity and excellent detection sensitivity explain the widespread adoption of mass spectrometry (MS)-based proteomics as a popular tool for the identification and quantification of the composition of complex proteome mixtures.
However, to date, the rate of discovery of successful biomarkers is still unsatisfactory. In addition to challenges such as the high dynamic range of proteins 3 and inaccurate protein quantification, 4 an important impediment to progress is that, in clinical applications of mass spectrometry, the number of samples available is extremely small, whereas mass spectra contain hundreds of thousands of intensity measurements with signals generated by thousands of proteins/peptides. This small-sample, high-dimensionality problem requires the experiment and analysis to be carefully designed and validated in order to arrive at statistically meaningful results. Through model-based approaches and simulation using ground-truthed synthetic data, the problem of biomarker discovery can be studied and evaluated.
In this paper, we propose the application of a Bayesian approach to address the small-sample, high-dimensionality problem in the classification of proteomic profiles generated by liquid chromatography–mass spectrometry (LC-MS). Our approach relies on the detailed LC-MS experiment pipeline model developed in Ref. 5, as well as on the theory of the optimal Bayesian classifier (OBC), proposed in Ref. 6. However, the complexity of the LC-MS experiment, involving steps of sample preparation, protein digestion, peptide ionization, peptide detection, and protein quantification, implies that the likelihood function for the LC-MS model is exceedingly complex, requiring the application of a likelihood-free Bayesian approach. In this paper, we apply a new likelihood-free methodology called approximate Bayesian computation (ABC). 7 The basic ABC rejection sampling method generates candidate parameters by sampling from the prior distribution and creates a model-based simulated dataset. If the dataset conforms to the observed dataset, the candidate can be retained as a sample from the posterior distribution. Thus, one can avoid evaluating the likelihood function, which is essential for classical Bayesian posterior simulation methods. The ABC approach can also be implemented via a combination of rejection sampling and Markov chain Monte Carlo (MCMC) sampling. 8
The detailed implementation of our approach involves first the prior calibration of the hyperparameters of the LC-MS model using an ABC approach via rejection sampling and then using the ABC method implemented via an MCMC procedure to obtain samples from the posterior distribution of the protein concentrations, which are used to approximate the OBC using Monte Carlo integration and kernel smoothing. Numerical experiments using synthetic LC-MS data based on an actual human proteome indicate that the ABC-MCMC classification rule outperforms classical methods such as support vector machines (SVMs), linear discriminant analysis (LDA), and 3-nearest neighbor (3NN) classifiers in the case when sample size is small or the number of selected proteins used to classify is large. We also quantify the effect of experimental parameters such as the coefficient of variation (noise) and instrument peptide efficiency factor on classification accuracy.
The paper is organized as follows. The “LC-MS Model” section surveys the LC-MS model proposed in Ref. 5, which is the basis for our inference approach. The ‘ABC-MCMC Classification Framework” section describes in detail the algorithms for prior calibration, sampling from the posterior, and computation of the ABC-MCMC classifier. The “Numerical Experiments” section presents the results of a numerical experiment using synthetic LC-MS data corresponding to a subset of the human proteome. Finally, the “Conclusion” section brings concluding remarks.
LC-MS Model
Here, we describe briefly the label-free LC-MS model proposed in Ref. 5. Two sample classes are considered, control (class 0) and treatment (class 1). There are n sample profiles from each class, sharing Npro protein species from a specified proteome, which is typically input into the model as a FASTA file. As argued in Ref. 9, protein concentration in the control sample is best described as a Gamma distribution,
According to whether there is a significant difference in abundance between control and treatment populations, proteins are divided into biomarker (differentially expressed) proteins and background (not differentially expressed) proteins. The difference in abundance for biomarker proteins is quantified by the fold change,
The multivariate Gaussian distribution is recommended as the model for protein concentration variations in each class.
10
Accordingly, the protein expression level for the lth protein in the jth sample profile is modeled as
In this paper, we assume a diagonal covariance matrix
The coefficient of variation ϕ is calibrated based on the observed data.
In order to perform in silico tryptic digestion of the protein samples, we use the peptide mixture model from openMS.
11
Let ω
i
be the set of all proteins that contain the ith peptide. If there are Npep peptide species, in total, across all proteins in a given sample, then their molar concentrations are given as
In general, ion abundance in MS data bears the signature of the concentration of a peptide type, say i in sample j. Taking measurement uncertainty factors in consideration, one may envisage that the expected readout μ
ij
of the abundance of said peptide can be modeled as,
The true peptide abundance differs from its readout due to noise. Accordingly, the actual abundance of a peptide v
ij
is modeled as v
ij
= μ
ij
+ ∊
ij
, where ∊
ij
is additive Gaussian noise and follows the distribution
Peptide signals observed in mass spectra are in fact the result of true signals with interfering noise signals and also signals from other peptides. Therefore, the signal-to-noise ratio (SNR) affects the true positive rate (TPR) greatly. To take account of this, we describe the SNR as
Taking interfering signals in consideration, the TPR of peptides is defined as
Finally, we consider in our model three peptide filters, in order: (1) nonunique peptides present in more than one protein of the proteome in study are discarded; (2) peptides with missing value rates greater than 0.7 are discarded; and (3) among the remaining peptides, those having correlation larger than 0.6 with all other peptides are kept.
The MS1 output provides information about detected peptides, their abundances, and related characteristics. The process of filtering these data and compiling the parent protein abundances from the raw peptide data is called protein abundance roll-up. To obtain the identities of the parent proteins from captured peptide sequence information, one will often use a second round of MS and search available MS/MS (MS2) databases. Alternatively, the accurate mass and time approach matches peptides to databases using the monoisotopic mass and elution time predictors, obviating the need of a second step of MS.
13
We will assume here that data are available in the form of rolled-up abundances, whereby the readout of protein l in sample j can be written as
ABC-MCMC Classification Framework
Bayesian analysis for complex models used in recent applications involve intractable likelihood functions, which has prompted the development of new algorithms generally called approximate Bayesian computation (ABC). In this approach, one generates candidate parameters by sampling from the prior distribution and creating a model-based simulated dataset. If the dataset conforms to the observed dataset, the candidate can be retained as a sample from the posterior distribution. Thus, one can avoid evaluating the likelihood function, which is essential for classical Bayesian posterior simulation methods. The ABC approach can be implemented via rejection sampling, MCMC, and sequential Monte Carlo methods. 8 Utilizing the LC-MS proteomics model described in the last section, we first do prior calibration of the hyperparameters using an ABC approach via rejection sampling, and then use the ABC method implemented via an MCMC procedure to obtain samples from the posterior distribution of the protein concentrations in order to derive the ABC-MCMC classifier for LC-MS data.
Overview of the inference procedure
The sample data S = S0 ⋓ S1 consist of two subsamples S0 and S1 corresponding to the control group (eg, healthy volunteers) and treatment group (eg, cancer patients), respectively, where each subsample contains n protein abundance profiles. Given the sample data, the total number of proteins Npro is reduced via feature selection (eg, ranking by the two-sample t-test statistic) to a tractable number d of selected proteins. According to the adopted LC-MS model, described in the “LC-MS Model” section, the protein abundance profiles are a function of the baseline protein concentration vector

Relationship among all parameters of the LC-MS model (see text).
LC-MS parameters used in the experiment.
Our approach consists of treating
Prior calibration of k, θ, and ϕ using ABC rejection sampling.
Generate Mcal triplets of parameters of {k
(t)
,θ
(t)
,ϕ
(t)
} such that,
Simulate a control sample set
Accept the triplet {k
(t)
,θ
(t)
,ϕ
(t)
} if |
Let
Similar Monte Carlo integrations are performed to calculate θcal and ϕcal.
Prior calibration via ABC rejection sampling
Calibration of the hyperparameters k, θ, ϕ,
First, we calibrate k, θ, and ϕ using the control sample only, since these parameters are common across control and treatment populations and
Next we calibrate the fold change parameter
Posterior sampling via an ABC-MCMC procedure
After prior calibration, we would like now to draw samples from the posterior distribution of the protein baseline expression vector
Prior calibration of f l , l = 1,…, d, using ABC rejection sampling.
Generate Mcal baseline expression values
Simulate a control sample
Accept
Generate Mcal fold change parameters
Simulate a treatment sample
Accept
Let
If λ0 > λ1 then assign fl,cal = 1 (ie, background protein) and return from the algorithm.
Otherwise, fcal,l ≠ 1 (ie, marker protein). For all the accepted altered expression means, we perturb each of the fold changes
The mean of all accepted fold change parameters in step 9 is a reasonably accurate fold changed fcal for the given protein.
Optimal Bayesian classifier
Let ψ: R
d
↠ {0, 1} be a classifier that takes a protein abundance profile
Posterior sampling of γ using an ABC-MCMC procedure.
Generate
Simulate control and treatment samples
Accept
For t = 0,1,…, t s , ts+1,…,t s + M where t s is the burn-in period, repeat:
Generate
Simulate control and treatment samples
Let
Accept
Now, consider a Bayesian setting, where the joint distribution of (
The OBC
6
is the classifier that minimizes the quantity in (15):
In the present case of the LC-MS model discussed in the “LC-MS Model” section, the random parameter vector
Now, the densities p(
In addition, we will assume c to be known (eg, from epidemiological data) and fixed, so E[c | S] = c. After some simplification, the resulting OBC, which we call an ABC-MCMC Bayesian classifier, is a kernel-based classifier given by
Numerical Experiments
We demonstrate the application of the proposed ABC-MCMC classification rule to synthetic LC-MS data generated from a subset of the human proteome, containing around 4000 drug targets, which was compiled as a FASTA file from DrugBank 15 – this is the same proteome that was used in the numerical experiments of Ref. 5 – and compare its performance against that of popular classification rules: linear support vector machines, LDA, and 3NN. 16 As our interest is on small-sample performance, we selected methods that are simple and known to perform well with small samples and avoid overfitting: linear SVMs are sophisticated methods widely used in the pattern recognition and machine learning communities, which displays minimal overfitting, while LDA and 3NN are classical methods that are well known to have superior small-sample performance. 17
We select randomly among these data 500 proteins to play the role of background proteins, along with 20 proteins to serve as biomarkers. Synthetic LC-MS protein abundance data were generated using realistic sample preparation, LC-MS instrument characteristics, and protein quantification parameters – see Table 1. These are the “LC-MS experiment parameters” of Figure 1, which are assumed to be known and are held constant throughout the simulation. (For the peptide efficiency factor, values uniformly distributed in the indicated range are randomly generated for each peptide, and then held constant.) As argued in Ref. 5, the values and ranges adopted in Table 1 adequately represent the peptide mixture, peptide abundance mapping, peptide detection and identification, and protein abundance roll-up that is typical in an LC-MS workflow.
The hyperparameter priors for k, θ, ϕ,
Hyperparameter priors used in the experiment.
We consider sample sizes from n = 10 through n = 50 per class, and select d = 3, 5, 8, or 10 proteins from the original 520 proteins using the two-sample t-test (notice that background proteins could be erroneously selected by the t-test, especially for small sample sizes, which makes the experiment realistic).
For the MCMC step, M = 10,000 samples were drawn from the posterior distribution of
A total of 12 runs of the experiment were run for each combination of sample size, dimensionality, and parameter settings, and the average true error rate for each classification rule was obtained using a large synthetic test set containing 1000 sample points. This is a comprehensive simulation, given the relatively large computational burden required for accurate prior calibration and ABC-MCMC computation.
The root mean square error (RMS) of the test set error estimator, which reflects the expected distance between the estimate and the true error, is bounded by equation (2.29) in Ref. 17 as follows
Effect of sample size
Figure 2 displays the expected error rates of the various classification rules for varying sample size and fixed number of selected proteins d = 8. We can see that, as expected, the expected error rates of all classifiers tend to go down as sample size increases, but the ABC-MCMC classifier has the smallest expected error at small sample sizes. This is in agreement with the predicted superiority of the Bayesian approach in small-sample scenarios. Though the difference in performance among the classification rules may seem to be small, the point to be emphasized is that the ABC-MCMC displays a consistently smaller error rate for small sample sizes.

Expected classification error rates for varying sample size and fixed number of selected proteins d = 8.
Effect of dimensionality
Figure 3 displays the expected error rates of the various classification rules for varying number of selected proteins and fixed sample size n = 10 per class. Here we can see that, as the number of selected proteins increases, expected classification error rates tend to go down at first, but then increase slightly, which is in agreement with the well-known peaking phenomenon of classification. 18 We can see that the ABC-MCMC classification rule displays the smallest expected error rate when d is large, which once again agrees with the prediction that Bayesian methods perform comparatively well under small-sample scenarios (here, small n/d ratio).

Expected classification error rates for varying number of selected proteins and fixed sample size n = 10 per class.
Effect of coefficient of variation
Here we keep both the sample size and the dimensionality fixed at n = 10 per class and d = 8, respectively, and investigate the impact on classification error rate of an increased variability in the true protein concentration values, by changing the value of the coefficient of variation ϕ used to generate the LC-MS data. To accommodate this change, the hyperparameter prior for ϕ is changed from the value displayed in Table 2 to Unif(ϕ0 – 0.1, ϕ0 + 0.1), where ϕ0 is the value used to generate the data. Increasing the coefficient of variation corresponds to the effect of very noisy background proteins in the LC-MS channel. Accordingly, it can be seen in Figure 4 that as ϕ increases the expected error rates for all classification rules approach the no-information value 0.5, ie, the same error rate of flipping a coin. However, the expected error rate of the ABC-MCMC classification rule approaches 0.5 error rate rather more slowly than the others, indicating superiority in classifying noisy data.

Expected classification error rates for fixed sample size n = 10 per class, fixed number of selected proteins d = 8, and varying coefficient of variation ϕ.
Effect of peptide efficiency factor
Finally, we investigate the impact of varying the peptide efficiency factor on the classification error rates. We do this by changing the lower bound in the range for e i displayed in Table 1 from α = 0.1 to a value varying between 0 and 1. The peptide efficiency factor affects how many ions an instrument can detect for a given peptide. Larger values for e i imply a smaller transmission loss for the corresponding peptide. Increasing the lower bound a will uniformly increase efficiency for all peptides, which corresponds to a better LC-MS instrument. We can see in Figure 5 that, indeed, the expected classification error rates tend to decrease with an increasing lower bound on the peptide efficiency factor, though somewhat modestly (all other things being equal). We can also observe that among all algorithms, the ABC-MCMC classification rule displays the smallest error rate over nearly the entire range in the plot.

Expected classification error rates for fixed sample size n = 10 per class, fixed number of selected proteins d = 8, and varying lower bound a for the peptide efficiency factor e i ~ Unif(α, 1).
Conclusion
We proposed in this paper a model-based Bayesian approach for classification of LC-MS proteomics data with the ultimate goal of facilitating biomarker discovery for cancer research. Our approach combines state-of-the-art Bayesian computation techniques, namely, ABC and MCMC, for the calculation of the OBC. As expected, the proposed Bayesian classifier outperforms other approaches when sample size is small or the number of selected proteins to classify is large. We believe that our simulation using a subset of 4000 human protein drug targets and realistic parameter settings is indicative of the performance of the proposed methodology on real data. The challenges associated with designing experiments and obtaining appropriate real data to calibrate and validate the methodology go beyond the scope of the present paper and are intended to be part of future work.
Author Contributions
Conceived and designed the experiments: UB, UBN. Analyzed the data: UB. Wrote the first draft of the manuscript: UB. Contributed to the writing of the manuscript: UBN. Agree with manuscript results and conclusions: UB, UBN. Made critical revisions and approved final version: UB, UBN. Both authors reviewed and approved of the final manuscript.
