Abstract
Proteomics studies based on mass spectrometry (MS) are gaining popular applications in biomedical research for protein identification/quantification and biomarker discovery, especially for potential early diagnosis and prognosis of severe disease before the occurrence of symptoms. However, MS data collected using current technologies are very noisy and appropriate data preprocessing is critical for successful applications of MS-based approaches. Among various data preprocessing steps, peak alignment from multiple spectra based on detected peak sample locations presents special statistical challenges when effective experimental calibration is not feasible due to relatively large peak location variation. To avoid intensive tuning parameter optimization, we propose a simple novel Bayesian algorithm “random grafting-pruning Markov chain Monte Carlo (RGPMCMC)” that can be applied to global MS peak alignment and to follow certain modelbased sample classification criterion for using aligned peaks to classify spectrum samples. The usefulness of our approach is demonstrated through simulation study by making extensive comparison with other algorithms in the literature. Its application to an ovarian cancer MALDI-MS data set achieves a smaller 10-fold cross validation error rate than other current large scale methodologies.
Keywords
Introduction
Genomics and proteomics technologies offer much promise in our understanding of fundamental biological processes by allowing us to simultaneously monitor the expression levels of tens of thousands of genes and proteins. Since proteins are the basic functioning units in the cells, there is a great interest to characterize individual molecular profiles based on proteomics for reliable biomarker discovery and effective disease diagnosis/prognosis/treatment. In proteomics research, mass spectrometry (MS) is the most widely used instrument to allow for the mass measurement of molecules, where a mass spectrometer determines chemical compounds’ molecular weight by ionizing, separating, and measuring molecular ions according to their mass-to-charge ratio (m/z: unit Da) and a mass spectrum is the standard data output for analysis and interpretation (Link et al. 1999), where the x-axis represents m/z value (Da) and the y-axis represents intensity (enrichment of particles with certain m/z). Recently, Yu et al. (2005) reviewed current approaches on the extraction of the most relevant information from the raw mass spectra to identify disease biomarkers. A standard MS data analysis usually involves background noise removal, smoothing, intensity normalization, peak identification and alignment, and biomarker identification. The false positive and false negative peaks may exist in taking local maxima as peaks (Coombes et al. 2003) and peak location variation may be due to differences in sample preparation, chemical noise, co-crystallization, deposition of the matrix-sample onto the target, and laser position on the target among others. Several statistical methods have been proposed to reduce the background noise (Coombes et al. 2003; Satten et al. 2004; Coombes et al. 2005a and Randolph and Yasui, 2006). Morris et al. (2005) applied translation-invariant wavelet transformations to the raw spectra and performed peak detection using the mean spectrum derived from a group of spectra, where they assumed calibration can be done in advance experimentally or by interpolation to make common peaks stay closely together, and only false negatives are possible. However, Coombes et al. (2005b) pointed out that, peak location variation, caused by a spread of initial particle velocities at the starting end of mass spectrometer tube, makes calibration more difficult. Compared to background noise removal, peak identification and alignment is more challenging and critical by providing the links of underlying peptides across all spectra. However, existing algorithms for peak alignment are mostly ad hoc and based on heuristic arguments (Nielsen et al. 1998; Johnson et al. 2003; Torgrip et al. 2003; Eilers, 2004; Tibshirani et al. 2004 and Randolph and Yasui 2006), where some parameters need to be optimized empirically and/or subjectively. In this paper, we work on the detected peak samples (possibly with associated intensities) coming from certain protocol. In view of hundreds of features (peaks) in each spectrum, due to complex chemical and physical mechanisms undergoing the mass spectrometry, throughout this paper, we assume that substantial individual peak location variation is existent, say up to one half of the interval between neighboring peaks, thus calibration may be in lack of power. Moreover, we consider statistically false positives and negatives to make our model more accountable. Overall, we will assume that each set of potential peak samples corresponding to the true peak follows an individual composite distribution regulated by true peak location, peak sample location variation, false positive and false negative rates. An effective simple Bayesian MCMC algorithm is proposed to do peak alignment (biomarker identification) and downstream sample classification, where all individual sets of parameters for each true potential peak are able to be estimated in a universal modeling framework without intensive tuning parameter optimization. Our algorithm is much computationally simpler than other available MCMC algorithms which could be applied to MS alignment while retaining competing performance.
The rest of this article is organized as follows: Section 2 introduces Bayesian dimension matching problem and proposes our simplified approach; Section 3 summarizes simulation results to demonstrate the efficiency and reliability of our algorithm; Section 4 illustrates the applications of our approach to a real MS data set; Section 5 considers joint analysis of peak alignment and sample classification; Section 6 concludes with discussions; and some technical details are given in the appendix.
Methods
Dimension-Matching Statistical Model
We now develop a simple novel MCMC algorithm, random grafting-pruning Markov chain Monte Carlo (RGPMCMC) which can be applied to MS peak alignment from mass charge ratio information. Within Bayesian framework, we are often interested in the posterior distribution of the dimension-varying parameter θ. The prior distribution π(θ)is represented as
where the denominator is the normalization constant not needed for posterior sampling. Change point model usually involves dimension-matching in two typical cases: one single ordered series where change points are taken as those separating successive discrete points, and multiple ordered series where change points correspond to physical locations in the continuous space. For the discrete case, the partitioning and wrapping up the segments leads to an exponentially increasing computational cost as the model space grows (Denison et al. 2002), and frequentist's approaches only either work on very few change points or special algorithms (Guan, 2004; Olshen and Venkatraman, 2004). Recently, Fearnhead (2005) proposed an exact non-MCMC sampler by recursive partition, and Loschi, Cruz and Arellano-Valle (2005) introduced the product partition model based Bayesian algorithm which stems from Yao (1984) and Barry and Hartigan (1992, 1993). For the continuous case, the readers are referred to sampling-based algorithms: the reversible jump MCMC (RJMCMC) by Green (1995), the birth-and-death process MCMC (BDMCMC) and continuous time process MCMC (CTMCMC) by Stephens (2000a, 2000b), Bayesian cluster detection in maps (Knorr-Held and Raßer, 2000) and others. Cappέ, Robert and Rydέn (2003) showed that, the acceptance probability of the usual MCMC methods is replaced by differential holding times in BDMCMC, and RJMCMC converges to a limiting continuous time birth-and-death process on an appropriate rescaling of time. They also demonstrated that, RJMCMC and CTMCMC have similar computational performance while the latter demands expensive death rate computation. Obviously, discretization is neither always suitable nor efficient for sophisticated change point identification in the continuous space. The present work introduces a simple MCMC algorithm in the context of multiple MS peak alignment by considering all uncertainties including peak number, peak locations, peak sample location variations, false negative and false positive rates. Neither the error-prone Jacobian terms in RJMCMC, the intensive death rate calculation in CTMCMC, nor the computationally expensive recursive partition in other algorithms is needed by our method.
Before starting with our statistical model, we assume local maxima (discrete locations) have been detected as peak samples for each raw mass spectrum. In MS peak alignment, peak sample location variation may linearly depend on m/z magnitude (Yasui et al. 2003), and log-transformation of m/z achieves peak sample location variation homogeneity, this observation will justify the identical prior specification for peak sample location variations across true peaks (details later). For notational simplicity, we use m/z instead of log (m/z) in the following discussions and assume that the m/z domain is [(m/z)min, (m/z)max]. The underlying K-dimensional true peak location vector is

Mass Spectrum Biomarker Model. (The left column is spectrum index, the vertical lines are putative true peaks, the horizontal circle lines are peak samples of each spectrum. The lower left dash rectangle represents a false negative case and the upper right dash rectangle represents a false positive case.)
We now describe the motivation for such a likelihood function which is crucial for Bayesian inference. The matching by minimum distance
Motivated by Green (1995), our algorithm is based on a redesigned universal “naively informative” parameter proposal involving peaks and the other parameters concurrently without the need for Jacobian terms. We also propose four move types (+, –, H, S), where “+” means peak birth proposal, “–” means peak death proposal, “H” means parameter (σ
2
, fn and fp) proposal excluding peaks and “S” means peak location mutation with peak number unchanged. We specify (+, –, H, S) probabilities as
Parameter Sampling Process
First we choose one of these four move types based on move type probabilities (π(+), π(–), π(H), π(S)), where π(+) = π(–).
For the “+” move type, we describe the parameter proposal process: If Kold = Kmax, we go to 1) since the upper threshold is reached; if Kold < Kmax, we randomly sample one of the Kold + 1 intervals formed by current Kold peaks, say (sj, sj+1), with equal probability 1/(Kold + 1). We may assign j + 1 to this new peak index * and the following indexes increase by one accordingly. Within this sampled interval, we propose (
True peak location proposal for peak birth:
where gs*(U1; sj, sj+1) is a one-to-one mapping from random variable U1 to peak location s* given sj and sj+1. We take
Peak sample location variance proposal for peak birth:
where
Peak sample false negative and false positive rate proposal for peak birth:
where
the false negative and false positive rate proposal is realized in two sequential steps:
O* proposal:
i.e. (O*/O)1/2/(Oj+1/O*)1/2 = g3(U3), where g3(·) is any monotonic function with domain [0, 1] and range [0, ∞), and U3 ~U[0, 1]. It can be seen that, O* is a monotonically increasing function of U3, we simply use g3(u) = u/(1 – u).
R* proposal:
i.e. (R*/Rj)1/2/(Rj+1/R*)1/2 = g4(U4), where g4(·) is any monotonic function with domain [0, 1] and range [0, ∞), and U4 ~ U[0, 1]. It can be seen that, R* is a monotonically increasing function of U4, we simply use g4(u) = u/(1 –u).
Note that the constraint 0 ≤ fn* + fp* ≤ 1 holds under this proposal. The
When the insertion of the peak birth candidate is before the first peak or after the last peak, there are no real double neighbors. In this case, we take the duplicates of peak birth candidate's unique succeeding or preceding neighbor as two virtual neighbors for proposal implementation. In this move type, we realize the sequential uniform lift for fn*+fp* and subsequent conditional uniform lift for fn* within fn* + fp*. We claim that, the peak birth proposal by “+” move type, along with the peak death proposal by the following “–” move type, constructs a symmetric transition, i.e. equally probable events (Proposition 3). So the acceptance probability in the Metropolis-Hastings algorithm within Gibbs sampler is simply
For the “–” move type, the parameter proposal process is: If Kold = Kmim we go to 1) since the lower threshold is reached; if Kold > Kmim we randomly sample one from current Kold peaks with equal probability 1/Kold, say index *, to delete. Then we simply abandon the associated
The “+/–” move type is demonstrated in Figure 2 for only peak location model.
Symmetric Transition for Peak Birth/death Proposal. (Each rectangle represents a uniform peak birth proposal domain, each internal vertical boundary represents a possible peak death proposal location.)
For the “S” move type, the parameter proposal process is: We randomly sample one of current peaks, say *, with equal probability 1/Kold, and take the two neighboring peak intervals which peak * separates as a fused interval. A peak location is uniformly randomly drawn within this fused interval for a new candidate peak to replace s*. The other parameters associated with this peak location mutation is kept unchanged, or they could be changed as a set within the uniformity framework. This is also a symmetric transition (Proposition 2). So the acceptance probability in the Metropolis-Hastings algorithm within Gibbs sampler is simply
For the “H” move type, the posterior distribution of each
peak (described by location, peak sample location variance, false negative and positive rates) birth or death changes the number of parameter sets by move type “+” or “–” in the aforementioned parameter sampling proposal;
peak location mutation does not change the number of parameter sets by move type “S” in the aforementioned parameter sampling proposal;
peak sample location variance or false negative/positive rate sampling does not change the number of parameter sets by move type “H” in the aforementioned parameter sampling proposal.
The associated propositions and the detailed description of RGPMCMC are given in the appendix. The following proposition justifies the correct convergence.
Interpretation: In view of independent uniform proposals for non-peak parameter set and diverse move types, given any arbitrarily small neighborhood of a current state there is a positive probability that the chain lies in that neighborhood after one sampling iteration, thus the aperiodicity is verified; the irreducibility is established since the chain can move from any state to any other state one step at a time.
In this section, we study the RGPMCMC performance under diverse circumstances and make comparison with other approaches in the literature.
Prior Sensitivity Analysis for Rgpmcmc
We consider different priors and study the discrepancy between the specified true peak locations and the estimated peak locations. Compaq Fortran 90 is our development package and we use IMSL Fortran Numerical Library to generate random numbers. 200 spectra are simulated for each of the six simulations, the set-ups and priors are listed in Table 1, where four true peaks are considered, the peak location vectors specify the true peak locations, the σ vectors specify the peak sample location variations at true peaks, the fn and fp vectors jointly specify the probabilities for us to simulate no peak sample (with probability fn), multiple peak samples (with probability fp), or single peak sample (with probability 1 – fn – fp) for each true peak from individual spectrum. The Inverse-Gamma and Dirichlet priors are for peak sample location variances and false negative and positive rates at putative true peaks. The peak number prior is K ~ U [1, 20], the starting peak number is 11, and move type probabilities are: π(+) = 0.45, π(–) = 0.45, π(H) = 0.05, and π(S) = 0.05. With burn-in 10,000 and thinning 1,000, each collection of 1,000 posterior samples takes several minutes on a PC powered by Celeron CPU. The results are given in Figure 3. In simulation 1, the true peaks are clearly clustered and the peak estimation is good. In simulation 2 with larger peak sample location variations and false positive/negative rates, the same number of peaks are identified as the true peak number under highly informative priors. By modifying peak sample location variation priors, we observe that, under large true variations, the peak estimation under less informative variation priors is worse than that under more informative variation priors. More uncertainties are introduced in simulation 3, and fewer peaks are identified than the true peak number. By modifying peak sample location variation priors, we observe that, when the variation priors are consistent with the true variation in terms of mean value, the peak number estimation seems to be better than inconsistent variation priors. Simulation 4 shows that, the informative peak sample location variation priors may lead to good estimation even when the peaks’ associated parameters are different, so does simulation 5, where unevenly distributed peaks are considered. In simulation 6, even when the parameter priors become much less informative, the peak estimation performs well, since the true peaks are sharply surrounded by corresponding peak samples. These observations show that, informative a priori knowledge is desirable for reliable estimation.
Simulation Configurations (IG: Inverse-Gamma prior for σ
2
, D: Dirichlet prior for (fn, fp,1 – fn – fp)).
Simulation Configurations (IG: Inverse-Gamma prior for σ 2 , D: Dirichlet prior for (fn, fp,1 – fn – fp)).

Simulations and Estimations (The simulation directly produces the peak samples for 200 patients which are plotted in each odd row of panels without the need for peak sample detection by data preprocessing, the y-axis is patient index and the x-axis is log(m/z). The alignments are compared by gridded walls given in each even row of panels. From top to bottom: true peaks, aligned peaks by RGPMCMC, aligned peaks by RJMCMC, aligned peaks by scale-space approach, aligned peaks by super-set approach and aligned peaks by PAM clustering algorithm.)
We first make comparisons with other non-MCMC algorithms represented by the recently developed scale-space approach (Yu et al. 2006a), super-set approach (Yu et al. 2006b), and Partitioning around Medoids (PAM) approach (Kaufman and Rousseeuw, 1990). For the six simulations in Section 3.1, the combined peak sample locations from 200 spectra along with the optimal cluster number minimizing the “median split silhouette” (Pollard and van der Laan, 2002) are taken as PAM inputs, S-plus function pam offers the medoid (cluster center) locations. These results are also included in Figure 3. Overall, RGPMCMC performs better than non-MCMC methods; the scale-space result is little better than the super-set result in terms of robustness; the PAM algorithm is not specifically designed for peak alignment, so it may perform poorly under certain circumstances, say simulations 2 and 3, where RGPMCMC can not recover all true peaks either.
Comparison with Reversible Jump Markov Chain Monte Carlo
We apply the same simulated data in Section 3.1 and make comparison with reversible jump Markov chain Monte Carlo (RJMCMC) algorithm by Green (1995). RGPMCMC and RJMCMC differ in the method for proposing move type “+” and/or “–” (peak birth and/or death), where the former conditions on the active Markov chain by making use of equally probable peak birth and death proposals, while the latter makes use of additional variables to construct a one-to-one matching for dimension changing (details are given in the appendix). The same prior specifications for RGPMCMC in Section 3.1 are applied to RJMCMC. We consider two starting peak numbers, 11 and 1, both equally partitions the m/z range.
Initial Peak Number = 11
Figure 3 also compares the alignments by RGPMCMC and RJMCMC, where no difference between RGPMCMC and RJMCMC exists.
The peak number iteration comparison is given in Figure 4, where the burn-in = thinning = 1,000. The third row of dual panels pinpoint 3 or 4 peaks, the 4:3 ratios are 0.124 and 0.122 for RGPMCMC and RJMCMC respectively, which is close to each other. The acceptance rate for peak birth and/or death proposals should be almost identical for these two algorithms.
Peak Number Sampling Series by RGPMCMC and RJMCMC from Simulation Study (burn in = thinning = 1,000, the initial peak number = 11.)
The peak number iteration comparison before reaching reasonable peak number (1,000th iteration) can be seen from Figure 5. For these six simulations, peak birth and/or death, and peak mutation acceptance rates are compared in Table 2, which are close to each other.
Peak Number Sampling Series by RGPMCMC and RJMCMC from Simulation Study (first 1,000 iterations with initial peak number 11)
The reasonable peak number is reached after almost the same number of iterations (~1,000) by GPMCMC and RJMCMC, which have similar efficiency for peak number identification.
Acceptance Rate Comparison between RGPMCMC and RJMCMC.
Initial Peak Number = 1
The peak number iteration comparison is given in Figure 6, where the burn-in = thinning = 1,000. The third row of dual panels pinpoint 3 or 4 peaks, the 4:3 ratios are 0.131 and 0.079 for RGPMCMC and RJMCMC respectively, where the ratio by RGPMCMC (0.131) is very close to peak number 11 case (0.124). Except for simulation 5, both RGPMCMC and RJMCMC identify the same number of peaks. For simulation 6, they all identify 3 peaks other than the true 4 peaks.
Peak Number Sampling Series by RGPMCMC and RJMCMC from Simulation Study (burn in = thining = 1,000 the initial peak number = 1.)
The peak number iteration comparison before 50,000th iteration can be seen from Figure 7. For these six simulations, peak birth and/or death, and peak mutation acceptance rates are also compared in Table 2. The rates are still close to each other.

Peak Number Sampling Series by RGPMCMC and RJMCMC from Simulation Study (first 50,000 interations with initial peak number 1)
We find that, starting from a relatively large peak number is more capable of identifying true peaks by RGPMCMC and RJMCMC.
We mimic the mass spectra using the R package developed by Coombes et al. (2005b), where wide-ranging factors are considered to create the uncertainty, including the acquisition time resolution of the detector, the distribution of initial particle velocities, isotope distribution and others. Ideally, our six simulated mass spectrum groups have 5, 10, 20, 40, 80 and 160 peaks without incorporating any uncertainty and the R simulator produces six mean spectra, where the particles with large mass value (>20,000 Da) have broader hills due to more isotopes (Figure 8). Each mean spectrum leads to 100 uncertainty involved random replicative spectra subject to peak sample detection. For them we smooth spectrum with a predefined Gaussian function (window size of 15), search local maxima in the local neighborhood of 15 data points as peak samples, the minimal intensity value of peaks should be not smaller than 100. The RGPMCMC pinpoints 5, 9, 18, 35, 62 and 112 peaks (Inverse-Gamma (6,500) is peak sample location variance prior, Dirichlet (5,5,90) is false negative and positive rate prior, the starting peak numbers are 10, 20, 40, 80, 160 and 320); the clustering method in Tibshirani et al. (2004) pinpoints 5, 9, 19, 36, 69 and 174 peaks (the tuning parameters are selected as suggested in Tibshirani et al. (2004)). The alignment comparison is also given in Figure 8, where the clustering method tends to identify redundant peaks at large masses (the arrows in the bottom panels of Figure 8), while RGPMCMC performs better in this region; both methods identify less peaks in certain dense regions (bottom panels in Figure 8), while RGPMCMC sometimes combine too concentrated peaks into one peak (the arrows in the middle panels of Figure 8). Overall, these two approaches have similar performance in this simulation scenario.

Mimic-MS Simulations and Estimations (The R simulator produces the spectrum profiles given in each odd row of panels, 100 random spectra were simulated for each of them for peak sample detection. The y-axis is intensity and the x-axis is m/z. After peak sample detection by data preprocessing, the alignments are compared by gridded walls given in each even row of panels. From top to bottom: true peaks, aligned peaks by RGPMCMC and aligned peaks by clustering algorithm.)
We model the same ovarian cancer data source as used by Wu et al. (2003) (available on-line at http://bioinformatics.med.yale.edu/MSDATA), where the healthy group has 77 patients and the cancer group has 93 patients. The individual spectrum has tens of thousands of (m/z, intensity) pairs and looks like a more complicated and error-involving version of those simulated profiles in Section 3.4 (Figure 8). The data preprocessing on original spectra involves baseline subtraction, smoothing, intensity normalization and peak sample detection by local maxima as described in Section 3.4. The median of the original peak sample numbers after pre-processing is 249 for the healthy group and 241 for the cancer group. The values of log(m/z) range from 6.565 to 8.200. We use move type probabilities (π(+) = 0.45, π(–) = 0.45, π(H) = 0.05, π(S) = 0.05) and the interval constraint d = 10–5. If this constraint is not met at one iteration, we simply resample the parameters. Richardson and Green (1997) observed that, proper posterior distributions may be not possible under fully noninformative priors, so we apply (α1 = 5, α2 = 5, α3 = 90) to Dirichlet priors, and (ν = 3, η = 2 6 ) to Inverse-Gamma priors. Different peak number priors, either truncated Poisson [100, 600|λ] or Uniform [100, 600] lead to similar results. The starting peak number is K = (Kmin + Kmax)/2, the initial peak locations are equal K-partition of log(m/z) range, burn-in is 10,000, and thinning is 1,000. We recommend to start with a relatively large peak number. The sampler approaches the reasonable peak number very quickly and usually sticks around and mostly does effective single peak mutations once approaching the true peak number. Occasional peak number jump ups are highly efficient for joint peak number and location estimation. The alignment results are given in Figure 9. The sampling series of peak number are given in Figure 10, where we empirically identify the modes of the posterior peak number distribution as 274 (healthy group) and 260 (cancer group). The false negative rate and false positive rate estimations are given in Figure 11 with a significant negative correlation. The posterior peak sample location variation medians are given in Figure 12, where the inconsistency of the first point possibly arises from edge effect, so do false negative and positive rate median plots (Figure 11). The average peak distance (~(8.200 – 6.565)/290 = 0.0058) dominates the estimated peak sample location variation (~0.001). The sampler takes several thousand iterations to approach the reasonable peak number, thus an adaptive strategy with varying move type probabilities may be more efficient than the brute birth/death dominating proposal after that time point. The collection of the posterior samples in Figure 10 takes several days on a PC powered by Celeron CPU.

Aligned Peaks with Peak Sample Background (y-axis: patient index, x-axis: log(m/z). The vertical lines represent aligned peaks (biomarker profile) for each group, the dots in the background are the deteced peak samples for all patients by data preprocessing.)

Sampling Series of Peak Number (y-axis: peak number, x-axis: iteration index. The healthy group seems to have more peaks than cancer group.)

False Negative Rate and False Positive Rate Estimation (The left panel of first row is the histogram of false negatives at aligned peaks from healthy group, the right panel of first row is the histogram of false negatives at aligned peaks from cancer group; the left panel of second row shows the false negative medians at aligned peaks (represented by log(m/z)) from healthy group, the right panel of second row shows the false negative medians at aligned peaks (represented by log (m/z)) from cancer group; the next two rows are for false positives; the bottom row of panels show (false positive rate [x-axis], false negative rate [y-axis]) at aligned peaks for healthy group and cancer group.)

Posterior Medians of Peak Sample Location Variation (The upper left panel is the histogram of peak sample location variation medians from healthy group, the upper right panel is the histogram of peak sample location variation medians from cancer group; the lower left panel shows peak sample location variation medians (healthy group) at aligned peaks represented by log (m/z), the lower right panel shows peak sample location variation medians (cancer group) at aligned peaks represented by log(m/z).)
Effective sample classification is as important as biomarker profile estimation. Wu et al. (2003) compared a number of sample classification methods without cross-validation and Tibshirani et al. (2004) reached error rates no less than 35%. Since the peak number is biologically variable between healthy and cancer groups, the current equal-peak-number based classifications may not be very suitable. Without considering cross-validation, we simply calculate the two log-likelihood functions for each preprocessed spectrum given fitted model (healthy group or cancer group) from Section 4, where the estimated parameters are taken from posterior medians. The histograms of log-likelihoods of all spectra from both healthy group and cancer group along with the log-likelihood difference histograms are given in Figure 13. From the log-likelihood difference empirical distributions in the bottom panels of Figure 13, we simply take the proportions of negative values as type I error rates: 28.6% for testing: Healthy vs. Cancer and 10.8% for testing: Cancer vs. Healthy, which are overly optimistic. Denote ỹ = (y1, y2, …, yN) as the spectrum peak sample location vector to be classified,
and
where
Sample Classification Error Rates from 10 fold Cross-validation (The error rates represent the proportions of the healthy or cancer patients in the testing set which are misclassified.)
Sample Classification Error Rates from 10 fold Cross-validation (The error rates represent the proportions of the healthy or cancer patients in the testing set which are misclassified.)

Hypothesis Test by Log-likelihood (The upper left panel is the log-likelihood histogram for healthy individuals with healthy group model, the upper right panel is the log-likelihood histogram for healthy individuals with cancer group model; the middle left panel is the log-likelihood histogram for cancer individuals with cancer group model, the middle right panel is the log-likelihood histogram for cancer individuals with healthy group model; the lower left panel is the log-likelihood difference histogram for “healthy individuals with healthy group model–-healthy individuals with cancer group model”, the lower right panel is the log-likelihood difference histogram for “cancer individuals with cancer group model–-cancer individuals with healthy group model”.)
In this article, we take a global viewpoint to avoid multiple edge effects under piecewise processing and incorporate flexible biomarker numbers to make our Bayesian model more accountable. The Jacobian term derivation, intensive death rate calculation, or lengthy recursive partition required by RJMCMC, CTMCMC and others in the literature may impede convenient application of Bayesian algorithm to change point identification (see the RJMCMC computational procedure in the appendix for example). For multiple change point identification problems where each segment has the same set of regulating parameters, we can see that, the superiority of RGPMCMC algorithm over other available algorithms is the most computational efficiency and simplicity by minor local adjustment of likelihood function and prior set armed with “naively informative” global treatment as introduced by the parameter sampling process in Section 2.2. Its competing computational performance has been demonstrated in this article by intensive comparison with others. Moreover, RGPMCMC can be easily modified to apply to multiple change point identification in circular domain (Liu et al. 2006) and others. Although our mass charge ratio (m/z) model already leads to promising sample classification, how to make use of relative intensity information beyond m/z value is a more challenging statistical problem, since peak samples in close proximity with disparate intensities are less likely to belong to the same putative true peak. Under the assumption of reproducibility and homogeneity of mass spectra, this algorithm is designed to be applied to each phenotype group separately (disease and control) at this point, leading to likely different peak location vectors for different phenotypes. Wu et al. (2006) observed that, a protein subset with considerable size, say 20~40 m/z features, may pose as signature between phenotypes, thus separate global statistical models are still desirable. Any peak sample detection protocol will cause inevitable peak sample location variation, false negative and false positive peak samples, which is obviously subject to statistical modeling. From algorithmic aspects, Green (1995) emphasized the importance of proposing parameter efficiently. Because the independent proposal from joint prior distribution of (K, θ K ) is not very efficient, our proposal works on the joint infinitesimal space to achieve more efficiency by a more fair birth/death move. Although the RGPMCMC does not need intensive tuning parameter optimization, running MCMC properly is never a simple automatic task, since from simulation study we find that, highly informative prior specification consistent with the truth is desirable, for which a solution could be a local small scale study out of the whole spectra picture. The mass spectra's quality and characteristics vary greatly depending on the platform, e.g. MALDI-TOF or SELDI-TOF, and certain experimental settings used for the measurements. This is not our concern here since it is not difficult to apply this global profile estimation algorithm to those spectra coming from the same source and enjoying high reproducibility and homogeneity. We anticipate that, the RGPMCMC developed in this article will shed light on a broad class of Bayesian multiple change point identification problems, not only MS data analysis. Lastly we emphasize that, diverse alignment problems arise from complicated scenarios in modern bioinformatics research. Beyond this m/z based mass spectra peak alignment which greatly benefits from Green (1995)'s seminal paper, Green and Mardia (2006) recently developed a novel Bayesian approach for simultaneous inference about the matching and the transformation between two protein 2D-gel images, and aligning active sites of proteins in three dimensions.
Footnotes
Acknowledgements
We are grateful to the editor and two referees for their valuable comments which improved our presentation, and Jeffrey Morris for his help with mass spectra simulation R package. The first three authors would like to thank Hongyu Zhao for the financial support and supervision during their stay at Yale center for statistical genomics and proteomics. This work was supported in part from NHLBI/NIH contract N01-HV-28186, NIDA/NIH grant 1 P30 DA018343-01, NIGMS grant R01GM59507, and NSF grant DMS-0241160.
