Abstract
Motivation
Our goal was to understand why the PLIER algorithm performs so well given its derivation is based on a biologically implausible assumption.
Results
In spite of a non-intuitive assumption regarding the PM and MM errors made as part of the derivation for PLIER, the resulting probe level error function does capture the key characteristics of the ideal error function, assuming MM probes only measure non-specific binding and no signal.
Introduction
The PLIER (Probe Logarithmic Intensity ERror) algorithm was developed by Affymetrix and released in 2004. It is part of several commercially available software packages that analyze Genechip® data such as Strand Genomic's Avadis and Stratagene's ArrayAssist®. The PLIER algorithm produces an improved gene expression value (a summary value for a probe set) for the GeneChip® microarray platform as compared to the Affymetrix MAS5 algorithm. It accomplishes this by incorporating experimental observations of feature behavior. Specifically, it uses a probe affinity parameter, which represents the strength of a signal produced at a specific concentration for a given probe. The probe affinities are calculated using data across arrays. The error model employed by PLIER assumes error is proportional to observed intensity, rather than to background-subtracted intensity. How ever, the derivation of the method also assumes that the error of the mismatch probe is the reciprocal of the error of the perfect match probe. We find this assumption counter-intuitive.
On the other hand, PLIER definitely performs well. It outperforms MAS5 in terms of the benchmark data and measures [3, 8] for assessing the quality of the summary statistic for a probe set. It also does fairly well compared to other methods that are commonly used to compute gene expression values for GeneChip probeset data. The Affycomp website (affycomp.biostat.jhsph.edu) currently shows results for 14 of the measures of accuracy [9 for 50+ normalization and analysis combinations; Figure 1 shows the ranks, on all 14 measures, for the default Affymetrix software MAS5, two early methods, robust multichip average (RMA) and dChip, and two more recent additions, chip calibration [11 and GC RMA version 1.1.3, along with those for PLIER. No one method dominates the competition in the sense of being the best or near best on all rankings, but for comparison GC RMA has the smallest average rank in the U133 data. In the U95 data, PLIER+16 does quite well, with an overall rank of sixteenth out of 54 entries. In particular, improvements over MAS5 include a higher reproducibility (lower coefficient of variation) without loss of accuracy and higher differential sensitivity for genes with lower expression values.
This inconsistency, good performance of an algorithm derived from a counter-intuitive error model assumption, prompted us to look more closely at the PLIER algorithm. Specifically, we looked at the error function for the algorithm and compared it to that for a more biologically based one. By examining the behavior of individual probes over a sequence of spiked-in RNA concentrations of a target gene, characterizations of the PLIER error function became clearer. The major finding is that the PLIER error model possesses many of the key characteristics of the ideal error function for fitting individual probe calibration curves.

Rankings of selected methods on 14 outcomes, from the Affycomp website.
Plier Description
This description of the PLIER algorithm is based upon a Technical Report [1 on the Affymetrix website. Consider a single probeset on an array and a set of j = 1, 2, …, n arrays in the experiment. We assume the probeset contains i = 1, 2, …, m probe pairs; a probe pair i consists of a perfect match (PM
ij
) and mismatch probe (MM
ij
). Let pm
ij
and mm
ij
represent the observed binding intensity for the perfect match and mismatch probe i on array j, respectively. The expected value for the observed binding for the perfect match and mismatch probes is assumed to be
B
ij
is background binding for probe pair i on array j (background is assumed to be the same for the PM and MM probes within a pair), μ
ij
is the binding level of probe i on array j, a
i
is the binding affinity of probe i, c
j
is the concentration of RNAin sample j, which is hybridized to array j.
The quantities B ij μ ij , a i , and c j represent the (unknown) true values of the background binding, probe binding, affinity, and concentration, respectively, whereas pm ij and mm ij are the observed intensity values.
It is fairly well established from empirical data that the logarithm (log) of the observed binding intensities is approximately equivariant; in other words, the error is multiplicative. This implies the following model
The assumption that the perfect match and mismatch error for probe pair i are equal, i.e.
The PLIER algorithm selects a and c such that the average “residual” r = log(∊) equals zero. Specifically, this is accomplished by minimizing a robust average of the r
2
values. The particular robust M-estimator used (Geman-McClure) is not of particular interest here. If the mismatch binding MM is zero, then
To more concretely understand how this algorithm works, consider a case of a single probeset on a single array. The goal is to obtain an estimate of the gene expression value for the probeset. For simplicity, assume there are only 3 probe pairs in the probeset. In this example, we use the first three probes of the U95A probeset 37777 at where the corresponding transcript of the target gene, protein tyrosine phosphatase receptor B (PTPRB), was spiked into a background of human pancreas RNA at a concentration of 32 PM. The observed (pm, mm) intensity pairs were: (1801,627), (542, 132), and (229,111). Figure 2 displays the r 2 curves for these probes as a function of the estimate for the true intensity (μ ij = aic j ), as well as the average r 2 across all three probesets. Average probeset error is minimized by an estimate of 220 as the the true expression level of this gene. The argument is similar for the complete probeset of 16 probe pairs; the plot would just be more crowded.

Error curves for various combinations.
The Direct Argument
Spike-in Data
To better understand why PLIER does well, we begin by examining characteristics of the Affymetrix data. A spike-in experiment dataset was created by Affymetrix and is publicly available at their web site http://www.affymetrix.com; search on the phrase “Latin square data” to find the link to the page containing a description of the experiment and the downloadable files of data. In this experiment, mixtures of a common RNA background, in which 16 probesets were spiked in according to 14 different concentrations (0, 0.25, 0.5, 1, 2, 4, …, 512, 1024 PM), were hybridized to a set of Affymetrix U95Av2 arrays. In most cases, each pattern of the 16 probeset concentrations was replicated three times. A cyclic latin square design was used for the spike-in pattern of the target RNAs. Irizarry et al. [7 provide a more detailed description of this experiment. Figure 3 contains the plots of the probes within the first spiked-in probeset 37777 at. Each plot contains the observations of the perfect match and mismatch probes. The probeset 37777 at was spiked in at 14 different concentrations (0, 0.5, 1, …, 1024 PM) across a total of 59 arrays. The observed expression values were plottedonthe j-axis and the spike-in concentrations were plotted on the x-axis; both on a log2 scale. (The data has not been normalized; however, this particular plot is almost unchanged by normalization.) A panel is shown for each probe in the probeset; the perfect match (PM) and mismatch (MM) values were plotted using different symbols. Fitted S shaped curves were superimposed on the data, where the PM function differed from the MM function only in the location of its infection point; a paper by Ballman and Therneau [2 contains the complete set of plots for this and other spike-in experiments. As can be seen in Figure 3, the S-shaped curves appear to fit the data well.

The probe pair intensity values versus the spiked concentration of a spiked transcript, probeset 37777 at, from the Affymetrix U95A spike-in experiment. There are 16 probe pairs. The open circles are the values of the PM probes and the filled squares are the values of the MM probes. The solid and dashed lines correspond to the fitted logistic calibration curves for the PM probes and MM probes, respectively.
Models of the Data
From the literature, there are at least two data models appropriate for the Affymetrix data. If we assume that binding to the chip surface (probeset) does not change the concentration of the target (cDNA) in solution, then the standard mass action laws lead directly to the Langmuir isotherm equation. Its appropriateness for modeling Affymetrix data is described nicely by Hekstra et al. [5. Let x be the specific RNA concentration, then the fraction of occupied probe sites θ is given by
The second model was described by Finney [4 for behavior of calibration curves of radioligand assays where x is the log of the (known) dose and y the log of the observed intensity from the assay. Finney suggested that for this, and most binding equations, a logistic or probit function adequately describes the relationship between x and y. As seen in Figure 3, an S-shaped curve such as a logistic appears appropriate because it captures the effect of background binding and/or lower limits of detection (i.e. the flat lower portion of the lefthand part of the curve) and the effect of biochemical saturation and/or the instrumentation (i.e. the upper portion of the righthand part of the curve).
Is one of these models more appropriate than the other? Figure 4 shows a logistic curve and a Langmuir isotherm curve, both scaled to the range of data values in Figure 3. Clearly, the two curves are virtually indistinguishable. In light of this, we fit logistic curves to the PM and MM data in Figure 3. The logistic curves were fit simultaneously where the PM curve only differed from the MM curves in the location of the infection point, i.e. assuming the same upper and lower thresholds a and b but a different binding affinity K. The PM and MM curves have identical shapes but the MM curve is shifted to the right of the PM curve (lower affinity). As can be seen in Figure 3, the theoretical curves fit the data well.

A logistic curve and Langmuir isotherm curve.
Graphical Comparison of Error Models
The ideal estimate of gene expression for an experiment would use the probe curves fromFigure 3 directly; which is unfortunately not possible since the curves are unknown. But let us assume for a moment that they were, with fi(x) being the calibration curve for the ith probe. Since the data y are approximately equivariant on the log scale, a rational approach for estimating the binding is to minimize the overall error
Over a set of probes, one wishes to choose the estimated binding x so as to make errors close to zero. (This is essentially how assays for which a binding curve is created as a part of the procedures, such as ELISA, proceed.)
To combine PM and MM values, our ideal error function will be the simple sum,
All other functions can be compared to this ideal error function. We compare the error functions for two different true concentrations for a probe, μ = 512 and μ = 256 (9 and 8 on the log2 scale, respectively), with a known binding background level of 64 (6 on the log2 scale). Specifically, we compare the error as a function of the estimated μ, i.e.

The functions for the ideal error (true), the MAS5 error (mas5), PLIER error(plier), and PLIER applied to background-adjusted data error(plier-bc). The true gene intensity (μ) and background (B) and the observed pm and mm values are on each plot.
As can be seen from the panel of plots, the implied error function for MAS5 differs dramatically from the ideal error function in the lower tail. This explains the poor behavior of MAS5 for estimating expression values for low RNA concentration levels, which has been cited extensively in the literature. Also note that there is no curve for the MAS5 model when pm = 256 and mm = 512. The reason is that when the mm value is larger than the pm value, the expression value is undefined. This is not technically true for MAS5 because in instances where pm < mm, the algorithm employs an ideal mismatch value; the ideal mismatch value is selected such that it is less than the pm value. However, when pm > mm, which occurs for the majority of the probe pairs, the error functions in Figure 5 are correct.
On the other hand, the implied error curve for the PLIER model has the correct shape for the left portion of the function. This explains the observation that PLIER yields improved estimations of expression values for low RNA concentration levels compared to MAS5. Neither PLIER nor MAS5 error functions have the correct shape for the right-hand portions of the plots. In practice, the effect of differences from the ideal error for the far right portion of the function is not as serious as differences in the left portion. This is because for actual experiments employing collected biospecimens of interest (cell lines, animal tissue, or human tissue), saturation of the probes is rarely reached. However, when the MM value is far above background, as it is for the spike in experiment when the observed MM values are greater than 256, the overly high lower threshold of the PLIER error function can cause overestimation. Finally, PLIER applied to global background adjusted data does not perform as well as PLIER applied to unadjusted data. There are several variations of PLIER, e.g. PLIER+16 and PLIER+32, which add the constants 16 and 32, respectively, back to global background adjusted data. These constants are on the order of magnitude of the values that were subtracted for global background adjustment and largely “undo” the global background adjustment; from above, we see that PLIER performs better on data that has not been globally background adjusted.
Properties of the Error Function
From the graphical display of the error functions for MAS5 and PLIER, it appears as though a possible explanation for why PLIER performs so well is that in the crucial part of the error function, it has characteristics similar to the ideal error curve. What are the general characteristics of the ideal error function?
Assume that the true assay binding calibration function is a logistic curve, or something quite like it, so
Let us assume that f has a lower threshold or background, log2(b), which corresponds to the scanner effect and non-specific binding when the target gene is not expressed. To the right of this threshold, assume f is linear or nearly linear on the log2 scale, and is smooth. Under these conditions, the error function would have the following properties.
For For The behavior described in 2 is independent of the value of pm
ij
.
For PLIER, we can verify 1 and 2 above, algebraically; this confirms the behavior observed in Figure 5 for the general case.
For property 1, the error in Equation (2) is placed on the log2 scale and multiplied by –2 to get ∊*,
Under the Affymetrix assumption, the PM probe measures the target gene concentration and the MM probe measures the background level. Since pm estimates the signal level and mm estimates MM, or background, respectively, this satisfies the first property. Note, it has been established that MM does not measure background alone but also measures signal. However, as the true concentration level, μ, becomes small, MM becomes a better estimate of background, i.e. it is less likely to also measure signal. Hence, the PLIER error function is reasonably consistent with property 1.
For property 2, we again place the error on the log2scale, drop the subscripts, and we get
PLIER is of course making the assumption that log(MM) = background + error; in particular, it assumes that mm does not measure any gene signal. If this assumption is true, we see from above that the PLIER error model has the characteristics of the ideal error model, especially in the region of the plot that is the hardest, the low end. This explains why it does perform better than MAS5. However, the more these assumptions are violated–-i.e. the more signal the observed mm measures in addition to non-specific binding, the more the PLIER error function will deviate from the ideal error function. As mentioned previously, it is fairly well established in the literature that mm does measure signal in addition to non-specific binding, which may explain why PLIER is not the best performing algorithm of those entered in Affycomp.
Conclusions
In light of the fact that the MM probes are not good estimates for probe background level, the PLIER algorithm could likely be improved with a better estimate of background binding, perhaps along the lines of that proposed by Naef et al. [10. Another question, one which we did not address here, is whether a robust average, such as that employed by PLIER, is really necessary. This is based on the fact that on a log scale, the spike-in data appear relatively equivariant, with few outliers. However, these considerations are of secondary importance. Of major concern is the fact that the error model is based upon an implausible assumption regarding the relationship between the error of the PM values and MM values.
Overall, we found that in spite of the non-intuitive assumption regarding the PM and MM errors made as part of the derivation for PLIER, the resulting model does capture the key characteristics of the ideal error curve, assuming MM probes only measure non-specific binding and no signal. Our only explanation for why this should be is good fortune.
This note has only considered the shape of the PLIER error function for a single probe. When averaging over multiple probes not only the shape but the relative shifts of the per probe error curves from one another will affect the effectiveness of the final estimate; our paper does not predict how PLIER will fair in comparison to other methods. In particular, we believe the deviations of the individual error functions from the ideal error functions likely will be compounded when performing the averaging across the probes in a probeset. Our belief is based on the observation that although PLIER performs better than MAS5, it does not perform as well as other algorithms entered in Affycomp, most of which are based on more biologically plausible assumptions.
Disclosure
The authors report no conflicts of interest.
