Sage Journals: Discover world-class research

Abstract

Chromatin immunoprecipitation sequencing (ChIP-seq) is a powerful method for analyzing protein interactions with DNA. It can be applied to identify the binding sites of transcription factors (TFs) and genomic landscape of histone modification marks (HMs). Previous research has largely focused on developing peak-calling procedures to detect the binding sites for TFs. However, these procedures may fail when applied to ChIP-seq data of HMs, which have diffuse signals and multiple local peaks. In addition, it is important to identify genes with differential histone enrichment regions between two experimental conditions, such as different cellular states or different time points. Parametric methods based on Poisson/negative binomial distribution have been proposed to address this differential enrichment problem and most of these methods require biological replications. However, many ChIP-seq data usually have a few or even no replicates. We propose a nonparametric method to identify the genes with differential histone enrichment regions even without replicates. Our method is based on nonparametric hypothesis testing and kernel smoothing in order to capture the spatial differences in histone-enriched profiles. We demonstrate the method using ChIP-seq data on a comparative epigenomic profiling of adipogenesis of murine adipose stromal cells and the Encyclopedia of DNA Elements (ENCODE) ChIP-seq data. Our method identifies many genes with differential H3K27ac histone enrichment profiles at gene promoter regions between proliferating preadipocytes and mature adipocytes in murine 3T3-L1 cells. The test statistics also correlate with the gene expression changes well and are predictive to gene expression changes, indicating that the identified differentially enriched regions are indeed biologically meaningful.

Keywords

kernel smoothing normalization nonparametric testing spatial histone profiles

Introduction

Chromatin immunoprecipitation sequencing (ChIP-seq) technology is a powerful tool for analyzing protein interactions with DNA.¹ ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites of transcription factors (TFs) and genomic landscape of histone modification marks (HMs). This high-throughput technology creates millions of short parallel sequencing reads and provides more accurate mapping information for the binding regions in the whole genome with lower cost^2–5 than array-based methods.

Both TF binding and histone modification play important roles in gene regulation, where TFs bind to DNA at a promoter region to promote or block gene transcription. The signal of TFs usually shows one sharp peak at their binding sites. Multiple HMs have been reported to be associated with transcription initialization, open chromatin, and repression of transcription.^3,6

Most previous works in analysis of ChIP-seq data have focused on developing peak-calling procedures to find the binding sites for TFs.^7–11 Identifying the enriched regions of HMs is difficult since their signals are more spread out.¹² The signals of HMs are diffuse and usually have multiple local peaks, which are hard to identify by directly applying the existing peak-calling algorithms.

Another important question is to identify the genomic regions that show differential enrichment of histone modification between two experimental conditions, such as different cellular states or different time points.^3,13 Indeed, different types of differential histone enrichment have been observed, including shift of nucleosome positions, peak height differences and presence/absence of HMs.^14,15 Chen et al.¹⁴ further demonstrated that the spatial distributions of histone marks are predictive for promoter locations and promoter usage. Angel et al.¹⁶ showed that during cold, the H3K27me3 levels progressively increase at a tightly localized nucleation region in Arabidopsis, indicating the importance of studying the peak height, not just presence/absence of the peaks.

One common approach to identifying differentially enriched regions by histones is to apply a peak-calling algorithm to identify the enriched regions for each of the two conditions. The regions with peaks in one condition but without peaks in the other condition are then selected. However, selection of enriched regions often depends on the thresholds used in the peak-calling algorithm. Small differences in the calculated P values or the False Discovery Rate (FDR) threshold used by the peak-finding procedures can lead to very different sets of peaks. Furthermore, this simple procedure has limitations in detecting the differential enrichment in terms of different peak heights or different peak locations.

Several parametric methods based on Poisson/negative binomial distribution have been proposed to address this differential enrichment problem in ChIP-seq data such as DiffBind and DBChIP.^13,17 Most of these methods require biological replications to estimate the parameters, especially the dispersion parameter in the negative binomial model.⁸ However, many ChIP-seq data usually have a few or even no replicates. Taslim et al.¹⁸ proposed a nonlinear method that uses locally weighted regression (Lowess) for ChIP-seq data normalization. Shao et al.¹⁹ developed a method to quantitatively compare ChIP-seq data sets. To circumvent the issue of differences in signal-to-noise ratios between samples, they focused on ChIP-enriched regions and introduced the idea that ChIP-seq common peaks could serve as a reference to build the rescaling model for normalization. The inputs of all the methods mentioned rely on first identifying the enriched regions and then obtaining the total tag or read counts in these regions. Such approaches have two limitations. First, one has to identify the regions using peak-finding algorithms. Second, by summarizing the number of tags into one single number of the region, one can potentially lose important spatial profile differences such as shifts of the signal region or shapes of signals.

In this paper, we propose a nonparametric method to identify the genes with differentially enriched regions based on the ChIP-seq data of histones. Instead of first identifying the enriched regions or peaks as most of the existing methods do, we consider the regions close to genes that may contain important regulatory elements such as the promoter regions, the gene body, and downstream regions of the genes. For each of these regions, we summarize the data as counts of sequencing reads in each of the bins of a given length (eg, 25 bps). The counts in these candidate regions provide important information about different HM enrichment levels between two cellular states. After transforming the count data to approximately normal, we apply kernel smoothing to the differences of the data and develop a nonparametric hypothesis testing procedure based on the kernel smoothing. Applying smoothing to the data helps to eliminate the small local differences that are unlikely to be biologically relevant.

We demonstrate the method using ChIP-seq data on a comparative epigenomic profiling of adipogenesis of murine 3T3-L1 cells. Our method detects genes with differential H3K27ac levels at gene promoter regions between proliferating preadipocytes and mature adipocytes, which agree with what were observed by Mikkelsen et al.³ The test statistics correlate with the gene expression changes well, indicating that the identified differences are indeed biologically meaningful. Our results also indicate that the combination of different histone modification profiles can predict the fold changes of gene expressions very well.

Motivating Comparative ChIP-Seq Study, Data Transformation, and statistical Model

We consider the ChIP-seq experiments reported by Mikkelsen et al.³ on murine 3T3-L1 cells undergoing adipogenesis. Specifically, they generated genome-wide chromatin state maps using ChIP-seq profiling, where they mapped six HMs and two TFs at four time points, including proliferating (day ™2) and confluent (day 0) preadipocytes, immature adpipocytes (day 2), and mature adipocytes (day 7). We focus our analysis on H3K27ac mark, which is expected to be enriched at active promoters or enhancers. In order to identify the genes that show differential H3K27ac modification levels between the preadipocytes (day ™2) and mature adipocytes (day 7), we consider the upstream 5000 bp region and downstream 2000 bp region around the transcription start site for each gene and divide the regions into 280 bins of 25 bps. We map the raw data using Bowtie,²⁰ extend reads to the fragment size and then obtain the genome-wide coverage data with a fixed bin size of 25 bp. Since the two ChIP-seq samples are usually sequenced at different depths (total number of reads), we rescale the counts according to the sequencing depth ratio. Suppose that there are m genes and for each gene i, there are n observed, we have read counts X_ikj in bin k under condition j, for i = 1, …, m; k = 1, …, n; and j = 1, 2. Our goal is to identify the genes with differential H3K27ac levels at their promoter regions between mature adipocytes and preadipocytes.

For each gene i and each condition j, we assume the data X_ikj, k = 1, n are approximately Poisson with means μ_ikj. We first apply variance-stabilizing transformation (VST) procedure to transform the data to $X *_{i k j} = 2 \sqrt{X_{i k j} + 0.25}$ as recommended by Brown et al.^21,22 We then treat $X *_{i k j}' s$ as approximate normal random variables with mean $2 \sqrt{λ_{i k j}}$ and variance of 1. For the ith gene, in order to test for differential enrichment between the two conditions, we calculate the difference between the two conditions as $Y_{i k} = X *_{i k 1} - X *_{i k 2}$ . If there is no differential enrichment, $Y_{i}^{T} = (Y_{i k, …,} Y_{i n})$ should have a mean value of zero.

We further denote $Y_{i} (t_{k}) = Y_{i k'}$ for $t_{k} = k / n \in$ (0, 1], k = 1, n. We assume the following “signal+white noise” model for the count differences after the VST,

Y_{i} (t_{k}) = f_{i} (t_{k}) + σ_{i} W_{i} (t_{k})

(1)

where f_i(t) is a smooth function that characterizes the difference of the ChIP-seq enrichment profiles,

W_{i} (t_{k})

is Gaussian noise with mean 0 and variance 1, and

σ_{i}^{2}

²_i is the noise variance. For the ith gene, the null hypothesis that there is no differential enrichment between the two conditions is equivalent to testing

H_{0} : f_{i} (t) = 0.

(2)

Kernel-Smoothing-Based Nonparametric Tests

For a given gene i, we propose a kernel-smoothing based non-parametric test²³ to test the null hypothesis (2). For notational simplicity, we omit the subscript i in the following. Let K be a proper kernel, which is a symmetric, continuous density function with an expectation of zero. We use a normal kernel function, which satisfies all these regularity conditions and fits the real data well. For a fixed bandwidth value λ ∈ [0, 1], we consider the kernel estimator ${\tilde{Y}}_{λ} (t)$ with t ∈ [0, 1], s ∈ [0, 1] and its standard decomposition as

\begin{matrix} \tilde{Y} \hat{λ} (t) = \frac{1}{λ} {\int^{​}}_{1}^{0} K (\frac{t - s}{λ}) Y (s) d s \\ = \frac{1}{λ} \int_{0}^{1} k (\frac{t - s}{λ}) f (s) d s + \frac{σ}{λ} \int_{0}^{1} k (\frac{t - s}{λ}) W (s) d s \\ = f \hat{λ} (t) + σ ξ_{λ} (t), \end{matrix}

(3)

where

f_{λ} (t) = \frac{1}{λ} \int_{0}^{1} k (\frac{t - s}{λ}) f (s) d s a n d ξ_{λ} (t) = \frac{1}{λ} \int_{0}^{1} k (\frac{t - s}{λ}) W (s) d s .

Based on the study by Lepski and Spokoiny,²³ we use the integral of the squared kernel estimator T_λ, which is defined as

T_{λ} = {\frac{‖ {\tilde{Y}}_{λ} ‖}{{\hat{σ}}^{2}}}^{2} = \frac{\int_{0}^{1} {\tilde{Y}}_{λ}^{2} (t) d t}{{\hat{σ}}^{2}}

(4)

to test the null hypothesis

H_{0} : ‖ f (t) ‖

= 0, where

{\hat{σ}}^{2}

is some estimate of the error variance, which we discuss in the Estimate σ for Each Gene section. Under the null H₀ one has

{\hat{Y}}_{0 λ} (t) = σ ξ_{λ} (t)

(5)

and the test statistic becomes

T_{0 λ} = {\int_{0}^{1} ξ}_{λ}^{2} (t) d t .

Since W(t_i) follows N(0, 1), we have

ξ_{λ} (t) = \frac{1}{λ} \int_{0}^{1} k (\frac{t - s}{λ}) W (s) d s .

For the Gaussian kernel, the expectation of T_0λ is given by

E (T_{0 λ}) = \frac{1}{n λ} {‖ k ‖}^{2} = \frac{1}{n λ} \frac{1}{2 \sqrt{π}},

and its variance is

V a r (T_{0 λ}) = \frac{1}{n^{2} λ} \frac{1}{\sqrt{2 π}} .

We define the test statistic as

Z_{0 λ} = \frac{T_{λ} - E (T_{0 λ})}{\sqrt{V a r (T_{0 λ})}},

(6)

which follows N(0, 1) as n → ∞ under the null hypothesis.

Alternative derivation of the test statistic

In this section, we present an alternative derivation of the test statistic that has better finite sample performance than the statistic (6) when n is not too large (see the Application to a Comparative ChIP-Seq Study During Mouse Adipogenesis section for an illustration). Note that the kernel smoother ${\tilde{Y}}_{λ} (t)$ can be written as a linear combination of Y^T = (Y₁, Y_n),

{\tilde{Y}}_{λ} (t) = S_{λ} Y,

(7)

where S_λ is considered as the hat matrix,

S λ = \frac{1}{n λ} (\begin{matrix} K (\frac{t_{1} - s_{1}}{λ}) & \dots & K (\frac{t_{1} - s_{n}}{λ}) \\ ⋮ & ⋱ & ⋮ \\ K (\frac{t_{n} - s_{1}}{λ}) & \dots & K (\frac{t_{n} - s_{n}}{λ}) \end{matrix}) .

The trace of S_λ is the degrees of freedom (df) of the kernel smoother.²⁴

Based on equations (3), (4), and (7), the statistic T_λ can be approximated by

T_{λ} = \frac{1}{n σ^{2}} \sum_{k = 1}^{n} {\tilde{Y}}_{k λ}^{2} = \frac{1}{n σ^{2}} Y^{T} S_{λ}^{T} S_{λ} Y,

(8)

where the n x n matrix S^T_λ is the transpose of S_λ Let M =

S_{T}^{λ} S_{λ}

with the following eigen decomposition, V^TMV = D, where D = diag(d₁, …, d_n), d₁  …  d_n are the eigenvalues and V is the orthogonal matrix of the eigenvectors. Under the null hypothesis, based on equation (5), Y/σ follows a multivariate normal distribution N_n(0, I_n). Let U^T = (U₁, …, U_n) = V^TY/σ, we can rewrite T_λ as

T_{λ} = \frac{1}{n} U^{T} D U = \frac{1}{n} \sum_{k = 1}^{n} d_{k} U_{2}^{k} .

Since V is an orthogonal matrix, the vector U follows N_n(0, VV^T) = N_n(0, I_n) under the null hypothesis, and therefore U²_k are i.i.d random variables following X²₁ and T_λ follows a mixture of n X² distributions with weights $d_{k} / n .$ . Furthermore, based on the study by Bentler and Xie,²⁵ under the null hypothesis, T_λ can be approximated by a weighted X² distribution, $δ X_{2}^{d},$ , where

d = \frac{{(\sum_{k = 1}^{n} d_{k})}^{2}}{\sum_{k = 1}^{n} d_{k}^{2}}, δ = \frac{(\sum_{k = 1}^{n} \frac{d_{k}}{n})}{d} .

Alternatively, using the Wilson–Hilferty transformation,²⁶ we have

Z_{0 λ, W H} = \frac{3 \sqrt{\frac{T_{λ}}{δ d}} - (1 - \frac{2}{9 d})}{\sqrt{\frac{2}{9 d}}},

(9)

which follows a N(0, 1) under the null hypothesis. We use this statistic in our analysis.

Estimate σG for each gene

In order to calculate the test statistic specified as equation (4) or (8), we need the variance estimate ${\hat{σ}}_{i}^{2}$ for each gene i. After the VST of the read counts, for each gene i, we assume that the observations Y_ik have the same variance σ²₁. We consider the Nadaraya–Watson non-parametric regression with kernel smoothers as (3),

{\tilde{Y}}_{λ} (t) = S_{λ} Y,

where df = tr(S_λ) is the degrees of freedom of the kernel smoother.²⁴ We estimate the variance σ²ⁱ by calculating the residual sum of squares

{\hat{σ}}^{2} = \frac{{[\tilde{Y} λ (t) - Y (t)]}^{T} [\tilde{Y} λ (t) - Y (t)]}{n - d f} = \frac{{\sum_{k = 1}^{n} [Y_{k} - \hat{Y} λ (t_{k})]}^{2}}{n - d f}

(10)

Since we consider the ChIP-seq data with very few or no replications, the estimates ${\hat{σ}}_{i}^{2}$ can be too small for very small counts. To improve precision, we use an approach similar to that used by Efron et al.²⁷ and Tusher et al²⁸: we add a constant a₀ = 90th percentile of the standard deviations to make the standard deviation of each gene bigger to avoid false identification of genes with differential enrichment. The final modified estimator of the variance is ${\hat{σ}}_{i}^{2} {({\hat{σ}}_{i} + a_{0})}^{2} .$ .

Finally, we choose the bandwidth λ in the kernel smoothing to be relatively large to avoid fitting the very small local changes. In our analysis of the real data sets with n = 280 observations, we choose λ = 20/280. The details of bandwidth selection are discussed in the Effects of Bandwidth Selection on Identifying the Differential Enrichment Genes section.

Application to a Comparative ChIP-Seq Study during Mouse Adipogenesis

We present results of our analysis of the comparative ChlP-seq data described in the Motivating Comparative ChIP-seq Study, Data Transformation and Statistical Model section. Our initial analysis focuses on H3K27ac at gene promoter regions, because it is known that H3K27ac is positively associated with gene expression.³ We divide the genomic region around the transcription starting site (-5000 to 2000 bp) into n = 280 bins, where the length of each bin is 25 bps. The data set includes m = 29,716 genes. Our goal is to identify the genes with differential enrichment of H3K27ac at the promoter regions between proliferating preadipocytes (day ™2) and mature adipocytes (day 7).

Comparison of the Z_{0^, W H} statistics and fold-change statistics

For each gene, after the normal transformation as in the Motivating Comparative ChIP-seq Study, Data Transformation and Statistical Model section, we fit a kernel-smoothing function to the difference data using a bandwidth of λ = 20/280, which over-smooths the very small signals that are likely due to noises. We calculate the test statistic for each of the 29,716 genes. To compare different test statistics Z_0λ and Z_oλ,WH, we plot the histograms of these two test statistics in Figure 1 for 9,874 genes with the maximum number of read counts in both days fewer than 5. Because of the very small read counts in these genes, these genes are most likely not differentially enriched and therefore the test statistics should follow the standard normal distribution. Clearly, Z_{0λ, W H} follows N(0, 1) closer than Z_0λ. We therefore use this statistic in all our analyses.

Figure 1.

Histograms of two test statistics for the mouse adipogenesis ChIP-seq data, (A) Z_0λ and (B) Z_{0λ, H W}, for 9,874 genes with the maximum number of read counts in both day ™2 and day 7 fewer than 5. The red curve in each plot represents the standard normal density.

Using the test statistic Z_{0λ, W H}, we observed that about one-third of the genes that show differential enrichment between preadipocytes and mature adipocytes using a Bonferroni-adjusted P-value of 0.05. This is expected since the cells are very different between these two days. Large-scale differential enrichment was also observed by Mikkelsen et al.³ We observe different patterns of differential enrichment. Figure 2 shows the observed data for 12 genes with the largest test statistics. Clearly, some genes are enriched for H3K27ac in only one condition. For genes that are enriched at both time points, Figure 2 shows that these genes have different H3K27ac enrichment levels or peak heights.

Figure 2.

Observed mouse adipogenesis ChIP-seq bin-counts for top 12 genes ranked by the test statistics Z_{0λ, W H} over the promoter region for day ™2 (red) and day 7 (black). Vertical line represents the transcription starting site.

As a comparison, for each of the genes, we also calculate the simple fold-change statistics and the statistics used in DBChIP.¹³ In general, we observe that large Z_{Oλ, H W} statistics correspond to large fold changes or large DBChIP statistics. We observe a small set of genes that have very small Z_0λ _WH statistics, but with very large fold changes or DBChIP statistics. These genes tend to have very small read counts. We also observe that some genes have very small fold changes, but with large Z_{0λ, HW} statistics. Figure 3 shows the plots of 12 such genes. Many such genes show a clear shift of peaks between two different cell states, which cannot be captured simply using total read counts as in fold changes and the DBChIP statistics. This indicates the importance of modeling the spatial ChIP enrichment profiles.

Figure 3.

Observed ChIP-seq bin-counts over the promoter region for day ™2 (red) and day 7 (black) for 12 genes with large Z_{0λ, WH} but small fold changes. Vertical line represents the transcription starting site.

Differential enrichment statistics and gene expression changes

We next investigate the relationship between our test statistics Z_{0λ, W H} and changes in expressions of the genes between the two time points. The gene expression data contain two replicates for each time point, and we take the average of two replicates as the mean value W_ij for each gene i = 1, …, m and time point j = 1, 2. We define the log₂ of the fold change of the expression levels as

Δ W_{i} = \log_{2} \frac{W_{i 2}}{W_{i 1}}

for the ith gene. We then divide the genes into two groups depending on whether higher enrichment was observed at day 7 or day ™2. Specifically, we fit the kernel smoothing curve to data for each gene under day 7 and day ™2 and obtain the maximum of the curves. The genes are classified as being enriched at day 7 (or day ™2) if the maximum height is higher at day 7 (or day ™2). Figure 4 shows the gene expression fold changes against the test statistics Z_{Oλ, W H} together with the Lowess fit for genes that are enriched at day ™2. We observe that larger enrichment statistics correspond to down-regulation of these genes. Similarly, Figure 4 also shows the gene expression fold changes against the test statistics Z_{Oλ, W H} together with Lowess fit for genes that are enriched at day 7. We observe that larger statistics correspond to up-regulation of these genes. Both plots make biological sense since enrichment of H3K27ac is known to promote gene expression. As a comparison, similar plots are shown in Figure 4 for the fold-change statistics. The patterns from the fold-change statistics are not as clear as using our proposed statistics Z_{Oλ, W H}.

Figure 4.

Plots of gene expression fold changes as a function of two different test statistics. Top: proposed smoothing-kernel test statistics; bottom: fold changes. Left panel: genes with enriched H3K27ac binding at day ™2; right panel: genes with enriched H3K27ac binding at day 7.

To demonstrate this further, we define gene i as being up-regulated if ΔW_i > 1 and down-regulated if ΔW_i < ™1. In Figure 5A, we divide our test statistics Z_{Oλ, W H} into equal-length intervals (<0, 0–5, 5–10, 10–15, 15–20, >20) for the genes that have higher enrichment at day ™2. We observe that the proportion of down-regulated genes increases as the test statistics increase. On the other hand, the proportions remain almost constant and close to zero for up-regulated genes. In contrast, for the genes that have higher enrichment at day 7, we observe exactly the opposite (see Fig. 5B). This indicates that our statistics correspond to gene expression changes very well. As a comparison, we present similar plots of the genes based on fold changes of the total reads counts (see Fig. 5C and D). We observe that the separations are not as clear as using our proposed statistics.

Figure 5.

Plots of proportions of up/down-regulated genes in different intervals of the test statistics for the mouse adipogenesis ChIP-seq data, (A)-(B): proposed smoothing-kernel test statistics; (C)-(D): fold change statistics; (A), (C): genes with enriched H3 K27ac at day ™2; (B), (D): genes with enriched H3 K27ac at day 7.

Prediction of gene expression fold changes using histone modification profiles

We next evaluate how well our proposed statistics can be used for predicting the fold changes of gene expression using ChIP-seq data. Besides the H3K27ac ChIP-seq data, we also have data from another five HMs, including H3K4mel, H3K4me2, H3K4me3, H3K27me3, and H3K36me3. In addition, for each gene, besides the promoter region, we also consider the histone modifications in gene body and downstream regions. We evaluate the prediction for fold changes of gene expression by randomly selecting half of the genes as the training set and fit a linear regression model,

Δ W_{i} = β_{0} + \sum_{h = 1}^{6} \sum_{l = 1}^{3} β_{h l} T S_{, h l,}

(11)

where h indexes the six HMs and l indexes promoter region, gene body, and down stream region. Using the fitted model, we then predict the gene expression for the left-out genes. We repeat this process 100 times and calculate the average R² for model fits for the training genes and the prediction error for genes in the testing sets. As a comparison, we also consider the same model as (11) using the simple fold change statistics as the predictors. Figure 6 shows the model fit for training genes and prediction results for testing genes using our proposed statistics Z_{Oλ, W H} and the fold change statistics as predictors. Clearly, we observe that our proposed statistics give a much better model fit and better prediction results. The average R² over 100 random splitting of the genes is 0.57 using our statistics and 0.46 using simple fold changes, and the average prediction error is 0.47 using our statistics and 0.59 using simple fold changes.

Figure 6.

Model fit (left panel) and prediction (right panel) for log of the gene expression fold changes using the proposed statistics Z_{0λ, HW} (top panel) and fold changes (bottom panel) of six histone-modification ChIP-seq data in promoter, gene body, and downstream region.

We also observe that histone modification dynamics in the promoter and gene body are more predictive than the signals in the downstream regions for predicting the gene expression changes (see Table 1 for details). This is expected since the HMs we used are associated with transcriptional initiation (H3K4me3), open chromatin and cisregulatory activities (H3K4me2/mel and H3K27ac), transcription elongation (H3K36me3), and polycomb-mediated repression (H3K27me3).

Table 1.

Comparison of model fit R² and prediction R² (PE) of gene expression fold changes using the proposed statistic Z_{0λ, W H} and fold change based on ChIP-seq data of promoter, gene body, and downstream regions of all six HMs as predictors and models using all the three regions.

	Z_{0λ, W H}R²	PE	FOLD CHANGE R²	PE
Promoter	0.45 (0.009)	0.60 (0.012)	0.35 (0.009)	0.72 (0.015)
Gene body	0.49 (0.008)	0.57 (0.015)	0.40 (0.011)	0.66 (0.014)
Downstream	0.3c0 (0.009)	0.78 (0.018)	0.18 (0.007)	0.90 (0.023)
All regions	0.57 (0.008)	0.47 (0.013)	0.46 (0.009)	0.59 (0.012)

The results are based on 100 runs of randomly selected half of the genes as training set and another half as testing set. Numbers in parentheses are standard errors.

Effects of Bandwidth Selection on Identifying the Differential Enrichment Genes

In applying our kernel-based test in analyzing the mouse ChIP-seq data, we used a global bandwidth of λ, = 20/280 for all the genes. Any reasonable test should capture the spatial profiles of signals in the gene regions of interest. On the other hand, the test should also smooth out the small local noises, which are not biologically interesting. We suggest using a relatively large bandwidth to reduce possible false positives. Alternatively, the standard method is to apply cross-validation to find the optimal rate c(1/n)^1/5.²⁹ Neumeyer and Dette³⁰ suggests to obtain the nonparametric variance estimator ${\hat{σ}}_{i}^{2}$ ³¹ for each gene and to estimate the bandwidth as follows

λ = {\frac{m e d i a n ({\hat{σ}}_{i}^{2}, i = 1, \dots,)}{n}}^{1 / 5}

We study the sensitivity of bandwidth selection on the performance of our proposed kernel-based test by considering different bandwidth values, λ₁ = 5/280, λ₂ = 20/280, λ₃ = 60/280, and λ₄ = 90/280. Here, λ₃ and λ₄ correspond to the bandwidths chosen by the nonparametric variance estimation method³⁰ and the optimal rate (1/n)^1/5,²⁹ respectively. We calculate the kernel-based test statistics and denote these statistics as $Z_{λ_{l}, W H}, l$ , l = 1,2,3,∠ We present in Figure 7 the histogram of $Z_{λ_{l}, W H}, l$ = 1,2,3,∠ For the 9,874 genes with the maximum number of read count in both days fewer than 5, which are analogs to the plot in Figure 1B. Clearly, the statistics $Z_{λ_{1}, W H}$ with a relatively small bandwidth lead to false positive detection where the distribution of null genes clearly deviates to the right side of N(0, 1). On the other hand, when a large bandwidth is used, as in statistics $Z_{λ_{3}, W H}$ and $Z_{λ_{4}, W H}$ the tests are conservative, although they still fit the standard normal density curves (red line) reasonably well.

Figure 7.

Histogram of the test statistics $Z_{λ_{i}, W H}$ with the different bandwidths: (A) λ₁ = 5/280, (B) λ₂ = 20/280, (C) λ₃ = 60/280, (D) λ₄ = 90/280 for 9,874 genes with the maximum number of read count in both day ™2 and day 7 fewer than 5 in mouse adipogenesis ChIP-seq data.

We also examine how different bandwidths affect the ability of identifying differentially expressed genes, where a gene is defined as a true differentially expressed gene if |ΔW_i| > 1. Overall, we observe that it is essential to smooth out the small local signals in order to reduce false-positive identification of genes with differential enrichment. A larger bandwidth gives better results than the smaller ones.

Application to an ENCODE ChIP-Seq Data with Two Replicates

To further evaluate the possible false positives in identifying genes with differential histone modification, we analyze ChIP-seq data reported in the Encyclopedia of DNA Elements (ENCODE) project³² for a B-lymphoblastoid cell line of human GM12878, which is also part of the 1000 Genomes project, and HeLa-S3 cervical carcinoma cells. Our analysis still focuses on the H3K27ac mark at the promoter regions of the genes with count data available in n = 280 bins for each gene. In this experiment, there are a total of m^* = 23807 genes. Besides the ChIP-seq data for two biological replicates, two input data are also available. Ideally, we should not expect any genes with differential enrichment between the two replicates. We apply the same procedure as in our analysis of the mouse data in the Application to a Comparative ChIP-seq Study During Mouse Adipogenesis section to the data between two ChIP-seq replicates and calculate test statistics Z_{new, i} for each gene i, i = 1, …, m = 23,807. The histogram of Z_new for all the genes in Figure 8 (top plot) shows that the majority of the test statistics follow the standard normal distribution. In addition, using a Bonferroni-adjusted P value of 0.05, our procedure identifies only 263 genes that show differential enrichment between the two replicates, which results in a less than 1.5% false discovery rate. This analysis further demonstrates that our proposed kernel-based nonparametric testing procedure is not only powerful enough to detect the true differential enriched regions but also makes fewer false identifications.

Figure 8.

Top: Histogram of differential enrichment test statistics Z_new between two biological replicates of the ENCODE data for all 23,807 genes. Bottom: Histogram of differential enrichment test statistics Z_new between two cell types (B-lymphoblastoid cell vs HeLa-S3 cervical carcinoma cells) of the ENCODE data for all 23,807 genes. The red curve represents the standard normal density.

We also perform an analysis to identify the genes with differential enrichment of histone modification between a B-lymphoblastoid cells and HeLa-S3 cervical carcinoma cells. Figure 8 (bottom plot) shows the histogram of the test statistics for all 23,807 genes. Using a Bonferroni threshold for genome-wide level of 0.05, we identify 6,647 genes that show differential H3K27ac enrichment at their promoter regions.

Conclusions and Discussion

We have proposed a kernel-smoothing-based nonparametric test to identify genes with differential histone enrichment for ChIP-seq data. Different from all the currently available methods, our method models the spatial histone enrichment profiles at the promoter regions of the genes, rather than simply modeling the total read counts in a given window. The method can therefore capture different types of differences in protein-enriched profiles between two experimental conditions. To detect differences in enrichment profiles, we constructed a nonparametric statistic based on kernel smoothing on the differences of the profiles after approximate normal transformation of the data. We have shown that the proposed test statistic corresponds to the gene expression changes better than other statistics and the models based on a combination of different HMs can effectively predict the gene expression fold changes. Although prediction of gene expression using the ChIP-seq data has been studied in many published works,^33,34 these papers focused only on prediction of gene expression at a static state. Our results further demonstrate that change of histone modifications and the dynamic chromatin signatures can also be very predictive for the fold-changes of gene expression between two different cellular states.

We considered only the problem of identifying the differential enrichment regions between two conditions, where we fit the kernel-smoothing to the differences of the normal transformed data in order to further smooth out the small local changes that might be due to differences in GC contents or mappability of the sequencing reads. By smoothing, we expect that our procedure is robust to such small changes due to genomic features. If input data are available, one can take the difference of the square-root transformed count data between ChIP and input and then apply our proposed test with kernel smoothing. The method requires the users to specify the regions to test. Besides the regions close to genes as we tested in this paper, one can also first identify the histone-enriched regions using some existing methods such as MACS⁷ or SICER³⁵ and then test for differential enrichments using our proposed methods. Finally, our proposed method can be extended to identify differential enrichment in multiple conditions. In such cases, we can define the test statistic as the mean or maximum of all the pair-wise statistics as proposed in this paper.

Author Contributions

Conceived and designed the experiments: HL, QW, KW. Analyzed the data: HL, QW, KW. Wrote the first draft of the manuscript: QW, HL. Contributed to the writing of the manuscript: HL, QW, KW. Agree with manuscript results and conclusions: QW, KW, HL. Jointly developed the structure and arguments for the paper: QW, HL, KW. Made critical revisions and approved final version: QW, KW, HL. All authors reviewed and approved of the final manuscript.

References

Park

. ChIP-Seq: advantages and challenges of a maturing technology. Nat Rev Genet. 2009; 10: 669–80.

Johnson

, Mortazavi

, Myers

, Wold

. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007; 316: 1497.

Mikkelsen

, Xu

, Zhang

. Comparative epigenomic analysis of murine and human adipogenesis. Cell. 2010; 143: 156–69.

Mortazavi

, Williams

, McCue

, Schaeffer

, Wold

. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008; 5: 621–8.

Barski

, Cuddapah

, Cui

. High-resolution profiling of histone methylations in the human genome. Cell. 2007; 129: 823–37.

Hon

, Wang

, Ren

. Discovery and annotation of functional chromatin signatures in the human genome. PLoS Comput Biol. 2009; 5: el000566.

Zhang

, Liu

, Meyer

. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008; 9: R137.

Kuan

, Chung

, Pan

, Thomson

, Stewart

, Kele

. A statistical framework for the analysis of ChIP-Seq data. J Am Stat Assoc. 2011; 106: 891–903.

, Jiang

, Ma

, Johnson

, Myers

, Wong

. An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat Biotechnol. 2008; 26: 1293–300.

10.

Schwartzman

, Jaffey

, Gavrilovz

, Meyer

. Multiple testing of local maxima for detection of peaks in ChIP-Seq data. Ann Appl Stat. 2013; 7: 471–94.

11.

Spyrou

, Stark

, Lynch

, Tavaré

. BayesPeak: Bayesian analysis of ChIP-seq data. BMC Bioinformatics. 2009; 10: 299.

12.

O'Geen

, Echipare

, Farnham

. Using ChIP-seq technologe high-resolution profiles of histone modifications. Methods Mol Biol. 2011; 791: 265–86.

13.

Liang

, Keles

. Detecting differential binding of transcription factors with ChIP-seq. Bioinformatics. 2012; 28: 121–2.

14.

Chen

, Jrgensen

, Kolde

. Prediction of RNA polymerase II recruitment, elongation and stalling from histone modification data. BMC Genomics. 2011; 12: 544.

15.

, Meyer

, Shin

. Nucleosome dynamics defines transcriptional enhancers. Nat Genet. 2010; 42: 343–7.

16.

Angel

, Song

, Dean

, Howard

. A polycomb-based switch underlying quantitative epigenetic memory. Nature. 2011; 476: 105–8.

17.

Stark

, Brown

. DiffBind: differential binding analysis of ChIP-Seq peak data. Bioconductor. 2011. http://bioconductor.org/packages/release/bioc/vianettes/DiffBind/dec/DiffBinded/.

18.

Taslim

, Wu

, Yan

. Comparative study on ChIP-seq data: normalization and binding pattern characterization. Bioinformatics. 2009; 25: 2334–40.

19.

Shao

, Zhang

, Yuan

, Orkin

, Waxman

. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets. Genome Biol. 2012; 13: R16.

20.

Langmead

, Trapnell

, Pop

, Salzberg

. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009; 10: R25.

21.

Brown

, Cai

, Zhang

, Zhao

, Zhou

. The root-unroot algorithm for density estimation as implemented via wavelet block thresholding. Probab Theory Relat Fields. 2010; 146: 401–33.

22.

Brown

, Gans

, Mandelbaum

. Statistical analysis of a telephone call center: a queing science perspective. J Am Stat Assoc. 2005; 100: 36–50.

23.

Lepski

, Spokoiny

. Minimax nonparametric hypothesis testing: the case of an inhomogeneous alternative. Bernoulli. 1999; 5: 333–58.

24.

Hastie

, Tibshirani

. Generalized Additive Models. Vol 43. Chapman and Hall/CRC; 1990. London.

25.

Bentler

, Xie

. Corrections to test statistics in principal Hessian directions. Stat Probab Lett. 2000; 47: 381–9.

26.

Wilson

, Hilferty

. The distribution of chi-squared. Proc Natl Acad Sci U S A. 1931; 17: 684–8.

27.

Efron

, Tibshirani

, Storey

, Tusher

. Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc. 2001; 96: 1151–60.

28.

Tusher

, Tibshirani

, Chu

. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001; 98: 5116–21.

29.

Gasser

, Kneip

, Köhler

. A flexible and fast method for automatic smoothing. J Am Stat Assoc. 1991; 86: 643–52.

30.

Neumeyer

, Dette

. Nonparametric comparison of regression curves: an empirical process approach. Ann Stat. 2003; 31: 880–920.

31.

Rice

. Bandwidth choice for nonparametric regression. Ann Stat. 1984; 12: 1215–30.

32.

ENCODE Project Consortium.

An integrated encyclopedia of DNA elements in the human genome.

Nature. 2012; 489: 57–74.

33.

Karlic

, Chung

, Lasserre

, Vlahovicek

, Vingron

. Histone modification levels are predictive for gene expression. Proc Natl Acad Sci U S A. 2010; 107: 2926–31.

34.

Dong

, Greven

, Kundaje

. Modeling gene expression using chromatin features in various cellular contexts. Genome Biol. 2012; 13: R53.

35.

Zang

, Schones

, Zeng

, Cui

, Zhao

, Peng

. A clustering approach for identification of enriched domains from histone modification ChIP-Seq data. Bioinformatics. 2009; 25: 1952–8.

Nonparametric Tests for Differential Histone Enrichment with ChIP-Seq Data

Abstract

Keywords

Introduction

Motivating Comparative ChIP-Seq Study, Data Transformation, and statistical Model

Kernel-Smoothing-Based Nonparametric Tests

Alternative derivation of the test statistic

Estimate σG for each gene

Application to a Comparative ChIP-Seq Study during Mouse Adipogenesis

Comparison of the Z0^, W H statistics and fold-change statistics

Differential enrichment statistics and gene expression changes

Prediction of gene expression fold changes using histone modification profiles

Effects of Bandwidth Selection on Identifying the Differential Enrichment Genes

Application to an ENCODE ChIP-Seq Data with Two Replicates

Conclusions and Discussion

Author Contributions

References

Comparison of the Z_{0^, W H} statistics and fold-change statistics