Abstract
DNA methylation (the addition of a methyl group to a cytosine) is an important epigenetic event in mammalian cells because it plays a key role in regulating gene expression. Most previous methylation studies assume that DNA methylation occurs on both positive and negative strands. However, a few studies have reported that in some genes, methylation occurs only on one strand (ie, hemimethylation) and has clustering patterns. These studies report that hemimethylation occurs on individual genes. It is unclear whether hemimethylation occurs genome-wide and whether there are hemimethylation differences between cancerous and noncancerous cells. To address these questions, we have developed the first-ever pipeline, named hemimethylation pipeline (HMPL), to identify hemimethylation patterns. Utilizing the available software and the newly developed Perl and R scripts, HMPL can identify hemimethylation patterns for a single sample and can also compare two different samples.
Introduction
DNA methylation is an epigenetic modification in a cell. This modification adds a methyl group (CH3) to the 5′ position of cytosine in a DNA sequence and is inheritable through cell division.1,2 For mammalian cells, it occurs only at CPG sites (cytosines paired with guanines). DNA methylation plays an essential role for both normal and cancerous cell development3–6 and is closely related to significant processes, including X-chromosome inactivation, genomic imprinting, and tumor growth.7–10
In a genome, there are different types of methylation patterns, including hypermethylation, hypomethylation, and hemimethylation. Hypermethylation occurs when samples in one group (eg, cancer patients) have more methylation than the samples in another group (eg, normal individuals). Hypomethylation occurs when samples in one group have less methylation than the samples in another group. Hemimethylation means that at a CPG site, only one strand of the DNA is methylated (denoted M in Fig. 1), and the other strand is unmethylated (denoted U in Fig. 1). Recent research studies11–13 show that, on a number of genes, hemimethylation may occur as a same-strand cluster (Fig. 1A), a polarity (or reverse) hemimethylation cluster (Fig. 1B) with only two CPG sites, or a different-strand cluster with more than two CPG sites (Fig. 1C). The cluster pattern means that two or more consecutive CPG sites are methylated only on one DNA strand, and not on the other strand, as shown in Figure 1A. The polarity (or reverse) hemimethylation clusters imply that, at the two consecutive CPG sites, the methylation patterns on the positive and negative strands are “MU and UM,” respectively, as shown in Figure 1B. “MU and UM” means that, at two adjacent CPG sites, the first site is hemimethylated as MU on the positive and negative strands, while the next adjacent site is hemimethylated as UM on two strands, which has a reversed pattern. The hemimethylation patterns shown in Figure 1 are like the footprints of DNA demethylation (ie, methyl groups are removed) in cancer. 12 This “footprint” role exists because hemimethylation is a transitional state between being methylated and having no methylation at specific CPG sites. Therefore, the identification of hemimethylation is important for understanding both methylation events and the establishment of different methylation patterns.

Examples of hemimethylation patterns. M and CmG represent a methylated site. U and CG represent an unmethylated site. A is an example of hemimethylation that occurs on the same strand. B is an example of polarity or reverse hemimethylation pattern with only two CPG sites. C is an example of hemimethylation on different strands with more than two CPG sites.
A major experimental limitation in hemimethylation studies has been the difficulty in obtaining methylation signals from the two complementary strands of DNA molecules for all CPG sites in an entire genome. Previous hemimethylation studies can only obtain the hemimethylation data for a few genes using the traditional Sanger sequencing and hairpin sequencing methods.11–13 Even though microarray technologies can obtain methylation levels genome-wide, they cannot produce methylation signals on two DNA strands separately. However, the next-generation sequencing (NGS) technology,14,15 combined with the bisulfite conversion technique (ie, C is converted to U and then becomes T), makes it possible to obtain methylation signals at the CPG site level on both DNA strands in an entire genome.16–19 During the last several years, a number of pioneering research groups have successfully used the bisulfite-converted methylation sequencing method on either
Methods
The Workflow of HMPL
The workflow of our pipeline HMPL (see, Fig. 2) consists of two parts. Part I (preprocessing) utilizes available software packages in three steps: sequencing data quality assessment, trimming, and alignment (ie, Steps 1–3 of the workflow). Part II (parsing) is the new feature developed by our group. This part includes data parsing and summary reports (ie, Steps 4 and 5 of the workflow). A more detailed description of these steps is introduced below. HMPL code and resource files can be downloaded from the following web link: http://hal.case.edu/˜sun/HMPL/HMPL.zip.

Workflow of the HMPL.
Step 1: assess sequencing qualities using FastQC. 28
FastQC is a software package for assessing sequencing qualities by generating basic and informative diagnostic plots for sequencing data. This package provides a modular set of analyses for users to obtain a quick impression as to whether or not there are any obvious and serious problems before they start any downstream data analysis. FastQC produces basic statistics plots and summary reports for (1) per base sequence quality, (2) sequence quality scores, (3) summary of per base sequence content, (4) per sequence GC content, (5) per base N content, (6) sequence length distribution, (7) duplicate sequences, (8) overrepresented sequences, (9) adapter content, (10) Kmer (or K-base) content, and (11) per tile sequence quality.
Step 2: trim sequencing data.
Quite often, sequencing quality is very low at the 3′ end in sequencing reads, and raw reads may include adapter sequences. Therefore, HMPL has included the quality trimming and adapter-trimming step. In particular, dynamic trimming (the
Step 3: align reads using BRAT-bw 31 and obtain methylation ratios at all cytosine sites.
After trimming, alignment is done using BRAT-bw.
31
After alignment, the methylation level (or ratio) at each cytosine (or C) site is obtained using the
Step 4: identify hemimethylation patterns and compare these patterns in two samples.
In this step, all individual CPG sites that are hemimethylated are first identified. These sites are then classified into two groups: hemimethylated singleton and hemimethylated cluster. A hemimethylated CPG site is defined as a singleton if its adjacent CPG sites within
Because high-throughput sequencing data may include sequencing and alignment errors, HMPL identifies a hemimethylated site using the following criteria. First, the user may determine a coverage cutoff value
Step 5: provide genetic annotation and report findings.
For each hemimethylated CPG site, HMPL provides the following genetic information.
Gene: If a hemimethylated CPG site is located on a gene, HMPL will report the name ofthat gene.
Pomoter: If a hemimethylated CPG site is within a
Input and Output
The HMPL uses raw sequencing reads (in FASTQ format) as input in Step 1 and Step 2. In Steps 3, 4, and 5, the input files are the output files from the previous step. More detailed information about the input and output files of HMPL can be found in the user manual, which can be downloaded from http://hal.case.edu/˜sun/HMPL/HMPL.user.manual.pdf.
Usage, Command Options, and Running Time
HMPL is written in Perl32,33 and R
34
scripts. It can be run as shown below in a LINUX or UNIX environment. The preprocessing step of HMPL (Part I) can be implemented with the following command (the command options of
The command options of HMPL Part I (
The preprocessing pipeline ties the software and source code together with the appropriate dataflow to ensure that the correct output is achieved. Users need to have Perl, Python 2.6, R, FastQC, and BRAT-bw software installed on their system. They can run HMPL by entering commands in a Unix/Linux environment. If users have finished the hemimethylation preprocessing pipeline and have obtained the files of combined CPG sites, they may only run the parsing analysis using the Part II of HMPL (ie,
The command options of HMPL Part II (
In order for users to test the HMPL, we have prepared a small dataset with 5 million sequencing reads (a FASTQ *.fastq file), the human chromosome 22 reference sequence (a FASTA *.fa file), and the example scripts. Users may align these 5-million reads to chromosome 22 and then identify hemimethylated sites using both Part I and Part II of HMPL. Users simply need to download the data, install the HMPL package, and then change the data path/directory in the example script accordingly to run the HMPL. It takes about 11 minutes to run this small dataset using a Linux computer with 4 GB RAM. In addition, in order for users to explore the different options and arguments of HMPL, we have prepared a file named “README.” This file includes example scripts of running the HMPL using different command options and also explains to the user what to expect in the screen output. The small dataset, example scripts, and the “README” file can be downloaded from the following web link: http://hal.case.edu/˜sun/HMPL/HMPL.zip.
In order to show the running time of HMPL in practice, we use two human example datasets. Each dataset has ˜50 million raw sequencing reads, which will be introduced in detail in the Results section. Using the Linux server with dual quad-core 2.66 GHz Xeon E5430 processor that has 4 GB RAM for each core, it takes ˜4 hours to run Part I (
Results
The HMPL pipeline can be used to compare any two samples of bisulfite sequencing data. In this paper, we demonstrate the use of HMPL using publicly available bisulfite-treated methylation sequencing datasets for cell lines MCF10A and MCF7. 25 These two samples are breast cancer cell lines. Because a number of hypermethylated genes have been reported for breast cancer cells 35 and there are hemimethylation patterns reported in individual genes,11–13 it is very likely that many CPG sites have been hemimethylated in these two cell lines. MCF10A is nontumorigenic and MCF7 is tumorigenic. We select these two cell lines because their hemimethylation patterns could be different. The bisulfite methylation sequencing data of MCF10A and MCF7 and more information about these two cell lines can be found from the corresponding references.25,36 The sequencing reads of MCF10A and MCF7 are generated using the reduced representative bisulfite sequencing (RRBS) protocol. 16 There are 54,295,326 and 50,054,248 sequencing reads for MCF10A and MCF7, respectively, and the read length is 50-base for each dataset.
In order to identify hemimethylation singletons and clusters in both MCF10A and MCF7 and compare these two samples, we have run both Part I (
The description of HMPL Part II (
The hemimethylated singleton and cluster patterns are compared and summarized in Figure 3. In this figure, “I. Singleton” means comparing the singleton hemimethylated CPG sites in MCF10A and MCF7. “II. Consecutive Polarity Cluster” shows the results of comparing the polarity (or reverse) clusters that include two consecutive CPG sites; no other CPG sites are located between these two sites. “III. Non-Consecutive Polarity Cluster” means comparing the polarity (or reverse) clusters that include two CPG sites that are not consecutive. There is at least one CPG site located between these two sites, but there are either no sequencing reads (or data) or no hemimethylation sites between them. “IV. Non-Polarity Cluster” refers to the clusters that do not have the polarity (or reverse) pattern (eg, the patterns shown in Fig. 1A and C).

MCF10Aand MCF7 hemimethylation pattern comparison results. In each Venn diagram, the “A” entry means the number of CPG sites or clusters that are hemimethylated in the MCF10A sample, but not in the MCF7 sample. The “B” entry shows the number of CPG sites or clusters that are hemimethylated in the MCF1OA sample, but there are no sequencing reads for these CPG sites in the MCF7 sample. The “C” entry represents the number of CPG sites or clusters that are hemimethylated in both MCF10A and MCF7. The “D” entry indicates the number of CPG sites or clusters that are hemimethylated in the MCF7 sample, but not in the MCF10A sample. The “E” entry means the number of CPG sites that are hemimethylated in the MCF7 sample, but there are no sequencing reads for these CPG sites in the MCF10A sample.
Figure 3 and Table 4 show the results of comparing MCF10A with MCF7, which are summarized based on the output files named “compare” as shown in the last row of Table 3. The comparison results indicate that the hemimethylation patterns between the non-tumorigenic sample MCF10A and tumorigenic sample MCF7 are different at some CPG sites and/or genomic regions. Therefore, it is important that these CPG sites are investigated further. Our pipeline HMPL has provided gene annotation files for all CPG sites by giving names of genes in which hemimethylated CPG sites are located, as well as the names of genes in whose promoter regions the hemimethylated CPG sites are located.
The summary of MCF10A and MCF7 hemimethylation patterns.
The results of comparing MCF10A and MCF7 show that there are more polarity (or reverse) clusters (eg, Fig. 1B) than the single-strand hemimethylation clusters (eg, Fig. 1A). This may be as a result of the fact that the methylation sequencing data we have used are generated by the RRBS protocol, 16 which only sequences a small percentage of CPG sites in a human genome. In fact, there are ˜5% of the CPG sites with at least 3x coverage in the RRBS data we have analyzed. If we use the whole genome bisulfite sequencing (WGBS) data, it is very likely that more single-strand hemimethylation clusters would be identified. Currently, we are not aware of any WGBS data for either MCF10A or MCF7. Therefore, we have used the available RRBS data to demonstrate the usage of HMPL. Even though RRBS data are not ideal for identifying all hemimethylation sites in an entire genome, we have found many hemimethylation singletons and clusters, which show the capability of our HMPL package. In fact, HMPL can be used to compare any two samples with data generated using either the RRBS or WGBS protocol.
Discussion
For the MCF7 sample, 532 genes have at least three hemimethylated CPG sites. In order to see if the genes with hemimethylated CPG sites are biologically important or meaningful, we have further investigated the 532 genes by comparing them with oncogenes, breast cancer methylated genes, and transcription factors. The comparison results show that seven of these genes are methylated, 17 are oncogenes, and 62 are transcription factors. We have also conducted the gene set enrichment analysis (GSEA) for these 532 genes using the GSEA software package and the molecular signature database provided by the Broad Institute.
37
The analysis results show that 87 genes are significantly represented in (or overlapped with) 10 cancer modules (with
Ten significant cancer modules identified using GSEA. The first column is the gene symbol, and the other columns indicate if a gene belongs to a specific cancer module. “X” shows that a gene belongs to that module, and a blank cell means that a gene does not belong to a specific module.
There are a number of alignment tools for bisulfite methylation sequencing data, such as BRAT, 29 BRAT-bw, 31 BSMAP, 39 BS Seeker, 40 Bismark, 41 MethylCoder, 42 RMAPBS, 43 Pash, 44 and BatMeth. 45 A comprehensive list of these tools can be found at omictools.com. 46 Among all these available tools, we have tested BRAT-bw, BRAT, and BSMAP. We have found that they provide similar results, but BRAT-bw is faster. BRAT-bw is user-friendly and has the following useful features: (1) it can align both single-end and paired-end reads, (2) it can produce the ACGT count for all cytosines in a genome, (3) it can account for overlapping paired-end reads, and (4) it can check strands. If users prefer another alignment tool, they can obtain the alignment results and methylation levels using their preferred alignment tool, and then run Part II of the HMPL to obtain hemimethylated singletons and clusters. If necessary, this can be done with some minor format changes of their alignment output files. Reformatting is easy because the parsing pipeline of HMPL (ie, Part II) only requires data with the following columns that most alignment tools provide for two DNA strands: chromosome, start position, end position, total number of sequencing reads, methylation level, and DNA strand.
The cutoff values used in HMPL, especially the ones provided in Table 2 for the parsing pipeline (ie, part II), should be determined depending on the sequencing quality and coverage. If the user has a sequencing dataset with very good quality and high coverage, changing the cutoff value may not significantly affect the results. However, if the sequencing dataset has low coverage and poor quality, changing the cutoff values may lead to very different results. For this case, we recommend that users set up very stringent cutoff values to reduce false discovery rates. We set up the default values based on findings in previous publications, some basic and common knowledge of methylation sequencing data, and our experience with bisulfite sequencing data; however, every dataset is different. We suggest that the users first determine the quality and coverage of their data, and then try different values to see if their results are dramatically different. If results are dramatically different, users may choose results that are obtained based on stringent cutoff values, especially when they plan to do experimental validation. Using results based on stringent cutoff values can ensure a high validation rate and then the user may expand their validation list by adding more hemimethylated singletons or clusters from the list obtained with less stringent cutoff values.
Ideally, it is best to have a list of known hemimethylated and nonhemimethylated sites to study the true and false discovery rates (or sensitivity and specificity) of HMPL. For the nonhemimethylated sites, we can choose the CPG sites that are located in the housekeeping genes, which are relatively stable and not likely to be methylated on either strand. In fact, housekeeping genes have been used for the purpose of “negative control” in previous methylation studies.47–49 For the example dataset MCF7, using the coverage cutoff of 5× (that is, at least five reads to cover each strand) and other default settings, there are 532 genes with at least three hemimethylated sites. We compare these 532 genes with the 205 known housekeeping genes used in a previous study, 49 and there is no overlap; therefore, none of these 532 genes are housekeeping genes. As for the known hemimethylated sites, to the best of our knowledge, no available list of genes or sites can be used as a “positive control.” This is because genome-wide hemimethylation study is rarely done, and a few hemimethylated sites have been experimentally validated. Another possible way of validating the HMPL is to use simulated data. However, we have decided not to use this approach because there is little knowledge about genome-wide hemimethylation patterns. With little known information, a simulation would be very arbitrary and the simulated data would not reflect true unknown patterns. The purpose of our HMPL development is to provide an exploratory tool for this research topic. For genes identified with a number of hemimethylated CPG sites, users may do further investigation by studying their biological functions and relationships with other genes using pathway analyses. Users may also do experimental validation to discover novel methylated or hemimethylated genes.
Our pipeline has two limitations. First, the HMPL is designed for identifying hemimethylation only at CPG sites, but not at any non-CPG sites (eg, CHG and CHH sites, where H represents A, C, or T in a DNA sequence). If users are interested, our algorithms may be modified to study the hemimethylation of non-CPG sites. Second, our pipeline is developed for identifying hemimethylated sites (or clusters) by comparing two samples, but it is not designed for comparing multiple samples in two or more groups. Because the hemimethylation study is an area with little research, HMPL is good for preliminary studies. As for the topic of identifying hemimethylation patterns in multiple samples and in multiple groups, more sophisticated statistical methods may be used, and our group is working on projects related to this approach.
For the first step of the HMPL workflow, we have used the software package FastQC. Even though FastQC is not designed for bisulfite-treated methylation sequencing data, it can provide informative diagnostic plots for methylation sequencing data. If users find some serious sequencing quality issues in their data, we recommend that they check data more thoroughly using other available software packages, such as SAAP-RRBS, 50 BSeQC, 51 and MethyQA, 27 before they interpret their HMPL results.
Conclusion
Hemimethylation patterns are useful for studying DNA methylation events. Therefore, it is important to develop a software package to identify such patterns. To address this need, we have developed a new software package, HMPL, which includes both preprocessing and data parsing. For two samples, each with 50 million reads, it takes a few hours for HMPL to align the sequencing reads, and it only takes a few minutes to process the methylation level data. If users have obtained their coverage and methylation ratio data, Part II of HMPL can identify hemimethylation patterns in minutes.
Author Contributions
Conceived and designed the experiments: SS. Analyzed the data: SS, PL. Wrote the first draft of the manuscript: SS. Contributed to the writing of the manuscript: SS, PL. Agreed with manuscript results and conclusions: SS, PL. Jointly developed the structure and arguments for the paper: SS, PL. Made critical revisions and approved the final version: SS, PL. Both authors reviewed and approved the final manuscript.
Footnotes
Acknowledgments
We appreciate the support provided by our IT colleagues at Case Western Reserve University and Texas State University.
