Abstract
The differentiation of embryonic stem cells into various lineages is highly dependent on the chromatin state of the genome and patterns of gene expression. To identify lineage-specific enhancers driving the differentiation of progenitors into pancreatic cells, we used a previously described computational framework called Total Functional Score of Enhancer Elements (TFSEE), which integrates multiple genomic assays that probe both transcriptional and epigenomic states. First, we evaluated and compared TFSEE as an enhancer-calling algorithm with enhancers called using GRO-seq-defined enhancer transcripts (method 1) versus enhancers called using histone modification ChIP-seq data (method 2). Second, we used TFSEE to define the enhancer landscape and identify transcription factors (TFs) that maintain the multipotency of a subpopulation of endodermal stem cells during differentiation into pancreatic lineages. Collectively, our results demonstrate that TFSEE is a robust enhancer-calling algorithm that can be used to perform multilayer genomic data integration to uncover cell type-specific TFs that control lineage-specific enhancers.
Introduction
Embryonic development is a complex process during which pluripotent cells differentiate into specialized cells of various lineages. The mechanisms underlying developmental competence are highly complex and are dependent on the tissue type and precise timing of signaling cues. 1 Lineage specification is dependent on the interactions of transcription factors (TFs) and chromatin states at enhancers. Enhancers and the TFs regulating their formation have been shown to play an important role in cell type–specific activation of gene expression.2,3 Although thousands of potential enhancers have been identified in cell types derived from various lineages and tissues, identification of the enhancers that are active (versus inactive or poised) remains a major challenge. 4 In addition, the ability to identify the TFs acting at numerous enhancers in each cell type is challenging.5,6
Enhancers have been shown to share several common features, such as increased chromatin accessibility (as measured by DNase-seq or ATAC-seq)7-9 and enrichment of posttranslational modification of the amino-terminal tails of core histone proteins (as assessed by ChIP-seq), including histone H3 lysine 4 monomethyl (H3K4me1) and histone H3 lysine 27 acetyl (H3K27ac).10-12 While these epigenomic features can reveal the location of many enhancers across the genome, they cannot readily differentiate between active and inactive enhancers.13,14 Recent genomic assays have shown that active enhancers are bound by RNA polymerase II (Pol II) and are transcribed, producing noncoding RNAs known as enhancer RNAs (eRNAs).14-16 While the full breadth of functions of eRNAs are unknown, we and others have shown that enhancer transcription (as measured by total RNA-seq, GRO-seq, or PRO-seq) can be used in the absence of any other genomic information to predict enhancer activity.3,15-23
In recent years, advances in technology have facilitated the large-scale functional characterization of enhancer activity22,24-26 and the annotation of TF-binding sites (TFBSs) genome-wide in various cell types and tissues.5,27 However, due to the large number of cell types, TFs, and experimental conditions, 28 integration of these independent data sets to achieve a comprehensive analysis of gene expression and actionable predictions of TFs driving cell type–specific gene expression is challenging. Analyses that predict TFBSs, which are usually 4 to 12 nucleotides in length, 29 using TF-binding profile databases29-31 fail to consider that such sequences occur frequently by chance throughout the genome and that TF binding is cell type specific. 32 To overcome these limitations, we previously established a computational pipeline and tool called Total Functional Score of Enhancer Elements (TFSEE), which can be used to identify location and activity of enhancers in any cell or tissue type together with their cognate TFs. 33
In this study, we aimed to (1) evaluate TFSEE as an enhancer-calling algorithm and (2) understand the TF-driven transcriptional programs differentiating human embryonic stem cells (hESCs) into pancreatic cells.1,34 This developmental model allowed us to explore spatiotemporal gene regulation during development by enhancers and TFs. In the studies presented herein, we provide a detailed characterization of TFSEE and demonstrate the broader use of TFSEE to identify enhancers and TFs during the differentiation of embryonic stem cells into pancreatic progenitor cells to uncover cell type–specific TFs that control lineage-specific enhancers (Figure 1).

Flowchart of data analysis and inputs into TFSEE. Flowchart describing the preprocessing steps, methods, and subsequent analysis described in this article. TFSEE accepts enhancer calls from different inputs: Method 1, enhancers called using enhancer transcription based on GRO-seq data. Method 2, enhancers called using enrichment of histone modifications. TF indicates transcription factors; TFSEE, Total Functional Score of Enhancer Elements.
Materials and Methods
Genomic data curation
We used previously published GRO-seq, ChIP-seq, and RNA-seq data from1,34 time course differentiation of hESCs to pancreatic endoderm (PE). All data sets are available from NCBI’s Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/) or EMBL-EBI’s ArrayExpress (http://www.ebi.ac.uk/arrayexpress/) repositories using the accession numbers listed in Table S1.
Analysis of ChIP-seq data
The raw reads were aligned to the human reference genome (GRCh37/hg19) using default parameters in Bowtie version 1.0.0. 35 The aligned reads were subsequently filtered for quality and uniquely mappable reads were retained for further analysis using Samtools version 0.1.19 36 and Picard version 1.127 (http://broadinstitute.github.io/picard/). Library complexity was measured using BEDTools version 2.17.0 37 and meets ENCODE data quality standards. 38 Relaxed peaks were called using MACS version 2.1.0 39 with a P value of 1 × 10−2 for each replicate, pooled replicates’ reads, and pseudoreplicates. Peak calls from the pooled replicates that were observed in either both replicates or in both pseudoreplicates were used for subsequent analysis.
Analysis of RNA-seq data
The raw reads were aligned to the human reference genome (GRCh37/hg19) using default parameters in STAR version 2.4.2a. 40 Quantification of genes against Gencode version 19 41 annotations was done using default parameters in RSEM version 1.2.31. 42
Analysis of GRO-seq data
The GRO-seq reads were trimmed to the first 36 bases to trim adapter and low-quality sequence, using default parameters of fastx_trimer in fastx-toolkit version 0.0.13.2 (http://hannonlab.cshl.edu/fastx_toolkit/). The trimmed reads were aligned to the human reference genome (GRCh37/hg19) using default parameters in BWA version 0.7.12. 36
Kernel density
Kernel density plot representations were used to express the univariate distribution of ChIP-seq reads under peaks, RNA-seq reads for protein-coding genes, and GRO-seq reads for short paired and short unpaired eRNAs. The kernel density plots were calculated in Python (ver. 2.7.11) using the kdeplot function from seaborn version 0.7.1 (http://seaborn.pydata.org/) with default parameters.
Defining transcription start sites and promoters
We made distinct transcription start sites (TSSs) for protein-coding genes from Gencode version 19 41 annotations using MakeGencodeTSS (https://github.com/sdjebali/MakeGencodeTSS). We identified active promoters using enrichment of H3K4me3. 43 An RPKM cutoff of ⩾1 for H3K4me3 in at least one cell line was used to identify a peak as an active enhancer (Figure S1A).
Enhancer calling by GRO-seq
Calling a universe of transcripts from GRO-seq data
Transcript calling was performed using a 2-state hidden Markov model using the groHMM data analysis package version 3.416,21 (https://bioconductor.org/packages/release/bioc/html/groHMM.html) on each individual cell line. The negative log transition probability of the switch between transcribed state to nontranscribed state and the variance in read counts in the nontranscribed state that are used to predict the transcription units for the cell lines in this study are listed in Table S2. We then built a universe of transcripts by merging the groHMM-called transcripts from individual cell lines and stratifying the boundaries to remove overlaps/redundancies occurring from the union of all transcripts.
Calling active enhancers using GRO-seq-defined enhancer transcripts
We filtered and collected a subset of short intergenic transcripts <9 kb in length and >3 kb away from known TSSs of protein-coding genes from Gencode version 19 annotations 41 and H3K4me3 peaks. These were further classified into (1) short paired eRNAs and (2) short unpaired eRNAs as described previously. 19 For the short paired eRNAs, the sum of the GRO-seq RPKM values for both strands of DNA was used to determine whether an enhancer transcript pair is expressed using a cutoff of RPKM ⩾ 0.5 (Figure S1B). An RPKM cutoff of ⩾ 1 was used to determine the universe of expressed short unpaired eRNAs (Figure S1C). The comprehensive universe of expressed eRNAs (short paired and short unpaired) assembled using the cutoffs noted above for each cell line was used for further analyses.
Motif analyses for GRO-seq-defined enhancers
De novo motif analyses was performed on a 1 kb region (±500 bp [base pairs]) surrounding the overlap center or the TSS for short paired and short unpaired eRNAs, respectively, using the command-line version of MEME from MEME Suite version 4.11.1. 44 The following parameters were used for motif prediction: (1) zero or one occurrence per sequence (-mod zoops); (2) number of motifs (-nmotifs 15); (3) minimum, maximum width of the motif (-minw 8, -maxw 15); and (4) search for motif in given strand and reverse complement strand (-revcomp). The predicted motifs from MEME were matched to known motifs in the JASPAR database (JASPAR_CORE_2016_vertebrates.meme) 30 using TOMTOM. 31
Enhancer calling by ChIP-seq
Calling active enhancers using histone modification ChIP-seq data
We built a universe of peak calls by merging the peaks from individual cell lines for histone modifications (H3K4me1 and H3K27ac) and stratifying the boundaries to remove overlaps/redundancies occurring from the union of all peaks. Potential enhancers were defined as peaks that were >3 kb from known TSSs, protein-coding genes from Gencode version 19 annotations, 41 and H3K4me3 peaks. An RPKM cutoff of ⩾1 for H3K4me1 and H3K27ac (Figure S1D and E) in at least one cell line was used to identify a peak as an active enhancer. The universe of active enhancers was assembled using the cutoffs noted above for each cell line and was used for further analyses.
Motif analyses for ChIP-seq-defined enhancers
De novo motif analyses were performed on a 1 kb region (±500 bp) surrounding the peak summit for the top 10 000 enhancers, using the command-line version of MEME-ChIP from MEME Suite version 4.11.1.44,45 The following parameters were used for motif prediction: (1) zero or one occurrence per sequence (-mod zoops); (2) number of motifs (-nmotifs 15); (3) minimum, maximum width of the motif (-minw 8, -maxw 15). All the other parameters were set at the default. The predicted motifs from MEME were matched to known motifs in the JASPAR database (JASPAR_CORE_2016_vertebrates.meme) 30 using TOMTOM. 31
Generating heatmaps and clusters
For each cell line, the functional scores were Z-score normalized. To identify cognate TFs by cell type, we performed hierarchical clustering by calculating the Euclidean distance using clustermap from seaborn version 0.7.1 (https://seaborn.pydata.org/). For visualization of the multidimensional TFSEE scores, we performed t-distributed stochastic neighbor embedding analysis (t-SNE) 46 using the TSNE function and labeled the clusters by calculating K-means clustering using the KMeans function with the expectation-maximization algorithm in scikit-learn version 0.17.1 (http://scikit-learn.org/).
Nearest neighboring gene analyses and box plots
The universe of expressed genes in each cell line was determined from the RNA-seq data using a FPKM cutoff of >0.4 (Figure S1F). The set of nearest neighboring expressed genes for each enhancer defined by an expressed eRNA or the enrichment of active histone marks was determined for each cell line. Box plot representations were used to express the levels of transcription or enrichment for each called enhancer and transcription of their nearest neighboring expressed genes. The read distribution (RPKM) for each enhancer or (FPKM) gene was calculated and plotted using the boxplot function from matplotlib version 2.0.2 (https://matplotlib.org/). Wilcoxon rank sum tests were performed to determine the statistical significance of all comparisons.
Overlapping enhancer analysis
Quantification of overlapping enhancers was assessed using a universe of enhancers for each identification method and counting the overlapping enhancers with other methods using BEDTools version 2.17.0. 37 The percentage of overlapping enhancers was calculated in Python (ver. 2.7.11) and plotted using the barplot function from matplotlib version 2.0.2 (https://matplotlib.org/). To visualize the intersection of enhancer marks, we used UpsetR version 1.4. 47
Results
The TFSEE model
The TFSEE model integrates data from multiple genomic assays, such as GRO-seq, RNA-seq, and ChIP-seq, with TF expression and motif information to predict (1) TFs driving the formation of active enhancers in a particular cell type and (2) the locations of their cognate enhancers. Enhancer identification methods using enrichment of enhancer histone modifications (eg, H3K4me1 and H3K27ac)10-12 or enhancer transcription 16 are quite well established. To explore the utility of TFSEE, we have analyzed using both methods of enhancer identification as inputs for TFSEE (Figure 1). In step 1 (Figure 2), a universe of active enhancers across the different constituent cell types was identified based on enhancer transcription as assessed by GRO-seq or total RNA-seq (method 1) (Figure S2A). One could also substitute this with the enhancers identified using the enrichment of epigenomic marks that are known to be enriched at enhancers, such as H3K4me1 and H3K27ac (method 2) (Figure S2B). After the enhancer calling step (by method 1 or method 2), the TFSEE model includes 5 key data processing steps (Figure 2), followed by data integration to calculate the enrichment and activity profiles, that is, the TFSEE score (Figure 3).

Data processing for Total Functional Score of Enhancer Elements (TFSEE) method. The TFSEE method has 5 data processing steps that are used to identify enhancer location and activity and their cognate transcription factors (TFs). In step 1, epigenomic (ChIP-seq) or the transcriptional (GRO-seq or total RNA-seq) profiles are used to generate a universe of active enhancers across the different constituent lineages. In step 2, TFSEE calculates the enrichment (H3K4me1 and H3K27ac) and enhancer transcription (GRO-seq and total RNA-seq) profiles under all identified active enhancers per lineage. Lineage-specific enhancers are used as input for step 3, where a de novo motif search is performed to identify potential TFs at each enhancer. If a motif is represented multiple times for a given enhancer location, TFSEE combines the probability of that motif into a single P value in step 4. Step 5 integrates the amount of enhancer transcription (GRO-seq or total RNA-seq) and the expression of the TFs whose motifs were predicted in step 3 and 4 for all cell types, to provide an output of TF expression profiles across every cell type.

Overview of Total Functional Score of Enhancer Elements (TFSEE) method. TFSEE combines diverse data sets to identify enhancer location and activity and their cognate transcription factors (TFs). An illustration of TFSEE data integration stage, taking the outputs generated at each step to identify the location, activity level, and predicted TFs at each enhancer across all cell types. (Top) All matrices represent scaled enhancer activity for each cell type in each enhancer prediction method (G, H, and M). All matrices are linearly combined into a resulting matrix A, to provide a total enhancer activity score. (Bottom) Enhancer activity matrix A, combined with motif prediction matrix T, represents scaled motif prediction P values for each enhancer, to form an intermediate matrix product. This matrix product is entrywise combined with TF expression matrix R (scaled TF expression for each cell type), into a resulting matrix Z, on which TFSEE clustering is performed.
Step 1—Method 1: enhancer calling based on enhancer transcripts defined by GRO-seq
In this approach, the active enhancers were identified based on enhancer transcripts (ie, eRNAs) called using GRO-seq data. The GRO-seq data were analyzed using groHMM version 3.416,21 and the transcript calling was performed as described in the methods section. The transcript calls were further filtered to identify a universe of short intergenic transcripts <9 kb in length and >3 kb away from the known TSSs of protein-coding genes and H3K4me3 peaks, which mark promoters. A final universe of expressed short paired eRNAs and short unpaired eRNAs for each cell type was identified with the cutoffs as mentioned in the methods section. Next, the expression profiles (using GRO-seq data) and histone mark enrichment profiles (using ChIP-seq data) were calculated at these active enhancers for each cell type to calculate the enhancer activity (A) (Figure 3).
Step 1—Method 2: enhancer calling based on histone modification defined by ChIP-seq
In this approach, the enhancers were identified based on histone modifications (H3K4me1 and H3K27ac) using ChIP-seq data. The enhancers called based on histone modification peak calls for each cell type were merged and the redundancies were removed as described in the methods section. Next, the potential intergenic enhancers were defined as the merged peaks that are >3 kb from known TSSs, protein-coding gene bodies, and H3K4me3 peaks. An RPKM cutoff of ⩾1 for H3K4me1 and H3K27ac (Figure S1D and E) in at least one cell type was used to call a peak as an active enhancer. The universe of active enhancers was then assembled for each cell type.
Step 2: Calculating enrichment and activity profiles
For the universe of enhancers for each cell type (identified in step 1), the enhancer activity levels were assessed genome-wide by calculating the enrichment of histone modifications (ie, H3K4me1 and H3K27ac) and enhancer transcription (GRO-seq or total RNA-seq) (Figure 2). Next, we generated an enhancer activity matrix ACxE for all cell types, C, for the universe of active enhancers, E. For this analysis, we assumed that the enhancer activity of each cell type is linearly correlated to the amount of enhancer transcription (GRO-seq or total RNA-seq, G) and to the epigenomic marks (H3K4me1, M, and H3K27ac, H). To reduce bias, the enrichment for each individual enhancer was scaled between 0 and 1. Enhancer activity, A, was calculated using the following formula:
When using only histone modifications (step 1, method 2), enhancer activity can be calculated using the following formula:
Steps 3 to 5: De novo motif searching and TF expression
The TFSEE was designed primarily to detect enhancer activity changes and TF-enhancer relationships for each cell type. The steps in this section are common for both the methods of enhancer calling, which includes de novo motif searching and postprocessing of the motif search results. In steps 3 to 4, the TF-enhancer relationships were determined using a de novo motif search, and a matrix of probabilities of the TFs was created by annotating every enhancer to TF relationships for each cell type (Figures 2 and 3). If a motif is represented multiple times for a given enhancer location, TFSEE combines the probability of that motif into a single P value using the Stouffer method. 48 In step 5, the expression profile of all the TFs from step 4 (Figures 2 and 3) was calculated from GRO-seq or RNA-seq data across all the cell types.
Calculating the TFSEE score by data integration
The final stage integrates all of the data compiled in steps 1 to 5 (Figure 3) to determine the TFSEE score matrix and generate a heatmap of TFSEE scores. First, the enhancer activity matrix, ACxE, was combined with the motif prediction matrix, TExF, to generate a scaled motif prediction P value, T, for each enhancer, E, to form an intermediate matrix product. This matrix product is combined entrywise with the TF expression matrix, R, from step 5, and the expression of each TF, F, for each cell type, C, into a resulting matrix, Z, composed of C cell types and F TFs. The TFSEE scores can be expressed as the following formula:
Using TFSEE for the unbiased identification of enhancers during pancreatic differentiation
To demonstrate the utility of TFSEE, we used it to define the enhancer landscape and identify TFs that maintain the multipotency of a subpopulation of endodermal stem cells during differentiation into pancreatic lineages. For these analyses, we mined previously published ChIP-seq data sets for 3 different histone modifications (ie, H3K4me1, H3K4me3, and H3K27ac), in addition to GRO-seq and RNA-seq data sets, at 5 defined stages of endoderm lineage differentiation: hESCs, definitive endoderm (DE), primitive gut tube (GT), posterior foregut (FG), and PE (Figure 4A, Table S1).

Comparison of approaches for genome-wide prediction of enhancers during pancreatic differentiation. (A) (Top) Schematic diagram of pancreatic differentiation starting from human embryonic stem cells (hESCs) to pancreatic endoderm (PE). (Bottom) Depiction of epigenomic (ChIP-seq) and transcriptional (GRO-seq and RNA-seq) profiles for each cell line used for analysis. (B) Stacked bar chart comparing the predicted activity of candidate enhancers categorized by (Top) H3K4me1 and H3K27ac enrichment or (bottom) enhancer transcription (GRO-seq). (C) Stacked bar chart comparing enhancer prediction methods in pancreatic differentiation. Enhancers were called using enhancer transcription (GRO-seq) or using H3K4me1 enrichment, H3K27ac enrichment, or a combination of both histone marks. The percentage of called enhancers from one prediction method that overlap with enhancers called using other methods is shown. (D) UpSet plot showing the set intersection of enhancer identification methods shown in panel (C). DE indicates definitive endoderm; FG, posterior foregut; GT, primitive gut tube; hESCs, human embryonic stem cells; PE, pancreatic endoderm; TF, transcription factors; TFSEE, Total Functional Score of Enhancer Elements.
Calling enhancers by independent methods
Using differentiation of pancreatic stem cells as a biological model (Figure 4A), we identified the enhancer universe for the cell types by 2 methods: (1) enhancer transcription signatures from GRO-seq data (Figure S2A) and (2) enrichment of epigenomic marks (ie, H3K4me1 or H3K27ac) (Figure S2B). To avoid complications associated with overlaps between enhancer transcription and promoter transcription, we only considered candidate enhancers >3 kb away from the annotated TSSs of active protein-coding genes, as identified by the enrichment of H3K4me3 43 (using GENCODE version 19 annotations 41 ) (Figure S2A and B).
We then predicted candidate enhancers using methods 1 and 2. We identified a set of 4974 candidate enhancers (Figure 4B) by method 1 using GRO-seq data, as described previously, 19 with RPKM cutoffs of ⩾0.5 or ⩾1 (Figure S1B and C) in at least one cell lineage. We also identified a set of enhancers by method 2 using histone modifications, by filtering the enhancer universe based on the enrichment of H3K4me1 and H3K27ac (RPKM cutoff of ⩾ 1 [Figure S1D and E] for both marks in at least one cell line), and identified a set of 218 731 candidate enhancers across all stages of pancreatic differentiation (Figure 4B). This stringent filter is necessary to reduce the false-positive enhancers from method 2 that could easily be annotated as alternative chromatin states using ChromHMM. 49 In addition, the majority of histone called enhancers were marked by only H3K4me1 (Figure 4B). These results confirm the enhancer landscape across pancreatic differentiation reported by Wang et al. 1
Comparison of enhancer calls by methods 1 and 2
Next, we compared the enhancer universes called by enhancer transcription (method 1) and histone modifications (method 2). We found that 12% of enhancers called based on enhancer transcription using GRO-seq data were also identified based on enrichment of H3K4me1, H3K27ac, or both marks combined (Figure 4C, Figure S3A). Interestingly, greater than 84% of the enhancers identified based solely on enhancer transcription were not called based on enrichment of H3K27ac or H3K4me1 (Figure 4C and D, Figure S3B). This is likely due to the fact that these may not be the primary or only chromatin marks denoting active enhancers in these systems, and other marks (ie, H4K8ac or H3K9ac) or combinations of marks might serve as better identifiers. 50 In contrast, less than 1% of enhancers called based on the enrichment of H3K4me1 and H3K27ac overlapped with the enhancers identified based on enhancer transcription or both H3K4me1 and H3K27ac combined (Figure 4C). This may be due, in part, to the fact that enhancer calling based on H3K4me1 or H3K27ac enrichment yields much larger numbers of putative enhancers (Figure 4D), many of which may be false positives or inactive as true regulatory elements (Figure S3C and D). Based on these findings, we decided to focus on the enhancers identified based on enhancer transcription using GRO-seq data (method 1), which had the highest percentage of enhancers that were called by all 3 methods, as an input to TFSEE for the subsequent analysis.
TFSEE identifies lineage-specific enhancers and their cognate TFs during pancreatic differentiation
TFSEE scores determined by using inputs from method 1
After determining the TFSEE scores globally across the lineage of pancreatic differentiation, we performed unsupervised hierarchical clustering on the enhancers predicted by method 1 based on enhancer transcription, which grouped the lineage-specific cell types into 2 major clades: (1) FG and PE and (2) hESC, DE, and GT (Figure 5A). To better understand the TF-enhancer dynamics across the 4 represented cell types in pancreatic differentiation, we clustered the TFSEE scores across all the differentiation stages, revealing 4 major clusters (Figure 5B). We then examined the enrichment of putative enhancers and their associated TFs across stages by quantifying their normalized TFSEE scores. This analysis revealed 4 major enhancer clusters: (1) those driving early (hESC, DE) and late pancreatic differentiation (FG and PE), (2) those enriched in GT, (3) those driving early pancreatic endodermal lineage formation (hESC, DE and GT), and (4) those driving late pancreatic differentiation (FG and PE) (Figure 5C).

TFSEE identifies cell type–specific enhancers and their cognate TFs that drive gene expression during pancreatic differentiation. (A) Unsupervised hierarchical clustering of cell type–normalized TFSEE scores shown in a heatmap representation. hESC (human embryonic stem cell); DE (definitive endoderm); GT (primitive gut tube); FG (posterior foregut); PE (pancreatic endoderm). (B) Biaxial t-SNE clustering plot of cell type–normalized TFSEE scores showing evidence of 4 distinct clusters, each point represents an individual TF. (C) Box plots of normalized TFSEE score for clusters identified in pancreatic differentiation (panel B), number of TFs are indicated in each cluster. Bars marked with different letters are significantly different (Wilcoxon rank sum test,
To investigate the distinct roles of lineage-specific enhancers and their cognate TFs, focusing on those that provide a clear demarcation of enrichment between early and late pancreatic differentiation, we first examined the expression levels of the messenger RNAs encoding the predicted TFs for each cluster in each of the stages. Our analysis revealed that TFs identified in early pancreatic differentiation show similar expression across the later stages of development, whereas TFs identified in late pancreatic differentiation are expressed most predominantly in the FG and PE stages (Figure 6A) coinciding with pancreatic induction at the FG stage (Figure 4A). In addition, we observed an enrichment of TFs in a stage-specific manner for TFs enriched early (hESC, DE) and late (FG and PE), but not for those maintaining GT pluripotency (Figure S4A).

TFSEE-predicted TFs are enriched in pre- and late pancreatic differentiation. (A to C) Box plots of normalized TF expression (panel A), enhancer transcription (panel B), and gene expression for the nearest neighboring genes to active enhancers (panel C) in pre- (cluster 3) and late pancreatic (cluster 4) differentiation across the different cell types. Bars marked with different letters are significantly different from each other (Wilcoxon rank sum test). hESC (human embryonic stem cell); DE (definitive endoderm); GT (primitive gut tube); FG (posterior foregut); PE (pancreatic endoderm). (A) TFs identified in cluster 3 by TFSEE show equal expression across differentiation, whereas cluster 4 highlights TFs highly expressed in FG and PE. TF expression as measured by RNA-seq. The number of TFs in each cluster are in parenthesis
Next, we determined whether enhancer transcription corresponding to the enriched TFs in each cluster, using the TFSEE scores, correlates with the regulation of nearby genes. To do so, we identified the enhancers corresponding to the predicted TFs using motif enrichment and binding site prediction. We then determined the level of transcription for each enhancer using GRO-seq (Figure 6B, Figure S4B) and the level of expression for the nearest neighboring gene (upstream or downstream) using RNA-seq (Figure 6C, Figure S4C). Interestingly, transcribed enhancers exhibited stage-specific enrichment, which did not correspond to the patterns observed based on TFSEE scores (Figure 6B, Figure S4B). This result likely reflects the fact that 1364 of the enhancers are shared between clusters (55%, n = 2465, for cluster 3; 99%, n = 1371, for cluster 4), and the variation between clusters is due to differences in TF expression and their affinity for motifs. Likewise, the expression of the nearest neighboring gene for each transcribed enhancer did not exhibit stage-specific enrichment (Figure 6C, Figure S4C) due to the vast abundance of enhancers with neighboring genes shared between the clusters. However, without further high-throughput data to study enhancer-promoter interactions (as measured by 4C, ChIA-PET, or Hi-C),51-53 it is difficult to discern the logic of the stage-specific regulatory network.
To further understand the potential regulators of each cluster, we determined a rank order frequency distribution for all TFs within each cluster (Figure 6D and E, Figure S4D and E). This analysis revealed enrichment of HINFP, RARG, ZIC3, and SP1-like family TFs (SP1 and SP8), which are important regulators of embryonic development54-57 (Figure 6D). In addition, the Onecut family (ONECUT2 and ONECUT3), EGR1, MITF, and FOXP1 TFs, which were enriched in cluster 4, have been shown to function in pancreatic and islet cell development58-61 (Figure 6E).
TFSEE scores determined using inputs from method 2
To determine the robustness of the method, we calculated TFSEE scores using enhancers predicted based on histone enrichment alone. We performed unsupervised hierarchical clustering and retrieved only 3 clusters, in contrast to the 4 clusters from enhancers predicted based on enhancer transcription (Figure S5A and B). These results highlight the potential role of TF-enhancer interactions driving the formation of the early pancreatic endodermal lineage (hESC, DE, and GT), as well as late pancreatic differentiation (FG and PE), but do not reveal any other stage-specific drivers (Figure S5C).
For comparable clusters from early and late pancreatic differentiation, we determined whether enhancer “activity” based on H3K27ac enrichment correlates with the regulation of nearby genes. We monitored TF expression by RNA-seq (Figure S6A) and then calculated the level of activity for each enhancer using H3K27ac ChIP-seq data (Figure S6B) and the expression of the nearest neighboring gene (upstream or downstream) using RNA-seq (Figure S6C). We found that H3K27ac-enriched enhancers exhibit stage-specific enrichment, as observed for method 1, which does not correspond to the patterns found from TFSEE enrichment (Figure S6B). This result reflects the fact that 1017 of the enhancers are shared between clusters (74%, n = 1375 for cluster 2; 62%, n = 1640 for cluster 3), and the variation between clusters is due to differences in TF expression and their affinity for the motifs. Furthermore, similar to method 1, the nearest neighboring gene for each transcribed enhancer does not exhibit stage-specific enrichment (Figures S6C). The potential regulators of clusters 2 and 3 were determined using a rank order frequency distribution for all TFs within each cluster (Figure S6D and E).
Comparison of TF identification using inputs from method 1 or method 2
To determine the robustness of the approaches, we compared TFs identified by TFSEE using inputs from method 1 or method 2 for pre- and late pancreatic differentiation. We found 9 and 12 TFs (out of the total of 44 and 46 unique TFs identified) enriched in common for early and late pancreatic differentiation, respectively (Figure S6F). These differences in the enriched TFs may be, in part, due to the much larger numbers of putative enhancers called using H3K4me1 and H3K27ac enrichment, many of which may be false positives or inactive as regulatory elements, producing a greater assortment of enriched TF motifs. Taken together, our results show that TFSEE can be used to identify cell type–specific TFs that control lineage-specific enhancers.
Discussion
The carefully choreographed mechanisms involved in driving the lineage-specific transcriptional responses during development remain poorly understood. In this study, we integrated a variety of publicly available high-throughput sequencing data from pancreatic lineage development to identify the potential regulators driving the early or late pancreatic lineage development. Our analysis revealed the enrichment of the Onecut family (ONECUT2 and ONECUT3), EGR1, MITF, and FOXP1 TFs in the late pancreatic differentiation phase (FG, PE), which have been shown to function in pancreatic and islet cell development.58-61 All but the Onecut family were identified as top-ranked TFs by both methods. Our analyses provide a detailed operational description of TFSEE, a previously published computational method 33 for identification of active enhancers and associated cognate TFs, during differentiation of hESCs toward pancreatic cell type. TFSEE employs a multiview clustering of multiple genomic assays that directly models changes in the transcriptional and epigenetic states across cell types. This approach allowed us to directly integrate disparate data while encoding assumptions and dependencies between data types in an interpretable and extendable model.
Evaluation and use of TFSEE
The TFSEE model gains power by both explicitly modeling the enhancer landscape for each cell type and detecting the enhancer activity changes and TF-enhancer relationships across all cell types. To date, we have applied TFSEE to transcriptional and epigenomic data from hESC differentiation time course experiments1,34 (analyses described herein) and a variety of breast cancer cell lines.33,62 Our results show that this method can identify cell type–specific TFs and their cognate enhancers that are biologically relevant and are good candidates for further biological validation. In particular, this method identifies TFs bound at active enhancers, which regulate gene expression, supporting the biological relevance of TFSEE predictions. In this study, we identified enrichment of HINFP, RARG, ZIC3, and SP1-like family TFs (SP1 and SP8), which are known important regulators of embryonic development,54-57 3 of which (ie, ZIC3, SP1, and SP8) were identified as the top TFs by both methods.
In addition, TFSEE enables analysis of driver TFs using a limited amount of data. The model was able to identify lineage-specific TFs with as little as 5 cell types and with only 2 data types, RNA-seq and ChIP-seq (for H3K4me3, H3K4me1, and H3K27ac) (Figure S5A and B). A limitation of the TFSEE method is that while the model can be used with a reduced number of data types for enhancer identification, it fails to identify additional subtype- or stage-specific drivers with reduced data input (Figure S5C). In this case, the overlapping clusters identified only a subset of TFs that are jointly enriched (Figure S6F).
Integrating additional genomic data into TFSEE
TFSEE could potentially be applied to any cell type with limited data, either GRO-seq or total RNA-seq (for enhancer calling by method 1), or histone modifications (for enhancer calling using H3K27ac and H3K4me1 by method 2). The integration of additional data into TFSEE could allow extension of the model, providing greater granularity about subtype-specific TFs and a better understanding of gene regulatory networks. Genomic data indicating open regions of chromatin (eg, ATAC-seq, 63 DNase-seq, 7 or MNase-seq 64 ) could extend the dynamic range of “enhancer activity” (Figure 3) and help eliminate false-positive enhancers in each cell type. Likewise, adding enrichment of transcriptional co-regulator p300 65 could serve the same role. Furthermore, integrating additional histone modifications, which could be used to annotate alternate chromatin states by ChromHMM, 49 may provide a finer filter for enhancer identification by method 2 than achieved here when limiting the analysis to the histone modifications used herein. Finally, to better understand the cluster-specific regulatory networks, the addition of chromatin looping data for enhancer-promoter interactions (as measured by 4C, ChIA-PET, or Hi-C)51-53 may provide some advantages. The looping data would provide a better understanding of how enhancers shared between clusters determine cell type–specific expression profiles. We believe that including any or all of data described above into TFSEE would improve the model and help to discern cell type–specific regulatory networks and can easily be added due to the flexibility of the model.
Conclusions
The increasing availability of different types of genomic data sets provides an opportunity to perform data integration to uncover cell type–specific TF drivers. To facilitate identification of these drivers, we developed and further evaluated TFSEE, which systemically identifies active enhancers and their cognate TFs. We showed that TFSEE can identify stage-specific TFs during differentiation of hESCs into pancreatic lineages. Collectively, our results show how TFSEE can be used to predict molecular drivers maintaining cell type–specific function and biology.
Supplemental Material
Malladi_TFSEE_-_Supplemental_Material_Revised3_052520_FINAL_xyz40901e2b318ca – Supplemental material for Total Functional Score of Enhancer Elements Identifies Lineage-Specific Enhancers That Drive Differentiation of Pancreatic Cells
Supplemental material, Malladi_TFSEE_-_Supplemental_Material_Revised3_052520_FINAL_xyz40901e2b318ca for Total Functional Score of Enhancer Elements Identifies Lineage-Specific Enhancers That Drive Differentiation of Pancreatic Cells by Venkat S. Malladi, Anusha Nagari, Hector L Franco and W Lee Kraus in Bioinformatics and Biology Insights
Footnotes
Acknowledgements
The authors thank members of the Kraus Laboratory for providing critical feedback on this work and helpful comments on the manuscript.
Funding:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the following: (1) a grant from the Cancer Prevention and Research Institute of Texas (CPRIT; RP100417) to the LONESTAR Oncology Network for Epigenetics Therapy and Research, (2) a grant from CPRIT (RP160319) to W.L.K., (3) a grant from the NIH/NIDDK (R01 DK058110) to W.L.K., and (4) funds from the Cecil H. and Ida Green Center for Reproductive Biology Sciences Endowment to W.L.K.
Declaration of conflicting interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author Contributions
HLF, AN, VSM, and WLK designed the project. WLK conceived of the conceptual framework for TFSEE and VM, AN, and HLF developed it further. VM processed and analyzed the GRO-seq, RNA-seq, and ChIP-seq data and performed the integrative computational analyses of the genomic data. VM and AN prepared an initial draft of the figures and text, which were edited and finalized by WLK and AN, with input from HLF. WLK secured funding for the work and provided overall project direction.
Data Availability and Implementation Statement
The data sets analyzed in this study were obtained from the NCBI’s Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/) or EMBL-EBI’s ArrayExpress (http://www.ebi.ac.uk/arrayexpress/) repositories using the accession numbers listed in Table S1. Implementation of this method is available at:
.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
