Published August 01, 2025
The polymerase chain reaction (PCR) amplification process of deoxyribonucleic acid (DNA) libraries can introduce bias in the sequence ratios. Consequently, several recent genomic and transcriptomic methods employing next-generation sequencing (NGS) utilize in vitro transcription (IVT) to amplify template polynucleotide chains. IVT amplifies nucleic acid sequences linearly, making it less susceptible to bias than the exponential amplification of PCR. Chromatin integration labeling sequencing (ChIL-seq), a tool for analyzing transcription factor binding and histone modifications, has incorporated IVT by replacing PCR in the DNA amplification step, enabling the analysis of small sample sizes, including single cells. In this study, we discovered that many of the excluded sequences known as PCR duplicates during the pre-processing step of ChIL-seq data analysis contain amplification products derived from IVT. Furthermore, we developed an in silico method to selectively eliminate PCR duplicates from NGS data while retaining IVT-derived amplification products. The method prevents excessive data reduction and significantly improves the utilization efficiency of NGS data.
transcription, PCR duplicate, data pre-processing, amplification bias, chromatin
Next-generation sequencing (NGS) enables large-scale omics analyses. In some experiments, such as chromatin immunoprecipitation (ChIP), 1 only a very small amount of deoxyribonucleic acid (DNA) can be recovered as a template for the NGS library. In such cases, amplification of nucleic acid chains in vitro is necessary for NGS analysis. Typically, polymerase chain reaction (PCR) 2 is employed to attach common sequences for sequencing reactions to NGS libraries and to amplify the DNA library. However, PCR amplification can introduce sequence-dependent bias during the exponential amplification step. 3 Therefore, an in silico pre-processing step removing duplicated sequences (hereafter referred to as “PCR duplicates”) 4 from the raw data is often introduced. Recent omics techniques using NGS employ in vitro transcription (IVT) in place of or addition to PCR as the amplification step. 5 IVT is a linear amplification method for polynucleotide chains; therefore, it has been developed to experimentally avoid PCR-derived bias. 6 Therefore, in principle, NGS libraries amplified by IVT do not require in silico removal of duplicates as a pre-processing step. In the chromatin integration labeling sequencing (ChIL-seq) technique, an alternative to ChIP-seq, the IVT method using the bacteriophage-derived T7 promoter is combined with a low-cycle PCR reaction when constructing NGS libraries. 7 However, we found that DNA sequences derived from IVT amplification were eliminated along with PCR duplicates in the current data analysis pipeline for ChIL-seq. 7 In this study, we present a pre-processing method that distinguishes and selectively retains IVT-derived duplicate sequences from PCR-derived duplicate sequences using paired-end NGS data to identify the duplicates to be removed. Finally, we confirmed that the method prevented excessive and unnecessary data reduction, and significantly improved the utilization efficiency of NGS data.
Isolation of mouse embryonic fibroblasts
Mouse embryonic fibroblasts (MEFs) were prepared from C57BL6 female mouse embryos at 13.5 days post-coitum and cultured in MEF medium as described previously. 8
Retrovirus production
As described previously, 8 mouse Hnf4α and Foxa2 cDNAs were obtained by reverse transcription PCR and subcloned into pGCDNsam (a gift from M. Onodera). To produce recombinant retroviruses, plasmid DNA was transfected into Platinum-E (Plat-E) cells using linear polyethylenimine (PEI; Polysciences, Warrington, PA) following the method described previously. 9 Three days before transfection, Plat-E cells (1.6 × 106) were plated on poly L-lysine-coated 10 cm dishes. For transfection, 36 μL of 1 mg/mL PEI and 12 μg of retroviral plasmid DNA were diluted in 1 mL of Dulbecco’s modified Eagle medium (DMEM) and incubated for 15 minutes at room temperature. The mixture was then added dropwise to plated Plat-E cells. After 6 hours of incubation at 37°C under 5% CO2, the medium was replaced with fresh MEF medium, and the culture was continued. Supernatants from the transfected cells were collected at 24 hours after medium was replaced, filtered through 0.2 μm cellulose acetate filters (Sartorius), and concentrated by centrifugation at 10,000 × g for 16 hours at 4°C. Viral pellets were resuspended in Hanks’ balanced salt solution (Nissui, Tokyo, Japan) at 1/140 of the initial supernatant volume.
Transduction of cells and generation of induced hepatic progenitor cells
As described previously,9,10 MEFs were grown on gelatin-coated 12-well plates until they reached 20%–30% confluency. The cells were then incubated in MEF medium containing the concentrated viral supernatants and 5 μg/mL protamine sulfate (Nacalai Tesque, Kyoto, Japan) for 6 hours. Viral infection was performed twice using 20 uL virus solutions per factor. Two days after the final infection, the cells were re-plated in 6-well plates coated with type I collagen (Iwaki, Tokyo, Japan) and cultured in a hepato-medium. The hepato-medium consisted of a 1:1 mixture of DMEM and Ham’s F-12 medium (Nacalai Tesque), supplemented with 10% fetal bovine serum, 1 μg/mL insulin (Wako, Tokyo, Japan), 10−7 M dexamethasone (Sigma-Aldrich, St. Louis, MO), 10 mM nicotinamide (Sigma-Aldrich), 2 mM L-glutamine, 50 mM β-mercaptoethanol (Nacalai Tesque), penicillin/streptomycin 20 ng/mL hepatocyte growth factor (PeproTech, Cranbury, NJ), and 20 ng/mL epidermal growth factor (Sigma-Aldrich). After reaching confluence, cells were passaged at a density of 1 × 105 cells/passage.
ChIL-seq
ChIL-seq was performed after confirming the appearance of induced hepatic progenitor cells (iHepCs), which were defined as albumin/E-cadherin double-positive cells. A total of 1 × 104 iHepCs were re-plated in 96–well plates coated with type I collagen (Iwaki) and cultured in a hepato-medium. After 16 hours, cell fixation and subsequent ChIL-seq were performed as described previously. 7
Quality check and detection of duplicates
Raw reads were trimmed and duplicates were removed using fastp (v0.23.4) with the parameters -D –dup_calc_accuracy 6 (related to Figure 1B). The ratio of unique reads was calculated as the proportion of raw reads retained before and after duplicate removal, using fastp (related to Figure 3). The reads were counted using the fqtools count program (v2.1).
 
                                    
                                
                                    
                                
                                    
                                
                                
Mapping to the reference genome
Duplicate-removed reads were mapped to the reference genome (GRCm38/mm10 for mice and GRCh38/hg38 for humans) using HISAT2 (v2.0.5) with the parameters -k 1 –no-spliced-alignment. After mapping, reads aligned to the mitochondrial genome (chrM) were excluded from subsequent analyses using grep with the parameter -v chrM. SAM files with the mitochondrial genome removed were converted to BAM using SAMtools (v1.9) with the -F 4 option to remove unmapped reads. The BAM files were sorted and indexed. The strand-specific BAM files were filtered using the parameter f = 16.
Peak call
Peak calling was performed using Read1 processed in the single-end mode. The transcription factors (TFs) peaks were identified using MACS (v2.2.9.1; https://github.com/taoliu/MACS/) with the options: -g mm/hs -q 0.01 –nomodel –shift 0 –extsize 200 –keep-dup all (related to Figure 2), or -g mm -q 0.000000001 –nomodel –shift 0 –extsize 200 –keep-dup all (related to Figure 4).
 
                                    
                                
                                    
                                
                                
Signal quantification
Signal coverage was calculated using deepTools (v3.5.4) in the bamCoverage mode with the following parameters: –normalizeUsing CPM –smoothLength 100 –minMappingQuality 60 –binSize 10.
Signal aggregation around peaks
Aggregation plots were generated using the deepTools (v3.5.4) reference-point mode with the following parameters: –referencePoint center –upstream 500 –downstream 500 –skipZeros –missingDataAsZero. Strand-specific signals around the TF peaks were plotted using Read1 processed in single-end mode (related to Figure 2).
De novo motif detection and comparison with known motif datasets
The summits data obtained from MACS were extended by 1 kb on both sides using the bedtools (v2.31.1) and categorized based on their overlaps between the datasets. Motif analysis was performed for each category using the findMotifsGenome.pl program in the HOMER package (v4.11) with the following parameters: -size 500 -mask -p 16 S 50.
Data and code availability
The deep-sequencing data generated in this study, ie, ChIL-seq data, were deposited in the Gene Expression Omnibus (GEO) under the accession code GSE282869. Previously published datasets reanalyzed in this study, including ChIP-seq, CUT&RUN, and targeted insertion of promoter sequencing (TIP-seq) 11 data, are available under the following accession codes: GSE213460 (ChIP-seq, mouse liver, Hnf4α), GSE84474 (CUT&RUN, K562 cells, CTCF), and GSE188512 (TIP-seq, HCT116 cells, CTCF). The shell scripts used for the analyses in this study are available from the following GitHub repository: https://github.com/kenichihorisawa/pipeline_for_IVT_duplicates.
In ChIL-seq, Tn5 transposase inserts the T7 promoter sequence and Read1 adaptor sequence, which enables the use of NGS primer, 12 into the genomic DNA around the target proteins. T7 RNA polymerase, which is added to the sample, binds to the T7 promoter sequence and initiates the IVT reaction by transcribing RNAs including the Read1 adaptor sequence and genomic sequences around the target proteins. The IVT reaction products are then reverse transcribed into single-stranded DNAs using reverse transcriptase and a 9-mer random primer containing the Read2 adaptor sequence. Subsequently, the DNAs were added index sequences and converted into double-stranded DNAs through a low-cycle PCR reaction. Finally, the resulting NGS library is analyzed using NGS instruments (Figure 1A). The developers of ChIL-seq used Read1 sequences as single-end analysis data because the Read1 sequences were closer to the target sites than the Read2 sequences. Thus, they did not read or discard Read2 sequence data. In the series of library preparations, the Read1 sequence was considered to be dependent on the position of the transcription start site (TSS) of the IVT reaction, whereas the Read2 sequence was considered to be dependent on the termination point of the IVT reaction and/or the position where the random primer anneals (Figure 1A). 13 Therefore, it can be hypothesized that Read1 sequences have a high degree of overlap, whereas Read2 sequences have a high degree of diversity and a low degree of overlap. To test this model, we examined the proportion of overlap in the Read1 and Read2 sequences for IVT-based methods (ChIL-seq and TIP-seq) and non-IVT-based methods (ChIP-seq and CUT&RUN 14 ). Non-IVT-based methods showed very little difference in the duplication rates between Read1 and Read2 (Cohen’s h = 0.0052–0.0117). However, as expected, the IVT-based methods revealed a higher degree of overlapping sequences in Read1 compared than in Read2 (Cohen’s h = 0.3012–0.5767) (Figures 1B and S1). These results suggest that, in addition to PCR duplicates, a significant number of duplicate sequences derived from IVT amplification were present in the library, as confirmed by Read2. According to the principles of the IVT, this phenomenon is likely to be common in IVT-based methods. We define “IVT duplicates” as duplicates where only Read1 matches and Read2 does not.
During the IVT reaction, the T7 RNA polymerase binds to the T7 promoter and then transcribes the genome-integrated adaptor and genomic sequences in the 5′ to 3′ direction. Therefore, if Read1 sequences are prone to duplication, as hypothesized, and Read2 sequences exhibit greater diversity and are less susceptible to duplication, we would expect Read2 to be widely distributed and detected along the direction of IVT progression. To examine this, a DNA strand-specific aggregation plot was created around the ChIL-seq peak detected in the Read1 data, and the distributions of Read1 and Read2 were analyzed (Figures 2 and S2). The results showed that in the non-IVT-based method, the Read1 and Read2 sequences were distributed at almost the same position [Spearman’s rank correlation coefficient (ρ) = 0.9910–0.9992], whereas in the IVT-based method, the Read2 sequences were broadly distributed along the direction of IVT progression, which consequently resulted in a relatively low correlation between the two plots, as expected (ρ = 0.3508–0.8880). This result supported our hypothesis that duplicate IVT sequences were present in the resulting NGS data.
In the pre-processing of the analysis, PCR duplicates are often removed as they can lead to incorrect interpretations, also in the conventional ChIL-seq data pipeline, duplicates were removed from single-end data (Read1). However, the algorithm for removing duplicate sequences cannot distinguish IVT duplicates from PCR duplicates, which may lead to the excessive removal of sequences from raw data, including IVT-derived production. To resolve this issue, we acquired paired-end ChIL-seq data and used both sets of paired-end data (Read1 and Read2) to distinguish IVT and PCR duplicates. Consequently, a significantly larger number of sequences remained than when only Read1 data were used to remove PCR duplicates (Figures 3 and S3). Therefore, we concluded that we could selectively eliminate only PCR duplicates without losing IVT duplicates by using paired-end data and paired-end data-based pre-processing for PCR duplicate removal.
 
                                    
                                
                                    
                                
                                    
                                
                                    
                                
                                    
                                
                                
Furthermore, we examined whether differences in duplicate removal methods lead to qualitative differences in analysis outcomes. Using the same raw ChIL-seq data for Hnf4α, we applied two duplicate removal strategies: one based on Read1 alone and another using both Read1 and Read2, as described earlier. After the removal of the duplicates, only the retained Read1 data were then mapped to the genome, followed by peak calling. Read1 data were mapped into the reference genome sequence. As a result, duplicate removal using both Read1 and Read2 identified more peaks than the method using Read1 alone. Moreover, nearly all peaks detected with Read1-only deduplication were included in those obtained using both Read1 and Read2 (Figure 4). In addition, de novo motif analysis of sequences within the overlapping peaks and those uniquely detected by the method using both Read1 and Read2 revealed significant enrichment of the Hnf4α binding motif [HNF4a(NR), DR1/HepG2-HNF4a-ChIP-Seq(GSE25021)/Homer] in both cases. These findings suggest that duplicate removal using both Read1 and Read2 enables more comprehensive biological data extraction from the same dataset compared to the conventional pre-processing method (Figure 4).
 
                                    
                                
                                    
                                
                                    
                                
                                    
                                
                                
In the present study, we found that the NGS data obtained from IVT-based methods contained a large number of DNA fragments that completely match the Read1 data but not the Read2 data (IVT duplicates). Based on Cohen’s h, the IVT-based methods demonstrated effect sizes around the medium range (h = 0.5) in the difference in duplicate rates between Read1 and Read2, whereas the non-IVT-based methods exhibited effect sizes well below the threshold for a small effect (h = 0.2). Furthermore, we found that only PCR duplicates could be selectively eliminated using paired-end data and pre-processing pipelines. This improvement can be achieved with previously unused Read2 sequencing data and making only minor changes in the in silico pipeline, and without requiring any changes to the experimental protocol. As a result, this method prevents the unintended removal of sequences and significantly increases the number of unique reads available for analysis. Upon evaluation of the proportion of remaining unique reads after deduplication, the two types of approaches produced widely different effect sizes. This indicates an improved data usage efficiency. However, after duplicate removal in paired-end mode, single-ended analyses using Read1 may preferentially focus on regions near the target protein’s binding sites. IVT duplicates show different characteristics in Read1 and Read2 data. In Read1, the TSSs of the IVT reaction (5′ end of the IVT duplicates) are constrained by the T7 promoters which are integrated into the genome during the experimental procedure, whereas the termination of the IVT reaction is stochastic (3′ end of the IVT duplicates). In addition, it is assumed that there is diversity in the annealing positions of the random primers used for reverse transcription. A combination of these factors may be responsible for the diversity of IVT duplicates in Read2. The analysis reveals that the aggregation plots for Read1 and Read2 are nearly identical in non-IVT-based methods. Conversely, in IVT-based methods, the distributions show a lesser degree of overlap, even though they maintain a positive correlation. This observation is consistent with the above discussion.
During nucleic acid chain amplification, IVT amplifies nucleic acid chains linearly, unlike PCR, which amplifies DNA fragments exponentially.2,4,15 As a result, the artificial bias due to sequence differences is theoretically very low in IVT. Therefore, distinguishing IVT duplicates from PCR duplicates and selectively removing PCR duplicates should retain more unique sequence reads, thereby improving data usage efficiency. Indeed, ChIL-seq data processed while retaining IVT duplicates showed a significantly higher number of peaks compared to data processed with the conventional pre-processing protocol using the same raw data. Furthermore, the additionally detected peaks exhibited a strong concentration of Hnf4α-specific binding sequences, similar to the intersected peaks between the datasets from the two pre-processing methods. These findings suggest that the conventional pre-processing method using single-end data has excessively and unnecessarily reduced the data, and the deduplication method with paired-end data may provide deeper biological insights than the conventional method.
Although many IVT-based NGS analysis techniques have been developed, a standardized method for data processing has yet to be established. For example, some studies analyzed only Read1 data,7,11,16,17 whereas others utilized both Read1 and Read2 data. 18 Understanding the duplication bias present in the data and performing appropriate pre-processing are essential. However, the characteristics and effects of IVT duplicates on data analysis have not been fully elucidated; therefore, IVT duplicates may introduce unexpected biases or misinterpretations. Further validation and evaluation of the characteristics of IVT duplicates are required. If IVT duplicates are determined to be artifacts that should be removed, the appropriate use of IVT during pre-analytical processes and NGS library preparation should be reconsidered. These efforts are expected to improve the quality and productivity of NGS analysis by avoiding the erroneous removal of valid duplicates.
Unique molecular identifiers (UMIs) represent a modern approach to removing duplicates from NGS data. Multiplexed indexed unique molecule T7 amplification end-to-end sequencing (MINUTE-ChIP), an enhanced ChIP-seq protocol using both IVT and UMI technologies,19,20 exemplifies this by reducing the excessive filtering of reads compared to in silico deduplication via Picard—a process technically identical to our proposed paired-end deduplication. 19 This finding suggests the potential benefits of integrating UMIs into IVT-based protocols such as ChIL-seq and TIP-seq. Nevertheless, such an approach requires modifications to the library construction protocol. For this reason, and due to significant advantages like its compatibility with existing public data, we argue that our proposed analysis pipeline is also a valuable tool.
Our study presents an in silico method to distinguish and selectively retain IVT-derived duplicates while eliminating only PCR duplicates in NGS data generated by IVT-based methods. By leveraging paired-end sequencing, the proposed approach prevents unnecessary data loss, significantly increasing the number of usable reads without altering the experimental protocol. Our findings demonstrate that conventional single-end deduplication pipelines may overlook valuable IVT-derived reads, thereby underestimating true biological information. The enhanced data utilization achieved by this method enables more comprehensive analysis from the same dataset, offering deeper biological insights and improving the overall quality of IVT-based NGS investigations.
                                
                                
We thank Mariko Tasai, Yoshimi Iwasaki, Mitsuhiro Kurata, Yuuki Honda, Kanako Ichikawa, Ryo Ugawa, and Emiko Koba for excellent technical assistance. This study was performed on the infrastructure of Omics Science Center Secure Information Analysis System, Medical Institute of Bioregulation at Kyushu University.
 
        Ryoga Suzuki, Kenichi Horisawa, Kazumitsu Maehara, Yasuyuki Ohkawa, Atsushi Suzuki
Bioinformatics and Biology Insights
Vol 2025, Issue , pp. -
Issue published date: -01-
10.1177/11779322251365042