ClusterDE : A Statistical Software Package for Removing Double-Dipping Bias in Post-Clustering Differential Expression Analysis

Abstract

Typical pipelines for single-cell and spatial transcriptomics involve clustering cells or spatial spots, followed by post-clustering differential expression (DE) analysis to identify marker genes for annotating clusters as cell types or spatial domains. However, using the same data for both clustering and DE analysis—a problem known as double-dipping—can lead to spurious detection of DE genes. In particular, over-clustering can produce artificial clusters that are incorrectly interpreted as distinct cell types or spatial domains. To address this issue, the ClusterDE R package implements a statistical method using a synthetic null dataset, which consists of a single homogeneous cell population or spatial domain but is constructed to match the real dataset in terms of gene means, variances, and gene-gene rank correlations. By serving as a parallel negative control, the synthetic null data allow users to identify and remove false-positive DE genes arising from double-dipping. This article introduces the ClusterDE R package and provides practical guidance on installation and usage for more reliable marker gene detection following clustering.

Keywords

clustering differential expression single-cell RNA-seq spatial transcriptomics

Get full access to this article

View all access options for this article.

References

, Chen

, Song

, et al. ‘Clipper: P-value-free fdr control on high-throughput data from two conditions’. Genome Biol, 2021; 22(1):288.

Pardo

, Spangler

, Weber

, et al. ‘spatiallibd: An r/bioconductor package to visualize spatially-resolved transcriptomics data’. BMC Genomics, 2022; 23(1):434; doi: 10.1186/s12864-022-08601-w

Satija Lab. (2020). pbmc3k.SeuratData: 3k PBMCs from 10X Genomics. R package version 3.1.4. Available from: https://satijalab.org/seurat

Song

, Chen

, Lee

, et al. ‘Synthetic control removes spurious discoveries from double dipping in single-cell and spatial transcriptomics data analyses’. bioRxiv, 2024:2023.07.21.550107. Available from: https://www.biorxiv.org/content/early/2024/12/30/2023.07.21.550107

Song

, Chen

, Lee

, et al. (2025). Synthetic control removes spurious discoveries from double dipping in single-cell and spatial transcriptomics data analyses. In: Research in Computational Molecular Biology. ( Sankararaman

, ed.) Springer Nature: Switzerland, Cham, pp. 400–404.