Reproducible and Multi-Study Transcriptomic Integration with disint,Disease Integration and Clustering Toolkit,and Application to Drug Repositioning

Abstract

Integrating Big Data, such as large-scale transcriptomic datasets across diseases, continues to be a major challenge. This is in part due to inconsistent preprocessing and the lack of a standardized, reproducible analytical framework. Existing pipelines often rely on manual parameter tuning and fragmented scripts, which limits cross-dataset comparability and downstream interpretability. We developed disint (disease integration and clustering toolkit), an open-source Python framework for standardized cross-dataset expression integration, embedding, and clustering. The pipeline implements housekeeping gene-based normalization, disease-specific log₂ fold-change computation, automated Uniform Manifold Approximation and Projection hyperparameter optimization, and adaptive K-means clustering. Building on its outputs, we further implemented a prototype downstream module, disease reposition, which extracts disease-specific gene signatures, evaluates their shared components, and explores potential drug repositioning candidates. The framework was validated on 28 transcriptomic datasets encompassing 34 disease categories and 386 samples, including 255 patient and 131 healthy control samples, covering 194,182 genes in total. These results highlight the reproducibility, scalability, and translational versatility of our proposed framework.

Keywords

transcriptome integration big data drug repositioning integrative biology systems biology diseasome data analysis tools

Get full access to this article

View all access options for this article.

References

Ahmed

, Yang

, Samantasinghar

, et al. Network-based DRUG repurposing for HPV-associated cervical cancer. Comput Struct Biotechnol J, 2023; 21:5186–5200; doi: 10.1016/j.csbj.2023.10.038

Akiba

, Sano

, Yanase

, et al. Optuna: A next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM: New York, NY, USA; 2019; pp. 2623–2631; doi: 10.1145/3292500.3330701

Calinski

, Harabasz

. A Dendrite method for cluster analysis. Comm in Stats - Theory & Methods, 1974; 3(1):1–27; doi: 10.1080/03610927408827101

Cong

, Endo

. A quadruple revolution: Deciphering biological complexity with artificial intelligence, multiomics, precision medicine, and planetary health. OMICS, 2024; 28(6):257–260; doi: 10.1089/omi.2024.0110

Cong

, Endo

. Multi-OMICS and artificial intelligence-guided drug repositioning: Prospects, challenges, and lessons learned from COVID-19. OMICS, 2022; 26(7):361–371; doi: 10.1089/omi.2022.0068

Cong

, Shintani

, Imanari

, et al. A new approach to drug repurposing with two-stage prediction, machine learning, and unsupervised clustering of gene expression. OMICS, 2022; 26(6):339–347; doi: 10.1089/omi.2022.0026

Davies

, Bouldin

. A cluster separation measure. IEEE Trans Pattern Anal Mach Intell, 1979;PAMI-1(2):224–227; doi: 10.1109/TPAMI.1979.4766909

Dorrity

, Saunders

, Queitsch

, et al. Dimensionality reduction by UMAP to visualize physical and genetic interactions. Nat Commun, 2020; 11(1):1537; doi: 10.1038/s41467-020-15351-4

Duan

, Reid

, Clark

, et al. L1000CDS²: LINCS L1000 characteristic direction signatures search engine. NPJ Syst Biol Appl, 2016; 2(1):16015; doi: 10.1038/npjsba.2016.15

10.

Hounkpe

, Chenou

, de Lima

, et al. HRT atlas v1.0 database: Redefining human and mouse housekeeping genes and candidate reference transcripts by mining massive RNA-Seq datasets. Nucleic Acids Res, 2021; 49(D1):D947–D955; doi: 10.1093/nar/gkaa609

11.

Hozumi

, Wang

, Yin

, et al. UMAP-Assisted K-Means clustering of large-scale SARS-CoV-2 mutation datasets. Comput Biol Med, 2021; 131:104264; doi: 10.1016/j.compbiomed.2021.104264

12.

Jiang

, Ye

, Tan

, et al. Network-Based Multi-OMICS Integrative Analysis Methods in Drug discovery: A systematic review. BioData Min, 2025; 18(1):27; doi: 10.1186/s13040-025-00442-z

13.

, Wilkinson

, Sowalsky

. Comparison of approaches to transcriptomic analysis in multi-sampled tumors. Brief Bioinform, 2021; 22(6):bbab337; doi: 10.1093/bib/bbab337

14.

Lei

, Lei

, Chen

, et al. Drug repositioning based on deep sparse Autoencoder and drug-disease similarity. Interdiscip Sci, 2024; 16(1):160–175; doi: 10.1007/s12539-023-00593-9

15.

Lloyd

. Least squares quantization in PCM. IEEE Trans Inform Theory, 1982; 28(2):129–137; doi: 10.1109/TIT.1982.1056489

16.

Luecken

, Büttner

, Chaichoompu

, et al. Benchmarking atlas-level data integration in single-cell genomics. Nat Methods, 2022; 19(1):41–50; doi: 10.1038/s41592-021-01336-8

17.

, Zhang

. Integrate multi-omics data with biological interaction networks using multi-view factorization AutoEncoder (MAE). BMC Genomics, 2019; 20(Suppl 11):944; doi: 10.1186/s12864-019-6285-x

18.

McInnes

, Healy

, Melville

. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv, 2018.

19.

Nakatsuka

, Adler

, Jiang

, et al. Improving reproducibility of differentially expressed genes in single-cell transcriptomic studies of neurodegenerative diseases through meta-analysis. Nat Commun, 2025; 16(1):7436; doi: 10.1038/s41467-025-62579-z

20.

Oestreich

, Holsten

, Agrawal

, et al. HCoCena: Horizontal integration and analysis of transcriptomics datasets. Bioinformatics, 2022; 38(20):4727–4734; doi: 10.1093/bioinformatics/btac589

21.

Patro

, Duggal

, Love

, et al. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods, 2017; 14(4):417–419; doi: 10.1038/nmeth.4197

22.

Pedregosa

, Varoquaux

, Gramfort

, et al. Scikit-Learn: Machine learning in python. The Journal of Machine Learning Research, 2018; 12:2825–2830.

23.

Rousseeuw

. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math, 1987; 20:53–65; doi: 10.1016/0377-0427(87)90125-7

24.

Sakagianni

, Koufopoulou

, Koufopoulos

, et al. Data-Driven approaches in antimicrobial resistance: Machine learning solutions. Antibiotics (Basel), 2024; 13(11):1052; doi: 10.3390/antibiotics13111052

25.

Salvati

, Melone

, Giordano

, et al. Multi-OMICS based and AI-Driven drug repositioning for epigenetic therapy in female malignancies. J Transl Med, 2025; 23(1):837; doi: 10.1186/s12967-025-06856-x

26.

Shekhar

, Bansode

, Salim

. A comparative study of hyper-parameter optimization tools. In: 2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE). IEEE; 2021; pp. 1–6; doi: 10.1109/CSDE53843.2021.9718485

27.

. AIME: Autoencoder-Based integrative multi-omics data embedding that allows for confounder adjustments. PLoS Comput Biol, 2022; 18(1):e1009826; doi: 10.1371/journal.pcbi.1009826

28.

Zhang

, Lin

, Yang

, et al. Neural network-based approaches for biomedical relation classification: A review. J Biomed Inform, 2019; 99:103294; doi: 10.1016/j.jbi.2019.103294

29.

Zheng

, Liu

, Yang

, et al. Multi-OMICS data integration using ratio-based quantitative profiling with quartet reference materials. Nat Biotechnol, 2024; 42(7):1133–1149; doi: 10.1038/s41587-023-01934-1

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

1.71 MB

1.28 MB