Abstract
Integrating Big Data, such as large-scale transcriptomic datasets across diseases, continues to be a major challenge. This is in part due to inconsistent preprocessing and the lack of a standardized, reproducible analytical framework. Existing pipelines often rely on manual parameter tuning and fragmented scripts, which limits cross-dataset comparability and downstream interpretability. We developed disint (disease integration and clustering toolkit), an open-source Python framework for standardized cross-dataset expression integration, embedding, and clustering. The pipeline implements housekeeping gene-based normalization, disease-specific log2 fold-change computation, automated Uniform Manifold Approximation and Projection hyperparameter optimization, and adaptive K-means clustering. Building on its outputs, we further implemented a prototype downstream module, disease reposition, which extracts disease-specific gene signatures, evaluates their shared components, and explores potential drug repositioning candidates. The framework was validated on 28 transcriptomic datasets encompassing 34 disease categories and 386 samples, including 255 patient and 131 healthy control samples, covering 194,182 genes in total. These results highlight the reproducibility, scalability, and translational versatility of our proposed framework.
Keywords
Get full access to this article
View all access options for this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
