Unsupervized clustering reveals a tri-phenotype model of hospitalized COVID-19 patients: Beirut cohort study and literature synthesis

Abstract

Introduction

COVID-19, caused by severe acure respiratory syndrome coronavirus 2, has posed unprecedented challenges globally, with diverse clinical manifestations ranging from asymptomatic and mild symptoms to severe and fatal illness. Identifying patient subgroups with distinct clinical profiles could enhance individualized treatment strategies. Clustering mixed clinical data offers a promising avenue for uncovering meaningful patterns; however, few algorithms effectively manage heterogeneous datasets. This study applied evidence-based clustering algorithms, that is, KAMILA and K-prototypes, to categorize COVID-19 patients on the basis of medical history and biochemical and radiological data.

Methods

A retrospective cohort study was conducted on 556 COVID-19 patients admitted to Hôtel Dieu de France Hospital in Beirut between March 2020 and October 2021. Only data collected within the first 24 hours of admission were used for clustering to ensure early prognostic relevance. After data cleaning, the missing values were imputed into 30 datasets. KAMILA and K-prototype algorithms were applied to these datasets, generating clusters ranging from two to six groups. The optimal clustering solution was determined via the silhouette, Calinski–Harabasz, and Dunn indices, followed by statistical analyses to characterize cluster-specific patient profiles and outcomes.

Results

Clustering identified three distinct patient groups, with the KAMILA algorithm providing the best fit. Cluster 1 primarily included middle-aged male patients exhibiting elevated inflammatory markers, consistent oxygen requirements, and extended hospital stays. Cluster 2 included elderly patients with multiple comorbidities and high intensive care unit (ICU) admission rates, requiring cautious anticoagulation and early antibiotic intervention. Cluster 3 included younger, generally healthier individuals who required minimal interventions and experienced low mortality.

Conclusions

Mixed-data clustering revealed three COVID-19 patient clusters indicating the clinical meaningfulness and global reproducibility with prognostic and therapeutic implications. This unsupervised approach may inform early triage and resource allocation. Further prospective validation in diverse, vaccinated populations is warranted.

Keywords

COVID-19 clustering machine learning K-prototypes KAMILA

Introduction

Even after the World Health Organization (WHO) ended the PHEIC (Public Health Emergency of International Concern) on 5 May 2023, COVID-19 remains an ongoing health issue, with millions of reported deaths worldwide and far higher excess-mortality estimates for 2020–2021.¹ Lebanon's multiple crises have kept the health system under acute strain inaugurated by the Gross Domestic Product collapse since 2018 and aggravated by ongoing hostilities since late 2023, attacks on health care, plummeting routine immunization coverage, and a sustained exodus of staff.^2–7 Clinically, COVID-19 still spans mild to critical disease; risk is concentrated in older adults, those with comorbidities, and immunocompromised individuals.^8,9 Protection from the updated 2024–2025 vaccines persists against severe outcomes, and the novel descendant's circulation has not shown increased intrinsic severity.^8–11 Against this backdrop, predicting individual patient responses and optimizing treatment to mitigate severe outcomes are no longer considered novel approaches, with machine learning and clustering algorithms offering promising solutions.

Clustering is a unique method in data analysis, particularly for complex diseases with evolving knowledge, such as COVID-19. By grouping data into clusters on the basis of intrinsic similarity, cluster analysis uncovers patterns that are often concealed by high data complexity.¹² This approach aligns with the evolution of precision medicine, enabling personalized care strategies, a trend increasingly observed in fields such as oncology and metabolic disorders.¹³ For COVID-19, identifying patient subgroups based on the basis of unique clinical profiles can enhance individualized care and targeted interventions.

Clinical datasets often contain mixed data, encompassing both continuous and categorical variables, which poses a yet debatable problem of the feasibility of clustering such data. There is limited guidance on optimal clustering methods for such data, with a notable exception being Preud’homme et al.,¹⁴ who provided proof using real and generated data on the suitable and recommended clustering algorithms for mixed data. Among their recommended methods, the KAMILA and K-prototype stand out for their unique features. K-prototype extends the k-means algorithm, which uses Euclidean and Hamming distances to handle mixed data efficiently. It usually accepts scaled continuous variables and balanced categorical data. KAMILA, in contr, adapts k-means for heterogeneous datasets by modeling continuous data with mixture distributions and categorical data with multinomial distributions and is robust against data imbalance.

In this study, we applied the KAMILA and K-prototype methods to cluster COVID-19 patients treated at Hôtel Dieu de France Hospital in Beirut on the basis of medical histories and initial biochemical and radiological data. Our goal was to delineate patient categories, enabling tailored care strategies for newly admitted patients by matching them to predefined clusters and offering evidence-based treatment recommendations rooted in their unique clinical profiles.

Material and methods

We conducted a single-center, retrospective cohort study at Hôtel Dieu de France Hospital, including 556 hospitalized patients with confirmed COVID-19 from 3 March 2020 to 12 October 2021 (pre-Omicron variant phase¹⁵). Our goal was to assess early prognostic indicators and treatment effects on clinical outcomes. All statistical analyses were performed via R 4.3.1 (The R Foundation for Statistical Computing, Vienna, Austria).

Data collection

Study data were extracted from the hospital's electronic medical records (EMR) system. Patients eligible for inclusion had a confirmed COVID-19 diagnosis and were admitted during the study period. Only the data points collected within 24 hours of admission were used for clustering (Table 1), comprising variables commonly accessible upon admission for standard COVID-19 patients. The early data points chosen will assist clinicians in promptly classifying new admissions, guiding initial treatment decisions. Data extracted from the EMR were cross-checked for consistency and quality. All data entries were n-tuple-checked by a workforce of residents to confirm accuracy, with discrepancies resolved through consensus and review of the source documents. The prognostic value of the different treatments administered during the stay was assessed by studying their effects on severity variables such as contraction of nosocomial infections, development of pneumo-mediinum, intubation, occurrence of thromboembolic or hemorrhagic events, duration of hospitalization and death.

Table 1.

Variables used for patient clustering analysis.

Continuous variables
Age	Lymphocytes	Day symptoms started
CRP	Baseline CT ground glass estimation	Weight
D-dimers	Procalcitonin	Baseline CT pulmonary artery diameter
Ferritin	LDH	Leucocytes

Categorical variables		Definition
Sex		Biological gender
0		Male
1		Female
Chronic renal failure		Progressive kidney damage causing a buildup of we and toxins in the body
0		Absent
1		Present
Cardiovascular disease		Any cardiac or vascular disease, e.g. coronary disease and heart failure
0		Absent
1		Present
Diabetes mellitus		Chronic metabolic disorder caused by a buildup of glucose in vessels and organs
0		Absent
1		Present
Hypertension		Chronic high blood pressure increasing the risk of stroke and cardiac events
0		Absent
1		Present
Immunosuppression		Patients with low immunity, e.g. cancer, transplantation and autoimmunity
0		Absent
1		Present
Lung disease		Chronic respiratory disorder, e.g. COPD, hma, and pulmonary fibrosis
0		Absent
1		Present
Smoker		Patients with a history of tobacco use
0		Absent
1		Present
Baseline CT lobar condensation		Increased density of a lobe in the lung due to fluid, inflammation, or infection
0		Absent
1		Present
O₂ requirements		The volume of additional oxygen needed to keep the blood well saturated
0		No O₂
1		O₂ < 4 L/min
2		O₂ 4–8 L/min
3		High flow

This table lists the clinical and demographic variables collected within the first 24 hours of admission and used to group patients. The variables are divided into two types: continuous and categorical. For each categorical variable, a definition is provided along with the numerical codes, and their meaning, which are used in the clustering algorithm. CRP: C-reactive protein; CT: computed tomography; LDH: Lactate dehydrogenase; COPD: chronic obstructive pulmonary disease.

Inclusion and exclusion criteria

Patients who received experimental or compassionate-use treatments prior to admission were included, to ensure that the treatment effects studied were from the standard protocols in use in Lebanon at the time as well as those with pre-existing terminal illnesses (e.g. advanced malignancy with end-of-life care) that could support severity assessment of COVID-19. The exclusion criteria which were applied at the time of admission or during the statistical analysis, included patients with insufficient baseline data or those with incomplete hospital records during the 24-hour window. We also excluded patients admitted for conditions unrelated to COVID-19.

Statistical analysis

We prepared the data via systematic cleaning, imputation, and mixed-type clustering. Variables with more than 20% missing data were excluded from the analysis. The remaining data were multiply imputed into 30 completed datasets (MICE). Redundant variables were identified via appropriate correlation metrics: Spearman correlation for continuous pairs, the φ coefficient for binary pairs, and point–biserial correlation for mixed pairs (thresholds |r|, φ ≥0.60; point-biserial ≥0.50), removing one variable from any flagged pair on the basis of clinical relevance and data quality. We then removed near-zero-variance predictors using caret defaults (frequency ratios ≥95:5 and ≤10% unique values). Continuous features were z scored. Clustering was performed separately in each imputed set using two nonlikelihood algorithms suited to mixed data, KAMILA and K-prototypes, for k=2…6. For each k, we formed a consensus partition across the 30 imputations via clue::cl_consensus (squared-Euclidean partition distance). To quantify internal validity while respecting mixed data, we computed Gower distances on the most representative imputed set (the one with the highest adjusted Rand index to the consensus) and evaluated silhouette, Calinski–Harabasz, and Dunn indices; a composite score (product of the three) guided selection of the final k. Because KAMILA and K-prototypes do not maximize a parametric likelihood, likelihood-ratio statistics are not applicable to the clustering step; instead, we report consensus stability and the above distance-based validity metrics. Cluster differences in baseline characteristics were assessed with the Kruskal–Wallis test (continuous; with Dunn's post hoc test) and the χ² test (categorical; with pairwise comparisons when relevant).

For association analyses within clusters (intensive care unit (ICU) and non-ICU strata), we fit univariable models per predictor–outcome pair. For the continuous outcome (length of stay), we used Huber M-estimation (MASS::rlm) to mitigate outlier and nonnormality influence; nonlinearity was screened via component-plus-residual plots, adding a quadratic term when indicated. We examined heteroscedicity (Breusch–Pagan) and used HC1 sandwich variances when needed; influence was screened with Cook's distance (D_i >4/n triggers refit). For binary outcomes (e.g. mortality and hemorrhage) we used binomial logistic regression (logit link), again cluster-specific and univariable, adding a quadratic term for evident nonlinearity and using HC1 robust standard errors. As rlm is an M-estimator without a true likelihood, likelihood-ratio tests are not defined for the length of stay (LOS) models; for logistic generalized linear models, effect sizes with confidence intervals were prioritized for interpretability.

Results

Thirty imputed datasets were generated. Each was clustered via KAMILA and K-prototypes into k = 2, 3, 4, 5, and 6 clusters each. For each k, the consensus cluster was extracted. The number k of clusters with the highest silhouette, Calinski–Harabasz, and Dunn indices was chosen to be the best clustering solution; namely, the KAMILA algorithm with k = 3 clusters was the best overall (silhouette = 0.11, Calinski–Harabasz = 0.04, Dunn = 85.01). Each cluster was then divided into two subgroups on the basis of whether they were admitted to the ICU. Table 2 summarizes the frequency per cluster and group.

Table 2.

Distribution of patients across clusters and hospital units.

	Cluster 1	Cluster 2	Cluster 3
Total	155 (27.88)	174 (31.29)	227 (40.83)	556
ICU	55 (44.72)	47 (38.21)	21 (17.07)	123
Floors	100 (23.09)	127 (29.33)	206 (47.58)	433

This table shows how the 556 patients in the study were allocated into the 3 distinct clusters identified by the KAMILA algorithm. It provides the total number and percentage of patients (of the total) in each cluster. Furthermore, it breaks down each cluster's population by their location of care: the ICU or the general hospital floors.

Patient characteristics

All the imaging findings and quantitative measurements differed significantly across the clusters (p < 0.001). Immunocompromised status did not predict patient prognosis or treatment response (p = 0.48). Cluster 1 comprised predominantly males (≈5:1), typically >50 years, with the greatest weight and a comorbidity burden comparable to that of age-matched peers (>10% but <30% per condition). Patients presented later to the ED and showed higher inflammatory/cytotoxic markers (elevated ferritin and lactate dehydrogenase (LDH) and lymphopenia) with a high percentage of lobar consolidation and ground-glass opacities on the index computed tomography (CT); nearly half were transferred to the ICU (Table 2), and hospital stays were the longest. Cluster 2 had a male to female ratio ≈2:1 and was the oldest group; it harbored the highest comorbidity prevalence (>30% for each disease), laboratory signs of infection and thrombosis, and the greatest baseline CT consolidation burden. Cluster 3 was the youngest, had an approximately 1:1 sex distribution, and carried very few comorbidities (<10% for each, hypertension most common); it had the highest lymphocyte counts in comparison to clusters 1 and 2, nevertheless staying within the norm; the fewest complications; and the shortest stays. Summary contrs appear in Figure 1 and Tables 3 and 4.

Figure 1.

The three clusters exhibited varying levels of routine blood tests. All tests are significant at p < 0.001 according to the Kruskal–Wallis test.

Table 3.

Comparison of categorical patient characteristics across clusters.

		Cluster 1	Cluster 2	Cluster 3	$χ^{2}$ , p
Sex	Male	127 (81.94)	112 (64.37)	134 (59.03)	<0.01
Sex	Female	28 (18.06)	62 (35.63)	93 (40.97)
Diabetes	No	117 (75.48)	76 (43.68)	206 (90.75)	<0.01
Diabetes	Yes	38 (24.52)	98 (56.32)	21 (9.25)
Immunocompromised (e.g. cancer, autoimmunity)	No	144 (92.9)	155 (89.08)	207 (91.19)	0.48
Immunocompromised (e.g. cancer, autoimmunity)	Yes	11 (7.1)	19 (10.92)	20 (8.81)
Smoker	No	117 (75.48)	111 (63.79)	199 (87.67)	<0.01
Smoker	Yes	38 (24.52)	63 (36.21)	28 (12.33)
Cardiovascular	No	132 (85.16)	71 (40.8)	212 (93.39)	<0.01
Cardiovascular	Yes	23 (14.84)	103 (59.2)	15 (6.61)
Hypertension	No	82 (52.9)	11 (6.32)	159 (70.04)	<0.01
Hypertension	Yes	73 (47.1)	163 (93.68)	68 (29.96)
Renal disease	No	147 (94.84)	117 (67.24)	209 (92.07)	<0.01
Renal disease	Yes	8 (5.16)	57 (32.76)	18 (7.93)
Lung disease	No	129 (83.23)	117 (67.24)	204 (89.87)	<0.01
Lung disease	Yes	26 (16.77)	57 (32.76)	23 (10.13)
Lobar consolidation on CT scan	No	137 (88.39)	150 (86.21)	215 (94.71)	0.01
Lobar consolidation on CT scan	Yes	18 (11.61)	24 (13.79)	12 (5.29)

This table displays the breakdown of patients’ baseline characteristics for each of the three clusters. The data, organized by personal medical history and initial CT scan findings are presented as the total number of patients followed by the percentage within their cluster show in (%). This allows for a direct comparison of the prevalence of these conditions among the groups.

Table 4.

Comparison of continuous patient characteristics across clusters.

	Cluster 1	Cluster 2	Cluster 3	Kruskal–Wallis, p
Age	62 [54, 70]	76 [70, 83]	55 [44, 66.5]	<0.001
Weight	87 [75.75, 95]	79.5 [70, 89.62]	77 [65, 86.5]	<0.001
Onset of symptoms (days)	−9 [−11.5, −7]	−5 [−8, −2]	−6 [−10, −3]	<0.001
GGO on 1st scan	40 [30, 50]	20 [10, 30]	15 [5, 20]	<0.001
PA diameter on 1st scan	26 [24, 29]	28 [25, 30]	25 [23.5, 27]	<0.001

This table compares the distributions of continuous variables across the three patient clusters. The data are presented as the median value, with the first and third quartiles shown in [ ]. GGOs: ground glass opacities; PAs: pulmonary arteries.

Adverse outcomes and treatment effects

Chronic comorbidities and adverse outcomes differed significantly across the clusters (p < 0.001). Cluster 1 patients almost uniformly required oxygen, with most having a rate >4 L/min, and had prolonged stays. Cluster 2 patients required oxygen in roughly three quarters of the patients (most <10 L/min), had a greater bleeding risk, and also had long stays. Cluster 3 patients had the shortest stays, and approximately half did not require supplemental oxygen (Table 5).

Table 5.

Comparison of clinical outcomes and complications by cluster.

			Cluster 1		Cluster 2		Cluster 3
Oxygen requirements	No		1 (0.65)		41 (23.56)		121 (53.3)
	Yes	<4 L/min	153 (99.35)	19 (12.34)	133 (76.44)	55 (31.61)	106 (46.7)	71 (31.28)
		< 10 L/min		65 (42.21)		60 (34.48)		26 (11.45)
		>10, HFNC		69 (44.81)		18 (10.34)		9 (3.96)
Death	Survived		128 (82.58)		116 (66.67)		210 (92.51)
Death	Deceased		27 (17.42)		58 (33.33)		17 (7.49)
Bleeding risk ^†	No bleeding event		146 (94.19)		152 (87.36)		218 (96.04)
Bleeding risk ^†	Bled		9 (5.81)		22 (12.64)		9 (3.96)
Length of stay (days)			12 [7, 19]		11 [6, 20]		6 [4, 11]

This table outlines the frequency of key clinical outcomes and adverse events experienced by patients in each of the three clusters. All p-values are <0.001, but † <0.01. HFNC: high flow nasal canula.

In the ICU subgroup, there was no difference in ICU length of stay or intubation risk between clusters. Antibiotic therapy was associated with longer stays; patients without infection more often had pneumothoraces. Cluster 2 had 2.36-fold greater odds of death than cluster 1 did (95% confidence interval (CI) 1.07–5.32, p = 0.035), whereas clusters 1 and 3 did not differ in mortality (p = 0.99). Among cluster 2 patients who received on glucocorticoid therapy upon ICU admission, the odds of mortality differed significantly from those of patients who did not (odds ratio (OR) 1.32; 95% CI 1.05–1.77; p = 0.035).

In the ward (non-ICU), cluster 3 had a 2.4-day shorter length of stay than clusters 1 to 2 did (95% CI 1.24–3.58). Across clusters 1 to 3, none of the other tested interventions (hydroxychloroquine ± azithromycin, ivermectin, lopinavir/ritonavir, remdesivir, shifts from prophylactic to curative anticoagulation or antiplatelets, broad-spectrum antibiotics, or advanced ventilatory adjuncts) yielded statistically significant benefits in mortality, ICU transfer, intubation, or length of stay.

Discussion

Our principal finding is that hospitalized COVID-19 patients in our center reproducibly separate into three clinically coherent phenotypes that map onto a familiar severity gradient and carry distinct prognoses: a group with elevated serology markers of severe inflammation, the highest ICU demand and substantial mortality (critical illness, Infectious Disease Society of America classification¹⁶), an older multimorbid group with high risk of thrombosis and superinfections (severe illness), and a young-low-risk group with the shortest stays and lowest adverse events (mild to moderate illness). In our data this translated into, respectively, longer admissions and high escalation needs (cluster 1: LOS 12 d, 45% ICU, 17% mortality), an intermediate but still fragile profile being older in age and having multiple comorbidities (cluster 2: LOS 11 d, 38% ICU, 33% mortality), and a comparatively favorable course (cluster 3: LOS 6 d, 17% ICU, 7.5% mortality). Clinically, assigning phenotypes at admission offers a rapid and interpretable summary of the patient's host response that can steer triage, monitoring intensity, and early therapy. Although some patients with mild illness develop Long COVID-19 including rare delayed deaths, a mild initial presentation does not necessarily predict favorable long-term outcomes.^17–19 Even in patients with a less severe disease course, careful acute evaluation, structured follow up, and treatment are still necessary.

Several classifications from 2020–2025 resulted in similar patterns. Unsupervised analyses usually divide hospitalized COVID-19 patients into three common types: mild to moderate, severe, or critical illness. Sometimes, an additional type is added to the four groups.^20–29 Their reproducibility confers clear clinical utility, although a minority of series report two clusters²³ or, in long-COVID-19 or highly granular datasets, four.²⁴ The severe to critical/hyperinflammatory phenotype comprises predominantly male older adults,^30–32 with higher body mass index (BMI)¹⁵ and the heaviest burden of chronic cardiac, renal, hepatic, pulmonary, and oncological disease.³¹ Biochemically, it is associated with a C-reactive protein–LDH–D-dimer surge, neutrophilia, lymphopenia, a high neutrophil-to-lymphocyte ratio (NLR)^33,34 and multiple organ markers (aspartate aminotransferase, alanine aminotransferase, bilirubin, creatinine, urea, N-terminal pro-B-type natriuretic protein [NT-proBNP], and troponin).³³ Radiologically, there is dominance of extensive ground–glass opacities or lobar consolidation associated with hypoxemia.^25,30 Intermediate clusters with older, multimorbid patients but with less extreme inflammation/organ injury lie between extremes^25,31, whereas less severe clusters center on younger adults, often female-skewed,³⁵ with preserved lymphocyte/monocyte counts, the lowest inflammatory indices³¹ and the sparsest CT changes.²⁵ Thus, irrespective of geography, data granularity, or algorithms, COVID-19 admissions reliably resolve into a hyperinflammatory severe to critical phenotype, a multimorbid phenotype and a low-risk phenotype composed of younger patients, underscoring a biologically coherent continuum of host responses.

Contemporary COVID-19 stratification studies use various unsupervised learning techniques,³¹ including K-means, x-means, Partitioning Around Medoids,^25,32 hierarchical methods (Ward's criterion),^25,36 model-based approaches (Latent Class Analysis (LCA),³⁷ Gaussian-mixture models,³⁸ Mixture of Autoregressions (MoAR), Multiple Factor Analysis (MFA)³⁹), dimensionality-reduction tools (Factor Analysis of Mixed Data,²⁵ Principal Component Analysis⁴⁰), self-organizing maps,⁴¹ rigorous preprocessing (log-transforming,²⁵ z-scaling, categorical encoding³²), chained-equation imputation,³⁵ and algorithms tailored to mixed or longitudinal data (Kml3d,³¹ distribution-specific LCA³⁷). Internal validity indices (Bayesian Information Criterion, Akaike Information Criterion, silhouette, Dunn, and Davies–Bouldin)^25,31 and stability checks (ensemble consensus,³⁵ imaging-exclusion sensitivity,²⁵ train-test splits,⁴² and Adjusted Rand Index concordance²⁵) ensure robustness. Methodologically, the literature encompasses K-means,²⁷ agglomerative clustering,²⁵ K-prototypes for mixed data,²⁴ KAMILA for kernel-enhanced mixed clustering,^14,23 probabilistic LCA,^21,29 hybrid unsupervised–supervised pipelines,^21,22 expectation–maximization mixtures,⁴³ and self-organizing maps.⁴⁴ Robustness is achieved through algorithm ensembles or external validation,^25,28 whereas fragility arises from small samples, missing data, and time-variant therapies, which are partially mitigated by imputation and temporal sensitivity checks. The convergence of three analogous clusters across diverse tools might suggest methodological reliability when best practices are followed, but it is worth questioning the robustness of these findings. Heterogeneous studies seem to yield a similar schema: a hyperinflammatory critical cluster, a multimorbid elderly cluster, and a mild low-risk cluster. However, some datasets merge extreme age and inflammation, whereas others separate a younger hyperinflammatory group from an older multimorbid group, raising questions about the consistency of these categorizations. Sex-based nuances, such as female-skewed low-risk clusters or a sensory-symptom female cluster with altered long-COVID-19 risk, are noted. The one-to-one mapping between this consensus and our three clusters might validate our mixed-data KAMILA/K-prototype approach, which suggests segregation of adult COVID-19 patients into inflammatory-critical, older-comorbid prone to thrombi and superinfections, and young mild-to-moderate phenotypes, each of which presumably require requiring tailored management (aggressive immunomodulation, enhanced anticoagulation, or standard care respectively).

This study has important limitations. The cohort comes from a single institution and is modest in size, and stratification and subgroup analyses further reduce precision. Recruitment occurred before widespread vaccination and under two dominant variants (alpha⁴⁵ and delta⁴⁶), so generalizability to vaccinated populations and to newer variants (omicron and forward) is uncertain. Imaging relies on chest CT only, and we do not include other modalities such as SPECT or VQ SPECT CT, which could reveal thrombotic and other vascular changes in small subsegmental vessels, including in younger patients⁴⁷; thus, clustering may have been influenced by imaging modality and timing and may have led to missed relevant disease. Treatments were not randomized, and symptoms were not captured in a fully standardized manner, limiting causal inference, and clinical detail. Clusters represent clinical tendencies rather than deterministic outcomes, and outliers exist, so any triage or access to care guidance derived from our phenotypes should support rather than replace personalized clinical judgment. The literature synthesis we provide is illustrative given the very large amount of COVID-19 literature and should not be read as exhaustively. We therefore present our work as exploratory and hypothesis-generating, and we encourage validation in larger multicenter cohorts with contemporary case mixes, standardized data capture, comprehensive imaging that includes perfusion and ventilation modalities, and longitudinal follow up to assess long-term outcomes such as long COVID-19.

Conclusion

This single-center, mixed-data unsupervised analysis shows how routinely collected variables can reveal three clinically interpretable phenotypes in hospitalized COVID-19 patients. The observed patterns resemble those reported elsewhere but may reflect the case mix and tests performed. Phenotypes reflect clinical tendencies rather than deterministic outcomes. Outliers exist, and patients with initially mild presentations may still develop long COVID-19. These groups should support, not replace, personalized clinical judgment, acute evaluation, and structured follow up.

This study presents a practical workflow for clustering mixed-type data, incorporating multiple imputation and robust internal validation and shows how phenotype assignment can be used to organize care pathways. We do not assert universal applicability across countries, populations, or viral variants. Future work should test these findings in larger, contemporary, multicenter cohorts, incorporate multimodal imaging, and follow patients longitudinally to quantify long-term outcomes. With external validation, phenotype-aware approaches may complement established triage and treatment decisions.

Footnotes

Acknowledgments

The author thank Dr Georges Maalouly's for helping me throughout the journey of conceptualizing, writing and publishing this study. The author also appreciate Dr Mouin Jamal's help in study conceptualization, Dr Rindala Saliba for supervision, Dr Moussa Riachy and Dr Ghassan Sleilaty for granting me access to the Hospital's COVID-19 dataset.

ORCID iDs

Christopher El Hadi

Rindala Saliba

Georges Maalouly

Moussa Riachy

Ethical approval

Our study was approved by the ethical board at Hotel-Dieu de France Hospital. The ethics committee approval code is Tfem-2022’-30.

Informed consent

This retrospective study did not require patient consent for data interrogation because no intervention was introduced and all patients were anonymised.

Contributorship

CEH conceived the study. analysed the data, and wrote the manuscript. GM supervised the project, and all authors reviewed the final manuscript.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Guarantor

GM is the guarantor of this work and, as such, had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Peer review

All reviewer and editorial comments have been addressed in a satisfactory manner.

Data availability

The datasets analyzed during the current study are not publicly available due to strict regulations imposed by the Hotel Dieu de France Hospital; however, they are available from the corresponding author upon reasonable request.

References

Statement on the fifteenth meeting of the IHR. Emergency Committee on the COVID-19 pandemic [Internet]. [cited 2025 Sept 8]. Available from: https://www.who.int/news/item/05-05-2023-statement-on-the-fifteenth-meeting-of-the-international-health-regulations-(2005)-emergency-committee-regarding-the-coronavirus-disease-(covid-19)-pandemic. 2005.

World Bank [Internet]. [cited 2024 Nov 7]. Overview. Available from: https://www.worldbank.org/en/country/lebanon/overview.

Khoury

Azar

Hitti

. COVID-19 response in Lebanon: current experience and challenges in a low-resource setting. JAMA. 2020;324:548.

Foundation TR Thomson Reuters. FEATURE-Lebanon’s patients suffer as crisis causes health worker exodus. Reuters [Internet]. 2022 June 8 [cited 2025 Sept 8]; Available from: https://www.reuters.com/article/business/healthcare-pharmaceuticals/feature-lebanons-patients-suffer-as-crisis-causes-health-worker-exodus-idUSL8N2XT1MH/.

World Bank [Internet]. [cited 2025 Sept 8]. Lebanon Economic Monitor. Available from: https://www.worldbank.org/en/country/lebanon/publication/lebanon-economic-monitor.

Public Health Situation Analysis - Lebanon [Internet]. [cited 2025 Sept 8]. Available from: https://www.who.int/publications/m/item/public-health-situation-analysis---lebanon.

Lebanon: a conflict particularly destructive to health care [Internet]. [cited 2025 Sept 8]. Available from: https://www.who.int/news/item/22-11-2024-lebanon--a-conflict-particularly-destructive-to-health-care.

Loubet

Benotmane

Fourati

, et al. Risk of severe COVID-19 in four immunocompromised populations: a French expert perspective. Infect Dis Ther 2025; 14: 671–733.

CDC. COVID-19. 2025 [cited 2025 Sept 8]. Underlying Conditions and the Higher Risk for Severe COVID-19. Available from: https://www.cdc.gov/covid/hcp/clinical-care/underlying-conditions.html.

10.

Link-Gelles

. Interim estimates of 2024–2025 COVID-19 vaccine effectiveness among adults aged ≥18 years — VISION and IVY Networks, September 2024–January 2025. MMWR Morb Mortal Wkly Rep [Internet]. 2025 [cited 2025 Sept 8];74. Available from: https://www.cdc.gov/mmwr/volumes/74/wr/mm7406a1.htm.

11.

Statement on the antigen composition of COVID-19 vaccines [Internet]. [cited 2025 Sept 8]. Available from: https://www.who.int/news/item/15-05-2025-statement-on-the-antigen-composition-of-covid-19-vaccines.

12.

Pina

Macedo

Henriques

. Clustering clinical data in R. In: Matthiesen R, editor. Mass spectrometry data analysis in proteomics [Internet]. New York, NY: Springer New York; 2020 [cited 2023 Feb 4]. p. 309–43. (Methods in Molecular Biology; vol. 2051). Available from: http://link.springer.com/10.1007/978-1-4939-9744-2_14.

13.

El Hadi

Ayoub

Bachir

, et al. Polygenic and network-based studies in risk identification and demystification of cancer. Expert Rev Mol Diagn 2022; 22: 427–438.

14.

Preud’homme

Duarte

Dalleau

, et al. Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark. Sci Rep 2021; 11: 4202.

15.

The MoPH confirms two cases of Omicron variant [Internet]. 2021. Available from: https://www.moph.gov.lb/en/DynamicPages/view_page/57232/21/moph-omicron-variant#:∼:text=The%20Ministry%20of%20Public%20Health%20·%20Our,·%20Hospital%20Accreditation%20·%20Primary%20Health%20Care.

16.

Bhimraj

Morgan

Shumaker

, et al. Infectious Diseases Society of America guidelines on the treatment and management of patients with COVID-19 (September 2022). Clin Infect Dis 2024; 78: e250–e349.

17.

Munblit

O’Hara

Akrami

, et al. Long COVID: aiming for a consensus. Lancet Respir Med 2022 July; 10: 632–634.

18.

Skevaki

Moschopoulos

Fragkou

, et al. Long COVID: pathophysiology, current concepts, and future directions. J Allergy Clin Immunol 2025; 155: 1059–1070.

19.

Greenhalgh

Sivan

Perlowski

, et al. Long COVID: a clinical update. Lancet 2024; 404: 707–724.

20.

Epsi

Powers

Lindholm

, et al. A machine learning approach identifies distinct early-symptom cluster phenotypes which correlate with hospitalization, failure to return to activities, and prolonged COVID-19 symptoms. Delanerolle G, editor. PLoS ONE 2023; 18: e0281272.

21.

Banoei

Rafiepoor

Zendehdel

, et al. Unraveling complex relationships between COVID-19 risk factors using machine learning based models for predicting mortality of hospitalized patients and identification of high-risk group: a large retrospective study. Front Med 2023; 10: 1170331.

22.

Bello

Bundey

Bhave

, et al. Integrating AI/ML models for patient stratification leveraging omics dataset and clinical biomarkers from COVID-19 patients: a promising approach to personalized medicine. Int J Mol Sci 2023; 24: 6250.

23.

Fernández

Perez-Alvarez

Molist

and on behalf of the DIVINE project. COVID-19 patient profiles over four waves in Barcelona metropolitan area: a clustering approach. Chang DW, editor. PLoS ONE 2024; 19: e0302461.

24.

Lau

KYY

Kwok

, et al. An unsupervised machine learning clustering and prediction of differential clinical phenotypes of COVID-19 patients based on blood tests—a Hong Kong population study. Front Med 2022; 8: 764934.

25.

Yamga

Mullie

Durand

, et al. Identifying COVID-19 phenotypes using cluster analysis and assessing their clinical outcomes [Internet]. 2022 [cited 2025 June 22]. Available from: http://medrxiv.org/lookup/doi/10.1101/2022.05.27.22275708.

26.

Sokolski

Trenson

Reszka

, et al. Phenotype clustering of hospitalized high-risk patients with COVID-19—a machine learning approach within the multicentre, multinational PCHF-COVICAV registry. Cardiol J 2024; 31: 512–521.

27.

Nalinthasnai

Thammasudjarit

Tassaneyasin

, et al. Unsupervised machine learning clustering approach for hospitalized COVID-19 pneumonia patients. BMC Pulm Med 2025; 25: 70.

28.

Zhang

Flory

, et al. Clinical subphenotypes in COVID-19: derivation, validation, prediction, temporal patterns, and interaction with social determinants of health. Npj Digit Med 2021; 4: 10.

29.

Wang

Jehi

, et al. Phenotypes and subphenotypes of patients with COVID-19. Chest 2021; 159: 2191–2204.

30.

Qaisieh

Al-Tamimi

El-Hammuri

, et al. Clinical, laboratory, and imaging features of COVID-19 in a cohort of patients: cross-sectional comparative study. JMIR Public Health Surveill 2021; 7: e28005.

31.

San-Cristobal

Martín-Hernández

Ramos-Lopez

, et al. Longwise cluster analysis for the prediction of COVID-19 severity within 72 h of admission: COVID-DATA-SAVE-LIFES cohort. J Clin Med 2022; 11: 3327.

32.

Benito-León

Del Castillo

Estirado

, et al. Using unsupervised machine learning to identify age- and sex-independent severity subgroups among patients with COVID-19: observational longitudinal study. J Med Internet Res 2021; 23: e25988.

33.

Borges Do Nascimento

Von Groote

O’Mathúna

, et al. Clinical, laboratory and radiological characteristics and outcomes of novel coronavirus (SARS-CoV-2) infection in humans: a systematic review and series of meta-analyses. Farag E, editor. PLoS ONE 2020; 15: e0239235.

34.

Lusczek

Ingraham

Karam

, et al. Characterizing COVID-19 clinical phenotypes and associated comorbidities and complication profiles. Lazzeri C, editor. PLoS ONE 2021; 16: e0248956.

35.

Han

Shen

Yan

, et al. Exploring the clinical characteristics of COVID-19 clusters identified using factor analysis of mixed data-based cluster analysis. Front Med 2021; 8: 644724.

36.

Sigwadhi

Tamuzi

Zemlin

, et al. Latent class analysis: an innovative approach for identification of clinical and laboratory markers of disease severity among COVID-19 patients admitted to the intensive care unit. IJID Reg 2022; 5: 154–162.

37.

Improvement of k-means clustering performance on disease clustering using Gaussian mixture model. J Syst Manag Sci [Internet]. 2023 Sept 28 [cited 2025 July 10];13(5). Available from: http://www.aasmr.org/jsms/Vol13/No.5/Vol.13%20No.5.11.pdf.

38.

Maleki

Bidram

Wraith

. Robust clustering of COVID-19 cases across U.S. counties using mixtures of asymmetric time series models with time varying and freely indexed covariates. J Appl Stat 2023; 50: 2648–2662.

39.

Tang

, et al. Identification of COVID-19 clinical phenotypes by principal component analysis-based cluster analysis. Front Med 2020; 7: 570614.

40.

Ilbeigipour

Albadvi

Akhondzadeh Noughabi

. Cluster-based analysis of COVID-19 cases using self-organizing map neural network and K-means methods to improve medical decision-making. Inform Med Unlocked 2022; 32: 101005.

41.

Kisiel

Lee

Malmquist

, et al. Clustering analysis identified three long COVID phenotypes and their association with general health status and working ability. J Clin Med 2023; 12: 3617.

42.

University Teknologi Malaysia, Mohamand Noor NF, Sipail HS, University Teknologi Malaysia, Ahmad N, University Teknologi Malaysia, et al. COVID-19: Symptoms Clustering and Severity Classification Using Machine Learning Approach. Int J Integr Eng [Internet]. 2023 July 31 [cited 2024 Nov 7];15(3). Available from: https://publisher.uthm.edu.my/ojs/index.php/ijie/article/view/12809/5811.

43.

Chen

Yao

. Trajectory tracking of COVID-19 epidemic risk using self-organizing feature map. Bull Chin Acad Sci 2022; 36: 2022003.

44.

First Coronavirus Case in Lebanon [Internet]. 2020. Available from: https://www.moph.gov.lb/en/DynamicPages/view_page/25530/22/minister-hasan-first-coronavirus-case-lebanon-#:∼:text=Hamad%20Hasan%20confirmed%20the%20first,WHO%20representative%20to%20Lebanon%2C%20Dr.

45.

Lebanon Records Three Cases of the COVID-19 Delta Variant [Internet]. Available from: https://www.moph.gov.lb/en/Drugs/index/0/52278/page:32/sort:Drug.atc/direction:asc#:∼:text=Assessment%20and%20Treatment-,Minister%20Hasan:%20Lebanon%20Records%20Three%20Cases%20of%20the%20COVID%2D19,على%20تحديد%20مصدر%20هذه%20الحالات%22.

46.

Evbuomwan

Endres

Tebeila

, et al. Identification and follow-up of COVID-19 related matching ventilation and perfusion defects on functional imaging using VQ SPECT/CT. Nucl Med Mol Imaging 2023; 57: 9–15.

47.

Zhong

Wang

, et al. Clinical determinants of the severity of COVID-19: a systematic review and meta-analysis. Lazzeri C, editor. PLoS ONE 2021; 16: e0250602.