Analyzing dengue outbreak patterns using integrated machine learning approaches: A study in Bangladesh

Abstract

Dengue fever remains a persistent global health threat, particularly in Southeast Asia, the Pacific, and the Americas. This study aims to improve early detection and prediction of dengue outbreaks by addressing the challenges of data scarcity and complex transmission factors through a hybrid machine learning approach. We developed a methodology that integrates clustering and classification techniques to identify and predict seasonal patterns of dengue risk. Using regional data from Bangladesh, clustering was performed to uncover latent patterns, with optimal clusters selected based on low inertia and high silhouette scores. The supervised machine learning models were then trained on labeled data to classify dengue risk levels using key meteorological and demographic characteristics. Clustering analysis revealed well-defined structures within the data, with a silhouette score of 0.774, indicating robust clustering quality. The classification models demonstrated exceptional performance, achieving more than 99% in accuracy, precision, recall, and F1 score metrics. These models effectively identified high-risk periods and regions with strong seasonal trends in dengue incidence. Overall, this study presents a data-driven framework for the early detection of dengue outbreaks, supporting proactive public health strategies, while also contributing to the identification of dengue patterns and serving as a tool for controlling infectious diseases.

Keywords

machine learning clustering classification risk prediction public health health Informatics

Introduction

Dengue fever is a significant global health concern, with estimates of the annual incidence of infections ranging from 50 million to almost 400 million.¹ The disease poses a continuing threat, with approximately 50 million dengue infections and around 500,000 individuals hospitalized with dengue hemorrhagic fever each year, particularly in Southeast Asia, the Pacific, and the Americas.² Urbanization has been found to accelerate the growth and survival of Aedes albopictus, a mosquito that spreads dengue fever. This could cause more outbreaks of dengue fever.³ Furthermore, the incidence of dengue fever has been observed to fluctuate over the years, with reported cases and deaths varying annually.⁴

The spread of dengue fever is influenced by various factors, including human travel patterns, international traffic between dengue-endemic and dengue-free countries, and the presence of suitable vectors for the virus.⁵ Furthermore, dengue outbreaks are significantly influenced by climate and socio-ecological settings; therefore, prevention strategies should focus on reducing the urban heat island effect, modifying vector control measures, and providing targeted health education to travelers at border points.⁶ Monsoonal precipitation has a significant impact on the dynamics of dengue epidemics because it increases mosquito populations, which in turn causes dengue outbreaks.⁷ Furthermore, dengue fever is becoming more commonplace worldwide, affecting public health.⁸ Climate-related risk factors also impact the spread of dengue fever.⁹ Aedes aegypti mosquitoes are the primary vectors of this fever, posing a serious health risk in many tropical nations.¹⁰

Dengue, a crucial arthropod-borne viral infection, causes millions of annual infections worldwide.^11–13 The importance of including socioeconomic covariates in assessing dengue risk is underscored by its global scale.¹⁴ Early and accurate diagnosis is imperative for disease surveillance and control, requiring practical analysis and detection methods.¹⁵ Distinguishing dengue from other dengue-like diseases is crucial, emphasizing the need for precise analytical approaches.¹⁶ Furthermore, the links between imported and indigenous cases of dengue highlight the importance of preventing disease spread and promptly identifying high-risk individuals in high-risk areas. This shows the need for thorough research and effective prevention methods. The potential development of antiviral drugs and vaccines underscores the importance of ongoing analysis and research for effective disease management.¹¹ The life-threatening nature of dengue also emphasizes the need to increase awareness among physicians in temperate areas, requiring continuous education and analysis of the disease.

Accurate analysis and prediction are crucial for effective disease management. Machine learning models have played an important role in understanding dengue population dynamics and virus evolution and inform intervention strategies.¹⁷ However, challenges persist due to a limited understanding of transmission factors and concerns about data availability.¹⁸ Spatial analysis, age-related epidemiology, and the influence of immunological interactions further complicate the analysis.^19,20 Machine learning works better when it uses a variety of data sets from genetics and epidemiology²¹ and advanced statistical methods, such as principal component analysis.²²

In response to the growing threat of dengue, particularly in regions like Bangladesh, this study addresses key knowledge gaps through a hybrid machine learning methodology. The proposed approach integrates clustering and classification techniques supported by systematic data retrieval, cleaning, feature engineering, and multivariate trend analysis. The elbow method and the silhouette score guide the clustering to identify optimal groupings characterized by strong internal consistency, as indicated by low inertia and high silhouette values. Supervised learning models, including Decision Tree (DT), Random Forest (RF), K-Nearest Neighbor (KNN), and Logistic Regression (LR), achieved outstanding performance, with precision, precision, recall, and F1 scores that all exceeded 0.99. Experimental results show a seasonal trend in dengue incidence, particularly from August to October. The adaptability of this hybrid framework, along with informative visualizations of clusters, decision boundaries, and temporal trends, underscores its potential as a reliable tool for disease surveillance and early warning.

This study makes a significant contribution by introducing an innovative hybrid methodology that integrates clustering and classification techniques. The key contributions can be summarized as follows:

• Hybrid Methodology: The study employs a novel hybrid method that integrates data collection, cleaning, feature engineering, multivariate trend analysis, and the fusion of clustering and classification techniques to enhance the accuracy of dengue analysis.

• Structured Dengue Dataset: This article focuses on the preparation of a structured dengue dataset derived from the medical histories of Bangladeshi patients over 5 years.

• Clustering Analysis: The elbow method and silhouette score serve as guides for the clustering analysis, which finds ideal clusters with low inertia and high silhouette scores.

• Supervised Learning Models: It aims to validate the quality of the cluster separation and build predictive models capable of identifying the dengue risk level for new or future observations, thereby converting unlabeled historical data into a usable training set for future risk prediction.

• Versatile Tool for Surveillance: Positioned as a versatile and accurate tool for infectious disease surveillance and control, this methodology is complemented by insightful visualizations.

• Actionable Insights: Beyond filling crucial gaps in dengue analysis, the research provides actionable insights for public health interventions, resource allocation, and targeted strategies.

This research is structured as follows. In Section Related works, we mentioned related works and highlighted key research gaps to include in this study. Section Methodology describes the methodological framework and its associated modules, while Section Experiments and Results Analysis illustrates how this research was conducted and presents the research findings in detail. Finally, in Section Conclusion, we conclude the research by suggesting future directions.

Related works

The literature on dengue prediction in Bangladesh encompasses various models and methodologies aimed at addressing the increasing public health problem. Naher et al. (2022) emphasized the importance of early warning systems in controlling dengue epidemics and recommended the ARIMA model for accurate forecasting. Their study found an effective forecasting model and predicts a worrying rise in dengue cases.²³ Khan et al. (2022) address the challenge of predicting dengue outbreaks in Bangladesh using multilayer perceptron neural networks (NN) and decision tree models. Although their research outperforms conventional methods, achieving an accuracy of over 90% and detecting seasonal patterns remains challenging.²⁴

Krishnan et al. (2022) contribute to the literature by predicting dengue outbreaks using artificial neural networks and decision tree models, both of which are based on meteorological data. The study emphasizes climate variables in predicting outbreaks but identifies limitations in the dataset’s parameters.²⁵ Rana et al. (2022) propose the Dengue Fever Expert System using Machine Learning Analytics (DFES-MLA) for efficient dengue prediction. While their model showcases strengths, challenges in selecting vulnerable features and the correct classifier are acknowledged.²⁶

Chakraborty et al. (2019) introduce a hybrid ARIMA-NNAR model for forecasting dengue epidemics, emphasizing its reduction of model uncertainty. However, the study acknowledges the universal challenge of model applicability in diverse circumstances.²⁷ Lowe et al. (2014) present an innovative approach to forecasting dengue fever risk during the World Cup in Brazil, integrating real-time seasonal climate forecasts. The study provides valuable early warnings but is criticized for lacking a comprehensive discussion on limitations.²⁸

Hoyos et al. (2021) conducted a systematic literature review on dengue modeling using machine learning techniques, identifying challenges and opportunities.²⁹ Despite limitations in the scope of the search, the review offers valuable insights into improving dengue control through diagnostic, epidemiological, and intervention models.

Caicedo-Torres et al. (2016) suggest using machine learning to predict the severity of dengue early on. However, they admit that their dataset is not completely balanced, which is one of the problems they mention.³⁰ In a comprehensive review, Hoyos et al. (2021) provide insights into challenges and opportunities in dengue modeling using machine learning techniques.²⁹ The review highlights the urgent need for models to manage data uncertainty and suggests future research priorities for improved dengue control strategies.

Research on dengue prediction and surveillance encompasses a diverse range of methodologies, revealing distinct strengths and limitations. Reiner et al. provide a global overview, offering a comprehensive foundation for understanding current and future disease distribution.³¹ In their systematic review, Johansson et al. emphasizes the need for standardized approaches in modeling tools for dengue risk mapping, despite potential weaknesses.³²

Machine learning approaches, exemplified by Liu et al., leverage real-time data for accurate predictions, although complexities in model interpretation may pose challenges.³³ Regional predictions, such as those made by Salim et al. in Selangor, Malaysia, are effective at taking into account socioeconomic factors to provide a comprehensive picture of the region. However, they may not be applicable in other locations.³⁴

Yadav et al. integrate weather data into machine learning models for dengue prevalence prediction in India, enhancing models with climatic factors.³⁵ Similarly, Manjunath et al. utilize LSTM neural networks for multivariate time-series forecasting in the Mysore district, Karnataka.³⁶ Chen et al. focus on real-time forecasting at a neighborhood level in tropical urban Singapore, providing actionable insights for localized interventions.³⁷ Zheng et al.’s systematic review explore data-driven methods, emphasizing the integration of real-world and big data.³⁸

In the context of machine learning, decision tree algorithms, as employed by Thein et al., offer transparency and interpretability in early-phase diagnosis and outcome prediction for dengue fever. While DT contributes to a clear understanding of the complex disease process, potential weaknesses may include oversimplification of the intricate dynamics involved.³⁹

In summary, global dengue forecasting, especially in Bangladesh, reveals studies with notable strengths and identified weaknesses. For example, Naher et al.²³ highlighted the need for early warning systems and proposed the ARIMA model for accurate forecasting, yet they do not project potential dengue case surges. Existing studies on dengue prediction use a wide range of machine learning and statistical models, each with unique strengths and challenges. However, limitations such as data quality, feature selection, model generalizability, and early warning accuracy remain areas of active research. These gaps provide opportunities for hybrid and integrative approaches to improve risk prediction and outbreak management.

Methodology

Our research methodology combines both unsupervised and supervised techniques to develop a hybrid model that aims to understand and predict dengue disease patterns in Bangladesh. A pivotal aspect of our contribution is the meticulous structuring of a previously unavailable dataset, which addresses a critical gap in dengue epidemiology research for the Bangladeshi population. This effort not only establishes the foundation for our analyses but also serves as a resource for future studies in this field. Our methodical approach, as illustrated in Figure 1, employs advanced techniques to identify hidden patterns and make predictions, even in the absence of an organized dataset.

Figure 1.

Hybrid model integrating unsupervised and supervised techniques for dengue prediction in Bangladesh, addressing a critical data gap.

Data retrieval and structuring

Our research commences with the acquisition of dengue case information from a reliable source for healthcare data. The data collection involves a meticulous approach, utilizing manual extraction to ensure a comprehensive and representative dataset. After collecting the information, it is systematically organized into a structured format, laying the groundwork for subsequent analyses.

Data cleaning and feature engineering

The integrity of our dataset is paramount, and thus, we embark on a thorough data cleaning and feature engineering process. This process involves addressing missing values, outliers, and inconsistencies with precision and accuracy. Additionally, special attention should be given to formatting dates and numerical values correctly. The outcome of this phase is the creation of a feature set that incorporates variables such as year, date, hospital patients, daily cases, daily deaths, and daily recoveries, establishing a robust foundation for our analytical endeavors.

Multivariate trend analysis

Our analytical journey includes a comprehensive exploratory data analysis (EDA) to unravel the dataset’s intricacies. Visualizations are employed to reveal temporal trends, geographical distribution, and any discernible patterns within the data. This phase serves as a crucial precursor, providing insights that shape subsequent stages of our research.

Feature scaling

Ensuring the equitable contribution of features to our analyses necessitates the normalization of the dataset. If required, feature scaling, including dimension reduction techniques, will be applied to bring variables to a consistent scale. A standard technique used for normalization in this study is Min-Max Feature Scaling, as outlined in Algorithm 1. This step is crucial for fostering accurate interpretations of results and facilitating meaningful comparisons between different features. Additionally, dimension reduction methods may be employed to streamline the dataset, enhancing the efficiency of subsequent analyses, particularly when applying cluster algorithms. This strategic incorporation of dimension reduction enhances the robustness of our methodology, enabling a more nuanced exploration of dengue disease patterns in the Bangladeshi population.

Cluster optimality assessment

The process of identifying inherent patterns within the dataset takes center stage through cluster analysis. Employing both the elbow method and silhouette analysis, we diligently determine the optimal number of clusters (K), as operationalized in Algorithm 3. Subsequently, the K-means clustering algorithm (as shown in Algorithm 2) is deployed to group similar patterns, fostering a nuanced understanding of the underlying structure of the dengue disease data.

Clustering performance evaluation

A meticulous evaluation of the K-means clustering algorithm was conducted, leveraging multiple validation metrics, including inertia, the Silhouette score, the Calinski-Harabasz index, and the Davies-Bouldin index. The resulting clusters undergo a comprehensive interpretability assessment, providing meaningful insights into the patterns encapsulated within the Dengue dataset.

Labeling dataset

To enhance the richness of our dataset, K-means cluster labels will be incorporated. These labels serve as a reference point, providing a clear representation of the clusters achieved through the intricate clustering process.

Supervised learning and evaluation

As no pre-labeled dengue risk dataset exists for Bangladesh, we used unsupervised clustering to generate class labels based on patterns in key features such as total cases, deaths, and releases. This method enabled us to construct a structured, labeled dataset for training predictive models. The supervised learning step was applied to (i) verify the consistency and quality of the derived labels and (ii) develop models capable of predicting future dengue risk levels using new data. This approach lays the foundation for scalable, real-time risk classification and early warning systems in data-scarce public health contexts. The overall procedures are demonstrated in Algorithm 4.

A strategic division of the labeled dataset into training and testing sets will precede the application of machine learning classifiers. The evaluation of classifier performance is conducted rigorously, utilizing metrics such as accuracy, precision, and recall to provide a nuanced understanding of their efficacy.

Identification of predictive features

The culminating phase of our methodology involves an in-depth exploration to identify predictive features. This analysis aims to uncover the key factors, including cluster labels, that significantly contribute to the accurate prediction of dengue outbreaks, as shown in Algorithm 5.

Clinical framework and ethical considerations

To ensure the clinical relevance and ethical integrity of this study, we have incorporated a structured clinical framework into our methodology. Ethical considerations were addressed by exclusively using publicly available, anonymized data from the Directorate General of Health Services (DGHS), eliminating any direct patient involvement or identifiable personal information. The inclusion criteria focused on daily records of confirmed dengue cases reported by government and private hospitals in Bangladesh between 2019 and 2023. Entries were included if they contained complete information on patient counts, deaths, and recoveries. Records with missing or inconsistent values were excluded to maintain data integrity and accuracy. The study population comprises aggregated daily statistics representing a wide demographic range across urban and rural healthcare facilities, ensuring a broad and representative overview of the dengue-affected population in Bangladesh. This framework supports the integrity and applicability of our hybrid modeling approach.

Experiments and result analysis

This section presents the experimental setup and analyzes the obtained results. The analysis begins with a detailed overview of the experimental configuration, followed by a thorough examination of the results. The goal of this analysis is to offer insights and interpretations that contribute to the overall objectives of the research.

Dataset collection and preparation

Our experimental setup starts with the acquisition and preparation of dengue case information from the Directorate General of Health Services (DGHS) website, which serves as a reliable source for healthcare data in Bangladesh. The data collection process is systematic and involves meticulous manual extraction techniques. This approach ensures the development of a comprehensive and representative dataset. After gathering the information, it is organized into a structured format, providing a solid foundation for future analyses. The step-by-step process is as follows:

• Visit the DGHS Website¹

• Navigate to Dengue Section

• Identify Relevant Information

• Manually Extract Data:

– For each relevant piece of information:

* Store the contents into a spreadsheet or text document.

* Double-check the accuracy of

• Ensure Data Accuracy

• Organize Data Structure

• Document the Extraction Process

• Repeat as Needed

• Save Data in a Suitable Format

In crafting our dataset, we meticulously organized and structured the features, selecting columns that hold the extracted pertinent information. The chosen features encompass critical details such as ’Year,’ ’Month,’ ’Day,’ ’Patients in Government,’ ’Patients in Private,’ ’Total Death,’ ’Total Release,’ ’Per Day Death,’ ’Per Day Release,’ and ’Total Affected.’ Spanning the years 2019, 2020, 2021, 2022, and 2023, this dataset provides a comprehensive overview of dengue-related information, offering daily insights for each year.

An important aspect to note is the variation in the number of days per year, which indicates the temporal scope of our dataset. Specifically, the dataset comprises 899 records of dengue patients, including 125 days in 2019, 30 days in 2020, 19 days in 2021, 364 days in 2022, and 361 days in 2023. This variability in temporal coverage is crucial for understanding the dynamics of dengue-related factors over distinct periods. The dataset summarized in Table 1 encapsulates the key temporal information, setting the stage for our subsequent exploratory data analysis and predictive modeling endeavors.

Table 1.

Number of days in each month yearwise: temporal overview of dengue-related records.

Month	2019	2020	2021	2022	2023
January	0	30	0	31	29
February	0	0	0	28	27
March	0	0	0	31	31
April	0	0	0	30	30
May	0	0	0	31	31
June	0	0	0	30	30
July	0	0	0	30	31
August	5	0	0	31	31
September	28	0	0	30	29
October	31	0	0	31	31
November	30	0	0	30	30
December	31	0	18	31	31
Number of days	125	30	18	364	361

The presence of ’0′ in certain months (see Table 1) indicates that, according to the available information from the DGHS website, there were no recorded Dengue-related records for those particular months in the specified years. This lack of information might be due to various factors, such as the absence of reported cases, the unavailability of data, or other reasons that resulted in a dearth of dengue-related records during those specific months. It is essential to consider such gaps in the dataset when interpreting and analyzing the temporal patterns of dengue occurrences.

Exploratory analysis of the dataset

We begin a detailed exploratory study of the data from the dengue case obtained from the DGHS website, based on the careful preparation and collection of the data set described in the previous section. During this analytical stage, we focused on identifying trends and insights within the dataset, which spans the years 2019 through 2023. The key variables are visually represented in Figure 2, which provides a comprehensive view of patient counts in both the government and commercial sectors, total deaths, total releases, and the total number of affected individuals overall. Since every data point is associated with a particular year, it is easier to compare and evaluate trends within these crucial dengue-related data points. The basis refers to our further analysis and interpretation of the dataset, and this visualization facilitates a more sophisticated comprehension of the dataset.

Figure 2.

Patients’ medical records over the years.

The presented figure (as shown in Figure 2) offers insightful observations regarding dengue cases from 2019 to 2023. Notably, 2023 stands out as the peak year, marked by the highest counts in key variables, including patients in the government sector, patients in the private sector, total deaths, total releases, and total affected individuals. This trend indicates a significant surge in dengue cases during that period. Additionally, a consistent upward trend in dengue cases is evident from 2019 to 2023, indicating a general increase over the specified period. The year 2022 also exhibits heightened figures, particularly in patients admitted to government hospitals, total deaths, and total affected, but these counts are surpassed by those in 2023. Conversely, 2020 and 2021 display relatively lower counts across all variables, indicating potential fluctuations during these years. Furthermore, the figure highlights a consistent difference in the impact of dengue on patients in government and private hospitals, with the latter consistently showing lower counts. These insights lay the groundwork for a deeper exploration, prompting an examination of the factors contributing to the observed patterns in dengue cases throughout this period.

Continuing our exploration of dengue case trends, the subsequent analysis delves into the temporal dynamics of hospital admissions. The graph presented in Figure 3 illustrates the trends in patients admitted to hospitals, with the horizontal axis representing the course of time in days and the vertical axis indicating the number of patients. Notably, the figure reveals that the number of patients admitted to government hospitals surpasses that in private hospitals over a period of days. This visualization enables us to identify patterns and fluctuations in hospital admissions, providing insights into the nuances of dengue cases within specific healthcare sectors. The analysis aims to unveil temporal correlations and variations in patient counts, providing a valuable perspective on the evolving nature of dengue-related hospitalizations.

Figure 3.

Temporal evolution of hospital admissions for dengue patients over the period.

The analysis, as shown in Figure 4, facilitates a comparison between the daily release and death counts, aiding in the identification of any potential trends or correlations. Analyzing daily release and death statistics through data-driven research reveals interesting patterns across various months. According to Figure 4(a), dengue-related activities or reporting peak during August, October, and September, when daily release counts approach 103,000, 97,011, and 99,000, respectively. November exhibits a similar pattern, with a notable daily release count of 67,282, suggesting the persistence of noteworthy dengue-related incidents. On the other hand, December shows a respectable daily release count of 47,429, even though it does not reach the peaks of the previously stated months.

Figure 4.

Analysis of daily counts for patient releases and deaths.

Examining daily death counts (Figure 4(b)), October, September, and November stand out with the highest counts at 449, 431, and 398, respectively. This observation aligns with the recorded high release counts during these months, suggesting a potential correlation between the severity of dengue cases and the number of deaths. In August and July, despite notable release counts (356 and 213, respectively), the daily death counts are relatively lower, indicating a potentially lower fatality rate during these months. December, with a notable daily release count, shows a comparatively lower daily death count of 233, suggesting a lower fatality rate relative to the release count. However, these findings contribute to a comprehensive understanding of the temporal variations in dengue factors, shedding light on months of heightened activity and potential correlations between release and death counts.

The multivariate trend analysis presented earlier in this section, as shown in Figure 2, clearly indicates that 2023 was the year with the highest dengue outbreak nationwide. A notable pattern becomes apparent when looking at the monthly statistics of patients or dengue cases for this year (as shown in Figure 5). In September, which appears to be a critical month in the dengue trend for 2023, the data reveals the following: 396 individuals succumbed to dengue-related complications in September; the medical system released 77,621 patients in response to dengue cases during the same month; the number of individuals affected by dengue in September reached 76,775. This month, in particular, showcases a significant impact on healthcare resources, with a high number of deaths, releases, and individuals affected. The detailed monthly breakdown, as provided in the data, contributes to a comprehensive understanding of the dengue situation, emphasizing the importance of targeted interventions and resource allocation during peak periods.

Figure 5.

Monthly statistics for the year 2023 depict the peak of Dengue activity.

Feature scaling and dimensionality reduction

In this section, we examine the crucial preprocessing procedures of dimensionality reduction and feature scaling. These techniques play a pivotal role in enhancing the performance and interpretability of our dataset, paving the way for more effective analyses and modeling. Feature scaling ensures that all variables contribute equally to the analysis by bringing them to a standardized scale. At the same time, dimensionality reduction techniques help mitigate the curse of dimensionality and enhance computational efficiency. We explore the methods employed, the rationale behind their application, and the impact on the dataset’s overall quality and utility. This section sets the stage for a streamlined and optimized dataset ready for advanced analytics and modeling.

In Section ”Methodology,” we initiated the preprocessing of our dataset by applying min-max normalization for feature scaling. For the correlation analysis presented in Figure 6, we selected seven key features that were later used as input variables in the supervised learning phase: Patients in Government Facilities, Patients in Private Facilities, Total Death, Total Release, Per Day Death, Per Day Release, and Total Affected. These features were chosen due to their relevance for predictive modeling and continuous nature, making them suitable for Pearson correlation analysis. Importantly, the remaining features—Year, Month, Day (temporal identifiers), and Cluster Label (used as a target variable)—were carefully excluded, as they are not suitable for correlation-based visualization, thereby ensuring the integrity of our research process.

Figure 6.

Correlation analysis among scaled features of the dataset.

The resulting correlation matrix, visualized in the heatmap presented in Figure 6, unveils insightful patterns within the interrelationships among these scaled features. Examining the heatmap, distinct patterns emerge:

• There is a strong positive correlation (0.885) between ’Patients in Govt’ and ’Patients in Private,’ indicating a substantial simultaneous increase in the number of patients in both government and private hospitals.

• There is an almost perfect positive correlation (0.999) between ’Total Death’ and ’Total Release,’ signifying a closely intertwined relationship between the total number of deaths and releases.

• There is a robust positive correlation (0.998) between ’Total Death’ and ’Total Affected,’ suggesting a tendency for the total number of deaths to rise alongside the overall number of affected individuals.

• There is a moderately positive correlation (0.635) between ’Patients in Govt’ and ’Total Affected,’ implying a moderate association between the number of patients in government hospitals and the overall number of affected individuals.

These identified patterns contribute to our understanding of the complex interactions between key dataset features, providing valuable insights for subsequent analyses and modeling endeavors. The heatmap serves as a visual aid, unraveling the complexities within the dataset and facilitating more informed decision-making processes.

After adjusting feature scaling, we proceeded with dimensionality reduction using Principal Component Analysis (PCA). Consider a dataset X with m samples and n features $(X \in R^{m \times n})$ . The steps for PCA are outlined in Algorithm 6. The resulting matrix X_pca captures the principal components of the original dataset. Furthermore, we incorporated a visualization of the data points after PCA, highlighting key features in two dimensions. The distribution of data points in the figure suggests the presence of 2 or 3 clusters. These estimated cluster numbers offer insights into the optimal cluster points when applying the k-means algorithm.

Clustering analysis and evaluation

In this section, we explored the crucial phase of clustering analysis, utilizing the insights gained from feature scaling and dimensionality reduction. Employing clustering algorithms, we aim to uncover inherent patterns and groupings within the dengue-related dataset. Furthermore, we assess the performance of the chosen clustering approach, shedding light on its efficacy in capturing meaningful structures within the data. Additionally, we labeled the dataset by adding a feature using the cluster label, enhancing interpretability. In this study, K-means clustering was employed to uncover hidden patterns in dengue-related cases across multiple years. The goal was to identify natural groupings in the data—such as periods of high and low outbreak intensity—based on clinical and epidemiological features. K-means was selected due to its simplicity, scalability, and effectiveness at separating compact, spherical clusters in normalized datasets.⁴⁰

We employed the elbow method to determine the optimal number of clusters (k) before applying the k-means clustering algorithm. The elbow method involves running k-means clustering on the dataset for a range of values of k and then plotting the sum of squared distances from each point to its assigned center against k. The resulting curve typically resembles an arm, and the ”elbow” of the curve indicates the optimal number of clusters where the rate of decrease in the sum of squared distances slows down.

The graph in Figure 7 illustrates the elbow point identified when the k-means clustering algorithm was applied to our target dataset. The x-axis represents the range of k values from 1 to 11, and the y-axis shows the Within-Cluster Sum of Squares (WCSS) or distortion scores. WCSS is a measure of the compactness of clusters, reflecting the sum of squared distances between each data point and its assigned cluster center. As k increases, the WCSS generally decreases, illustrating improved clustering within smaller, more compact groups. In the figure, we also observe that the optimal elbow occurs around k = 2. The associated WCSS value is 76.59, suggesting that these values represent the optimal number of clusters for the given dengue disease dataset. This information guides the subsequent application of k-means clustering, ensuring that the chosen k captures the most meaningful structures within the data.

Figure 7.

Elbow method illustrating the optimal number of clusters for the Dengue cases dataset using k-means clustering.

Furthermore, we employed the Silhouette Score Method to validate the optimal selection of k values in the K-means clustering algorithm, as illustrated in Figure 8. We selected a range of k values from 2 to 11 and observed that the highest silhouette score, 0.763, was achieved at k = 2, despite some fluctuation in scores for different k values.

Figure 8.

Silhouette score method illustrating the optimal number of clusters using k-means clustering.

After determining the optimal number of clusters, we performed experiments using the final preprocessed dataset, as shown in Figure 9. Cluster 1 contains 738 data points, while Cluster 2 has 161. In Figure 9(a), the clustered data is shown in a two-dimensional space using Principal Component Analysis (PCA), and Figure 9(b) shows the decision boundaries in the same 2D space. We used PCA to reduce the number of features and make the data easier to visualize while retaining most of the important information. The first principal component (PC1) is mostly influenced by ’Total Affected,’ ’Total Death,’ and ’Total Release,’ which represent the overall severity of the outbreak. The second principal component (PC2) is more affected by ’Per Day Death’ and ’Per Day Release,’ which show how the situation changes daily. This PCA projection helps us see how the two clusters are separated. Cluster 1 appears more compact, whereas Cluster 2 is more spread out, indicating that it has greater variation within the group.

Figure 9.

Visualization of clustered data points and decision boundaries after determining the optimal number of clusters.

Further insights were drawn from the clustering results: the strong correlation within Cluster 1 suggests a more cohesive grouping of data points, indicating a potentially homogeneous subgroup within the dataset. On the other hand, the scattered nature of data points in Cluster 2 suggests a higher degree of variability or heterogeneity within this cluster. These findings provide valuable information for interpreting the characteristics and patterns within each cluster, facilitating a deeper understanding of the underlying structures in the data.

We conducted a comprehensive evaluation of the k-means clustering algorithm, employing various performance metrics. The results, presented in Table 2, provide insights into the effectiveness of the k-means clustering in revealing meaningful patterns within the dataset. The low inertia value (69.992) indicates that the clusters are tightly packed, suggesting good cohesion within the clusters. A high silhouette score (0.774) signifies well-separated clusters, indicating the algorithm’s effectiveness in creating distinct and meaningful groupings. The Calinski-Harabasz Index, with a value of 2434.746, further supports the notion that the clusters are well-defined and exhibit a clear structure. Additionally, the low Davies-Bouldin index (0.545) implies that the clusters are compact and well-separated. Collectively, these performance metrics affirm that the k-means clustering algorithm has successfully captured meaningful patterns within the given dataset.

Table 2.

Performance metrics for K-means clustering.

Metric	Value
Inertia	69.992
Silhouette score	0.774
Calinski-Harabasz index	2434.746
Davies-Bouldin index	0.545

As mentioned earlier in this section, we enhanced the dataset by adding cluster labels to make it suitable for supervised learning. After careful observation, we identified data points in Cluster 0 as representing low-risk dengue trends and Cluster 1 as indicating high-risk trends. The presence of these dengue risk patterns is visualized across the entire dataset in Figure 10. In this figure, the vertical axis represents the days of each month (1 to 31), and the horizontal axis spans 5 years, encompassing 60 months. Blank or white cells indicate dates when no medical records were available from the DGHS web source. A closer look at the figure reveals that the period from the last week of July to November in 2023 was the peak season for high dengue risk compared to other years. This detailed visualization provides valuable insight into the temporal distribution of dengue risk across the dataset.

Figure 10.

Visualizing dengue risk trends over 5 years, highlighting peak high-risk and low-risk seasons.

It is important to mention that one of our objectives was to analyze dengue risk patterns across months over multiple years rather than to detect intra-annual seasonal trends within a single year. Therefore, we applied min-max normalization (as shown in Algorithm 1) across the entire dataset to preserve the relative magnitudes between years. This scaling enabled us to compare risk levels across the same months in different years (e.g., August 2022 and August 2023), which would not have been possible with year-wise normalization.

Identifying features through supervised learning

In this section, we first evaluate our dataset using supervised machine learning approaches and then identify the most significant features in the dataset. Additionally, we demonstrate the effectiveness of employing clustering and classification, using a hybrid approach to conduct our research.

As the dataset comprises 11 features—(1) Year, (2) Month, (3) Day, (4) Patients in Government Facilities, (5) Patients in Private Facilities, (6) Total Death, (7) Total Release, (8) Per Day Death, (9) Per Day Release, (10) Total Affected, and (11) Cluster Label (generated during clustering)—we excluded three temporal features (Year, Month, and Day) from the supervised learning phase, as they hold no predictive value for classification purposes. We used six traditional classifiers: Naive Bayes (NB), DT, RF, Support Vector Machine (SVM), KNN, and LR. These classifiers were chosen because they work well and are recommended for datasets with few samples.⁴¹ The filtered dataset was used to train the models along with k-fold (5-fold) cross-validation and train-test splitting (random 80:20).

The supervised learning results, as shown in Table 3, demonstrate high performance across various classifiers. The DT, RF, KNN, and LR models achieved particularly impressive accuracy, precision, recall, and F1 scores, all exceeding 0.99. NB and SVM also performed well, with accuracy and precision scores above 0.98. These results demonstrate that the chosen classifiers, when applied to the dataset with train-test splitting, have effectively learned and generalized patterns, resulting in accurate predictions for the given features. The models’ consistently high precision, recall, and F1 scores demonstrate that they are effective at classifying instances, which supports the choice of the selected classifiers for the task. The confusion matrix for the RF classifier in the train-test split method is shown in Figure 11. It provides a detailed explanation of its effectiveness and accuracy in correctly classifying instances.

Table 3.

Supervised learning results.

Classifier	Accuracy	Precision	Recall	F1 score
NB	0.9833	0.9848	0.9833	0.9836
DT	0.9944	0.9946	0.9944	0.9945
RF	0.9944	0.9946	0.9944	0.9945
SVM	0.9833	0.9832	0.9833	0.9832
KNN	0.9944	0.9946	0.9944	0.9945
LR	0.9944	0.9946	0.9944	0.9945

Figure 11.

Confusion matrix for RF classifier (train-test split).

The K-fold cross-validation results (as shown in Table 4) further confirm the robust performance of the classifiers. The DT, RF, KNN, and LR models consistently exhibit exceptional accuracy, precision, recall, and F1 scores, with an average score of over 0.99. NB and SVM also maintain strong performance, with average scores above 0.98. These outcomes highlight the classifiers’ ability to generalize effectively to unseen data, thereby reinforcing their reliability in making accurate predictions across different folds. The high precision, recall, and F1 scores signify the classifiers’ effectiveness in maintaining a balance between true positives, true negatives, false positives, and false negatives. Overall, these results validate the robustness and generalization capability of the selected classifiers in handling the given dataset.

Table 4.

K-fold cross-validation results.

Classifier	Average accuracy	Average precision	Average recall	Average F1 score
NB	0.9878	0.9887	0.9878	0.9880
DT	0.9978	0.9978	0.9967	0.9978
RF	0.9978	0.9989	0.9989	0.9978
SVM	0.9933	0.9933	0.9933	0.9933
KNN	0.9978	0.9978	0.9978	0.9978
LR	0.9956	0.9957	0.9956	0.9955

Our final analysis aims to identify predictive features within the dataset. For this purpose, we utilized the feature importance scores obtained from a tree-based model, specifically RF. The extracted features, as depicted in Figure 12, visually represent the importance of various features, and the accompanying data provide the corresponding importance scores. Specifically, the features were ranked based on their impact on the model’s predictive performance. The feature importance is as follows: ’Total affected’ with a score of 0.318, ’Total death’ with 0.258, ’Total release’ with 0.208, ’Per day release’ with 0.098, ’Patients in government facilities’ with 0.085, ’Per day death’ with 0.033, and ’Patients in private facilities’ with a minimal contribution of 0.0004. These scores provide valuable insights into the relative importance of each feature in predicting the target variable, helping us understand the key drivers within the dataset.

Figure 12.

Feature importance.

Discussion

The integration of clustering and classification techniques in our research has proven to be a highly effective and synergistic approach, bringing about a more profound understanding of the intricate patterns within the Dengue cases dataset. The utilization of clustering, specifically employing the k-means algorithm, played a crucial role in uncovering inherent structures and groupings within the data. The identified clusters not only served as distinct subgroups but also laid the groundwork for subsequent analyses, providing a foundation for a more nuanced exploration of dengue trends.

The optimal number of clusters, determined through the elbow method and silhouette score, was pivotal in ensuring that the clusters represented meaningful and well-defined structures. Visualizing the clustered data points revealed clear distinctions between different patterns, allowing for the identification of cohesive groups within the dataset. This initial clustering phase, therefore, sets the stage for a more focused and targeted analysis. Since no labeled dengue risk dataset exists in Bangladesh, clustering provided a data-driven way to define meaningful risk categories. These cluster-derived labels were then used to train predictive models, enabling future classification of dengue risk based on new, unlabeled data.

Subsequently, the incorporation of supervised learning classifiers, such as DT, RF, KNN, and LR, into the clustered dataset validated and extended the patterns identified during the clustering phase. The high accuracy, precision, recall, and F1 scores obtained from these classifiers underscored their effectiveness in accurately predicting dengue risk trends within the identified clusters. This two-step process ensured a robust and comprehensive analysis, leveraging the strengths of both clustering and classification methodologies.

Moreover, the feature importance analysis derived from the RF classifier shed light on the critical variables influencing dengue risk. This detailed understanding of feature importance not only validated the relevance of key variables but also provided actionable insights for prioritizing factors that significantly contribute to the occurrence and severity of dengue cases.

While this hybrid approach offers valuable insights, it has some limitations. First, the labels generated through clustering reflect internal patterns within the data but lack external validation. As a result, classification performance reflects internal consistency rather than accurate ground-truth accuracy. Second, we applied global normalization across the dataset rather than normalizing within each year, which may cause high-incidence years, such as 2023, to disproportionately influence the clustering outcome. Future work could explore year-wise normalization and external validation using epidemiological ground truth to enhance the robustness of findings.

In essence, the hybrid approach combining clustering and classification techniques has not only improved predictive accuracy but also provided a holistic and interpretable view of dengue-related data. This integrated methodology offers a scalable framework for future research, public health planning, and targeted interventions to mitigate the impact of dengue outbreaks. Unlike previous studies that relied on predefined thresholds or univariate analysis, our approach leverages unsupervised and supervised learning to dynamically uncover risk patterns using multivariate features.

Implications of the research

The implications of this research for dengue disease pattern analysis, based on the analyzed dataset, are multifaceted and carry significant relevance for public health, epidemiology, and healthcare planning. The key implications include:

(1) Early Detection and Intervention: The clustering and classification approaches employed in this research enable the identification of distinct patterns and trends in dengue cases. Early detection of these patterns can be crucial for public health authorities to implement timely interventions. By recognizing high-risk clusters early, authorities can deploy resources, conduct targeted awareness campaigns, and implement preventive measures to curb the spread of dengue.

(2) Resource Allocation and Capacity Planning: Understanding the temporal and spatial dynamics of dengue cases, as revealed through the analysis, allows for more effective resource allocation and capacity planning. Health facilities in regions identified as high-risk clusters can be better prepared to handle potential surges in dengue-related admissions, ensuring adequate medical resources, staffing, and treatment facilities are in place.

(3) Targeted Public Health Campaigns: The detailed analysis of dengue risk trends over time enables the design of targeted public health campaigns. Authorities can focus on specific months or seasons identified as high-risk periods to educate the public about preventive measures, symptoms, and the importance of seeking medical attention promptly. This targeted approach can enhance public awareness and adherence to preventive practices.

(4) Identification of Contributing Factors: The feature importance analysis provides insights into the variables that significantly contribute to dengue risk. This information is instrumental in identifying the contributing factors and understanding their impact on disease transmission. Such knowledge can inform public health policies and interventions, allowing for the development of strategies that address the specific factors influencing dengue outbreaks.

(5) Adaptive Strategies for Different Regions: Clustering analysis allows for the identification of distinct patterns in different regions. This regional variation can inform the development of adaptive strategies tailored to the specific characteristics of each cluster. Authorities can implement region-specific interventions, taking into account local environmental, socioeconomic, and healthcare factors that contribute to dengue transmission.

(6) Data-Driven Decision Making: The comprehensive analysis presented in this research provides a data-driven foundation for decision-making. Public health officials can rely on the identified patterns and trends to make informed decisions about resource allocation, intervention strategies, and healthcare planning. This shift towards evidence-based decision-making enhances the effectiveness of public health initiatives.

(7) Research and Policy Impact: The research findings contribute to a broader understanding of dengue disease patterns and can influence policy formulation. The identified patterns and factors can serve as valuable inputs for future research studies, guiding researchers and policymakers in developing strategies to mitigate the impact of dengue outbreaks.

In summary, the significance of this research lies in its capacity to revolutionize the prevention and monitoring of dengue. By employing a hybrid methodology that seamlessly integrates clustering and classification techniques, the study offers a proactive approach to disease management. The accurate prediction of dengue outbreaks, identification of key risk factors, and establishment of adaptable strategies empower public health authorities with valuable tools for timely interventions. This research not only enhances the precision of dengue surveillance but also provides a versatile framework applicable to diverse geographical regions. Ultimately, the findings contribute to a more effective and data-driven approach to global efforts aimed at preventing and monitoring dengue outbreaks.

Conclusion

This study provides a comprehensive analysis of dengue disease patterns in Bangladesh by employing a hybrid approach that combines unsupervised and supervised techniques. We utilized clustering to identify risk classes from unlabeled data, followed by classification to validate and predict risk levels (low risk vs high risk), offering a hybrid method for risk prediction in data-scarce public health settings. The study also demonstrates the conversion of unlabeled records into a structured and labeled dataset for dengue risk prognosis. The experimental results reveal two optimal clusters or class labels characterized by low inertia and high silhouette scores. Moreover, the supervised learning models—including DT, RF, KNN, and LR—consistently perform well, surpassing 0.99 in accuracy, precision, recall, and F1 scores. The results also highlight the peak season for dengue, emphasizing the months of August to October. Ultimately, this research enhances our understanding of dengue dynamics and provides a reliable tool for disease surveillance and control. Future work will incorporate environmental factors and patient medical records to improve dengue risk prediction.

Footnotes

Acknowledgements

I would like to express my gratitude to the Department of Computer Science and Engineering at Jagannath University for providing us with the necessary laboratory facilities to conduct the research presented in this article.

ORCID iD

Tanvir Ahammad

Ethical considerations

Not applicable. This study utilized publicly available, de-identified data from the Directorate General of Health Services (DGHS), Bangladesh (), which did not require ethical approval. The data did not include any personal patient information such as names or addresses.

Author contributions

Tanvir Ahammad contributed to the conceptualization of the study, data analysis, and writing of the manuscript. Apurbo Kormokar assisted with the implementation of the machine learning models, data analysis, and validation of results. Sabina Yasmin contributed to data collection, preprocessing, and visualizing the results. Dr. Selina Sharmin provided critical feedback, and contributed to manuscript editing.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The datasets used and/or analyzed during the current study will be made available upon request.*

Note

References

Murray

Quam

Wilder-Smith

. Epidemiology of dengue: past, present and future prospects. Clin Epidemiol 2013; 5: 299–309.

Guzmán

Halstead

Artsob

, et al. Dengue: a continuing global threat. Nat Rev Microbiol 2010; 8: S7–S16.

Kamara

Zhou

, et al. Urbanization increases aedes albopictus larval habitats and accelerates mosquito development and survivorship. PLoS Neglected Trop Dis 2014; 8: e3301.

Sabchareon

Sirivichayakul

Limkittikul

, et al. Dengue infection in children in ratchaburi, Thailand: a cohort study. I. epidemiology of symptomatic acute dengue infection in children, 2006–2009. PLoS Neglected Trop Dis 2012; 6: e1732.

Carrington

Simmons

. Human to mosquito transmission of dengue viruses. Front Immunol 2014; 5: 290.

Sheridan

, et al. Interaction of climate and socio-ecological environment drives the dengue outbreak in epidemic region of China. PLoS Neglected Trop Dis 2021; 15: e0009761.

Zhang

Ibaraki

Schwartz

. Disease surveillance using online news: an extended study of dengue fever in India. Trop Med Health 2019; 47: 58.

Basurko

Matheus

Hilderal

, et al. Estimating the risk of vertical transmission of dengue: a prospective study. Am J Trop Med Hyg 2018; 98: 1826–1832.

Nejad

Varathan

. Identification of significant climatic risk factors and machine learning models in dengue outbreak prediction. Research Square 2019. doi:10.21203/rs.2.15755/v1.

10.

Laura

Supriatna

Khumaeroh

, et al. Biological and mechanical transmission models of dengue fever. Communication in Biomathematical Sciences 2019; 2: 12.

11.

Effler

Pang

Kitsutani

, et al. Dengue fever, Hawaii, 2001–2002. Emerg Infect Dis 2005; 11: 742–749.

12.

Bhatt

Gething

Brady

, et al. The global distribution and burden of dengue. Nature 2013; 496: 504–507.

13.

Pok

Lai

Sng

, et al. Evaluation of nonstructural 1 antigen assays for the diagnosis and surveillance of dengue in Singapore. Vector Borne Zoonotic Dis 2010; 10: 1009–1016.

14.

Chang

Chen

Tien

, et al. Epidemiological, clinical and climatic characteristics of dengue fever in Kaohsiung city, Taiwan with implication for prevention and control. PLoS One 2018; 13: e0190637.

15.

Pan

Liu

, et al. Epidemiological analysis of the Kaohsiung city strategy for dengue fever quarantine and epidemic prevention. BMC Infect Dis 2020; 20: 347.

16.

Gubler

. Dengue and dengue hemorrhagic fever. Clin Microbiol Rev 1998; 11: 480–496.

17.

Lourenço

Tennant

Faria

, et al. Challenges in dengue research: a computational perspective. Evol Appl 2017; 11(4): 516–533.

18.

Andraud

Hens

Marais

, et al. Dynamic epidemiological models for dengue transmission: a systematic review of structural approaches. PLoS One 2012; 7(11): e49085.

19.

Rotela

Lopez

Céspedes

, et al. Analytical report of the 2016 dengue outbreak in Córdoba city, Argentina. Geospatial Health 2017; 12: 564.

20.

Chau

TNB

Quyen

NTH

Thuy

, et al. Dengue in Vietnamese infants—Results of infection-enhancement assays correlate with age-related disease epidemiology, and cellular immune responses correlate with disease severity. J Infect Dis 2008; 198(4): 516–524.

21.

Domínguez-de-la Cruz

Muñoz

Hernández-García

, et al. Dataset on the epidemiology and genetic diversification of dengue virus (denv) serotypes and genotypes in Mexico. Data Brief 2020; 32: 106077.

22.

Bitetto

Cerchiello

Mertzanis

. A data-driven approach to measuring epidemiological susceptibility risk around the world. Sci Rep 2021; 11(1): 24037.

23.

Naher

Rabbi

Hossain

, et al. Forecasting the incidence of dengue in Bangladesh—application of time series model. Health Sci Rep 2022; 5(4): e666.

24.

Khan

MAR

Akter

Ahammad

, et al. Dengue outbreaks prediction in Bangladesh perspective using distinct multilayer perceptron nn and decision tree. Health Inf Sci Syst 2022; 10(1): 32.

25.

Muhamad Krishnan

Zukarnain

Ahmad

, et al. Predicting dengue outbreak based on meteorological data using artificial neural network and decision tree models. JOIV: Int J Inform Visualization 2022; 6(3): 597–603. https://www.joiv.org/index.php/joiv/article/view/600

26.

Rana

Boruah

Biswas

, et al. Dengue fever prediction using machine learning analytics. In: 2022 international conference on machine learning, big data, cloud and parallel computing (COM-IT-CON), Faridabad, 26–27 May 2022, vol 1, pp. 126–130.

27.

Chakraborty

Chattopadhyay

Ghosh

. Forecasting dengue epidemics using a hybrid methodology. Phys Stat Mech Appl 2019; 527: 121266. https://www.sciencedirect.com/science/article/pii/S0378437119307997

28.

Lowe

Barcellos

Coelho

, et al. Dengue outlook for the world cup in Brazil: an early warning model framework driven by real-time seasonal climate forecasts. Lancet Infect Dis 2014; 14(7): 619–626. https://www.thelancet.com/journals/laninf/article/PIIS1473-3099(14)70781-9/fulltext

29.

Hoyos

Aguilar

Toro

. Dengue models based on machine learning techniques: a systematic literature review. Artif Intell Med 2021; 119: 102157. https://www.sciencedirect.com/science/article/pii/S0933365720302429

30.

Caicedo-Torres

Paternina

Pinzón

. Machine learning models for early dengue severity prediction. In: Advances in Artificial Intelligence-IBERAMIA 2016: 15th IBERO-American conference on AI, San José, Costa Rica, 23–25 November 2016. Springer, pp. 247–258.

31.

Reiner

Stoddard

Scott

. The current and future global distribution and population at risk of dengue. Nat Microbiol 2019; 4(9): 1508–1515.

32.

Johansson

Dominici

Glass

. Modeling tools for dengue risk mapping - a systematic review. Int J Health Geogr 2014; 13(1): 50.

33.

Liu

Zhou

, et al. Developing a dengue forecast model using machine learning: a case study in China. PLoS Neglected Trop Dis 2017; 11(10): e0005973.

34.

Salim

NAM

Wah

Reeves

, et al. Prediction of dengue outbreak in Selangor, Malaysia using machine learning techniques. Sci Rep 2021; 11(1): 939. doi:10.1038/s41598-020-79193-2.

35.

Yadav

Kumar

Pal

. Weather integrated multiple machine learning models for prediction of dengue prevalence in India. Int J Biometeorol 2022; 67(2): 285–297.

36.

Manjunath

Anupama

Kumar

, et al. Prediction of dengue incidents using hospitalized patients, meteorological and socio-economic data in Karnataka, India: a machine learning approach. Indian Journal of Public Health Research & Development 2022; 13(4): 284–289.

37.

Chen

Ong

JHY

Rajarethinam

, et al. Neighbourhood level real-time forecasting of dengue cases in tropical urban Singapore. BMC Med 2018; 16(1). doi:10.1186/s12916-018-1108-5.

38.

Zheng

. Data-driven methods for dengue prediction and surveillance using real-world and big data: a systematic review. PLoS Neglected Trop Dis 2022; 16(1): e0010056.

39.

Thein

Gan

Lye

, et al. Decision tree algorithms predict the diagnosis and outcome of dengue fever in the early phase of illness. PLoS Neglected Trop Dis 2008; 2(3): e196.

40.

Ahmed

Seraj

Islam

SMS

. The k-means algorithm: a comprehensive survey and performance evaluation. Electronics 2020; 9(8): 1295.

41.

Tuomo

Suutala

Röning

, et al. Better classifier calibration for small datasets. ACM Trans Knowl Discov Data 2020; 14(3): 1–19.