A Machine Learning Approach to Predicting Household Smoke Exposure Risk in Somalia: An Analysis With SHAP Explanations

Abstract

Introduction:

Household air pollution (HAP) from solid fuel combustion is a major global public health issue with a particularly high burden in sub-Saharan Africa. In Somalia, the extent and predictors of household smoke exposure risk (SER) remain underexplored due to data scarcity and analytical limitations. This study applies machine learning (ML) models to identify and predict SER in Somali households using the first Somalia Demographic and Health Survey (SDHS) and interpretable artificial intelligence (AI) techniques.

Methods:

A nationally representative sample of 15 838 households from the 2020 SDHS was analyzed using multivariate logistic regression. The SER was defined based on the cooking fuel type and location. Six supervised ML models (Logistic Regression, K-Nearest Neighbors, Decision Tree, Support Vector Machine, Random Forest, Gradient Boosting) were trained using an 80/20 train-test split. The performance was evaluated using accuracy, precision, recall, F1-score, and AUROC. Feature importance was assessed using Gini, permutation, and SHAP (SHapley Additive explanation) values.

Results:

The prevalence of household smoking exposure was 70.0%. Place of residence, region, and wealth were the dominant predictors. Gradient Boosting outperformed other models (AUC = 81% [95% CI: 79.11%-82.17%], F1 = 81%), followed by Random Forest. SHAP analysis confirmed that geographic and socioeconomic factors were the most impactful features. Notably, higher exposure was paradoxically associated with urban residence and higher wealth, diverging from the traditional patterns observed in similar settings.

Conclusion:

This is the first national study to apply machine learning to predict SER in Somalia, revealing urban and wealth-linked vulnerabilities that challenge conventional assumptions about poverty. These findings highlight the need for targeted clean cooking interventions in urban and peri-urban communities, alongside data innovations for real-time monitoring. ML-informed risk stratification may support more effective and equitable health policies in fragile states.

Keywords

household air pollution machine learning Somalia SHAP smoke exposure risk health equity

Introduction

Household air pollution (HAP) from the domestic combustion of solid fuels is a critical global public health challenge, responsible for an estimated 3.8 million premature deaths annually.¹ The health burden is immense, contributing significantly to non-communicable diseases such as stroke, heart disease, chronic obstructive pulmonary disease (COPD), and lung cancer, as well as acute respiratory infections like pneumonia.² The toxic constituents of kitchen smoke, including particulate matter, carbon monoxide, and various volatile hydrocarbons, are directly linked to a range of respiratory symptoms and chronic illnesses.^2,3 Beyond serving as an environmental exposure, household air pollution represents a major pathway through which indoor environmental conditions shape population health, particularly among women and children who spend longer periods in cooking environments.¹

The burden of HAP is disproportionately borne by low- and middle-income countries, particularly in sub-Saharan Africa, where there is a heavy reliance on solid biomass fuels for cooking.⁴ Studies across the region have consistently linked HAP to adverse health outcomes, including increased under-five mortality, adverse pregnancy outcomes, and a significant share of the overall disease burden.^5
-7 In East Africa, HAP remains a primary driver of mortality and disability, underscoring the urgent need for effective interventions and targeted public health strategies.^7,8 To better quantify and analyze this multifaceted risk, recent research has utilized Demographic and Health Survey (DHS) data to develop a composite indicator known as Household Smoke Exposure Risk (SER). This metric integrates information on both the type of cooking fuel used (smoke-producing vs non-smoke-producing) and the primary cooking location (indoor vs outdoor) to create a more nuanced risk profile.⁹ This approach has been successfully applied to analyze risk factors in countries like Tanzania and Bangladesh, revealing the profound impact of cooking practices on health outcomes such as adverse birth events and respiratory infections.^9,10 A recent regional analysis from Nigeria further highlights the utility of the SER metric in understanding regional disparities and identifying vulnerable populations.¹¹

Although descriptive and regression-based analyses of SER have proven valuable, recent advancements in computational methods offer a powerful new frontier for risk prediction. Machine learning (ML) algorithms, such as the random forest model, have demonstrated significant promise in predicting complex health behaviors and outcomes from large datasets.^12,13 These approaches excel in identifying nonlinear relationships and complex interactions between various determinants, offering predictive power beyond that of traditional statistical models. The application of ML to public health is growing, with recent studies successfully using these techniques to predict behaviors like tobacco use among pregnant women in sub-Saharan Africa, thereby identifying key determinants for targeted interventions.¹⁴ The application of machine learning (ML) in public health is expanding rapidly, with recent studies demonstrating its ability to predict health-related behaviors and identify key determinants that can inform targeted interventions.¹⁴ This growth is also reflected across multiple healthcare domains, where ML-based predictive modeling and evidence synthesis are increasingly used to support risk stratification, early detection, and outcome forecasting in diverse settings.^15
-17 In addition, ML has shown strong potential in clinical decision support by improving pattern recognition and risk assessment to guide timely and personalized decision-making.^18,19

Despite this progress, there remains a critical research and policy gap in Somalia. As a nation recovering from decades of conflict and instability, Somalia faces unique and severe public health challenges and has historically lacked comprehensive, population-level evidence to support evidence-based policymaking. The first Somalia Health and Demographic Survey (SHDS/SDHS) provide an unprecedented opportunity to examine key health and environmental determinants at the national level. In a recent study, we reported that 60.34% of Somali households experience high household smoke exposure risk and 36.32% experience medium risk, underscoring a substantial and widespread burden.²⁰ However, this prior work relied on conventional analytical approaches and did not apply advanced computational or predictive modeling techniques. Accordingly, the present study addresses this gap by being the first to employ machine learning methods to predict household smoke exposure risk using data from the 2020 SHDS/SDHS. The specific objectives are: (1) to determine the prevalence and key sociodemographic determinants of household smoke exposure risk in Somalia; (2) to develop and evaluate the performance of machine learning models for predicting household SER; and (3) to identify the most influential predictors of high smoke exposure risk to inform targeted public health interventions and policy in Somalia.

Methods

Study Design and Data Source

This study employed a population-based cross-sectional design to analyze the determinants of household smoking exposure. The primary data source was the 2020 Somalia Health and Demographic Survey (SDHS), a nationally representative survey conducted by the Somalia National Bureau of Statistics in collaboration with international partners. The SDHS provides strong and comprehensive data on a wide range of demographic, socioeconomic, and health indicators, making it a suitable dataset for this study.

Sampling Method and Sample Size Justification

The SDHS used a stratified, 2-stage cluster sampling methodology to ensure a representative sample of the Somali population. In the first stage, enumeration areas (EAs) were selected with a probability proportional to their size. In the second stage, a fixed number of households were systematically selected from each chosen EAs. This robust sampling strategy minimized selection bias and enhanced the generalizability of the findings. For this study, a final weighted sample of 15 838 households was included in the analysis. This large sample size, derived from a rigorous national survey, provided sufficient statistical power for developing and validating complex machine learning models, ensuring the reliability of the study’s conclusions.

Study Variables

The primary outcome variable for this study was the Household Smoke Exposure Risk (SER). This was constructed as a binary variable, where households were classified as “Exposed” (coded as 1) if their primary cooking methods involved solid fuels in enclosed or semi-enclosed spaces, and “Not Exposed” (coded as 0) otherwise. In this study, SER was treated as a proxy indicator of household air pollution exposure, enabling stratification of households likely to experience environmentally driven adverse health outcomes associated with chronic smoke exposure. Based on a review of the existing literature on the determinants of household air pollution and data availability in the SDHS, the following predictor variables were selected: Geographic Factors: Place of Residence (categorized as urban, rural, or nomadic), and Administrative Region. Socioeconomic Factors: Wealth Quintile (poor, middle, rich) and Media Exposure (yes/no). Household Head Characteristics: School Attendance of the Household Head (yes/no), Sex of the Household Head, and Age of the Household Head. Household Demographics: Household Size.

Data Pre-processing

The dataset was prepared for machine learning implementation using Python (version 3.13.5). The pre-processing pipeline included data cleaning to rectify any inconsistencies, handling missing values through mode imputation for categorical variables, and the final formatting of variables to ensure model compatibility. A final sample of 15 838 household data points was retained for analysis.

Feature Selection

A comprehensive multi-stage feature selection process was conducted to identify the most relevant predictors of household smoke exposure risk (SER). The process began with exploratory data analysis (EDA), in which descriptive statistics and visualizations were used to gain insights into the distribution of variables and their initial associations with the outcome. For a more formal assessment, a bivariate analysis using survey-adjusted chi-square tests was performed to evaluate the statistical significance of the association between each predictor and the household SER. To further refine the feature set and ensure the robustness of the final models, correlations between categorical predictors were assessed using Cramer’s V statistic. This step was crucial for minimizing multicollinearity by identifying and managing highly redundant variables in the dataset. This combination of statistical prescreening, correlation analysis, and literature-informed review yielded a comprehensive and relevant set of predictors for the model development phase.

Feature of Importance

To understand the contribution of each predictor to the model’s decisions, feature importance was systematically evaluated after model training. For tree-based models (Decision Tree, Random Forest, Gradient Boosting), Gini importance was used. For Logistic Regression, the absolute values of the coefficients were assessed. For K-Nearest Neighbors and SVM, permutation importance was calculated. Furthermore, for the best-performing model, a more advanced explainable AI (XAI) technique, SHapley Additive exPlanations (SHAP), was employed to provide granular insights into both global and local feature contributions, enhancing the interpretability of the model predictions.

Model Development

Six supervised machine learning algorithms were developed to predict the household SER, representing a range of complexities and methodologies. Logistic Regression (LR) served as a linear baseline model, offering straightforward statistical foundations. K-Nearest Neighbors (KNN) was employed as a non-parametric, instance-based algorithm. Decision Tree (DT) provided a simple, non-linear tree-based model. The Support Vector Machine (SVM) aimed to identify an optimal separating hyperplane between classes. Random Forest (RF) is an ensemble method based on multiple decision trees. Gradient Boosting, a powerful ensemble technique that builds models sequentially to correct prior errors.

Model Training and Evaluation

The dataset was randomly partitioned into a training subset comprising 80% of the observations and a strict hold-out testing subset containing the remaining 20% to generate the performance metrics. All models were fitted exclusively on the training data, and their hyperparameters were tuned via k-fold cross-validation to curb overfitting and enhance generalizability. The performance was subsequently assessed on the unseen test set using a comprehensive suite of classification metrics, namely accuracy, precision, recall, F1-score, confusion matrix, and area under the receiver-operating characteristic curve (AUROC). To ensure the statistical reliability of the model performance, 95% Confidence Intervals (CIs) were calculated for the AUROC values. Accuracy provides an overall measure of correct predictions, whereas precision and recall quantify the positive predictive values and sensitivity, respectively. The F1-score, as the harmonic mean of precision and recall, gauges the balance between these two dimensions. The confusion matrix offered a granular comparison of true versus false classifications for both exposed and unexposed households, and the AUROC summarized each model’s capacity to discriminate between the 2 classes across all probability thresholds.

Model Selection

The selection of the optimal predictive model was based on a rigorous and holistic comparison of the evaluation metrics generated from an unseen test set. The primary quantitative criteria for model selection were the Area Under the Receiver Operating Characteristic curve (AUROC) and the F1-Score. The AUROC was chosen as a key indicator of the model’s overall ability to discriminate between “Exposed” and “Not Exposed” households, independent of a specific classification threshold. The F1-Score, which represents the harmonic mean of precision and recall, was prioritized to ensure that the selected model demonstrated a robust balance in correctly identifying positive cases (recall) while minimizing false alarms (precision). This balanced approach is particularly critical in the public health context, where both identifying at-risk households and efficiently allocating resources are important. In addition to quantitative performance, qualitative aspects, such as model interpretability and computational efficiency, were considered secondary criteria. The algorithm that demonstrated the most superior and well-rounded performance across these combined criteria was selected for in-depth features, importance analysis and final interpretation.

Statistical Analysis

All conventional statistical analyses were performed using STATA version 17. Descriptive statistics, including weighted frequencies and percentages, were generated to summarize household sociodemographic characteristics and the prevalence of smoke exposure. To account for the complex sampling design of the SDHS (stratification, clustering, and weighting). For inferential analysis, survey-adjusted chi-square tests were employed to assess the bivariate associations between each predictor variable and the binary outcome of the risk of smoke exposure. A P-value of <.05 was considered indicative of a statistically significant association. This initial statistical analysis provided a foundational understanding of the data and informed the preliminary feature selection for machine learning models.

Results

Prevalence and Household Characteristics

The overall weighted prevalence of household smoke exposure was 70.0% (Figure 1). This study included a weighted sample of 15 838 households from the 2020 Somalia Health and Demographic Survey (SDHS), the characteristics of which are presented in Table 1. Regarding household characteristics, most households resided in urban (59.8%), rural (27.5%), and nomadic (12.7%) settings. A significant proportion of households were classified in the poor wealth quintile (43.5%), with 36.5% being rich and 20.1% being middle-class. Most households were male headed (67.9%). Most household heads (58.8%) had never attended school. More than two-thirds of the households (67.5%) reported no media exposure. In terms of household size, 70.6% of households had more than 4 members. The age of the household head was distributed across various groups, with the largest proportion being 30 to 40 years old (35.0%). The sample was distributed across all 16 regions of Somalia, with Banadir having the largest proportion of households (24.7%).

Figure 1.

Weighted SER distribution among households in Somalia.

Table 1.

Sociodemographic Characteristics and Smoke Exposure Risk Among Households in Somalia.

Variables	Number (weighted %)	Smoke exposure risk		P-value
Variables	Number (weighted %)	Not exposed, % (weighted N)	Exposed, % (weighted N)	P-value
Wealth quintile				.0000^**
Poor	6883.7 (43.5)	44.4 (3056.4)	55.6 (3827.3)
Middle	3181.6 (20.1)	22.2 (706.3)	77.8 (2475.3)
Rich	5773.2 (36.4)	17.1 (987.2)	82.9 (4786.0)
Place of residence				.0000^**
Urban	9468.6 (59.8)	20.9 (1979.0)	79.1 (7489.6)
Rural	4362.4 (27.5)	28.5 (1243.3)	71.5 (3119.1)
Nomadic	2007.5 (12.7)	76.1 (1527.7)	23.9 (479.8)
School attendance of head				.0000^**
Yes	6524.2 (41.2)	25.5 (1663.7)	74.5 (4860.5)
No	9314.3 (58.8)	33.1 (3083.0)	66.9 (6231.3)
Media exposure				.0000^**
Exposed	5144.1 (32.5)	21.3 (1095.7)	78.7 (4048.4)
Not exposed	10 694.4 (67.5)	34.2 (3657.5)	65.8 (7036.9)
Household size				..0000^**
⩽4 Members	4650.7 (29.4)	35.4 (1646.3)	64.6 (3004.4)
>4 Members	11 187.8 (70.6)	27.7 (3099.0)	72.3 (8088.8)
Region				..0000^**
Awdal	360.9 (2.3)	25.2 (91.0)	74.8 (269.9)
Woqooyi Galbeed	1505.5 (9.5)	23.3 (350.8)	76.7 (1154.7)
Togdheer	873.5 (5.5)	20.6 (179.9)	79.4 (693.6)
Sool	511.8 (3.2)	53.7 (274.8)	46.3 (237.0)
Sanaag	655.9 (4.1)	52.1 (341.7)	47.9 (314.2)
Bari	944.0 (6.0)	48.8 (460.6)	51.2 (483.4)
Nugaal	439.1 (2.8)	48.3 (212.1)	51.7 (227.0)
Mudug	1017.4 (6.4)	49.0 (498.5)	51.0 (518.9)
Galgaduud	945.4 (6.0)	48.3 (456.6)	51.7 (488.8)
Hiraan	750.2 (4.7)	37.7 (282.8)	62.3 (467.4)
Middle Shabelle	1190.8 (7.5)	23.3 (277.5)	76.7 (913.3)
Banadir	3910.0 (24.7)	11.8 (461.4)	88.2 (3448.6)
Bay	839.6 (5.3)	47.5 (398.8)	52.5 (440.8)
Bakool	369.6 (2.3)	49.6 (183.2)	50.4 (186.4)
Gedo	648.3 (4.1)	13.7 (88.8)	86.3 (559.5)
Lower Juba	876.2 (5.5)	21.5 (188.4)	78.5 (687.8)
Sex of household head				.3656
Male	10 747.5 (67.9)	29.7 (3192.0)	70.3 (7555.5)
Female	5091.0 (32.1)	30.7 (1562.9)	69.3 (3528.1)
Age of household head (years)				.1101
<30	2727.1 (17.2)	32.2 (878.1)	67.8 (1849.0)
30-40	5546.8 (35.0)	29.2 (1620.0)	70.8 (3926.8)
41-54	3594.9 (22.7)	29.3 (1053.3)	70.7 (2541.6)
>54	3969.7 (25.1)	30.1 (1194.9)	69.9 (2774.8)

Statistical significance was set at P < .05.

P < .001.

Determinants of Smoke Exposure Risk

Bivariate analyses using survey-adjusted chi-square tests identified several significant determinants of household smoke exposure risk (Table 1). Place of residence and wealth quintile were the most significant predictors (P < .0001). Urban households had the highest exposure rate (79.1%), whereas nomadic households had the lowest (23.9%). Smoke exposure risk demonstrated a clear positive gradient with wealth, increasing from 55.6% in poor households to 82.9% in rich ones. Additionally, the household head’s school attendance, media exposure, household size, and region were all significantly associated with higher smoke exposure (P < .0001). For instance, exposure was significantly higher in households where the head had attended school (74.5%) than in those without formal education (66.9%). In contrast, the sex and age of the household head were not significantly associated with the risk of smoke exposure.

Predictive Performance of Machine Learning Models

Six machine learning models were trained and evaluated to predict the risk of household smoke exposure. The comparative performance across the 5 key metrics is presented in Table 2 and Figure 2. Gradient Boosting emerged as the top-performing model, achieving the highest accuracy (76%) and Area Under the ROC Curve (AUC) of 81% (95% CI: 79.11%-82.17%). It also demonstrated a robust F1-Score (81%), indicating a strong and well-balanced ability to correctly classify households. The other ensemble model, Random Forest, also showed strong predictive capability, with an accuracy of 74% and an AUC of 78% (95% CI: 75.82%-79.21%). Notably, the Support Vector Machine (SVM) model achieved the highest recall (91%), suggesting that it was the most effective at identifying all “Exposed” households. However, this superior sensitivity came at the cost of the lowest precision (71%), indicating a higher rate of false-positive results. The Decision Tree and K-Nearest Neighbors (KNN) models yielded moderate performance, whereas Logistic Regression showed the most limited discriminatory ability, as reflected in their respective AUC scores (Figure 2).

Table 2.

Predictive Performance of Machine Learning Models for Smoke Exposure Risk (SER).

Metric	Logistic regression	Random forest	Support vector machine (SVM)	Decision tree	Gradient boosting	K-nearest neighbors (KNN)
Accuracy (%)	74%	74%	72%	72%	76%	73%
Precision (Class 1: Exposed)	75%	77%	71%	78%	77%	76%
Recall (Class 1: Exposed)	85%	81%	91%	76%	86%	81%
F1-score (Class 1: Exposed)	80%	79%	80%	77%	81%	79%
AUC-ROC (%)	74%	78%	76%	75%	81%	76%
AUC 95% CI	(72.51-76.14)	(75.82-79.21)	(74.55-78.17)	(72.68-76.13)	(79.11-82.17)	(74.41-77.83)

Figure 2.

Comparison of model performance metrics.

Confusion Matrix

The confusion matrices in provide a detailed breakdown of the classification performance for each of the 6 models (Figure 3). The Gradient Boosting model demonstrates a strong balance, correctly identifying 1653 “Exposed” households (True Positives) while minimizing False Negatives (276). This indicates high sensitivity for the at-risk group. In contrast, the Support Vector Machine (SVM) model, while achieving the highest recall by correctly classifying 1764 “Exposed” households (the lowest False Negative count), also produced the highest number of False Positives (710), suggesting a tendency to misclassify “Not Exposed” households. Random Forest also showed strong efficacy with 1570 True Positives. The remaining models, including Logistic Regression, KNN, and Decision Tree, showed more moderate performance with higher rates of misclassification for 1 or both classes (Figure 3).

Figure 3.

Confusion matrices for: (A) gradient boosting, (B) random forest, (C) logistic regression, (D) K-nearest neighbors, (E) decision tree, and (F) support vector machine.

AUROC Curve

The Receiver Operating Characteristic (ROC) curves presented in Figure 4 visually compare the ability of the 6 models to distinguish between “Exposed” and “Not exposed” households. The Area Under the Curve (AUC) serves as a key metric for this discriminatory power. The Gradient Boosting model is the clear top performer, with an AUC of 0.81 (95% CI: 0.79-0.82), indicating its superior capability to correctly classify households across all thresholds. Random Forest follows with a strong AUC of 0.78 (95% CI: 0.76-0.79), positioning it as the second-best model. The SVM and K-Nearest Neighbors (KNN) models showed comparable moderate performance AUC of 0.76 (95% CI: 0.74-0.78), while Decision Tree (0.74, 95% CI: 0.73-0.76) and Logistic Regression (0.74, 95% CI: 0.73-0.76) displayed the most limited discriminatory power in this context.

Figure 4.

Receiver operating characteristic (ROC) curves of the 6 models.

Feature Importance Analysis

The relative importance and directional impact of various predictors on smoke exposure risk were evaluated using different techniques across the 6 models, with the results for the top features presented in Figure 5. Gini importance was used for tree-based models, absolute coefficients for Logistic Regression, and permutation importance for K-Nearest Neighbors (KNN) and SVM. The analysis revealed that geographic and location-based variables were the most influential. Residence emerged as the single most dominant feature in 5 of the 6 models, demonstrating exceptionally high importance scores in Gradient Boosting (Figure 5(A); Importance = 0.55) and Logistic Regression (Figure 5(C); Importance = 0.78). Region was also consistently identified as a primary determinant, ranking as the most important feature for the Random Forest model (Figure 5(B); Importance = 0.40) and the second most important for both the Gradient Boosting (Importance = 0.31) and Decision Tree (Figure 5(E); Importance = 0.22) models. Socioeconomic and demographic factors had a moderate to low influence. Wealth Quantile consistently ranked among the top predictors, particularly for Gradient Boosting (Importance = 0.08) and KNN (Figure 5(D); Importance = 0.04). Household characteristics such as Household Age, Household size, and Household Sex displayed varying levels of moderate influence depending on the model used. For instance, HH_Age was the third most important feature for Random Forest (Importance = 0.11), while HH_size was the third for Logistic Regression (Importance = 0.18). In contrast, Media Exposure and School attendance consistently demonstrated the lowest predictive power across all models, suggesting that they have a minimal direct impact on determining household smoke exposure risk in this analysis.

Figure 5.

Top feature importances for predicting smoke exposure risk: (A) gradient boosting, (B) random forest, (C) logistic regression, (D) K-nearest neighbors, (E) decision tree, and (F) support vector machine.

Feature Importance Analysis Using SHAP

SHAP analysis was employed to reveal the relative importance and directional impact of various predictors on the model output. The analysis utilized 4 complementary visualizations: a SHAP Summary Plot, a SHAP Waterfall Plot, a SHAP Beeswarm Plot, and a Feature Importance ranking plot, to provide a comprehensive interpretation of the model’s predictions. The feature importance of the plot clearly identifies residence as the most influential predictor, with the highest mean absolute SHAP value. This is followed by Region and Wealth Quantile, which also demonstrate significant predictive power. Features such as HH_size, MediaExposure, and School_attendance showed a moderate influence, while HH_Age and HH_Sex had the least impact on the model’s predictions (Figure 6).

Figure 6.

Gradient boosting model feature importance based on mean absolute SHAP values.

The SHAP Summary (Figure 7) and Beeswarm plots (Figure 8) illustrate the distribution and direction of these impacts. For Residence, higher feature values (shown as red dots in the beeswarm plot) consistently pushed the model output higher, as indicated by the cluster of points on the positive side of the SHAP value axis (Figure 8). Similarly, higher Wealth Quantile values were associated with positive SHAP values, increasing the likelihood of a positive prediction. Conversely, the plots reveal that higher Media Exposure and School attendance values are associated with negative SHAP values, suggesting that they decreasethe model’s output in this context.

Figure 7.

SHAP summary plot.

Figure 8.

SHAP beeswarm plot.

The SHAP Waterfall Plot provides a transparent view of how an individual prediction is formulated, starting from the base value (E[f(X)]) of 0.697 (Figure 9). For the specific instance shown, residence (+0.61), region (+0.56), and Wealth Quantile (+0.31) were the primary drivers pushing the prediction higher. In contrast, Media Exposure (−0.13) and School attendance (−0.06) exerted a negative influence. The cumulative effect of all features resulted in a final model output (f(x)) of 2.044 for this observation (Figure 9). Collectively, these visualizations confirm that geographic and socioeconomic factors, specifically Residence, Region, and Wealth Quantile, are the most critical determinants in the model’s predictions.

Figure 9.

SHAP waterfall plot.

Discussion

The study revealed a striking 70% prevalence of household smoke exposure risk (SER) among Somali households and demonstrated that a gradient-boosting ensemble achieved the best discriminatory performance (AUC = 0.81; F1 = 0.81), driven chiefly by geographic features, particularly place of residence and region, while socioeconomic markers such as wealth quintile exerted a more modest yet consistent influence. In contextualizing these findings, it is noteworthy that Somalia remains 1 of the most heavily affected countries in East Africa, with 78% of households relying on solid cooking fuels in 2019, the highest proportion in the region.²¹ Our estimate aligns with this data and with a recent Somaliland survey reporting an even higher reliance (97%) on biomass and charcoal.²² Regionally, nearly four-fifths of sub-Saharan Africans still depend on polluting fuels as of 2021, underscoring the persistent magnitude of the exposure gap despite global decline.²³

Although the literature typically implicates rural poverty as the principal driver of household air pollution (HAP) exposure,²⁴ our models identified urban residence and higher relative wealth as the strongest positive predictors. This apparent paradox is plausibly explained by Somalia’s rapid, conflict-driven urbanization, which has concentrated internally displaced populations and low-income renters in dense peri-urban settlements, where charcoal is the de facto affordable cooking fuel and ventilation is poor.^25,26 Similar urban slum vulnerabilities have been documented in Nigeria and Kenya, where solid fuel use persists despite ostensibly higher household assets.^27,28 The modest protective influence of media exposure observed here resonates with Indian and Ethiopian studies showing that information access modestly accelerates the adoption of clean energy technologies but is rarely sufficient in the absence of subsidies and supply chain improvements.^29,30

From a methodological perspective, the gradient-boosting model’s superiority mirrors broader evidence that ensemble learners outperform single classifiers in air-quality prediction tasks across diverse settings,³¹ while our Support Vector Machine’s high recall but low precision reflects the classic sensitivity–specificity trade-off encountered in ML-based exposure screening.³² Importantly, SHAP decomposition reaffirmed that residence and region wielded supra-additive contributions to SER, corroborating a recent interpretable-ML synthesis that flagged spatial determinants as dominant in both high- and low-income environments.³³ The positive wealth gradient we observed contrasts with multi-country DHS analyses, where lower quintiles typically exhibit higher solid-fuel dependence,²⁴ suggesting that Somalia’s wealth index, heavily weighted toward livestock and remittance-linked durable goods–may not accurately proxy liquidity for clean-energy purchases. Our findings therefore extend the environmental-justice discourse: contrary to global patterns where poverty predicts pollution exposure, Somali urban households with nominally higher wealth may paradoxically face the greatest smoke burden, a phenomenon echoed in recent machine learning work linking complex poverty metrics, urban siting, and ambient emissions in other fragile settings.³⁴ Importantly, these results provide direct environmental health insight because SER captures household-level exposure conditions that are strongly associated with environmentally mediated health burdens, including acute respiratory infections, chronic respiratory disease, cardiovascular outcomes, and premature mortality.^23,35 In this context, our predictive models help identify where exposure reduction strategies can deliver the greatest health gains, supporting practical translation from environmental exposure mapping to human health protection and prevention-oriented policy.

Internal validity was supported by the large nationally representative sample (n = 15 838), use of a strict 80/20 hold-out test set with cross-validated tuning, and consistent identification of key predictors (residence and region) across multiple models and SHAP explanations, reducing the likelihood that the findings reflect model-specific artifacts. However, SER remains a proxy exposure indicator derived from cooking fuel type and location rather than direct PM₂₅ measurements or clinical outcomes; therefore, some exposure misclassification is possible. External validity is strengthened by the SDHS sampling design, supporting national generalizability; nonetheless, the partial under-representation of nomadic households due to missing covariates may limit the applicability to fully pastoralist communities.

Notwithstanding these insights, several limitations must be considered. First, the cross-sectional SDHS cannot infer causality, and the SER outcome relies on proxy definitions of cooking practice rather than direct PM₂.₅ measurement, raising potential misclassification; ventilation characteristics, fuel stacking, and seasonality were unmeasured. Second, nomadic households (12.7% of the sample) may have been under-represented in model training because of missing covariate data, limiting generalizability to fully pastoral communities. Third, we confined modeling to 6 conventional algorithms: deep neural architectures and geospatial covariates (eg, satellite aerosol optical depth or land-use regression layers) were not incorporated, potentially capping the predictive ceiling.³⁶ Fourth, although survey weights were applied in descriptive statistics, they were not integrated into ML loss functions; future work should test survey-weighted learners to better respect complex sampling designs. Finally, all covariates were self-reported or census-derived, and residual confounding by unobservable socioeconomic or behavioral factors (eg, cooking duration, child presence, or ambient dust storms) cannot be excluded.

Considering these constraints, 3 actionable recommendations were identified. First, Somalia’s clean cooking policies should focus on densely populated urban areas, particularly those accommodating internally displaced persons, by expanding subsidized liquefied petroleum gas (LPG) or ethanol programs and utilizing humanitarian channels that have previously distributed efficient stoves on a small scale.^25,27 Second, future population surveys should incorporate low-cost sensor modules or hybrid questionnaire–sensor approaches to directly measure particulate concentrations and ventilation metrics, thereby enabling the validation of proxy-based machine learning risk scores.³⁷ Third, the predictive workflow should be developed into a dynamic geospatial early warning system that integrates DHS microdata, routine media-penetration indicators, and remotely sensed environmental layers using interpretable ensemble methods to guide micro-targeted interventions and monitor progress toward Sustainable Development Goals. Collectively, these measures could expedite equitable access to clean household energy in Somalia and provide a transferable model for other fragile, low-resource contexts facing the persistent challenge of household air pollution.

Conclusion

Household smoke exposure risk remains highly prevalent in Somalia, indicating widespread vulnerability to harmful household air pollution. By applying supervised machine learning models alongside explainable AI methods, this study demonstrated that smoke exposure risk can be predicted with good discriminatory performance, with Gradient Boosting providing the strongest overall balance across evaluation metrics. Geographic context, particularly residence and region, emerged as the most influential drivers of exposure risk, whereas socioeconomic factors such as household wealth also contributed meaningfully. The unexpected clustering of higher exposure risk among urban and relatively wealthier households highlights the need to reconsider conventional assumptions and prioritize clean cooking strategies in dense urban and peri-urban settings, including communities affected by displacement and those living in informal housing. Overall, interpretable ML-based risk stratification offers a practical and scalable approach for identifying high-risk households and supporting targeted, prevention-oriented environmental health policies in fragile and data-constrained settings.

Footnotes

ORCID iDs

Mohamed Abdirahim Omar

Mohamed Mustaf Ahmed

Ethical Considerations

This study used secondary data from the 2020 Somali Demographic and Health Survey (SDHS), conducted in accordance with established ethical guidelines. The research complied with ethical principles by obtaining the necessary approvals from the Somalia National Health Research Ethics Committee and the ICF Institutional Review Board.

Consent to Participate

Informed consent was secured from all participants prior to data collection, ensuring that their rights and confidentiality were upheld throughout the research process.

Author Contributions

MMA Conceptualized this idea. The study design was developed collaboratively by ASA, MMA, MAO and YSAH. Material preparation and data collection were performed by YSAH and MMA. MAO analyzed and interpreted the data. The initial draft was composed by ASA, MMA and YSAH. All authors contributed to the writing, reviewing, and editing of the subsequent versions of the manuscript.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

This research did not involve the collection of primary data. The findings are based on secondary data from the 2020 Somali Demographic and Health Survey (SDHS), which is publicly accessible via the DHS website: .

References

WHO. Household Air Pollution. World Health Organization; 2023. Accessed October 16, 2024. https://www.who.int/news-room/fact-sheets/detail/household-air-pollution-and-health.

Balmes

JR.

Household air pollution from domestic combustion of solid fuels and health. J Allergy Clin Immunol. 2019;143:1979-1987.

Juntarawijit

Cooking smoke exposure and respiratory symptoms among those responsible for household cooking: a study in Phitsanulok, Thailand. Heliyon. 2019;5:e01706.

Odame

Amoah

Household exposure to the risk of cooking smoke: evidence from Sub-Saharan Africa. Energy Nexus. 2023;12:100256.

Amegah

Quansah

Jaakkola

JJ.

Household air pollution from solid fuel use and risk of adverse pregnancy outcomes: a systematic review and meta-analysis of the empirical evidence. PLoS One. 2014;9:e113920.

Bickton

Ndeketa

Sibande

Nkeramahame

Payesa

Milanzi

EB.

Household air pollution and under-five mortality in sub-Saharan Africa: an analysis of 14 demographic and health surveys. Environ Health Prev Med. 2020;25:67.

Misganaw

Hailemariam

Moshago Berheto

Lakew

Derso Mengesha

Agachew

, et al. Household air pollution impacts on mortality and disease burden in East Africa and Nile Basin African countries. Ethiop J Health Dev. 2023;37.

Misganaw

Naghavi

Walker

Mirkuzie

Giref

Berheto

, et al. Progress in health among regions of Ethiopia, 1990–2019: a subnational country analysis for the Global Burden of Disease Study 2019. Lancet. 2022;399:1322-1335.

Ahamad

Tanin

Shrestha

Household smoke-exposure risks associated with cooking fuels and cooking places in Tanzania: a cross-sectional analysis of demographic and health survey data. Int J Environ Res Public Health. 2021;18:2534.

10.

Khan MN

Nurs

Mofizul Islam

Islam

Rahman

MM.

Household air pollution from cooking and risk of adverse health and birth outcomes in Bangladesh: a nationwide population-based study. Environ Health. 2017;16:57.

11.

Sani

Garba

Ahmed

MM.

Understanding household smoke exposure risks (SER) in Nigeria: a regional analysis from the 2018 NDHS. BMC Public Health. 2025;25:1351.

12.

Kneer

Borchardt

Kärgel

, et al. Diminished fronto-limbic functional connectivity in child sexual offenders. J Psychiatr Res. 2019;108:48-56.

13.

Cao

Zhao

, et al. Prediction of smoking behavior from single nucleotide polymorphisms with machine learning approaches. Front Psychiatry. 2020;11. doi:10.3389/fpsyt.2020.00416

14.

Taye

Woubet

Hailie

, et al. Random forest algorithm for predicting tobacco use and identifying determinants among pregnant women in 26 sub-Saharan African countries: a 2024 analysis. BMC Public Health. 2025;25:1506.

15.

Ranjbar

Montazeri

Ghamsari

Mehrnoush

Roozbeh

Darsareh

Machine learning models for predicting preeclampsia: a systematic review. BMC Pregnancy Childbirth. 2024;24:6.

16.

Taeidi

Ranjbar

Montazeri

Mehrnoush

Darsareh

Machine learning-based approach to predict intrauterine growth restriction. Cureus. 2023;15. doi:10.7759/cureus.41448

17.

Banaei

Roozbeh

Darsareh

Mehrnoush

Farashah

MSV

Montazeri

Utilizing machine learning to predict the risk factors of episiotomy in parturient women. AJOG Global Reports. 2025;5:100420.

18.

Safarzadeh

Ardabili

Farashah

Roozbeh

Darsareh

Predicting mother and newborn skin-to-skin contact using a machine learning approach. BMC Pregnancy Childbirth. 2025;25:182.

19.

Roozbeh

Montazeri

Farashah

Mehrnoush

Darsareh

Proposing a machine learning-based model for predicting nonreassuring fetal heart. Sci Rep. 2025;15:7812.

20.

Hassan

YSA

Omar

Ali

Ahmed

. Smoke exposure risk among Somali households: prevalence, determinants, and regional disparities. BMC Public Health. 2025;25:3229.

21.

Household air pollution impacts on mortality and disease burden in East Africa and Nile Basin African countries | Institute for Health Metrics and Evaluation. IHME; 2023. Accessed June 22, 2025. https://www.healthdata.org/research-analysis/library/household-air-pollution-impacts-mortality-and-disease-burden-east-africa?utm_source=chatgpt.com.

22.

Ali

Abokor

Adam Farih

Abdikarim

Yousuf

Muse

AH.

Household solid fuel use and associated factors in Somaliland: a multilevel analysis of data from 2020 Somaliland demographic and health survey. Environ Health Insights. 2025;19:1-11. doi:10.1177/11786302251315893

23.

Bennitt

Wozniak

Causey

Spearman Okereke

Garcia

, et al. Global, regional, and national burden of household air pollution, 1990–2021: a systematic analysis for the Global Burden of Disease Study 2021. Lancet. 2025;405:1167-1181.

24.

Azanaw

Endalew

Determinants of solid fuel use in Sub-Saharan Africa: a multilevel analysis using DHS data. PLoS One. 2025;20:e0321721.

25.

Somalia Case Study | Climate Refugees | Othering & Belonging Institute. Othering & Belonging Institute at UC Berkeley; 2023. Accessed June 22, 2025. https://belonging.berkeley.edu/climatedisplacement/case-studies/somalia.

26.

Ahmed

Asowe

Dirie

, et al. The nexus of climate change, food insecurity, and conflict in Somalia: a comprehensive analysis of multifaceted challenges and resilience strategies. F1000Res. 2024;13:913.

27.

As Pollution Kills, Africa needs billions for climate-ready stoves | Reuters . Reuters; 2024. Accessed June 22, 2025. https://www.reuters.com/business/environment/pollution-kills-africa-needs-billions-climate-ready-stoves-2024-05-14/.

28.

Yonemitsu

Njenga

Iiyama

Matsushita

A choice experiment study on fuel preference of Kibera slum households in Kenya. Int J Environ Sci Dev. 2015;6:196-200.

29.

Halder

Kasemi

Roy

Majumder

Impact of indoor air pollution from cooking fuel usage and practices on self-reported health among older adults in India: evidence from LASI. SSM Popul Health. 2024;25:101653.

30.

Mulat

Tamiru

Abate

KH.

Exposure to household air pollution and childhood multimorbidity risk in Jimma, Ethiopia. Front Public Health. 2024;12. doi:10.3389/fpubh.2024.1473320

31.

Song

Guo

Deep ensemble machine learning framework for the estimation of PM2.5 concentrations. Environ Health Perspect. 2022;130. doi:10.1289/ehp9752

32.

Samad

Garuda

Vogt

Yang

Air pollution prediction using machine learning techniques – an approach to replace existing monitoring stations with virtual monitoring stations. Atmos Environ. 2023;310:119987.

33.

Houdou

El Badisy

Khomsi

, et al. Interpretable machine learning approaches for forecasting and predicting air pollution: a systematic review. Aerosol Air Qual Res. 2024;24:230151.

34.

Magesh

Geng

A machine learning interpretation of the correlation between poverty and air pollution in the contiguous United States. Sci Rep. 2025;15:2407.

35.

Lee

Bing

Kiang

, et al. Adverse health effects associated with household air pollution: a systematic review, meta-analysis, and burden estimation study. Lancet Glob Health. 2020;8:e1427-e1434.

36.

Jianyao

Yuan

Wang

Weng

Zhang

Machine learning-enhanced high-resolution exposure assessment of ultrafine particles. Nat Commun. 2025;16:1209.

37.

Karmakar

Pradhan

Chakraborty

Indoor Air Quality Dataset with Activities of Daily Living in Low to Middle-income Communities. Adv Neur Inform Process Systems. 2024;37:70076-70100.