Abstract
Introduction
Stress can significantly affect human health and performance, often resulting in long-term mental health disorders across all demographic groups. 1 Farmers’ mental health has become a growing concern due to their exposure to unpredictable challenges and hazardous working conditions. Identifying the underlying causes of mental health disorders is essential for effectively promoting farmers’ well-being. While research on farmer mental health stressors has largely focused on developed countries, few studies have examined these issues in developing nations such as Thailand. Only a limited number of investigations have explored associations between mental health and related risk factors. Yazd et al. identified key contributors to mental health risks in farmers, including pesticide exposure, financial strain, climate variability, drought, and poor physical health, findings that align with those of Hagen BNM et al.1,2 Most studies to date have relied on descriptive analyses to examine mental health outcomes in farming populations. Although pesticide exposure has been associated with mental and sleep disorders in farmers,3,4,5,6 evidence linking agrochemical exposure to psychological distress remains inconclusive. 7 Fewer studies have examined correlations between perceived farm stressors and mental health, particularly among Thai farmers. This pilot study employed a comprehensive farm stressor inventory questionnaire to better understand these stressors within this population.
To analyze variable relationships in studies on farmers’ stress, life satisfaction, and well-being, ordinary least squares regression and multiple logistic regression have been commonly applied.8–13 Many recent investigations have considered the mental health impacts of the COVID-19 pandemic.14–18 Several studies have also reported associations between pesticide use and mental health risks among Thai farmers,19–23 while additional work has explored the pandemic’s effects.24,25 A recent analysis combining descriptive analytics and multiple logistic regression further investigated the relationship between pesticide exposure and mental health disorders in Thai farmers. 26
Given the complexity and multidimensionality of farm-related stressors, machine learning (ML) approaches offer notable advantages over traditional statistical models such as multiple linear regression, which primarily capture linear relationships. ML models are capable of identifying nonlinear patterns, modeling complex variable interactions, and accommodating heterogeneous data types. In agriculture, the interrelated nature of stressors presents challenges that require modeling tools able to detect these complex relationships without imposing predefined functional forms or distributional assumptions. When paired with advanced interpretability techniques, ML models can provide deeper insights into the factors most strongly influencing mental health outcomes among farming populations. In particular, ensemble tree-based models, combined with interpretability methods such as SHAP values, support both accurate prediction and transparent feature attribution.
This study comprises three analyses related to mental health among Thai farmers. The first focuses on the association between pesticide exposure and symptoms of stress, depression, and anxiety. The second examines the relationship between mental health and perceived stressors, including both occupational (farm-related) and non-occupational (financial and social) factors, while controlling for demographic variables and COVID-19-related stressors. The third specifically investigates the impact of COVID-19-related stressors on mental health outcomes.
Our work is distinguished by the incorporation of specific stressor variables, model development strategies, feature selection techniques, and SHAP-based model interpretation to estimate mental health outcomes. We conducted a detailed quantitative analysis to examine the relationships among various stressors, COVID-19-related variables, and mental health. Machine learning models were applied to investigate the complex interactions underlying mental health disorders among Thai farmers. In particular, ensemble tree-based models, including random forest, Gradient Boosting, and XGBoost, were utilized due to their superior predictive performance over baseline multiple linear regression. Several feature selection methods were employed, including p-value analysis, models’ feature importance score, Boruta, and Boruta-SHAP,27,28 while SHAP values facilitated interpretation of the results. This approach not only enhances the accuracy of mental health outcome predictions but also allows for transparent ranking and visualization of the relative importance of various stressors, thereby supporting evidence-based and practical interventions.
Methods
Study area and data collection
This study was based on Noomnual et al. 26 , approved by the Ethical Review Committee for Human Research, Faculty of Public Health, Mahidol University (COA No. MUPH2021-087). It was conducted in Amphoe Payuhakiri, Nakhon Sawan province, Thailand. Details regarding the inclusion and exclusion criteria have been described elsewhere. 26 The sample size was calculated to estimate a population proportion in a finite population2 29 and adjusted for a 10% non-response rate, resulting in a total sample size of 270. Participants were selected using a purposive sampling technique. Thai farmers and their family members aged 18 years or older, who had engaged in agricultural activities as a primary or secondary occupation for at least one year, were eligible. Individuals with pre-existing mental disorders or health conditions were excluded.
All 270 participants completed the survey and were included in the raw dataset prior to data preparation. Each participant was interviewed by one of three trained interviewers using a structured, four-part questionnaire. The survey collected information on (1) demographics, household environment, and agricultural activities; (2) perceived farm stressors; (3) self-reported symptoms of stress, depression, and anxiety; and (4) COVID-19-related stress, including concerns about virus transmission, fear, and frequent information checking. The questionnaires are described in detail elsewhere. 26 Briefly, parts 1 and 3 were adopted from surveys commonly used in previous studies among Thai populations. Parts 2 and 4 were translated from validated instruments used in prior research. The translated versions were reviewed and compared with the originals by three experts to ensure content validity and conceptual consistency.
A list of all variables used in this study to address Q1, Q2, and Q3.
Demographic variables were included in all three cases. For Cases I and II, household location with respect to farm areas and chemical use within the household were considered relevant contextual factors and were therefore included, along with summary COVID-19 scores. Case I additionally incorporated specific pesticide-related stressors, while Case II focused on farm-related, financial, and social stressors. In contrast, Case III was designed to specifically address COVID-19 stressors, and thus these variables were prioritized while pesticide- and farm-related stressors were not included. These three cases were constructed to approximate an ablation analysis to focus on a specific group of stressors. Particularly, this design allowed us to examine the contribution of each feature group to predictive performance.
After data collection, a comprehensive analysis was performed in three stages: data preparation, model development, and identification of key contributing factors. Each stage was designed to ensure data quality, construct robust predictive models, and extract meaningful insights. In the first stage, raw data were cleaned, transformed, and preprocessed to ensure suitability for downstream analysis. The second stage focused on developing predictive models using both baseline and advanced tree-based algorithms. Finally, the third stage involved interpreting model outputs and identifying important factors contributing to the outcomes of interest. The complete analytical workflow is shown in Figure 1 where associated pseudocode is provided in Supplemental Material 8 to facilitate reproducibility. Process workflow.
Data preparation
To prepare the raw data, columns with more than 75% missing values were removed, and the remaining gaps were imputed using standard techniques. After excluding incomplete and inconsistent entries, the dataset consisted of 211 samples. Continuous variables were scaled to a range of 0 to 1. The objective of this study was to investigate mental health issues among farmers, specifically focusing on stress, anxiety, and depression, which were used as response variables. Mental health scores were computed as proportions based on self-reported indicators. Features relevant to the three research questions were organized into separate datasets, referred to as Case I, Case II, and Case III. Model development and key factor identification were conducted independently for each case.
Model development and evaluation metrics
This study applied several machine learning models, multiple linear regression (MLR) and tree-based models, including random forest (RF), Gradient boosting, and XGBoost, to estimate stress, anxiety, and depression levels, treating them as continuous response variables. These tree-based models are ensemble approaches that combine the predictions of individual trees, where each regression tree recursively partitions the input space into regions and assigns a constant predicted value within each region. These tree-based models were selected for their robustness against multicollinearity among predictors, their ability to capture potential nonlinear relationships between stressors and mental health outcomes, and their reduced tendency to overfit due to ensemble averaging. Among the tree-based ensemble models, we selected the best-performing algorithm as the representative for this group to be compared against the baseline MLR model.
To improve model stability and generalizability, a 5-fold cross-validation (CV) approach was used. The model was trained on four folds and validated on the fifth, rotating this process five times. Although the sample size was relatively small (n = 211), the tree-based model’s internal ensemble technique and feature subsampling mechanisms can yield stable estimates when combined with cross-validation. However, theoretical benefits may not always translate into practical performance improvements. To evaluate this, we conducted direct comparisons between tree-based models and the baseline MLR using the same features and cross-validation settings. This approach allowed us to determine whether the increased complexity of tree-based models resulted in a meaningful gain in predictive accuracy for the dataset.
Model performance was evaluated using the mean square error (MSE), with lower values indicating better accuracy. The full dataset was split into 80% training and 20% hold-out test sets. Cross-validation was applied only within the training set to tune hyperparameters and assess model stability across different parameter combinations. The goal was to minimize MSE while maintaining a balance between model complexity and generalization. The final model was then evaluated on the hold-out test set.
Key factors identification
For the MLR model, the significance of each variable was assessed using standard statistical t-tests. Variables with p-values less than 0.05 were retained, establishing a baseline for feature importance using classical statistical criteria. These results were compared with feature importance rankings derived from the tree-based models, Boruta, and SHAP. Feature importance in the tree-based models was calculated based on the cumulative reduction in prediction error attributed to each variable across all sub decision trees. Variables with higher importance scores contributed more substantially to accurate predictions. The Boruta algorithm provided a more robust feature selection approach by extending the tree-based model. It generated shadow features by randomly permuting the values of the original variables. Across multiple iterations, each original feature’s importance was statistically compared against its shadow counterpart. Boruta classified features as ‘important’ if they consistently outperformed their shadow versions, ‘potentially important’ if they had mixed performance, or ‘unimportant’ if they consistently underperformed. In addition, we explored SHAP (SHapley Additive exPlanations), a method based on the Shapley value from cooperative game theory. SHAP values quantify the marginal contribution of each variable to the model’s prediction relative to the average prediction. This approach allows interpretation of how much each variable contributes, either positively or negatively, to a prediction. Rather than relying on raw model-derived feature importance, the BorutaSHAP method used SHAP values as its importance metric. By combining SHAP’s interpretability with the robustness of Boruta, we identified key variables that most significantly influenced farmers’ mental health outcomes.
Statistical analysis
To evaluate and compare the predictive performance of the baseline MLR and relatively more complicated tree-based models, an empirical analysis was conducted using a paired t-test on cross-validation results. The mean squared error (MSE) for each model was computed across multiple random train-test splits within a repeated cross-validation framework. This procedure allowed us to examine the difference in performance between the two models across consistent data partitions. For each split, we calculated the difference in MSE values (MSE_TREE—MSE_LR).
The hypotheses for the statistical test were defined as follows:
This hypothesis testing was conducted separately for each target variable, including ST, DASS-ST, 9Q, DASS-DEP, and DASS-ANX. For each case, we calculated the t-statistic and corresponding p-value to assess the significance of the observed performance difference. A p-value less than 0.05 was considered statistically significant and indicated sufficient evidence to reject the null hypothesis, suggesting that the tree-based model performed significantly worse than the MLR model for that specific outcome. This testing framework provided insight into both average performance and model stability across multiple data partitions.
Results
Relationships among mental health disorder scores were initially assessed by calculating Pearson correlation coefficients between ST5-stress and DASS21-stress, and between 9Q-depression and DASS21-depression, to examine consistency across different assessment instruments. A correlation coefficient of 0.62 was observed between the stress measures, and 0.59 between the depression measures.
MSE using multiple linear regression (MLR) and the selected tree-based model.
To assess performance differences between the models, a cross-validation paired t-test was conducted, comparing MSE values from the selected tree-based model and MLR across five random data splits with multiple repetitions. P-values were calculated to test the null hypothesis that both models performed equivalently. The results indicated that the tree-based model significantly outperformed MLR for all mental health outcomes across all three cases, supporting the decision to emphasize the tree-based model in subsequent analyses.
A comparison of selected features from statistical p-values and Boruta algorithms along with feature importance values from the model for Case I.
In Case I, COVID-19-related stressors were positively associated with multiple mental health outcomes, while PPE usage was negatively associated with 9Q-depression scores. Specifically, a one-unit increase in PPE usage was associated with a 0.19-unit decrease in depression scores, assuming all other variables remained constant. Boruta and Boruta SHAP consistently identified similar important features for 9Q and DASS-DEP, whereas MLR coefficients showed inconsistencies. This trend was generally observed in the comparison between ST and DASS-ST with slight discrepancies. In Case II, COVID-19 stressors continued to show positive correlations with most mental health symptom levels, which aligned with SHAP values. However, only a few farm-related stressors (FS) showed consistent correlations between SHAP and MLR results. Boruta SHAP identified more significant features for all target variables than MLR, with larger discrepancies noted between ST versus DASS-ST and 9Q versus DASS-DEP compared to Case I. For FS-related features in MLR, five variables were identified as important for ST, but only two for DASS-ST. In Case III, numerous COVID-19-related variables were identified as significant contributors in the SHAP analysis, despite showing limited importance in MLR, regardless of the mental health response variable considered.
SHAP values of selected features were further analyzed using the Boruta SHAP approach with the tree-based model. Absolute SHAP values indicated the overall influence of each feature on the model’s predictions. The direction of feature contributions was also evaluated (Figure 2; Supplemental Materials 6 and 7). Each dot represents an individual data point, with pink and blue indicating high and low feature values, respectively. The sign of the SHAP value reflects the direction of impact on the target variable. Negative SHAP values with pink dots concentrated on the left suggest that high feature values are associated with lower predicted outcomes, and vice versa. A lack of clear separation between pink and blue indicates an ambiguous or non-directional effect of the feature on the prediction. SHAP values for top features for Case I.
In Case I, higher values for current and past use of PPE were generally associated with lower levels of mental health disorder symptoms for most outcome variables (Figure 2). Clear separation between pink and blue dots was observed especially for the PPE feature, with pink dots predominantly appearing on the left side of the axis. Some variations were observed for the PEST_HIS5_PPE feature for DASS-ST and DASS-ANX target variables, for which no clear conclusion could be made. COVID-19-related stressors, a key confounding factor, exhibited a strong positive correlation with stress, depression, and anxiety levels. The variable HH_DIST, representing residence within 1 km of farming areas, also showed a positive association with mental health symptoms in most cases, suggesting increased stress and depression among farmers living near their work environment. In contrast, good agricultural work practices (PRAC) were negatively correlated with mental health symptom levels.
In Case II, the influence of perceived stressors related to farming (FRM), finance (FIN), and social factors (SOC) on mental health outcomes was examined using the FS1 to FS25 variables (Supplemental Material 6). For stress outcomes (ST and DASS-ST), FS13, representing lack of support from government policy, emerged as a key farm-related factor. Among social indicators, FS10, reflecting no leisure time with family members, showed a positive association with stress symptoms. For depression-related outcomes (9Q and DASS-DEP), significant farm-related indicators included FS13 and FS11 (concern about cultivated areas). FS1, denoting inconvenience in commuting, also emerged as a meaningful social factor contributing to elevated depression levels. It is noteworthy that COVID-19-related stressors and HH_DIST, as major confounding factors, showed strong positive correlations with stress, depression, and anxiety levels.
Focused analysis of COVID-19-related variables revealed minimal separation between pink and blue dots across most features, regardless of the mental health outcome (Supplemental Material 7). This indicated a lack of clear directional influence for many COVID-19 stressors. However, CST9, which concerns fear of individuals traveling from abroad being infected, consistently showed importance for stress, depression, and anxiety scores. Additionally, CST12 and CST14, which reflect lack of sleep and intrusive thoughts about the disease, respectively, were positively associated with higher levels of mental health disorder symptoms.
Discussion
This study employed the tree-based regressor to examine associations between mental health disorder symptoms and relevant independent variables, with the MLR model used as a baseline for comparison. Identifying important features was performed using the Boruta algorithm, and the resulting features were interpreted using SHAP values. These were compared against traditional linear regression coefficients to evaluate both the magnitude and direction of each variable’s effect on mental health outcomes. COVID-19-related stress consistently emerged as a key factor, supporting the reliability of the findings, while less influential variables showed inconsistencies across models.
In Case I, COVID-19-related stress was identified as the most significant confounding factor in both MLR and Boruta analyses. The direction of effects indicated by MLR coefficients in Table 3 aligned partially with the SHAP results shown in Figure 2. Both present and past PPE use were generally found to be important across most mental health outcomes according to SHAP values, though they were less prominent in MLR results. In Case II, the impact of perceived farm stressors on mental health symptom levels was assessed. Farm-related and social indicators consistently ranked among the top features in all models, whereas financial stressors were significant only for depression and anxiety symptoms, particularly in FS24. SHAP plots showed positive correlations for these stressors, illustrated by the dominance of pink dots on the right side of the axis. COVID-19-related stress was again found to be positively associated with all mental health outcomes. Household income (HH_INC) influenced almost all response variables except ST The inconsistencies observed between MLR and SHAP results for ST versus DASS-ST and 9Q versus DASS-DEP may be attributed to the moderate correlations (approximately 0.6) between these questionnaire scores, highlighting the importance of consistent questionnaire design and terminology in mental health assessments. Case III provided additional insights into COVID-19-related factors, while demographic variables, such as age, appeared less influential. Concerns about individuals traveling abroad and being exposed to COVID-19 (CST9) emerged as one of the most important features. Both positive and negative coefficients were observed (Supplemental Material 5), consistent with the lack of clear separation between pink and blue dots in the SHAP visualizations (Supplemental Material 7).
The results indicated that good agricultural practices and PPE use were associated with reduced mental health disorder symptoms, as highlighted by SHAP values. These findings are consistent with the previous study by Kaewboonchoo et al.,2 22 reinforcing the importance of implementing safe pesticide practices among Thai farmers. To the best of current knowledge, no prior study has applied a comprehensive farm stressor inventory among Thai farmers. A positive association was observed between heightened perceptions of farm stress and adverse mental health outcomes. The stressor inventory enabled the identification of detailed risk factors and supported the development of targeted intervention programs and policies aimed at improving mental health and well-being in agricultural populations. This pilot study demonstrated the value of machine learning models in exploring occupational and non-occupational factors related to mental health disorders among Thai farmers. Tree-based models were employed to capture nonlinear relationships, in contrast to earlier studies that relied primarily on linear regression models. The integration of Boruta and SHAP facilitated the identification and interpretation of influential factors by combining the interpretability of SHAP values with Boruta’s robust feature selection capability.
The study highlighted the potential of machine learning in public health systems by contributing to evidence-based policy development, monitoring, forecasting, and structured data analysis. It demonstrated how machine learning can be used to identify risk factors and health behavior patterns, thereby informing more effective interventions and mental health promotion strategies.30–32 The present study illustrated how machine learning models can uncover complex relationships in survey-based public health research, using mental health among Thai farmers as a case study. Previous research applied machine learning to mental health data, primarily for diagnosis, treatment, and support.33,34 However, relatively few studies have used machine learning for public health surveillance, particularly to examine complex risk factor interactions in specific populations such as farmers. Survey data, as used in this study, proved useful for detecting mental health conditions. This analysis showed that machine learning can be applied to broad and complex variable sets, such as farm stressors, offering valuable insights for future public health interventions. Prior research has applied machine learning models to questionnaire data to predict mental health among adolescents and university students.35,36 Tree-based algorithms were often used, along with SHAP value analysis for interpreting feature importance. Other studies also examined the influence of COVID-19 on mental health, identifying key risk factors that inform intervention strategies.37,38
Several limitations were noted. First, the self-reported mental health disorders were not validated with external sources, such as biological measurements or personal journals. In addition, pesticide exposure levels were not quantitatively measured. These may introduce recall bias, reporting errors, and social desirability bias—especially in relation to mental health and chemical use. Our study captured participants’ self-reported adverse effects within the past year in order to minimize recall bias. Future studies should incorporate quantification of pesticide exposures. Second, the study was conducted in a single agricultural region in Thailand, which may reduce its applicability to other geographical, cultural, or occupational contexts. Although the sample size was calculated with a 10% allowance for non-response based on our previous research, only 211 complete records were available for model development after data cleaning. While this provides valuable continuity of research, the purposive sampling and relatively small post-cleaning sample size may limit statistical power and external validity. In addition, the small sample size, when combined with machine learning techniques, may increase the risk of overfitting, thereby limiting the models’ generalizability. Future work should therefore expand to larger, multi-regional, and longitudinal datasets and incorporate external validation to enhance model robustness and confirm the stability of key predictors. Although the current study employed a case-based design to address sample size limitations, future work should implement a full ablation analysis across all variables to provide a more rigorous evaluation of predictor importance. Despite these limitations, this pilot study demonstrated the utility of farm stressor assessments and machine learning in understanding mental health risks among Thai farmers. To strengthen the policy translation pathways in future work, we aim to incorporate advanced XAI-powered approaches to better support the decision-making process. 39 These efforts would support comprehensive mental health interventions and contribute to improving well-being among agricultural populations.
Conclusion
Mental health disorders may be influenced by a range of factors, including demographic characteristics, cultural context, socioeconomic status, lifestyle, access to healthcare, and individual perceptions of health risks. This study applied a farm stressor inventory to capture both occupational and non-occupational factors associated with mental health disorder symptoms among Thai farmers using machine learning models for analysis. The findings indicated that lower levels of mental health disorder symptoms were associated with higher levels of both current and past PPE usage, as well as adherence to good agricultural work practices. Associations were also identified between mental health disorder symptoms and indicators related to agricultural and social stressors. Additionally, COVID-19-related factors were found to be significant confounders. This pilot study demonstrated the utility of machine learning approaches in examining complex public health issues involving multiple, interrelated variables.
Supplemental Material
Supplemental Material - Application of machine learning to identify key factors influencing agricultural workers’ mental health: A case study of Thai farmers
Supplemental Material for Application of machine learning to identify key factors influencing agricultural workers’ mental health: A case study of Thai farmers by Papis Wongchaisuwat, Veerasit Kaewbundit, Saisattha Noomnual in Health Informatics Journal
Footnotes
Acknowledgements
Data collected for this research was supported by the Fogarty International Center of the National Institutes of Health under Award Number U2RTW010088. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Ethical considerations
This study was approved by the Ethical Review Committee for Human Research Faculty of Public Health, Mahidol University (COA No. MUPH2021-087).
Consent to participate
All participants were written and verbal informed consent prior to participate in the study.
Author contributions
Conceptualization: Papis W, Saisattha N. Data curation: Veerasit K. Formal analysis: Papis W, Veerasit K, Saisattha N. Funding acquisition: Papis W, Saisattha N. Methodology: Papis W, Veerasit K. Project administration: Saisattha N. Visualization: Veerasit K. Writing - original draft: Papis W, Saisattha N. Writing - review & editing: Papis W, Saisattha N.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Fogarty International Center of the National Institutes of Health under Award Number U2RTW010088 for the data collection phase. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The data analysis for this project was supported by Kasetsart University Research and Development Institute under grant number FF(KU) 51.67. However, any opinions, findings, and conclusions or recommendations in this document are those of the authors and do not necessarily reflect the views of the sponsor.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
