Predicting Absenteeism at Workplace Using Machine Learning and Network Analysis

Abstract

Absenteeism at work, possibly leading to productivity loss in business, is related to various psychological, social, and economic factors. Since predicting absenteeism is involved with complex associations of such factors, appropriately utilizing machine learning algorithms is required in the analysis. Statistical pre-processing and applications of machine learning methods have developed the comprehensive analysis of massive social data for absenteeism. The aim of this study is to develop a quantitative approach to identify the associations of factors and classify the absenteeism by including the effect of factors in the high-dimensional data. This approach implements association analysis including odds ratio test and network analysis, and supervised learning with imbalanced classification with random forest, application of principal component analysis and penalized regression methods. The dataset in this study includes records of various types of absenteeism at workplace from July 2007 to July 2010 in Brazil. Our study shows that there exist strongly interacting factors and that specific factors are strongly associated with absenteeism. The proposed method is validated on publicly available data sets using random forest and penalized regression with k-fold cross validation in order to strengthen better generalizability. One of major findings in this study is to elucidate the associations of factors affecting absenteeism. Application to similarly structured social data improves the understanding of the complex interplay between social factors and absenteeism that are important for people analytics which can help organizations resolve management difficulties.

Keywords

quantitative methods network analysis management public health absenteeism big data machine learning

Introduction

Absenteeism can be defined as taking an absence on a regularly planned workday (Riedy et al., 2020). Absenteeism is possibly related to various demographic, situational, socioeconomic factors, including transportation expense, workload, service time, age, body mass index, etc. (de Wit et al., 2018). The absenteeism may be affected by lack of educational opportunities at work, extended service time, and overwhelming work load (Aboagye et al., 2019). It has been shown that work-related absenteeism can be positively altered by the workplace’s improved work environment (Grimani et al., 2019). From a variety of public health studies, it has been proven that absenteeism at work may affect productivity loss associated with work-related features (van Den Heuvel et al., 2010). Workload and service time with emotional exhaustion may be associated with higher levels of absenteeism (Vignoli et al., 2016). It is crucial that the association of factors, or the relations between various factors and the outcome, absenteeism, should be appropriately examined and identified in terms of business and public health.

In a systematic analysis of social and psychological factors, absenteeism at work includes behaviors such as facing family-related interruptions or leaving early (Boise & Neal, 1996). Using demographic or situational information to estimate the outcome with socioeconomic status has been considered as an innovative statistical analysis to elucidate latent social and psychological factors and identify variables influencing workers’ absenteeism (Markussen et al., 2011). Absenteeism at work is a consequence of situational and environmental factors with social and psychological adjustments that may have an effect on the variables which are correlated to business growth (Hausknecht et al., 2008). Business growth is susceptible to the workers’ physical office environment, which may adversely affect productivity (Kamarulzaman et al., 2011). There is an extended literature on the management of working environment related to social and psychological effect on workers (Belloni et al., 2022). Studies have demonstrated that variables related to workplace environment and workers do not independently affect absenteeism but are strongly correlated in terms of labor economics and public health (Bubonya et al., 2017; D. W. Lee et al., 2021). Since absenteeism is a significant determinant related to the laboring processes of business, validating the associations of social factors affecting the absenteeism can estimate variables in workers’ workplace environment (Sarker et al., 2016). Therefore, it is crucial to utilize appropriate computational and statistical approaches to define complex interactions of occupational factors in various social and economic aspects.

Identifying the interactions of factors in workplace environment in the high-dimensional social data with outcomes is an emerging area in business and public health research. In the complex structures of high-dimensional social data, extracting features for the prediction of outcome is challenging. Estimating the associations of features in the high-dimensional data has been examined with machine learning methods such as penalized regression (Doerken et al., 2019; Mullah et al., 2021) with community detection (Fortunato, 2010). The network analysis of demographic, situational and social factors can not only provide the associations of strongly interacting factors but also construct a model to estimate an outcome based on relations of variables. To classify an outcome in high-dimensional setting, lasso (Wang & Du, 2019), elastic net (Goutman et al., 2022), and ridge (Xu et al., 2012) methods are utilized. To design a model to analyze high-dimensional social data, an effective and efficient statistical approach to classify the outcome considering complex interactions among high dimensional features of social and economic factors.

In this paper, a novel analytic approach is suggested in order to implement a network analysis on the integration of demographic and socioeconomic features and classify the absenteeism. The proposed approach will conduct a comprehensive computational and statistical experiment to evaluate interactions of social factors and obtain the relations between the social and psychological factors related to absenteeism. The estimated associations of features in work environment detected by the proposed methods are interpreted in the context of the business and public health literature.

Given the social and psychological relations from work environment variables related to absenteeism, the proposed methods can be applied to similarly structured workers’ demographic, situational, and socioeconomic datasets to assess health outcomes. Developing and utilizing the proposed method to examine the workers’ condition can explain how complex combinations of social and psychological factors can have an impact on absenteeism. Elucidating the patterns of associations between factors from work environment using computational and statistical methods can be bettered by considering prior knowledge in the proposed approach.

Conceptual Background

Absenteeism is an employee’s justified or unjustified absence from workplace (Putnam et al., 2004). It is expected that workers may take a certain number of absences every year. Consecutive or overwhelming absences affected by various social and psychological factors may lead to decreased productivity, influencing strategies, ethics, and other significant criteria in business (Amer et al., 2022; Bryan et al., 2021). These risk factors may indicate seasons, transportation expenses, distance from residence to work, service time, workload, age, body mass index, education level, drinking and smoking status and family-related variables. It is critical to develop and use effective and efficient methods to appropriately identify complex interactions of various demographic, situational, and socioeconomic factors and to accurately classify absenteeism by including the significant factors (Schmidt et al., 2021). The analysis of absenteeism at workplace can be performed with machine learning algorithms by including the variables of age, education, working hours, etc. (Park et al., 2024). Previous studies focus on designing a set of rules that connect factors to an outcome, absenteeism (de Oliveira et al., 2019; Rista et al., 2020).

In this study, there exist various reasons for absence such as certain diseases, pregnancy, childbirth, injury, external causes of morbidity and mortality, or factors affecting health conditions. In the data analysis of this study, the status of absenteeism is analyzed regardless of reasons for absence. There exist factors such as distance, service time, age, workload, education level, number of children, number of pets, the status of social drinking or smoking, BMI. It is significant to understand the associations of factors and the impact of factors on absenteeism (Martiniano & Ferreira, 2018). Specifically, there have been studies on relationship between alcohol consumption and workplace absenteeism (Bacharach et al., 2010; Marzan et al., 2023; S Hashemi et al., 2022). Utilizing workplace absenteeism dataset with various factors related to and reasons for absenteeism may provide significant clues about not only associations of factors and absenteeism but also co-worker support. To analyze associations of factors and absenteeism, two hypotheses will be presented and tested.

Research Model and Hypotheses

A research model with two hypotheses can be designed.

There Exists a Difference in the Structures of Networks Built Using Features Related to Drinkers and Non-drinkers

Drinkers and non-drinkers may have different characteristics or priorities in terms of social and psychological factors. The features in drinker and non-drinker groups have distinct levels of relations. In this hypothesis, the network structures built using features by drinkers and non-drinkers can be compared.

Age, Body Mass Index, and Family-Related Attributes Have a Positive Effect on Absenteeism

Various demographic, situational, and socioeconomic factors positively or negatively affect absenteeism. Machine learning methods should be utilized to estimate the impact of factors on the outcome and accurately classify absenteeism.

Materials and Methods

Data Description

The dataset (Martiniano & Ferreira, 2018) can be directly accessed via the Project DOI: 10.24432/C5X882 at https://archive.ics.uci.edu/ from UC Irvine Machine Learning Repository. Various types of absenteeism were investigated for the study of at a courier company in Brazil. The data set includes 21 categorical, integer, or real features and 740 instances.

The dataset of absenteeism at work includes an outcome and features. The outcome is a binary variable which indicates unjustified absenteeism and other cases. The features are categorial or real variables which indicates demographic, situational, or socioeconomic characteristics. In the classification of the outcome of absenteeism, association of categorical variables, association of real variables, classification of an outcome in the imbalanced data may be considered in order to better understand the structures of data and conduct the prediction of an outcome (Sowjanya & Mrudula, 2022; Zhao et al., 2018). Computing odds ratio using the contingency table for the association of categorical variables (Simon, 2001), network analysis (Luke & Harris, 2007) detecting and representing the interactions of real variables and various machine learning methods such as principal component analysis (Jolliffe & Cadima, 2016), random forest (Khalilia et al., 2011), and penalized regression (Fu et al., 2017) including Lasso, Elastic Net, and Ridge can be implemented to analyze high-dimensional data.

Estimation of Odds Ratio

An odds ratio is an estimation of association between a factor and an outcome. Odds ratio represents the possibility that an outcome may occur given a factor, compared to the possibility of the outcome occurring without that factor (Ranganathan et al., 2015). The odds ratio between the event of disciplinary failure, drinking status, or smoking status, and absenteeism is computed.

Principal Component Analysis

Principal component analysis was conducted to evaluate how unjustified absenteeism, and presenteeism or justified absenteeism groups can be classified. As an unsupervised learning method, which detects the patterns in the high-dimensional data, principal component analysis can reduce the complexity by transforming data points to lower dimensions (Lever et al., 2017). In this process, principal component analysis aims to find the axis with the maximum variance in order to minimize the loss of information. Principal component analysis projects data onto lower dimensions by summarizing the data with a limited number of principal components.

Imbalanced Data With Random Forest

Imbalanced data include the majority and minority classes possibly with misclassification (Fotouhi et al., 2019). It is crucial to investigate the effect of class imbalance in the model. In order to obtain the best balancer in the model, random Forest, an ensemble of decision trees based on the bagging strategy implementing simple random sampling method with replacement, can be utilized (Ganaie et al., 2022). To better design imbalanced ensemble learning, over-sampling or under-sampling methods are considered.

Network Analysis Using Graphical Lasso and Community Detection

The graphical lasso is a regularization technique that helps to identify a sparse inverse covariance matrix, effectively revealing the underlying structure of the data. A correlation is rigorously measured by utilizing the graphical lasso based on the partial correlation between features in work environment to build an optimized network model. Estimating the association between demographic, situational, or socioeconomic features is crucial in the network analysis. It is challenging to extract only important characteristics because all characteristics appear to have multiple connections with other features. However, using graphical lasso makes it convenient to select only significant properties (Huang et al., 2020). With a network built with n instances and p features, there exist strongly correlated features. The proposed regression model can detect the interactions between features and minimize the negative log likelihood with penalty terms to estimate the optimized regression coefficients for the analysis. The minimization formula follows Equation 1.

\log (\det (θ)) - - tr (s θ) + ρ {‖ θ ‖}_{1}

(1)

where S indicates empirical covariance matrix, θ indicates a nonnegative definite matrix, tr indicates the trace value of a matrix, and $‖ θ | |_{1}$ is L1 norm. In Equation 2, $β$ can be estimated by minimizing the objective function and choosing an optimal hyperparameter $ρ$ .

{m i n}_{β} {\frac{1}{2} {‖ W β - b ‖}^{2} + {ρ ‖ β ‖}_{1}}

(2)

Modeling and visualizing a network of features can be conducted by using community detection methods and graphical lasso which estimates regression coefficients of features in the dataset.

Penalized Regression

Penalized regression is an extension of regression that can help with strongly correlated features and avoid overfitting in the high-dimensional data. To construct a network of features form workplace environment data, a method of computing partial correlation between features is utilized. When there exists a correlation with some subsets of network characteristics, the potential interactions of the features should be defined (Friedman et al., 2010; Liu et al., 2009). All the factors associated with the demographic, situational, and socioeconomic effects for each characteristic are validated to develop the network model that represents the interactions between the two features. Suppose there exist instances $x_{i} \in R^{p}$ and the outcome $y_{i} \in$ R, i = 1, 2, …, N. The objective function for the Gaussian family follows Equation 3.

{m i n}_{(β_{0}, β) \in R^{p + 1}} \frac{1}{2 N} \sum_{i = 1}^{N} {(y_{i} - β_{0} - x_{i}^{T} β)}^{2} + λ [(1 - α) {‖ β ‖}_{2}^{2} + α {‖ β ‖}_{1}]

(3)

where $λ$ indicates a nonnegative value and the value of $α$ ranges from 0 to 1. When the value of $α$ is 0, a ridge regression is implemented. When $α$ is greater than 0 and less than 1, an elastic net regression is implemented. When $α$ is 1, a lasso regression is implemented. $β$ can be estimated by minimizing the objective function to find an optimal hyperparameter $λ$ . To predict absenteeism, lasso for logistic regression (Wong et al., 2023), elastic net for logistic regression (Engebretsen & Bohlin, 2019), and ridge for logistic regression (Arashi et al., 2021) are utilized. As an extended version of the regression, lasso selects variables in regression models with penalty terms. Given p features from work environment data and n independent and identically distributed instances, penalized logistic regression includes categorial and continuous features to classify a binary outcome.

Data Analysis Results

Odds ratio values for the event of disciplinary failure, drinking status, or smoking status and absenteeism are computed. The 95% confidence intervals of odds ratio values between the event of disciplinary failure or smoking status and absenteeism contain the value of 1, indicating that there may not exist statistical significance, suggesting that the relationship between these two variables is likely due to chance. The odds ratio between drinking status and absenteeism is 2.468 with a 95% confidence interval of [1.098, 5.55] which does not contain 1, representing that there may exist statistical significance, indicating that drinking status may have an impact on absenteeism.

The result (Figure 1) obtained by the application of principal component analysis shows that it is not clear to classify absenteeism using the data from work environment.

Figure 1.

Principal component analysis on absenteeism.

Results of Imbalanced Data With Random Forest

The class of outcome variable contains 33 unjustified absenteeism cases and 707 presenteeism or justified absenteeism cases. Random forest algorithm gives accuracy of 0.957 with the original data, 0.940 with over-sampled data, and 0.880 with under-sampled data.

Network Analysis

Three different networks with instances for drinkers (Figure 2), and instances for non-drinkers (Figure 3) are constructed. The network for all instances does not represent specific interactions but the relation between height and body mass index. Since the formula for body mass index calculation includes height as a parameter, the result is promising but not new.

Figure 2.

Network from Lasso model for drinkers only.

Figure 3.

Network from Lasso model for non-drinkers only.

In the network analysis of features for drinkers, the feature body mass index works as a hub with degree of 4 in the network. Four features including education, son, pet, and height have interactions with the feature, body mass index. Here, the variable, son, indicates the number of children and the variable, pet, indicates the number of pets. Co-existence of features such as body mass index, education, son, pet, and height is shown, and the change in a feature body mass index may affect the change in these four features.

In the network analysis of features for non-drinkers, there are two connected components. One connected component contains three features such as body mass index, pet and height. The feature, body mass index works as a hub with degree of 2 in the network. Two features including pet, and height have interactions with the feature, body mass index. The other connected component contains three features such as son, weight, and education. The feature, son works as a hub with degree of 2 in the network. Two features including weight and education have interactions with the feature, son. Co-existence of features such as body mass index, pet, and height, and the co-existence of features such as son, weight and education are shown. The change in the feature, body mass index may affect the change in two features pet and height, or the change in the feature, son, may affect the change in two features, weight and education.

In the prediction of absenteeism with features in the data, Ridge regression outperforms other penalized regression methods in terms of accuracy defined as the proportion of true positives and true negatives among all instances (Table 1).

Table 1.

Accuracy Results by Lasso, Elastic Net and Ridge Regression Methods.

Model	Ridge	Elastic net with alpha = .25	Elastic net with alpha = .5	Elastic net with alpha = .75	LASSO
Accuracy	0.930	0.916	0.913	0.912	0.910

Ridge regression method selected 6 factors as predictors which positively affect the absenteeism outcome (Table 2). Based on average coefficients, factors such as Transportation expense, age, hit target, son, weight, and body mass index positively affect the absenteeism at work. Since the month of absence and absenteeism time in hours directly affect the absenteeism, these two factors were selected but excluded.

Table 2.

Features Positively Correlated With Unjustified Absenteeism.

Factors	Averaged coefficients
Transportation expense	0.218
Age	0.264
Hit target	1.452
Son	11.436
Weight	1.229
Body mass index	4.029

Ridge regression method selected 8 factors as predictors which negatively affect the absenteeism (Table 3). Based on averaged coefficients, factors such as day of the week, seasons, distance from residence to work, service time, workload, education, pet, and height negatively affect the absenteeism

Table 3.

Features Negatively Correlated With Unjustified Absenteeism.

Factors	Averaged coefficients
Day of the week	−0.841
Seasons	−3.471
Distance from residence to work	−0.653
Service time	−1.528
Workload average day	−0.016
Education	−4.782
Pet	−13.882
Height	−1.843

Discussion and Implications

The statistical approach implements statistical analysis and machine learning methods considering the interactions of features selected and grouped by penalized regression and community detection (Kuzudisli et al., 2023). The proposed method computed odds ratio between the outcome, absenteeism, and categorical variables such as the event of disciplinary failure, drinking status and smoking status. The 95% confidence intervals of odds ratio between the outcome, absenteeism and categorical variables imply that disciplinary failure or smoking status are not directly related to absenteeism. Instead, the odds ratio between social drinking status and absenteeism shows that there is a possibility that drinkers are more likely to take absences than non-drinkers. When the odds ratio value is greater than 1 and the 95% confidence interval of odds ratio does not include the value of 1, it may indicate that the possibility that an outcome may occur given a factor is higher, compared to the possibility of the outcome occurring without that factor. To perform the association analysis of categorical variables, odds ratio values using a contingency table can be calculated.

We assume that studies on the analysis of absenteeism include the application of various machine learning algorithms and datasets from different sources. While the previous studies classify the specific outcome using machine learning algorithms with various factors, the proposed approach conducts machine learning algorithms and includes the associations of factors detected by network analysis. The proposed approach detects significant interactions of demographic, situational, and socioeconomic features in the workplace environment utilizing graphical lasso and community detection (Bakkeli, 2023). Graphical lasso estimates how each feature interacts with other features in the data. With the use of penalty term, graphical lasso produces an inverse covariance matrix after finding an optimized hyperparameter, avoiding overfitting in the model. Given estimated inverse covariance matrix, community detection method constructs a network which represents the interactions of features in the data. The network analysis can show hub nodes which have connections with many other nodes. It is crucial to identify hub nodes since any nodes having connections with many nodes may play a significant role in the mechanism of features in the network. Constructing network structures of features based on the stratification by a specific feature is needed in order to identify different network structures based on different groups of instances, inferring various important interpretations in the real-world data analysis.

The network from the original model and drinking-specific models identified fundamental associations consistent with known epidemiological roles of the features (Beard et al., 2019). The common features of these models is a separate network containing Body mass index and other features. The feature body mass index plays a central role among the work environment features. Drinking-dependent functional differences in the workplace environment may underlie the observed drinking-specificity of responses to in situational exposures to psychological decision (Rehm et al., 2017). The drinking-specificity of the social networks was investigated by grouping workers on the basis of whether they had drunk, producing subtly different networks. We observed differences in the connections with features body mass index and son; in non-drinkers, son was connected with weight and education, and in drinkers, son was connected with body mass index.

Two-dimensional principal component analysis may not perfectly classify the outcomes. In this case, random forest or penalized regression can better predict the outcomes using the high-dimensional data. In the imbalanced data analysis, the outcome has majority and minority classes, and a new sampling strategy may be conducted to appropriately estimate the outcomes. In the real-world data application, random forest on original data slightly outperformed the classification methods such as random forest on over-sampled or under-sampled data or penalized regression methods including Lasso, Elastic Net, and Ridge. To find the best performing machine learning methods for the analysis of workplace environment data, validation of various machine learning methods on differently sampled data can be conducted. The penalized regression implements five penalized regression methods with alpha = 1, .75, .5, .25, or 0. When alpha is 1, Lasso is implemented to select the subset of features from the high-dimensional data. Since Lasso extracts selected significant features for the prediction of an outcome, Lasso outperforms other penalized regression methods in terms of accuracy. However, in the experiments described in this project, Ridge regression method outperforms other penalized regression methods. Based on the prediction result of ridge regression, features positively or negatively affecting absenteeism are estimated.

The feature, age and employee absenteeism are associated in the classification analysis. Results indicate that both justified and unjustified are associated with age (Martocchio, 1989). Employee absenteeism is associated with the family-related attribute such as the number of children (Hysing et al., 2017). The results show that the feature, son and employee absenteeism are significantly associated. Poor health conditions caused by obesity may adversely affect work productivity related to absenteeism (Destri et al., 2022). Obesity represented by high BMI implies a burden due to absenteeism. There may exist factors which may affect the interpretations of data analysis results. In a social group, cultural influences, individual differences and unique circumstances, organizational culture, dynamic nature of absenteeism patterns over time, organizational policies and procedures, organizational support may have an impact on absenteeism rates. For an individual employee’s attendance, mental and physical health issues, job flexibility, job satisfaction, job stress, job insecurity, work-life balance may be significant factors.

The study’s approaches have certain limitations. Utilizing publicly available data may result in a limited sample size and features, which may affect the specificity of data analysis. While the study conducts statistical methods and network analysis for the analysis, appropriate regularization for the penalized regression and the repetition of k-fold cross validation may need to be reviewed. Plus, various machine learning algorithms and statistical tests other than odds ratio tests can be considered. As a future research direction for the valid design of statistical experiments, the control over external variables, potential confounding variables, potential biases or outliers in data collection, interaction effects between factors, potential outliers can be examined.

Conclusion

An innovative approach evaluates the inverse covariance matrix representing the associations of features in the data and predicts an outcome using high-dimensional features. The proposed method implements odds ratio, network analysis, principal component analysis, random forest and penalized regression methods. This approach enables analysis on the relations of both categorical and continuous variables and evaluates the predictive performance of each machine learning algorithm. The proposed approach can be applied in the studies which use the machine learning techniques for public health and social impact problems such as multiple organizational studies (Leso et al., 2023).

The proposed approach is tested with workplace environment data with more demographic, situational, or socioeconomic features which influence the status of absenteeism. The limitation of this method can be improved by applying the proposed methods data with various features. An optimal model to obtain the best predictive accuracy can be derived if we use network structures of features and prior knowledge. The generalizability of the proposed methods in the real data analysis may need to be examined further to assess its applicability to classify outcomes with a larger set of high-dimensional features in the domain of business and public health (Al-Raeei, 2024).

In summary, the aim of proposed approach on real-world data is to efficiently and accurately predict the outcome. The proposed methods can play a significant role as a part of a research methodology to evaluate the interactions of social and psychological factors that may affect absenteeism related to workplace environment (Born et al., 2016; J. W. Lee et al., 2021; Mottaz & Potts, 1986). The proposed methods can be applied to similarly structured high-dimensional management data to estimate the complex associations between significant features and an outcome using information from workplace environment.

Footnotes

ORCID iDs

Donggeun Kim

Jai Woo Lee

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This article is financially supported by the 2025 College of Public Policy at Korea University.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability

Data used in this study were downloaded from a public data repository and has been described in the Materials and Methods section. The source codes of statistical simulation are available in the following link:

References

Aboagye

Björklund

Gustafsson

Hagberg

Aronsson

Marklund

Leineweber

Bergström

(2019). Exhaustion and impaired work performance in the workplace associations with presenteeism and absenteeism. Indian Journal of Occupational and Environmental Medicine, 61(11), E438–E444. https://doi.org/10.1097/jom.0000000000001701

Al-Raeei

(2024). When AI goes wrong: Fatal errors in oncological research reviewing assistance Open AI based. Oral Oncology Reports, 10, 100292. https://doi.org/10.1016/j.oor.2024.100292

Amer

S. A.

Elotla

S. F.

Ameen

A. E.

Shah

Fouad

A. M.

(2022). Occupational burnout and productivity loss: A cross-sectional study among academic university staff. Frontiers in Public Health, 10, 861674. https://doi.org/10.3389/fpubh.2022.861674

Arashi

Roozbeh

Hamzah

N. A.

Gasparini

(2021). Ridge regression and its applications in genetic studies. Plos One, 16(4), e0245376. https://doi.org/10.1371/journal.pone.0245376

Bacharach

S. B.

Bamberger

Biron

(2010). Alcohol consumption and workplace absenteeism: The moderating effect of social support. E-Journal of Applied Psychology, 95(2), 334–348. https://doi.org/10.1037/a0018018

Bakkeli

N. Z.

(2023). Predicting COVID-19 exposure risk perception using machine learning. BMC Public Health, 23(1), 1377. https://doi.org/10.1186/s12889-023-16236-z

Beard

Brown

West

Kaner

Meier

Boniface

Michie

(2019). Associations between socio-economic factors and alcohol consumption: A population survey of adults in England. Plos One, 14(2), e0216378. https://doi.org/10.1371/journal.pone.0209442

Belloni

Carrino

Meschi

(2022). The impact of working conditions on mental health: Novel evidence from the UK [Article]. Labour Economics, 76, 102176. https://doi.org/10.1016/j.labeco.2022.102176

Boise

Neal

M. B.

(1996). Family responsibilities and absenteeism: Employees caring for parents versus employees caring for children. Journal of Managerial Issues, 8, 218–238. https://www.jstor.org/stable/40604102

10.

Born

Akkerman

Thommes

(2016). Peer influence on protest participation: Communication and trust between co-workers as inhibitors or facilitators of mobilization. Social Science Research, 56, 58–72. https://doi.org/10.1016/j.ssresearch.2015.11.003

11.

Bryan

M. L.

Bryce

A. M.

Roberts

(2021). The effect of mental and physical health problems on sickness absence. European Journal of Health Economics, 22(9), 1519–1533. https://doi.org/10.1007/s10198-021-01379-w

12.

Bubonya

Cobb-Clark

D. A.

Wooden

(2017). Mental health and productivity at work: Does what you do matter? Labour Economics, 46, 150–165. https://doi.org/10.1016/j.labeco.2017.05.001

13.

de Oliveira

E. L.

Torres

J. M.

Moreira

R. S.

de Lima

R. A. F

. (2019). Absenteeism prediction in call center using machine learning algorithms. In New knowledge in information systems and technologies (Vol. 1, pp. 958–968) Springer International Publishing.

14.

Destri

Alves

Gregório

M. J.

Dias

S. S.

Henriques

A. R.

Mendonça

Canhão

Rodrigues

A. M.

(2022). Obesity- attributable costs of absenteeism among working adults in Portugal. BMC Public Health, 22(1), 978. https://doi.org/10.1186/s12889-022-13337-z

15.

de Wit

Wind

Hulshof

C. T. J.

Frings-Dresen

M. H. W

. (2018). Person-related factors associated with work participation in employees with health problems: A systematic review [Review]. International Archives of Occupational and Environmental Health, 91(5), 497–512. https://doi.org/10.1007/s00420-018-1308-5

16.

Doerken

Avalos

Lagarde

Schumacher

(2019). Penalized logistic regression with low prevalence exposures beyond high dimensional settings. PLoS One, 14(5), e0217057. https://doi.org/10.1371/journal.pone.0217057

17.

Engebretsen

Bohlin

(2019). Statistical predictions with glmnet. Clinical Epigenetics, 11(1), 123. https://doi.org/10.1186/s13148-019-0730-1

18.

Fortunato

(2010). Community detection in graphs. Physics Reports, 486(3), 75–174. https://doi.org/10.1016/j.physrep.2009.11.002

19.

Fotouhi

Asadi

Kattan

M. W.

(2019). A comprehensive data level analysis for cancer diagnosis on imbalanced data. Journal of Biomedical Informatics, 90, 103089. https://doi.org/10.1016/j.jbi.2018.12.003

20.

Friedman

Hastie

Tibshirani

(2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22. https://doi.org/10.18637/jss.v033.i01

21.

Parikh

C. R.

Zhou

(2017). Penalized variable selection in competing risks regression. Lifetime Data Analysis, 23(3), 353–376. https://doi.org/10.1007/s10985-016-9362-3

22.

Ganaie

M. A.

Tanveer

Suganthan

P. N.

Snasel

(2022). Oblique and rotation double random forest. Neural Networks, 153, 496–517. https://doi.org/10.1016/j.neunet.2022.06.012

23.

Goutman

S. A.

Boss

Godwin

Mukherjee

Feldman

E. L.

Batterman

S. A.

(2022). Associations of self-reported occupational exposures and settings to ALS: A case-control study. International Archives of Occupational and Environmental Health, 95(7), 1567–1586. https://doi.org/10.1007/s00420-022-01874-4

24.

Grimani

Aboagye

Kwak

(2019). The effectiveness of workplace nutrition and physical activity interventions in improving productivity, work performance and workability: A systematic review. BMC Public Health, 19(1), 1676. https://doi.org/10.1186/s12889-019-8033-1

25.

Hashemi

N. S.

Skogen

J. C.

Sevic

Thørrisen

M. M.

Rimstad

S. L.

Sagvaag

Riper

Aas

R. W

. (2022). A systematic review and meta-analysis uncovering the relationship between alcohol consumption and sickness absence. When type of design, data, and sickness absence make a difference. PLoS One, 17(1), e0262458. https://doi.org/10.1371/journal.pone.0262458

26.

Hausknecht

J. P.

Hiller

N. J.

Vance

R. J.

(2008). Work-unit absenteeism: Effects of satisfaction, commitment, labor market conditions, and time. Academy of Management Journal, 51(6), 1223–1245. https://doi.org/10.5465/amj.2008.35733022

27.

Huang

Y. J.

T. P.

Hsiao

C. K.

(2020). Application of graphical lasso in estimating network structure in gene set. Annals of Translational Medicine, 8(23), 1556. https://doi.org/10.21037/atm-20-6490

28.

Hysing

Petrie

K. J.

Bøe

Sivertsen

(2017). Parental work absenteeism is associated with increased symptom complaints and school absence in adolescent children. BMC Public Health, 17, 439. https://doi.org/10.1186/s12889-017-4368-7

29.

Jolliffe

I. T.

Cadima

(2016). Principal component analysis: A review and recent developments. Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences, 374(2065), 20150202. https://doi.org/10.1098/rsta.2015.0202

30.

Kamarulzaman

Saleh

A. A.

Hashim

S. Z.

Hashim

Abdul-Ghani

A. A.

(2011, July 11–12). An overview of the influence of physical office environments towards employees [Conference session]. Procedia Engineering [2nd international building control conference], Penang, Malaysia.

31.

Khalilia

Chakraborty

Popescu

(2011). Predicting disease risks from highly imbalanced data using random forest. BMC Medical Informatics and Decision Making, 11, 51. https://doi.org/10.1186/1472-6947-11-51

32.

Kuzudisli

Bakir-Gungor

Bulut

Qaqish

Yousef

(2023). Review of feature selection approaches based on grouping of features. PeerJ, 11, e15666. https://doi.org/10.7717/peerj.15666

33.

Lee

D. W.

Lee

Kim

H. R.

Kang

M. Y.

(2021). Health-related productivity loss according to health conditions among workers in South Korea. International Journal of Environmental Research and Public Health, 18(14), 7589. https://doi.org/10.3390/ijerph18147589

34.

Lee

J. W.

Zhou

Moen

E. L.

Punshon

Hoen

A. G.

Romano

M. E.

Karagas

M. R.

Gui

(2021). Prediction of an outcome using NETwork clusters (NET-C). Computational Biology and Chemistry, 90, 107425. https://doi.org/10.1016/j.compbiolchem.2020.107425

35.

Leso

B. H.

Cortimiglia

M. N.

Ghezzi

(2023). The contribution of organizational culture, structure, and leadership factors in the digital transformation of SMEs: A mixed-methods approach. Cognition Technology & Work, 25(1), 151–179. https://doi.org/10.1007/s10111-022-00714-2

36.

Lever

Krzywinski

Altman

(2017). Principal component analysis. Nature Methods, 14(7), 641–642. https://doi.org/10.1038/nmeth.4346

37.

Liu

Palatucci

Zhang

(2009). Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery [Conference session]. Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada. https://doi.org/10.1145/1553374.1553458

38.

Luke

D. A.

Harris

J. K.

(2007). Network analysis in public health: History, methods, and applications. Annual Review of Public Health, 28, 69–93. https://doi.org/10.1146/annurev.publhealth.28.021406.144132

39.

Markussen

Røed

Røgeberg

O. J.

Gaure

(2011). The anatomy of absenteeism. Journal of Health Economics, 30(2), 277–292. https://doi.org/10.1016/j.jhealeco.2010.12.003

40.

Martiniano

Ferreira

(2018). Absenteeism at work. UCI Machine Learning Repository. https://doi.org/10.24432/C5X882

41.

Martocchio

J. J.

(1989). Age-Related differences in employee absenteeism - A meta-analysis. Psychology and Aging, 4(4), 409–414. https://doi.org/10.1037/0882-7974.4.4.409

42.

Marzan

M. B.

Callinan

Livingston

Jiang

(2023). Dose-response relationship between alcohol consumption and workplace absenteeism in Australia. Drug and Alcohol Review, 42(7), 1773–1784. https://doi.org/10.1111/dar.13726

43.

Mottaz

Potts

(1986). An empirical evaluation of models of work satisfaction. Social Science Research, 15(2), 153–173. https://doi.org/10.1016/0049-089X(86)90013-X

44.

Mullah

M. A. S.

Hanley

J. A.

Benedetti

(2021). LASSO type penalized spline regression for binary data. BMC Medical Research Methodology, 21(1), 83. https://doi.org/10.1186/s12874-021-01234-9

45.

Park

Sim

Lee

Kim

Yun

Yoon

J. H.

(2024). Comparison of the association between Presenteeism and absenteeism among replacement workers and paid workers: Cross-sectional studies and machine learning techniques. Safety and Health at Work, 15(2), 151–157. https://doi.org/10.1016/j.shaw.2024.03.001

46.

Putnam

McKibbin

Wachs

J. E.

(2004). Managing workplace depression: An untapped opportunity for occupational health professionals. AAOHN Journal, 52(3), 122–129, quiz 130–121.

47.

Ranganathan

Aggarwal

Pramesh

C. S.

(2015). Common pitfalls in statistical analysis: Odds versus risk. Perspectives in Clinical Research, 6(4), 222–224. https://doi.org/10.4103/2229-3485.167092

48.

Rehm

Gmel

G. E.

Gmel

Hasan

O. S. M.

Imtiaz

Popova

Probst

Roerecke

Room

Samokhvalov

A. V.

Shield

K. D.

Shuper

P. A.

(2017). The relationship between different dimensions of alcohol use and the burden of diseasean update. Addiction, 112(6), 968–1001. https://doi.org/10.1111/add.13757

49.

Riedy

Dawson

Fekedulegn

Andrew

Vila

Violanti

J. M.

(2020). Fatigue and short-term unplanned absences among police officers. Policing: An International Journal, 43(3), 483–494. https://doi.org/10.1108/pijpsm-10-2019-0165

50.

Rista

Ajdari

Zenuni

(2020). Predicting and analyzing absenteeism at workplace using machine learning algorithms [Conference session]. 2020 43rd international convention on information, communication and electronic technology (MIPRO 2020).

51.

Sarker

A. R.

Sultana

Mahumud

R. A.

Ahmed

M. W.

Hoque

M. E.

Islam

Gazi

Khan

J. A. M.

(2016). Effects of occupational illness on labor productivity: A socioeconomic aspect of informal sector workers in urban Bangladesh. Journal of Occupational Health, 58(2), 209–215. https://doi.org/10.1539/joh.15-0219-FS

52.

Schmidt

S. A. J.

Sørensen

H. T.

Langan

S. M.

Vestergaard

(2021). Associations of lifestyle and anthropometric factors with the risk of Herpes Zoster: A nationwide population-based cohort study. American Journal of Epidemiology, 190(6), 1064–1074. https://doi.org/10.1093/aje/kwab027

53.

Simon

S. D.

(2001). Understanding the odds ratio and the relative risk. Journal of Andrology, 22(4), 533–536.

54.

Sowjanya

A. M.

Mrudula

(2022). Effective treatment of imbalanced datasets in health care using modified SMOTE coupled with stacked deep learning algorithms. Applied Nanoscience, 13, 1829–1840. https://doi.org/10.1007/s13204-021-02063-4

55.

van Den Heuvel

S. G.

Geuskens

G. A.

Hooftman

W. E.

Koppes

L. L.

van Den Bossche

S. N

. (2010). Productivity loss at work; health-related and work-related factors. Journal of Occupational Rehabilitation, 20(3), 331–339. https://doi.org/10.1007/s10926-009-9219-7

56.

Vignoli

Guglielmi

Bonfiglioli

Violante

F. S.

(2016). How job demands affect absenteeism? The mediating role of work–family conflict and exhaustion. International Archives of Occupational and Environmental Health, 89(1), 23–31. https://doi.org/10.1007/s00420-015-1048-8

57.

Wang

(2019). Factors associated with high psychological distress in primary carers of people with disability. AUSTRALIAN JOURNAL OF GENERAL PRACTICE, 48(4), 234–238.

58.

Wong

Kramer

S. C.

Piccininni

Rohmann

J. L.

Kurth

Escolano

Grittner

Domenech

Cellès

(2023). Using LASSO regression to estimate the population-level impact of pneumococcal conjugate vaccines. American Journal of Epidemiology, 192(7), 1166–1180. https://doi.org/10.1093/aje/kwad061

59.

Ladouceur

Dastani

Richards

J. B.

Ciampi

Greenwood

C. M.

(2012). Multiple regression methods show great potential for rare variant association tests. PLoS One, 7(8), e41694. https://doi.org/10.1371/journal.pone.0041694

60.

Zhao

Wong

Z. S. Y.

Tsui

K. L.

(2018). A framework of rebalancing imbalanced healthcare data for Rare Events' classification: A case of look-alike sound-alike mix-up incident detection. Journal of Healthcare Engineering, 2018, 1–11. https://doi.org/10.1155/2018/6275435