Abstract
This article integrates scholarship on determinants of labor market transitions with the flexible modeling strategies available through machine learning. Leveraging population registers from Finland, this study analyzes how well we can predict (un)employment and for whom we can (not) predict. The study finds that the predictability of long-term (un)employment improves when using tree-based nonparametric models compared to logistic regression but that predictability varies substantially across outcome groups. None of the models predict very accurately for the unemployed net of baseline likelihood to be unemployed—a group that is well researched and often of interest when designing policies and interventions. Overall accuracy is driven by the employed while incorrectly predicting most of the unemployed. Additionally, the outcomes for individuals with low parental education are overall more difficult to predict than the outcomes for individuals with mid/high parental education, whereas predicting unemployment slightly improves among the low parental education group.
The transition into the labor market is one of the most important markers in the life course of young adults (Hogan and Astone 1986; Shanahan 2000). Typically, this transition occurs in the demographically dense life phase of early adulthood, meaning that it happens in combination with or in close proximity to several other transitions, such as finishing education, moving into independent housing, partnership formation, and having children (Rindfuss 1991). Most individuals transition into the labor market smoothly, but for those who do not, it can have many scarring effects for future life outcomes and societal attachment when occurring and sustaining in early adulthood (e.g., Abebe and Hyggen 2019; Azzollini 2023; Helbling, Sacchi, and Imdorf 2019; Mousteri, Daly, and Delaney 2018).
The determinants of unemployment in early adulthood have been researched extensively and are understood to be complex processes (see e.g., Caspi et al. 1998; Doku et al. 2018; Lallukka et al. 2019). However, existing modeling strategies have not sufficiently operationalized this complexity. Regression models typically used in modeling assume linear associations, rarely account for interactions beyond two variables, generally rely on average effects, and are not optimized for simultaneously modeling highly correlated variables. Furthermore, the “all else being equal” hypothesis testing framework is not ideal for modeling complex disadvantages.
Recent studies have started to explore the potential of predictive modeling strategies that leverage machine learning (ML) methods to study life outcomes (e.g., Salganik et al. 2020; Savcisens et al. 2024). I argue that this approach is likely to be the most suitable also for learning about predictors of unemployment given that it has the most fit potential (see Verhagen 2024). Supervised machine learning (SML) methods flexibly search for functions f(X) to predict an output (Y) given an input (X) to be able to forecast (Y) for future inputs (X) (Molina and Garip 2019). They can account for nonlinearity and multiple simultaneous processes, create flexible higher-order interactions using the most predictive thresholds of variables, do not rely on average effects, and can deal with multicollinearity by averaging across uncorrelated models. Thus, these modeling approaches can address some of the issues with regards to shortcomings of theories for particularly complex processes (e.g., Sun 2024).
However, we still know fairly little about how well such models actually work for modeling and predicting individuals’ life outcomes and for whom. Lately, artificial intelligence (AI) applications relying on ML prediction have been increasingly used in policy contexts, such as aiding welfare interventions or court rulings. However, the scientific literature on social prediction is so far insufficient. Increasing our understanding for how well predictive models work for social prediction—and for whom do they not work—is thus needed.
The current literature in the field of life course prediction has explored questions about to what extent human life courses are predictable and what drives overall prediction error (Lundberg et al. 2024; Salganik et al. 2020; Savcisens et al. 2024). Building on this literature, I explore the overall predictability of (un)employment in a Nordic welfare state context and the relative (un)predictability of subgroups by (1) investigating if there are predictability differences by outcome category and (2) analyzing individuals with low socioeconomic status (SES) separately from the others.
The research questions are the following:
Research Question 1: To what extent can we predict (un)employment in early adulthood?
Research Question 2: Are there differences in predictability across outcome groups?
Research Question 3: Are some SES groups more predictable than others?
In this article, I analyze the full cohort of individuals born in Finland in 1987 and leverage domain-specific knowledge—theories and empirical work on acquiring human, social, and personal capital for employability and the intergenerational processes of transmitting them (see e.g., Caspi et al. 1998; Coleman 1988; Doku et al. 2018)—to choose potential predictors from life course data throughout ages 0 to 24, including intergenerational markers from parents, to predict long-term unemployment at ages 25 to 30.
Background
In this section, I summarize the theoretical approaches and empirical work important for predicting (un)employment. These are further used for choosing the set of independent variables for the analysis. I also summarize the current state of literature in social prediction, which this article builds on.
Theoretical Approaches
The way life course experiences contribute to becoming unemployed at a young age can be viewed from multiple theoretical perspectives. One comprehensive framework argues it is driven by an individual’s acquisition of human, social, and personal capital (Caspi et al. 1998). Human capital refers to the skills, qualifications, knowledge, and resources needed for becoming employed—for example, literacy, numeracy, educational attainment, and qualifications (Becker 1975). A higher level of human capital is thus likely to improve an individual’s employability. Social networks, in terms of, for example, family, friends, or colleagues, are a type of social capital that provides or controls access to resources needed for acquiring and maintaining a job (Coleman 1988). Having access to networks that can help in finding a first job or moving from a precarious job arrangement to a more stable one is likely to be particularly important for young individuals. Additionally, social networks operate through forms of social control and transmission of norms and values. Finally, personal capital consists of behavioral and psychological characteristics and resources linked to motivation and capacity to work (Caspi et al. 1998). Motivation to work has several explanations in the scientific field (Jahoda 1981) but is nevertheless one key factor in who finds employment and maintains it and who does not. Aspects of personal capital can be conceptualized, for example, in terms of symptoms of mental illness, antisocial behavior, and substance abuse.
There is good reason to believe that the predictability of unemployment in early adulthood is connected to social origin. Intergenerational transmission of SES and (dis)advantage has been studied for over a century with different approaches (see e.g., Ganzeboom, Treiman, and Ultee 1991). A vast body of literature has found positive associations between the human, social, and personal capital of parents and children (see e.g., Black and Devereux 2010; Black, Devereux, and Salvanes 2005; Caspi et al. 1998; Currie 2009; Doku et al. 2018; Mood 2017). One process through which human capital is transmitted is through parents making active investments into the child and their future employability, in other words, to the accumulation of the child’s human capital. The link between parents’ and child’s social capital has been suggested to originate from parents having access to useful networks that can benefit the child and from parents being physically and emotionally available for the child. This also translates to an interplay between social and human capital: If the family has strong social capital—including close relationships—the human capital of the parents can more easily be transmitted to the children (Coleman 1988). A lack of parental human and/or social capital can thus translate to lower chances for their child of finding or maintaining employment. Social capital in the family can also be understood in terms of structurally organized homes and family processes of informal social control, where children can be negatively affected if the home environment does not provide enough safety and stability in terms of parental resources to take care of the child. A disorganized and unstable home environment has been associated with antisocial and criminal behavior (Sampson and Laub 1994), which again has been connected to unemployment and precarious work arrangements in later life (Sanford et al. 1994). Such family conditions can also be a fertile ground for mental health and substance abuse problems that can operate either through diminished opportunities to accumulate human and social capital or psychological distress that affects abilities to apply for and maintain a job (Wadsworth et al. 2005). Moreover, motivation to work and values toward it can be transmitted intergenerationally. Finally, parental job loss has been found to reduce children’s school performance and educational attainment (Lindemann and Gangl 2019; Mooi-Reci et al. 2019; Rege, Telle, and Votruba 2011), and unemployment spells themselves can be reproduced intergenerationally (Clark and Lepinteur 2019). Taking all this into account, I expect that the (un)employment of individuals with low parental SES is more difficult to predict than for mid/high SES given that there exist complex, overlapping processes producing variance that is more difficult to capture with an overall pattern among low-SES individuals.
Empirical Studies on Risk Factors for Unemployment in Young Adulthood
Several factors in the life course together play a role in how young individuals find their way into employment and integrate into the labor market. The accumulation of measures of disadvantage is likely to increase the probability of unemployment (e.g., Lallukka et al. 2019). In addition to macro-level factors, such as economic conditions and educational systems, there are individual differences in skills, abilities, and opportunities that affect whether a person ends up unemployed (Caspi et al. 1998). Some of them include school performance, early work experience, health and well-being, delinquency behavior, and parental resources, such as SES.
A Finnish study on social and health that related precursors for long-term unemployment in early adulthood (25–28 years) found that grade point average was strongly associated with early adulthood unemployment. Mental health of ego was also among the most important variables associated with unemployment, as was parental education level, young age of the mother, and involvement with child protective services, specifically, after the age of 12. One of their key findings was that accumulation of measures of disadvantage increased the probability of unemployment (Lallukka et al. 2019).
Other studies have also found that life course characteristics and events begin shaping unemployment trajectories well before the transition to labor market happens. Another study in the context of Finland on precursors of youth unemployment trajectories between ages 16 and 28 found poor school achievement and low education level to be strongly associated with high unemployment (Doku et al. 2018). In addition, both parental and grandparental low education level and SES were found to strongly increase ego’s risk for unemployment.
Another study of long-term unemployment between ages 15 and 21 in New Zealand used precursors measured at different stages during childhood (Caspi et al. 1998). Some of the most important precursors measured at age 15 were not having obtained a school certificate, having low reading skills, and showing symptoms of delinquency. In line with Lallukka et al. (2019), Caspi et al. (1998) found that accumulation of disadvantage factors, such as parent’s low occupational status, single-parent family, and ego’s behavioral problems, starting from early childhood and continuing to adolescence, increased unemployment risk. They also tested whether the precursors were the same for lack of education, potentially explaining the precursors of unemployment: They found that factors such as having not obtained a certificate, low reading skills, and delinquency continued to significantly contribute to unemployment after adjusting for lack of education.
Machine Learning for Life Course Research
Machine learning is a relatively new and underutilized approach in the social sciences and life course research, with more applications being called for (Hofman et al. 2021; Lundberg, Brand, and Jeon 2022; Verhagen 2022). Most previous quantitative research leans on an approach where hypotheses and statistical models are built by the researcher based on theory and then tested to assess the generalizability of the theory. ML instead approaches data agnostically, without a prespecified statistical model, and aims at capturing as much of the complexity it can, from, for example, life course data. In the context of computer sciences, Breiman (2001) referred to the former as the data modeling culture and the latter as the algorithmic. The algorithmic modeling culture makes use of ML algorithms and typically applies an exploratory approach, starting with observations, deriving a statistical model from the observations using an algorithm, testing the model on unseen data, and ending up with associations from the data. It is a highly data-driven approach to research, where the model reflects the data used for training it.
The current state of the life course prediction literature presents some very promising directions for further research. Savcisens et al. (2024) leveraged Danish population registers and survey data in combination with a predictive language model and found that their model outperformed other commonly used ones in predictive power. They also found some subgroup heterogeneity in the prediction outcomes in terms of sex and age group. Salganik et al. (2020) conducted a mass collaboration with longitudinal panel data from the United States and concluded that prediction error correlated highly with certain families. This was partly due to the mean variance in individuals’ outcomes in the given task and partly to the difference between the observed and predicted mean outcomes (Lundberg et al. 2024). The empirical examples in the literature thus give good reason to expect some groups to have good predictability and some to have poor predictability, suggesting that more predictive applications are needed in the social sciences to understand whether current theories reflect the social processes under observation and for whom they fail. The findings by Salganik et al. (2020) and Savcisens et al. (2024) teach us to (1) ask more questions about for whom the human life course is (not) predictable and why, (2) ask whether some social processes have better predictive capacity than others, and (3) not solely focus on average prediction accuracy but also to understand what data-driven methods can teach us about diverging life courses.
In this study, I take these emerging modeling techniques and systematically review them in comparison to traditional parametric modeling techniques in the context of predicting the transition process into the labor market for young adults using the Nordic “big data” for research—administrative population registers. Studying which groups are easier or more difficult to predict will leverage discussions about whether the current theories are complete and the extent to which they are applicable and generalizable to a given population. Moreover, although AI systems based on predictive ML are actively used for social prediction, we need to build the science to unpack it.
Data and Methods
The analyses are conducted using governmental register data from a Finnish Birth Cohort project, where 60,254 children who were born in 1987 are followed throughout their lives (Paananen and Gissler 2012). The data include individual-level, longitudinal administrative information from 10 different population register holders, covering many dimensions of well-being, life events, and demographic transitions and linking them together with unique personal identification numbers. In addition to the children born in 1987, their parents are linked to the data, and administrative register information are collected for them as well, sometimes even before the birth of the child. Such rich data enable us to understand the human life course, even intergenerationally, and build models for predicting life outcomes. For the purpose of studying long-term unemployment as an outcome, individuals who died during the review period were excluded from the population, as were individuals with an intellectual disability, resulting in an analytical sample of 58,179 individuals (Westerinen 2018). The data are provided by the Finnish Institute for Health and Welfare, who collect them from several administrative register holders and coordinate research done using the data.
Long-Term Unemployment in Early Adulthood
To understand whether the transition into the labor market is successful, I derive a binary outcome measure of long-term unemployment in early adulthood. The criterion is being registered as a job seeker for more than 365 consecutive days at any time from January 1, 2012, to May 31, 2017, that is, when the individuals are ages 25 to 30: 27.2 percent of the individuals in the analytical sample fulfill this condition. This age stage is commonly referred to as early adulthood and is defined in terms of a period after the phase of emerging adulthood (Arnett 2000), when more stable workforce attachment is assumed to form and when most individuals in the sample have completed their education. The operationalization includes all job seekers, such as full-time unemployed (a little more than half of the job seekers), individuals involuntarily working part-time, and individuals in training due to difficulty in employability 1 (Official Statistics of Finland 2015). The measure can be seen to capture a real-life phenomenon related to not making a normative transition into the labor market.
Predictors
The predictor variables (Table 1) are chosen based on theories and previous research as described in the Background section. All predictors were measured before the outcome variable review period starts. Some are measured once, and some are divided into four age periods: before school (0–6 years), early school years (7–12), late school years (13–15), and after compulsory school (16–24). The predictors aim to capture ego’s characteristics, resources, skills, and measures of well-being important for transitioning to working life; parental resources that can be intergenerationally transmitted; and attributes in the home environment that may affect the child when growing up. See operationalizations in Appendix A in the supplemental material.
Predictors.
Analytical Approach
The study design is a classification task, aiming to predict whether individuals belong to class 0 (employed) or class 1 (unemployed). I apply three commonly used nonparametric SML algorithms: decision tree classifier (CART), random forest (RF), and gradient booster (GB). I compare their performance to a parametric baseline model, specifically, logistic regression. All three algorithms are based on classification trees detecting nonlinearities and predictive interactions from the data by recurringly splitting the data into branches based on the variable with the most predictive power in the current sample.
The choice of algorithms was done by balancing assumptions of complexity and nonlinearity in the underlying data against computational complexity. These algorithms are also known to be suitable for the data typical for social sciences (Verhagen 2024). The CART algorithm creates one tree of complex but parsimonious higher-order interactions found to be predictive of the outcome. RF builds several uncorrelated trees from different subsets of the data,and produces smooth predictions by averaging across them. GB builds several trees sequentially, and each new tree corrects the errors from the previous. GB typically produces the best prediction of the three (followed by RF and then CART; see e.g., Hastie, Tibshirani, and Friedman 2009) but also has the tendency to be sensitive to noise in the data and can boost bias originating from the data. Comparing all the aforementioned models gives a good overall picture of the predictability of the outcome. The logic of tree-based models is explained in more detail in Appendix B in the supplemental material.
The analytical strategy is summarized in Figure 1. The data are first split into two parts: the 20 percent holdout sample and the rest. The holdout sample is set aside from the analysis and is only used after model training to validate the models with previously unseen data. The remaining 80 percent is split in training sample (60 percent) and testing sample (20 percent) five times—a process called fivefold cross-validation. Cross-validation is done to reduce potential error due to a particular sample split. Class-balancing approaches are used for the training data to rule out that the model is biased for the majority outcome class (see details in Appendix C in the supplemental material). The SML models are trained, applying a standard scaler and k-nearest neighbors (k = 5) approach for imputing missing data, separately for each algorithm, with optimal hyperparameter settings tested previously using a grid search on a subsample of the data. The test data are used to assess how the current model specifications work in predicting the outcome, and the best class-balancing approach (synthetic minority oversampling technique [SMOTE]) is chosen for further analysis. The final models are validated with the holdout data. Model performance is discussed in terms of several evaluation measures to understand both the overall predictability and subgroup variations for each outcome class separately. After performing the analysis on the full holdout data, the training and validation steps are repeated for subsamples of the holdout set, divided into low parental SES group versus mid-high parental SES. All analysis is carried out in Python (version 3.11; Van Rossum and Drake 2009) using the scikit-learn package (Pedregosa et al. 2011). Reporting of the results follow the recommendations for ML-based science proposed by Kapoor et al. (2024).

Analytical strategy.
Results
Evaluation Metrics
I drew out several evaluation measures to understand how well the models perform, all ranging from 0 to 1. Accuracy indicates the proportion of correct predictions in the whole model, where 1 equal to 100 percent accuracy. Mean squared error (MSE) measures the squared differences of predictions and observations, with ideal value being 0. Precision is the proportion of true positive outcomes (observed and predicted unemployed) out of all positive predictions (predicted unemployed but observed employed) and recall of the proportion of true positive outcomes out of all positive observations (observed unemployed), both with value 1 indicating a perfect performance. F1 score is a harmonic mean of precision and recall, with 1 as the ideal. False-positive rate (FP rate) indicates the proportion of those who were in fact unemployed but who were predicted to be employed, and false-negative rate (FN rate) indicates the proportion of those who were in fact employed but who were predicted unemployed, both with ideal values of 0, which would indicate that everyone was accurately predicted.
Table 2 shows the evaluation metrics for all models averaged across all five folds (cross-validations produced robust results; see Appendix D in the supplemental material). Compared to the baseline model (logistic regression), all three nonparametric, nonlinear algorithms perform better in terms of both accuracy and MSE, with 9 percentage points of improvement in performance for the best performing algorithm (GB), suggesting that for the full population, a nonparametric, nonlinear algorithm is a better choice for prediction. Note, however, that even in the best performing model in terms of accuracy, the differences across the outcome groups are major: 7.3 percent of all employed will falsely be predicted to be unemployed, and 74.4 percent of all unemployed will falsely be predicted to be employed. Due to the application of class-balancing approaches, these observed differences are not driven by the imbalanced outcome groups observed. This means that the results do not stem from the fact that it is more common to be employed than unemployed. Further inspecting the FN and FP rates, the nonlinear algorithms generally predict the group observed as unemployed very poorly (high FN rate) and the group observed employed very well (low FP rate). Logistic regression has a more balanced performance for the two groups while still overall performing more poorly than the nonparametric algorithms and misclassifying 40 percent of the unemployed. Additional analysis in Appendix E in the supplemental material reveals more class-specific evaluation measures.
Evaluation Metrics for Full Sample.
Predicted Probabilities
To examine the differences for the two outcome groups more closely, Figures 2 through 5 show the distribution of predicted probabilities for being employed and being unemployed, respectively, for logistic regression (Figures 2 and 3) and GB (Figures 4 and 5). CART and RF have similar patterns to GB (see Appendix F in the supplemental material). None of the models produce many certain predictions, that is, where there is low discrepancy between the observed and predicted. Instead, logistic regression produces predicted probabilities close to the decision threshold of 0.5 for employed and unemployed, making the certainty of the predictions low for both groups: The model is struggling to distinguish these groups from one another and needs to make guesses around the decision threshold. GB (and the two other nonparametric algorithms), on the other hand, is fairly sure about the correct predictions for the employed group, with a modal discrepancy of 0.2, but also tends to classify the cases observed unemployed mostly as employed with rather high confidence: The model likely highlights some of the characteristics the unemployed share with the employed, ignoring the potential noise. In short, logistic regression performs more poorly in terms of accuracy and error and also predicts both groups with low certainty—generally a bad trait for a predictive model—whereas the GB performs better overall, giving confident predictions—a good trait for a predictive model—but the performance is driven by correctly predicting the group observed as employed. See discussion about the decision threshold of the models in Appendix G in the supplemental material.

Distribution of predicted probabilities for the employed (0), logistic regression.

Distribution of predicted probabilities for the unemployed (1), logistic regression.

Distribution of predicted probabilities for the employed (0), gradient booster.

Distribution of predicted probabilities for the unemployed (1), gradient booster.
Differences by Socioeconomic Groups
To shed light on what social groups are easier or more difficult to predict, I repeated the validation step of the analysis but split the holdout data into two groups—low SES and mid-to-high SES—and compared the performance with the full model. Fivefold cross-validation was performed as in the full sample analysis. Low SES is defined as having no parent who has obtained a secondary degree (6.6 percent of the full sample). Appendix H (in the supplemental material) displays the results for this analysis: The mid-to-high SES is easier to predict, likely due to the majority of the training sample belonging to this group. Note that also a more inclusive definition of low SES yields similar patterns (see Appendix I in the supplemenal material). To account for potential effects of distribution shift—where the predictors have different distributions in the training and the holdout sets—I trained models for the low SES and mid-to-high SES groups separately. Table 3 displays the evaluation measures for the low SES group, when the model is trained with only including the low SES group, and Table 4 displays the similar evaluations for the mid-to-high SES group. For the tree-based models, the results point to the low SES group having poorer overall predictability: Accuracy is lower and MSE higher than for the mid-to-high group. However, looking at the FN and FP rates, we find that the unemployed have a lower overall misclassification in the low SES group than in the mid-to-high SES group. Regardless, the unemployed are still misclassified more than 50 percent of the time. In summary, the low SES group is more difficult to predict overall, but simultaneously, the unemployed are slightly better captured in this group than the mid-to-high SES group. For more details about the distribution shift analysis, see in Appendix J in the supplemenal material.
Low Socioeconomic Status Training and Validation.
Mid-to-High Socioeconomic Status Training and Validation.
Conclusion and Discussion
The aim of this article was to contribute to the literature on the predictability of the human life course, focusing on labor market transition processes and for whom it is predictable, by exploring (1) whether there are differences in predictability across the outcome groups and (2) whether some socioeconomic groups are more predictable than others. The results suggest that the overall predictive capacity of (un)employment systematically increases with the use of decision-tree-based nonparametric models compared to logistic regression. This is consistent with the social prediction literature, which has typically found similar trends for other outcomes in the life course prediction framework (e.g., Salganik et al. 2020; Savcisens et al. 2024). However, the results from this study highlight that overall accuracy in this case is driven by the outcome group of the employed, for which the models reach up to 93 percent accuracy while incorrectly predicting most of the unemployed. Importantly, the class-balancing approaches applied for the training data rule out that the differences across the outcome groups are driven by their imbalanced distribution. This means that the results are not due to it being more common to be employed than unemployed. Furthermore, the results show that groups with low SES (in terms of parental education) are more difficult to predict than mid-to-high SES groups. In terms of the outcome groups in the low SES group, the employed remain better predicted than the unemployed even though the predictability of the unemployed improves compared to the main analysis.
These findings highlight our limited understanding of the processes under observation. Using models that are particularly designed to capture complex patterns (here: life courses) and a wide set of predictors chosen based on theories and previous empirical findings of an extensively researched topic, we struggle to understand the outcome group that is more vulnerable—the unemployed. This leads to questioning why we cannot generalize a social process so well researched and whether our current understanding of unemployment is even close to complete.
There are some potential reasons that can contribute to why this study fails to capture processes for unemployment. Given detailed data and several flexible models that together capture the employed group well, the models use the same logic to predict the unemployed as employed (i.e., misclassifying the unemployed as employed). This means that two individuals are identical in their observed characteristics but actually have different outcomes (see Lundberg et al. 2024). It is possible that there are some unobserved characteristics for the ego or the parents that are critical in the processes for unemployment. The data used in the analysis are, however, particularly detailed and spread across many domains and are chosen based on previous empirical literature and theoretical understanding. This analysis did not account for macro-level conditions, such as economic conditions, educational systems, or job availability. However, because the sample consists of a full birth cohort from a single year in one country, most of the macro conditions can be assumed to be the same for everyone. There can, however, be some regional variation, for instance, in job availability, that this analysis could not account for. Finally, one possible explanation is high level of randomness in the life courses for those who end up unemployed. Random events that throw individuals off the track for employment and the possible stacking of them in individuals is a plausible explanation both statistically and substantively: The pattern-finding models cannot generalize when there are lots of random events, and vulnerable groups tend to cumulatively stack adverse life events (e.g., Fridell Lif et al. 2017). This potential randomness can be viewed as the unemployed having some risk factors or the employed having some resilience factors beyond the pattern that this binary reveal. These results also open the door for more studies using explainable AI techniques to reveal subgroup characteristics for predictive models.
Whether these results are context dependent is an interesting question for future research. Finland has a welfare system where the state has designed numerous universal policies to mitigate the effects of adversity. The low SES group having poorer predictability in terms of (un)employment can be expected to differ when studying a context with a weaker social safety net, such as the United States. However, without empirical analysis, it is difficult to say to which direction the predictability would change. On one hand, a society with weaker social safety nets can contribute to the stacking of more disadvantage and random life events and thus make the groups with less socioeconomic resources even less predictable. On the other hand, if the social system provides less assistance and support, the groups that are worse off could, in fact, be more predictable in their outcomes, translating their stacking adversities into a “destiny.” If we were to have data on these adverse events in a context with a weaker social safety net, we might be able to find the patterns and be able to predict better. In a welfare state context, the safety net can, in fact, be a part of the random events that help some individuals fight adversity but simultaneously make prediction more difficult.
The analysis in this study has systematically estimated the predictability of the outcome (un)employment and by doing so, emphasized our lack of understanding of the processes under investigation, in particular for the unemployed group. The results highlight the importance of modeling social processes in a way that aims to understand subgroups, particularly the more disadvantaged ones. The modeling approach presented in this article offers one way to systematically explore whether we in fact understand the processes we are addressing. The results also highlight that for a social scientist applying these modeling approaches, the choice of model and evaluation measure depend on the context of application and the stakeholders’ interests. In a scenario where the interest is in how well the models predict overall, which can be found in the current academic literature (e.g., Salganik et al. 2020; Savcisens et al. 2024), the aim is to understand to what extent the human life is predictable. This is a case where overall performance measures, such as accuracy and MSE, are particularly useful. If instead we want to design policies with early interventions for people at risk of unemployment, we might be interested in how to capture as many unemployed as possible. This is where optimizing for a low FN rate can be preferred. This may come at the cost of misclassifying more and more employed and potentially oversimplifying the paths to unemployment, but if interventions for the employed come with a lower cost/risk than no interventions for the unemployed, this may be a decision worth making. The results from this study illustrate that there is no one single solution better than the others when it comes to applying ML approaches to understand the human life course.
The implications for the findings of this study should be discussed not only academically but also with regards to social policymaking. What does it mean for designing interventions and supportive welfare policies if the groups the society aims to support are the ones that we do not understand well and cannot generalize? Furthermore, when using AI tools to target interventions for groups that are likely to have more variability and randomness in their life course than the average individual, the model training and analyzing its generalizability for subgroups need to be carefully inspected. In light of these findings, the recent calls for integrating prediction-based modeling strategies alongside traditional regression-based ones to better understand the society (e.g., Hofman et al. 2021) are highlighted.
Supplemental Material
sj-docx-1-srd-10.1177_23780231241286655 – Supplemental material for The (Un)Predictability of Early (Un)Employment: A Machine Learning Approach
Supplemental material, sj-docx-1-srd-10.1177_23780231241286655 for The (Un)Predictability of Early (Un)Employment: A Machine Learning Approach by Sanni Kuikka in Socius
Footnotes
Acknowledgements
I am grateful for comments from Maria Brandén, Stefanie Möllborn, and Sunnee Billingsley.
Author’s Note
The study processes information defined as sensitive personal data by the GDPR. The study was conducted with anonymized data, stored on a secure server hosted by the data provider, and processed through a secure remote access with two-factor identification. Additionally, the study has been approved by the Swedish Ethical Review Authority (Dnr 2022-03844-01). The data underlying this article were provided by the Finnish Institute for Health and Welfare by permission and cannot be shared publicly due to containing sensitive personal data about the individuals.
Funding
The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is done in affiliation with the project “Understanding Society through Register Based Machine Learning” at the Institute for Analytical Sociology, Linköping University, generously funded by the Swedish Research Council Dnr. 2019-00245.
Supplemental Material
Supplemental material for this article is available online.
1
Being registered as a job seeker in the Employment and Economic Development Office is a requirement for receiving unemployment benefits, the primary welfare subsidy for unemployed adults included in the workforce, yet not a guarantee for receiving them. Individuals registered as job seekers for the purpose of having access to unemployment benefits can be found in varying situations, such as unemployed looking for work, part-time workers getting partial unemployment benefits to compensate the loss of income due to lack of full-time employment, and previously unemployed currently full-time degree students for whom studying further has been found to increase their employability and unemployment benefits (which are higher than student benefits) granted as an incentive to reeducate. They can also be temporarily laid off, in training, or in services promoting employment, and the type and amount of potential unemployment benefits can vary depending on one’s label in the register. This definition of “unemployment” can be contrasted to the term “unemployed job seeker,” which is used by governmental agencies to distinguish individuals who are fully unemployed and looking for full-time work, from all registered as unemployed. Moreover, the operationalization of the outcome as binary means that there is likely heterogeneity in both groups. The group unemployed includes all cases who need the state-provided unemployment benefit (e.g., the fully unemployed or the ones in need of part-time income compensation), and in contrast, the group employed includes everyone who does not need the benefit (e.g., the full-time employed, the part-time employed, and inactive people with sufficient compensatory household income or wealth).
Author Biography
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
