Sage Journals: Discover world-class research

Abstract

This article integrates scholarship on determinants of labor market transitions with the flexible modeling strategies available through machine learning. Leveraging population registers from Finland, this study analyzes how well we can predict (un)employment and for whom we can (not) predict. The study finds that the predictability of long-term (un)employment improves when using tree-based nonparametric models compared to logistic regression but that predictability varies substantially across outcome groups. None of the models predict very accurately for the unemployed net of baseline likelihood to be unemployed—a group that is well researched and often of interest when designing policies and interventions. Overall accuracy is driven by the employed while incorrectly predicting most of the unemployed. Additionally, the outcomes for individuals with low parental education are overall more difficult to predict than the outcomes for individuals with mid/high parental education, whereas predicting unemployment slightly improves among the low parental education group.

Keywords

unemployment life course machine learning prediction register based

The transition into the labor market is one of the most important markers in the life course of young adults (Hogan and Astone 1986; Shanahan 2000). Typically, this transition occurs in the demographically dense life phase of early adulthood, meaning that it happens in combination with or in close proximity to several other transitions, such as finishing education, moving into independent housing, partnership formation, and having children (Rindfuss 1991). Most individuals transition into the labor market smoothly, but for those who do not, it can have many scarring effects for future life outcomes and societal attachment when occurring and sustaining in early adulthood (e.g., Abebe and Hyggen 2019; Azzollini 2023; Helbling, Sacchi, and Imdorf 2019; Mousteri, Daly, and Delaney 2018).

The determinants of unemployment in early adulthood have been researched extensively and are understood to be complex processes (see e.g., Caspi et al. 1998; Doku et al. 2018; Lallukka et al. 2019). However, existing modeling strategies have not sufficiently operationalized this complexity. Regression models typically used in modeling assume linear associations, rarely account for interactions beyond two variables, generally rely on average effects, and are not optimized for simultaneously modeling highly correlated variables. Furthermore, the “all else being equal” hypothesis testing framework is not ideal for modeling complex disadvantages.

Recent studies have started to explore the potential of predictive modeling strategies that leverage machine learning (ML) methods to study life outcomes (e.g., Salganik et al. 2020; Savcisens et al. 2024). I argue that this approach is likely to be the most suitable also for learning about predictors of unemployment given that it has the most fit potential (see Verhagen 2024). Supervised machine learning (SML) methods flexibly search for functions f(X) to predict an output (Y) given an input (X) to be able to forecast (Y) for future inputs (X) (Molina and Garip 2019). They can account for nonlinearity and multiple simultaneous processes, create flexible higher-order interactions using the most predictive thresholds of variables, do not rely on average effects, and can deal with multicollinearity by averaging across uncorrelated models. Thus, these modeling approaches can address some of the issues with regards to shortcomings of theories for particularly complex processes (e.g., Sun 2024).

However, we still know fairly little about how well such models actually work for modeling and predicting individuals’ life outcomes and for whom. Lately, artificial intelligence (AI) applications relying on ML prediction have been increasingly used in policy contexts, such as aiding welfare interventions or court rulings. However, the scientific literature on social prediction is so far insufficient. Increasing our understanding for how well predictive models work for social prediction—and for whom do they not work—is thus needed.

The current literature in the field of life course prediction has explored questions about to what extent human life courses are predictable and what drives overall prediction error (Lundberg et al. 2024; Salganik et al. 2020; Savcisens et al. 2024). Building on this literature, I explore the overall predictability of (un)employment in a Nordic welfare state context and the relative (un)predictability of subgroups by (1) investigating if there are predictability differences by outcome category and (2) analyzing individuals with low socioeconomic status (SES) separately from the others.

The research questions are the following:

Research Question 1: To what extent can we predict (un)employment in early adulthood?

Research Question 2: Are there differences in predictability across outcome groups?

Research Question 3: Are some SES groups more predictable than others?

In this article, I analyze the full cohort of individuals born in Finland in 1987 and leverage domain-specific knowledge—theories and empirical work on acquiring human, social, and personal capital for employability and the intergenerational processes of transmitting them (see e.g., Caspi et al. 1998; Coleman 1988; Doku et al. 2018)—to choose potential predictors from life course data throughout ages 0 to 24, including intergenerational markers from parents, to predict long-term unemployment at ages 25 to 30.

Background

In this section, I summarize the theoretical approaches and empirical work important for predicting (un)employment. These are further used for choosing the set of independent variables for the analysis. I also summarize the current state of literature in social prediction, which this article builds on.

Theoretical Approaches

The way life course experiences contribute to becoming unemployed at a young age can be viewed from multiple theoretical perspectives. One comprehensive framework argues it is driven by an individual’s acquisition of human, social, and personal capital (Caspi et al. 1998). Human capital refers to the skills, qualifications, knowledge, and resources needed for becoming employed—for example, literacy, numeracy, educational attainment, and qualifications (Becker 1975). A higher level of human capital is thus likely to improve an individual’s employability. Social networks, in terms of, for example, family, friends, or colleagues, are a type of social capital that provides or controls access to resources needed for acquiring and maintaining a job (Coleman 1988). Having access to networks that can help in finding a first job or moving from a precarious job arrangement to a more stable one is likely to be particularly important for young individuals. Additionally, social networks operate through forms of social control and transmission of norms and values. Finally, personal capital consists of behavioral and psychological characteristics and resources linked to motivation and capacity to work (Caspi et al. 1998). Motivation to work has several explanations in the scientific field (Jahoda 1981) but is nevertheless one key factor in who finds employment and maintains it and who does not. Aspects of personal capital can be conceptualized, for example, in terms of symptoms of mental illness, antisocial behavior, and substance abuse.

There is good reason to believe that the predictability of unemployment in early adulthood is connected to social origin. Intergenerational transmission of SES and (dis)advantage has been studied for over a century with different approaches (see e.g., Ganzeboom, Treiman, and Ultee 1991). A vast body of literature has found positive associations between the human, social, and personal capital of parents and children (see e.g., Black and Devereux 2010; Black, Devereux, and Salvanes 2005; Caspi et al. 1998; Currie 2009; Doku et al. 2018; Mood 2017). One process through which human capital is transmitted is through parents making active investments into the child and their future employability, in other words, to the accumulation of the child’s human capital. The link between parents’ and child’s social capital has been suggested to originate from parents having access to useful networks that can benefit the child and from parents being physically and emotionally available for the child. This also translates to an interplay between social and human capital: If the family has strong social capital—including close relationships—the human capital of the parents can more easily be transmitted to the children (Coleman 1988). A lack of parental human and/or social capital can thus translate to lower chances for their child of finding or maintaining employment. Social capital in the family can also be understood in terms of structurally organized homes and family processes of informal social control, where children can be negatively affected if the home environment does not provide enough safety and stability in terms of parental resources to take care of the child. A disorganized and unstable home environment has been associated with antisocial and criminal behavior (Sampson and Laub 1994), which again has been connected to unemployment and precarious work arrangements in later life (Sanford et al. 1994). Such family conditions can also be a fertile ground for mental health and substance abuse problems that can operate either through diminished opportunities to accumulate human and social capital or psychological distress that affects abilities to apply for and maintain a job (Wadsworth et al. 2005). Moreover, motivation to work and values toward it can be transmitted intergenerationally. Finally, parental job loss has been found to reduce children’s school performance and educational attainment (Lindemann and Gangl 2019; Mooi-Reci et al. 2019; Rege, Telle, and Votruba 2011), and unemployment spells themselves can be reproduced intergenerationally (Clark and Lepinteur 2019). Taking all this into account, I expect that the (un)employment of individuals with low parental SES is more difficult to predict than for mid/high SES given that there exist complex, overlapping processes producing variance that is more difficult to capture with an overall pattern among low-SES individuals.

Empirical Studies on Risk Factors for Unemployment in Young Adulthood

Several factors in the life course together play a role in how young individuals find their way into employment and integrate into the labor market. The accumulation of measures of disadvantage is likely to increase the probability of unemployment (e.g., Lallukka et al. 2019). In addition to macro-level factors, such as economic conditions and educational systems, there are individual differences in skills, abilities, and opportunities that affect whether a person ends up unemployed (Caspi et al. 1998). Some of them include school performance, early work experience, health and well-being, delinquency behavior, and parental resources, such as SES.

A Finnish study on social and health that related precursors for long-term unemployment in early adulthood (25–28 years) found that grade point average was strongly associated with early adulthood unemployment. Mental health of ego was also among the most important variables associated with unemployment, as was parental education level, young age of the mother, and involvement with child protective services, specifically, after the age of 12. One of their key findings was that accumulation of measures of disadvantage increased the probability of unemployment (Lallukka et al. 2019).

Other studies have also found that life course characteristics and events begin shaping unemployment trajectories well before the transition to labor market happens. Another study in the context of Finland on precursors of youth unemployment trajectories between ages 16 and 28 found poor school achievement and low education level to be strongly associated with high unemployment (Doku et al. 2018). In addition, both parental and grandparental low education level and SES were found to strongly increase ego’s risk for unemployment.

Another study of long-term unemployment between ages 15 and 21 in New Zealand used precursors measured at different stages during childhood (Caspi et al. 1998). Some of the most important precursors measured at age 15 were not having obtained a school certificate, having low reading skills, and showing symptoms of delinquency. In line with Lallukka et al. (2019), Caspi et al. (1998) found that accumulation of disadvantage factors, such as parent’s low occupational status, single-parent family, and ego’s behavioral problems, starting from early childhood and continuing to adolescence, increased unemployment risk. They also tested whether the precursors were the same for lack of education, potentially explaining the precursors of unemployment: They found that factors such as having not obtained a certificate, low reading skills, and delinquency continued to significantly contribute to unemployment after adjusting for lack of education.

Machine Learning for Life Course Research

Machine learning is a relatively new and underutilized approach in the social sciences and life course research, with more applications being called for (Hofman et al. 2021; Lundberg, Brand, and Jeon 2022; Verhagen 2022). Most previous quantitative research leans on an approach where hypotheses and statistical models are built by the researcher based on theory and then tested to assess the generalizability of the theory. ML instead approaches data agnostically, without a prespecified statistical model, and aims at capturing as much of the complexity it can, from, for example, life course data. In the context of computer sciences, Breiman (2001) referred to the former as the data modeling culture and the latter as the algorithmic. The algorithmic modeling culture makes use of ML algorithms and typically applies an exploratory approach, starting with observations, deriving a statistical model from the observations using an algorithm, testing the model on unseen data, and ending up with associations from the data. It is a highly data-driven approach to research, where the model reflects the data used for training it.

The current state of the life course prediction literature presents some very promising directions for further research. Savcisens et al. (2024) leveraged Danish population registers and survey data in combination with a predictive language model and found that their model outperformed other commonly used ones in predictive power. They also found some subgroup heterogeneity in the prediction outcomes in terms of sex and age group. Salganik et al. (2020) conducted a mass collaboration with longitudinal panel data from the United States and concluded that prediction error correlated highly with certain families. This was partly due to the mean variance in individuals’ outcomes in the given task and partly to the difference between the observed and predicted mean outcomes (Lundberg et al. 2024). The empirical examples in the literature thus give good reason to expect some groups to have good predictability and some to have poor predictability, suggesting that more predictive applications are needed in the social sciences to understand whether current theories reflect the social processes under observation and for whom they fail. The findings by Salganik et al. (2020) and Savcisens et al. (2024) teach us to (1) ask more questions about for whom the human life course is (not) predictable and why, (2) ask whether some social processes have better predictive capacity than others, and (3) not solely focus on average prediction accuracy but also to understand what data-driven methods can teach us about diverging life courses.

In this study, I take these emerging modeling techniques and systematically review them in comparison to traditional parametric modeling techniques in the context of predicting the transition process into the labor market for young adults using the Nordic “big data” for research—administrative population registers. Studying which groups are easier or more difficult to predict will leverage discussions about whether the current theories are complete and the extent to which they are applicable and generalizable to a given population. Moreover, although AI systems based on predictive ML are actively used for social prediction, we need to build the science to unpack it.

Data and Methods

The analyses are conducted using governmental register data from a Finnish Birth Cohort project, where 60,254 children who were born in 1987 are followed throughout their lives (Paananen and Gissler 2012). The data include individual-level, longitudinal administrative information from 10 different population register holders, covering many dimensions of well-being, life events, and demographic transitions and linking them together with unique personal identification numbers. In addition to the children born in 1987, their parents are linked to the data, and administrative register information are collected for them as well, sometimes even before the birth of the child. Such rich data enable us to understand the human life course, even intergenerationally, and build models for predicting life outcomes. For the purpose of studying long-term unemployment as an outcome, individuals who died during the review period were excluded from the population, as were individuals with an intellectual disability, resulting in an analytical sample of 58,179 individuals (Westerinen 2018). The data are provided by the Finnish Institute for Health and Welfare, who collect them from several administrative register holders and coordinate research done using the data.

Long-Term Unemployment in Early Adulthood

To understand whether the transition into the labor market is successful, I derive a binary outcome measure of long-term unemployment in early adulthood. The criterion is being registered as a job seeker for more than 365 consecutive days at any time from January 1, 2012, to May 31, 2017, that is, when the individuals are ages 25 to 30: 27.2 percent of the individuals in the analytical sample fulfill this condition. This age stage is commonly referred to as early adulthood and is defined in terms of a period after the phase of emerging adulthood (Arnett 2000), when more stable workforce attachment is assumed to form and when most individuals in the sample have completed their education. The operationalization includes all job seekers, such as full-time unemployed (a little more than half of the job seekers), individuals involuntarily working part-time, and individuals in training due to difficulty in employability¹ (Official Statistics of Finland 2015). The measure can be seen to capture a real-life phenomenon related to not making a normative transition into the labor market.

Predictors

The predictor variables (Table 1) are chosen based on theories and previous research as described in the Background section. All predictors were measured before the outcome variable review period starts. Some are measured once, and some are divided into four age periods: before school (0–6 years), early school years (7–12), late school years (13–15), and after compulsory school (16–24). The predictors aim to capture ego’s characteristics, resources, skills, and measures of well-being important for transitioning to working life; parental resources that can be intergenerationally transmitted; and attributes in the home environment that may affect the child when growing up. See operationalizations in Appendix A in the supplemental material.

Table 1.

Predictors.

	Variables measured at one point in time
Variable	Categories	Mean/Percentage	Register
Sex	Female (vs. reference male)	49.1	Medical Birth Record
Grade point average at age 15	4 (fail)–10 (best)	7.8	Finnish National Board of Education
School finished on time	Yes (vs. reference no)	92	Finnish National Board of Education
Highest education level attained before age 25	No secondary (0)	13.4	Statistics Finland Register of Educational Achievements
	Secondary (1)	70.1
	Bachelor’s (2)	15.3
	≥Master’s (3)	1.1
Year of first paid job	2001	3.6	Finnish Center for Pensions
	2002	8.1
	2003	18.2
	2004	19.5
	2005	11.9
	2006	21.3
	2007	8.1
	2008	3.7
	2009	1.0
	2010	0.8
	2011	0.6
	≥2012	3.2
Marital status	Never married (0)	90.0	Population Register Centre
	Married (1)	9.3
	Divorced (2)	0.6
Father highest education	No secondary (0)	24.5	Statistics Finland Register of Educational Achievements
	Secondary (1)	43.4
	>Secondary (2)	13.9
	Bachelor’s (3)	8.1
	Master’s (4)	8.7
	>Master’s (5)	1.5
Mother highest education	No secondary (0)	15.5	Statistics Finland Register of Educational Achievements
	Secondary (1)	45.1
	>Secondary (2)	21.7
	Bachelor’s (3)	8.7
	Master’s (4)	8.2
	>Master’s (5)	0.8
Number of penal orders	0	76.0	Finnish Legal Register Centre
	1	14.8
	2	4.9
	3	2.0
	4+	2.2
Criminal conviction	Yes (vs. reference no)	10.8	Finnish Legal Register Centre
Born to young parents	No (0)	97.6	Medical Birth Register
	1 young parent (1)	2.1
	2 young parents (2)	0.3
Year of out-of-home placement	1987	0.04	Register on Child Welfare
	1988	0.09
	1989	0.08
	1990	0.1
	1991	0.19
	1992	0.17
	1993	0.13
	1994	0.1
	1995	0.12
	1996	0.13
	1997	0.12
	1998	0.13
	1999	0.13
	2000	0.17
	2001	0.3
	2002	0.35
	2003	0.36
	2004	0.24
	2005	0.08
	Never placed (2012)	97.0
Single-parent family at birth	Yes (vs. reference no)	1.3	Population Register Centre
Parents’ marital status	Married (vs. reference never/no longer)	57.9	Population Register Centre
Variables measured at different points in time
Behavioral diagnosis (1.7% ever yes)	Yes (vs. reference no), 0–6 years	0.01	The Hospital Discharge Register
	Yes, 7–12 years	0.05
	Yes 13–15 years	0.07
	Yes, 16–24 years	0.04
Mood diagnosis (6.4% ever yes)	Yes (vs. reference no), 0–6 years	0.0	The Hospital Discharge Register
	Yes, 7–12 years	0.2
	Yes, 13–15 years	1.2
	Yes, 16–24 years	5.0
	No (0), 0–6 years	100.0
Anxiety diagnosis (5.2% ever yes)	Yes (vs. reference no), 0–6 years	0.0	The Hospital Discharge Register
	Yes, 7–12 years	0.3
	Yes, 13–15 years	0.9
	Yes, 16–24 years	4.0
Substance abuse diagnosis (2.1% ever yes)	Yes (vs. reference no), 0–6 years	0.0	The Hospital Discharge Register
	Yes, 7–12 years	0.0
	Yes, 13–15 years	0.3
	Yes, 16–24 years	1.8
Father mean income	0–6 years	12,023 €	Earnings Register of the Finnish Centre for Pensions
	7–12 years	16,092 €
	13–15 years	20,902 €
	16–24 years	26,696 €
Mother mean income	0–6 years	6,622 €	Earnings Register of the Finnish Centre for Pensions
	7–12 years	10,848 €
	13–15 years	14,739 €
	16–24 years	19,388 €
Father unemployment days	7–12 years	92	Earnings Register of the Finnish Centre for Pensions
	13–15 years	31
	16–24 years	62
Mother unemployment days	7–12 years	108	Earnings Register of the Finnish Centre for Pensions
	13–15 years	38
	16–24 years	70
Parental death (8.8% ever experienced)	Yes (vs. reference no), 0–6 years	1.2	Population Register Centre
	Yes, 7–12 years	1.6
	Yes, 13–15 years	1.2
	Yes, 16–24 years	4.8
N		58,179

Analytical Approach

The study design is a classification task, aiming to predict whether individuals belong to class 0 (employed) or class 1 (unemployed). I apply three commonly used nonparametric SML algorithms: decision tree classifier (CART), random forest (RF), and gradient booster (GB). I compare their performance to a parametric baseline model, specifically, logistic regression. All three algorithms are based on classification trees detecting nonlinearities and predictive interactions from the data by recurringly splitting the data into branches based on the variable with the most predictive power in the current sample.

The choice of algorithms was done by balancing assumptions of complexity and nonlinearity in the underlying data against computational complexity. These algorithms are also known to be suitable for the data typical for social sciences (Verhagen 2024). The CART algorithm creates one tree of complex but parsimonious higher-order interactions found to be predictive of the outcome. RF builds several uncorrelated trees from different subsets of the data,and produces smooth predictions by averaging across them. GB builds several trees sequentially, and each new tree corrects the errors from the previous. GB typically produces the best prediction of the three (followed by RF and then CART; see e.g., Hastie, Tibshirani, and Friedman 2009) but also has the tendency to be sensitive to noise in the data and can boost bias originating from the data. Comparing all the aforementioned models gives a good overall picture of the predictability of the outcome. The logic of tree-based models is explained in more detail in Appendix B in the supplemental material.

The analytical strategy is summarized in Figure 1. The data are first split into two parts: the 20 percent holdout sample and the rest. The holdout sample is set aside from the analysis and is only used after model training to validate the models with previously unseen data. The remaining 80 percent is split in training sample (60 percent) and testing sample (20 percent) five times—a process called fivefold cross-validation. Cross-validation is done to reduce potential error due to a particular sample split. Class-balancing approaches are used for the training data to rule out that the model is biased for the majority outcome class (see details in Appendix C in the supplemental material). The SML models are trained, applying a standard scaler and k-nearest neighbors (k = 5) approach for imputing missing data, separately for each algorithm, with optimal hyperparameter settings tested previously using a grid search on a subsample of the data. The test data are used to assess how the current model specifications work in predicting the outcome, and the best class-balancing approach (synthetic minority oversampling technique [SMOTE]) is chosen for further analysis. The final models are validated with the holdout data. Model performance is discussed in terms of several evaluation measures to understand both the overall predictability and subgroup variations for each outcome class separately. After performing the analysis on the full holdout data, the training and validation steps are repeated for subsamples of the holdout set, divided into low parental SES group versus mid-high parental SES. All analysis is carried out in Python (version 3.11; Van Rossum and Drake 2009) using the scikit-learn package (Pedregosa et al. 2011). Reporting of the results follow the recommendations for ML-based science proposed by Kapoor et al. (2024).

Figure 1.

Analytical strategy.

Results

Evaluation Metrics

I drew out several evaluation measures to understand how well the models perform, all ranging from 0 to 1. Accuracy indicates the proportion of correct predictions in the whole model, where 1 equal to 100 percent accuracy. Mean squared error (MSE) measures the squared differences of predictions and observations, with ideal value being 0. Precision is the proportion of true positive outcomes (observed and predicted unemployed) out of all positive predictions (predicted unemployed but observed employed) and recall of the proportion of true positive outcomes out of all positive observations (observed unemployed), both with value 1 indicating a perfect performance. F1 score is a harmonic mean of precision and recall, with 1 as the ideal. False-positive rate (FP rate) indicates the proportion of those who were in fact unemployed but who were predicted to be employed, and false-negative rate (FN rate) indicates the proportion of those who were in fact employed but who were predicted unemployed, both with ideal values of 0, which would indicate that everyone was accurately predicted.

Table 2 shows the evaluation metrics for all models averaged across all five folds (cross-validations produced robust results; see Appendix D in the supplemental material). Compared to the baseline model (logistic regression), all three nonparametric, nonlinear algorithms perform better in terms of both accuracy and MSE, with 9 percentage points of improvement in performance for the best performing algorithm (GB), suggesting that for the full population, a nonparametric, nonlinear algorithm is a better choice for prediction. Note, however, that even in the best performing model in terms of accuracy, the differences across the outcome groups are major: 7.3 percent of all employed will falsely be predicted to be unemployed, and 74.4 percent of all unemployed will falsely be predicted to be employed. Due to the application of class-balancing approaches, these observed differences are not driven by the imbalanced outcome groups observed. This means that the results do not stem from the fact that it is more common to be employed than unemployed. Further inspecting the FN and FP rates, the nonlinear algorithms generally predict the group observed as unemployed very poorly (high FN rate) and the group observed employed very well (low FP rate). Logistic regression has a more balanced performance for the two groups while still overall performing more poorly than the nonparametric algorithms and misclassifying 40 percent of the unemployed. Additional analysis in Appendix E in the supplemental material reveals more class-specific evaluation measures.

Table 2.

Evaluation Metrics for Full Sample.

Model	Accuracy	F1 Score	Precision	Recall	MSE	False-Positive Rate	False-Negative Rate
Logistic regression	0.65	0.48	0.40	0.60	0.35	0.34	0.40
Decision tree	0.72	0.36	0.48	0.28	0.28	0.12	0.72
Random forest	0.73	0.39	0.51	0.32	0.27	0.12	0.68
Gradient booster	0.74	0.35	0.57	0.26	0.26	0.07	0.74

Predicted Probabilities

To examine the differences for the two outcome groups more closely, Figures 2 through 5 show the distribution of predicted probabilities for being employed and being unemployed, respectively, for logistic regression (Figures 2 and 3) and GB (Figures 4 and 5). CART and RF have similar patterns to GB (see Appendix F in the supplemental material). None of the models produce many certain predictions, that is, where there is low discrepancy between the observed and predicted. Instead, logistic regression produces predicted probabilities close to the decision threshold of 0.5 for employed and unemployed, making the certainty of the predictions low for both groups: The model is struggling to distinguish these groups from one another and needs to make guesses around the decision threshold. GB (and the two other nonparametric algorithms), on the other hand, is fairly sure about the correct predictions for the employed group, with a modal discrepancy of 0.2, but also tends to classify the cases observed unemployed mostly as employed with rather high confidence: The model likely highlights some of the characteristics the unemployed share with the employed, ignoring the potential noise. In short, logistic regression performs more poorly in terms of accuracy and error and also predicts both groups with low certainty—generally a bad trait for a predictive model—whereas the GB performs better overall, giving confident predictions—a good trait for a predictive model—but the performance is driven by correctly predicting the group observed as employed. See discussion about the decision threshold of the models in Appendix G in the supplemental material.

Figure 2.

Distribution of predicted probabilities for the employed (0), logistic regression.

Figure 3.

Distribution of predicted probabilities for the unemployed (1), logistic regression.

Figure 4.

Distribution of predicted probabilities for the employed (0), gradient booster.

Figure 5.

Distribution of predicted probabilities for the unemployed (1), gradient booster.

Differences by Socioeconomic Groups

To shed light on what social groups are easier or more difficult to predict, I repeated the validation step of the analysis but split the holdout data into two groups—low SES and mid-to-high SES—and compared the performance with the full model. Fivefold cross-validation was performed as in the full sample analysis. Low SES is defined as having no parent who has obtained a secondary degree (6.6 percent of the full sample). Appendix H (in the supplemental material) displays the results for this analysis: The mid-to-high SES is easier to predict, likely due to the majority of the training sample belonging to this group. Note that also a more inclusive definition of low SES yields similar patterns (see Appendix I in the supplemenal material). To account for potential effects of distribution shift—where the predictors have different distributions in the training and the holdout sets—I trained models for the low SES and mid-to-high SES groups separately. Table 3 displays the evaluation measures for the low SES group, when the model is trained with only including the low SES group, and Table 4 displays the similar evaluations for the mid-to-high SES group. For the tree-based models, the results point to the low SES group having poorer overall predictability: Accuracy is lower and MSE higher than for the mid-to-high group. However, looking at the FN and FP rates, we find that the unemployed have a lower overall misclassification in the low SES group than in the mid-to-high SES group. Regardless, the unemployed are still misclassified more than 50 percent of the time. In summary, the low SES group is more difficult to predict overall, but simultaneously, the unemployed are slightly better captured in this group than the mid-to-high SES group. For more details about the distribution shift analysis, see in Appendix J in the supplemenal material.

Table 3.

Low Socioeconomic Status Training and Validation.

Model	Accuracy	F1 Score	Precision	Recall	MSE	False-Positive Rate	False-Negative Rate
Logistic regression	0.66	0.59	0.55	0.63	0.34	0.33	0.37
Decision tree	0.64	0.49	0.53	0.46	0.36	0.25	0.54
Random forest	0.69	0.55	0.62	0.50	0.31	0.20	0.50
Gradient booster	0.66	0.52	0.58	0.47	0.34	0.21	0.53

Table 4.

Mid-to-High Socioeconomic Status Training and Validation.

Model	Accuracy	F1 Score	Precision	Recall	MSE	False-Positive Rate	False-Negative Rate
Logistic regression	0.64	0.47	0.39	0.60	0.36	0.35	0.40
Decision tree	0.69	0.37	0.42	0.34	0.31	0.18	0.66
Random forest	0.73	0.37	0.49	0.30	0.27	0.12	0.70
Gradient booster	0.74	0.32	0.54	0.23	0.26	0.07	0.77

Conclusion and Discussion

The aim of this article was to contribute to the literature on the predictability of the human life course, focusing on labor market transition processes and for whom it is predictable, by exploring (1) whether there are differences in predictability across the outcome groups and (2) whether some socioeconomic groups are more predictable than others. The results suggest that the overall predictive capacity of (un)employment systematically increases with the use of decision-tree-based nonparametric models compared to logistic regression. This is consistent with the social prediction literature, which has typically found similar trends for other outcomes in the life course prediction framework (e.g., Salganik et al. 2020; Savcisens et al. 2024). However, the results from this study highlight that overall accuracy in this case is driven by the outcome group of the employed, for which the models reach up to 93 percent accuracy while incorrectly predicting most of the unemployed. Importantly, the class-balancing approaches applied for the training data rule out that the differences across the outcome groups are driven by their imbalanced distribution. This means that the results are not due to it being more common to be employed than unemployed. Furthermore, the results show that groups with low SES (in terms of parental education) are more difficult to predict than mid-to-high SES groups. In terms of the outcome groups in the low SES group, the employed remain better predicted than the unemployed even though the predictability of the unemployed improves compared to the main analysis.

These findings highlight our limited understanding of the processes under observation. Using models that are particularly designed to capture complex patterns (here: life courses) and a wide set of predictors chosen based on theories and previous empirical findings of an extensively researched topic, we struggle to understand the outcome group that is more vulnerable—the unemployed. This leads to questioning why we cannot generalize a social process so well researched and whether our current understanding of unemployment is even close to complete.

There are some potential reasons that can contribute to why this study fails to capture processes for unemployment. Given detailed data and several flexible models that together capture the employed group well, the models use the same logic to predict the unemployed as employed (i.e., misclassifying the unemployed as employed). This means that two individuals are identical in their observed characteristics but actually have different outcomes (see Lundberg et al. 2024). It is possible that there are some unobserved characteristics for the ego or the parents that are critical in the processes for unemployment. The data used in the analysis are, however, particularly detailed and spread across many domains and are chosen based on previous empirical literature and theoretical understanding. This analysis did not account for macro-level conditions, such as economic conditions, educational systems, or job availability. However, because the sample consists of a full birth cohort from a single year in one country, most of the macro conditions can be assumed to be the same for everyone. There can, however, be some regional variation, for instance, in job availability, that this analysis could not account for. Finally, one possible explanation is high level of randomness in the life courses for those who end up unemployed. Random events that throw individuals off the track for employment and the possible stacking of them in individuals is a plausible explanation both statistically and substantively: The pattern-finding models cannot generalize when there are lots of random events, and vulnerable groups tend to cumulatively stack adverse life events (e.g., Fridell Lif et al. 2017). This potential randomness can be viewed as the unemployed having some risk factors or the employed having some resilience factors beyond the pattern that this binary reveal. These results also open the door for more studies using explainable AI techniques to reveal subgroup characteristics for predictive models.

Whether these results are context dependent is an interesting question for future research. Finland has a welfare system where the state has designed numerous universal policies to mitigate the effects of adversity. The low SES group having poorer predictability in terms of (un)employment can be expected to differ when studying a context with a weaker social safety net, such as the United States. However, without empirical analysis, it is difficult to say to which direction the predictability would change. On one hand, a society with weaker social safety nets can contribute to the stacking of more disadvantage and random life events and thus make the groups with less socioeconomic resources even less predictable. On the other hand, if the social system provides less assistance and support, the groups that are worse off could, in fact, be more predictable in their outcomes, translating their stacking adversities into a “destiny.” If we were to have data on these adverse events in a context with a weaker social safety net, we might be able to find the patterns and be able to predict better. In a welfare state context, the safety net can, in fact, be a part of the random events that help some individuals fight adversity but simultaneously make prediction more difficult.

The analysis in this study has systematically estimated the predictability of the outcome (un)employment and by doing so, emphasized our lack of understanding of the processes under investigation, in particular for the unemployed group. The results highlight the importance of modeling social processes in a way that aims to understand subgroups, particularly the more disadvantaged ones. The modeling approach presented in this article offers one way to systematically explore whether we in fact understand the processes we are addressing. The results also highlight that for a social scientist applying these modeling approaches, the choice of model and evaluation measure depend on the context of application and the stakeholders’ interests. In a scenario where the interest is in how well the models predict overall, which can be found in the current academic literature (e.g., Salganik et al. 2020; Savcisens et al. 2024), the aim is to understand to what extent the human life is predictable. This is a case where overall performance measures, such as accuracy and MSE, are particularly useful. If instead we want to design policies with early interventions for people at risk of unemployment, we might be interested in how to capture as many unemployed as possible. This is where optimizing for a low FN rate can be preferred. This may come at the cost of misclassifying more and more employed and potentially oversimplifying the paths to unemployment, but if interventions for the employed come with a lower cost/risk than no interventions for the unemployed, this may be a decision worth making. The results from this study illustrate that there is no one single solution better than the others when it comes to applying ML approaches to understand the human life course.

The implications for the findings of this study should be discussed not only academically but also with regards to social policymaking. What does it mean for designing interventions and supportive welfare policies if the groups the society aims to support are the ones that we do not understand well and cannot generalize? Furthermore, when using AI tools to target interventions for groups that are likely to have more variability and randomness in their life course than the average individual, the model training and analyzing its generalizability for subgroups need to be carefully inspected. In light of these findings, the recent calls for integrating prediction-based modeling strategies alongside traditional regression-based ones to better understand the society (e.g., Hofman et al. 2021) are highlighted.

Supplemental Material

sj-docx-1-srd-10.1177_23780231241286655 – Supplemental material for The (Un)Predictability of Early (Un)Employment: A Machine Learning Approach

Supplemental material, sj-docx-1-srd-10.1177_23780231241286655 for The (Un)Predictability of Early (Un)Employment: A Machine Learning Approach by Sanni Kuikka in Socius

Footnotes

Acknowledgements

I am grateful for comments from Maria Brandén, Stefanie Möllborn, and Sunnee Billingsley.

Author’s Note

The study processes information defined as sensitive personal data by the GDPR. The study was conducted with anonymized data, stored on a secure server hosted by the data provider, and processed through a secure remote access with two-factor identification. Additionally, the study has been approved by the Swedish Ethical Review Authority (Dnr 2022-03844-01). The data underlying this article were provided by the Finnish Institute for Health and Welfare by permission and cannot be shared publicly due to containing sensitive personal data about the individuals.

Funding

The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is done in affiliation with the project “Understanding Society through Register Based Machine Learning” at the Institute for Analytical Sociology, Linköping University, generously funded by the Swedish Research Council Dnr. 2019-00245.

ORCID iD

Sanni Kuikka

Data and Code

The code used for analysis is available at the following Zenodo repository: .

Supplemental Material

Supplemental material for this article is available online.

1

Being registered as a job seeker in the Employment and Economic Development Office is a requirement for receiving unemployment benefits, the primary welfare subsidy for unemployed adults included in the workforce, yet not a guarantee for receiving them. Individuals registered as job seekers for the purpose of having access to unemployment benefits can be found in varying situations, such as unemployed looking for work, part-time workers getting partial unemployment benefits to compensate the loss of income due to lack of full-time employment, and previously unemployed currently full-time degree students for whom studying further has been found to increase their employability and unemployment benefits (which are higher than student benefits) granted as an incentive to reeducate. They can also be temporarily laid off, in training, or in services promoting employment, and the type and amount of potential unemployment benefits can vary depending on one’s label in the register. This definition of “unemployment” can be contrasted to the term “unemployed job seeker,” which is used by governmental agencies to distinguish individuals who are fully unemployed and looking for full-time work, from all registered as unemployed. Moreover, the operationalization of the outcome as binary means that there is likely heterogeneity in both groups. The group unemployed includes all cases who need the state-provided unemployment benefit (e.g., the fully unemployed or the ones in need of part-time income compensation), and in contrast, the group employed includes everyone who does not need the benefit (e.g., the full-time employed, the part-time employed, and inactive people with sufficient compensatory household income or wealth).

Author Biography

Sanni Kuikka is a PhD candidate at Stockholm University Department of Sociology, Demography Unit, and affiliated with the Institute for Analytical Sociology, Linköping University. In her work, she incorporates computational social science methods and approaches to understand social processes related to socioeconomic inequality and life course events. Her research interests are multidisciplinary, ranging from the social sciences to computer and data sciences. In her work, she leverages the Nordic “big data” for research—administrative population registers—in combination with data-driven methods to study both macro patterns and subgroup heterogeneity. Her research aims at incorporating both traditional and contemporary statistical tools to better understand the social world.

References

Abebe

Dawit Shawel

Hyggen

Christer

. 2019. “Moderators of Unemployment and Wage Scarring during the Transition to Young Adulthood: Evidence from Norway.” Pp. 115–37 in Negotiating Early Job Insecurity - Well-Being, Scarring and Resilience of European Youth, edited by Hvinden

O’Reilly

Schoyen

M. A.

Hyggen

Cheltenham, UK: Edward Elgar Publishing Ltd.

Arnett

Jeffrey Jensen

. 2000. “Emerging Adulthood: A Theory of Development from the Late Teens through the Twenties.” American Psychologist 55(5):469–80. doi:10.1037//0003-066X.55.5.469.

Azzollini

Leo

. 2023. “Scar Effects of Unemployment on Generalised Social Trust: The Joint Impact of Individual and Contextual Unemployment across Europe.” Social Science Research 109:102787. doi:10.1016/j.ssresearch.2022.102787.

Becker

Gary S

, ed. 1975. Human Capital: A Theoretical and Empirical Analysis, with Special Reference to Education. Cambridge, MA: National Bureau of Economic Research.

Black

Sandra E.

Devereux

Paul J.

2010. Recent Developments in Intergenerational Mobility. Bonn, Germany: Institute for the Study of Labor (IZA).

Black

Sandra E.

Devereux

Paul J.

Salvanes

Kjell G.

2005. “Why the Apple Doesn’t Fall Far: Understanding Intergenerational Transmission of Human Capital.” The American Economic Review 95(1):437–49.

Breiman

Leo

. 2001. “Statistical Modeling: The Two Cultures.” Statistical Science 16(3):199–215.

Caspi

Avshalom

Entner Wright

Bradley R.

Moffitt

Terrie E.

Silva

Phil A.

1998. “Early Failure in the Labor Market: Childhood and Adolescent Predictors of Unemployment in the Transition to Adulthood.” American Sociological Review 63(3):424–51.

Clark

Andrew E.

Lepinteur

Anthony

. 2019. “The Causes and Consequences of Early-Adult Unemployment: Evidence from Cohort Data.” Journal of Economic Behavior and Organization 166:107–24. doi:10.1016/j.jebo.2019.08.020.

10.

Coleman

James S.

1988. “Social Capital in the Creation of Human Capital.” The American Journal of Sociology 94:S95–120.

11.

Currie

Janet

. 2009. “Healthy, Wealthy, and Wise: Socioeconomic Status, Poor Health in Childhood, and Human Capital Development.” Journal of Economic Literature 47(1):87–122. doi:10.1257/jel.47.1.87.

12.

Doku

David Teye

Acacio-Claro

Paylun Jean

Koivusilta

Leena

Rimpelä

Arja

. 2018. “Health and Socioeconomic Circumstances over Three Generations as Predictors of Youth Unemployment Trajectories.” The European Journal of Public Health 29(3):517–23. doi:10.1093/eurpub/cky242.

13.

Fridell Lif

Evelina

Brännström

Lars

Vinnerljung

Hjern

Anders

. 2017. “Childhood Adversities and Later Economic Hardship among Swedish Child Welfare Clients: Cumulative Disadvantage or Disadvantage Saturation?” British Journal of Social Work 47(7):2137–56. doi:10.1093/bjsw/bcw167.

14.

Ganzeboom

Harry B. G.

Treiman

Donald J.

Ultee

Wout C.

1991. “Comparative Intergenerational Stratification Research: Three Generations and beyond.” Annual Review of Sociology 17(1):277–302. doi:10.1146/annurev.so.17.080191.001425.

15.

Hastie

Trevor

Tibshirani

Robert

Friedman

Jerome

. 2009. The Elements of Statistical Learning - Data Mining, Inference and Prediction. New York, NY: Springer.

16.

Helbling

Laura A.

Sacchi

Stefan

Imdorf

Christian

. 2019. “Comparing Long-Term Scarring Effects of Unemployment across Countries: The Impact of Graduating during an Economic Downturn.” Pp. 68–89 in Negotiating Early Job Insecurity - Well-Being, Scarring and Resilience of European Youth, edited by Hvinden

O’Reilly

Schoyen

Hyggen

Cheltenham, UK: Edward Elgar Publishing Ltd.

17.

Hofman

Jake M.

Watts

Duncan J.

Athey

Susan

Garip

Filiz

Griffiths

Thomas L.

Kleinberg

Jon

Margetts

Helen

, et al. 2021. “Integrating Explanation and Prediction in Computational Social Science.” Nature 595(7866):181–88. doi:10.1038/s41586-021-03659-0.

18.

Hogan

Dennis P.

Astone

Nan Marie

. 1986. “The Transition to Adulthood.” Annual Review of Sociology 12:109–30.

19.

Jahoda

Marie

. 1981. “Work, Employment, and Unemployment: Values, Theories, and Approaches in Social Research.” American Psychologist 36(2):184–91. doi:10.1037/0003-066X.36.2.184.

20.

Kapoor

Sayash

Cantrell

Emily M.

Peng

Kenny

Pham

Thanh Hien

Bail

Christopher A.

Gundersen

Odd Erik

Hofman

Jake M.

, et al. 2024. “REFORMS: Consensus-Based Recommendations for Machine-Learning-Based Science.” Science Advances 10(18):eadk3452. doi:10.1126/sciadv.adk3452.

21.

Lallukka

Tea

Kerkelä

Martta

Ristikari

Tiina

Merikukka

Marko

Hiilamo

Heikki

Virtanen

Marianna

Øverland

Simon

Gissler

Mika

Halonen

Jaana I.

2019. “Determinants of Long-Term Unemployment in Early Adulthood: A Finnish Birth Cohort Study.” SSM - Population Health 8:100410. doi:10.1016/j.ssmph.2019.100410.

22.

Lindemann

Kristina

Gangl

Markus

. 2019. “The Intergenerational Effects of Unemployment: How Parental Unemployment Affects Educational Transitions in Germany.” Research in Social Stratification and Mobility 62:100410. doi:10.1016/j.rssm.2019.100410.

23.

Lundberg

Ian

Brand

Jennie E.

Jeon

Nanum

. 2022. “Researcher Reasoning Meets Computational Capacity: Machine Learning for Social Science.” Social Science Research 108:102807. doi:10.7910/DVN/UVO6Z3.

24.

Lundberg

Ian

Brown-Weinstock

Rachel

Clampet-Lundquist

Susan

Pachman

Sarah

Nelson

Timothy J.

Yang

Vicki

Edin

Kathryn

, et al. 2024. “The Origins of Unpredictability in Life Trajectory Prediction Tasks.” Proceedings of the National Academy of Sciences 121(24):e2322973121. doi:10.1073/pnas.2322973121.

25.

Molina

Mario

Garip

Filiz

. 2019. “Machine Learning for Sociology.” Annual Review of Sociology 45(1):27–45.

26.

Mood

Carina

. 2017. “More than Money: Social Class, Income, and the Intergenerational Persistence of Advantage.” Sociological Science 4:263–87. doi:10.15195/v4.a12.

27.

Mooi-Reci

Irma

Bakker

Bart

Curry

Matthew

Wooden

Mark

. 2019. “Why Parental Unemployment Matters for Children’s Educational Attainment: Empirical Evidence from the Netherlands.” European Sociological Review 35(3):394–408. doi:10.1093/esr/jcz002.

28.

Mousteri

Victoria

Daly

Michael

Delaney

Liam

. 2018. “The Scarring Effect of Unemployment on Psychological Well-Being across Europe.” Social Science Research 72:146–69. doi:10.1016/j.ssresearch.2018.01.007.

29.

Official Statistics of Finland. 2015. Employment Bulletin November 2015. Helsinki, Finland: Ministry of Economic Affairs and Employment of Finland.

30.

Paananen

Reija

Gissler

Mika

. 2012. “Cohort Profile: The 1987 Finnish Birth Cohort.” International Journal of Epidemiology 41(4):941–45. doi:10.1093/ije/dyr035.

31.

Pedregosa

Fabian

Varoquaux

Gael

Gramfort

Alexandre

Michel

Vincent

Thirion

Bertrand

Grisel

Olivier

Blondel

Mathieu

, et al. 2011. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research 12:2825–30.

32.

Rege

Mari

Telle

Kjetil

Votruba

Mark

. 2011. “Parental Job Loss and Children’s School Performance.” The Review of Economic Studies 78(4):1462–89. doi:10.1093/restud/rdr002.

33.

Rindfuss

Ronald R.

1991. “The Young Adult Years: Diversity, Structural Change, and Fertility.” Demography 28(4):493–512. doi:10.2307/2061419.

34.

Salganik

Matthew J.

Lundberg

Ian

Kindel

Alexander T.

Ahearn

Caitlin E.

Al-Ghoneim

Khaled

Almaatouq

Abdullah

Altschul

Drew M.

, et al. 2020. “Measuring the Predictability of Life Outcomes with a Scientific Mass Collaboration.” Proceedings of the National Academy of Sciences 117(15):8398–403. doi:10.1073/pnas.1915006117.

35.

Sampson

Robert J.

Laub

John H.

1994. “Urban Poverty and the Family Context of Delinquency: A New Look at Structure and Process in a Classic Study.” Child Development 65:523–40.

36.

Sanford

Mark

Offord

David

McLeod

Kerryellen

Boyle

Michael

Byrne

Carolyn

Hall

Barbara

. 1994. “Pathways into the Work Force: Antecedents of School and Work Force Status.” Journal of the American Academy of Child & Adolescent Psychiatry 33(7):1036–46.

37.

Savcisens

Germans

Eliassi-Rad

Tina

Hansen

Lars Kai

Mortensen

Laust Hvas

Lilleholt

Lau

Rogers

Anna

Zettler

Ingo

, et al. 2024. “Using Sequences of Life-Events to Predict Human Lives.” Nature Computational Science 4(1):43–56. doi:10.1038/s43588-023-00573-5.

38.

Shanahan

Michael J.

2000. “Pathways to Adulthood in Changing Societies: Variability and Mechanisms in Life Course Perspective.” Annual Review of Sociology 26:667–91.

39.

Sun

Xiaoran

. 2024. “Supervised Machine Learning for Exploratory Analysis in Family Research.” Journal of Marriage and Family. doi:10.1111/jomf.12973.

40.

Van Rossum

Guido

Drake

Fred

. 2009. Python 3 Reference Manual. Scotts Valley, CA: CreateSpace.

41.

Verhagen

Mark D.

2022. “A Pragmatist’s Guide to Using Prediction in the Social Sciences.” Socius 8:. doi:10.1177/23780231221081702.

42.

Verhagen

Mark D.

2024. “Incorporating Machine Learning into Sociological Model-Building.” Sociological Methodology 54(2):217–68. doi:10.1177/00811750231217734.

43.

Wadsworth

Martha E.

Raviv

Tali

Compas

Bruce E.

Connor-Smith

Jennifer K.

2005. “Parent and Adolescent Responses to Poverty-Related Stress: Tests of Mediated and Moderated Coping Models.” Journal of Child and Family Studies 14(2):283–98. doi:10.1007/s10826-005-5056-2.

44.

Westerinen

Hannu.

2018. “Prevalence of Intellectual Disability in Finland” [Phd dissertation]. University of Helsinki.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.40 MB