Abstract
Aims:
School non-completion is a public health and educational concern in most countries. This study sought to identify the strongest predictors of the non-completion of upper secondary education based on register data.
Methods:
A cross-validated elastic net regression analysis was used to predict school non-completion in a population of 2696 students in the city of Jyväskylä, Finland. The register data included data from the primary social and healthcare register and the educational register.
Results:
The non-completion rate was 13.1% (13.4% for males, 12.8% for females). The non-completion of upper secondary education was best predicted by the following seven features (ordered from strongest to weakest): unauthorized absences (odds ratio (OR) = 2.27), out-of-home placement (OR = 2.23), average grade when leaving lower secondary education (OR = 0.73), an anxiety/depression diagnosis (OR = 1.43), visits to child guidance and family counselling centres (OR = 1.17), family poverty (OR = 1.11) and the grade point average in the 5th Grade (OR = 0.95).
Conclusions:
Introduction
School non-completion is a public health and educational concern in most countries [1–4]. The non-completion rates differ across countries owing to the differences in education systems and ways (non-) completion is measured [5,6]. Nonetheless, Scandinavian countries seem to have the lowest rates compared with other European countries [2,7,8]. In Finland, for example, in general upper secondary education level, the discontinuation percentage is 4 [7]. Finland has nine years of comprehensive education, beginning the year when the child turns seven, and ends when children are 16 years old (in the 9th grade). In 2021, free compulsory education was extended to the age of 18 [8].
School non-completion has been shown to be associated, for example, with a lower socioeconomic status, poorer school performance, poorer health, substance abuse, out-of-home placements during childhood, and mental health problems [4,9–16]. Although the phenomenon along with its risk factors is reasonably well investigated and understood at the individual, family, student and school levels [16], there are still gaps to be filled to further advance our understanding, and to better enable the targeting of activities more specifically to keep students in schools. For example, there is little research on the role of welfare system level factors, such as the use of social and healthcare services in predicting school non-completion [17]. Investigating this issue is relevant at least in the Finnish context, where universal primary health and social service provision for families and children are crucial both for recognizing and preventing child welfare concerns and problems [18].
The purpose of this study is to identify the strongest predictors of non-completion of upper secondary education based on register data with a wide set of potential predictors. Using a machine learning approach enabled us to construct prediction models for school non-completion with the most accurate predictor subsets and, thus, to expand the previous literature [19,20].
Methods
Study population and data sources
The data for the present study included students from upper secondary institutions in the city of Jyväskylä, Finland. The data included students whose information from lower secondary school was available, who had started their studies between 2013 (1 January) and 2015 (31 December) and had either graduated (n=2344) or not completed their studies (n=352) by 2019 (3 August). This resulted in a final sample of 2696 adolescents.
The data on the educational records was drawn from the educational registry of Jyväskylä (from the school administration system Primus), and the health and social data were based on records from the Jyväskylä primary social and healthcare register (the Effica system).
This study was undertaken as part of the wider Finnish Youth Social Impact Bond Program [21] to provide evidence-based support for the city of Jyväskylä to set targets for tackling school non-completion. The Ethics Committee of the Finnish Institute for Health and Welfare (THL) granted approval for the study. The register data was accessed after receiving permission from Hospital Nova in Central Finland (granted November 2021), as well as Jyväskylä Educational Consortium Gradia (granted June 2021), and the City of Jyväskylä’s social and health services (granted June 2021), and the cultural, education and sports services (granted June 2021). The data was anonymized before handing over to the researchers.
Variables
For the purposes of our study, the non-completion of upper secondary education was defined as an adolescent not graduated from upper secondary education, and not continuing their studies at follow-up in August 2019. Students who were enrolled at an educational institution in 2019 but had not graduated were considered to be continuing their studies and were hence excluded from the analyses. This information was based on the educational records drawn from the educational registry.
Initially, a diverse set of potential predictors was selected to predict the non-completion of upper secondary education. The main criteria for choosing each was a plausible association with school non-completion and availability in registers in the study area and other municipalities, which would enable wider application of the results. For descriptive purposes, the predictors were grouped into nine distinct categories, based on their content and the registry they were obtained from: 1) demographics; 2) school absenteeism; 3) school success; 4) learning support; 5) visits to child guidance and family counselling centres; 6) social welfare measures; 7) health conditions; 8) school behaviour; 9) other. Table I offers a complete list of the predictor variables included in this study, as well as their definitions and classifications.
Descriptive overview of the included registry-based variables used to predict upper secondary school non-completion.
Strategy for data analysis
Using a cross-validation elastic net regression analysis, a machine-learning approach, the dataset was randomly split into k parts (with k = 1000): k-1 training datasets to establish the prediction model, and the validation data (kth split) that tested the accuracy of the prediction model. In the elastic net regression analysis, variable selection is regulated by the α value, which ranges from 0 to 1 and where lower values generally result in models with more explanatory variables and higher values result in models with fewer explanatory variables (‘features’ in the machine-learning literature). We, therefore, ran the analysis with different levels of the mixing parameter α (0.45, 0.55, 0.65 and 0.75). The predictive power of 53 different variables and their combinations obtained from social and health service registries and education providers were tested. The prediction accuracy was evaluated using an area under the curve (AUC) metric, which is a measure of the accuracy of the discriminatory capacity of classification models. In general, an AUC of 0.5 suggests no discrimination, 0.7 to 0.8 is considered acceptable, 0.8 to 0.9 is considered excellent and more than 0.9 is considered outstanding [22].
Continuous features were scaled using the two-sigma scaling method [23] to ensure comparability with binary features. After the initial runs, the features were narrowed down so that out of the overlapping ones (e.g. regarding grades, school absences and diagnoses for which many indicators were available) the ones showing the greatest predictive power were retained.
Analyses were performed with the R statistical software (version 4.0.4) [24] and the ‘glmnet’ package [25]. Sensitivity models tested different numbers of folds (k-value) in the cross-validation (3, 5, 10, 100, and 1500), but the main predictors were not sensitive to these analytical choices.
Results
Descriptive results
Of the total study population, 86.9% (n=2344) completed their upper secondary education and 13.1% (n=352) did not complete it. There were slightly more females (50.6%) than males (49.4%) in the dataset. Non-completion rates, however, were fairly equally distributed between genders (13.4% for males and 12.8% for females). Descriptive information is presented in Table II of the most important features in the present study across the whole dataset and grouped by graduation status.
Descriptive information of the most important features in the study, in the whole dataset and grouped by graduation status. Percentage or mean (SD).
Figure 1 shows the distributions of grade point from the 5th grade to the 9th grade among those who completed upper secondary school and among those who did not. This was done because the grade point average was a continuous variable, whereas all the other variables were binary and hence the figure illustrates the results in a more informative way than listing the medians and quartiles for each school year in a table. The final grade point averages in all school years from the 5th until the 9th grade were significantly higher among the students who completed their studies than among those who did not complete them.

Average grades according to upper secondary school graduation.
Model fit
The results from the cross-validation were consistent when the number of folds was at least 100, regardless of the level of α. The strongest predictors for school completion were not sensitive to the levels of α. The AUC in models with a reduced set of variables were almost equal or even higher than the AUC of the full dataset (e.g. with α=0.65, the AUC was 0.814 in the reduced dataset and 0.813 in the full dataset) and hence the analysis for the reduced dataset was inspected in more detail.
Parameters
In the best-predicting models, the non-completion of upper secondary education was predicted by the following seven features ordered from the strongest to the weakest when α=0.65, k=1000/1500 with good accuracy (AUC=0.814): unauthorized absences (odds ratio (OR) = 2.27), out-of-home placement (substitute care) (OR = 2.23), average grade when leaving lower secondary education (OR = 0.73), an anxiety or depression diagnosis (OR = 1.43), visits to child guidance and family counselling centre (OR = 1.17), family poverty (OR = 1.11) and average grade in the 5th grade (OR = 0.95). These results were essentially the same with different levels of α (Table III), although the OR estimate for the average grade in the 5th grade reduced as α increased, indicating that this was the weakest predictor for the non-completion of upper secondary education.
Selected features and their estimated odds ratios from the model predicting non-completion of upper secondary education with different alpha levels (k=1000).
AUC: area under the curve; exp (b): the exponentiation of the B coefficient.
These findings are also further illustrated in Figure 2. The figure shows the relative importance of different predictors on the non-completion probability. The grade point is presented on the x-axis because it was a continuous variable. For simplicity, the grade point average in the 5th grade was assumed to be 6.0. When the grade point average in the 9th grade was greater than the average in the 5th grade, the non-completion probability decreased.

The relative importance of different predictors on school non-completion probability.
Discussion
Based on register data, this study aimed to identify the predictors of the school non-completion of upper secondary education. The results showed that the rate of school non-completion (13%) in a specific Finnish context (city of Jyväskylä) was substantially higher compared with national statistics [7]. It is, thus, reasonable to suggest that a careful identification of the strongest predictors of school non-completion in a local area level may help target preventative actions more specifically. The seven strongest predictors of school non-completion identified in the present context were multifaceted in nature. Below, these predictors are mainly discussed in relation to their strength.
Unauthorized absence, which emerged as the strongest predictor of school non-completion, was in line with previous research. A similar finding was reported by Chung and Lee in South Korea [20]. They also used machine-learning algorithms and found that the best predictor for school non-completion was unauthorized absence. Despite the fact that research has consistently shown school absenteeism to be a strong risk factor for non-completion of education [10,16], the focus has predominantly been on the role of school absenteeism without distinguishing different types of absences. It is important to acknowledge that the reasons for absenteeism can be multiple and that authorized absences may also disrupt learning and subsequently have an influence on the non-completion risk [26].
Out-of-home placement was the second strongest predictor of adolescent school non-completion. This finding is in line with other studies demonstrating that children who have been placed outside the home are less likely to complete their studies [27]. Average grade when leaving lower secondary education was found to be the third strongest predictor of school non-completion. Previous research supports this finding [10,16]. As a novel contribution to the existing literature, we observed its importance as a predictor of school non-completion as early as from the 5th-grade level onwards. Even though the grade point average in the 5th grade was the weakest predictor (ranked 7th ) for the non-completion, it could be a potentially important early signal to consider when identifying students at risk of school non-completion.
The fourth strongest predictor of school non-completion in our study was an anxiety or depression diagnosis, referring to internalizing mental health disorders. Overall, this finding accords with earlier studies showing that poor mental health is related to school non-completion and that early-onset mental disorders should be considered a key target to reduce non-completion rates [10,13–15]. On the other hand, our result is somewhat in contrast with the earlier literature which has identified that externalizing disorders predict school non-completion better than internalizing disorders [28]. More recent evidence is similar to ours and suggests that internalizing problems have a strong role in predicting school non-completion [29].
Some of the identified predictors of school non-completion in the current study reflect conditions/circumstances that are unreachable by the school. These notable predictors included visits to child guidance and family counselling centres (ranked 5th) and family poverty (ranked 6th). All of these are signifiers of broader challenges and processes, which may influence the emergence of differences in educational outcomes among children. The pathways leading to these are diverse and too broad to be discussed in detail here. Additionally, the significant effect of family poverty (commonly defined by education and income-based measures) on educational outcomes for children has been widely reported, with a lower socioeconomic family status being linked to poorer educational achievement for children [30]. Based on our findings, visits to child guidance and family counselling centres predicted school non-completion. While these centres serve additional support and guidance roles for adolescents and their families, and thus can be seen as a positive resource in relation to school completion, the reverse finding here may reflect that those who use these services are typically characterized as high-risk students [17].
School non-completion as a phenomenon is complex and influenced by a variety of factors [10–16]. When considering the implications of the current study, the multifaceted nature of the identified predictors suggests that efforts to prevent non-completion are not straightforward. Predictors support multidisciplinary actions preventing non-completion by providing both early signals to target actions more specifically and indicators for monitoring the impact of preventative actions. Furthermore, these findings call for cross-sectoral actions and cooperation between the education, health and social sectors.
Study strengths and limitations
The strengths of this study were the use of objective register-based data and predictive modelling with machine learning. As the results were based on register information, they were not affected by the self-reported biases that are typically related to survey samples. Using a machine learning approach, we were able to select the most accurate predictor subsets and construct prediction models of school non-completion [19,20]. However, this study is not without limitations. The generalizability of the results outside Finland, particularly in countries with different welfare and educational systems, should be approached with caution. Additionally, there exist potentially relevant factors that may predict school non-completion, but were not available in the register data, such as the students’ ethnicity. Another plausible limitation to acknowledge is that the data was gathered prior to COVID-19 pandemic, which led to disruptions in the education and lives of students. Furthermore, the follow-up time was two years shorter for those students who started their studies in 2015, compared with those who started in 2013, which could have influenced the results because those who had started earlier had more time to complete their education. We assume, however, that this did not cause a major bias because the majority of the students at the upper secondary institutions graduated within the four-year follow-up. Finally, as the data did not cover the whole country but was restricted to one area (city of Jyväskylä), this could represent a limitation. However, there is no clear indication that the main predictors would significantly differ between different cities in Finland.
Footnotes
Acknowledgements
We would like to thank the city of Jyväskylä for granting access to its educational, social and health registries. We are also grateful to the late Hannu Pahkala from Steamlane Ltd for his contribution to the statistical preparation of this work. The funding body had no role in the study design, collection, analysis, or interpretation of data, writing the manuscript, or the decision to submit the manuscript for publication.
Declaration of conflicting interests
The authors have no conflicts of interest to declare.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the ITLA – Children’s Foundation, Finland.
