Abstract
Aims:
School non-completion is a public health and educational concern in most countries. This study sought to identify the strongest predictors of the non-completion of upper secondary education based on register data.
Methods:
A cross-validated elastic net regression analysis was used to predict school non-completion in a population of 2696 students in the city of Jyväskylä, Finland. The register data included data from the primary social and healthcare register and the educational register.
Results:
The non-completion rate was 13.1% (13.4% for males, 12.8% for females). The non-completion of upper secondary education was best predicted by the following seven features (ordered from strongest to weakest): unauthorized absences (odds ratio (OR) = 2.27), out-of-home placement (OR = 2.23), average grade when leaving lower secondary education (OR = 0.73), an anxiety/depression diagnosis (OR = 1.43), visits to child guidance and family counselling centres (OR = 1.17), family poverty (OR = 1.11) and the grade point average in the 5th Grade (OR = 0.95).
Conclusions:
Introduction
School non-completion is a public health and educational concern in most countries [1–4]. The non-completion rates differ across countries owing to the differences in education systems and ways (non-) completion is measured [5,6]. Nonetheless, Scandinavian countries seem to have the lowest rates compared with other European countries [2,7,8]. In Finland, for example, in general upper secondary education level, the discontinuation percentage is 4 [7]. Finland has nine years of comprehensive education, beginning the year when the child turns seven, and ends when children are 16 years old (in the 9th grade). In 2021, free compulsory education was extended to the age of 18 [8].
School non-completion has been shown to be associated, for example, with a lower socioeconomic status, poorer school performance, poorer health, substance abuse, out-of-home placements during childhood, and mental health problems [4,9–16]. Although the phenomenon along with its risk factors is reasonably well investigated and understood at the individual, family, student and school levels [16], there are still gaps to be filled to further advance our understanding, and to better enable the targeting of activities more specifically to keep students in schools. For example, there is little research on the role of welfare system level factors, such as the use of social and healthcare services in predicting school non-completion [17]. Investigating this issue is relevant at least in the Finnish context, where universal primary health and social service provision for families and children are crucial both for recognizing and preventing child welfare concerns and problems [18].
The purpose of this study is to identify the strongest predictors of non-completion of upper secondary education based on register data with a wide set of potential predictors. Using a machine learning approach enabled us to construct prediction models for school non-completion with the most accurate predictor subsets and, thus, to expand the previous literature [19,20].
Methods
Study population and data sources
The data for the present study included students from upper secondary institutions in the city of Jyväskylä, Finland. The data included students whose information from lower secondary school was available, who had started their studies between 2013 (1 January) and 2015 (31 December) and had either graduated (
The data on the educational records was drawn from the educational registry of Jyväskylä (from the school administration system Primus), and the health and social data were based on records from the Jyväskylä primary social and healthcare register (the Effica system).
This study was undertaken as part of the wider Finnish Youth Social Impact Bond Program [21] to provide evidence-based support for the city of Jyväskylä to set targets for tackling school non-completion. The Ethics Committee of the Finnish Institute for Health and Welfare (THL) granted approval for the study. The register data was accessed after receiving permission from Hospital Nova in Central Finland (granted November 2021), as well as Jyväskylä Educational Consortium Gradia (granted June 2021), and the City of Jyväskylä’s social and health services (granted June 2021), and the cultural, education and sports services (granted June 2021). The data was anonymized before handing over to the researchers.
Variables
For the purposes of our study, the non-completion of upper secondary education was defined as an adolescent not graduated from upper secondary education, and not continuing their studies at follow-up in August 2019. Students who were enrolled at an educational institution in 2019 but had not graduated were considered to be continuing their studies and were hence excluded from the analyses. This information was based on the educational records drawn from the educational registry.
Initially, a diverse set of potential predictors was selected to predict the non-completion of upper secondary education. The main criteria for choosing each was a plausible association with school non-completion and availability in registers in the study area and other municipalities, which would enable wider application of the results. For descriptive purposes, the predictors were grouped into nine distinct categories, based on their content and the registry they were obtained from: 1) demographics; 2) school absenteeism; 3) school success; 4) learning support; 5) visits to child guidance and family counselling centres; 6) social welfare measures; 7) health conditions; 8) school behaviour; 9) other. Table I offers a complete list of the predictor variables included in this study, as well as their definitions and classifications.
Descriptive overview of the included registry-based variables used to predict upper secondary school non-completion.
Strategy for data analysis
Using a cross-validation elastic net regression analysis, a machine-learning approach, the dataset was randomly split into
Continuous features were scaled using the two-sigma scaling method [23] to ensure comparability with binary features. After the initial runs, the features were narrowed down so that out of the overlapping ones (e.g. regarding grades, school absences and diagnoses for which many indicators were available) the ones showing the greatest predictive power were retained.
Analyses were performed with the R statistical software (version 4.0.4) [24] and the ‘glmnet’ package [25]. Sensitivity models tested different numbers of folds (
Results
Descriptive results
Of the total study population, 86.9% (
Descriptive information of the most important features in the study, in the whole dataset and grouped by graduation status. Percentage or mean (SD).
Figure 1 shows the distributions of grade point from the 5th grade to the 9th grade among those who completed upper secondary school and among those who did not. This was done because the grade point average was a continuous variable, whereas all the other variables were binary and hence the figure illustrates the results in a more informative way than listing the medians and quartiles for each school year in a table. The final grade point averages in all school years from the 5th until the 9th grade were significantly higher among the students who completed their studies than among those who did not complete them.

Average grades according to upper secondary school graduation.
Model fit
The results from the cross-validation were consistent when the number of folds was at least 100, regardless of the level of α. The strongest predictors for school completion were not sensitive to the levels of α. The AUC in models with a reduced set of variables were almost equal or even higher than the AUC of the full dataset (e.g. with α=0.65, the AUC was 0.814 in the reduced dataset and 0.813 in the full dataset) and hence the analysis for the reduced dataset was inspected in more detail.
Parameters
In the best-predicting models, the non-completion of upper secondary education was predicted by the following seven features ordered from the strongest to the weakest when α=0.65,
Selected features and their estimated odds ratios from the model predicting non-completion of upper secondary education with different alpha levels (
AUC: area under the curve; exp (b): the exponentiation of the B coefficient.
These findings are also further illustrated in Figure 2. The figure shows the relative importance of different predictors on the non-completion probability. The grade point is presented on the

The relative importance of different predictors on school non-completion probability.
Discussion
Based on register data, this study aimed to identify the predictors of the school non-completion of upper secondary education. The results showed that the rate of school non-completion (13%) in a specific Finnish context (city of Jyväskylä) was substantially higher compared with national statistics [7]. It is, thus, reasonable to suggest that a careful identification of the strongest predictors of school non-completion in a local area level may help target preventative actions more specifically. The seven strongest predictors of school non-completion identified in the present context were multifaceted in nature. Below, these predictors are mainly discussed in relation to their strength.
The fourth strongest predictor of school non-completion in our study was
Some of the identified predictors of school non-completion in the current study reflect conditions/circumstances that are unreachable by the school. These notable predictors included
School non-completion as a phenomenon is complex and influenced by a variety of factors [10–16]. When considering the implications of the current study, the multifaceted nature of the identified predictors suggests that efforts to prevent non-completion are not straightforward. Predictors support multidisciplinary actions preventing non-completion by providing both early signals to target actions more specifically and indicators for monitoring the impact of preventative actions. Furthermore, these findings call for cross-sectoral actions and cooperation between the education, health and social sectors.
Study strengths and limitations
The strengths of this study were the use of objective register-based data and predictive modelling with machine learning. As the results were based on register information, they were not affected by the self-reported biases that are typically related to survey samples. Using a machine learning approach, we were able to select the most accurate predictor subsets and construct prediction models of school non-completion [19,20]. However, this study is not without limitations. The generalizability of the results outside Finland, particularly in countries with different welfare and educational systems, should be approached with caution. Additionally, there exist potentially relevant factors that may predict school non-completion, but were not available in the register data, such as the students’ ethnicity. Another plausible limitation to acknowledge is that the data was gathered prior to COVID-19 pandemic, which led to disruptions in the education and lives of students. Furthermore, the follow-up time was two years shorter for those students who started their studies in 2015, compared with those who started in 2013, which could have influenced the results because those who had started earlier had more time to complete their education. We assume, however, that this did not cause a major bias because the majority of the students at the upper secondary institutions graduated within the four-year follow-up. Finally, as the data did not cover the whole country but was restricted to one area (city of Jyväskylä), this could represent a limitation. However, there is no clear indication that the main predictors would significantly differ between different cities in Finland.
Footnotes
Acknowledgements
We would like to thank the city of Jyväskylä for granting access to its educational, social and health registries. We are also grateful to the late Hannu Pahkala from Steamlane Ltd for his contribution to the statistical preparation of this work. The funding body had no role in the study design, collection, analysis, or interpretation of data, writing the manuscript, or the decision to submit the manuscript for publication.
Declaration of conflicting interests
The authors have no conflicts of interest to declare.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the ITLA – Children’s Foundation, Finland.
