Abstract
In this paper, we describe in detail the different approaches we used to predict the GPA of children at the age of 15 in the context of the Fragile Families Challenge. Our best prediction improved about 18 percent in terms of mean squared error over a naive baseline prediction and performed less than 5 percent worse than the best prediction in the Fragile Families Challenge. After discussing the different predictions we made, we also discuss the predictors that tend to be robustly associated with GPA. One remarkable predictor is related to teacher observations at the age of nine. We end with a reflection on our participation in the Fragile Families Challenge and provide some suggestions for follow-up work.
Introduction
This paper describes our effort in predicting GPA of children in the context of the Fragile Families Challenge (FFC). The FFC was a mass collaboration effort to study six important outcomes in the lives of children and families when the child was 15. Participants in the FFC were diverse in terms of experience (from undergraduate students to tenured faculty) and fields of expertise (with backgrounds in sociology, physics, economics, etc.). Each participant was required to make predictions of the outcome variables using a large data set consisting of earlier waves of the Fragile Families and Child Wellbeing Study. Otherwise participants were completely free to use any approach they deemed appropriate. The participants were blind to the actual outcome but could submit predictions to a leaderboard that provided feedback in terms of predictive accuracy. 1
In this paper, we describe our approach to predict one outcome, GPA. This score reflects school performance and ranges from 1 (worst) to 4 (best).
The approach taken in this paper reflects our background as a social scientist. We browsed the relevant literature, read through the variable descriptions, and experimented with various methods. We started with a rather compact data set. Then, we created new data sets by adding (or removing) variables to this initial data set. The changes to this data set were based on our reading of the literature and the results of our predictions. This led to a total of 12 data sets we explored using different prediction models.
The approach we took grew organically throughout our participation in the FFC. As the goal of this paper is to explicitly describe what we did, we restrict ourselves to the predictions we made during the contest.
We found that a simple approach using only constructed variables already led to a substantial improvement in terms of predictive accuracy over a baseline prediction. Further improvements were modest. Our best prediction had a mean squared error (MSE) that was about 18 percent lower than a baseline prediction but was more than 4 percent off the best prediction in the FFC. Key variables are related to parental demographics and background and teacher observations at the age of nine.
This manuscript is structured as follows. We describe the creation of various data sets, present the prediction models, discuss the performance of our predictions, review the key covariates, and end with a conclusion.
Data Cleaning and Handling
We created 12 different data sets to predict GPA. To start, we restricted ourselves to the constructed variables. These variables are constructed by the Fragile Families research staff. This is a convenient starting point because these variables have often little missingness and combine variables into meaningful constructs.
During our participation in the FFC, we created different data sets by adding (and sometimes removing) variables. Both removing and adding variables can improve performance (Kuhn and Johnson 2013). One example is splitting up a categorical variable into a series of dummy variables.
We cleaned the data set, consisting of constructed variables, by removing variables that we deemed not to have a meaningful interpretation such as indicators of sample characteristics. 2 We also removed variables with no variability. In case of missing observations, we used median imputation. 3
Next, we reduced the number of variables related to poverty and household income. The data contain 40 of such closely related variables. We reduced these to three variables via principal component analysis. Then we transformed categorical variables containing four or less categories to a series of dummy variables (one hot encoding). We applied the same transformation to the categorical variables capturing the relationship status of the parents. This data set we call Constructed (see Table 1).
Data Sets Constucted for the Fragile Families Challenge.
To this data set, we added different sets of variables or transformed the same set of variables differently. 4
We started by adding variables coming from the teacher survey at age nine, the most recent information available. We believe that information from teachers is important to consider because children spend a large amount of time in school (Hofferth and Sandberg 2001). We focused on questions related to classroom behavior and social skills, and we dealt with these questions in different ways. First, we used a narrow set of these questions focusing on social skills. These were added to the data set previously described to create the data set ConstrSocial. Then, we added all the questions on classroom behavior to Constructed to create data set ConstrTeacher.
The answers on some of these questions are highly correlated as they relate to the same underlying concept. For example, teachers had to rate children on questions like: “Controls temper in conflict situations . . . ” or “Responds appropriately to teasing . . . .” We aimed at capturing the underlying construct using four different strategies. First, we created cumulative scores by summing the answers to related questions. These indices were added to Constucted to create ConstrTeacherIndex. Second, we added categorical variables of five to seven levels based on these indices. This led to ConstrTeacherCat. Third, we create a dummy variable for each level of the corresponding categorical variable to allow for nonlinear effects. The resulting data set is called ConstrTeacherDummy. Finally, we extracted principal components from these sets of related questions. The resulting data set is called ConstrTeacherPr.
We explored these seven data sets by assessing the predictive performance of various elastic net models (see the following) through cross-validation. From this, we learned that the data set ConstrTeacherPr tends to yield the lowest prediction errors.
Next, we searched for existing research using the Fragile Families and Child Wellbeing Study. We used Google Scholar to look for papers using survey data and studying some of the outcome variables in the FFC. We considered the first 25 papers in the search results. Two were irrelevant, and 2 could papers could not be accessed. We read the remaining 21 papers and listed relevant variables and constructs. We then browsed various questionnaires to add variables to our data set. We added variables coming from the impressions the interviewer got when making the interviews at age nine. These questions concern the childhood environment, hygiene of the child, and so on. We also added questions from the mother related to her perception of the father, any hardship she might face, as well as the mother’s well-being (e.g., religious attendence, life satisfaction, smoking). This is ConstrTeacherPCAge9.
Next we considered questions asked to the parents when the child was five. These questions were related to how the mother raises the child (e.g., spanking, kindergarten) and how she feels about raising a child. The answers from the father concern any financial aid the father was receiving as well as the father’s involvement in raising the child. These variables were added to ConstrTeacherPCAge9 to yield ConstrTeacherPCAge9Age5. We also made a variation of this data set, ConstrTeacherPCAge9Age5PC, where we added principal components from subsets of related variables instead.
The variables we chose to add to the constructed variables thus come from the waves at ages five and nine. This does not mean that we disregard the importance of early childhood, but we thought that the early life conditions would be well captured by the constructed variables. 5
We guessed that more recent information, such as from a teacher in elementary school, would be more helpful in predicting outcomes at age 15 than earlier data.
Finally, we constructed two additional data sets focusing on interactions between variables we considered relevant (based on the predictions we already made). We used the lasso (see the following) to select the variables. We fitted a lasso model on ConstrTeacherPCAge9Age5. We took all variables that were not set to zero and created manually a range of interaction variables. This formed the data set Interaction. The final data set is InteractionPCAge9, which combines Interaction with answers from the mother and father at age nine we earlier used. Table 1 provides an overview.
In Table 1, we show the different data sets we constructed and explored in the FFC. The table shows which data sets were ultimately used in predictions we submitted. The third column indicates the number of variables in these data sets. The last column shows the relation between the different data sets.
Methods
In our prediction exercise, we used four types of prediction models. 6
The elastic net (Zou and Hastie 2005) is a (penalized) linear model that has two penalties. First, there is regularization, which means that large coefficients are shrunk toward zero. Second, there is feature selection that means that some features are set to zero. These penalties are governed by two tuning parameters: λ (the ridge parameter) and α (the lasso parameter). Setting α = 0 yields ridge regression (no feature selection), whereas α = 1 yields lasso regression (feature selection). Intermediate values of α mix both.
Multivariate adaptive regression splines (Friedman 1991), or MARS, uses surrogate features instead of the original predictors. It is an inherently nonlinear model where nonlinear relationships between predictors do not need to be specified in advance. MARS creates contrasted features of a predictor to enter a model. These new features are added to a basic linear regression to create a piecewise linear model where each feature models an isolated portion of the data (Kuhn and Johnson 2013). The MARS model sequentially searches for prediction/cut point combinations that achieve the smallest error. After an initial selection, the model searches for the next set of features that, given the initial set, provides the best model fit. Once the full set of features is determined, a pruning procedure removes superfluous features.
We also use two tree-based models. These are models that build on regression trees where the idea is to partition the data into smaller groups that are homogenous with respect to the response.
We consider generalized boosted modeling, or gbm (Ridgeway 2017). The underlying idea is to combine individual regression trees that may not have strong predictive performance to improve predictive performance. There are different approaches to construct an ensemble of models that yields the final prediction. Boosting is one approach (Friedman 2002) that is known to work well (Kuhn and Johnson 2013).
We also consider Cubist, a rule-based model based on Quinlan (1987, 1992, 1993). Rules are distinct paths through a tree. Rule-based models try to simplify rules generated from trees, cutting down model complexity and hopefully also improving predictive performance. Cubist draws on other approaches (e.g., boosting) to improve the predictive performance. We chose this method because of its strong performance in horse race presented by Kuhn and Johnson (2013).
All models require tuning parameters that were chosen using cross-validation.
Performance
We submitted 11 predictions to the leaderboard. An overview of the models and data sets we used can be found in Table 2. This table provides the performance in terms of MSE given by the leaderboard and the performance on the holdout data. We notice that our predictions always did better on the holdout data than on the leaderboard data. This could be coincidental but could also be due to the fact that the leaderboard data contained randomly imputed data (Salganik et al. 2019). The performance on the leaderboard is also more variable than on the holdout data.
Predictive Performance (mean squared error).
Note: The lowest mean squared error on the holdout data is shown in bold. MARS = Multivariate adaptive regression splines; gbm = generalized boosted modeling.
We see that our best submission, based on the performance on the holdout data, is only 2.5 percent better than our worst in MSE terms. To see this in perspective, consider a naive baseline that uses the mean observed value as a prediction. Such a model yields an MSE of .4251. Our best prediction yields an MSE that is more than 18 percent lower. Consider on the other hand the best submission in the FFC. This submission obtained a score of slightly less than .344, which is an improvement of less than 5 percent over our best prediction.
From this, we conclude that the model we initially submitted already yielded a considerable improvement over the baseline model. We were then able to improve slightly, but we were still a bit removed from the best prediction in the FFC.
Table 2 shows the data sets and predictive models used in the different predictions we submitted to the Fragile Families Challenge. The MSE scores are rounded to four decimals. λmin is the lambda for which the mean cross-validated error is minimized. The lowest MSE on the holdout data is shown in bold. Additional information on our predictions can be found in the supporting online materials.
Interpretation
The discussion so far centered around predictive performance as this was the explicit goal of the prediction contest. Here we look at the variables that play a role in our predictions.
We do not discuss all submitted models. For the elastic net models, we focus on the models with α = 1 (lasso) because the other models have more predictors (with very small values) that encompass the predictors of the lasso models. We also do not show the variables in the models based on the data sets with interactions because these are by construction mostly based on the variables we report here (see the previous discussion on the construction of the data sets) and importantly because these models do not seem to outperform the ones we report here.
This leaves us with Models 2, 4, 5, and 6 (see also Table 2). For the elastic net models, we show which variables were not set to zero. For Model 5, we show the variables that are used to make a prediction (which are left after the pruning procedure; see our previous description of MARS). For Model 6, we show the 10 variables with the largest relative influence. These are the variables that contributed the most in reducing the loss function (Friedman 2001).
We see four variables playing a role in the four models. Model 5 only uses these four variables to predict GPA (after pruning).
We notice that all variables are either related to the socioeconomic background of the parents or capture classroom behavior at age nine. The variable with the largest relative influence in our best model (Model 6) is schoolpr1. This is the first principal component extracted from a set of 25 items in the teacher survey at age nine. This set contains items that reflect how much the child is participating in educational activities, for example, whether homework is finished in time, the child attends teacher instructions, keeps the desk clean, and so on. Table 3 shows which variables played a role in different predictions.
Relevant Variables.
Note: PC = principal component.
Conclusion
Our first prediction proved to be a substantial improvement over a baseline prediction that simply predicts the mean value of all observations. Subsequently, we explored different data sets using different types of prediction models. This resulted in only a modest improvement.
Looking back on our participation in the Fragile Families Challenge, we have the following takeaways.
First, variables related to teacher observations are relevant to consider when making predictions about GPA at the age of 15. In our best performing model (Model 6, see “Interpretation” section), these variables played a major role (as measured by relative influence).
Second, in our predictions, we relied solely on median imputation. It is unlikely that this is the best approach (Kuhn and Johnson 2013), and some improvement could be expected by investigating alternative imputation schemes.
Third, the effort invested in creating new data sets based on our reading of the literature and the codebook as well as exploring new algorithms only led to modest improvements. This suggests that a simple prediction approach, using an elastic net on the constructed variables with median imputation, captures most of the low hanging fruit. After that, improvements were increasingly difficult to achieve.
We tried to improve by exploring different models and reading through the literature. Implementing different machine learning models was more alien to us and took us more efforts without substantial benefits—in hindsight. For this reason, we would recommend researchers improve predictions by putting effort in what they are familiar with.
Supplemental Material
SRD-17-0118 – Supplemental material for Predicting GPA at Age 15 in the Fragile Families and Child Wellbeing Study
Supplemental material, SRD-17-0118 for Predicting GPA at Age 15 in the Fragile Families and Child Wellbeing Study by Louis Raes in Socius
Footnotes
Appendix
Acknowledgements
This paper benefited from the feedback of the editors and three anonymous reviewers. The results in this paper were created with software written in R 3.4.1. (R Core Team, 2017) using the following packages: Caret (Kuhn 2017), Cubist (Kuhn et al. 2016), dplyr (Wickham et al. 2017), earth (Milborrow 2017), factoextra (Kassambara and Mundt 2017), gbm (Ridgeway 2017), Glmnet (Friedman, Hastie, and Tibshirani 2010), and Hmisc (
). Funding for the Fragile Families and ChildWellbeing Study was provided by the Eunice Kennedy Shriver National Institute of Child Health and Human Development through grants R01HD36916, R01HD39135, and R01HD40421 and by a consortium of private foundations, including the Robert Wood Johnson Foundation. Funding for the Fragile Families Challenge was provided by the Russell Sage Foundation. Errors remain our own.
Supplemental Material
Supplemental material for this article is available with the manuscript on the Socius website.
2
The raw data set contains 591 constructed variables.
3
With the term missing observations, we refer to all observations that are not available (coded with a negative value in the original data set).
4
Note that some prediction models are insensitive to having many (highly correlated) predictors and lack a strong theoretical foundation. Our aim here is to provide an accurate description of our predictions.
5
There is an extensive literature on the importance of early childhood. The overview paper by Heckman (2008) contains many references. There are government-sponsored programs aimed at helping disadvantaged children in their early years. Garces, Thomas, and Currie (2002) evaluate the longer term effects of one program.
Author Biography
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
