Sage Journals: Discover world-class research

Abstract

Quantitative sociologists frequently use simple linear functional forms to estimate associations among variables. However, there is little guidance on whether such simple functional forms correctly reflect the underlying data-generating process. Incorrect model specification can lead to misspecification bias, and a lack of scrutiny of functional forms fosters interference of researcher degrees of freedom in sociological work. In this article, I propose a framework that uses flexible machine learning (ML) methods to provide an indication of the fit potential in a dataset containing the exact same covariates as a researcher’s hypothesized model. When this ML-based fit potential strongly outperforms the researcher’s self-hypothesized functional form, it implies a lack of complexity in the latter. Advances in the field of explainable AI, like the increasingly popular Shapley values, can be used to generate understanding into the ML model such that the researcher’s original functional form can be improved accordingly. The proposed framework aims to use ML beyond solely predictive questions, helping sociologists exploit the potential of ML to identify intricate patterns in data to specify better-fitting, interpretable models. I illustrate the proposed framework using a simulation and real-world examples.

Keywords

Machine learning Misspecification Explainable A.I.Computational methods

It is common knowledge that valid inference crucially depends on a correctly specified relationship between the outcome of interest, $y$ , and the explanatory variables, $X$ (Buja, Brown, et al. 2019; Cameron and Trivedi 2005; Long and Trivedi 1992). In practice, much more attention is typically paid to identifying relevant variables to include in a model rather than to making sure the functional relationship among these variables is correctly specified. This is evidenced by the fact that variables are often simply assumed to affect the outcome in a linear and additive way (Hindman 2015). However, there is little reason to believe linear additive models appropriately reflect the underlying data-generating process (DGP). At the same time, estimating incorrectly specified models can lead to biased findings; there are noteworthy examples throughout the social sciences—and likely many more that have gone unnoticed—where more complicated functional relationships, which might include nonlinearities or interactions, have led to reversed findings (Christensen and Christensen 2014; Dougherty et al. 2015; Freedman 2009; Heckman, Humphries, and Veramendi 2018; McClintock 2017; Muñoz and Young 2018). Despite this type of criticism of the standard linear additive model having been around for many decades, it remains the workhorse throughout most empirical sociology today (Abbott 1988; Berk 2004; Duncan 1984; Lundberg, Johnson, and Stewart 2021). In this article, I incorporate methods from machine learning (ML) and explainable A.I. (X-AI) into the standard empirical workflow to help sociologists (1) assess whether their hypothesized model fits the data well by comparing its fit against a flexible ML model, and (2) improve their model when it does not accurately represent the patterns in the data by unpacking the ML model using X-AI techniques.

In the past, simple models like the linear additive functional form were often a necessity due to computational constraints.¹ Yet these historical limitations are no longer a factor, giving rise to ML methods that instead exploit the exponential increase in available computational power. These methods let the data dictate the relationship among variables, rather than relying on the researcher to hypothesize the functional form between the two.² Often, this flexibility leads to better-fitting models that improve on researcher-hypothesized models in fitting the data (Brand et al. 2021; Grimmer, Roberts, and Stewart 2021); this has led to increased adoption of ML throughout industry and academia (Athey 2018; Rahal, Verhagen, and Kirk 2022).³ However, ML methods are almost exclusively applied within a predictive context due to their “black box” nature (Mullainathan and Spiess 2017; Shmueli 2010); their inclusion into sociological inquiry, which has an overwhelmingly explanatory focus, remains limited.⁴ This is unnecessary, as the fact that ML methods can identify associations among variables holds value for explanatory work as well.

Concretely, I propose to incorporate ML methods into the typical quantitative empirical workflow in the following way. Assume a researcher is interested in modeling the association between some interest variable and an outcome. To this effect, they apply a conditioning-on-observables approach to control for confounders and develop a hypothesized functional form, $\tilde{f} (\cdot)$ . However, it is unclear a priori whether this model correctly represents the underlying relationship among variables in the data. In the first step of the proposed approach, $\tilde{f} (\cdot)$ is estimated and, in parallel, a flexible ML model, or preferably an ensemble method like a Super Learner that combines multiple ML methods, is estimated to the same data (Baćak and Kennedy 2019). The flexible model thus uses the same set of variables as identified by the researcher in their hypothesized model and follows the same inferential logic.⁵ However, the functional form in which variables relate to one another is left to be decided by the data, effectively considering a much wider range of possible model specifications than a researcher would typically do during model-building. Second, the fit of both the hypothesized model and the flexible ML model is evaluated out-of-sample, either through cross-validation or a formal holdout set, to identify a possible difference in fit between the two models. Third, in case a lack of appropriate specification in $\tilde{f} (\cdot)$ is found, the flexible model is unpacked to better understand the lack of specification in $\tilde{f} (\cdot)$ , such that it can be improved.

The framework relies on a number of concepts from ML or closely adjacent fields. First, that out-of-sample estimates of model fit can be used to compare the ability of widely varying modeling approaches to fit the data (Rose 2013; Stone 1977; Verhagen 2022). Second, that ML methods can be used as efficient function approximators that can uncover relevant patterns in data (Baćak and Kennedy 2019; Van der Laan, Polley, and Hubbard 2007). Third, that methods from the rapidly evolving field of X-AI—like the Shapley values approach explored in this article that decomposes model predictions along covariates—allow researchers to distill understanding from ML methods, which can then be used to improve hypothesized models (Samek et al. 2019). These elements, in particular the last, allow for an iterative process where ML is used in a supporting role in model-building. The novelty of the framework thus lies in using ML methods as a guide, in such a way that the overarching goal of the framework is still to generate interpretable models (Agrawal, Peterson, and Griffiths 2020; Rudin 2019). This complementary role should be juxtaposed with the overwhelmingly predictive focus of ML methods in sociology up until now (Molina and Garip 2019). The framework is also in line with increasing calls to use ML in a supportive role throughout the empirical pipeline, rather than completely replacing interpretable methods for ML equivalents (Agrawal et al. 2020; Rudin 2019; Rudin et al. 2010).^6,7

Incorporating ML and X-AI methods into empirical work should not only improve model-building but also improve transparency into the research process. A lack of emphasis on functional form specification provides researchers with relative freedom in evaluating multiple functional forms and choosing which one to report (Muñoz and Young 2018; Sala-i Martin 1997; Simonsohn, Simmons, and Nelson 2020). Such “researcher degrees of freedom” are to blame for a large number of important empirical findings turning out to be un-reproducible and a more general “crisis in science,” where results are often dependent on over-engineered models (Gelman and Loken 2013; Ioannidis 2005; Young 2018). Comparing a researcher’s carefully tailored model to the fit of an ML method estimated to the same data can help identify such an over-engineered model, by providing insights into model fit that are not a result of a researcher’s decision-making process but rather a feature of the data. More generally, overfitting is a central concern throughout ML, and the frequent re-estimating of models on subsets—standard practice throughout ML—can further protect against “p-hacking” (Gelman and Loken 2013).

More generally, the proposed framework provides a welcome realignment of empirical methods with today’s computational reality (Efron and Hastie 2016) and is in line with increasing calls for a re-appreciation of the predictive power of social theories that can be observed more broadly throughout sociology (Hofman, Sharma, and Watts 2017; Verhagen 2022; Watts 2014, 2017).

The rest of this article is structured as follows. I first introduce a toy example to illustrate the risk of model misspecification to inference. Then, I introduce the three steps of the proposed framework and illustrate each by applying them to the same toy example. I next compare the proposed approach to other computational frameworks. In the empirical section, I apply the framework to one simulated and two real-world case studies. In doing so, I illustrate how the framework identifies (1) missing nonlinearities and interactions in a simulated dataset on the association between schooling and earnings, (2) nonlinearities and spatial heterogeneity in a large dataset on London house sales, and (3) various intricate patterns in voting preferences among U.S. voters in the General Social Survey (GSS).

Misspecification and Risks to Inference

Consider the following toy example. Our interest lies in whether having grandchildren affects elderly individuals’ worries about climate change. We analyze a survey $(N = 900)$ in which respondents were asked their level of concern with climate change on a seven-point scale. In addition, we have information on respondents’ age, sex, whether they completed high school, and, our interest variable, whether they have grandchildren. We might propose the following model (Model 1) within a conditioning-on-observables framework:

y_{i} = β_{0} + β_{1} x_{age, i} + β_{2} x_{sex, i} + β_{3} x_{high_school, i} + β_{4} x_{grandchildren, i} + ϵ_{i},

(1)

which simply plugs all the relevant controls and our interest variable into a linear additive model. We might also consider including a nonlinearity in the effect of age by adding a square (Model 2):

y_{i} = β_{0} + β_{1} x_{age, i} + β_{2} x_{sex, i} + β_{3} x_{high_school, i} + β_{4} x_{grandchildren, i} + β_{5} x_{age, i}^{2} + ϵ_{i} .

(2)

Results for estimating both models on a simulated dataset are presented in the first two columns in the left-hand panel of Table 1. We find a substantial difference between the two models in terms of the estimated association between having grandchildren and concerns with climate change. The effect is statistically insignificant for Model 1 with a point estimate of −0.02, whereas it is negative and significant $(p < 0.01)$ for Model 2 with a point estimate of −0.68. A Likelihood Ratio (LR) test strongly prefers the second model over the first ( $p < 0.001$ ). We might feel comfortable concluding our empirical analysis at this stage.

Table 1.

Table on the left shows regression results when estimating a functional form assuming linear age, quadratic age, and a step function for age; Plot on the right shows implied effect of age for the three model specifications.

	Model 1	Model 2	Model 3
(Intercept)	−19.02^***	−2.16	9.97^***
	(0.51)	(3.42)	(0.23)
Age: linear	0.52^***	0.03
	(0.01)	(0.10)
Grandchildren	−0.02	−0.68^**	0.57^*
	(0.21)	(0.25)	(0.23)
Sex: female	−0.97^***	−0.99^***	−0.95^***
	(0.10)	(0.10)	(0.08)
High school	0.41^***	0.38^***	0.36^***
	(0.10)	(0.10)	(0.08)
Age: squared		0.00^***
		(0.00)
Age: 50+			0.26^***
			(0.02)
Age: 65+			0.49^***
			(0.03)
Age: 75+			−0.41^***
			(0.04)
Age: 85+			−0.17
			(0.13)
R ²	0.88	0.89	0.91
Adj. R²	0.88	0.89	0.91
Num. obs.	900	900	900
RMSE	1.44	1.42	1.23

Note: Table shows OLS regression coefficients. Standard errors are in parentheses. The outcome variable of interest is a seven-point scale indicating respondent’s concerns with climate change (7 = highest, 1 = lowest).

p < 0.05. **p < 0.01. ***p < 0.001.

Imagine, however, that the true effect of age is indeed nonlinear, but piece-wise linear rather than a second-degree polynomial. Specifically, there is a stronger increase in concern by age among people age 65 to 75, with more modest increases at other ages. If we had estimated a functional form in line with this nonlinearity, we would find results as reported in the third column of the left panel of Table 1. These results show that the effect of having grandchildren is in fact associated with a higher level of concern. The estimated age effect per functional form is illustrated in the right panel of Table 1, with the dashed line representing the “true” effect in the DGP and the colored lines the best-fitting effect given the flexibility of the model. The problem is that both the linear and the quadratic function of age overshoots the true effect as the respondent’s age increases. Coupled with the fact that grandchildren are more prevalent among the older population, this effect leads to incorrect inference.

The issue with doing inference into the first two models is that the assumption of exogeneity, that is, $E (ϵ | X) = 0$ , is violated. The error term for the first two models will be small for observations around ages 60, 70, and 80, as the estimated effect lies close to the true effect, but it will be larger at the ends of the age distribution. The coefficient estimates still reflect the best-fitting model given the functional form, but clearly lead to incorrect inference. The more general phenomenon of exogeneity is further illustrated by the four bivariate relationships plotted in Figure 1.⁸ Three of the four models include a nonlinear relationship, although a linear coefficient estimated to the data (N = 80) is statistically significant at the conventional 5 percent level. These examples illustrate two typical risks of misspecification: that we wrongfully assume an effect to be linear when it is not, and that correlation of our interest variable with a misspecified control can lead to omitted variable bias (Cameron and Trivedi 2005; Long and Trivedi 1992).⁹

Figure 1.

Implied association from a linear model (dashed line) relative to the true effect (solid line).

Various statistical tests have been proposed to combat misspecification. The most popular ones are White’s test for functional misspecification (White 1980, 1981) and Ramsey’s RESET test (Ramsey 1969). White’s test statistic is based on a comparison of the estimated coefficients from a hypothesized model $\hat{β}$ with those from a weighted regression ${\hat{β}}_{WLS}$ , whereas the Ramsey test statistic compares the hypothesized model with a model that includes higher-order versions of the explanatory variables and evaluates their statistical significance. Both provide computationally efficient statistics to assess misspecification of the functional form, although both assume the true model lies within a specific class of functions $M$ with possibly restrictive assumptions (Golden et al. 2016).¹⁰ Since the introduction of the White and Ramsey tests, more computationally intensive (semi-)parametric methods have been developed (Robinson 1988a, 1988b; Yatchew 1997), and substantial follow-up research has built on their principles (Golden et al. 2016). Unfortunately, misspecification tests have a number of well-known theoretical and practical problems (Buja, Brown, et al. 2019; Long and Trivedi 1992).

First, they are generally under-powered and can struggle to identify misspecification in multivariate settings (Buja, Kuchibhotla, et al. 2019:616).¹¹ Second, they provide limited insight into what to do next, if a test is rejected. Third, and perhaps most important, the actual implementation of misspecification tests is limited in published empirical work (Long and Trivedi 1992; Open Science Collaboration 2015). As a case in point, the two flagship journals in sociology—the American Sociological Review and American Journal of Sociology—totalled 70 research articles in 2022. Of these, 40 included quantitative analyses, of which 32 implemented a conditioning-on-observables approach. None of these articles implement any of the standard misspecification tests like the White or RESET tests.¹² In practice, researchers enjoy relative freedom in specifying their functional form, as well as the robustness and specification tests they use to verify its appropriateness. The crisis in reproducibility and the effect of researcher degrees of freedom on empirical results thus extends to the (lack of) specification checks chosen by researchers to validate their model assumptions (Simonsohn et al. 2020; Young 2018).

Even without malicious intent, many functional forms are from an outdated age of computational limitations (Buja, Kuchibhotla, et al. 2019:615; Efron and Hastie 2016; Muñoz and Young 2018). The linear additive functional forms plugged into exponential family probability distributions were historically preferred due to their ease of estimation and interpretation, not their de facto appropriateness to study social life. This pragmatic preference for parsimony has led to the continued prevalence of simplistic functional forms, even though the computational constraints under which they were developed are no longer present, and the wealth of qualitative research in the social sciences consistently implies that social life is in fact highly complex and probably not appropriately modeled using such simple functional forms (Abbott 1988). As the toy example above illustrates, inference can easily go astray given the often blind acceptance of simplistic functional forms in empirical work.

A Computational Framework to Improve Model-Building

I propose a computational framework that exploits ML methods to obtain an indication of the potential model fit in a dataset. This potential can then be compared to the fit of a researcher’s own hypothesized model, and thus used to diagnose a possible lack of specification in the latter. This estimate provides researchers with an intuitive assessment of how well the data could be modeled versus how well the data is modeled by their hypothesized model. ML thus takes a guiding role in model-building. Whenever it is found that the ML model improves on the researcher-hypothesized model, methods from the X-AI domain can be implemented to unpack and better understand the ML model and subsequently improve the hypothesized model.

Given a dataset $D$ and a researcher-hypothesized functional form $\tilde{f} (\cdot)$ , relating outcome of interest $y$ with independent variables $X$ , the proposed framework consists of the following three simple steps (see Figure 2 for a schematic illustration):

Estimate hypothesized model $\tilde{f} (\cdot)$ to the data, as well as an ML model $\bar{f} (\cdot)$ .

Evaluate the model fit of both $\tilde{f} (\cdot)$ and $\bar{f} (\cdot)$ and assess a possible lack of fit in $\tilde{f} (\cdot)$ .

Diagnose why $\bar{f} (\cdot)$ improves on the hypothesized model and improve $\tilde{f} (\cdot)$ accordingly.

Figure 2.

Schematic illustration of the proposed framework.

Crucial to the framework is that the ML method is estimated using the same variables as present in the model originally hypothesized by the researcher. The framework is not designed for data-mining, where a large number of possible explanatory variables are included in the model. In such a case, one risks including variables that might improve model fit but could harm understanding of the underlying processes—typical examples would be (accidentally) including post-treatment variables or colliders. Instead, the functional form is scrutinized given the exact same inferential curation of variables and logic as present in the researcher’s originally hypothesized model.¹³ The following sections describe the three components of the proposed framework in more detail and illustrate them using the toy example.

Step I: The benchmarking model

The first component of the proposed framework is a flexible form model $\bar{f} (\cdot)$ , which serves as a benchmark of the possible model fit in the data. This model uses the same variables and thus follows the same inferential logic as the hypothesized model $\tilde{f} (\cdot)$ , but it evaluates patterns not necessarily considered by the researcher’s functional form. The ML methods do not rely on prespecified functional forms, but distill the functional form from the observed data (Baćak and Kennedy 2019; Grimmer et al. 2021). However, they retain more structure than do non-parametric approaches like local regression and often suffer less from the “curse of dimensionality,” where the number of data points required scales exponentially with the covariate space (Bishop 2006).

Many different ML methods can be applied to datasets commonly encountered in the social sciences (Athey 2018; Hastie, Tibshirani, and Friedman 2009). In fact, limiting the benchmarking step to a single researcher-curated ML model to serve as $\bar{f} (\cdot)$ would invite some of the very risks this framework is designed to address in terms of researcher degrees of freedom in model-building. Tuning parameters in ML models can easily be calibrated to diminish model fit and thus imply appropriate fit of $\tilde{f} (\cdot)$ , when this is not the case. Relatedly, some methods fit certain patterns in data better than others with often limited a priori guidance (Berk and Bleich 2013; Rose 2013). A principled approach would thus estimate not a single but an ensemble of flexible form models during the benchmarking phase (Baćak and Kennedy 2019).

The Super Learner is an example of such an ensemble method, consisting of multiple ML methods or the same method with different parameter settings. Among these, the most appropriate model is identified using cross-validation. The oracle result by Van Der Laan and Dudoit (2003) shows that a Super Learner is indeed optimal to identify the best-fitting model to approximate an unknown DGP among evaluated models, and the price of including large numbers of models into the Super Learner is minor in terms of performance (Baćak and Kennedy 2019; Van der Laan et al. 2007; Van der Vaart, Dudoit, and van der Laan 2006). Therefore, many models can and should be evaluated, and ensembles of various methods are typically preferred as function approximators over single models (Agrawal et al. 2020).¹⁴ A Super Learner can also be specified prior to model estimation and preregistered, improving transparency in the research process (Open Science Collaboration 2015). Naturally, the full set of models and the resulting fit of each should be part of the research output.¹⁵

Among the broader set of ML methods, some approaches lend themselves better than others to modeling social science data (Chen and Guestrin 2016; Lundberg et al. 2020). In particular, tree-based methods deal well with various types of data often encountered in the social sciences (e.g., categorical and numerical variables) and can handle missing data. They also do not require exponential increases in sample size as the feature space increases. Furthermore, the most popular tree-based methods, like the Random Forest (RF) or Gradient Boosting (GB) model, require relatively little tuning on the part of the researcher and can be applied out-of-the-box or with limited effort to most social science datasets (Breiman 1996; Freund and Schapire 1996).¹⁶ These advantages make tree-based approaches an attractive class of ML methods to include in the benchmarking step. As a case in point, consider the nonlinear associations in Figure 3, which were introduced earlier. As before, the black lines illustrate true associations and the green diamonds are a hypothetical dataset generated by the true association and some white noise. The blue dots and red squares are out-of-sample predicted values based on a GB and an RF model trained to 80 observed data points and fed with 100 uniformly distributed $X$ values to generate $\hat{y}$ -predictions. Both models learn the nonlinearity in the data well and do so without requiring any pre-hypothesized functional relationship.

Figure 3.

Predicted outcome using a GB (blue dots) and RF (red squares) model fit to noisy data (green diamonds).

I illustrate the first step of the framework by estimating a Super Learner including a number of GB and RF models with various parameter estimates to the toy example introduced earlier. I also include the two researcher-hypothesized models to the Super Learner, reflecting the model including linear (SL.GLM_Linear) and quadratic (SL.GLM_Quadratic) specifications of age. For illustrative purposes, I add the true model (SL.GLM_PieceLinear), even though this model was not hypothesized originally. Table 2 shows a truncated output from the Super Learner and the root mean squared error (RMSE)¹⁷ of various models included in the Super Learner, estimated based on 10 folds using cross-validation (for the full output, see Appendix Table A1). The best-performing model is, unsurprisingly, the true model, but its performance is closely tracked by a GB model that strongly outperforms the linear and the quadratic specifications. Including the GB model with different parameters (in this case a slow learning rate combined with high or low tree depth) leads to overfitting and a strong underperformance in terms of fit. This further illustrates the necessity of including a large set of models into an ensemble like the Super Learner.

Table 2.

Truncated Super Learner Performance Using the Toy Example Data.

Model	Ave.	SD	Min.	Max.
SL.GLM_PieceLinear	1.218	0.060	1.046	1.320
SL.GB_200_1_0.1	1.229	0.059	1.107	1.327
SL.GB_500_1_0.1	1.231	0.059	1.104	1.323
SL.GB_100_1_0.1	1.241	0.059	1.129	1.351
RF_200_2	1.383	0.064	1.229	1.544
SL.GLM_Quadratic	1.384	0.063	1.174	1.536
SL.GLM_Linear	1.413	0.065	1.321	1.569
SL.GB_200_5_0.01	2.698	0.088	2.484	2.926
SL.GB_200_6_0.01	2.698	0.088	2.486	2.927

Note: The models included in the Super Learner are a GB model with parameter grid: ntrees = [100, 200, 500], max depth = [1, 2, 3, 4, 5, 6], shrinkage = [0.01, 0.1], and a Random Forest model with parameter grid: mtry = $[\sqrt{n}, 2 \sqrt{n}]$ , ntree = [100, 200, 500]. Super Learner performance based on RMSE & 10-fold CV. GB parameters: n_rounds_max_depth_eta, RF parameters: n_trees_mtry. Output is truncated for brevity, see Appendix Table A1 for the full output.

Step II: Estimating and comparing model fit

The second component of the proposed framework is a metric to compare the fit of benchmarking model $\bar{f} (\cdot)$ from the previous step with that of the hypothesized model $\tilde{f} (\cdot)$ . Typically, comparative statistics like the Akaike Information Criterion (AIC) or LR tests (in case of nested models) are used to compare two or more explicitly hypothesized models. These statistics are not easily transferred to ML methods, mainly because they rely on in-sample diagnostics and specified functional forms or require estimation of degrees of freedom, which are challenging for ML models (Janson, Fithian, and Hastie 2015). As a result, ML methods are typically evaluated on their out-of-sample performance (Hastie et al. 2009; Shmueli 2010). The broad comparability of out-of-sample predictions across modeling domains makes them an attractive fit metric for the framework (Stone 1977; Van Der Laan and Dudoit 2003; Verhagen 2022).

Formally, the out-of-sample estimation of model fit would require splitting the total dataset $D$ into two partitions, a training set, $D_{train}$ , used to estimate the model, and a test set, $D_{test}$ , used to evaluate the model’s fit. By making predictions with the model estimated based on $D_{train}$ but using data from $D_{test}$ to make predictions, the latter predictions can be compared with the actually observed outcomes, and summary metrics of fit like the RMSE can be calculated (Hastie et al. 2009). Separating off a testing set is generally preferred for a truly out-of-sample estimate of fit, but it does sacrifice part of the sample available for estimation (usually 20–30 percent).¹⁸ A common alternative is $K$ -fold cross-validation, where the data are split into $K$ equal-sized folds. The model is estimated $K$ times, each time omitting one fold and using the omitted fold to generate predictions (Kohavi 1995).¹⁹ The Super Learner approach discussed in Step 1 similarly relies on $K$ -fold cross-validation. A closely related approach is Monte Carlo cross-validation, where $M$ random splits of $D$ into training and testing sets are made rather than mutually exclusive folds. The latter can be helpful when the data have additional structure that complicates separation of $D$ into distinct subsets. The two approaches have been shown to be similar in practice (Yousef 2020). Another benefit of implementing cross-validation to obtain a metric of model fit is that bootstrapped estimates of the coefficients for the researcher-hypothesized model $\tilde{f} (\cdot)$ are obtained, which can further help identify a lack of robustness in an estimated coefficient.²⁰

As others have noted, fit need not automatically equate with the best model from an inferential perspective (Grimmer et al. 2021; Muñoz and Young 2018). This discussion is often concerned with including colliders or post-treatment variables in a model, or simply including as many variables as a researcher can possibly think of. Such practices typically improve fit, but they can invalidate inference. This is the central reason why the proposed framework uses the same inferential reasoning in terms of which explanatory variables are included in the flexible model as the originally hypothesized model. Much of the criticism regarding blind “fit-hunting,” where all causal or inferential logic is effectively abandoned in favor of model fit, is thus not applicable, and variable selection is driven by the substantive question rather than fit (Grimmer et al. 2021:412).²¹ However, it can be challenging to identify when a difference in fit between the flexible and researcher-hypothesized model warrants a re-evaluation of the latter. For example, improvements in fit for a covariate uncorrelated with the interest variable may have little bearing on the consistency of the interest variable.²² Ideally, we would like clear statistical guidance on the extent of bias in an underspecified model. In practice, the decision to unpack the flexible form model will likely depend on the willingness of both the researcher to defend a specification with an apparent lack of fit, and the research community to accept it. This is not a consequence of the framework, but a level of uncertainty resulting from embracing a state of the world where we acknowledge that we have very little knowledge about the true DGP and are unwilling to make stringent assumptions on it a priori.

Putting the second step of the framework into practice, I compare the model fit of the GB model identified in Step 1 with the two hypothesized alternatives for the toy example. I include two fit metrics: the R-squared $(R^{2})$ and the RMSE, and evaluate both using 1,000 splits of the dataset into train and test sets. The results are summarized in Table 3. For illustration’s sake, the true piece-wise model is included in this step as well. The bottom three rows for both metrics show the difference in the fit metric between the flexible model and the hypothesized models. The flexible model strongly outperforms the linear and quadratic models, but it does not improve on the true model, as should be expected.²³

Table 3.

Three linear additive models and a GB model estimated on a train set consisting of 80 percent of the total dataset.

Variable	N	Mean	SD	Min.	Pctl. 5	Pctl. 95	Max.
RMSE
GB	1000	1.241	0.06	1.062	1.145	1.34	1.414
Linear	1000	1.417	0.064	1.195	1.314	1.522	1.643
Quadratic	1000	1.391	0.065	1.181	1.284	1.495	1.602
PieceLinear	1000	1.222	0.06	1.049	1.124	1.316	1.414
GB vs. Linear	1000	–0.176	0.049	–0.315	–0.254	–0.097	–0.024
GB vs. Quadratic	1000	–0.149	0.046	–0.284	–0.222	–0.073	0.008
GB vs. PieceLinear	1000	0.019	0.02	–0.042	–0.013	0.051	0.109
$R^{2}$
GB	1000	0.656	0.005	0.638	0.647	0.664	0.671
Linear	1000	0.626	0.005	0.609	0.617	0.635	0.64
Quadratic	1000	0.630	0.005	0.613	0.621	0.639	0.645
PieceLinear	1000	0.656	0.005	0.639	0.648	0.665	0.672
GB vs. Linear	1000	0.030	0.002	0.024	0.027	0.033	0.037
GB vs. Quadratic	1000	0.025	0.002	0.019	0.022	0.028	0.032
GB vs. PieceLinear	1000	–0.001	0.000	–0.002	–0.001	0.000	0.000

Note: The three linear additive models assume the effect of age to be linear, quadratic, or piece-wise linear, respectively. A GB model is also fit using the xgboost package in R with the following parameters: depth = 1, nrounds = 200, eta = 0.1, which were shown to be optimal in the Super Learner routine (see Table 2). RMSE and OOS $R^{2}$ is calculated for each model based on the remaining 20 percent of the data. The last three rows of the RMSE and $R^{2}$ parts show bootstrapped results of the difference between the optimal GB model and each of the linear additive models. 1,000 splits of the total dataset into train and test sets are evaluated.

Step III: Unpacking the ML model

The third and final component of the proposed framework is a method to generate understanding of the ML model whenever it outperforms the hypothesized model in a way that warrants re-evaluation of the originally hypothesized functional form. The increasing application of ML models in everyday life has led to pressure to improve understanding of the inner workings of ML models.²⁴ These developments have led to the emergence of X-AI, which focuses on understanding the patterns underlying ML models. Two general approaches can be identified within X-AI: global and local explainability. The former attempts to describe the mechanics of a model in general terms, that is, which variables tend to be important for the model’s overall performance. The reporting of variable importance measures are an example of this approach (Baćak and Kennedy 2019; Brand et al. 2021). In local explainability, the goal is to explain the drivers of singular predictions made by a model. Local explanation methods are more appropriate for the framework, as they provide insights into the actual functional patterns between covariates and the outcome.

Various local explanation methods have been developed for different substantive questions that may be asked of a model (Doshi-Velez and Kim 2017; Lipton 2018; Zhou et al. 2021).²⁵ For example, much explainability research is focused on providing additional ad-hoc context into a model’s predictions for practitioners making (high-stakes) decisions. In such cases, the ability to quickly calculate and extract information underlying a model’s prediction could be more relevant than emphasizing fidelity. Conversely, an ethical reviewer of a system-in-action might require more fine-grained and detailed explanations. Within the proposed framework, the onus is on understanding the underlying patterns found by the ML model, rather than any practical or regulatory concerns. This motivates the use of Shapley values, which have a number of unique and attractive properties with extensive theoretical grounding (discussed in more detail below).²⁶ Importantly, recent advances have made their estimation computationally feasible and rekindled a general interest in their use (Aas, Jullum, and Løland 2021; Heskes et al. 2020; Samek et al. 2019). Shapley values also align closely with human intuitions of model interpretation, making them appropriate for the iterative process envisioned in the framework (Doshi-Velez and Kim 2017; Lundberg and Lee 2017; Rudin 2019). Shapley values have been applied in various fields, notably medicine (Lundberg et al. 2020; Tang et al. 2021), but have also found applications in sociological research, for instance, as a way to understand complex network dynamics (van der Laan et al. 2022) and to resolve path dependencies in decomposition metrics like segregation indices (Elbers 2023).

Shapley values for model explanation

Shapley values were originally developed within cooperative game theory to assist in the task of distributing a game’s overall payout to its participants. Because a game’s payout can depend on its participants’ actions in a potentially complex manner, such an attribution function is not trivial to determine—for instance, some players’ participation may have no bearing on the outcome at all. Shapley (1953) developed a perturbation-based approach to determine such an allocation mechanism for games of arbitrary complexity that possesses various attractive theoretical guarantees.

When used for model explanation, every single prediction ${\hat{y}}_{i}$ that is made by a model is viewed as the payout of a potentially complex game, where the covariate values $x_{ik}$ underlying the prediction are viewed as the game’s $K$ participants. The goal of Shapley values is to attribute the value of the prediction ${\hat{y}}_{i}$ among its $K$ covariates $x_{ik}$ :

{\hat{y}}_{i} = ϕ_{0} + \sum_{k = 1}^{K} ϕ_{ik} .

(3)

Here, ${\hat{y}}_{i}$ is a single prediction made by some potentially complex model and $ϕ_{0}$ resembles the overall mean across predictions. The Shapley values thus attempt to decompose the deviance of a specific prediction made by the model with respect to the mean prediction across all observations. The additive decomposition of a prediction ${\hat{y}}_{i}$ amongst its covariates leads to the typical waterfall plots associated with Shapley values (see Figure 4). In this example, the difference between ${\hat{y}}_{i}$ and the overall mean $ϕ_{0}$ is decomposed among the four covariates. The first and fourth covariates, $x_{i 1}$ and $x_{i 4}$ , have a negative impact on ${\hat{y}}_{i}$ as their associated Shapley values drive the prediction to the left. The second and third covariates, $x_{i 2}$ and $x_{i 3}$ , have a positive impact. Taken together, they add up to the total difference between ${\hat{y}}_{i}$ and the mean. Each prediction ${\hat{y}}_{i}$ thus has its own $K$ Shapley values $ϕ_{ik}$ that precisely determine the difference between ${\hat{y}}_{i}$ and $ϕ_{0}$ .

Figure 4.

Prediction $\hat{y}$ and mean value $ϕ_{0}$ based on a model with four explanatory variables.

As every single prediction ${\hat{y}}_{i}$ is decomposed into $K$ Shapley values, one for each covariate, the approach effectively leads to a matrix of $N \times K$ Shapley values:

[\begin{matrix} x_{11} & x_{12} & \dots & x_{1 K} \\ x_{21} & x_{22} & \dots & x_{2 K} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{N 1} & x_{N 2} & \dots & x_{NK} \end{matrix}]

(4)

where all $K$ Shapley values per observation exactly add up to the model’s prediction for that observation:

ϕ_{o} [\begin{matrix} 1 \\ 1 \\ ⋮ \\ 1 \end{matrix}] + [\begin{matrix} \sum_{j = 1}^{K} ϕ_{1 j} \\ \sum_{j = 1}^{K} ϕ_{2 j} \\ ⋮ \\ \sum_{j = 1}^{K} ϕ_{Nj} \end{matrix}] = [\begin{matrix} {\hat{y}}_{1} \\ {\hat{y}}_{2} \\ ⋮ \\ {\hat{y}}_{N} \end{matrix}] .

(5)

Note that if the model was a standard linear additive model without an intercept (i.e., ${\hat{y}}_{i} = \sum_{k = 1}^{K} {\hat{β}}_{k} x_{ik}$ ) then calculating Shapley values that satisfy the additive function ${\hat{y}}_{i} = ϕ_{0} + \sum_{k = 1}^{K} ϕ_{ik}$ is simple. Every Shapley value should simply correspond to the estimated coefficient ${\hat{β}}_{k}$ times the covariate value $x_{ki}$ for that observation: $ϕ_{ki} : = {\hat{β}}_{k} x_{ki}$ , as shown by Aas et al. (2021). If we were to plot the $N$ tuples consisting of the Shapley values $ϕ_{ik}$ and the covariates $x_{ik}$ for a single variable k, this would result in a perfect linear relationship between the covariates and the Shapley values with slope ${\hat{β}}_{k}$ . When more complex predictive models are used, plotting the Shapley values $ϕ_{ik}$ and the covariate values $x_{ik}$ in such a joint manner provides a graphical way to study the implied association between a covariate and the outcome (Lundberg et al. 2020:59).

The process of calculating the $K$ Shapley values for a single prediction ${\hat{y}}_{i}$ relies on a perturbation-based approach, where we assess the effect of omitting information on a covariate $x_{ik}$ on that model’s prediction.²⁷ This is done by defining all possible “information sets” consisting of a set of $s$ out of the total $K$ covariates. Define $M_{i}^{k}$ to be some information set including covariate $k$ , and $M_{i}^{k} / k$ to be its complement, excluding information on covariate $k$ . For each $M_{i}^{k}$ , we also make a prediction for its equivalent in $M_{i}^{k} / k$ . This leads to two predictions—one including information on variable $k$ and one without—which are then differenced. The Shapley value $ϕ_{ik}$ for some prediction ${\hat{y}}_{i}$ and covariate $x_{ik}$ is defined as a weighted mean over all information sets $M^{k}$ :

ϕ_{ik} = \overset{Sum over all information sets}{\overset{︷}{\sum_{I \in M^{k}}}} \underset{Weighting function}{\underset{︸}{\frac{| I |! (| M | - | I | - 1)!}{M!}}} \overset{Difference in prediction}{\overset{︷}{(\bar{f} (I \cup k) - \bar{f} (I))}} .

(6)

This process is then repeated for every variable $k$ and for every prediction ${\hat{y}}_{i}$ , leading to the $N \times K$ matrix in Equation 4.

Shapley values have a number of unique theoretical properties that make them attractive as a method to generate understanding of models. First, the sum of all Shapley values $ϕ_{ik}$ and the mean prediction, $ϕ_{0}$ , match the actually observed prediction ${\hat{y}}_{i}$ . Second, whenever the inclusion of variable $k$ into the information set has no effect on the model’s prediction— $\bar{f} (I) = \bar{f} (I \cup k)$ for all $I$ — its Shapley value is zero. Third, whenever two covariates $j$ and $k$ contribute equally to every prediction— $\bar{f} (I \cup j) - \bar{f} (I) = \bar{f} (I \cup k) - \bar{f} (I) for all I$ —their Shapley values are the same: $ϕ_{ij} = ϕ_{ik}$ . Fourth, they are consistent with respect to addition and multiplication across models. Prior work shows Shapley values are the only local explanation method to possess these properties (for detailed discussions and proofs, see Lundberg and Lee 2017; Lundberg et al. 2020; Shapley 1953; Young 1985).

The computational burden of calculating Shapley values stems from three elements. First, the number of possible information sets is $2^{K}$ and scales exponentially with the number of covariates. Second, calculating the prediction $\bar{f} (I)$ when the information set $I$ consists of a subset $S \subset K$ requires an expectation $E [x | x^{*} = x_{s}]$ to be evaluated for the variables $\bar{S}$ not in the information set. Typically, the empirical density of the missing variables is used, but this assumes independence of the feature space. Joint Gaussian and copula-based methods have been proposed as an alternative, which further increase computation time (Aas et al. 2021; Heskes et al. 2020). Finally, many predictions have to be assessed to ensure sufficient tuples $[ϕ_{ik}, x_{ik}]$ to infer patterns in the model of interest, thus further scaling the computational requirements linearly. Fortunately, computationally efficient methods have been developed to calculate Shapley values. Here, I use the Shapley value estimation method optimized for tree-based models, Tree SHAP, following Lundberg et al. (2020).

Tree SHAP exploits the structure of the estimated tree, allowing for a much smaller set of relevant information sets to be assessed. Specifically, information sets that only differ from one another in terms of variables that do not feature in the branches of the tree can be ignored, as they will not lead to different predictions.²⁸ This has made estimation efficient to the point that exact Shapley values can be calculated, rather than having to rely on numerical approximations (Lundberg et al. 2020:64–65), and even renders pairwise Shapley values feasible to calculate in reasonable time (Lapuschkin et al. 2019; Lundberg et al. 2020). Pairwise Shapley values assess the joint omission of two variables from the model and the subsequent effect on the outcome, allowing interactions to be explicated. In the standard Shapley decomposition in Equation 3, interactions would be divided among the relevant interacting variables.²⁹ A final benefit of the Tree SHAP approach is that the independence assumption when making expectations for those variables outside of the information set, $\bar{S}$ , is relaxed. A more detailed discussion of Shapley values, pairwise Shapley values, and the Tree SHAP algorithm is provided in the Appendix.

To illustrate how Shapley values can be used to generate understanding into an ML model, I calculate all $N$ Shapley values for the age variable based on predictions using the GB model identified in the toy example above. Figure 5 presents a joint scatterplot of the $N$ Shapley values $ϕ_{i, age}$ together with the $N$ values of the covariate $x_{i, age}$ . I normalize the Shapley values as well as the true underlying effect such that they can be visualized jointly.³⁰ The Shapley values accurately recover the piecewise linear association in the underlying data, and would provide concrete insights into how the hypothesized functional forms could be improved from the linear and quadratic specifications of age.

Figure 5.

Implied effect size of age through Shapley values versus the correctly specified model.

Resemblances to Other Computational Frameworks

Before applying the framework to three empirical cases, I briefly discuss a number of methods and frameworks that share a similar intuition and computational appetite to the proposed framework. In spirit, the first two steps of the framework are indebted to modern types of specification tests that rely on comparing residuals from classic non-parametric regression to those provided by a researcher’s own hypothesized model.³¹ Such approaches can identify a broader range of nonlinear functional forms than those considered by the RESET test, for example, although formal tests of improved fit remain challenging to compute due to the dependence of local estimates to the data at hand. More generally, the amount of data required to compute such local estimates scales exponentially with the number of independent variables (Bishop 2006; Yatchew 2003). The ML methods proposed for the framework presented here retain considerably more structure and are more efficient from a data perspective than non-parametric tests are, improving efficiency and allowing for better post-estimation analysis.³²

The proposed framework also shares similar goals to the field of model robustness. The emphasis in model robustness lies in exposing the potential role of researcher degrees of freedom during the model-building process (Sala-i Martin 1997; Simonsohn et al. 2020; Young 2019). A large number of “plausible” models are estimated that slightly differ to the hypothesized model to guard against researchers cherrypicking a specification. The key difference in model robustness is that variation is typically induced by excluding variables in the model, rather than assessing different functional relationships among variables (Young and Holsteen 2017).³³ The functional complexity and the fit of the models is of limited importance (Slez 2019; Young 2019). The proposed framework instead focuses on the functional relationship among a set number of variables and the emphasis is on specification.³⁴ Both approaches could be combined by applying the proposed framework to different sets of explanatory variables, as I will illustrate in two empirical case studies. Despite their differences, both frameworks are similarly indebted to the growth in computational power that allows researchers to consider many different functional forms and to use these computational riches to be more critical of modeling assumptions and improve transparency into the research process (Muñoz and Young 2018:4).³⁵

Closest in spirit to the proposed framework is an emerging literature in behavioral economics and psychology using ML models to find the predictive limit of behavioral theories (Agrawal et al. 2020; Kleinberg, Liang, and Mullainathan 2017; Peterson et al. 2021). The typical approach is to compare the performance of some behavioral theory to predict the outcomes of a stylized experiment (e.g., the fairness perception of two vignettes [Agrawal et al. 2020]) with a flexible ML model trained to the same data. If the ML method leads to better predictions than a behavioral theory, cases where the behavioral theory underperformed relative to the ML benchmark are studied. This research focuses solely on large-scale experimental data and has not been extended to empirical work as broadly as I propose here. The onus in this literature is on theory completeness rather than correct functional form specification. Post-estimation interpretation of the ML method—the third step in the proposed framework—is mostly ad hoc, if present at all.³⁶ However, the central premise of using ML to find patterns in data not necessarily hypothesized by a researcher is a key similarity between the two approaches.

A final strand of research similar to the proposed framework is the field of “autometrics,” which also aims to find better-fitting models than a researcher’s own hypothesizing might yield. The autometrics approach includes a large number of base transformations of the explanatory variables into a linear additive functional form. This “stacked” model is then iteratively trimmed to reach an optimal model using in-sample fit metrics and specification tests (Doornik and Hendry 2015). Autometrics is a more classical approach to flexible model-building compared to the ML methods proposed here. Specifically, the autometrics setup can be subsumed by the set of ML methods called “generalized additive models” and spline-based methods, which can be folded into Step I of the proposed framework (Hastie and Tibshirani 1987; Hastie et al. 2009).³⁷ Related, scholars have put forward similarly stacked models that include interactions but then apply variable selection through least absolute shrinkage and selection operator regression (Blackwell and Olson 2022; Beiser-McGrath and Beiser-McGrath 2020). The proposed framework shares the same general intuition of the above approaches but uses a broader set of methods such that a wider range of patterns can be identified, and it emphasizes out-of-sample evaluation to reduce the risk of overfitting and path dependence in eliminating regressors from the model. As with all frameworks mentioned here, the inclusion of the third step in the proposed framework is another key difference with the above approach.

Applying the Framework in Practice

I apply the proposed framework to three case studies. The first is a simulation based on the Mincerian wage equation, a classic field of research investigating the returns of additional years of schooling on a person’s income (Lemieux 2006). The second is a hedonic regression of house prices applied to a large dataset of transactions in the London retail market (Malpezzi 2003). The third is a study of the demographic determinants of voting preferences in the United States using the General Social Survey (GSS) (Davis and Smith 1991). The Mincerian wage application is relevant because it provides an example of a functional form that has been actively innovated upon over the past decades and illustrates how the proposed framework can speed up model-building. I chose the hedonic regression and voting examples because both enjoy considerable academic interest, and models typically include a number of standard control variables with a (novel) interest variable. Because the latter is often correlated with the control variables, a correct functional relationship is crucial for inference. For these latter two case studies, I evaluate whether the typical complexity in which control variables feature in the functional form is sufficient, and improve them where necessary. Across these cases, I identify various nonlinearities and interaction effects using the framework.³⁸

Application I: Mincerian wage simulation

The Mincerian wage equation is a classic economic tool used to estimate the effect of an additional year of education on an individual’s wages—the “return to education” (Lemieux 2006). As mentioned above, the Mincerian wage equation has been actively innovated upon over the past decades. In the original functional form, log yearly wages was related to years of education and work experience in a linear additive fashion:

\ln (wage s_{i}) = β_{0} + β_{1} x_{educ, i} + β_{2} x_{\exp, i} + ϵ_{i} .

(7)

Subsequently, a square term was added to allow for a level of nonlinearity in the effect of work experience:

\ln (wage s_{i}) = β_{0} + β_{1} x_{educ, i} + β_{2} x_{\exp, i} + β_{3} x_{\exp, i}^{2} + ϵ_{i} .

(8)

More recently, a step function was added in the effect of education to allow for different linear effects by level of education:

\begin{matrix} \ln (wage s_{i}) = β_{0} + β_{1} x_{educ_0_8, i} + β_{2} x_{educ_9_10, i} + β_{3} x_{educ_11_12, i} + \\ β_{4} x_{educ_13_14, i} + β_{5} x_{educ_15 +, i} + β_{6} x_{\exp, i} + β_{7} x_{\exp, i}^{2} + ϵ_{i} . \end{matrix}

(9)

For illustrative purposes, I also study a fourth, hypothetical functional form where each coefficient slightly differs by sex:

\begin{matrix} \ln (wage s_{i}) = I (x_{sex, i} = Female) [β_{0}^{*} + β_{1}^{*} x_{educ_0_8, i} + β_{2}^{*} x_{educ_9_10, i} + β_{3}^{*} x_{educ_11_12, i} + \\ β_{4}^{*} x_{educ_13_14, i} + β_{5}^{*} x_{educ_15 +, i} + β_{6}^{*} x_{\exp, i} + β_{7}^{*} x_{\exp, i}^{2} + ϵ_{i}] + \\ I (x_{sex, i} = Male) [β_{0} + β_{1} x_{educ_0_8, i} + β_{2} x_{educ_9_10, i} + β_{3} x_{educ_11_12, i} + \\ β_{4} x_{educ_13_14, i} + β_{5} x_{educ_15 +, i} + β_{6} x_{\exp, i} + β_{7} x_{\exp, i}^{2} + ϵ_{i}] . \end{matrix}

(10)

To illustrate the proposed framework, I simulate data using the four specifications above as DGPs, plugging coefficient estimates as found in the recent empirical literature into each specification. For the final specification, I vary the coefficients of sex to fall within a standard error of the coefficients found in the literature (see the Appendix for the DGPs) (Heckman et al. 2018; Lemieux 2006).

Based on these four DGPs and a synthetic sample of 50,000 individuals’ age, years of education, years of work experience, and sex, I generate four outcomes: one using each DGP. The synthetic sample is based on the GSS 2018, such that the distribution of the explanatory variables is representative of an actual working population. For illustrative purposes, I calibrate the error term in each functional form such that the proportion of explainable variance is constant in every dataset irrespective of the DGP used to generate the outcome variable. Descriptives for the dataset can be found in Appendix Table A2. Linear-I refers to the first functional form above provided by Equation 7, Linear-II refers to Equation 8, and so on.

The above leads to four datasets consisting of a distinct vector of outcomes $y$ and the same matrix of explanatory variables $X$ , where each outcome vector follows from one of the four underlying functional forms in Equations 7 to 10. The first is based on a DGP where both education and work experience affect the outcome linearly, the second where work experience follows a second-degree polynomial relationship, and so on. I next propose four hypothetical functional forms, $\tilde{f} (\cdot)$ , to estimate to each of the four datasets. The first three functional forms are equal to those defined in Equations 7 to 9, as well as a fourth functional form that is equal to Equation 9 but includes a dummy for sex. This means none of the four hypothesized models aligns with the true DGP in the fourth dataset, which follows the DGP provided in Equation 10. I include this fourth form for illustrative purposes, as it is common practice to “control” for group differences by including a dummy variable, although this might simplify the true pattern in the data, which in this case concerns an interactive effect rather than a level difference. Estimating these four functional forms means that for the first dataset, all four hypothesized models should have the appropriate flexibility to model the data. For the second dataset, only the second, third, and fourth functional forms should, and for the third dataset, only the third and fourth functional forms should be able to fit the data well. None of the proposed functional forms have sufficient flexibility to estimate the underlying DGP in the fourth dataset appropriately.

In addition to the four hypothesized functional forms, I estimate a Super Learner including various tree-based ML methods—the first step of the framework. A GB model performs best for each of the four datasets (see Appendix Tables A3 and A4). In the second step of the framework, I compare the model fit of the flexible model with the four hypothesized models using Monte Carlo cross-validation. The results show that the flexible model is able to match the true functional form’s performance in the first three datasets and strongly outperforms the most flexible functional form in the fourth dataset (see Figure 6). The framework thus identifies the underlying model without requiring the researcher to hypothesize a functional form in all four datasets.

Figure 6.

Out-of-sample $R^{2}$ for the four datasets with varying DGPs, using four functional forms and the GB model.

Finally, the flexible model can be unpacked to infer why it improved on the hypothesized functional forms—the third step in the framework. We know the true DGP in this case, making it obvious why the flexible model outperformed the underspecified functional forms, but the Shapley values easily identify the correct association of the independent variables with the outcome (Figure 7). The Shapley values capture the linearity of both explanatory variables in the first dataset, the nonlinearity in years of work experience in the second dataset, and the step-wise function in the third dataset. The Shapley values also pick up the interaction with sex in the fourth dataset, as illustrated by varying the Shapley values by sex. Clearly, assuming linearity where none is present leads to a misrepresentation of the true returns to education; incorporating the correct flexibility is crucial for inference.

Figure 7.

Effect of work experience (red squares) and schooling (blue circles) as predicted by the four estimated functional forms, and as implied by Shapley values.

Application II: London house prices

As a second case study, I use a large dataset on transactions in the London housing market. The typical approach to modeling house prices is through a hedonic pricing model that assumes each house is composed of various traits for which buyers have certain preferences. Buyers effectively combine the value of individual traits to determine the (monetary) value of an entire house. In hedonic regression, interest typically lies in the price elasticity of certain traits, for example, how much the house price increases with an additional square meter of living space or the inclusion of a garden. In practice, linear additive models are typically estimated relating observed house characteristics with log prices, although many authors have suggested that the assumptions of additive linearity implicit in this functional form might be unreasonable (Fan, Ong, and Koh 2006; Malpezzi 2003).

In this application I use typical house characteristics, like house size, number of rooms, and property type as explanatory variables. I also include a number of neighborhood characteristics. In many applications, researchers attempt to address spatial heterogeneity by including neighborhood-level observables like crime indices, travel distances to local centers, or deprivation scores. Again, in the typical model these variables are added in a linear additive framework, although it is often argued that spatial heterogeneity is considerably more complex (Elhorst 2010). To estimate the hedonic regression, I use a dataset of nearly 630,000 house sales containing basic house characteristics and a neighborhood identifier. The transaction data were collected and merged by the Reshare project hosted at the UK Data Service (Chi et al. 2021). I include neighborhood-level data from the Department of Transport and Communities and Ministry of Housing, Communities and Local Government.³⁹ I also include the year and month of the transaction. Descriptive statistics of the dataset are shown in Appendix Table A5.

For the hypothesized model, I estimate a linear additive functional form relating log house prices to the variables depicted in Appendix Table A5. As is customary in the literature, I assume linear trends for all continuous variables, including the temporal variables. The exact functional form is as follows:

\begin{matrix} \ln (House pric e_{i}) = β_{0} + β_{1} x_{area, i} + β_{2} x_{rooms, i} + β_{3, . . ., 5} x_{propertytype, i} + β_{6} x_{new, i} + \\ β_{7} x_{travel_time, i} + β_{8} x_{crime, i} + β_{9} x_{deprivation, i} + β_{10} x_{year, i} + \\ β_{11} x_{month, i} + β_{12, . . ., 16} x_{transaction_type, i} + ε_{i} . \end{matrix}

(11)

Following the framework, I start by estimating both the hypothesized model and a Super Learner to the data. In addition to the model in Equation 11, I also apply the framework to subsets of the full model. The best-fitting flexible model is again a GB approach, which will be used as the flexible model (see Appendix Tables A6 and A7 for the Super Learner output for the full set of covariates, and the selected models for the covariate subsets, respectively). In the next step, the fit of the flexible model is compared to the hypothesized model. The results are summarized in Figure 8 and are striking. Already among the simplest housing characteristics, like size and property type, the flexible model improves on the linear additive functional form by about 5 pp in terms of the out-of-sample $R^{2}$ . We see the largest improvement when adding neighborhood variables like “travel time to hub,” with the difference in $R^{2}$ more than 30 pp. This strongly implies that spatial heterogeneity is poorly addressed by the functional form in Equation 11, a point to which I will return.

Figure 8.

Out-of-sample $R^{2}$ for the hypothesized model and the flexible model.

To evaluate why the flexible model outperforms the linear additive framework, I calculate Shapley values for the flexible model. These are visualized for six explanatory variables: the size of the house, the travel time to the nearest local hub, the crime index, number of rooms in the house, and the two temporal variables (year and month of sale). Figure 9 shows the results. We see nonlinearities for most of these variables, ranging from a slightly decreasing elasticity for the size of the house to piece-wise linearity in the effect for travel time and the crime index. It is also clear that the number of rooms, year, and month variables should all be modeled in a nonlinear way. When including these nonlinearities into the linear additive model, the fit improves and is strictly preferred according to an LR test,⁴⁰ although a remaining gap in model fit still points at further interactions and nonlinearities among variables, possibly across time. Specifically, the considerable noise in the Shapley values of the neighborhood-level variables implies that these variables do not follow a very precise pattern with respect to the neighborhood characteristics, and including nonlinearities may not suffice to capture the underlying patterns well.

Figure 9.

Plots show implied effect from a typical linear additive model (dashed line), when adding a squared term (dotted) and as implied by Shapley values (scatter).

Given the stark increase in model fit when adding neighborhood-level variables, I include random intercepts on the neighborhood level to a simple model including the house size and temporal explanatory variables. This effectively provides a fully nonlinear association for each neighborhood’s characteristics and the outcome variable, which seems reasonable based on the large increases in model fit when adding any of the neighborhood characteristics to the flexible model, combined with the variation in their Shapley values. Based on this modification of the functional form, the out-of-sample $R^{2}$ of the model strongly improves to 84 percent, which is much closer to that of the flexible model. In other words, the heterogeneity among neighborhoods could not be captured by the three variables when included in the model in either a linear or polynomial form—although this strategy is often encountered in the literature (Malpezzi 2003)—and requires more intricate modeling. In this case, the model’s ability to estimate flexible patterns led to the identification of individual neighborhoods, thus mimicking a random intercept approach, and the framework illustrates that spatial heterogeneity is highly predictive, but poorly accounted for by the available neighborhood characteristics (Elhorst 2010). As in the previous example, assuming linearity would have led to incorrect inferences in the elasticities of most house characteristics in the data. It would be particularly problematic for inference to ignore the considerable variation at the neighborhood level, as this clearly points at important omitted variables that could bias inference.

Application III: Party identification in the United States

As a third and final case study, I evaluate the demographic determinants of party identification in the United States using the GSS (Davis and Smith 1991). Party identification is of substantive interest in the social sciences (Freeden, Sargent, and Stears 2013), and the GSS is an often-used resource for this purpose. Throughout this literature, a number of demographic variables are used as typical controls, including respondent’s age, sex, race, educational attainment, and income. Another variable of substantive interest is usually included in the analysis, like cognitive ability (Meisenberg 2015) or social class (Morgan and Lee 2017). These interest variables naturally tend to correlate with the control variables, making correct specification critical.

Appendix Table A8 shows descriptive statistics of the GSS containing information on voting preferences and demographic characteristics between 1974 and 2018. The outcome of interest is a seven-point scale indicating whether the respondent identifies strongly with the Democratic party (value of 1) or the Republican party (value of 7). The GSS also includes the respondent’s age, years of schooling, income level across 12 brackets, sex, and race, as well as the year of the survey wave. As the hypothesized model, I use a standard linear additive model as often encountered in the literature (Freeden et al. 2013). Most functional forms include a linear time trend for the year of the survey, and add most demographic variables as a linear determinant or dummy. The hypothesized functional form is as follows:

\begin{matrix} y_{i} = β_{0} + β_{1} x_{age, i} + β_{2} x_{female, i} + β_{3} x_{black, i} + β_{4} x_{other, i} + \\ β_{7} x_{education, i} + β_{8} x_{income, i} + β_{9} x_{survey_year, i} + ϵ_{i}, \end{matrix}

(12)

and resembles that found in Meisenberg (2015).

I start by estimating Equation 12 as well as a Super Learner containing various tree-based ML models using both the full set of covariates and subsets. The best-performing model is again a GB model, although the RF also performs well (see Appendix Tables A9 and A10 for the Super Learner output for the full set of covariates, and the selected models for the covariate subsets, respectively). Figure 10 shows results from benchmarking the out-of-sample $R^{2}$ with the hypothesized models. The explanatory power of the flexible model is almost double that of the hypothesized model across subsets, indicating a considerable lack of appropriate specification in Equation 12. The flexible model improves most when race is added to the model, although the hypothesized model already underperforms when simply including temporal, age, and sex variables. It is also noteworthy that overall fit is comparatively low.

Figure 10.

Out-of-sample $R^{2}$ for the hypothesized model compared to the flexible model.

Using the full set of covariates, Figure 11 illustrates Shapley values and Shapley interaction values for the age, schooling, and income variables. These Shapley values show there are clear nonlinearities in the associations of age, income, and years of schooling with the outcome. Assuming these associations to be linear does not fit the implied effect well, and adding square terms improves the model fit considerably (Figure 11A).⁴¹ Ignoring these complexities and assuming a linear functional form would wrongfully imply a decreasing effect by age, even though there is a clear upward association at later ages. Similar incorrect conclusions would be drawn for both the education and income associations when assuming a linear additive model.

Figure 11.

Overall implied effect as estimated by Shapley values for age, education, and income variables: (A) overall effect by variables, (B) age and survey year interaction, (C) age and race interaction, (D) race and sex interaction, and (E) income and race interaction.

These direct effects can be further decomposed into a number of interactions by estimating pairwise Shapley values, allowing each Shapley value to be decomposed into a direct and indirect effect. The indirect effects indicate how Shapley values for certain observed characteristics deviate from the overall effect of the variable, conditional on a second variable. The estimates in panels B to E all show indirect effects and illustrate that younger respondents associated less strongly with Democrats in earlier waves than in later waves (Figure 11B), but also that White individuals have a stronger positive age effect than do non-White individuals (Figure 11C). We see similarly interesting dynamics when interacting sex and race, which show that the sex effect is more pronounced for Black respondents but is less pronounced for others (Figure 11D). The effect of income shows similarly complicated dynamics, where higher income is associated more strongly with Republican identification for White individuals than for non-White individuals (Figure 11E). Adding race interactions for sex, income, and age as implied by the Shapley values improves the fit of the hypothesized model using conventional in-sample fit statistics.⁴²

Discussion

This article set out to address a key problem in quantitative sociology: that we do not know what the appropriate functional form might be to model a dataset. In addition, a historic preference for parsimonious and easy-to-estimate models due to computational limitations has led to researchers hypothesizing relatively simplistic functional forms with little scrutiny. Not only does this lead to risks of misspecification bias and incorrect inference, but a lack of emphasis on appropriate specification can lead to researcher degrees of freedom muddying empirical findings. I argue that instead of trusting researchers to conjure up the correct functional form, we should use methods that embrace the fundamental uncertainty regarding the underlying patterns in a dataset to help sociologists improve their model-building.

I proposed a framework using ML methods to generate a data-driven estimate of the fit potential in a dataset. This fit potential indicates how well an outcome could be modeled when the functional relationship among variables is dictated by the data. Such an estimate provides an indication of whether a researcher’s own functional form might miss important nuances like interactions or nonlinearities. Crucially, the fit potential is a feature of the data and not a result of the researcher’s choices, improving transparency in the empirical process. Whenever the ML method finds more intricate patterns in the data than our own models do, we can unpack the former to provide guidance on how to improve the latter. This is contrary to popular belief that ML models are fundamentally black boxes that cannot convey any intuition into the patterns they identify. More generally, the proposed framework provides a bridge between the ability of ML methods to identify intricate patterns in data and a desire for interpretable models. By incorporating existing methods into the standard empirical workflow, ML models become complementary tools in the sociologist’s empirical toolkit, as opposed to the near exclusive use of ML for predictive questions, as is the status quo in sociology.

Illustrating the framework, I showed how the historic process of model-development could have been sped up considerably, using the example of the Mincerian wage equation. The framework effortlessly identified underspecification, and subsequent analysis of the ML models identified the necessary improvements to the functional form. The other two empirical examples, a hedonic regression of house prices and a model explaining party identification, both illustrate how often-encountered modeling strategies lead to considerably lower fit than does a flexible ML model. In the case of house prices in London, a number of simple nonlinearities were identified. Most importantly, the fundamental inability to address spatial heterogeneity through the inclusion of standard neighborhood characteristics became clear through the considerable differences in fit between the hypothesized and ML method when including neighborhood data. For party identification, a lack of complexity was similarly evident from applying the framework, and unpacking the flexible models showed important interactions and nonlinearities between the explanatory variables and outcome, which are not commonly implemented in the functional forms used to study party identification.

Existing misspecification tests have known limitations, but appropriate use would have identified a lack of specification in some of the examples presented in this article.⁴³ However, classic misspecification tests are known to be underpowered, especially in multivariate settings, and limited in the types of misspecification they consider. Fully non-parametric tests are more flexible, but suffer heavily from the curse of dimensionality. In addition, misspecification tests do not provide clear guidance on how to improve a functional form after misspecification is identified. Perhaps most importantly, use of even the most basic of misspecification tests is practically nonexistent in sociological work, and its selective implementation can suffer from the very same researcher degrees of freedom they are meant to address. Conversely, the only source of selectivity in the proposed framework is what models to include in a Super Learner to provide an estimate of the fit potential. As the price of considering a large amount of models is small, this risk is minimal and we can simply include a large amount of different types of methods. In contrast to classic misspecification tests, the proposed framework also provides researchers with concrete guidance into the type of patterns that may have been missed.

The approach presented here also has limitations. First, uncovering intricate patterns will still depend on the available data and might provide limited guidance in low $N$ settings, although many of the ML methods proposed here can be applied to datasets typically encountered in sociology. Second, although considerable progress has been made in recent years, unpacking flexible ML methods remains challenging. This is in part because patterns can become complex to the degree that even local explanation methods like Shapley values will not yield easily digestible insights into the underlying functional form, especially when multiple interactions may be at play. More generally, active debate regarding the practical and philosophical aspects of explanation methods’ accuracy to reflect how ML methods operate remains ongoing. However, the constant developments in the X-AI field are reassuring and exciting. Finally, when causal questions rather than optimal fit are of interest, there is limited guidance on whether a difference in model fit between a hypothesized model and ML alternative is actually problematic. However, as the framework follows the same inferential curation of variables as a researcher’s own model, missed functional relationships among variables should be expected to affect inference.

Perhaps the most challenging part of the framework is determining how much improvement in fit is enough to warrant a re-evaluation of the functional form. Unfortunately, this question will fundamentally depend on a combination of the substantive research question being asked and the true underlying DGP. As a result, the choice to accept a functional form will likely remain a debate among the academic community. This is not so much a consequence of the framework, but rather one of embracing the fact that we know very little about the true DGP and are unwilling to make stringent assumptions on it, nor blindly trust that a researcher-hypothesized model includes all the relevant intricacies in the data. A result of increasingly letting go of assumptions regarding the underlying DGP will be that we are left more frequently with questions like the one posed above, where we can rely less on statistical guidance and will instead have to rely on academic discussion.

At the root of most empirical sociological findings lies a functional form that is assumed to be correctly specified. Limited evaluation of whether this functional form is appropriate for the data leads to a number of serious risks. The model might not accurately reflect the patterns in the data, affecting the validity of statistical inference. Simply allowing researchers to report a single or curated number of functional forms without much scrutiny further exposes sociology to p-hacking. These practices stem from a time when limitations on computational power necessitated parsimony. These constraints are no longer applicable, yet the models we estimate retain a simplicity that likely belies the intricacies of the social mechanisms we are interested in. This is evidenced in the work of our qualitative colleagues, as well as the ever-increasing number of empirical examples where ML models outperform the linear additive models usually estimated by sociologists. I proposed a framework to address this issue by exploiting the benefits of ML methods—to find intricate patterns in data—to address this key issue throughout empirical work. This symbiosis of quantitative sociological work and the computational riches of today is long overdue, and it will only bear more fruit as the goals of the ML community increasingly align with the explanatory focus of sociologists.

Footnotes

Appendix

ORCID iD

Mark D. Verhagen

Data Availability Statement

All code and publicly available data underlying the analyses in this article can be found at .

Notes

Author Biography

Mark Verhagen is a researcher at the Leverhulme Centre for Demographic Science and member of Nuffield College, Oxford. His work uses computational methods to improve the construction and understanding of explanatory models. He has published in a wide range of outlets, including The Proceedings of the National Academy of Sciences, BMC Medicine, PloS One, The International Review of Law and Economics, and Socius.

References

Aas

Kjersti

Jullum

Martin

Løland

Anders

. 2021. “Explaining Individual Predictions When Features Are Dependent: More Accurate Approximations to Shapley Values.” Artificial Intelligence 298:103502. https://doi.org/10.1016/j.artint.2021.103502.

Abbott

Andrew

. 1988. “Transcending General Linear Reality.” Sociological Theory 6(2):169–86.

Agrawal

Mayank

Peterson

Joshua C.

Griffiths

Thomas L.

2020. “Scaling Up Psychology via Scientific Regret Minimization.” Proceedings of the National Academy of Sciences 117(16):8825–35.

Athey

Susan

. 2018. “The Impact of Machine Learning on Economics.” Pp. 507–47 in The Economics of Artificial Intelligence: An Agenda, edited by Agrawal

Gans

Goldfarb

Chicago, IL: University of Chicago Press.

Baćak

Valerio

Kennedy

Edward H.

2019. “Principled Machine Learning Using the Super Learner: An Application to Predicting Prison Violence.” Sociological Methods & Research 48(3):698–721.

Bates

Stephen

Hastie

Trevor

Tibshirani

Robert

. 2021. “Cross-Validation: What Does It Estimate and How Well Does It Do It?” arXiv. https://doi.org/10.48550/arXiv.2104.00673

Beiser-McGrath

Janina

Beiser-McGrath

Liam F.

2020. “Problems with Products? Control Strategies for Models with Interaction and Quadratic Effects.” Political Science Research and Methods 8(4):707–30.

Berk

Richard A

. 2004. Regression Analysis: A Constructive Critique, Vol. 11. Thousand Oaks, CA: Sage.

Berk

Richard A.

Bleich

Justin

. 2013. “Statistical Procedures for Forecasting Criminal Behavior: A Comparative Assessment.” Criminology and Public Policy 12(3):513–44.

10.

Biau

Gérard

. 2012. “Analysis of a Random Forests Model.” Journal of Machine Learning Research 13:1063–95.

11.

Bishop

Christopher M

. 2006. Pattern Recognition and Machine Learning. New York, NY: Springer.

12.

Blackwell

Matthew

Olson

Michael P.

2022. “Reducing Model Misspecification and Bias in the Estimation of Interactions.” Political Analysis 30(4):495–514.

13.

Brand

Jennie E.

Jiahui

Bernard

Koch

Pablo

Geraldo

. 2021. “Uncovering Sociological Effect Heterogeneity Using Tree-Based Machine Learning.” Sociological Methodology 52(2):189–223.

14.

Breiman

Leo

. 1996. “Bagging Predictors.” Machine Learning 24(2):123–40.

15.

Breiman

Leo

. 2001. “Statistical Modeling: The Two Cultures (with Comments and a Rejoinder by the Author).” Statistical Science 16(3):199–231.

16.

Bucca

Mauricio

Urbina

Daniela R.

2019. “Lasso Regularization for Selection of Log-Linear Models: An Application to Educational Assortative Mating.” Sociological Methods & Research 50(4):1763–800.

17.

Buja

Andreas

Brown

Lawrence

Berk

Richard

George

Edward

Pitkin

Emil

Traskin

Mikhail

Zhang

Kai

Zhao

Linda

. 2019. “Models as Approximations I: Consequences Illustrated with Linear Regression.” Statistical Science 34(4):523–44.

18.

Buja

Andreas

Kuchibhotla

Arun Kumar

Berk

Richard

George

Edward

Tchetgen

Eric Tchetgen

Zhao

Linda

. 2019. “Models as Approximations–Rejoinder.” Statistical Science 34(4):606–20.

19.

Cameron

A. Colin

Trivedi

Pravin K.

2005. Microeconometrics: Methods and Applications. Cambridge, UK: Cambridge University Press.

20.

Chen

Tianqi

Guestrin

Carlos

. 2016. “Xgboost: A Scalable Tree Boosting System.” Pp. 785–94 in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY: Association for Computing Machinery.

21.

Chi

Bin

Dennett

Adam

Oléron-Evans

Thomas

Morphet

Robin

. 2021. “A New Attribute-Linked Residential Property Price Dataset for England and Wales, 2011 to 2019.” UCL Open: Environment 3:e019.

22.

Christensen

Björn

Christensen

Sören

. 2014. “Are Female Hurricanes Really Deadlier Than Male Hurricanes?” Proceedings of the National Academy of Sciences 111(34):E3497–98.

23.

Davis

James A.

Smith

Tom W.

1991. The NORC General Social Survey: A User’s Guide. Thousand Oaks, CA: Sage.

24.

Doornik

Jurgen A.

Hendry

David F.

2015. “Statistical Model Selection with ‘Big Data.’” Cogent Economics & Finance 3(1):1045216.

25.

Doshi-Velez

Finale

Kim

Been

. 2017. “Towards a Rigorous Science of Interpretable Machine Learning.” arXiv. https://doi.org/10.48550/arXiv.1702.08608

26.

Dougherty

Michael R.

Thomas

Rick P.

Brown

Ryan P.

Chrabaszcz

Jeffrey S.

Tidwell

Joe W.

2015. “An Introduction to the General Monotone Model with Application to Two Problematic Data Sets.” Sociological Methodology 45(1):223–71.

27.

Duncan

Otis Dudley

. 1984. Notes on Social Measurement: Historical and Critical. New York, NY: Russell Sage Foundation.

28.

Efron

Bradley

Hastie

Trevor

. 2016. Computer Age Statistical Inference. Cambridge, UK: Cambridge University Press.

29.

Elbers

Benjamin

. 2023. “A Method for Studying Differences in Segregation across Time and Space.” Sociological Methods & Research 52(1):5–42.

30.

Elhorst

J. Paul

. 2010. “Applied Spatial Econometrics: Raising the Bar.” Spatial Economic Analysis 5(1):9–28.

31.

Fan

Gang-Zhi

Ong

Seow Eng

Koh

Hian Chye

. 2006. “Determinants of House Price: A Decision Tree Approach.” Urban Studies 43(12):2301–15.

32.

Freeden

Michael

Sargent

Lyman Tower

Stears

Marc

. 2013. The Oxford Handbook of Political Ideologies. Oxford, UK: Oxford University Press.

33.

Freedman

David A

. 2009. Statistical Models: Theory and Practice. Cambridge, UK: Cambridge University Press.

34.

Freund

Yoav

Schapire

Robert E.

1996. “Experiments with a New Boosting Algorithm.” ICML 96:148–56.

35.

Friedman

Jerome H.

1991. “Multivariate Adaptive Regression Splines.” Annals of Statistics 19(1):1–67.

36.

Gelman

Andrew

Loken

Eric

. 2013. “The Garden of Forking Paths: Why Multiple Comparisons Can Be a Problem, Even When There Is No ‘Fishing Expedition’ or ‘P-Hacking’ and the Research Hypothesis Was Posited Ahead of Time.”Department of Statistics, Columbia University, New York, NY.

37.

Golden

Richard M.

Henley

Steven S.

White

Halbert

Kashner

Michael T.

2016. “Generalized Information Matrix Tests for Detecting Model Misspecification.” Econometrics 4(4):46.

38.

Goodman

Bryce

Flaxman

Seth

. 2017. “European Union Regulations on Algorithmic Decision-Making and a ‘Right to Explanation’.” AI Magazine 38(3):50–57.

39.

Grimmer

Justin

Roberts

Margaret E.

Stewart

Brandon M.

2021. “Machine Learning for Social Science: An Agnostic Approach.” Annual Review of Political Science 24:395–419.

40.

Hastie

Trevor

Tibshirani

Robert

. 1987. “Generalized Additive Models: Some Applications.” Journal of the American Statistical Association 82(398):371–86.

41.

Hastie

Trevor

Tibshirani

Robert

Friedman

Jerome

. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York, NY: Springer.

42.

Heckman

James J.

Humphries

John Eric

Gregory

Veramendi

. 2018. “Returns to Education: The Causal Effects of Education on Earnings, Health, and Smoking.” Journal of Political Economy 126(S1):S197–246.

43.

Heskes

Tom

Sijben

Evi

Bucur

Ioan Gabriel

Claassen

Tom

. 2020. “Causal Shapley Values: Exploiting Causal Knowledge to Explain Individual Predictions of Complex Models.” Advances in Neural Information Processing Systems 33:4778–89.

44.

Hindman

Matthew

. 2015. “Building Better Models: Prediction, Replication, and Machine Learning in the Social Sciences.” ANNALS of the American Academy of Political and Social Science 659(1):48–62.

45.

Hofman

Jake M.

Sharma

Amit

Watts

Duncan J.

2017. “Prediction and Explanation in Social Systems.” Science 355(6324):486–88.

46.

Hornik

Kurt

Stinchcombe

Maxwell

White

Halbert

. 1989. “Multilayer Feedforward Networks Are Universal Approximators.” Neural Networks 2(5):359–66.

47.

Ioannidis

John P. A.

2005. “Why Most Published Research Findings Are False.” PLoS Medicine 2(8):e124.

48.

Janson

Lucas

Fithian

William

Hastie

Trevor J.

2015. “Effective Degrees of Freedom: A Flawed Metaphor.” Biometrika 102(2):479–85.

49.

Kleinberg

Jon

Liang

Annie

Mullainathan

Sendhil

. 2017. “The Theory Is Predictive, but Is It Complete? An Application to Human Perception of Randomness.” Pp. 125–26 in Proceedings of the 2017 ACM Conference on Economics and Computation. New York, NY: Association for Computing Machinery.

50.

Kohavi

Ron

. 1995. “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection.” IJCAI 14:1137–45.

51.

Krishna

Satyapriya

Han

Tessa

Alex

Pombra

Javin

Jabbari

Shahin

Steven

Lakkaraju

Himabindu

. 2022. “The Disagreement Problem in Explainable Machine Learning: A Practitioner’s Perspective.” arXiv. https://doi.org/10.48550/arXiv.2202.01602

52.

Lapuschkin

Sebastian

Wäldchen

Stephan

Binder

Alexander

Montavon

Grégoire

Samek

Wojciech

Müller

Klaus-Robert

. 2019. “Unmasking Clever Hans Predictors and Assessing What Machines Really Learn.” Nature Communications 10(1):1–8.

53.

Lemieux

Thomas

. 2006. “The ‘Mincer Equation’ Thirty Years After Schooling, Experience, and Earnings.” Pp. 127–45 in Jacob Mincer a Pioneer of Modern Labor Economics, edited by Grossbard

New York, NY: Springer.

54.

Liao

Yung-Sheng

. 2017. “Machine Learning in Macro-Economic Series Forecasting.” International Journal of Economics and Finance 9(12):71–76.

55.

Lipton

Zachary C

. 2018. “The Mythos of Model Interpretability: In Machine Learning, the Concept of Interpretability is Both Important and Slippery.” Queue 16(3):31–57.

56.

Long

J. Scott

Trivedi

Pravin K.

1992. “Some Specification Tests for the Linear Regression Model.” Sociological Methods & Research 21(2):161–204.

57.

Lundberg

Scott M.

Erion

Gabriel

Chen

Hugh

DeGrave

Alex

Prutkin

Jordan M.

Nair

Bala

Katz

Ronit

Himmelfarb

Jonathan

Bansal

Nisha

Lee

Su-In

. 2020. “From Local Explanations to Global Understanding with Explainable AI for Trees.” Nature Machine Intelligence 2(1):56–67.

58.

Lundberg

Ian

Johnson

Rebecca

Stewart

Brandon M.

2021. “What is Your Estimand? Defining the Target Quantity Connects Statistical Evidence to Theory.” American Sociological Review 86(3):532–65.

59.

Lundberg

Scott M.

Lee

Su-In

. 2017. “A Unified Approach to Interpreting Model Predictions.” Advances in Neural Information Processing Systems30: 4765–4774.

60.

Malpezzi

Stephen

. 2003. “Hedonic Pricing Models: A Selective and Applied Review.” Housing Economics and Public Policy 1:67–89.

61.

McClintock

Elizabeth Aura

. 2017. “Occupational Sex Composition and Gendered Housework Performance: Compensation or Conventionality?” Journal of Marriage and Family 79(2):475–510.

62.

Meisenberg

Gerhard

. 2015. “Verbal Ability as a Predictor of Political Preferences in the United States, 1974–2012.” Intelligence 50:135–43.

63.

Molina

Mario

Garip

Filiz

. 2019. “Machine Learning for Sociology.” Annual Review of Sociology 45:27–45.

64.

Morgan

Stephen L.

Lee

Jiwon

. 2017. “Social Class and Party Identification during the Clinton, Bush, and Obama Presidencies.” Sociological Science 4:394–423.

65.

Mullainathan

Sendhil

Spiess

Jann

. 2017. “Machine Learning: An Applied Econometric Approach.” Journal of Economic Perspectives 31(2):87–106.

66.

Muñoz

John

Young

Cristobal

. 2018. “We Ran 9 Billion Regressions: Eliminating False Positives Through Computational Model Robustness.” Sociological Methodology 48(1):1–33.

67.

O’Brien

Robert M.

2018. “Comment: Some Challenges When Estimating the Impact of Model Uncertainty on Coefficient Instability.” Sociological Methodology 48(1):34–39.

68.

Open Science Collaboration. 2015. “Estimating the Reproducibility of Psychological Science.” Science 349(6251):aac4716.

69.

Pearl

Judea

. 2009. Causality: Models, Reasoning, and Inference. Cambridge, UK: Cambridge University Press.

70.

Peterson

Joshua C.

Bourgin

David D.

Agrawal

Mayank

Reichman

Daniel

Griffiths

Thomas L.

2021. “Using Large-Scale Experiments and Machine Learning to Discover Theories of Human Decision-Making.” Science 372(6547):1209–14.

71.

Peysakhovich

Alexander

Naecker

Jeffrey

. 2017. “Using Methods from Machine Learning to Evaluate Behavioral Models of Choice under Risk and Ambiguity.” Journal of Economic Behavior & Organization 133:373–84.

72.

Polley

Eric

LeDell

Erin

Kennedy

Chris

Lendle

Sam

van der Laan

Mark

. 2021. “Package Superlearner.”https://cran.r-project.org/web/packages/SuperLearner/index.html. Accessed 1 November 2022.

73.

Polley

Eric C.

Rose

Sherri

Van der Laan

Mark J.

2011. “Super Learning.” Pp. 43–66 in Targeted Learning, edited by van der Laan

M. J.

Rose

New York, NY: Springer.

74.

Rahal

Charles

Verhagen

Mark D.

Kirk

David S.

2022. “The Rise of Machine Learning in the Academic Social Sciences.” AI and Society. Advance online publication. https://doi.org/10.1007/s00146-022-01540-w

75.

Ramsey

James Bernard

. 1969. “Tests for Specification Errors in Classical Linear Least-Squares Regression Analysis.” Journal of the Royal Statistical Society: Series B (Methodological) 31(2):350–71.

76.

Robinson

Peter M

. 1988a. “Root-N-Consistent Semiparametric Regression.” Econometrica: Journal of the Econometric Society 56(4):931–54.

77.

Robinson

Peter M

. 1988b. “Semiparametric Econometrics: A Survey.” Journal of Applied Econometrics 3(1):35–51.

78.

Rose

Sherri

. 2013. “Mortality Risk Score Prediction in an Elderly Population Using Machine Learning.” American Journal of Epidemiology 177(5):443–52.

79.

Rudin

Cynthia

. 2019. “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.” Nature Machine Intelligence 1(5):206–15.

80.

Rudin

Cynthia

Passonneau

Rebecca J.

Radeva

Axinia

Dutta

Haimonti

Ierome

Steve

Isaac

Delfina

. 2010. “A Process for Predicting Manhole Events in Manhattan.” Machine Learning 80(1):1–31.

81.

Sala-i Martin

Xavier X

. 1997. “I Just Ran Four Million Regressions.” Technical report. Cambridge, MA: National Bureau of Economic Research.

82.

Salganik

Matthew J.

Lundberg

Ian

Kindel

Alexander T.

Ahearn

Caitlin E.

Al-Ghoneim

Khaled

Almaatouq

Abdullah

Altschul

Drew M.

, et al. 2020. “Measuring the Predictability of Life Outcomes with a Scientific Mass Collaboration.” Proceedings of the National Academy of Sciences 117(15):8398–403.

83.

Samek

Wojciech

Montavon

Grégoire

Vedaldi

Andrea

Hansen

Lars Kai

Müller

Klaus-Robert

. 2019. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. New York, NY: Springer Nature.

84.

Shapley

Lloyd S

. 1953. “Stochastic Games.” Proceedings of the National Academy of Sciences 39(10):1095–100.

85.

Shmueli

Galit

. 2010. “To Explain or to Predict?” Statistical Science 25(3):289–310.

86.

Simonsohn

Uri

Simmons

Joseph P.

Nelson

Leif D.

2020. “Specification Curve Analysis.” Nature Human Behaviour 4(11):1208–14.

87.

Slez

Adam

. 2019. “The Difference between Instability and Uncertainty: Comment on Young and Holsteen (2017).” Sociological Methods & Research 48(2):400–30.

88.

Stone

Mervyn

. 1977. “An Asymptotic Equivalence of Choice of Model by Cross-Validation and Akaike’s Criterion.” Journal of the Royal Statistical Society: Series B (Methodological) 39(1):44–47.

89.

Tang

Siyi

Ghorbani

Amirata

Yamashita

Rikiya

Rehman

Sameer

Dunnmon

Jared A.

Zou

James

Rubin

Daniel L.

2021. “Data Valuation for Medical Imaging Using Shapley Value and Application to a Large-Scale Chest X-Ray Dataset.” Scientific Reports 11(1):1–9.

90.

Van der Laan

Jan

de Jonge

Edwin

Das

Marjolijn

Te Riele

Saskia

Emery

Tom

. 2022. “A Whole Population Network and Its Application for the Social Sciences.” European Sociological Review 39(1):145–60.

91.

Van Der Laan

Mark J.

Dudoit

Sandrine

. 2003. “Unified Cross-Validation Methodology for Selection among Estimators and a General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples.” U.C. Berkeley Division of Biostatistics Working Paper Series. Berkeley, CA: The Berkeley Electronic Press.

92.

Van der Laan

Mark J.

Polley

Eric C.

Hubbard

Alan E.

2007. “Super Learner.” Statistical Applications in Genetics and Molecular Biology 6(1):1–23.

93.

Van der Vaart

Aad W.

Dudoit

Sandrine

van der Laan

Mark J.

2006. “Oracle Inequalities for Multi-Fold Cross Validation.” Statistics & Decisions 24(3):351–71.

94.

Vehtari

Aki

Gelman

Andrew

Gabry

Jonah

. 2017. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.” Statistics and Computing 27(5):1413–32.

95.

Verhagen

Mark D

. 2022. “A Pragmatist’s Guide to Using Prediction in the Social Sciences.” Socius8. https://doi.org/10.1177/23780231221081702

96.

Watts

Duncan J

. 2014. “Common Sense and Sociological Explanations.” American Journal of Sociology 120(2):313–51.

97.

Watts

Duncan J

. 2017. “Should Social Science Be More Solution-Oriented?” Nature Human Behaviour 1(1):1–5.

98.

White

Halbert

. 1980. “Using Least Squares to Approximate Unknown Regression Functions.” International Economic Review 21(1):149–70.

99.

White

Halbert

. 1981. “Consequences and Detection of Misspecified Nonlinear Regression Models.” Journal of the American Statistical Association 76(374):419–33.

100.

Yatchew

Adonis

. 1997. “An Elementary Estimator of the Partial Linear Model.” Economics Letters 57(2):135–43.

101.

Yatchew

Adonis

. 2003. Semiparametric Regression for the Applied Econometrician. Cambridge, UK: Cambridge University Press.

102.

Young

H. Peyton

. 1985. “Monotonic Solutions of Cooperative Games.” International Journal of Game Theory 14(2):65–72.

103.

Young

Cristobal

. 2018. “Model Uncertainty and the Crisis in Science.” Socius4. https://doi.org/10.1177/2378023117737206

104.

Young

Cristobal

. 2019. “The Difference between Causal Analysis and Predictive Models: Response to ‘Comment on Young and Holsteen (2017).’” Sociological Methods & Research 48(2):431–47.

105.

Young

Cristobal

Holsteen

Katherine

. 2017. “Model Uncertainty and Robustness: A Computational Framework for Multimodel Analysis.” Sociological Methods & Research 46(1):3–40.

106.

Young

Cristobal

Stewart

Sheridan A.

2021. “Functional Form Robustness: Advancements in Multiverse Analysis.” Unpublished manuscript. http://cristobalyoung.com/development/wp-content/uploads/2021/08/Multiverse-Aug-2021.pdf. Accessed 1 November 2022.

107.

Yousef

Waleed A.

2020. “A Leisurely Look at Versions and Variants of the Cross Validation Estimator.” stat 1050:9.

108.

Zhou

Jianlong

Gandomi

Amir H.

Chen

Fang

Holzinger

Andreas

. 2021. “Evaluating the Quality of Machine Learning Explanations: A Survey on Methods and Metrics.” Electronics 10(5):593. https://doi.org/10.3390/electronics10050593

Incorporating Machine Learning into Sociological Model-Building

Abstract

Keywords

Misspecification and Risks to Inference

A Computational Framework to Improve Model-Building

Step I: The benchmarking model

Step II: Estimating and comparing model fit

Step III: Unpacking the ML model

Shapley values for model explanation

Resemblances to Other Computational Frameworks

Applying the Framework in Practice

Application I: Mincerian wage simulation

Application II: London house prices

Application III: Party identification in the United States

Discussion

Footnotes

Appendix

ORCID iD

Data Availability Statement

Notes

Author Biography

References