Sage Journals: Discover world-class research

Abstract

In recent years, machine learning has propagated into different aspects of psychological research, and supervised machine-learning methods have increasingly been used as a tool for predicting human behavior or psychological characteristics when there is a large number of possible predictors. However, researchers often face practical challenges when using machine-learning methods on psychological data. In this article, we identify and discuss four key challenges that often arise when applying machine learning to data collected for psychological research. The four challenge areas cover (a) limited sample size, (b) measurement error, (c) nonindependent data, and (d) missing data. Such challenges are extensively discussed in the “traditional” statistical literature but are often not explicitly addressed, or at least not to the same extent, in the applied-machine-learning community. We present how each of these challenges is dealt with first from a traditional-statistics perspective and then from a machine-learning perspective and discuss the strengths and weaknesses of these solutions by comparing the approaches. We argue that the boundary between traditional statistics and machine learning is fluid and emphasize the need for cross-disciplinary collaboration to better tackle these core challenges and improve replicability.

Keywords

imputation measurement error sampling error missing data nonindependent data

Psychology has seen a dramatic surge in a new class of analysis methods from the field of machine learning in recent years (Elhai & Montag, 2020; Harlow & Oswald, 2016). Machine-learning methods are seen as an appropriate tool to analyze large (and often unstructured; for glossary, see Box 1) data with many possible predictors (Faraway & Augustin, 2018). Machine learning constitutes a subfield of artificial intelligence (AI) that broadly encompasses techniques that enable learning from data without explicit instructions, for example, models or “algorithms” that automatically detect patterns in data and approaches to data partitioning for model training and evaluation (Hastie et al., 2009). Applying these methods allows the detection of complex (e.g., nonlinear) relationships in data that generalize to new data with the same distribution (although, see Quiñonero-Candela et al., 2022). This has enabled accurate predictions of a number of different behaviors, for example, suicide attempts and risk (C. R. Cox et al., 2020; Gradus et al., 2020; Walsh et al., 2017), environmental behaviors (Lavelle-Hill et al., 2020), life outcomes (Deininger et al., 2025; Lavelle-Hill et al., 2024; Salganik et al., 2020; Savcisens et al., 2024), psychological constructs (Donnellan et al., 2022; Gruda & Hasan, 2019; Jach et al., 2024; Neuendorf et al., 2025; Stachl et al., 2020; Youyou et al., 2015), developmental trajectories (Karch et al., 2015; Van Lissa et al., 2023), hiring decisions (Liem et al., 2018), clinical diagnoses (Dinga et al., 2018; Lei et al., 2022), and treatment outcomes (Chekroud et al., 2016; Fabbri et al., 2018; Jankowsky et al., 2022).

Box 1.

A Glossary of Machine-Learning-Related Terms

Bias-variance trade-off: In machine learning, this refers to the trade-off between two different types of errors: bias (underfitting) and variance (overfitting; Geman et al., 1992). The aim is to find a model that minimizes both, leading to the best predictive performance on new, unseen data.
Bootstrapping: A resampling method in which multiple data sets are created by sampling observations from the original data set with replacement. This allows for the estimation of the sampling distribution of a statistic or model parameter, providing insights into the uncertainty and enabling statistical tests to be conducted without making strong parametric assumptions (Efron & Tibshirani, 1994).
Cross-validation: A statistical technique used to evaluate the ability of a model to generalize to unseen data. It involves (often randomly) sampling or partitioning the data set into subsets, training the model on a subset, and evaluating it on the remaining data. For example, K-fold cross-validation splits the data into K number of equal subsets (folds), trains the model on K – 1 folds, and validates it on the left-out data. This is repeated K times so that each fold becomes the left-out data once. In leave-one-out cross-validation (LOOCV), the model is trained on all data except one instance and then tested on the omitted data point. This process is repeated N times so that each data point becomes the test data. Cross-validation methods can be used to tune hyperparameters during training, estimate the performance of a selected model (Stone, 1974), or do both simultaneously using nested cross-validation. The simplest way to measure performance is a “single hold-out” method (Berrar, 2019), which randomly samples just one test set to use for model evaluation (also referred to as a “train:test split”; see Fig. 1). For further reading, we recommend Arlot and Celisse (2010), Browne (2000), De Rooij and Weeda (2020), Krstajic et al. (2014), and Roberts et al. (2017).
Curse of dimensionality: First formalized by Bellman (1957), it refers to the notion that the amount of data needed to reliably estimate a model or function grows exponentially with the number of dimensions (features). More generally, the term is often used to describe the array of challenges that arise when dealing with high-dimensional data, such as increased computational complexity, sparsity of data, overfitting, and difficulty in visualization and interpretation (Verleysen & François, 2005).
Data/information leakage: The inadvertent use of information from outside the training data set to create or evaluate a model, often leading to overly optimistic performance metrics. For example, this can occur when performing preprocessing steps, such as data scaling, feature engineering, or imputation on the full data set before splitting into a training set and test set.
Hyperparameter or metaparameter: A configuration setting for a model that can affect its performance during training. Unlike parameters, which are learned by the model when it is fitting to data, hyperparameters must be set before fitting and are typically “tuned” (selected) through techniques such as cross-validation, grid search (Bischl et al., 2023), random search (Bergstra & Bengio, 2012), or Bayesian optimization (Snoek et al., 2012) on training data to optimize model performance. Many hyperparameters influence the complexity of the model, for example, the maximum depth of trees in a random forest or the strength of regularization in a regression-based model.
Pipeline: A sequence of data processing and analysis components arranged in a specific order such that the output of one component serves as the input to the next. Pipelines incorporate tasks such as data preprocessing, feature extraction, model training, hyperparameter tuning, and model evaluation. When coding machine-learning analyses, using a pipeline tool can help prevent information leakage. For machine-learning pipelines in R, we recommend using mlr3 (Lang et al., 2019), and for Python, we recommend using Scikit-learn (Pedregosa et al., 2011a).
Regularization: Techniques used to prevent overfitting by adding a penalty term to discourage complex models that may fit the training data too closely but then not generalize well to new data. For example, in regularized regression models (ridge, lasso, elastic net; Hoerl & Kennard, 1970; Tibshirani, 1996; Zou & Hastie, 2005), regularization shrinks the coefficients either toward zero or to zero (depending on the type of penalty), either removing them from the model or reducing their influence, creating a simpler, sparser model. The level of regularization can be tuned using hyperparameters.
Unstructured data: Information that lacks a predefined organization and is usually not numerical. It includes text, images, videos, and audio recordings—the majority of the world’s data are unstructured. Conversely, structured data are typically found in a tabular format with columns and rows.

In line with its popularity, there has been an increasing number of prominent introductory articles for psychologists that explain machine-learning concepts, opportunities, and limitations (Adjerid & Kelley, 2018; Bleidorn & Hopwood, 2019; Bzdok, 2017; Dwyer et al., 2018; Elhai & Montag, 2020; Hofman et al., 2017; Hullman et al., 2022; Liem et al., 2018; Orrù et al., 2020; Rocca & Yarkoni, 2021; Tay et al., 2022; Van Lissa, 2022; Yarkoni & Westfall, 2017) and provide tutorials for specific methods or approaches (Boedeker & Kearns, 2019; E. E. Chen & Wojcik, 2016; De Rooij & Weeda, 2020; Henninger et al., 2025; Henninger & Strobl, 2023; Jacobucci et al., 2019; Pargent et al., 2023; Rosenbusch et al., 2021). However, there has been a critical lack of work that directly addresses the compatibility of machine learning with specific characteristics of psychological data (for an exception, see Jacobucci & Grimm, 2020). Although there are various forms, psychological data are often quite different from the type of data machine-learning algorithms were initially designed around (Liem et al., 2018). First, psychological data are normally collected for a specific research purpose by the researchers themselves. Whether the data are experimental or observational, there tends to be a limited number of observations because of a limited amount of resources. Second, researchers are typically interested in hypothetical psychological constructs, such as personality traits behind the measurements, not the observed measurements themselves (e.g., responses to items in a survey). Third, psychological data are often collected from the same individuals, providing multiple observations. As a natural consequence, psychological data often have a clustered structure. Finally, because humans are the subjects, psychological data often contain substantial amounts of nonrandom missing data because of reasons such as participant dropout, inattention, and survey questions not being applicable.

In this article, we highlight and discuss the key challenges that are inherent with these common characteristics of psychological data: (a) limited sample size, (b) measurement error, (c) nonindependent data, and (d) missing data. Of course, we are not claiming that they are characteristics unique to data collected in psychology—there are many nonpsychological data that have these characteristics. In addition, there are many psychological data that do not have some of these characteristics. Nevertheless, in our experience, these are the topics that have kept emerging in discussions with researchers using machine learning in psychology. In addition, these aspects are thoroughly discussed in the psychometric community and have some established solutions (e.g., Enders, 2025; Kyriazos, 2018; Raudenbush & Bryk, 2002; Schmidt & Hunter, 1996). In comparison, we find them to be discussed to a lesser extent in the applied-machine-learning community, or at least these discussions are not well communicated to researchers in psychology.

A key goal of this article is to bridge the gap in communication between the fields of psychology and machine learning by discussing each point from both perspectives and highlighting the similarities and the differences in the approaches. First, we discuss the characteristics of a machine-learning approach. We argue that traditional-statistics and machine-learning methods are not completely separate categories but, rather, fall along a continuum. This perspective sets the scene for understanding the challenges and potential solutions discussed in this article. Subsequently, each aforementioned challenge is outlined first from a traditional-statistics perspective and second by looking at comparable possible solutions in the machine-learning literature. Because of the dynamic development in methods and approaches at the intersection of the two fields, in this article, we do not intend to point to finite solutions. Instead, we highlight how discussing these issues from both perspectives can produce important insights for both the machine-learning and psychology communities—and in doing so, we hope to help further stimulate methodological research and advancement where the two fields meet.

In the current article, we focus on a class of machine-learning models called “supervised” methods. These might include linear-regression-based approaches (e.g., linear regression, logistic regression, lasso, Tibshirani, 1996; ridge regression, Hoerl & Kennard, 1970; elastic net, Zou & Hastie, 2005), tree-based approaches (e.g., regression/classification trees, Breiman et al., 1984; random forests, Breiman, 2001a; extreme gradient boosting, i.e., XGBoost, T. Chen et al., 2015), kernel approaches (e.g., support vector machines [SVMs], Hearst et al., 1998), or neural networks (e.g., long short-term memory, Hochreiter & Schmidhuber, 1997) or convolutional neural networks used to process images (Gu et al., 2018).

These supervised-machine-learning models have distinct algorithms, but there is a common purpose: predicting a measured outcome variable from a set of predictors (also called “features”). Another collection of machine-learning methods, unsupervised methods, focuses on clustering or finding order/patterns in the data without a specific outcome variable. We do not focus on this class of models, but we do mention unsupervised methods in places where they can be used as a method within a supervised-machine-learning pipeline (see Box 1). For a comprehensive and accessible review of machine-learning methods in general, we recommend Hastie et al. (2009), Pargent et al. (2023), and Rosenbusch et al. (2021). For implementing machine-learning methods in Python, we recommend the Scikit-learn package (Pedregosa et al., 2011a), and in R, we recommend either the caret package (Kuhn et al., 2020) or mlr3 for more advanced applications (Lang et al., 2019).

The Traditional Statistics–Machine Learning Continuum

In the traditional statistical approach, researchers in psychology approach a given problem or question from a hypothetico-deductive perspective (Hempel & Oppenheim, 1948). Researchers formulate a hypothesis, collect data, and conduct a statistical analysis to test the theoretical prediction derived from the hypothesis. The hypothesis should be selected a priori, and researchers decide on a statistical model to estimate the magnitude of the hypothesized effect. Because of the resources required to collect data to answer a specific research question, the data sets tend to be relatively small compared with the population. The statistical significance of the hypothesized effect is then evaluated in relation to the estimated sampling error. This is the so-called “data modeling” approach in statistics (Breiman, 2001b), which is referred to in this article as the “traditional statistical approach” because of its long tradition and high prevalence in psychological research (Blanca et al., 2018). This approach is considered optimal when researchers are trying to understand an outcome in relation to a small number of theoretically conceived independent variables (Faraway & Augustin, 2018). Thus, the focus is on explaining the data, which is viewed “through the lens of the theoretical model” (Shmueli, 2010).

This approach, however, is not necessarily compatible with large data sets—specifically, data with a large number of possible predictor variables (relative to the number of observations; Faraway & Augustin, 2018). For example, a large number of predictors can easily lead to overfitting (the lack of generalization to a new sample) if not carefully controlled for (Yarkoni & Westfall, 2017), and hypothesized models may not be able to capture more complex relationships present in the data (Breiman, 2001b). In addition, a large number of possible predictors normally increases the interdependence of predictors, which can cause various problems in most statistical models (D. R. Cox, 2015). Furthermore, many statistical approaches do not scale up well computationally with large data (Jordan, 2011; Reid, 2018). Thus, the traditional statistical approach could be regarded as limited given the recent influx of large digital data sets that are increasingly becoming available for psychology research (Harlow & Oswald, 2016).

Machine learning is considered a suitable methodology when researchers face such a “big data” situation. Contrary to the traditional statistical approach, machine learning follows an “algorithmic modeling” approach (Breiman, 2001b). Typically, no assumptions are made about the underlying data-generating mechanism (e.g., whether the relationships are linear or whether there are interactions), and instead, this approach seeks to find the function that best predicts the outcome from a set of possible predictors (for an example of a typical machine-learning pipeline, see Fig. 1). This often involves comparing different classes of models (e.g., a random forest vs. a neural network) and different hyperparameters (e.g., different tree depths in a random forest; see Box 1). Machine-learning models are not evaluated on how well they fit the sample they were built on but on new data that are unseen by the model. This might be data in the future that are not yet available when the model is fit or data that are technically available but have been deliberately held back for the purpose of evaluating the model. Either way, there is generally an assumption that the out-of-sample data have similar underlying distributional properties as the data the model has been trained on. This out-of-sample prediction performance assesses how well the model generalizes and helps to protect against inflated performance estimates as a result of overfitting to one sample (Yarkoni & Westfall, 2017). In machine learning, overfitting is reduced through processes such as cross-validation, regularization, and hyperparameter tuning (see Box 1).

Fig. 1.

An example of a supervised-machine-learning pipeline. XAI = explainable artificial intelligence.

Several renowned previous articles have articulated either directly or indirectly the broad epistemological differences between the traditional statistical approach used in psychology and machine learning. For example, there have been discussions on the different utilities of explanation versus prediction (Hofman et al., 2021; Shmueli, 2010; Yarkoni & Westfall, 2017), data modeling versus algorithmic modeling (Breiman, 2001b), deductive (or “theory-testing”) versus inductive (or “discovery-oriented”) methods (Oberauer & Lewandowsky, 2019; Rothchild, 2006; Van Lissa, 2022, 2023), and basic science versus applied science (Simon, 2001). Although these dichotomies are important, we feel that they have increased the perceived gap between machine-learning and psychometric approaches. In fact, Bzdok (2017) lamented that there is a “scarcity of scientific papers . . . that provide an explicit account on how concepts and tools from classical statistics and statistical learning are exactly related to each other.”

In this article, we rather highlight the continuous nature of these approaches (also noted in Hao & Ho, 2019; Orrù et al., 2020) with the aim of reconciling them in relation to the key challenges of analyzing psychological data. This is with the view that in the words of Efron and Hastie (2021), a “healthy duality . . . is bound to improve both branches.” Figure 2 illustrates the continuum. On the left, we show the traditional statistical approach, where the focus is on applying a theoretically conceived model to data and understanding the data-generation mechanisms. Thus, the approach aims to derive an explanation. On the right, there is a machine-learning approach, where the focus is to make a prediction from the input to the output without assuming a theoretically motivated model. Thus, this approach does not explicitly consider the data-generation model (Breiman, 2001b); although, there are also many exceptions (Pearl, 1995; Peters et al., 2017; Schölkopf et al., 2021; Singh et al., 2019). In fact, with a large number of possible predictors, trying to find the true data-generation model is often considered “too ambitious, if meaningful at all” (Nevo & Ritov, 2017). See also Fisher et al. (2019), Letham et al. (2016), Statnikov et al. (2013), and Tulabandhula and Rudin (2014). In Figure 2, it is clear that either end of the continuum is completely different, but in reality, there is also a large middle ground, where the model is theoretically constrained but to a much lesser extent than in the traditional approach. We believe that this applies to many instances when psychologists attempt to use machine learning on psychological data. For example, one might constrain the model to be able to model only linear relationships (e.g., using lasso; Tibshirani, 1996) but let the model decide which features from a large pool (perhaps still of theoretically relevant predictors) to include in the final model. To be clear, we cannot truly have both approaches—there is a fundamental trade-off between explaining a sample and predicting new data. However, this does not mean that there are two clear categories but, rather, a spectrum.

Fig. 2.

The continuum between a traditional statistical approach and a machine-learning approach. p = set of possible predictors.

We take linear regression as an example, a model that is commonly used in psychology. Linear regression is a traditional statistical technique that aims to predict the outcome variable from predictor variables using additive linear terms. A lot of advanced statistical methods commonly used in psychology, such as structural equation models (SEMs) and mixed-effects models, are an extension of this linear regression. When the number of predictors increases, there is likely an issue of multicollinearity, which makes the parameter estimation and prediction unstable. To stabilize the prediction, linear regression models can be extended to regularized regression models, which include lasso (Tibshirani, 1996), ridge-regression (Hoerl & Kennard, 1970), and elastic-net models (Zou & Hastie, 2005). These models can provide relatively stable predictions with many predictor variables and are often regarded as machine-learning models (Hastie et al., 2009). When the functional relationship becomes more complicated, researchers can alternatively use kernel regression models (Nadaraya, 1964; Watson, 1964), which enable good predictions even when nonlinear relationships exist. However, when doing so, the interpretability of the model suffers compared with the simpler linear model.

We can also think about the other direction. Deep neural-network models are regarded as one of the most powerful classes of machine-learning algorithms across a variety of applications (Shinde & Shah, 2018). But the simplest version of a neural network, called a “perceptron” (McCulloch & Pitts, 1943), can be equivalent to regression with a binary outcome (e.g., logistic regression). Generally speaking, as a model accommodates an increased number of predictors and complex nonlinear relationships, the model naturally becomes more of a black box and is referred to as a machine-learning model. Of course, this is a somewhat simplistic view, and not all statistical tools fit with this continuous perspective. However, the implication of this perspective is that statistical concepts and ideas in one tradition can, to some extent, be transferable to the other tradition.¹ At least by trying to connect the statistical concepts in both fields, rather than learning about them completely separately, we believe psychologists should be able to reach a better understanding of machine-learning methods and vice versa.

Note that when we use the phrase “machine-learning approach” here, we encompass not only the predictive models themselves but also the whole analysis pipeline required to perform machine-learning analyses (for an example of a typical machine-learning pipeline, see Box 2 and Fig. 1). This includes data-preprocessing steps, practices for training and testing models, and a wide variety of different algorithms available to detect patterns in data to use for making predictions. Some of the machine-learning preprocessing procedures can be adapted to the traditional statistical approach or vice versa, for example, certain approaches to missing data. This supports our point that the two approaches are not qualitatively so different, and in the current article, we discuss some critical preprocessing procedures when relevant to our argument (e.g., dimensionality reduction and centering). That being said, in certain machine-learning domains, such as computer vision or natural language processing, data preprocessing is a large and active area of research in its own right (X. Chu et al., 2016). Furthermore, many of the techniques involved in these fields, such as data augmentation (e.g., Shorten & Khoshgoftaar, 2019) or the process of extracting tabular data from unstructured data (e.g., Church, 2017; Mars, 2022; Medhat et al., 2014; Wallach, 2006; Y. Zhang et al., 2010), are not easily comparable with the preprocessing procedure in a traditional statistical approach. Therefore, in the current article, we do not attempt to cover the whole array of literature in this regard.

Box 2.

A Typical Supervised-Machine-Learning Pipeline

Researchers typically start with a research question or prediction goal related to a given data set. Depending on the type of problem and the size, quality, and type of data, researchers will select which classes of machine-learning models to evaluate. Ideally, all relevant algorithms would be tried, but the search space is often narrowed because of computational constraints. Researchers then split the full data set into training and test data. There are many ways to do this, depending on the problem and data size, but common splits include 50:50, 70:30, and 80:20 for training:test data, respectively. The idea is that the best-fitting model is found using the training data, and the prediction performance of the “trained” model(s) is evaluated once in the held-out test data. Training the model involves learning both the model parameters (i.e., the slope of the line in linear regression) and selecting the optimal hyperparameter values—parameters that typically control the complexity or structure of the model (see Box 1). A highly complex model will likely explain the given data accurately but is unlikely to generalize well to new data. Thus, the training process finds the balance between being able to capture complex relationships useful for prediction and being simple enough to ensure generalizability. The selection of hyperparameters is typically done using cross-validation (e.g., K-fold; see Box 1) within the training data. The hyperparameter values that give the best prediction performance across the different validation data are chosen as the optimal hyperparameter values. Finally, this trained model is used to predict the hold-out test data. The difference between the model’s predictions and the actual hold-out (test) data is the assessment criterion for the final evaluation of model performance. Model performance is typically evaluated not in terms of statistical significance but in relation to different baseline measures (e.g., using the mean of the training data). Common metrics for model performance include prediction R^2a, root mean square error, mean squared error, and mean absolute error for continuous-outcome variables and accuracy, precision, recall, area under the curve of a receiver operating characteristic, and F1 score for classification problems^b (Bishop & Nasrabadi, 2006). Finally, how the model makes its predictions can be investigated using explainable-artificial-intelligence (XAI) methods, for example, by extracting measures of feature importance (this is out of the scope of the current article, but for more information, see Henninger et al., 2025; Henninger & Strobl, 2023; Lavelle-Hill et al., 2025; Molnar, 2022). How the explanations produced by XAI methods differ from those produced by explanatory models in psychology is discussed at length in Lavelle-Hill et al. (2025).

Prediction R², somewhat different to the R² when fitting a regression model, evaluates how well a model predicts new observations (Scheinost et al., 2019).

For classification problems, the type of metric should be chosen while considering the distribution of data across classes and the cost of an error (Sterner et al., 2023).

In the following sections, we discuss each of the focal challenges that are common in psychological data: (a) a limited sample size, (b) measurement error, (c) nonindependent data, and (d) missing data. In each section, we first introduce how they are approached from a traditional statistical perspective and then suggest possible solutions when using machine learning.

A Limited Sample Size

Approach in traditional statistical methods

Psychological data are often collected directly by researchers to address specific research questions. This has the advantage of allowing researchers to collect precisely the information they need. However, it also tends to be costly and time-consuming. As a result, psychological data sets are usually relatively small, especially in terms of the number of observations (i.e., participants, typically ranging from dozens to hundreds in experiments and hundreds to thousands in surveys). These limited sample sizes make it more challenging to generalize findings from the collected sample to broader populations.

To draw generalized conclusions, psychometrics relies primarily on inferential statistics. Here, researchers assume that the observed sample (i.e., the data they have) is a random sample from the population of interest. Parameters (e.g., means, regression coefficients) estimated from the observed sample are expected to deviate, to some extent, from the true population parameter values. This deviation is referred to as “sampling error.” Although researchers cannot directly observe sampling errors—because the true population values are unknown—under certain distributional assumptions (e.g., multivariate normality), they can evaluate the distribution of sampling errors, known as the “sampling distribution.” In fact, researchers can typically derive the standard deviation of the sampling distribution, known as the standard error, either analytically (i.e., using a statistical formula) or via numerical estimation (i.e., using iterative procedures to approximate solutions). These standard errors are often the basis for determining the statistical significance of parameter values.² When distributional assumptions are not met, researchers often use simulation-based approaches to evaluate the sampling distribution (e.g., bootstrapping; see Box 1). Generally, as the sample size (N) increases, sampling error decreases probabilistically (i.e., precision increases; D. R. Cox et al., 2018; Faraway & Augustin, 2018; Riley et al., 2021). In other words, with larger data sets, researchers obtain more information about the population, and the estimated parameter values are less likely to deviate from the true population parameters (i.e., see central-limit theorem, Kwak & Kim, 2017).

Related to inferential statistics, sample-size planning is another important practical step in psychology research because researchers cannot easily collect massive amounts of data. Specifically, researchers can judge whether a certain sample size is sufficient to achieve a high probability of detecting the effects when the effects truly exist. This is called a “statistical power analysis” (Cohen, 1992), and the evaluation of sampling error (e.g., standard errors) also plays a key role here.

How machine learning considers a limited sample size

Ideally, researchers using machine learning will have a large amount of data. But in psychology research, data are still expected to contain a limited number of data instances. When machine-learning methods are used in this setting, there is no obvious quantification of sampling error in most cases (Faraway & Augustin, 2018). For example, when one uses a random-forest model (Breiman, 2001a), one can calculate a metric to assess the predictive performance of the model (e.g., root mean square error) and extract a predictor-importance metric (e.g., permutation importance; Breiman, 2001a); but in most applications, researchers do not see standard errors or p values associated with these quantities (R. Sambasivan et al., 2020).³ In other words, an explicit metric that reflects uncertainty related to the sample size is typically not seen. With machine-learning methods, then, how can one make a generalized conclusion from a limited sample size?

In a sense, sampling errors are implicitly handled in the machine-learning methodology. This handling occurs through cross-validation and regularization (or feature selection) procedures (see Box 1). By using cross-validation and hyperparameter tuning, researchers aim to find a predictive model that generalizes to new data sets with similar distributional properties. In other words, cross-validation and regularization aim to minimize the effect of idiosyncratic features present in the training data. Therefore, cross-validation and regularization can be regarded as methods for selecting the best machine-learning model while accounting for sampling errors. Note, however, that there are key differences. In machine learning, “generalization” typically refers to the ability of a model to perform well on new data drawn from a similar underlying distribution. “New data” (i.e., the test data) are often a subsample (e.g., 15%–50%) of the full observed sample, future data, or data from a different source. These hold-out samples are eventually known, and how well the model explains the new sample in comparison with the sample it was fit to (i.e., the difference between training error and test error) can be explicitly measured. On the other hand, the traditional statistical approach conceptualizes generalization with regard to the true population, from which the observed data (a tiny sample in comparison) are randomly sampled. This is analogous to randomly sampling data as the training set in machine learning. In both cases, through the random-sampling procedure, the aim is to have distributional similarity—between the sample and the general population in psychology and the training set and test set in machine learning. However, in the traditional statistical approach, the population sample cannot be known because it is too large. In this respect, the machine-learning and the traditional-statistics views conceptualize sampling errors in a slightly different manner. However, both approaches consider how well a model built on a sample generalizes to different data.

To be more concrete, consider a case in which the data contain five possible predictor variables, of which, only one has a true linear relationship with the outcome. In the traditional statistical approach, the four irrelevant predictors are likely to yield nonzero beta values that are not statistically significant. In the machine-learning approach, when an unregularized model is fit to training data, not only the true predictive variable but also many of the other variables will likely show nonzero contributions to the prediction, partly as a result of sampling error. In machine learning, this is referred to as “overfitting” to noise in the data. However, through the model-training and selection process using cross-validation to find the right level of regularization (e.g., L1 norm; Tibshirani, 1996), the contribution of the four nonpredictive features should eventually be eliminated from the final predictive model. The resultant, more parsimonious model is expected to predict the outcome well in new data. This means that after a model has been trained and selected appropriately and the retained features show good predictive performance in the hold-out data, one can say that the model and its features were selected with sampling error taken into account.

Many introductory textbooks in machine learning emphasize generalization to a new sample as a critical feature of machine learning but do not compare this with the concept of “sampling error.” We believe that this situation makes the perceived gap between machine-learning and traditional statistical analysis bigger than the reality. However, as discussed earlier, the concepts of traditional statistics and machine learning are often closely related. As another example, for certain classes of models (e.g., linear regression based), model selection based on cross-validation, specifically, leave-one-out cross-validation (see Box 1), is asymptotically equivalent to the model selection based on Akaike information criteria (AIC; Akaike, 1973, 1974), which is commonly used in traditional statistics to strike a balance between goodness of fit and model complexity. In fact, in cases in which large sample sizes are not available and there are not a large number of models to compare, AIC can be used to select from different machine-learning models on the training data (Midway, 2022).⁴ Therefore, both cross-validation and AIC penalize unnecessary complexity with the goal of preventing overfitting. In other words, traditional statistical approaches also implicitly incorporate a model-selection procedure similar to cross-validation.

Acknowledging the link between a traditional-statistics approach and machine-learning approach, we can also think of ways to leverage the knowledge of one approach to enrich the other. One such example is the quantification of sampling error for predictor-importance estimates from machine-learning models. In traditional statistics, it is not only considered whether a parameter is meaningful after taking into account sampling error (i.e., statistical significance) but also how accurate these parameter estimates are likely to be (e.g., by using confidence intervals). Imagine researchers who have a machine-learning model that shows good performance in hold-out test data. The model may also include some interpretable values to represent predictor importance (e.g., Gini impurity for a random forest; Breiman, 2001a). However, if the training and/or test data are small, the model-performance and predictor-importance values are likely to be inaccurate because of sampling error. In machine learning, sampling error is typically reflected only in poor model performance and not in the explicit quantification or estimation of sampling error related to the predictors’ importance. Thus, by simply analyzing the predictor-importance estimates, particularly if they are calculated on the training data (for further discussion on what data should be used, see Lavelle-Hill et al., 2025), the researchers do not know how generalizable their variable importance findings are to other samples.

Although it is not common practice in the machine-learning community to routinely quantify sampling error explicitly (Faraway & Augustin, 2018), the computation of uncertainty related to machine-learning performance estimates has nonetheless been an active field of research (for a review, see Nemani et al., 2023). One useful method is bootstrapping (see Box 1). This method refits the selected model to predict the outcome on multiple generated (e.g., bootstrapped) data sets from the same distribution, providing researchers with an indicator of sampling error (e.g., see Bouthillier et al., 2021; Lavelle-Hill et al., 2021; Michelucci & Venturini, 2021; Nadeau & Bengio, 2003). Bootstrapping is a flexible and powerful method that is used in both traditional statistical analysis and machine learning because it does not require distributional assumptions (Efron & Tibshirani, 1994). In machine learning, it has been used to quantify the sampling error associated with the overall prediction performance (Bouthillier et al., 2021; Dietterich, 1998; McPherron et al., 2022; Michelucci & Venturini, 2021). However, in principle, it can also be used to quantify the effect of sampling error on estimates of predictor importance, given a model.

Sample-size planning/evaluation, which is highly related to generalizability in the traditional-statistics approach, has attracted even less attention in the field of machine learning. Unlike traditional statistics, the adequacy of the sample size in machine learning is usually only implicitly evaluated in relation to the model performance and has not been seriously considered beyond this (e.g., Dhiman et al., 2023). Some fields (e.g., psychiatry) have developed their own heuristics, such as “number of predictors * 10” rule (Concato et al., 1995; Peduzzi et al., 1995, 1996). However, recent research has demonstrated that many human data sets used to build and evaluate prediction models have been undersized, increasing the risk of (upward) bias (Vabalas et al., 2019; Wynants et al., 2020) and potentially giving rise to inaccurate conclusions (Dhiman et al., 2023). Several studies have provided formulas to evaluate the adequacy of sample size in certain applied-machine-learning fields (e.g., in medicine; Riley et al., 2019a, 2019b, 2020; van Smeden et al., 2019), but the extent to which these methods can be extended to different classes of machine-learning models and other types of data, such as psychological data, is still unclear.

Measurement Error

Approach in traditional statistical methods

In the natural sciences, researchers are typically dealing with quantities such as chemical compounds or physical matter, which can be measured directly. In contrast, in psychology, researchers often want to measure intangible phenomena, such as attitudes, beliefs, emotions, states, and traits. To assess these ambiguous concepts, it is unavoidable that measures are accompanied by some amount of measurement error—the deviation of the measured values from the true value of the concept (given the sample; McDonald, 2013). Measurement errors reflect various factors irrelevant to the concept itself they want to measure. They exist because the “true” constructs can be measured only indirectly, for example, by asking questions about phenomena that theoretically reflect the construct. The problem is that almost all standard statistical models—especially regression models and their extensions—assume that variables are perfectly measured. However, when measurement errors are present, parameter estimates become biased (Bound et al., 2001).

One may have the intuition that measurement error simply attenuates the results, and thus, researchers should simply keep this in mind when interpreting the effect size and p values. This is wrong. When there are multiple predictors in a regression model, for example, measurement error can wrongly strengthen a positive association or even reverse the effect (Bound et al., 2001; Brunner & Austin, 2009; Cohen et al., 2013). The direction of the bias is almost completely unpredictable when there are several correlated predictors. Even if a variable is perfectly measured, if another variable in the model is measured with error, the regression coefficient for the perfectly measured variable can be biased.⁵ Because of the bias, Type 1 error rates also naturally inflate—even if the variable has no effect (and regardless of whether the predictor itself was perfectly measured), the variable may show a significant effect if other predictors in the model suffer from measurement error (Brunner & Austin, 2009; Cole & Preacher, 2014). Increasing the sample size cannot counter the bias (van Smeden et al., 2020). In fact, when the bias is present, larger sample size actually increases Type 1 error rates.

Although some sophisticated models have been proposed for variables with measurement errors (especially in econometrics; e.g., Fuller, 2009), the issue is hard to resolve because the measurement error needs to be quantified in the first place. One straightforward strategy commonly used in psychology is to assess constructs (e.g., personality traits) using multiple items. If there are inconsistent responses to the multiple items assessing the same construct, then they should be deemed as measurement error. With multiple items per construct, latent variable modeling can then be applied, for example, through the commonly used SEM framework (Bollen, 1989). The SEM framework is a combination of factor analysis (FA; Fruchter, 1954; P. Kline, 2014) and path models to estimate the commonality of individual items by supposing a latent factor (an unobserved variable that is inferred from observable variables) that researchers can flexibly include in a regression model. The fundamental assumption is the existence of a hidden “true score” behind the measurement, which reflects a psychological construct of interest, and researchers are concerned with the relationship between this true score and the outcome.

How machine learning can address measurement error

In the machine-learning community, it is well known that the quality of the data plays a large role in the predictive capabilities of models, and thus, much attention needs to be paid to data cleaning and augmentation (X. Chu et al., 2016; N. Sambasivan et al., 2021; Shorten & Khoshgoftaar, 2019). For example, Polyzotis et al. (2019) described data validation as “on par with the algorithm and infrastructure used for learning.” Despite this, and aside from the potential impact on prediction performance, there has been limited discussion on the effect of measurement error on the machine-learning model and its interpretation (for exceptions, see Datta & Zou, 2020; Jacobs & Wallach, 2021; Jacobucci & Grimm, 2020; McNamara et al., 2022). This may be because strong assumptions on the validity of data are not needed for prediction in machine learning. However, the difference between an unobservable construct and its measurement can lead to misinterpretation and biased algorithms (e.g., Jacobs & Wallach, 2021). Furthermore, measurement error, more generally, is not unique to psychological data. Many of the data sets used to train machine-learning algorithms will also contain measurement error (McNamara et al., 2022; Morgenstern et al., 2021). But is this error taken into account in machine-learning models, and does it need to be?

On the one hand, certain machine-learning approaches are arguably highly successful at reducing measurement error because they have the capability to learn underlying representations of the feature space. This can be done using unsupervised methods and is an alternative to manually engineering features (e.g., by taking the mean or median of the multiple measurements). The process typically occurs within a (optional) step of the supervised-machine-learning pipeline known as “dimensionality reduction” (Ghojogh et al., 2023) before the model is fit (Bartal et al., 2019).⁶ Dimensionality reduction in machine learning is the process by which the number of input features is reduced, either by feature-selection or feature-compression methods (the combining of many features into a smaller number of model inputs). In machine learning, this is typically performed with the goal of avoiding the curse of dimensionality (Bellman, 1957; see Box 1), thus aiding prediction performance and easing computational costs.

When the feature space is compressed, higher-order dimensional features (or “components”) are arguably more reliable and thus suffer less from measurement error. This is because the selected components are capturing the dimensions in the data with the highest variance, making them less sensitive to noise (Greenacre et al., 2022). Furthermore, simplifying the representation of the data (reducing the dimensions) means there are fewer opportunities for measurement error to occur (Hellton & Thoresen, 2014). In addition to principal-components analysis (PCA), which is also commonly used in psychology (Bryant & Yarnold, 1995), methods used in machine learning include nonnegative matrix factorization (NMF; D. Lee & Seung, 2000; Y.-X. Wang & Zhang, 2012) and independent-component analysis (ICA; T.-W. Lee & Lee, 1998).⁷ Similar to FA (Fruchter, 1954; P. Kline, 2014) and PCA, these methods produce components that the original variables load onto. Although (confirmatory) FA and PCA are similar (Hinton et al., 1994; Roweis & Ghahramani, 1999) and capture variance or covariance structures (for further discussion on the similarities and differences, see Widaman, 2007), ICA aims to maximize the statistical independence between the extracted components (T.-W. Lee & Lee, 1998). NMF uses matrix factorization to compress (nonnegative) data with additive properties (e.g., images; Aonishi et al., 2022; Guillamet et al., 2003) into two matrices (plus some error)—the H matrix, containing the loading of features onto components, and the W matrix, containing the loadings of data instances onto components. In addition, a number of nonlinear methods have been proposed, for example, kernel PCA (Schölkopf et al., 1998). For an overview of nonlinear methods, see Van Der Maaten et al. (2009), and for visualization methods, see Rudin et al. (2022). For unstructured text data, there are also natural-language-processing methods to extract higher-order topics or linguistic features (e.g., topic modeling; Blei & Lafferty, 2009; Kao & Poteet, 2007).

In a supervised-machine-learning pipeline, the feature-compression process often occurs without being guided by theoretical concepts. Instead, the goal is to find a lower-dimensional data representation while retaining as much of the original information as possible (Montano, 2014). This is in contrast to SEMs/FA, which usually works on theoretical or conceptual grounds (although, see also exploratory SEMs; Marsh et al., 2014). Although unsupervised-machine-learning methods can reduce measurement error to some extent (i.e., increase measurement reliability; Greenacre et al., 2022), their effect remains somewhat limited. It is also possible that all variables are measured poorly and that the compressed components pick up on the systematic (i.e., correlated) measurement error. Furthermore, it is known from the psychometrics literature that such point estimates of aggregated features—regardless of the method used—are still contaminated by measurement error because they do not account for randomness (Lechner et al., 2021). In addition, particularly when driven by data rather than theory, the resultant components are not guaranteed to be interpretable (Rudin et al., 2022). Thus, as input features get merged, this can reduce model interpretability and make it more difficult to understand the contribution of individual features.

Aside from the reduction of measurement error that may occur as a (typically unintended) consequence of feature-compression approaches, there is relative inattention to measurement error in machine learning and its effects in the model-fitting phase (i.e., its impact on coefficients in a regression-based machine-learning model). This is because machine learning, at the extreme end of the continuum, is primarily aimed at maximizing the prediction accuracy given the features (i.e., measurements) available (Breiman, 2001b; Hofman et al., 2021; Mullainathan & Spiess, 2017; R. Sambasivan et al., 2020). Given this goal, it may not make sense to consider or discuss the predictive accuracy of a feature before being disturbed by measurement error (the true scores) or how the predictive accuracy of the true scores might be biased. Taking this perspective, measurement errors are typically of concern only if they attenuate the overall prediction accuracy and, hence, if this information can be used to change data collection or the modeling process in a way that would improve predictive performance. In other words, when there is a true relationship, it is better to have features with high reliability not because such features more closely reflect true scores but because they generally predict the outcome better. In addition, from a purely prediction standpoint, arguably, if the measurement error of a feature turns out to increase prediction accuracy (because the error happens to be correlated with the outcome), one should retain such a “contaminated” feature (as viewed from a traditional-statistics perspective) in the predictive model.

That said, we believe that there is a clear benefit of thinking more explicitly and seriously about measurement error in machine learning. First, by explicitly accounting for measurement error in machine-learning models, researchers may be able to achieve a better prediction. This is one potential consequence of dimensionality reduction (Bellman, 1957). Yet a more explicit method of dealing with measurement error could yield even bigger predictive gains. More broadly, by assuming measurement errors, researchers implicitly impose theoretical constraints on features. A substantial body of literature suggests that (valid) theoretical constraints can enhance machine-learning performance (e.g., Karniadakis et al., 2021; Schölkopf et al., 2021; Singh et al., 2019). Second, when importance of individual features are interpreted, measurement error could lead to misleading results (see also Jacobucci & Grimm, 2020; Luijken et al., 2019; McNamara et al., 2022). At least in the case of regression-based models, as noted above, measurement error could change the regression weight even in the opposite direction. In more complex models, such as random forests or XGBoost, it is possible that there is a positive bias toward selecting features with lower measurement error, similar to biases in favor of features with higher cardinality (Kononenko, 1995; A. P. White & Liu, 1994). Yet there has been little research that has explicitly examined the impact of measurement error on predictor importance and selection in machine-learning models.

One potentially fruitful avenue may be to combine machine-learning algorithms with the SEM framework, for example, using regularized SEM (Brandmaier & Jacobucci, 2023; Jacobucci et al., 2016, 2019; X. Li & Jacobucci, 2022) or SEM tree/forest algorithms (Brandmaier et al., 2013, 2016). With these models, theoretical latent variables can be used as features to predict the outcome combined with regularization mechanisms and complex nonlinear functions. At the same time, SEMs typically define measurement error in a very specific way (although there are some exceptions), that is, the unique variance that is not commonly shared by the items (the shared part forms a latent variable; Bollen & Lennox, 1991). In other words, some important item-specific elements might also be regarded as measurement error and discarded from the analysis (Donnellan et al., 2023). This is despite recent research showing that a model including item-specific effects could have higher predictive power than a model with only scale scores (McClure et al., 2021). Therefore, although there is promise in these recent methodological developments, they address only one specific definition of measurement error commonly found in psychological data.

Nonindependent Data

Approach in traditional statistical methods

Most of the traditional statistical analysis methods in psychology assume data independence, in which each data point is treated as independent of the others. However, psychological data often exhibit a hierarchically nested structure. For instance, data points can be nested within individuals or students within schools. In such cases, the data points are no longer independent from one another because data points within the same upper unit (e.g., person, school) are likely to be more similar than data points from different upper units. The degree of dependency between groups can be quantified using measures such as intraclass correlation (ICC; Bartko, 1966).

If one analyzes nonindependent data with standard statistical methods, generally, two issues emerge, and the extent of the problem becomes greater as ICC increases. First, the analysis could bias causal estimates because the upper unit (e.g., person, school) could be regarded as a confounder, directly influencing the variables of interest (Rohrer & Murayama, 2023). Second, the analysis would tend to underestimate the sampling error because the analysis assumes that all the cases have unique and independent information despite the data actually including redundant, shared information. To address these issues, researchers have proposed various statistical methods, such as mixed-effects modeling (or hierarchical linear modeling; Gałecki et al., 2013), robust standard errors (Hoechle, 2007), and generalized estimating equation (Zeger et al., 1988). These methods appropriately take into account the variance attributed to the upper units, allowing researchers to make appropriate statistical inferences. These methods can also address the issue of confounds by simple extensions, such as using a cluster-mean-centering approach (Enders & Tofighi, 2007) or latent variable modeling (Silva et al., 2019). They eliminate the between-clusters differences from the data manually (i.e., cluster-mean centering) or statistically (i.e., latent variable modeling); thus, the features associated with the clusters can no longer be confounders. Moreover, mixed-effects modeling, which is widely used in psychology, takes advantage of hierarchical data structures to flexibly answer various research questions (Raudenbush & Bryk, 2002). For example, if regression slopes are different between clusters (random slopes), researchers can include predictors at the cluster level to explain these cluster differences (i.e., cross-level interactions).

How machine learning can address nonindependent data

In practical applications of machine learning at the extreme end of the continuum, one could argue that the data structure is not crucial in the context of machine learning’s primary goal, to find an optimal predictive function (Breiman, 2001b). From this purist perspective, one may believe that the data’s underlying structure need not be a concern as long as the predictions are accurate. However, when viewed through the lens of traditional statistics, it can be pointed out that the search for an optimal prediction is undermined by ignoring the nonindependence of data. More specifically, neglecting the presence of nonindependence leads to the underestimation of sampling error, ultimately resulting in overfitting, as we illustrate below (for simulation work, see also Hu et al., 2023,).

Consider an example in which a researcher has data from 10 groups, each comprising 100,000 individuals in the training data set. The individuals within each group are highly similar, although those between groups are significantly different. The ICC would be high in such a scenario. If these group distinctions are disregarded and a standard cross-validation methodology (e.g., K-fold) is employed without considering nonindependence, the model is likely to exhibit overfitting. This occurs because the modeling and cross-validation procedures assume that the 10 × 100,000 data points are independent. Consequently, the model might incorrectly identify certain group-specific features as important predictors even if these features are noncausal and correlate with the outcome by chance (Hornung et al., 2023; Roberts et al., 2017). In other words, the model would be falsely confident (exaggerated by the large sample sizes; Reid, 2018) that the effects of these features were not the result of sampling error.

Although there is not yet a “gold standard” for how to deal with nested data structures, there are two general strategies employed in the current supervised-machine-learning literature to address the underestimation of sampling errors.⁸ First is a simple, practical solution, which we call the “cluster-as-features” approach: Researchers include the upper units as predictors (e.g., a categorical class variable for tree-based approaches or a set of dummy variables for classes in a regularized regression; see also Kilian et al., 2023). This method is analogous to a fixed-effects model in the traditional-statistics literature (Gardiner et al., 2009; McNeish & Kelley, 2019; Sommet & Lipps, 2024), which is known to address the underestimation of standard error. This clusters-as-features approach could also help to improve predictions. For example, if there are big differences between schools in academic achievement, information about which school a student attends has good predictive utility for the student’s achievement score. Using this method, potential interactions between the upper units and lower units (i.e., subgroup effects) could also be modeled. However, one fundamental limitation of the clusters-as-features approach is that it cannot easily make a prediction about a new cluster (below, we discuss the implication in relation to inflated performance estimates)—this is also the limitation of fixed-effects models in traditional statistical approaches. If there are 50 schools in the data, the clusters-as-features approach could make a good prediction for only those 50 schools, and the model cannot predict a new class because there is no such (dummy) variable representing the new class in the model. Note that by integrating certain unsupervised approaches, such as similarity-based methods (Ding et al., 2014) or neural-network approaches that learn continuous embeddings (e.g., Mars, 2022), one could get around this problem. However, in a typical supervised pipeline, the utility of such a predictive model would be limited in this context.

The second is a more technical solution. Researchers have developed new machine-learning methodologies that statistically deal with the nonindependence of data. These methods are useful in situations in which data collection might be biased by the data clusters (e.g., the data were collected from different labs) and one wants to control for this effect rather than use it as a predictor. The focus has been on combining mixed-effects models with different machine-learning models to account for the nonindependence, for example, combining linear mixed-effects models with lasso regression (Groll & Tutz, 2014; Pan & Huang, 2014; Schelldorfer et al., 2011), decision trees (Fokkema et al., 2018; Fu & Simonoff, 2015; Hajjem et al., 2011; Kundu & Harezlak, 2019; Ngufor et al., 2019; Sela & Simonoff, 2012), random forests (Calhoun et al., 2021; Capitaine et al., 2021; Hajjem et al., 2014), gradient tree boosting (Salditt et al., 2023; Sigrist, 2022), and neural networks (Mandel et al., 2023; J. Wang, 2025; Xiong et al., 2019). Promising recent work has also focused on developing generalized model-agnostic frameworks (Kilian et al., 2023). The general idea here is that the fixed-effects and random-effects parts are separately estimated via alternate iteration: The fixed-effects part of the model is estimated by a standard machine-learning model, and the random effects are estimated by a standard method in mixed-effects modeling (e.g., restricted-maximum-likelihood method).

In many cases, incorporating the random effects has improved predictive performance beyond linear mixed-effects models and the original machine-learning algorithm (e.g., Hu et al., 2023; Ngufor et al., 2019; Sela & Simonoff, 2012). However, these hybrid models are not yet commonplace in the applied-machine-learning field, and the availability of the software, for example, in Python (the programming language used most commonly for machine learning; Hao & Ho, 2019; Raschka, 2015), is still somewhat limited. For example, Scikit-learn (Pedregosa et al., 2011a), the most popular Python package for machine learning (Hao & Ho, 2019), does not include such models.⁹ Hierarchical Bayesian models can also be used to capture dependencies between different levels, offering robust estimates (Bharadiya, 2023). However, both Bayesian machine-learning and mixed-effect machine-learning methods present challenges in terms of computational complexity and scalability when dealing with data sets with a large number of predictors (Bharadiya, 2023; Kilian et al., 2023).

The issue of confounding from the cluster level can be addressed in the same way as the traditional statistical approach. Specifically, researchers could cluster-mean center the features (both the outcome and independent variables) before applying machine-learning methods (e.g., as in Deininger et al., 2023).¹⁰ In psychology, this variable transformation is applied to examine “within-clusters” effects, which are less confounded by the cluster-level effects (Hamaker & Muthén, 2020). However, from the perspective of pure prediction, such centering approaches do not make as much sense because they would typically decrease the prediction performance by potentially throwing away important predictive information (i.e., cluster-level effects) from features. However, the cluster-mean-centering method can provide a clearer interpretation of the effects by eliminating potential confounds. The decision to do cluster-mean centering will depend on the purpose of the analysis—whether one prioritizes prediction or interpretation. Researchers should be aware that the decision fundamentally changes the way the results can be interpreted.

In situations in which the ICC is high, it can be advantageous to leverage the clustered structure of the data to enhance predictive accuracy. One effective approach is to create additional model features derived from each cluster, such as the mean, median, mode, standard deviation, or entropy of relevant predictor variables within that cluster. By incorporating these cluster-level statistics as predictors in the model, researchers can use the shared variance within clusters, ultimately improving the performance of predictive models. In psychology, several studies have shown that cluster-level predictors have substantial contributions to various outcomes beyond the original predictors—often called “contextual analysis” (Firebaugh, 1978) or “compositional models” (Harker & Tymms, 2004). However, by including additional features in the model, researchers are increasing the dimensionality, which can lead to both unstable predictions (Bellman, 1957) and longer run times. Researchers will likely need to be guided by theory or require additional feature-selection methods to decide which variables to aggregate at the cluster level and include in the model.

In machine learning, analysts must consider data nonindependence when choosing a model-evaluation method and reporting on performance in relation to the research question. For example, when applying cross-validation, researchers can split the data into training and test/validation data either between clusters (between-clusters split) or within clusters (within-clusters split; Roberts et al., 2017). If the ICC is high, the within-clusters split should show a higher predictive performance in the test/validation data because both training and test data include examples from the same clusters. For a between-cluster splits, also referred to as “block” (Roberts et al., 2017) or “grouped”¹¹ cross-validation, the test set contains data in clusters that have not been seen by the model in training. If a researcher is interested in the prediction for only those clusters the model has seen examples of, a within-clusters split makes sense (and a clusters-as-features approach would be useful). However, if one wanted to estimate the performance of a model predicting a new cluster, a between-clusters split would give researchers more accurate, realistic estimates (Hornung et al., 2023). The choice should be clearly articulated and justified so that any information leaked from the test set to the train set (see “data/information leakage” in Box 1) caused by the ICC being high (i.e., similar data occurring in both train and test sets) does not give misleading, overly optimistic performance estimates (Farokhi & Kaafar, 2020; Gibney, 2022; Guignard et al., 2024). This example highlights that data nonindependence creates a multitude of epistemic questions that researchers should consider before making analytic decisions.

Missing Data

Approach in traditional statistical methods

Missing data is common in psychological research. For example, in surveys, missingness can occur because of dropout, technical errors, or the skipping of responses because of inattention, confusion, boredom, or study design (e.g., Graham et al., 2006). In longitudinal data, dropout typically occurs over time. Sometimes, researchers might be tempted to simply throw away these missing data before the analysis (called “complete-case analysis”). However, this brings about two issues. First, complete-data analysis eliminates all partially informative data (i.e., cases that have missingness for only a few but not all the variables), potentially influencing the precision (e.g., standard error) of parameter estimates. Second, when the missingness does not occur completely at random, systematic bias may be introduced into the results (Enders, 2022, 2025).

The traditional statistical literature has offered a number of techniques to handle missing data (for a comprehensive overview, see Enders, 2025), most notably multiple imputations (Little & Rubin, 2019; Rubin, 1987, 1996; Schafer, 1997; Van Buuren, 2018) and (full information) maximum likelihood (Arbuckle et al., 1996; Baraldi & Enders, 2010; Beale & Little, 1975; Dempster et al., 1977).¹² Full information maximum likelihood is a method in which missing values are not replaced or imputed but are instead handled directly when calculating the likelihood function. This method is especially popular in SEMs (R. B. Kline, 2012). Multiple-imputation methods impute missing data from a model that a researcher specifies (e.g., multiple regression). Unlike single-imputation methods, these methods incorporate the uncertainty of the imputed values by generating multiple different imputed data sets (typically five to 20, depending on the proportion of missingness; I. R. White et al., 2011) using (model-based) random errors. The analysis is then carried out in each of the imputed data sets, and the results are either aggregated or adjusted to account for the uncertainty (e.g., using Rubin’s rules; Rubin, 2018).

One critical but often overlooked assumption of multiple imputation is that researchers correctly specify a model for missing data (Van Buuren, 2018). For example, if a researcher believes that a missing value is best predicted by interaction effects of other variables, these interaction terms should be included in the model predicting missing values (Enders, 2025). When assumptions are met, full-information-maximum-likelihood and multiple-imputation approaches will typically produce similar results (Collins et al., 2001). Both methods allow researchers to make full use of the data and, more notably, can provide unbiased parameter estimates with correct standard errors even when missingness is not completely at random. Specifically, if the missingness of a variable is systematic and the systematic missingness can be explained by the other variables in the model (this condition is called “missing at random” [MAR]¹³), the methods produce unbiased parameter estimates and correct standard errors (Enders, 2025). Of course, one cannot know whether the MAR assumption is met in practice, but in general, these methods are more robust to systematic missingness than the complete-case analysis (Sterne et al., 2009).

How machine learning can address missing data

In machine learning, maximum-likelihood-based approaches are available for some models (e.g., Ghahramani & Jordan, 1993; Williams et al., 2005); however, many models (e.g., tree-based algorithms) do not use a likelihood function to estimate parameters. Instead, several machine-learning models can handle missing data as inputs or treat missingness as a meaningful feature (You et al., 2020). For example, many tree-based models have in-built methods to deal with missing data (Breiman et al., 1984; T. Chen & Guestrin, 2016; Friedman, 2001) by incorporating missingness as a feature in the model (i.e., by allowing missingness to be a criterion to split on), making available only the nonmissing information for each split decision and distributing the missing data across the daughter nodes in accordance to the distribution of the nonmissing values (i.e., the C4.5 algorithm; Quinlan, 2014), or using surrogate splitters that emulate the primary splitter (the best split decision), which are called on when the primary splitter is missing (Breiman et al., 1984; Umezawa et al., 1995). For further strategies, see Gavankar and Sawarkar (2015) and Twala (2009). Likewise, the XGBoost algorithm, an ensemble decision-tree method that builds trees sequentially, can also handle missing data. At each split point, it assigns a default branch for the missing data to follow. The default is the branch that minimizes the loss function (quantification of the difference between the predicted and the actual values) the most when the missing data follow it (T. Chen & Guestrin, 2016). In other words, it attempts to handle the missing data in a manner that improves prediction. In certain applications, such as predicting mental-health outcomes, leveraging the extent and presence of missingness has been shown to bolster predictive performance (Wu et al., 2022), and many machine-learning methods can capitalize on this.

Extensions to other machine-learning models have also been proposed to enable the handling of missing data, for example, for random forests (Xia et al., 2017), SVMs (Chechik et al., 2008; Pelckmans et al., 2005), and deep-learning models (Bengio & Gingras, 1995; Goodfellow et al., 2013; Śmieja et al., 2018). Using algorithms that can take missing data as inputs allows the data to be modeled in relation to the missing-data context rather than trying to emulate a context in which no data are missing. This prevents impossible scenarios from occurring in the imputed data, such as pregnant fathers (Van Buuren et al., 2006). It can also produce significant computational savings (Chechik et al., 2008). However, it has been shown that some of these algorithms that handle missing data are inferior to first imputing the data and then fitting a model (e.g., Feelders, 1999). Furthermore, selecting a certain machine-learning model a priori because it handles missing data or using this as a criterion to restrict the model search space may undermine the model-selection process by causing a suboptimal model to be chosen. How a particular model handles missing data may also introduce unknown bias (You et al., 2020). Thus, many researchers prefer to keep the handling of missing data as a preprocessing step, separate from the model-selection process (Nijman et al., 2022).

In line with this idea, a flexible and commonly used strategy in machine learning is single imputation (Nijman et al., 2022; You et al., 2020). Single imputation involves creating a single imputed data set¹⁴ (without adding random error) and using the imputed data to make predictions. In their review, Nijman et al. (2022) found that 61% of machine-learning analyses that imputed data used this approach. However, it can take various forms (Thomas et al., 2020). The most naive strategy is to simply fill in the missing values with a constant, such as the mean, median, mode, or most common category—often referred to as “simple imputation” (Pedregosa et al., 2011c). A more rigorous method, multivariate imputation (e.g., Pedregosa et al., 2011b; Van Buuren & Groothuis-Oudshoorn, 2011), uses a statistical model or supervised-machine-learning algorithm to predict the missing data points (as outputs) given the existing values (inputs). A common implementation is to iterate over the columns in a round-robin fashion, predicting the missing values in the target column using the other features in the data (e.g., Pedregosa et al., 2011b). In addition to regression (Z. Zhang, 2016b), different machine-learning algorithms have been proposed for multivariate imputation, many of which have been shown to yield more accurate predictions than statistical methods (e.g., Jerez et al., 2010). Commonly used methods include K-nearest neighbors (Z. Zhang, 2016a), Bayesian ridge regression (Tipping, 2001), and random forests (Breiman, 2001a). In addition, state-of-the-art methods continue to be developed, including the use of generative adversarial networks (Luo et al., 2018; Yoon et al., 2018), graph neural networks (You et al., 2020), autoencoders (Gondara & Wang, 2018; Vincent et al., 2008), and different deep-learning methods (e.g., Chai et al., 2020). Such methods can additionally be used for data augmentation and the generation of new synthetic data (e.g., Y. Chen et al., 2020). The attractiveness of single imputation is that once the imputed data are created, any type of machine-learning method can be applied as usual. However, single-imputation methods can be biased by the model selected and the specific initialization of the default values (You et al., 2020). Thus, machine-learning researchers may try out different imputation methods in the training phase to find the one that maximizes prediction accuracy in the validation data. In this way, imputation can be viewed as a hyperparameter and part of the pipeline that is optimized for prediction.

When using single imputation, machine-learning researchers commonly apply the following two constraints. First, it is common practice to impute missing data based solely on the training data (but see also Gunn et al., 2023). This simulates the real-world scenario in which the test data are unknown during training and all instances in the test data may not have been seen when predicting a given instance. If information from the test data is used in the training phase to impute the missingness, it contaminates the training data by making it artificially more similar to the hold-out test set. This is another example of data/information leakage (see Box 1) and can lead to the potential overestimation of the prediction performance in test data (Gibney, 2022). Second, given the goal of finding the best predictive function that generalizes to out-of-sample data, researchers in machine learning typically do not impute the outcome variable (Jerez et al., 2010). This is because supervised-machine-learning methods are evaluated by comparing the model’s predictions with “ground truth” labels (the actual values). Therefore, the evaluation of the model’s performance would be compromised if the model’s prediction were evaluated against the predicted value from an imputation function with unknown fidelity to the true value. Furthermore, imputing the outcome variable is generally avoided in machine-learning workflows because it replicates the modeling effort carried out in the primary analysis. During the imputation phase, researchers fit a predictive model that uses the available features to estimate the missing outcome values. Immediately thereafter, the main analysis fits yet another model—based on the same set of features—to predict the outcome. Both steps pursue the same objective of learning the optimal mapping from predictors to outcome, so the imputation stage contributes no additional information while duplicating computational work. Consequently, outcome-variable imputation is seldom employed in contemporary machine-learning practice.

Given the goal of prediction, single-imputation methods make sense because they attempt to impute data in a way that would preserve the existing predictive relationships without throwing away missing data. From the perspective of traditional statistical analysis, however, we raise a point for discussion—whether imputation methods in machine learning need to take into account the uncertainty because of missingness. As an (extreme) example, imagine a predictor variable has 90% missing data and the rest of the data (10%) showed some observed correlation with the outcome. It is reasonable not to take this observed correlation at face value (i.e., generalize it to the full data set) because the amount of data is comparatively small. Multiple imputation naturally accommodates this intuition by taking into account the uncertainty associated with the amount of missing data. Single imputation, however, does not because data with single imputation do not contain any information distinguishing the imputed values (which include uncertainty) from the real observed values. Consequently, the prediction model selected on the imputed training data may overfit to the imputation function because the model considers the imputed data as equally valid as the observed data—potentially resulting in suboptimal performance in the test data. Furthermore, how would one interpret a model that found the feature with 90% imputed data highly predictive?

In practice, when one has a very large number of possible predictors, it is usually deemed reasonable to discard features with high missingness (Nijman et al., 2022), like the one in the above example, because of the high amount of uncertainty (unless there are theoretical or causal reasons to retain it; Batista & Monard, 2003). This can be considered part of a feature-selection process (García et al., 2016). Following this intuition, researchers might use some rule-of-thumb (e.g., > 50% missing¹⁵) to decide whether to drop or impute a predictor variable (Lin & Tsai, 2020). Such a threshold will depend on domain knowledge, the proposed/selected model, and the number of features and instances in the data (i.e., whether it is important to preserve features; Batista & Monard, 2003). However, there are no well-established rules, and it is usually left to the researcher’s discretion (Emmanuel et al., 2021).

Because single-imputation methods do not account for uncertainty because of missingness, should multiple-imputation methods be used instead for machine-learning analyses? This sounds like a reasonable suggestion, but in practice, it is more complicated (Khan & Hoque, 2020). Aside from the obvious computational challenge because of handling multiple data sets, the problem is that there is no established method like Rubin’s rules (Rubin, 2018) to integrate the results from multiple data sets (in both the training data and test data). For example, training a model on different imputed data sets will likely result in different hyperparameters and features being selected in each data set (Molnar et al., 2020), and it is not clear whether it makes sense to integrate results from models with different hyperparameters/predictors. There are other procedural considerations for which different researchers have different proposals (e.g., whether imputed data should be analyzed separately or stacked together and analyzed as a single data set; Gunn et al., 2023), but there has been little consensus on which procedure to use. Practical applications of multiple imputations in machine learning seem to focus mostly on averaging the results and less on the variance across different imputations (e.g., Khan & Hoque, 2020; but see Alasalmi et al., 2015). This means that the uncertainty of the imputed data is typically not taken into account by the model (Emmanuel et al., 2021; Graham, 2009), which could result in misleading interpretations being made.

General Discussion

In this article, we have taken the reader through some of the challenges of using a machine-learning approach that are inherent in the typical characteristics of psychological data. Specifically, these involve handling a limited sample size, measurement error, nonindependent data, and missing data. Several prominent articles have discussed the application of a machine-learning approach in psychology (Adjerid & Kelley, 2018; Bzdok, 2017; Dwyer et al., 2018; Elhai & Montag, 2020; Hofman et al., 2017; Hullman et al., 2022; Orrù et al., 2020; Rocca & Yarkoni, 2021; Tay et al., 2022; Van Lissa, 2022; Yarkoni & Westfall, 2017). For example, Liem et al., 2018 outlined different options for integrating machine learning into analysis pipelines in psychology, including the direct substitution of a traditional statistical model for a machine-learning model. However, such integration is typically not straightforward. Switching to a machine-learning model can affect decisions along the whole analysis pipeline, including the data preprocessing (Shmueli, 2010). In general, we found there to be a lack of work that directly addresses the compatibility of machine learning with characteristics that are typical of psychological data. In Table 1, we provide an overview of the potential solutions we have discussed and their remaining limitations. In this section, we reflect on our discussion of each challenge, followed by a broader discussion of future research possibilities.

Table 1.

A Summary of the Four Challenges Discussed That Emerge When Using Machine Learning to Analyze Psychological Data

	How handled in psychology	Potential solutions in machine learning	Remaining limitations
Limited sample size	Generalizability from the sample to the population is expressed as standard errors or confidence intervals—the basis of statistical significance. Use power analyses for sample-size planning.	Generalizability to different samples is considered indirectly via cross-validation and regularization. Use bootstrapping to quantify uncertainty that is due to the sample.	There is no consensus method to evaluate the adequacy of a sample size for machine learning on psychological data.
Measurement error	Use statistical methods that explicitly incorporate error within the theoretical model, with certain assumptions (e.g., SEM).	Use feature-compression methods such as PCA, NMF, and ICA. Combine ML models with the SEM framework (e.g., regularized SEM or SEM forest algorithms).	The SEM framework can currently be integrated only with certain machine-learning models. It is not clear if and when considering measurement error improves/worsens prediction accuracy.
Nonindependent data	Employ statistical methods that explicitly consider nonindependence (e.g., mixed-effects modeling) or use cluster-mean centering.	Include the upper cluster units as predictors (“clusters-as-features” approach). Create additional cluster-level statistics as predictors. Combine mixed-effects and ML models.	The clusters-as-features approach does not enable predictions about new clusters. Creating new features at the cluster level increases dimensionality. Software for mixed-effect ML models is currently limited.
Missing data	Use maximum likelihood estimation or MI to estimate uncertainty that is due to missingness.	Use single imputation or a class of ML algorithm that can handle missing data inputs (e.g., tree-based models).	Single imputation and algorithms that handle missing data do not take into account the uncertainty that is due to missingness.

Note: SEM = structural equation modeling; ML = machine learning; PCA = principal-components analysis; NMF = nonnegative matrix factorization; ICA = independent-component analysis; MI = multiple imputation.

In the traditional statistical approach, given the limited sample size in typical psychological data, the evaluation of sampling errors plays a central role in data analysis. On the other hand, although the machine-learning approach implicitly takes into account sampling errors to some extent in the cross-validation procedure, the explicit quantification of sampling errors is still rare. This is problematic—when researchers see a standardized prediction R² of .3 in the test data, for example, the interpretation should be different depending on whether the results are based on a large or small sample size. Without the explicit quantification of sampling errors or a standardized evaluation of sample-size adequacy, objective interpretation of the results is difficult. We recommended bootstrapping as a flexible method for estimating the sampling distribution of a model performance metric or variable importance value. Bootstrapping can also enable the use of statistical tests (e.g., to see if the model performs significantly better than a baseline model; Dietterich, 1998) without needing to meet parametric assumptions.

At the same time, more effort is needed to establish concrete guidelines to use such procedures to quantify the sampling errors. For example, when using bootstrapping to assess the uncertainty of either the model’s predictions or the predictor-importance values, the question arises as to how the model’s hyperparameters are estimated before refitting it to each new sample. One possibility is to fix the hyperparameters before refitting (Michelucci & Venturini, 2021). This is a practical option, mostly because of computational reasons but also because of the high likelihood of the model and the explanation it produces varying drastically across samples if hyperparameters are allowed to vary (e.g., see Molnar et al., 2020). For example, decision trees have been shown to change a lot in relation to very small changes to the input data (Breiman, 2001a). Thus, if the goal is to understand the uncertainty of a particular model’s prediction, as opposed to the underlying data, it makes most sense to fix the model’s hyperparameters. However, there has been little research that has empirically examined the consequences of such decisions. By establishing a guideline through empirical studies, in the future, researchers should be able to set more robust standards to evaluate the adequacy of a sample size in machine-learning analyses.

Measurement errors are difficult to deal with in traditional statistics but even more so in machine learning. We suggested the use of feature-compression methods, such as PCA (Bryant & Yarnold, 1995), NMF (D. Lee & Seung, 2000; Y.-X. Wang & Zhang, 2012), and ICA (T.-W. Lee & Lee, 1998), when data have a large number of predictors and the researcher is okay with a potential loss of interpretability at the individual feature level. If the model’s inputs need to correspond to established theoretical components, we suggest exploring methods that incorporate the SEM structure into the machine-learning model, for example, regularized SEM (Brandmaier & Jacobucci, 2023; Jacobucci et al., 2016, 2019; Liang & Jacobucci, 2020) or SEM tree/forest algorithms (Brandmaier et al., 2013, 2016). However, if the model interpretation in relation to “true” predictor values is not important, measurement error is less of a concern, and it could even be exploited to improve the model’s predictions.

Despite the difficulties, future studies in machine learning could think about measurement error more seriously and discuss a way to incorporate psychometric-measurement models into the machine-learning pipeline in a more generalized manner. As noted earlier, practical usage of SEMs relies mainly on a very specific definition of measurement errors, but perhaps the implementation of SEMs in machine learning would be an important first step. There have already been great efforts on this front, as noted earlier (e.g., Brandmaier et al., 2013, 2016; Brandmaier & Jacobucci, 2023; Jacobucci et al., 2016, 2019; X. Li & Jacobucci, 2022), but the methods are generally specific to a particular machine-learning models and thus cannot be easily substituted into a generic machine-learning pipeline. Future studies could benefit tremendously from the development of a generalizable SEM framework for machine-learning methods and empirical studies to assess the effect of the measurement error commonly found in psychology on feature selection and importance estimates (e.g., as in medicine; Luijken et al., 2019).

For handling nonindependent data, for example, data that are hierarchically clustered, we suggested two approaches. The first involves including the upper-unit labels as predictors in the model (the clusters-as-features approach), and the second is to use a machine-learning algorithm that incorporates the random effects in the model (e.g., Calhoun et al., 2021; Mandel et al., 2023; Salditt et al., 2023). These two potential solutions roughly correspond to the fixed-effects and random-effects models to address clustered data in the literature of psychometrics (Gardiner et al., 2009; McNeish & Kelley, 2019). The cluster-as-features approach can be used for any type of machine-learning model in principle. However, this approach is limited when predicting information about new clusters that do not appear in the training data. Random-effects models do not suffer from the same issue, but these models are typically available only for specific types of machine-learning models, and the software availability is relatively limited in Python (e.g., in Scikit-learn; Pedregosa et al., 2011a), the programming language most commonly used for machine learning (Hao & Ho, 2019).

It is important to highlight that on top of these analytic options, a clustered data structure demands researchers to think deeply about what question they want to answer in their application of machine-learning models. For example, the way to perform cross-validation (e.g., using within-clusters or between-clusters splits) depends on whether researchers want to generalize the results to a new cluster. Researchers should also decide on whether to center predictor variables or include aggregated predictors (i.e., predictors averaged at the upper unit) based on the goal of the analysis. When making such decisions, researchers should also be aware of the risk of data/information leakage artificially inflating performance. Research in traditional statistics has long discussed these conceptual issues for clustered data (Raudenbush & Bryk, 2002), and these discussions should help researchers use machine learning on clustered data.

In the traditional statistical approach, there are two “gold standards” to handle missing data—full-information-maximum-likelihood and multiple-imputation methods. These methods have been shown to obtain unbiased estimates and correct standard errors under certain assumptions (Enders, 2025). Neither of these methods is popular in the applied-machine-learning community. In contexts in which the ratio of available data to missing data is high, machine-learning researchers may simply throw away data without considering the possible mechanisms for missingness and the potential bias introduced (Nijman et al., 2022). Otherwise, missing data are typically handled by machine-learning models that can handle missing data as inputs (e.g., tree-based methods) or pipelines that use single-imputation methods before fitting the machine-learning model (Nijman et al., 2022). Although these offer sensible, practical solutions, there has been limited research that has evaluated the performance, uncertainty, and potential bias introduced by these commonly used strategies to handle missing data.

One clear future direction is to consider how best to implement multiple-imputation methods into the machine-learning pipeline. This method is not yet very popular in the applied-machine-learning community (Nijman et al., 2022), but it has the advantage of (a) producing complete data sets that can be used by any machine-learning method afterward and (b) taking into account the uncertainty of missing data, which other common methods in machine learning do not consider. As noted earlier, there are some practical complications (e.g., see Gunn et al., 2023), and more work would be needed to establish a robust standard that is computationally viable and also prevents information leakage (Thomas et al., 2020).

Note that these four challenges (limited sample size, measurement error, clustered data, and missing data) are not completely independent. One common feature among these challenges is that they are all related to the fundamental uncertainty inherent in psychological data. Psychological data are full of uncertainty because the data are usually limited in amount (limited sample size), contaminated by large amounts of noise (measurement errors), and include frequent missingness for reasons over which researchers often have no control (missing data). The nature of uncertainty is even more complicated by the fact that psychological data are often clustered. Specifically, all four problems primarily magnify epistemic uncertainty (variability that exists because there is too little or imperfect information) because they deprive the analyst of information. This is, in theory, separable from aleatoric uncertainty, which is the inherent randomness in the data-generating process, for example, in human behavior (Hüllermeier & Waegeman, 2021). As reviewed in this article, machine learning has developed various excellent strategies to deal with such uncertainty even if it is not always quantified, for example, with missing data or sampling error. Still, many of these strategies are not yet attuned to the type and scale of uncertainty in psychological data. We hope this article sparks inspiration and momentum for future research endeavors in this dynamic field.

The four challenges discussed in this article do not constitute the whole landscape of issues that arise when conducting machine-learning analyses in psychology. We chose to focus on these challenges because they are closely related to the common characteristics of psychological data and have been heavily discussed in the traditional statistical approach. However, there are certainly other topics that lie at the intersection of machine learning and psychology that are worth drawing attention to. For example, longitudinal data with a low number of time points is commonly observed in psychology research as panel data. However, it is unclear how to analyze these data using machine learning (for possible directions, see Karch et al., 2020; Lavelle-Hill et al., 2024; Lim et al., 2021; Ovchinnik et al., 2022; Sarkar & De Bruyn, 2021). Another example is how and whether it is necessary to adapt machine-learning methods for ordinal data (e.g., Likert scales; Cheng et al., 2008; W. Chu & Keerthi, 2007), as has been done in traditional statistics (e.g., Bürkner & Vuorre, 2019). Furthermore, in machine learning, categorical variables are typically coded as dummy variables (also called “one-of-K” or “one-hot” encoding; Pedregosa et al., 2011a), but this procedure has been shown to cause problems for feature selection and prediction (e.g., for lasso; Huang et al., 2023). Rather than trying to adapt machine-learning methods to fit psychological data, perhaps psychologists should more carefully consider how to design studies that generate data better suited to the goal of prediction, following in the footsteps of the emerging discipline of data-centric AI (Zha et al., 2023). As the various open methodological questions posed in this article indicate, the different data types and research goals in psychology present distinct challenges for machine learning. At the same time, this also means that there are new opportunities for methodological contributions at this intersection of the two fields.

At a more general level, finding good, reliable solutions to these challenges hinges heavily on researchers’ spirit to value transparency, reproducibility, and replicability. Although research design in psychology has been influenced by these issues for more than a decade (Open Science Collaboration, 2015), the spotlight has only comparatively recently been placed on the machine-learning community (Bell & Kampman, 2021; Gibney, 2022; Hofman et al., 2021). Because the challenges we identified in this article have typically not been discussed to the same extent in the machine-learning community, it may be tempting for applied researchers to ignore them and report the results that just “worked.” However, as detailed in this article, such an approach can, in many cases, produce inaccurate performance estimates (often inflated because of overfitting or information leakage) and/or biased estimates for predictor importance (for more discussion on biases to predictor importance, see Lavelle-Hill et al., 2025). As discussed, psychological data contain large amounts of uncertainty (e.g., because of missingness, measurement error, and limited samples). Without adequate knowledge of how to deal with this uncertainty using machine-learning models and little consensus on how best to evaluate and communicate it, studies using machine learning on psychological data are at risk of adding to the replication problem (Hullman et al., 2022). To obtain replicable findings, we believe that tackling the challenges discussed in this article and establishing common standards is key for research in psychology using machine learning. Continued exchange of methods and tools at the intersection of the two fields could also help the situation. Preregistration or registered reports, a practice increasingly common in psychology, should also be a useful instrument to increase replicability and transparency in machine-learning research. However, there are limited guidelines on how to preregister studies using machine learning (but see the workshop at NeurIPS, 2021, and Hofman et al., 2023; Jankowsky et al., 2022). Conversely, the applied-machine-learning tradition of using well-annotated, structured, and tracked code repositories such as GitHub to make code open source and runnable in virtual environments is something that psychology might take more note of. The cross-field exchange of ideas and tools for reproducibility (e.g., Bell et al., 2022; Kidwell et al., 2016; Peikert et al., 2021; Rocca & Yarkoni, 2021; Van Lissa et al., 2021) could further help research at the intersection of the two fields to establish itself as a rigorous and reputable research area.

Footnotes

Acknowledgements

We thank the editor, Rogier A. Kievit; the three anonymous reviewers; and the named reviewer, Caspar van Lissa, for their thoughtful feedback and comments that helped improve the article. We also acknowledge Mirka Henninger, Hannah Deininger, Babette Bühler, and Tim Brailsford for their valuable feedback on earlier drafts.

Transparency

Action Editor: Rogier Kievit

Editor: David A. Sbarra

Author Contributions

Rosa Lavelle-Hill: Conceptualization; Methodology; Writing – original draft; Writing – review & editing.

Gavin Smith: Conceptualization; Methodology; Writing – original draft; Writing – review & editing.

Kou Murayama: Conceptualization; Methodology; Writing – original draft; Writing – review & editing.

ORCID iD

Rosa Lavelle-Hill

Notes

References

Adjerid

Kelley

(2018). Big data in psychology: A framework for research advancement. American Psychologist, 73(7), 899–917. https://doi.org/10.1037/amp0000190

Akaike

(1973). Formation theory and an extension of the maximum likelihood principle. In Petrov

B. N.

Csaki

(Eds.), Second international symposium on information theory (pp. 267–281). Akadémiai Kiado.

Akaike

(1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723.

Alasalmi

Koskimäki

Suutala

Röning

(2015). Classification uncertainty of multiple imputed data. In 2015 IEEE Symposium Series on Computational Intelligence (pp. 151–158). Conference Publishing Service. https://doi.org/10.1109/ssci35177.2015

Altmann

Toloşi

Sander

Lengauer

(2010). Permutation importance: A corrected feature importance measure. Bioinformatics, 26(10), 1340–1347.

Aonishi

Maruyama

Ito

Miyakawa

Murayama

Ota

(2022). Imaging data analysis using non-negative matrix factorization. Neuroscience Research, 179, 51–56.

Arbuckle

J. L.

Marcoulides

G. A.

Schumacker

R. E.

(1996). Full information estimation in the presence of incomplete data. In Marcoulides

G. A.

Schumacker

R. E.

(Eds.), Advanced structural equation modeling: Issues and techniques (pp. 243–277). Lawrence Erlbaum.

Arlot

Celisse

(2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 40–79. https://doi.org/10.1214/09-SS054

Balakrishnama

Ganapathiraju

(1998). Linear discriminant analysis—A brief tutorial. Institute for Signal and Information Processing, Department of Electrical and Computer Engineering, Mississippi State University. https://www.zemris.fer.hr/predmeti/kdisc/bojana/Tutorial-LDA-Balakrishnama.pdf

10.

Baraldi

A. N.

Enders

C. K.

(2010). An introduction to modern missing data analyses. Journal of School Psychology, 48(1), 5–37.

11.

Bartal

Fandina

Neiman

(2019). Dimensionality reduction: Theoretical perspective on practical measures. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (pp. 10577–10589). Association for Computing Machinery.

12.

Bartko

J. J.

(1966). The intraclass correlation coefficient as a measure of reliability. Psychological Reports, 19(1), 3–11.

13.

Batista

G. E.

Monard

M. C.

(2003). An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17(5–6), 519–533.

14.

Beale

E. M.

Little

R. J.

(1975). Missing values in multivariate analysis. Journal of the Royal Statistical Society, Series B: Methodological, 37(1), 129–145.

15.

Bell

S. J.

Kampman

O. P.

(2021). Perspectives on machine learning from psychology’s reproducibility crisis. arXiv. https://doi.org/10.48550/arXiv.2104.08878

16.

Bell

S. J.

Kampman

Dodge

Lawrence

(2022). Modeling the machine learning multiverse. Advances in Neural Information Processing Systems, 35, 18416–18429.

17.

Bellman

(1957). Dynamic programming. Princeton University Press.

18.

Bengio

Gingras

(1995). Recurrent neural networks for missing or asynchronous data. Advances in Neural Information Processing Systems, 8, 395–401.

19.

Bergstra

Bengio

(2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(2), 281–305. https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf

20.

Berk

R. A.

(2008). Statistical learning from a regression perspective (Vol. 14). Springer.

21.

Berrar

(2019). Cross-validation. In Cannataro

(Ed.), Encyclopedia of bioinformatics and computational biology (Vol. 1, pp. 542–545). Elsevier.

22.

Bharadiya

J. P.

(2023). A review of Bayesian machine learning principles, methods, and applications. International Journal of Innovative Science and Research Technology, 8(5), 2033–2038.

23.

Bischl

Binder

Lang

Pielok

Richter

Coors

Thomas

Ullmann

Becker

Boulesteix

A.-L.

Deng

Lindauer

(2023). Hyperparameter optimization: Foundations, algorithms, best practices, and open challenges. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 13(2), Article e1484. https://doi.org/10.1002/widm.1484

24.

Bishop

C. M.

Nasrabadi

N. M.

(2006). Pattern recognition and machine learning (Vol. 4). Springer.

25.

Blanca

M. J.

Alarcón

Bono

(2018). Current practices in data analysis procedures in psychology: What has changed? Frontiers in Psychology, 9, Article 414897. https://doi.org/10.3389/fpsyg.2018.02558

26.

Blei

D. M.

Lafferty

J. D.

(2009). Topic models. In Srivastava

A. N.

Sahami

(Eds.), Text mining (pp. 101–124). Chapman and Hall/CRC.

27.

Bleidorn

Hopwood

C. J.

(2019). Using machine learning to advance personality assessment and theory. Personality and Social Psychology Review, 23(2), 190–203.

28.

Boedeker

Kearns

N. T.

(2019). Linear discriminant analysis for prediction of group membership: A user-friendly primer. Advances in Methods and Practices in Psychological Science, 2(3), 250–263.

29.

Bollen

Lennox

(1991). Conventional wisdom on measurement: A structural equation perspective. Psychological Bulletin, 110(2), 305–314. https://doi.org/10.1037/0033-2909.110.2.305

30.

Bollen

K. A.

(1989). Structural equations with latent variables (Vol. 210). John Wiley & Sons.

31.

Bound

Brown

Mathiowetz

(2001). Measurement error in survey data. In Heckman

J. J.

Leamer

(Eds.), Handbook of econometrics (Vol. 5, pp. 3705–3843). Elsevier. https://www.sciencedirect.com/handbook/handbook-of-econometrics/vol/5/suppl/C

32.

Bouthillier

Delaunay

Bronzi

Trofimov

Nichyporuk

Szeto

Mohammadi Sepahvand

Raff

Madan

Voleti

Ebrahimi Kahou

Michalski

Arbel

Pal

Varoquaux

Vincent

(2021). Accounting for variance in machine learning benchmarks. Proceedings of Machine Learning and Systems, 3, 747–769.

33.

Brandmaier

A. M.

Jacobucci

(2023). Machine-learning approaches to structural equation modeling. In Hoyle

R. H.

(Ed.), Handbook of structural equation modeling (2nd ed., pp. 722–739). The Guilford Press.

34.

Brandmaier

A. M.

Prindle

J. J.

McArdle

J. J.

Lindenberger

(2016). Theory-guided exploration with structural equation model forests (Vol. 21). American Psychological Association.

35.

Brandmaier

A. M.

von Oertzen

McArdle

J. J.

Lindenberger

(2013). Structural equation model trees. Psychological Methods, 18(1), 71–86. https://doi.org/10.1037/a0030001

36.

Breiman

(2001a). Random forests. Machine Learning, 45, 5–32.

37.

Breiman

(2001b). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199–231.

38.

Breiman

Friedman

Olshen

Stone

(1984). Classification and regression trees. Chapman and Hall/CRC. https://doi.org/10.1201/9781315139470

39.

Browne

M. W.

(2000). Cross-validation methods. Journal of Mathematical Psychology, 44(1), 108–132.

40.

Brunner

Austin

P. C.

(2009). Inflation of Type I error rate in multiple regression when independent variables are measured with error. Canadian Journal of Statistics, 37(1), 33–46.

41.

Bryant

F. B.

Yarnold

P. R.

(1995). Principal-components analysis and exploratory and confirmatory factor analysis. In Grimm

L. G.

Yarnold

P. R.

(Eds.), Reading and understanding multivariate statistics (pp. 99–136). American Psychological Association.

42.

Bürkner

P.-C.

Vuorre

(2019). Ordinal regression models in psychology: A tutorial. Advances in Methods and Practices in Psychological Science, 2(1), 77–101.

43.

Bzdok

(2017). Classical statistics and statistical learning in imaging neuroscience. Frontiers in Neuroscience, 11, Article 543. https://doi.org/10.3389/fnins.2017.00543

44.

Calhoun

Levine

R. A.

Fan

(2021). Repeated measures random forests (RMRF): Identifying factors associated with nocturnal hypoglycemia. Biometrics, 77(1), 343–351.

45.

Capitaine

Genuer

Thiébaut

(2021). Random forests for high-dimensional longitudinal data. Statistical Methods in Medical Research, 30(1), 166–184.

46.

Chai

Duan

Lin

(2020). Deep learning for irregularly and regularly missing data reconstruction. Scientific Reports, 10(1), Article 3302. https://doi.org/10.1038/s41598-020-59801-x

47.

Chechik

Heitz

Elidan

Abbeel

Koller

(2008). Max-margin classification of data with absent features. Journal of Machine Learning Research, 9(1), 1–21. https://www.jmlr.org/papers/volume9/chechik08a/chechik08a.pdf

48.

Chekroud

A. M.

Zotti

R. J.

Shehzad

Gueorguieva

Johnson

M. K.

Trivedi

M. H.

Cannon

T. D.

Krystal

J. H.

Corlett

P. R.

(2016). Cross-trial prediction of treatment outcome in depression: A machine learning approach. The Lancet Psychiatry, 3(3), 243–250.

49.

Chen

E. E.

Wojcik

S. P.

(2016). A practical guide to big data research in psychology. Psychological Methods, 21(4), 458–474. https://doi.org/10.1037/met0000111

50.

Chen

Guestrin

(2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). Association for Computing Machinery. https://dl.acm.org/doi/proceedings/10.1145/2939672

51.

Chen

Benesty

Khotilovich

Tang

Cho

Chen

Mitchell

Cano

Zhou

, et al (2015). Xgboost: Extreme gradient boosting. R package version 0.4-2, 1(4), 1–4. https://cran.ms.unimelb.edu.au/web/packages/xgboost/vignettes/xgboost.pdf

52.

Chen

Wang

F.-Y.

(2020). Traffic flow imputation using parallel data and generative adversarial networks. IEEE Transactions on Intelligent Transportation Systems, 21(4), 1624–1630. https://doi.org/10.1109/TITS.2019.2910295

53.

Cheng

Wang

Pollastri

(2008). A neural network approach to ordinal regression. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence) (pp. 1279–1284). IEEE. https://ieeexplore.ieee.org/xpl/conhome/4625775/proceeding

54.

Chu

Keerthi

S. S.

(2007). Support vector ordinal regression. Neural Computation, 19(3), 792–815.

55.

Chu

Ilyas

I. F.

Krishnan

Wang

(2016). Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 International Conference on Management of Data (pp. 2201–2206). Association for Computing Machinery.

56.

Church

K. W.

(2017). Word2vec. Natural Language Engineering, 23(1), 155–162.

57.

Cohen

(1992). Statistical power analysis. Current Directions in Psychological Science, 1(3), 98–101.

58.

Cohen

West

S. G.

Aiken

L. S.

(2013). Applied multiple regression/correlation analysis for the behavioral sciences. Routledge.

59.

Cole

D. A.

Preacher

K. J.

(2014). Manifest variable path analysis: Potentially serious and misleading consequences due to uncorrected measurement error. Psychological Methods, 19(2), 300–315. https://doi.org/10.1037/a0033805

60.

Collins

L. M.

Schafer

J. L.

Kam

C.-M.

(2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6(4), 330–351.

61.

Concato

Peduzzi

Holford

T. R.

Feinstein

A. R.

(1995). Importance of events per independent variable in proportional hazards analysis I. Background, goals, and general strategy. Journal of Clinical Epidemiology, 48(12), 1495–1501.

62.

Cox

C. R.

Moscardini

E. H.

Cohen

A. S.

Tucker

R. P.

(2020). Machine learning for suicidology: A practical review of exploratory and hypothesis-driven approaches. Clinical Psychology Review, 82, Article 101940. https://doi.org/10.1016/j.cpr.2020.101940

63.

Cox

D. R.

(2015). Big data and precision. Biometrika, 102(3), 712–716.

64.

Cox

D., R.

Kartsonaki

Keogh

R. H

. (2018). Big data: Some statistical issues. Statistics & Probability Letters, 136, 111–115.

65.

Culpepper

S. A.

Aguinis

(2011). Using analysis of covariance (ANCOVA) with fallible covariates. Psychological Methods, 16(2), 166–178. https://doi.org/10.1037/a0023355

66.

Datta

Zou

(2020). A note on cross-validation for lasso under measurement errors. Technometrics, 62(4), 549–556.

67.

De Rooij

Weeda

. (2020). Cross-validation: A method every psychologist should know. Advances in Methods and Practices in Psychological Science, 3(2), 248–263.

68.

Deininger

Lavelle-Hill

Parrisius

Pieronczyk

Colling

Meurers

Trautwein

Nagengast

Kasneci

(2023). Can you solve this on the first try?–Understanding exercise field performance in an intelligent tutoring system. In 24th International Conference on Artificial Intelligence in Education (pp. 565–576). Springer. https://link.springer.com/book/10.1007/978-3-031-36272-9

69.

Deininger

Parrisius

Lavelle-Hill

Meurers

Trautwein

Nagengast

Kasneci

(2025). Who did what to succeed? Individual differences in which learning behaviors are linked to achievement. In Proceedings of the 15th International Learning Analytics and Knowledge Conference (pp. 771–782). Association for Computing Machinery.

70.

Dempster

A. P.

Laird

N. M.

Rubin

D. B.

(1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B: Methodological, 39(1), 1–22.

71.

Dhiman

Bullock

Sergeant

J. C.

Riley

R. D.

Collins

G. S.

(2023). Sample size requirements are not being considered in studies developing prediction models for binary outcomes: A systematic review. BMC Medical Research Methodology, 23(1), Article 188. https://doi.org/10.1186/s12874-023-02008-1

72.

Dietterich

T. G.

(1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1895–1923.

73.

Ding

Takigawa

Mamitsuka

Zhu

(2014). Similarity-based machine learning methods for predicting drug–target interactions: A brief review. Briefings in Bioinformatics, 15(5), 734–747.

74.

Dinga

Marquand

A. F.

Veltman

D. J.

Beekman

A. T.

Schoevers

R. A.

van Hemert

A. M.

Penninx

B. W.

Schmaal

(2018). Predicting the naturalistic course of depression from a wide range of clinical, psychological, and biological data: A machine learning approach. Translational Psychiatry, 8(1), Article 241. https://doi.org/10.1038/s41398-018-0289-1

75.

Donnellan

Aslan

Fastrich

G. M.

Murayama

(2022). How are curiosity and interest different? Naïve Bayes classification of people’s beliefs. Educational Psychology Review, 34(1), 73–105.

76.

Donnellan

Usami

Murayama

(2023). Random item slope regression: Examining both similarities and differences in the association with individual items. Psychological Methods. Advance online publication. https://doi.org/10.1037/met0000587

77.

Dwyer

D. B.

Falkai

Koutsouleris

(2018). Machine learning approaches for clinical psychology and psychiatry. Annual Review of Clinical Psychology, 14, 91–118.

78.

Efron

Hastie

(2021). Computer age statistical inference, student edition: Algorithms, evidence, and data science (Vol. 6). Cambridge University Press.

79.

Efron

Tibshirani

R. J.

(1994). An introduction to the bootstrap. Chapman and Hall/CRC.

80.

Elhai

J. D.

Montag

(2020). The compatibility of theoretical frameworks with machine learning analyses in psychological research. Current Opinion in Psychology, 36, 83–88.

81.

Emmanuel

Maupong

Mpoeleng

Semong

Mphago

Tabona

(2021). A survey on missing data in machine learning. Journal of Big Data, 8, Article 140. https://doi.org/10.1186/s40537-021-00516-9

82.

Enders

C. K.

(2022). Applied missing data analysis. The Guilford Press.

83.

Enders

C. K.

(2025). Missing data: An update on the state of the art. Psychological Methods, 30(2), 322–339. https://doi.org/10.1037/met0000563

84.

Enders

C. K.

Tofighi

(2007). Centering predictor variables in cross-sectional multilevel models: A new look at an old issue. Psychological Methods, 12(2), 121–138. https://doi.org/10.1037/1082-989X.12.2.121

85.

Fabbri

Corponi

Albani

Raimondi

Forloni

Schruers

Kasper

Kautzky

Zohar

Souery

Montgomery

Cristalli

C. P.

Mantovani

Mendlewicz

Serretti

(2018). Pleiotropic genes in psychiatry: Calcium channels and the stress-related FKBP5 gene in antidepressant resistance. Progress in Neuro-Psychopharmacology and Biological Psychiatry, 81, 203–210.

86.

Faraway

J. J.

Augustin

N. H.

(2018). When small data beats big data. Statistics & Probability Letters, 136, 142–145.

87.

Farokhi

Kaafar

M. A.

(2020). Modelling and quantifying membership information leakage in machine learning. arXiv. https://doi.org/10.48550/arXiv.2001.10648

88.

Feelders

(1999). Handling missing data in trees: Surrogate splits or statistical imputation? In European Conference on Principles of Data Mining and Knowledge Discovery (pp. 329–334). Springer. https://link.springer.com/book/10.1007/b72280

89.

Firebaugh

(1978). A rule for inferring individual-level relationships from aggregate data. American Sociological Review, 43(4), 557–572. https://doi.org/10.2307/2094779

90.

Fisher

Rudin

Dominici

(2019). All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. Journal of Machine Learning Research, 20, Article 177. https://jmlr.org/papers/volume20/18-760/18-760.pdf

91.

Fokkema

Smits

Zeileis

Hothorn

Kelderman

(2018). Detecting treatment-subgroup interactions in clustered data with generalized linear mixed-effects model trees. Behavior Research Methods, 50, 2016–2034.

92.

Friedman

J. H.

(2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232. https://www.jstor.org/stable/2699986

93.

Fruchter

(1954). Introduction to factor analysis. Van Nostrand.

94.

Simonoff

J. S.

(2015). Unbiased regression trees for longitudinal and clustered data. Computational Statistics & Data Analysis, 88, 53–74.

95.

Fuller

W. A.

(2009). Measurement error models. John Wiley & Sons.

96.

Gałecki

Burzykowski

Gałecki

Burzykowski

(2013). Linear mixed-effects model. Springer.

97.

García

Ramírez-Gallego

Luengo

Benítez

J. M.

Herrera

(2016). Big data preprocessing: Methods and prospects. Big Data Analytics, 1, Article 9. https://doi.org/10.1186/s41044-016-0014-0

98.

Gardiner

J. C.

Luo

Roman

L. A.

(2009). Fixed effects, random effects and gee: What are the differences? Statistics in Medicine, 28(2), 221–239.

99.

Gavankar

Sawarkar

(2015). Decision tree: Review of techniques for missing values at training, testing and compatibility. In 3rd International Conference on Artificial Intelligence, Modelling and Simulation (pp. 122–126). IEEE. https://ieeexplore.ieee.org/xpl/conhome/7602796/proceeding

100.

Gelman

Hwang

Vehtari

(2014). Understanding predictive information criteria for Bayesian models. Statistics and Computing, 24(6), 997–1016.

101.

Geman

Bienenstock

Doursat

(1992). Neural networks and the bias/variance dilemma. Neural Computation, 4(1), 1–58.

102.

Ghahramani

Jordan

(1993). Supervised learning from incomplete data via an EM approach. Advances in Neural Information Processing Systems, 6, 120–127. https://proceedings.neurips.cc/paper_files/paper/1993/file/f2201f5191c4e92cc5af043eebfd0946-Paper.pdf

103.

Ghojogh

Crowley

Karray

Ghodsi

(2023). Elements of dimensionality reduction and manifold learning. Springer Nature.

104.

Gibney

(2022). Could machine learning fuel a reproducibility crisis in science? Nature, 608(7922), 250–251. https://doi.org/10.1038/d41586-022-02035-w

105.

Gondara

Wang

(2018). Mida: Multiple imputation using denoising autoencoders. In Phung

Tseng

V. S.

Webb

G. I.

Ganji

Rashidi

(Eds.), Advances in Knowledge Discovery and Data Mining: 22nd Pacific-Asia Conference, PAKDD 2018, Melbourne, VIC, Australia, June 3-6, 2018, Proceedings, Part III 22 (pp. 260–272). Springer. https://link.springer.com/book/10.1007/978-3-319-93040-4

106.

Goodfellow

Mirza

Courville

Bengio

(2013). Multi-prediction deep Boltzmann machines. Advances in Neural Information Processing Systems, 26. https://proceedings.neurips.cc/paper_files/paper/2013/hash/0bb4aec1710521c12ee76289d9440817-Abstract.html

107.

Gradus

J. L.

Rosellini

A. J.

Horváth-Puhó

Street

A. E.

Galatzer-Levy

Jiang

Lash

T. L.

Sørensen

H. T.

(2020). Prediction of sex-specific suicide risk using machine learning and single-payer health care registry data from Denmark. JAMA Psychiatry, 77(1), 25–34.

108.

Graham

J. W.

(2009). Missing data analysis: Making it work in the real world. Annual Review of Psychology, 60, 549–576.

109.

Graham

J. W.

Taylor

B. J.

Olchowski

A. E.

Cumsille

P. E.

(2006). Planned missing data designs in psychological research. Psychological Methods, 11(4), 323–343. https://doi.org/10.1037/1082-989X.11.4.323

110.

Greenacre

Groenen

P. J.

Hastie

d’Enza

A. I.

Markos

Tuzhilina

(2022). Principal component analysis. Nature Reviews Methods Primers, 2, Article 100. https://doi.org/10.1038/s43586-022-00184-w

111.

Groll

Tutz

(2014). Variable selection for generalized linear mixed models by L₁-penalized estimation. Statistics and Computing, 24, 137–154.

112.

Gruda

Hasan

(2019). Feeling anxious? Perceiving anxiety in tweets using machine learning. Computers in Human Behavior, 98, 245–255.

113.

Wang

Kuen

Shahroudy

Shuai

Liu

Wang

Cai

Chen

(2018). Recent advances in convolutional neural networks. Pattern Recognition, 77, 354–377.

114.

Guignard

Ginsbourger

Levy Häner

Herrera

J. M.

(2024). Some combinatorics of data leakage induced by clusters. Stochastic Environmental Research and Risk Assessment, 38(7), 2815–2828.

115.

Guillamet

Vitria

Schiele

(2003). Introducing a weighted non-negative matrix factorization for image classification. Pattern Recognition Letters, 24(14), 2447–2454.

116.

Gunn

H. J.

Hayati Rezvan

Fernández

M. I.

Comulada

W. S.

(2023). How to apply variable selection machine learning algorithms with multiply imputed data: A missing discussion. Psychological Methods, 28(2), 452–471. https://doi.org/10.1037/met0000478

117.

Hajjem

Bellavance

Larocque

(2011). Mixed effects regression trees for clustered data. Statistics & Probability Letters, 81(4), 451–459.

118.

Hajjem

Bellavance

Larocque

(2014). Mixed-effects random forest for clustered data. Journal of Statistical Computation and Simulation, 84(6), 1313–1328.

119.

Hamaker

E. L.

Muthén

(2020). The fixed versus random effects debate and how it relates to centering in multilevel modeling. Psychological Methods, 25(3), 365–379. https://doi.org/10.1037/met0000239

120.

Hao

T. K.

(2019). Machine learning made easy: A review of Scikit-learn package in python programming language. Journal of Educational and Behavioral Statistics, 44(3), 348–361.

121.

Harker

Tymms

(2004). The effects of student composition on school outcomes. School Effectiveness and School Improvement, 15(2), 177–199.

122.

Harlow

L. L.

Oswald

F. L.

(2016). Big data in psychology: Introduction to the special issue. Psychological Methods, 21(4), 447–457. https://doi.org/10.1037/met0000120

123.

Hastie

Tibshirani

Friedman

J. H.

Friedman

J. H.

(2009). The elements of statistical learning: Data mining, inference, and prediction (Vol. 2). Springer.

124.

Hearst

M. A.

Dumais

S. T.

Osuna

Platt

Scholkopf

(1998). Support vector machines. IEEE Intelligent Systems and their Applications, 13(4), 18–28.

125.

Hellton

K. H.

Thoresen

(2014). The impact of measurement error on principal component analysis. Scandinavian Journal of Statistics, 41(4), 1051–1063.

126.

Hempel

C. G.

Oppenheim

(1948). Studies in the logic of explanation. Philosophy of Science, 15(2), 135–175.

127.

Henninger

Debelak

Rothacher

Strobl

(2025). Interpretable machine learning for psychological research: Opportunities and pitfalls. Psychological Methods, 30(2), 271–305. https://doi.org/10.1037/met0000560

128.

Henninger

Strobl

(2023). Local interpretation techniques for machine learning methods: Theoretical background, pitfalls and interpretation of LIME and Shapley values. PsyArXiv. https://doi.org/10.31234/osf.io/3svb2

129.

Higgins

Matthey

Pal

Burgess

Glorot

Botvinick

Mohamed

Lerchner

(2016). Beta-vae: Learning basic visual concepts with a constrained variational framework [Conference session]. International Conference on Learning Representations, San Juan, Puerto Rico. https://dblp.org/db/conf/iclr/iclr2016

130.

Hinton

G. E.

Revow

Dayan

(1994). Recognizing handwritten digits using mixtures of linear models. Advances in Neural Information Processing Systems, 7, 1015–1022. https://proceedings.neurips.cc/paper/1994/file/5c936263f3428a40227908d5a3847c0b-Paper.pdf

131.

Hochreiter

Schmidhuber

(1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

132.

Hoechle

(2007). Robust standard errors for panel regressions with cross-sectional dependence. The Stata Journal, 7(3), 281–312.

133.

Hoerl

A. E.

Kennard

R. W.

(1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55–67.

134.

Hofman

J. M.

Chatzimparmpas

Sharma

Watts

D. J.

Hullman

(2023). Pre-registration for predictive modeling. arXiv. https://doi.org/10.48550/arXiv.2311.18807

135.

Hofman

J. M.

Sharma

Watts

D. J.

(2017). Prediction and explanation in social systems. Science, 355(6324), 486–488.

136.

Hofman

J. M.

Watts

D. J.

Athey

Garip

Griffiths

T. L.

Kleinberg

Margetts

Mullainathan

Salganik

M. J.

Vazire

Vespignani

Yarkoni

(2021). Integrating explanation and prediction in computational social science. Nature, 595(7866), 181–188.

137.

Hornung

Nalenz

Schneider

Bender

Bothmann

Bischl

Augustin

Boulesteix

A.-L.

(2023). Evaluating machine learning models in non-standard settings: An overview and new findings. arXiv. https://doi.org/10.48550/arXiv.2310.15108

138.

Wang

Y.-G.

Drovandi

Cao

(2023). Predictions of machine learning with mixed-effects in analyzing longitudinal data under model misspecification. Statistical Methods & Applications, 32(2), 681–711.

139.

Huang

Tibbe

Tang

Montoya

(2023). Lasso and group lasso with categorical predictors: Impact of coding strategy on variable selection and prediction. Journal of Behavioral Data Science, 3(2), 15–42.

140.

Hüllermeier

Waegeman

(2021). Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine Learning, 110(3), 457–506.

141.

Hullman

Kapoor

Nanayakkara

Gelman

Narayanan

(2022). The worst of both worlds: A comparative analysis of errors in learning from data in psychology and machine learning. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society (pp. 335–348). Association for Computing Machinery.

142.

Jach

H. K.

Cools

Frisvold

Grubb

M. A.

Hartley

C. A.

Hartmann

Hunter

Jia

de Lange

F. P.

Larisch

Lavelle-Hill

Levy

van Lieshout

L. L. F.

Nussenbaum

Ravaioli

Wang

Wilson

Woodford

. . . Gottlieb

(2024). Individual differences in information demand have a low dimensional structure predicted by some curiosity traits. Proceedings of the National Academy of Sciences, 121(45), Article e2415236121. https://doi.org/10.1073/pnas.2415236121

143.

Jacobs

A. Z.

Wallach

(2021). Measurement and fairness. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 375–385). Association for Computing Machinery.

144.

Jacobucci

Brandmaier

A. M.

Kievit

R. A.

(2019). A practical guide to variable selection in structural equation modeling by using regularized multiple-indicators, multiple-causes models. Advances in Methods and Practices in Psychological Science, 2(1), 55–76.

145.

Jacobucci

Grimm

K. J.

(2020). Machine learning and psychological research: The unexplored effect of measurement. Perspectives on Psychological Science, 15(3), 809–816.

146.

Jacobucci

Grimm

K. J.

McArdle

J. J.

(2016). Regularized structural equation modeling. Structural Equation Modeling: A Multidisciplinary Journal, 23(4), 555–566.

147.

Jankowsky

Krakau

Schroeders

Zwerenz

Beutel

(2022). Predicting treatment response using machine learning: A registered report. PsyArXiv. https://doi.org/10.31234/osf.io/fuyjv

148.

Jerez

J. M.

Molina

García-Laencina

P. J.

Alba

Ribelles

Martín

Franco

(2010). Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial Intelligence in Medicine, 50(2), 105–115.

149.

Jordan

M. I.

(2011). Message from the president: The era of big data. ISBA Bulletin, 18(2), 1–3.

150.

Kao

Poteet

S. R.

(2007). Natural language processing and text mining. Springer Science & Business Media.

151.

Karch

J. D.

Brandmaier

A. M.

Voelkle

M. C.

(2020). Gaussian process panel modeling–machine learning inspired analysis of longitudinal panel data. Frontiers in Psychology, 11, Article 506696. https://doi.org/10.3389/fpsyg.2020.00351

152.

Karch

J. D.

Sander

M. C.

von Oertzen

Brandmaier

A. M.

Werkle-Bergner

(2015). Using within-subject pattern classification to understand lifespan age differences in oscillatory mechanisms of working memory selection and maintenance. NeuroImage, 118, 538–552.

153.

Karniadakis

G. E.

Kevrekidis

I. G.

Perdikaris

Wang

Yang

(2021). Physics-informed machine learning. Nature Reviews Physics, 3(6), 422–440.

154.

Khan

S. I.

Hoque

A. S. M. L

. (2020). SICE: An improved missing data imputation technique. Journal of Big Data, 7, Article 37. https://doi.org/10.1186/s40537-020-00313-w

155.

Kidwell

M. C.

Lazarević

L. B.

Baranski

Hardwicke

T. E.

Piechowski

Falkenberg

L.-S.

Kennett

Slowik

Sonnleitner

Hess-Holden

Errington

T. M.

Fiedler

Nosek

B. A.

(2016). Badges to acknowledge open practices: A simple, low-cost, effective method for increasing transparency. PLOS Biology, 14(5), Article e1002456. https://doi.org/10.1371/journal.pbio.1002456

156.

Kilian

Kelava

(2023). Mixed effects in machine learning–A flexible mixedML framework to add random effects to supervised machine learning regression. Open Review. https://openreview.net/pdf?id=MKZyHtmfwH

157.

Kingma

D. P.

Welling

(2019). An introduction to variational autoencoders. Foundations and Trends in Machine Learning, 12(4), 307–392.

158.

Kline

(2014). An easy guide to factor analysis. Routledge.

159.

Kline

R. B.

(2012). Assumptions in structural equation modeling. In Hoyle

R. H.

(Ed.), Handbook of structural equation modeling (pp. 111–125). The Guilford Press.

160.

Kononenko

(1995). On biases in estimating multi-valued attributes. IJCAI, 95, 1034–1040.

161.

Koul

Becchio

Cavallo

(2018). Cross-validation approaches for replicability in psychology. Frontiers in Psychology, 9, Article 1117. https://doi.org/10.3389/fpsyg.2018.01117

162.

Krstajic

Buturovic

L. J.

Leahy

D. E.

Thomas

(2014). Cross-validation pitfalls when selecting and assessing regression and classification models. Journal of Cheminformatics, 6, 1–15.

163.

Kuhn

Wing

Weston

Williams

Keefer

Engelhardt

Cooper

Mayer

Kenkel

, R Core Team, Benesty

Lescarbeau

Ziem

Scrucca

Tang

Candan

Hunt

(2020). Package ‘caret’. The R Journal, 223(7), 48. https://cran.radicaldevelop.com/web/packages/caret/caret.pdf

164.

Kundu

M. G.

Harezlak

(2019). Regression trees for longitudinal data with baseline covariates. Biostatistics & Epidemiology, 3(1), 1–22.

165.

Kwak

S. G.

Kim

J. H.

(2017). Central limit theorem: The cornerstone of modern statistics. Korean Journal of Anesthesiology, 70(2), 144–156. https://doi.org/10.4097/kjae.2017.70.2.144

166.

Kyriazos

T. A.

(2018). Applied psychometrics: Sample size and sample power considerations in factor analysis (EFA, CFA) and Sem in general. Psychology, 9(8), 2207–2230. https://doi.org/10.4236/psych.2018.98126

167.

Lang

Binder

Richter

Schratz

Pfisterer

Coors

Casalicchio

Kotthoff

Bischl

(2019). mlr3: A modern object-oriented machine learning framework in R. Journal of Open Source Software, 4(44), Article 1903. https://doi.org/10.21105/joss.01903

168.

Lavelle-Hill

Frenzel

A. C.

Goetz

Lichtenfeld

Marsh

H. W.

Pekrun

Sakaki

Smith

Murayama

(2024). How the predictors of math achievement change over time: A longitudinal machine learning approach. Journal of Educational Psychology, 116(8), 1383–1403. https://doi.org/10.1037/edu0000863

169.

Lavelle-Hill

Goulding

Smith

Clarke

D. D.

Bibby

P. A.

(2020). Psychological and demographic predictors of plastic bag consumption in transaction data. Journal of Environmental Psychology, 72, Article 101473. https://doi.org/10.1016/j.jenvp.2020.101473

170.

Lavelle-Hill

Smith

Deininger

Murayama

(2025). An explainable AI handbook for psychologists: Methods, opportunities, and challenges. OSF. https://doi.org/10.31219/osf.io/wgx34_v3

171.

Lavelle-Hill

Smith

Mazumder

Landman

Goulding

(2021). Machine learning methods for “wicked” problems: Exploring the complex drivers of modern slavery. Humanities and Social Sciences Communications, 8, Article 274. https://doi.org/10.1057/s41599-021-00938-z

172.

Lechner

C. M.

Bhaktha

Groskurth

Bluemke

(2021). Why ability point estimates can be pointless: A primer on using skill measures from large-scale assessments in secondary analyses. Measurement Instruments for the Social Sciences, 3(1), 1–16.

173.

Lee

Seung

H. S.

(2000). Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems, 13. https://proceedings.neurips.cc/paper_files/paper/2000/hash/f9d1152547c0bde01830b7e8bd60024c-Abstract.html

174.

Lee

T.-W.

Lee

T.-W.

(1998). Independent component analysis. Springer.

175.

Lei

Liang

Yang

Zhou

Tan

E.-L.

Lei

Liu

C.-M.

Wang

Xiao

Wang

(2022). Predicting clinical scores for Alzheimer’s disease based on joint and deep learning. Expert Systems with Applications, 187, Article 115966. https://doi.org/10.1016/j.eswa.2021.115966

176.

Letham

P. A.

Rudin

Browne

E. P.

(2016). Prediction uncertainty and optimal experimental design for learning dynamical systems. Chaos: An Interdisciplinary Journal of Nonlinear Science, 26(6), Article 063110. https://doi.org/10.1063/1.4953795

177.

Shen

Robins

J. M.

(2013). On weighting approaches for missing data. Statistical Methods in Medical Research, 22(1), 14–30.

178.

Jacobucci

(2022). Regularized structural equation modeling with stability selection. Psychological Methods, 27(4), 497–518. https://doi.org/10.1037/met0000389

179.

Liang

Jacobucci

(2020). Regularized structural equation modeling to detect measurement bias: Evaluation of lasso, adaptive lasso, and elastic net. Structural Equation Modeling: A Multidisciplinary Journal, 27(5), 722–734.

180.

Liem

C. C.

Langer

Demetriou

Hiemstra

A. M.

Sukma Wicaksana

Born

M. P.

König

C. J.

(2018). Psychology meets machine learning: Interdisciplinary perspectives on algorithmic job candidate screening. In Escalante

H. J.

Escalera

Guyon

Baró

Güçlütürk

Güçlü

van Gerven

(Eds.), Explainable and interpretable models in computer vision and machine learning (pp. 197–253). Springer. https://doi.org/10.1007/978-3-319-98131-4

181.

Lim

Arık

S. Ö.

Loeff

Pfister

(2021). Temporal fusion transformers for interpretable multi-horizon time series forecasting. International Journal of Forecasting, 37(4), 1748–1764.

182.

Lin

W.-C.

Tsai

C.-F.

(2020). Missing value imputation: A review and analysis of the literature (2006–2017). Artificial Intelligence Review, 53, 1487–1509.

183.

Little

R. J.

Rubin

D. B.

(2019). Statistical analysis with missing data (Vol. 793). John Wiley & Sons.

184.

Luijken

Groenwold

R. H.

Van Calster

Steyerberg

E. W.

van Smeden

(2019). Impact of predictor measurement heterogeneity across settings on the performance of prediction models: A measurement error perspective. Statistics in Medicine, 38(18), 3444–3459.

185.

Luo

Cai

Zhang

xiaojie

(2018). Multivariate time series imputation with generative adversarial networks. Advances in Neural Information Processing Systems, 31. https://proceedings.neurips.cc/paper_files/paper/2018/hash/96b9bff013acedfb1d140579e2fbeb63-Abstract.html

186.

Mandel

Ghosh

R. P.

Barnett

(2023). Neural networks for clustered and longitudinal data using mixed effects models. Biometrics, 79(2), 711–721.

187.

Mars

(2022). From word embeddings to pre-trained language models: A state-of-the-art walkthrough. Applied Sciences, 12(17), Article 8805. https://doi.org/10.3390/app12178805

188.

Marsh

H. W.

Morin

A. J.

Parker

P. D.

Kaur

(2014). Exploratory structural equation modeling: An integration of the best features of exploratory and confirmatory factor analysis. Annual Review of Clinical Psychology, 10, 85–110.

189.

Marshall

Altman

D. G.

Royston

Holder

R. L.

(2010). Comparison of techniques for handling missing covariate data within prognostic modelling studies: A simulation study. BMC Medical Research Methodology, 10, Article 7. https://doi.org/10.1186/1471-2288-10-7

190.

McClure

Jacobucci

Ammerman

B. A.

(2021). Are items more than indicators? An examination of psychometric homogeneity, item-specific effects, and consequences for structural equation models. PsyArXiv. https://doi.org/10.31234/osf.io/n4mxv

191.

McCulloch

W. S.

Pitts

(1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5, 115–133.

192.

McDonald

R. P.

(2013). Test theory: A unified treatment. Psychology Press.

193.

McNamara

M. E.

Zisser

Beevers

C. G.

Shumake

(2022). Not just “big” data: Importance of sample size, measurement error, and uninformative predictors for developing prognostic models for digital interventions. Behaviour Research and Therapy, 153, Article 104086. https://doi.org/10.1016/j.brat.2022.104086

194.

McNeish

Kelley

(2019). Fixed effects models versus mixed effects models for clustered data: Reviewing the approaches, disentangling the differences, and making recommendations. Psychological Methods, 24(1), 20–35. https://doi.org/10.1037/met0000182

195.

McPherron

S. P.

Archer

Otárola-Castillo

E. R.

Torquato

M. G.

Keevil

T. L.

(2022). Machine learning, bootstrapping, null models, and why we are still not 100% sure which bone surface modifications were made by crocodiles. Journal of Human Evolution, 164, Article 103071. https://doi.org/10.1016/j.jhevol.2021.103071

196.

Medhat

Hassan

Korashy

(2014). Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal, 5(4), 1093–1113.

197.

Michelucci

Venturini

(2021). Estimating neural network’s performance with bootstrap: A tutorial. Machine Learning and Knowledge Extraction, 3(2), 357–373.

198.

Midway

(2022). Data analysis in R. https://bookdown.org/steve_midway/DAR/model-selection.html

199.

Molnar

(2022). Interpretable machine learning: A guide for making black box models explainable (2nd ed.). Leanpub.

200.

Molnar

König

Herbinger

Freiesleben

Dandl

Scholbeck

C. A.

Casalicchio

Grosse-Wentrup

Bischl

(2020). General pitfalls of model-agnostic interpretation methods for machine learning models. In International Workshop on Extending Explainable AI Beyond Deep Models and Classifiers (pp. 39–68). Springer. https://doi.org/10.1007/978-3-031-04083-2_1

201.

Montano

A. P.

(2014). A survey of dimensionality reduction techniques. arXiv. https://doi.org/10.48550/arXiv.1403.2877

202.

Moosbauer

Herbinger

Casalicchio

Lindauer

Bischl

(2021). Towards explaining hyperparameter optimization via partial dependence plots [Conference session]. 8th ICML Workshop on Automated Machine Learning (AutoML), virtual. https://icml.cc/virtual/2021/workshop/8371

203.

Morgenstern

J. D.

Rosella

L. C.

Costa

A. P.

de Souza

R. J.

Anderson

L. N.

(2021). Perspective: Big data and machine learning could help advance nutritional epidemiology. Advances in Nutrition, 12(3), 621–631.

204.

Mosier

C. I.

(1951). Problems and designs of cross-validation. Educational and Psychological Measurement, 11(1), 5–11.

205.

Mullainathan

Spiess

(2017). Machine learning: An applied econometric approach. Journal of Economic Perspectives, 31(2), 87–106.

206.

Nadaraya

E. A.

(1964). On estimating regression. Theory of Probability & Its Applications, 9(1), 141–142.

207.

Nadeau

Bengio

(2003). Inference for the generalization error. Machine Learning, 52, 239–281. https://doi.org/10.1023/A:1024068626366

208.

Nemani

Biggio

Huan

Fink

Tran

Wang

Zhang

(2023). Uncertainty quantification in machine learning for engineering design and health prognostics: A tutorial. Mechanical Systems and Signal Processing, 205, Article 110796. https://doi.org/10.1016/j.ymssp.2023.110796

209.

Neuendorf

Masuda

Murayama

(2025). The structure of classroom social networks predicts educational outcomes. PsyArXiv. https://doi.org/10.31234/osf.io/krcdm_v1

210.

NeurIPS. (2021). The pre-registration workshop: An alternative publication model for machine learning research. preregister.science. https://neurips.cc/virtual/2021/workshop/21885

211.

Nevo

Ritov

(2017). Identifying a minimal class of models for high-dimensional data. Journal of Machine Learning Research, 18(24), 1–29.

212.

Ngufor

Van Houten

Caffo

B. S.

Shah

N. D.

McCoy

R. G.

(2019). Mixed effect machine learning: A framework for predicting longitudinal change in hemoglobin A1c. Journal of Biomedical Informatics, 89, 56–67.

213.

Nijman

Leeuwenberg

Beekers

Verkouter

Jacobs

Bots

Asselbergs

Moons

Debray

(2022). Missing data is poorly handled and reported in prediction model studies using machine learning: A literature review. Journal of Clinical Epidemiology, 142, 218–229.

214.

Oberauer

Lewandowsky

(2019). Addressing the theory crisis in psychology. Psychonomic Bulletin & Review, 26, 1596–1618.

215.

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), Article aac4716. https://doi.org/10.1126/science.aac4716

216.

Orrù

Monaro

Conversano

Gemignani

Sartori

(2020). Machine learning in psychometrics and psychological research. Frontiers in Psychology, 10, Article 2970. https://doi.org/10.3389/fpsyg.2019.02970

217.

Ovchinnik

Otero

Freitas

A. A.

(2022). Nested trees for longitudinal classification. In Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing (pp. 441–444). Association for Computing Machinery. https://dl.acm.org/doi/proceedings/10.1145/3477314

218.

Pan

Huang

(2014). Random effects selection in generalized linear mixed models via shrinkage penalty function. Statistics and Computing, 24, 725–738.

219.

Pargent

Schoedel

Stachl

(2023). Best practices in supervised machine learning: A tutorial for psychologists. Advances in Methods and Practices in Psychological Science, 6(3). https://doi.org/10.1177/25152459231162559

220.

Pearl

(1995). From Bayesian networks to causal networks. In Coletti

Dubois

Scozzafava

(Eds.), Mathematical models for handling partial knowledge in artificial intelligence (pp. 157–182). Plenum Press. https://dl.acm.org/doi/book/10.5555/234075

221.

Pedregosa

Varoquaux

Gramfort

Michel

Thirion

Grisel

Blondel

Prettenhofer

Weiss

Dubourg

Vanderplas

Passos

Cournapeau

Brucher

Perrot

Duchesnay

(2011a). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830. https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer

222.

Pedregosa

Varoquaux

Gramfort

Michel

Thirion

Grisel

Blondel

Prettenhofer

Weiss

Dubourg

Vanderplas

Passos

Cournapeau

Brucher

Perrot

Duchesnay

(2011b). Scikit-learn: Machine learning in Python. iterative imputer. Journal of Machine Learning Research, 12, 2825–2830. https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer

223.

Pedregosa

Varoquaux

Gramfort

Michel

Thirion

Grisel

Blondel

Prettenhofer

Weiss

Dubourg

Vanderplas

Passos

Cournapeau

Brucher

Perrot

Duchesnay

(2011c). Scikit-learn: Machine learning in Python. simple imputer. Journal of Machine Learning Research, 12, 2825–2830. https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

224.

Peduzzi

Concato

Feinstein

A. R.

Holford

T. R.

(1995). Importance of events per independent variable in proportional hazards regression analysis II. Accuracy and precision of regression estimates. Journal of Clinical Epidemiology, 48(12), 1503–1510.

225.

Peduzzi

Concato

Kemper

Holford

T. R.

Feinstein

A. R.

(1996). A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology, 49(12), 1373–1379.

226.

Peikert

van Lissa

C. J.

Brandmaier

A. M.

(2021). Reproducible research in R: A tutorial on how to do the same thing more than once. Psych, 3(4), 836–867. https://doi.org/10.3390/psych3040053

227.

Pelckmans

De Brabanter

Suykens

J. A.

De Moor

(2005). Handling missing values in support vector machine classifiers. Neural Networks, 18(5–6), 684–692.

228.

Peters

Janzing

Schölkopf

(2017). Elements of causal inference: Foundations and learning algorithms. The MIT Press.

229.

Polyzotis

Zinkevich

Roy

Breck

Whang

(2019). Data validation for machine learning. Proceedings of Machine Learning and Systems, 1, 334–347.

230.

Quinlan

J. R.

(2014). C4. 5: Programs for machine learning. Elsevier.

231.

Quiñonero-Candela

Sugiyama

Schwaighofer

Lawrence

N. D.

(2022). Dataset shift in machine learning. The MIT Press.

232.

Raschka

(2015). Python machine learning. Packt Publishing Ltd.

233.

Raudenbush

S. W.

Bryk

A. S.

(2002). Hierarchical linear models: Applications and data analysis methods (Vol. 1). Sage.

234.

Reid

(2018). Statistical science in the world of big data. Statistics & Probability Letters, 136, 42–45.

235.

Riley

R. D.

Debray

T. P.

Collins

G. S.

Archer

Ensor

van Smeden

Snell

K. I.

(2021). Minimum sample size for external validation of a clinical prediction model with a binary outcome. Statistics in Medicine, 40(19), 4230–4251.

236.

Riley

R. D.

Ensor

Snell

K. I.

Harrell

F. E.

Martin

G. P.

Reitsma

J. B.

Moons

K. G.

Collins

Van Smeden

(2020). Calculating the sample size required for developing a clinical prediction model. The BMJ, 368, Article m441. https://doi.org/10.1136/bmj.m441

237.

Riley

R. D.

Snell

K. I.

Ensor

Burke

D. L.

Harrell

F. E.

Jr. Moons

K. G.

Collins

G. S.

(2019a). Minimum sample size for developing a multivariable prediction model: Part I–Continuous outcomes. Statistics in Medicine, 38(7), 1262–1275.

238.

Riley

R. D.

Snell

K. I.

Ensor

Burke

D. L.

Harrell

F. E.

Jr. Moons

K. G.

Collins

G. S.

(2019b). Minimum sample size for developing a multivariable prediction model: Part II–Binary and time-to-event outcomes. Statistics in Medicine, 38(7), 1276–1296.

239.

Roberts

D. R.

Bahn

Ciuti

Boyce

M. S.

Elith

Guillera-Arroita

Hauenstein

Lahoz-Monfort

J. J.

Schröder

Thuiller

Warton

D. I.

Wintle

B. A.

Hartig

Dormann

C. F.

(2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography, 40(8), 913–929.

240.

Rocca

Yarkoni

(2021). Putting psychology to the test: Rethinking model evaluation through benchmarking and prediction. Advances in Methods and Practices in Psychological Science, 4(3). https://doi.org/10.1177/25152459211026864

241.

Rohrer

J. M.

Murayama

(2023). These are not the effects you are looking for: Causality and the within-/between-persons distinction in longitudinal data analysis. Advances in Methods and Practices in Psychological Science, 6(1). https://doi.org/10.1177/25152459221140842

242.

Rosenbusch

Soldner

Evans

A. M.

Zeelenberg

(2021). Supervised machine learning methods in psychology: A practical introduction with annotated R code. Social and Personality Psychology Compass, 15(2), Article e12579. https://doi.org/10.1111/spc3.12579

243.

Rothchild

(2006). Induction, deduction, and the scientific method. In Society for the study of reproduction (pp. 1–11). Case Western Reserve University.

244.

Roweis

Ghahramani

(1999). A unifying review of linear Gaussian models. Neural Computation, 11(2), 305–345.

245.

Rubin

D. B.

(1987). Multiple imputation for nonresponse in surveys (Technical report). John Wiley & Sons.

246.

Rubin

D. B.

(1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91(434), 473–489.

247.

Rubin

D. B.

(2018). Multiple imputation. In van Buuren

(Ed.), Flexible imputation of missing data, second edition (pp. 29–62). Chapman and Hall/CRC.

248.

Rudin

Chen

Huang

Semenova

Zhong

(2022). Interpretable machine learning: Fundamental principles and 10 grand challenges. Statistic Surveys, 16. https://doi.org/10.1214/21-SS133

249.

Salditt

Humberg

Nestler

(2023). Gradient tree boosting for hierarchical data. Multivariate Behavioral Research, 58(5), 911–937.

250.

Salganik

M. J.

Lundberg

Kindel

A. T.

Ahearn

C. E.

Al-Ghoneim

Almaatouq

Altschul

D. M.

Brand

J. E.

Carnegie

N. B.

Compton

R. J.

Datta

Davidson

Filippova

Gilroy

Goode

B. J.

Jahani

Kashyap

Kirchner

McKay

Morgan

A. C.

. . . McLanahan

(2020). Measuring the predictability of life outcomes with a scientific mass collaboration. Proceedings of the National Academy of Sciences, 117(15), 8398–8403. https://doi.org/10.1073/pnas.1915006117

251.

Sambasivan

Kapania

Highfill

Akrong

Paritosh

Aroyo

L. M.

(2021). “Everyone wants to do the model work, not the data work”: Data cascades in high-stakes AI. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1–15). Association for Computing Machinery.

252.

Sambasivan

Das

Sahu

S. K.

(2020). A Bayesian perspective of statistical machine learning for big data. Computational Statistics, 35(3), 893–930.

253.

Sarkar

De Bruyn

(2021). LSTM response models for direct marketing analytics: Replacing feature engineering with deep learning. Journal of Interactive Marketing, 53(1), 80–95.

254.

Savcisens

Eliassi-Rad

Hansen

L. K.

Mortensen

L. H.

Lilleholt

Rogers

Zettler

Lehmann

(2024). Using sequences of life-events to predict human lives. Nature Computational Science, 4(1), 43–56.

255.

Schafer

J. L.

(1997). Analysis of incomplete multivariate data. CRC Press.

256.

Scheinost

Noble

Horien

Greene

A. S.

Lake

E. M.

Salehi

Gao

Shen

O’Connor

Barron

D. S.

Yip

S. W.

Rosenberg

M. D.

Constable

R. T.

(2019). Ten simple rules for predictive modeling of individual differences in neuroimaging. NeuroImage, 193, 35–45.

257.

Schelldorfer

Bühlmann

Van De Geer

S. V.

(2011). Estimation for high-dimensional linear mixed-effects models using ℓ₁-penalization. Scandinavian Journal of Statistics, 38(2), 197–214.

258.

Schmidt

F. L.

Hunter

J. E.

(1996). Measurement error in psychological research: Lessons from 26 research scenarios. Psychological Methods, 1(2), 199–223. https://doi.org/10.1037/1082-989X.1.2.199

259.

Schölkopf

Locatello

Bauer

N. R.

Kalchbrenner

Goyal

Bengio

(2021). Toward causal representation learning. Proceedings of the IEEE, 109(5), 612–634.

260.

Schölkopf

Smola

Müller

K.-R.

(1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299–1319.

261.

Seaman

S. R.

White

I. R.

(2013). Review of inverse probability weighting for dealing with missing data. Statistical Methods in Medical Research, 22(3), 278–295.

262.

Seeland

Mäder

(2021). Multi-view classification with convolutional neural networks. PLOS ONE, 16(1), Article e0245230. https://doi.org/10.1371/journal.pone.0245230

263.

Sela

R. J.

Simonoff

J. S.

(2012). RE-EM trees: A data mining approach for longitudinal and clustered data. Machine Learning, 86, 169–207.

264.

Shinde

P. P.

Shah

(2018). A review of machine learning and deep learning applications. In 2018 Fourth International Conference on Computing Communication Control and Automation (pp. 1–6). IEEE. https://ieeexplore.ieee.org/document/8697366

265.

Shmueli

(2010). To explain or to predict? Statistical Science, 25(3), 289–310.

266.

Shorten

Khoshgoftaar

T. M.

(2019). A survey on image data augmentation for deep learning. Journal of Big Data, 6, Article 60. https://doi.org/10.1186/s40537-019-0197-0

267.

Sigrist

(2022). Gaussian process boosting. Journal of Machine Learning Research, 23, Article 232. https://www.jmlr.org/papers/volume23/20-322/20-322.pdf

268.

Silva

B. C.

Bosancianu

C. M.

Littvay

(2019). Multilevel structural equation modeling. Sage.

269.

Simon

H. A.

(2001). Science seeks parsimony, not simplicity: Searching for pattern in phenomena. In Zellner

Keuzenkamp

H. A.

McAleer

(Eds.), Simplicity, inference and modelling: Keeping it sophisticatedly simple (pp. 32–72). Cambridge University Press.

270.

Singh

S. K.

Yang

Behjat

Rai

Chowdhury

Matei

(2019). PI-LSTM: Physics-infused long short-term memory network. In 18th IEEE International Conference on Machine Learning and Applications (pp. 34–41). IEEE. https://www.icmla-conference.org/icmla19/#:~:text=The%20Eighteenth%20International%20Conference%20on,of%20machine%20learning%20(ML)

271.

Śmieja

Struski

Ł.

Tabor

Zieliński

Spurek

(2018). Processing of missing data by neural networks. Advances in Neural Information Processing Systems, 31. https://proceedings.neurips.cc/paper/2018/file/411ae1bf081d1674ca6091f8c59a266f-Paper.pdf

272.

Snoek

Larochelle

Adams

R. P.

(2012). Practical Bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems, 25. https://proceedings.neurips.cc/paper/2012/hash/05311655a15b75fab86956663e1819cd-Abstract.html

273.

Sommet

Lipps

(2024). A primer on fixed-effects and fixed-effects panel modeling using R, Stata, and SPSS. PsyArXiv. https://doi.org/10.31234/osf.io/etn9d

274.

Stachl

Schoedel

Gosling

S. D.

Harari

G. M.

Buschek

Völkel

S. T.

Schuwerk

Oldemeier

Ullmann

Hussmann

Bischl

Bühner

(2020). Predicting personality from patterns of behavior collected with smartphones. Proceedings of the National Academy of Sciences, 117(30), 17680–17687.

275.

Statnikov

Lytkin

N. I.

Lemeire

Aliferis

C. F.

(2013). Algorithms for discovery of multiple Markov boundaries. Journal of Machine Learning Research, 14, 499–566.

276.

Sterne

J. A.

White

I. R.

Carlin

J. B.

Spratt

Royston

Kenward

M. G.

Wood

A. M.

Carpenter

J. R.

(2009). Multiple imputation for missing data in epidemiological and clinical research: Potential and pitfalls. BMJ, 338, Article b2393. https://doi.org/10.1136/bmj.b2393

277.

Sterner

Goretzko

Pargent

(2023). Everything has its price: Foundations of cost-sensitive machine learning and its application in psychology [Conference session]. Big Data & Research Syntheses 2023, Frankfurt, Germany.

278.

Stone

(1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, Series B: Methodological, 36(2), 111–133.

279.

Tay

Woo

S. E.

Hickman

Booth

B. M.

D’Mello

(2022). A conceptual framework for investigating and mitigating machine-learning measurement bias (MLMB) in psychological assessment. Advances in Methods and Practices in Psychological Science, 5(1). https://doi.org/10.1177/25152459211061337

280.

Thomas

R. M.

Bruin

Zhutovsky

van Wingen

(2020). Dealing with missing data, small sample sizes, and heterogeneity in machine learning studies of brain disorders. In Mechelli

Vieira

(Eds.), Machine learning (pp. 249–266). Elsevier. https://www.sciencedirect.com/book/9780128157398/machine-learning

281.

Tibshirani

(1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B: Methodological, 58(1), 267–288.

282.

Tipping

M. E.

(2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244.

283.

Tulabandhula

Rudin

(2014). Robust optimization using machine learning for uncertainty sets. arXiv. https://doi.org/10.48550/arXiv.1407.1097

284.

Twala

(2009). An empirical comparison of techniques for handling incomplete data using decision trees. Applied Artificial Intelligence, 23(5), 373–405.

285.

Umezawa

Sato

(1995). Selectivity coefficients for ion-selective electrodes: Recommended methods for reporting KA, Bpot values (technical report). Pure and Applied Chemistry, 67(3), 507–518.

286.

Vabalas

Gowen

Poliakoff

Casson

A. J.

(2019). Machine learning algorithm validation with a limited sample size. PLOS ONE, 14(11), Article e0224365. https://doi.org/10.1371/journal.pone.0224365

287.

Van Buuren

. (2018). Flexible imputation of missing data (2nd ed.). Chapman and Hall.

288.

Van Buuren

Brand

J. P.

Groothuis-Oudshoorn

C. G.

Rubin

D. B

. (2006). Fully conditional specification in multivariate imputation. Journal of Statistical Computation and Simulation, 76(12), 1049–1064.

289.

Van Buuren

Groothuis-Oudshoorn

. (2011). Mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45, 1–67.

290.

Van Der Maaten

Postma

Van den Herik

. (2009). Dimensionality reduction: A comparative. Journal of Machine Learning Research, 10, 66–71.

291.

Van Lissa

C. J

. (2022). Developmental data science: How machine learning can advance theory formation in developmental psychology. Infant and Child Development, 32(6), Article e2370. https://doi.org/10.1002/icd.2370

292.

Van Lissa

C. J

. (2023). Complementing preregistered confirmatory analyses with rigorous, reproducible exploration using machine learning. Religion, Brain & Behavior, 13(3), 347–351.

293.

Van Lissa

C. J.

Beinhauer

Branje

Meeus

W. H

. (2023). Using machine learning to identify early predictors of adolescent emotion regulation development. Journal of Research on Adolescence, 33(3), 870–889.

294.

Van Lissa

C. J.

Brandmaier

A. M.

Brinkman

Lamprecht

A.-L.

Peikert

Struiksma

M. E.

Vreede

B. M

. (2021). Worcs: A workflow for open reproducible code in science. Data Science, 4(1), 29–49.

295.

van Smeden

Lash

T. L.

Groenwold

R. H

. (2020). Reflection on modern methods: Five myths about measurement error in epidemiological research. International Journal of Epidemiology, 49(1), 338–347.

296.

van Smeden

Moons

K. G.

de Groot

J. A.

Collins

G. S.

Altman

D. G.

Eijkemans

M. J.

Reitsma

J. B

. (2019). Sample size for binary logistic prediction models: Beyond events per variable criteria. Statistical Methods in Medical Research, 28(8), 2455–2474.

297.

Verleysen

François

(2005). The curse of dimensionality in data mining and time series prediction. In International Work-Conference on Artificial Neural Networks (pp. 758–770). Springer. https://dialnet.unirioja.es/servlet/libro?codigo=876087

298.

Vincent

Larochelle

Bengio

Manzagol

P.-A.

(2008). Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning (pp. 1096–1103). Association for Computing Machinery. https://dl.acm.org/doi/proceedings/10.1145/1390156

299.

Wallach

H. M.

(2006). Topic modeling: Beyond bag-of-words. In Proceedings of the 23rd International Conference on Machine Learning (pp. 977–984). Association for Computing Machinery. https://dl.acm.org/doi/proceedings/10.1145/1143844

300.

Walsh

C. G.

Ribeiro

J. D.

Franklin

J. C.

(2017). Predicting risk of suicide attempts over time through machine learning. Clinical Psychological Science, 5(3), 457–469.

301.

Wang

(2025). A machine learning approach for clustered data. Communications in Statistics-Simulation and Computation, 54(2), 406–416.

302.

Wang

Yao

Zhao

(2016). Auto-encoder based dimensionality reduction. Neurocomputing, 184, 232–242.

303.

Wang

Y.-X.

Zhang

Y.-J.

(2012). Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on Knowledge and Data Engineering, 25(6), 1336–1353.

304.

Watson

G. S.

(1964). Smooth regression analysis. Sankhyā: The Indian Journal of Statistics, Series A (1961-2002), 26(4), 359–372. http://www.jstor.org/stable/25049340

305.

White

A. P.

Liu

W. Z.

(1994). Bias in information-based measures in decision tree induction. Machine Learning, 15, 321–329.

306.

White

I. R.

Royston

Wood

A. M.

(2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.

307.

Widaman

K. F.

(2007). Common factors versus components: Principals and principles, errors and misconceptions. In Cudeck

MacCallum

R. C.

(Eds.), Factor analysis at 100 (pp. 191–218). Routledge.

308.

Williams

Liao

Xue

Carin

(2005). Incomplete-data classification using logistic regression. In Proceedings of the 22nd International Conference on Machine Learning (pp. 972–979). Association for Computing Machinery. https://dl.acm.org/doi/proceedings/10.1145/1102351

309.

Goodwin

G. M.

Lyons

Saunders

K. E.

(2022). Identifying psychiatric diagnosis from missing mood data through the use of log-signature features. PLOS ONE, 17(11), Article e0276821. https://doi.org/10.1371/journal.pone.0276821

310.

Wynants

Van Calster

Collins

G. S.

Riley

R. D.

Heinze

Schuit

Albu

Arshi

Bellou

Bonten

M. M.

Dahly

D. L.

Damen

J. A.

Debray

T. P. A.

de Jong

V. M. T.

De Vos

Dhiman

Ensor

Gao

Haller

M. C.

. . . van Smeden

L. M.

(2020). Prediction models for diagnosis and prognosis of covid-19: Systematic review and critical appraisal. The BMJ, 369, m1328. https://doi.org/10.1136/bmj.m1328

311.

Xia

Zhang

Cai

Pan

Yan

Ning

(2017). Adjusted weight voting algorithm for random forests in handling missing values. Pattern Recognition, 69, 52–60.

312.

Xiong

Kim

H. J.

Singh

(2019). Mixed effects neural networks (MeNets) with applications to gaze estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7743–7752). IEEE. https://www.computer.org/csdl/proceedings/cvpr/2019/1gyr6w5YIIU

313.

Yarkoni

Westfall

(2017). Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science, 12(6), 1100–1122.

314.

Yoon

Jordon

Schaar

(2018). Gain: Missing data imputation using generative adversarial nets. In International Conference on Machine Learning (pp. 5689–5698). PMLR. https://proceedings.mlr.press/v80/

315.

You

Ding

Kochenderfer

M. J.

Leskovec

(2020). Handling missing data with graph representation learning. Advances in Neural Information Processing Systems, 33, 19075–19087.

316.

Youyou

Kosinski

Stillwell

(2015). Computer-based personality judgments are more accurate than those made by humans. Proceedings of the National Academy of Sciences, 112(4), 1036–1040.

317.

Zeger

S. L.

Liang

K.-Y.

Albert

P. S.

(1988). Models for longitudinal data: A generalized estimating equation approach. Biometrics, 44, 1049–1060. https://www.jstor.org/stable/2531734?seq=1

318.

Zha

Bhat

Z. P.

Lai

K.-H.

Yang

(2023). Data-centric AI: Perspectives and challenges. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM) (pp. 945–948). Society for Industrial and Applied Mathematics. https://www.proceedings.com/71595.html

319.

Zhang

Jin

Zhou

Z.-H.

(2010). Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics, 1, 43–52.

320.

Zhang

(2016a). Introduction to machine learning: K-nearest neighbors. Annals of Translational Medicine, 4(11), Article 218. https://doi.org/10.21037/atm.2016.03.37

321.

Zhang

(2016b). Missing data imputation: Focusing on single imputation. Annals of Translational Medicine, 4(1), Article 9. https://doi.org/10.3978/j.issn.2305-5839.2015.12.38

322.

Zou

Hastie

(2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B: Statistical Methodology, 67(2), 301–320.

Bridging Traditional-Statistics and Machine-Learning Approaches in Psychology: Navigating Small Samples,Measurement Error,Nonindependent Observations,and Missing Data

Abstract

Keywords

The Traditional Statistics–Machine Learning Continuum

A Limited Sample Size

Approach in traditional statistical methods

How machine learning considers a limited sample size

Measurement Error

Approach in traditional statistical methods

How machine learning can address measurement error

Nonindependent Data

Approach in traditional statistical methods

How machine learning can address nonindependent data

Missing Data

Approach in traditional statistical methods

How machine learning can address missing data

General Discussion

Footnotes

Acknowledgements

Transparency

ORCID iD

Notes

References