Sage Journals: Discover world-class research

Abstract

Machine-learning (ML) algorithms are being rapidly incorporated into the work of psychologists given their capability and flexibility in analyzing large-scale, complex, or otherwise messy data sets. In this context and in the spirit of open science, ML research should be conducted in a transparent, understandable, and ethical manner. However, publications by psychology researchers and practitioners show a troubling lack of consistency in reporting ML information. Given that ML offers a wide range of analytical options, in this article, we address an important need by providing a comprehensive, open-science checklist that specifies the information researchers should disclose at each stage of a supervised-ML project—from data collection and preprocessing to model selection, evaluation, interpretation, and code sharing. We hope that psychological researchers will benefit from this checklist when reporting ML results and will adapt and extend this checklist further in the future.

Keywords

machine learning research guidelines open science reproducibility psychological science open materials

Driven by the promise and confluence of big data and advances in computational power, many in psychological science anticipate that applying machine-learning (ML) algorithms will enhance, if not revolutionize, key areas of research and practice. Likewise, researchers and practitioners expect that ML tools can mine big data to capitalize on complex relationships that have never been discovered before. One broad distinction pertains to ML algorithms (see Alloghani et al., 2020). “Supervised”-learning ML algorithms share the same purpose as linear regression: making predictions. These algorithms are trained on a large set of predictors (i.e., “inputs” or “features” in ML terminology) and a known criterion variable (i.e., a “labeled” variable or “output” variable). Once the algorithm is trained and models the relationship between the predictors and the criterion, the supervised-ML algorithms are then used on external data sets of a similar nature except the criterion data are not yet observed or are otherwise unseen. “Unsupervised”-learning ML algorithms involve finding hidden structure in a data set, for example, determining how many reliable consumer profiles exist in a large data set of consumer purchases. In the current article, we focus on supervised-ML algorithms. Although the use of big data and ML has surged in various research disciplines, its integration is still developing in psychological-research and graduate-training curricula across areas such as personality psychology (Bleidorn & Hopwood, 2019), social psychology (e.g., Sheetal et al., 2020), and industrial-organizational psychology (Tonidandel et al., 2018).

Compared with commonly used analytic methods in psychology research, such as analysis of variance (ANOVA) or multiple linear regression, the application of supervised-ML algorithms requires a significantly higher degree of decision-making, especially in the model-development and -evaluation stages. For instance, two researchers analyzing the same data set using ordinary least squares (OLS) regression are expected to obtain identical results regardless of the statistical software used (e.g., the regression coefficients, standard errors, R², and adjusted R² for the model should all be the same). By contrast, if these two researchers are tasked with fitting the same data set using random forests, a supervised-ML method, it is highly likely that they would yield divergent outcomes (e.g., different mean squared errors of prediction, different assessments of how important each variable is for making predictions in the model). This variation arises in part from the numerous critical decisions involved in developing the algorithm. Making even one change among all these decisions (e.g., number of variables considered in the node of a tree during hyperparameter tuning) could lead to vastly different results when replicating findings or applying the method more generally. Therefore, we are inspired to apply recent guidance promoting methodological transparency in psychological research (e.g., Aguinis et al., 2018; Weston et al., 2019) to these unique and important decisions specific to the implementation of supervised-ML algorithms that are new to psychological research. The current understanding of these algorithms by researchers and readers is uneven at best. Therefore, more consistent and transparent (if not standardized) approaches to ML reporting are crucial for advancing open science in psychology.

To assess the current state of ML reporting, we conducted a literature search across 11 selected psychology journals covering general, social, personality, and methodological psychology. The search spanned from 1965 to 2024 and yielded 106 review/theoretical and empirical articles relevant to ML techniques (see Supplemental Material 1 in the Supplemental Material available online). Of the 76 empirical studies, 33 used supervised learning, 15 used unsupervised learning, and the remaining 28 used ML as methods for data preprocessing (e.g., extracting data), not their main analytic methods. Among the 33 empirical studies applying supervised-ML techniques, there is an extreme lack of consistency in reporting key information necessary for fully understanding the ML algorithms applied and their findings. Moreover, although we identified good examples of reporting different information from multiple studies, very few of them reported adequate information for replication, as we detail further below. Based on these findings, we therefore provide a checklist critical for transparency and replication of supervised-ML algorithms.

The overall purpose of the current article is fourfold. First, informed by recent practices and publications involving ML in psychology, we develop and offer a set of reporting guidelines for future researchers applying supervised-ML techniques. The guidelines outline the key steps for reporting when conducting ML research, including how to frame supervised-ML-based research questions, documenting key data-preprocessing procedures, what to report when selecting among algorithms, and instructions for reporting and interpreting the results from supervised-ML algorithms. These recommendations aim to enhance the comprehension, transparency, and accessibility of supervised-ML methods in both research and practical contexts. In this article, we offer guidelines for comprehensive reporting that attempt to strike a balance. That is, we seek to describe the general key steps and decisions involved in using supervised ML in psychological research in the interest of establishing a shared understanding of how researchers may report ML findings transparently. This information is not intended to inform important specific ML reporting practices in one’s local research setting when conducting supervised-ML studies. Often, specific details will depend on the domain, data set, prediction problem, and ML algorithms available at the time.

Second, the guidelines from our article also contribute to open science by increasing transparency and replicability in psychological research that increasingly incorporates supervised ML. Science progress depends on both innovation and replication: “Innovation points out paths that are possible; replication points out paths that are likely” (Open Science Collaboration, 2015, p. 7). Given that applications of supervised-ML techniques involve many critical decisions, such as data cleaning, model selection, and model-evaluation metrics, inadequate reporting can obscure understanding and hinder serious replication attempts (Vollmer et al., 2020). The reporting checklist provided in the current article can help mitigate such risks. Moreover, if psychological scholars and reviewers rely on (and in the future provide input to) these supervised-ML reporting guidelines and the guidelines are generally acceptable to the broader research community, then psychological researchers are more likely to communicate effectively with and contribute to multidisciplinary projects involving supervised-ML research.

Third, in the spirit of open science, our checklist is designed to reduce new and emerging forms of questionable research practices (QRPs) in supervised-ML studies in psychology. Whereas traditional QRPs (e.g., p-hacking, post hoc theorizing) rely on null hypothesis significance testing (i.e., p values), ML research is evaluated mainly through predictive-performance metrics. Researchers may instead justify feature choices or hyperparameter tweaks after the fact, cycle through countless algorithms and random seeds until a “best” score emerges, or omit key baselines that reveal the true improvement of an ML model (e.g., reporting 64% retention accuracy without noting that always predicting “stay” yields 63%). Rigorous documentation of data preprocessing, model-selection pathways, hyperparameter tuning, and baseline comparisons is therefore critical to reduce the publication of irreproducible findings, providing practitioners with recommendations based on uncertain evidence, and diversion of limited resources toward unproductive avenues.

Fourth, our checklist also seeks to enhance replicability in cases in which researchers have high degrees of freedom. “Researcher degrees of freedom” refers to defensible choices among multiple empirically supported options (Manapat et al., 2024). For example, in supervised ML, choosing among several cross-validation techniques falls under the researcher degrees of freedom. When such decisions are poorly documented, this would certainly be considered a QRP (Wigboldus & Dotsch, 2016), ranging from incomplete reporting of technical details to outright misrepresentation (e.g., cherry-picking from the research options that were actually explored). Questionable reporting is likely to contribute to irreproducible findings. Although no guidelines can guard against bad actors, in the current article, we promote transparency in conducting supervised-ML analyses and reporting procedures, such as clearly specifying all key aspects of the predictive models used and detailing the methods researchers may use to improve model performance.

The need for distinct guidelines to enhance reporting transparency for supervised ML arises primarily from the unique steps inherent to data preprocessing, model development, and model evaluation, which diverge from more conventional methods of modeling and prediction (e.g., ANOVA, regression, and structural equation modeling). The absence of transparency in these specific steps can result in difficulties when researchers attempt to replicate supervised-ML results. In addition, given the novelty and varying familiarity with supervised-ML techniques by psychological researchers, specific reporting guidelines not only establish more consistent reporting practices but also serve as an educational tool to meet the evolving needs of psychological science.

A Brief Introduction to Supervised ML in the Psychological Literature

ML is a subfield of computer science that aims to construct computer algorithms that can learn and improve with experience automatically (T. Mitchell, 1997). The distinction between ML algorithms and traditional statistics in psychological research can be somewhat fuzzy. For example, some of the traditional statistical methods widely used in psychological research (e.g., logistic regression) can be considered as a component of certain ML algorithms. Typically, when researchers apply ML techniques, they often encounter settings with a large number of predictors (e.g., Sheetal et al., 2020) and/or situations in which the number of predictors (k) exceeds the number of observations (n; Oswald et al., 2020). In these contexts, researchers often incorporate procedures such as cross-validation and regularization to reduce overfitting and improve the predictive accuracy of supervised-ML techniques. For explanations of ML-related terminology, see Table 1.

Table 1.

Explanations of Machine-Learning-Related Terminologies

Category	Terminology	Definition
Types of machine learning	Supervised learning	Supervised learning is when an algorithm learns to make decisions by using examples of inputs and desired outputs provided by the user. It can then predict the correct output for new inputs without human help (Müller & Guido, 2016).
	Unsupervised learning	Unsupervised learning is a type of machine learning in which the algorithm is provided with only input data and no corresponding output data. The algorithm must find patterns and relationships within the input data on its own without any guidance on what the output should be (Müller & Guido, 2016).
	Reinforcement learning	Reinforcement learning is when an algorithm learns to act on its own to achieve the given goals by receiving feedback in the form of rewards or penalties based on the outcomes of those actions. This approach allows the algorithm to improve its performance and make better decisions without human intervention (Dutt et al., 2018).
Data and features	Features	Features are the properties that describe each observation in a data set (Müller & Guido, 2016). A feature typically refers to one input variable used by a machine-learning model. In psychology research, features may correspond to predictors (e.g., age, gender, stress level, or reaction time), which are variables used to explain or predict an outcome.
	Labels/criterion (output variable)	What (the output variable) a model is designed to predict. Labels may be categorical (e.g., group membership) or continuous (e.g., test scores). They typically represent a ground-truth property, measured directly through experiments or assessments or assigned by human judgment (Greener et al., 2022). In psychology research, this aligns with the concept of the criterion variable.
	Labeled data set	A labeled data set consists of data samples with one or more labels that help the machine-learning model make accurate predictions. Labeled data sets are used in supervised learning (IBM, n.d.-a; Müller & Guido, 2016).
	Unlabeled data set	An unlabeled data set is a collection of data samples that lack distinct identifiers or labels. Unlabeled data sets are primarily used in unsupervised-learning models, in which the algorithm sifts through the data to discover patterns, correlations, or clusters without any previous indication of what to look for (IBM, n.d.-a; Müller & Guido, 2016).
	Training data sets	Training data are the data used to build a machine-learning model (Dutt et al., 2018).
	Testing data sets	Testing data are the unseen data (separated from the training data set) used to assess how well the model works (Müller & Guido, 2016).
Model types and architectures	Classification machine-learning models	A type of supervised-machine-learning model that predicts one of several possible categories or classes for a given input (Müller & Guido, 2016).
	Regression machine-learning models	A type of supervised-machine-learning model that predicts a continuous number for a given input (Müller & Guido, 2016).
	Tree-based models	Tree-based models are machine-learning algorithms that use a tree-like structure to make decisions and predictions. These models include decision trees, which use a hierarchical structure of if/else questions; random forests, which are collections of decision trees; and gradient boosting methods, which build trees in a sequential manner (Müller & Guido, 2016).
	Predictive models	A model that predicts a specific value using the values in an input data set by establishing a relationship between the output variable (the labels being predicted) and the predictor features (the input data; Dutt et al., 2018).
	Large language models	Large language models are a category of foundation models trained on immense amounts of data, making them capable of understanding and generating natural language and other types of content to perform a wide range of tasks (IBM, n.d.-b).
Model training and evaluation	Cross-validation	Cross-validation is a resampling method. In its general form, the data are randomly divided into K subsets, also known as folds. The model is trained on K – 1 folds and tested on the remaining fold; this process is repeated until each fold has served as the test set once. By rotating through the folds, cross-validation provides a more reliable and robust estimate of how well the model generalizes to unseen data compared with a single train-test split. (De Rooij & Weeda, 2020; Müller & Guido, 2016).
	Hyperparameter	Hyperparameters are settings that are defined before the learning process begins (Pargent et al., 2023). They control how the training is carried out (e.g., the number of trees in a random forest) and can be optimized using methods such as cross-validation.
	Hyperparameter tuning	Hyperparameter tuning is the practice of identifying and selecting the optimal hyperparameters for use in training a machine-learning model (Belcic & Stryker, 2024).
	Bias-variance trade-off	Bias-variance trade-off describes the balance between a model’s bias (error from overly simplistic assumptions) and variance (error from an overly complex model that is sensitive to fluctuations in the training data; Dalpiaz, n.d.). As model complexity increases, bias tends to decrease because the model can capture more complex patterns, whereas variance tends to increase as the model begins to fit noise rather than underlying structure. Conversely, simpler models exhibit higher bias but lower variance. The trade-off underscores that achieving optimal predictive performance requires balancing these two sources of error to minimize total expected error and enhance the model’s generalization to unseen data.
	Regularization	Regularization is a technique used in machine learning to prevent overfitting by adding a penalty to the model’s complexity (Murel & Kavlakoglu, 2023).
	Overfitting	Overfitting occurs when a model is too complex for the amount of information available. It fits the training data too closely, capturing the particularities of the training set, which results in a model that performs well on the training data but fails to generalize to new, unseen data (Müller & Guido, 2016).
	Underfitting	Underfitting happens when a model is too simple and cannot capture the complexity and variability of the data. This leads to poor performance even on the training set because the model fails to represent the underlying patterns in the data (Müller & Guido, 2016).
Others	Natural language processing	Natural language processing describes the use of computer technology to assist in or complete tasks involving processing, categorizing, analyzing, or interpreting the meaning of human language (Rafail et al., 2020).
	Text mining	Text mining is the process of transforming unstructured text into a structured format to identify meaningful patterns and new insights. It involves analyzing vast collections of textual materials to capture key concepts, trends, and hidden relationships (IBM, n.d.-c).

Supervised ML involves statistically learning a set of mapping rules from a group of predictors (i.e., features or inputs) to a criterion (i.e., labels or outputs). Once trained, the supervised-ML model can be applied to new data sets with similar structures in which the criterion values are unknown or unobserved. Supervised ML has numerous applications in people’s daily lives, such as predicting mental illness (e.g., Jiang et al., 2020), inferring personality traits (e.g., Alexander et al., 2020), and detecting faking in psychological assessments (e.g., Calanna et al., 2020). These algorithms are typically categorized into two main types: regression (for continuous outcomes) and classification (for categorical outcomes).

To understand how supervised-ML algorithms are used in psychology research, we searched 11 psychology journals, publication dates from 1965 to September 2024, using relevant keyword stems combined with truncation symbols (i.e., machine learn*, big data, predictive model*, data min*, text min*, natural language process*; for the complete list, see Supplemental Material 1 in the Supplemental Material).

Studies integrating ML are growing rapidly in psychological research; there were only two articles before 2014, but there were 24 articles in the single year of 2023. Of the 106 total articles identified, 33 used supervised learning. The search results show that as a research community, psychologists are not merely discussing ML abstractly; they are now increasingly using these tools in their work. In these empirical studies, psychologists have applied ML techniques to answer a wide variety of questions. For example, can ML predict personality-assessment responses to unseen items, thereby overcoming the limitations of traditional surveys (Abdurahman et al., 2024)? What are the novel predictors (among more than 900 variables) of people’s willingness to justify unethical behaviors (Sheetal et al., 2020)? Studies like these shed light on how ML techniques can be applied to enhance the understanding of complex psychological phenomena.

However, reporting practices for ML procedures and results vary widely. Most identified supervised-ML empirical studies rarely provide the essential information required for future researchers to understand, replicate, and extend the research findings. For instance, many studies did not report the details of their data-preprocessing or cross-validation procedures. In addition, most supervised-ML empirical studies reported supervised-ML results inconsistently. For instance, two studies both used supervised ML to predict mental-health-diagnosis results; however, one reported the accuracy (i.e., proportion of correct predictions out of the total predictions), and the other reported precision (i.e., proportion of correct positive predictions over total positive predictions). This makes it impossible for the readers to directly compare the performance of the two supervised-ML models.

Admittedly, this situation is entirely understandable given that the application of supervised-ML techniques in psychological research is still in its infancy, a situation common to many advanced techniques when they are first introduced to the field (e.g., structural equation modeling, multilevel analysis). However, this situation not only introduces barriers to understanding research findings but also inhibits the scientific contribution and replication of those findings. Failing to address this issue hinders the psychology community’s understanding of the benefits, drawbacks, and broad applicability of supervised-ML techniques in the long run.

In the current work, we take an initial step to mitigate the inconsistency of reporting supervised-ML study procedures and results, recognizing that most ML studies in psychological research to date have used supervised ML.¹ We conducted a systematic review of articles across different disciplines about how to report ML results. Based on our literature review, we summarize the decision-making for each step when conducting psychological research using supervised ML and specify the key pieces of information required when reporting a supervised-ML study. We also list examples that report the key information. The examples provided in this article primarily focus on general, social, and personality psychology, reflecting the authors’ collective expertise in these areas. However, we acknowledge that supervised ML is also widely used in many other subfields of psychology, such as cognitive neuroscience and neuropsychology. For a broader overview, see Supplemental Material 2 in the Supplemental Material, which includes a table of example articles from these additional domains. The methodological challenges highlighted in the current study are also relevant and applicable across these subfields.

Method

Following the literature-search procedure recommended by Harari et al. (2020), we conducted a literature search on two electronic databases (i.e., Google Scholar and Web of Science) using two lists of keywords. The first list of keywords is related to ML, including “machine learn*,” “big data,” “predictive model*,” “data min*,” “text min*,” and “natural language process*.” The second list includes “guideline” and “best practice.”² The search yielded a total of 3,661 articles. A trained graduate research assistant read the titles and abstracts of these articles based on specific criteria. To qualify, an article had to (a) be written in English, (b) be available through the university library or interlibrary loan or made publicly accessible by the author(s), (c) originate from a peer-review journal, and (d) not be an empirical study applying ML to solve one specific problem because our goal was to summarize methodological-guidance articles. This resulted in 51 articles for full-text screening.

H. Min and F. Guo independently read 10 articles and coded them into three categories. Category 1 includes articles that provide guidelines specifically on supervised-ML reporting regardless of discipline. Category 2 includes articles that provide guidelines on specific aspects of applying ML and are relevant to psychological research, for example, applying ML to analyze longitudinal data (Sheetal et al., 2022) or ML-algorithm configuration (Eggensperger et al., 2019). Category 3 includes the articles whose scope was narrow (e.g., empirical studies of applying ML), broad (e.g., general introduction of artificial intelligence in multiple disciplines; Xu et al., 2021), or not relevant (e.g., best practice of species distribution models; Robinson et al., 2017). The interrater agreement of the two coders was 90%; they discussed their rating discrepancies until a consensus was reached. The two coders then finished coding the other articles separately. In the full-text review, we identified 10 articles in Category 1 (highlighted in the references using asterisks) and 24 articles in Category 2 (Table 2). Backward searches identified one additional article for Category 1 and two additional articles for Category 2. For a flowchart of the literature-review process, see Figure 1.

Table 2.

Summary of Guideline Articles in Other Aspects of Machine Learning

Category	References
General introduction for machine-learning techniques	Choi et al. (2020), Pfob et al. (2022), Rosenbusch et al. (2021), Xu et al. (2021)
Guidelines for applying machine-learning techniques in other disciplines	Alyahyan & Düştegör (2020), Bone et al. (2015), Cearns et al. (2019), Chicco & Jurman (2022), Golder et al. (2022), Greener et al. (2022), Halilaj et al. (2018), Koçak (2022), Lueftinger et al. (2021), Li et al. (2022), Makarov et al. (2021), Rossi et al. (2022), Vishwakarma et al. (2021), von Hahn & Mechefske (2022), Wang et al. (2020), Zhang et al. (2019)
Guidelines for applying machine learning under government laws or requirements	Rhahla et al. (2021)
Recommendations for data mining	Paiano & Pasanisi (2018), Yang & Wu (2006)
Guidelines for analyzing longitudinal data using machine-learning techniques	Sheetal et al. (2022)
Guidelines for algorithm configuration	Eggensperger et al. (2019)
Guidelines for algorithm comparison	Walters (2022)

Fig. 1.

Flowchart of systematic literature review.

Results

To synthesize insights from the literature, we focused our analysis on the 11 articles identified in Category 1, which provide direct guidance on supervised-ML reporting; they come from multiple disciplines, including biomedical, pathology, health care, orthopedics, education, and chemistry. We summarized the recommended items from these studies and adapted them to apply to psychological research. Based on these items, we developed a structured checklist for reporting supervised-ML applications in psychological studies (see Table 3). The checklist is organized into five main sections, each outlining key components and recommended practices. Items marked with an asterisk in Table 3 indicate that reporting these items is strongly recommended to increase research transparency and reproducibility.

Table 3.

Reporting Checklist When Using Supervised-Machine-Learning Techniques

Sections	Key components	Recommended checklists	Endorsements
Introduction	Research questions	Give a clear description of research question and frame the research question as a supervised-ML task in terms of classification or regression problem	*
	Research questions	Clarify and justify the choice of predictors and outcome variables	*
	Planned supervised-ML operations and justifications	Explain why ML techniques are necessary and helpful to answer the research questions raised (vs. ordinary least squares regression or other methods considered)	*
		Briefly summarize previous methods applied that address the current research problem (e.g., traditional statistical methods or other ML techniques)
		Explain the selection strategy for the supervised-ML model (i.e., why certain supervised-ML algorithms are used but not other supervised-ML models); state the benefits of the current supervised-ML model as an alternative or complement to other solutions
Sample	Data source	Provide detailed descriptions on where, when, and how the data are obtained; clarify what the raw data are	*
		Provide information about the data set	*
		Explain the nature and usefulness of big-data sources that are unfamiliar to many readers
		Pay attention to privacy, proprietary, and other ethical and legal issues when “big data” can be easily retrieved	*
	Sample characteristics	Report demographic information and basic descriptive statistics, as in traditional studies	*
	Sample characteristics	Rely on visualization tools to report exploratory findings (e.g., scatterplots, prevalence of missing data)
	Description of statistical software and platforms	Report the analytic software (and packages if there are any) used for each step of data analyses	*
Procedure	Data preprocessing	Report the data fields and formats in the original data and after preprocessing	*
		Give rationales for each data-transformation and -cleaning procedure as it pertains to modeling and/or the research questions	*
		Explain all data-processing procedures and techniques conducted in detail, including data transformation and cleaning and rules for case and variable inclusion and exclusion	*
	Model training and evaluation	For model training, articulate the rationale and data-splitting methods used: e.g., train data:test data ratio, stratification method (i.e., a process to make sure each split is representative of the whole data set)	*
		Clarify whether and what cross-validation method is conducted (e.g., k-fold, out of bag error)	*
		Whether an independent holdout data set is applied to the trained model (even after cross-validation) to avoid data leakage and provide a more accurate estimate of model generalizability	*
		Describe the hyperparameter-tuning procedures in detail (e.g., nested cross-validation), and if they are not applied, explain why it was not needed	*
Results	Performance metrics and results reporting	Report the rationale for performance metrics selected	*
		Report the performance of the final model across all data sets: the holdout data set, test data set, and training data set	*
		Report the importance metrics of predictors if the models are interpretable
		Understand and discuss the criteria by comparing against the standard performance achieved from previous studies
	Model explanation	Consider providing global (understanding the general patterns of the model) and/or local (understanding how a specific prediction is derived) explanations to the findings
Discussion	Interpretation of ML results	Discuss the theoretical and practical implications of the findings	*
Discussion	Limitations	Discuss the limitations (e.g., potential biases)
Additional	Open science	Make code (or pseudocode) publicly available (e.g., via platforms such as GitHub or OSF), including code for data preprocessing, model training, and model testing; also make data publicly available if appropriate to do so	*

Note: Asterisk indicates the listing is strongly recommended. ML = machine learning.

Introduction section

Research questions

When authors write an article about their research, they typically outline their hypotheses and/or research questions, study design, data collection, and statistical analyses (Brownstein et al., 2019; W. Luo et al., 2016). The use of supervised-ML techniques is different in many respects: Whether variables in a large data set can predict an outcome of interest using ML techniques may be based on a theory, but it is more likely to be based on a rationale or even just a hope (e.g., what novel antecedents will predict unethical behavior).

Hofman et al. (2021) proposed a framework for categorizing research activities along two dimensions: the extent to which researchers aim to identify and estimate causal effects and the extent to which researchers focus on accurately predicting outcomes. Based on these dimensions, the research framework includes four quadrants: descriptive modeling (neither causal nor predictive), explanatory modeling (focused on causal inference), predictive modeling (focused on forecasting outcomes), and integrative modeling (focused on both causal explanation and predictive accuracy). Generally, psychologists use supervised ML to explore (mine) the data and detect complex relationships wherever they exist and hold up under cross-validation, which falls under the predictive-modeling quadrant. Supervised ML can also be used to answer research questions in other quadrants, such as to improve matching quality and further facilitate causal inferences (explanatory modeling; e.g., Liu et al., 2021) or to predict how the patients’ mental and physical health will change when one changes one characteristic of the intervention focusing on substance abuse (integrative modeling). Hofman et al. suggested that researchers clarify which quadrant their research questions fall into.

Given the flexibility of supervised ML in addressing a wide range of questions, we recommend that researchers explicitly state their research questions and document all necessary details throughout the research process to promote transparency and reproducibility by third parties when data can be shared.

Rationale for using supervised-ML techniques

When authors articulate their research questions, they also need to explain why supervised-ML techniques were used over traditional statistical methods³ (W. Luo et al., 2016). This rationale typically centers on characteristics such as the need to handle high-dimensional predictor sets, detect complex nonlinearities and interactions, accommodate weaker distributional assumptions, or prioritize predictive accuracy over parameter estimation (Chapman et al., 2016; Hastie et al., 2009; Jordan & Mitchell, 2015). For example, when predicting turnover intentions, traditional statistical methods (e.g., OLS regression) provide interpretable results, whereas supervised-ML models (e.g., deep neural networks) may provide more accurate predictions but sacrifice interpretability. Thus, there may be a trade-off. When appropriate, researchers should conduct and report traditional methods as an analytical baseline against supervised-ML techniques in terms of predictive accuracy (Artrith et al., 2021; Vollmer et al., 2020). The practice is recommended not only to compare their predictive power but also because even in relatively simple and interpretable ML models, such as regularized regression (e.g., lasso or elastic net), interpreting coefficients can still present challenges. In the service of prediction, regularized regression coefficients are shrunk toward zero in hopes of improving predictive performance in an independent sample (Hastie et al., 2009); this will, of course, be more likely to the extent that future samples come from the same population as the original sample on which the ML model was trained and tested. Sometimes, traditional statistical methods are simply not feasible, such as when the number of variables far exceeds the number of cases, which may, in fact, motivate a researcher’s decision to conduct a supervised-ML analysis in the first place. In these cases, interpretable-ML models that provide variable coefficients (e.g., lasso or elastic net) are recommended as a baseline against more complex supervised-ML models.⁴

An example of explaining the rationale for using supervised ML can be found in Joel et al. (2020), in which ML was used to investigate relationship quality. Note that their discussion addressed both the general rationale for adopting supervised ML and the specific justification for selecting random forests as the algorithm:

Each dataset was analyzed using Random Forests (24), a machine-learning method designed to handle many predictors at once while minimizing overfitting (i.e., fitting a model so tightly to a particular dataset that it will not replicate in other datasets). The Random Forests method builds on classification and regression trees (25). Specifically, using a random subset of predictors and participants, the Random Forests method tests the strength of each available predictor one at a time through a process called recursive partitioning. It builds a decision tree out of the strongest available predictors and tests the tree’s overall predictive power on a subset of data that were not used to construct the tree (also called the “out of bag” sample). The Random Forests method does this repeatedly, separately bootstrapping thousands of decision trees and then averaging them together. Results reveal how much variance in the dependent measure was predictable and which predictors made the largest contributions to the model. Random Forests are nonparametric—they do not impose a particular structure on the data—and as such they are able to capture nonlinear relationships, including interactions among the predictors (26). For example, a model with actor- and partner-reported predictors would detect any robust actor × partner interactions (e.g., moderation, attenuation effects, matching effects) that could not be captured in a model featuring actor- or partner-reported predictors alone. (Joel et al., 2020, p. 19063)

Researchers should articulate why they chose to use supervised-ML techniques to answer their research questions, and when feasible, researchers should compare the application and results of those techniques with those of traditional (or more interpretable) statistical techniques. Overall, we argue that the use of supervised-ML algorithms should be explicitly justified in the context of the research question(s) and data rather than relying solely on the novelty of the ML methods.

Framing research questions as a supervised-ML task

Apart from specifying the research questions and providing the rationale for using supervised-ML techniques, authors also need to translate or operationalize their research questions in terms of specific ML tasks (W. Luo et al., 2016; Moons et al., 2015). For example, the research question of predicting employee turnover (a binary outcome) from job satisfaction (a continuous predictor) would be considered a classification problem in supervised ML. Understanding how this terminology is used in computer science differs from other terminology used in psychology for the same purpose (e.g., logistic regression in this case) facilitates cross-disciplinary collaboration (König et al., 2020). For example, for a common organizational problem of job-role assignments, Varshney et al. (2014) phrased their research question in computer-science terms:

We approach the predictive modeling problem as a classification problem. Since we have veracious data on a reasonable fraction of employees’ job roles and job role specialties, we can further formulate the problem via supervised multi-category classification, using the veracious fraction of employees’ data as a training set. (p. 1730)

We recommend that when the goal is cross-disciplinary collaboration, psychology researchers should translate their research questions into ML tasks right from the outset, in the introduction section. For instance, if supervised ML is used, is it a regression or classification task? What variables are predictors (i.e., inputs), and what variable is the outcome (i.e., output)?

Selecting and justifying the supervised-ML models used

Once the authors phrase their research question as a supervised-ML prediction problem, the next step is to report the specific supervised-ML algorithm being used, explaining why it was selected (e.g., its benefits, any limitations, and how it might have been preferred to other ML algorithms). That is, researchers can begin to evaluate the types of ML models that are suitable for use given the nature of the defined task and the data set(s) employed.

Some traditional statistical methods may suggest only one obvious analysis for a research question, such as a linear regression analysis when multiple predictors predict a continuous outcome of interest or a t test when two group means are being compared on a single continuous variable. In contrast, other widely used analytic approaches may require specifying multiple models and formally comparing them before researchers determine which model best fits the data, such as models with different numbers of latent factors in confirmatory factor analysis or full versus partial mediation models in structural equation modeling.

In a somewhat analogous manner, there is a wide range of ML algorithms when conducting supervised ML (e.g., more than 100 found on https://topepo.github.io/caret/available-models.html). Therefore, researchers typically do not test and compare all supervised-ML models; however, comparing at least a few might make good sense. For instance, when predicting turnover, one could compare cross-validated logistic regression with regularization, support vector machines, tree-based models (e.g., random forests or gradient boosting), neural networks, and many other variants or alternative supervised-ML models. When selecting the set of ML models to answer a specific research question, several factors may affect researchers’ decision-making process. The researchers may begin by identifying categories of supervised-ML models that allow for exploring different advantages and limitations when setting up the prediction problem. For instance, does the researcher seek to understand the nature of relationships, and if so, are they linear or nonlinear (e.g., additive or multiplicative, respectively)? Or is the prediction of outcomes the sole goal? As discussed above, if the primary goal is to gain insights into underlying relationships, supervised ML using more explainable regression models is more suitable, such as cross-validated OLS, lasso, and logistic regression. Conversely, when the primary goal is accurate performance prediction rather than relationship interpretation, complex supervised-ML methods may be more appropriate choices, such as tree-based models, such as random forests and extreme gradient boosting, or other models, such as support vector machines and neural networks (Y. Luo et al., 2019).

In practice, researchers can select one or two specific supervised-ML models in each category to effectively examine the assumptions. If the researchers found that the tree-based model performed much better predictively than regression-based models, for instance, this suggests that the relationships between the predictors and the outcome may be nonlinear in ways that may be much more complex than interaction or quadratic effects in regression models. Meanwhile, researchers can identify and ensure that in each category, models adopting distinct approaches are selected. For instance, in tree-based models, researchers can select random forests, a specific type of bootstrap aggregating (i.e., bagging) tree-based model, and extreme gradient boosting, a stagewise tree-based model. In addition, supervised ML allows for various forms of model “ensembling,” in which the final model is based on multiple predictions that are weighted and averaged either within or across different ML methods and data sets. If model ensembling is employed, it must be transparently reported.

Moreover, the nature and size of the data set also influence the selection of supervised-ML algorithms. For example, algorithms, such as logistic regression, lasso, or elastic net, are well suited for structured numeric data, whereas natural-language-processing techniques combined with models such as support vector machines or neural networks are more suitable for textual (word-based) data. In addition to the data type, the volume of data also influences the selection of the algorithm. When the data set is very large, scalable algorithms, such as stochastic gradient descent linear models and extreme gradient boosting, may be preferred for their scalability and ability to handle high-dimensional data (Bottou, 2010; Chen & Guestrin, 2016).

Overall, researchers should base their model-selection decisions on their research questions, data type, and data volume and the strengths/weaknesses of different supervised-ML models (e.g., considering any trade-offs between ML interpretation and prediction). We strongly recommend that researchers report more transparently how they decided among the available supervised-ML models with respect to addressing the research question before they selected the ML algorithm(s) used. Such information is vital for future researchers not only to understand a given ML research project but also to compare and extend study findings and learn from these key methodological decisions (Artrith et al., 2021).

As an example, Varshney et al. (2014) provided a good example of explaining how their research question and data formats affected their choice of ML models and why some other similar or popular models were not chosen:

For the skills analytics problem we are facing, the most direct and appropriate formulation is supervised multi-category classification. Moreover, due to the business structure, we learn separate classifiers for the different LOBs [line of businesses] within IBM because there are different valid class labels and different feature distributions among the different LOBs. Multi-task learning could be possible in this setting to do joint training for different LOBs, but we choose not to pursue this direction because it introduces unnecessary complexity. . . . We compare four one-against-the-rest multi-category classification algorithms: linear logistic regression with l₂ and l₁ regularization, linear support vector machine, and naïve Bayes. The regularization parameters for the first three models are found by cross-validation. (p. 1732)

In summary, we recommend authors (a) articulate the necessity and utility of supervised-ML techniques in addressing their research question, (b) briefly summarize prior methods applied to address the current problem (e.g., traditional statistical methods or other supervised-ML techniques), and (c) to the best of their knowledge, state the benefits of selecting their ML algorithms over other alternatives vis-à-vis the research question being investigated. As with traditional statistical methods, we advocate parsimony in ML algorithms, in which one starts with the simple algorithms for the research problem and data at hand, using a less parsimonious model only if the predictive benefit justifies the added complexity and any reduction in interpretability.

Sample section

After introducing the research question(s) and justifying the chosen ML methods, the next step is to describe the data sets used and the sample characteristics.

Data source

Given the abundance of data (both structured and unstructured) analyzed by supervised ML, psychologists need to clarify the data sources in their studies (W. Luo et al., 2016; Moons et al., 2015). This is particularly true because even though analytic toolboxes are expanding, some data selected by one researcher may be considered unfamiliar or novel to the general audience. Oswald et al. (2020) suggested that in the era of big data, researchers may consider incorporating some forms of “incidental data” (e.g., emails, video data, tweets) to address relevant research questions. Compared with the more “intentional data” that come from traditional measures common in the literature and familiar to most researchers (e.g., survey responses), there is often less knowledge or certainty regarding the nature of incidental data and therefore, its predictive value. Thus, researchers should provide detailed descriptions of where and how their data were obtained and what the data look like (e.g., human-resources information-system records, annual climate-survey archival data, social media posts; Winkler-Schwartz et al., 2019). For example, for a data source that relies on web scraping, the data-procurement process should be clearly stated. Detailed information should be provided on where and what data were scraped, the chosen time frame, and the inclusion and exclusion criteria. For instance, Min et al. (2021) provided specific information about how their data set was obtained:

Our text data for emotion analysis were collected via the Twitter API. We queried all available Tweets related to WFH [work from home] (i.e., Tweets including the following seven keywords: “WFH”, “work from home”, “working from home”, “work remotely”, “working remotely”, “remote work”, and “remote working”) over a four months period (March 01, 2020 to July 01, 2020) by searching through the historical Tweets database. In total, we collected 1.56 million Tweets posted by 706,142 distinct Twitter users. (p. 219)

One’s data source directly affects whether the relevant constructs are measured and how predictive those data are of the outcomes (Vollmer et al., 2020). Geiger et al. (2021) pointed out that in supervised ML, the predictive models derived from labeled training data are “only as good as the quality of that (training) data” (p. 1; also see Farrow et al., 2021). This echoes the long-standing phrase of “garbage-in, garbage-out,” first introduced in the early days of computing to emphasize how flawed or low-quality input data inevitably lead to flawed outputs (Babbage, 1864; Mellin, 1957). In the ML context, the labeled training data set guides the ML algorithm’s learning process, ultimately producing a trained algorithm that can be applied to independent data sets (e.g., holdout data, external data sets).

Thus, it will increase transparency if researchers provide comprehensive information about how the training data are obtained and about the labeling task procedure(s) if the training data are human-labeled. This information helps readers and reviewers better understand the nature of the data, thereby enhancing the construct validity and generalizability of the trained ML model and facilitating the interpretation of results. Even when ML models are cross-validated on a given data set, there is no guarantee that they will generalize to other data sets. However, a detailed description of the original data does provide some insight into the model’s potential for generalization.

When multiple data sets are available to address a research question using supervised ML, psychologists should explain why they chose a particular one. The general principles regarding research decision-making in the Ethical Principles of Psychologists and Code of Conduct (American Psychological Association [APA], 2017) also apply to data selection. The general principles are beneficence and nonmaleficence, fidelity and responsibility, integrity, justice, and respect for people’s rights and dignity. Selecting a data set solely because it produces favorable results constitutes a form of outcome-driven bias that undermines scientific transparency and reproducibility (Simmons et al., 2011; Weston et al., 2019). Ethical decision-making further requires that the chosen data set be collected under appropriate consent and confidentiality protections (Barchard & Williams, 2008; Kaiser, 2009). Considerations of fairness are also essential. For example, a researcher investigating adolescent depression could choose between a small, locally recruited sample and a large, publicly available longitudinal data set. Which data set is chosen should depend on the researcher’s goals. A sample recruited locally may simply have been out of convenience and without consideration of whether the sample is representative of the population to which the research results are supposed to generalize. On the other hand, perhaps the local data set is highly relevant because generalizability is intended to be for that local setting, with its own unique demographic features. On the other hand, a large, publicly available data set may come with higher statistical power and wider external validity, depending on the degree of correspondence between the sample and the population of interest. Overall, data-set selection should be motivated not only by methodological fit but also by explicit ethical considerations of factors such as transparency, fairness, and participant protection.

Finally, yet most importantly, any ethical concerns associated with data reporting, such as privacy or legal issues, should be considered (Richards & King, 2014). For example, sharing verbatim quotes is often discouraged in studies that use social media data (e.g., tweets) because it allows other people to trace the quote back to the individual who posted it (Golder et al., 2017). Below is an example of authors reporting the ethical concerns in their data-collection process:

No private data or non-public information is used in this work. For human annotation (Section 6.1), we recruited our annotators from the linguistics departments of local universities through public advertisement with a specified pay rate. All of our annotators are senior undergraduate students or graduate students in linguistic majors who took this annotation as a part-time job. We pay them 60 CNY an hour. The local minimum salary in the year 2023 is 25.3 CNY per hour for part-time jobs. The annotation does not involve any personally sensitive information. (Li et al., 2024, p. 44)

Stamatis et al. (2021) also reported details about their participants and data-collection procedures:

Participants included patients with OCD [obsessive compulsive disorder] (n = 90) and healthy controls (n = 59). Demographic information is provided in Table 1, which also contains information on OCD and sensory phenomena symptoms for the patient group. Participants were recruited from treatment centers, self-help groups, and Internet advertisements. All participants provided informed consent, and procedures were approved by the institutional review board. Participants completed a computerized neuropsychological battery and structured clinical interviews about sociodemographic and clinical symptoms, with only the OCD group responding to OCD-related symptom measures. (p. 81)

In summary, we strongly recommend researchers (a) provide detailed descriptions of where, when, and how research data are obtained for supervised ML and clarify what the raw data are and show what they look like if necessary; (b) provide key information about the data set(s) used (e.g., source of the data, demographics); (c) explain the usefulness of “new” data sources that are unfamiliar by many readers; and (d) ensure compliance with privacy or legal regulations when reporting data sources and examples.

Sample characteristics

Once the raw data are specified, sample characteristics should be described. These characteristics include demographic information (e.g., race/ethnicity, gender, age) and basic descriptive statistics (e.g., sample size, variable means and standard deviations, prevalence and patterns of missing data; W. Luo et al., 2016; Moons et al., 2015). When appropriate, intercorrelations among key variables should also be reported.⁵ Initial data exploration is a tedious but critical step before conducting any further statistical analyses. It facilitates a better understanding of the data and informs decision-making in other analytical steps (e.g., preprocessing, model choice). We strongly recommend incorporating data visualizations because they can be especially effective in identifying trends, patterns, and anomalies that may be overlooked by means, standard deviations, correlations, and other statistical summaries. Interested readers can refer to Tay et al. (2018) for more information on various visualization tools.

In addition, depending on the specific supervised-ML task, relevant statistics should be reported. For example, in a classification task, the existence of imbalanced data (i.e., disproportionate sample sizes for each group) should be reported, as should any method for addressing this issue. For example, in a study using supervised ML to examine how workstation types influence the coupling of neural and vascular activities of the prefrontal cortex, Alyan et al. (2020) provided detailed information about the participants included in their experiment:

Twenty-three healthy adult volunteers with no history of psychological illness, musculoskeletal problems, or substance dependence took part in this study (mean age 28.6 ± 3.4 years, males, right-handed, mean height 1.7 ± 0.038 m, BMI > 18 and < 25). The research was approved by the Medical Research Ethics Committee (MREC) of Universiti Kuala Lumpur Royal College of Medicine Perak (UniKL RCMP). All procedures were conducted following the approved regulations and guidelines. All subjects signed informed consent under the MREC approval stipulations. (p. 218912)

To summarize, for sample characteristics, we strongly recommend that researchers report demographic information and basic descriptive statistics, as is commonly done in traditional studies. In addition, we recommend that researchers use visualization tools to report important initial findings, such as intercorrelations and prevalence of missing data, and to illustrate overall data trends and patterns (e.g., Moons et al., 2015).

Procedural information

As with traditional statistical analyses, empirical studies involving supervised ML must include detailed documentation of the data-handling procedures and the implementation of ML algorithms. Arguably, the importance of this procedural transparency requirement is heightened in ML research given the magnitude and complexity of the data and the fact that ML techniques are still relatively new to psychologists. Below, we discuss common concepts and techniques of various data-preprocessing, ML-modeling, and evaluation techniques that are most relevant to psychological research.

Description of statistical software and platforms

As with any analytic method, researchers need to document and report the analytic software (and packages, if any), including version numbers, used at each step of supervised-ML research, such as data preprocessing, model training, model testing, and model evaluation (Artrith et al., 2021; Vollmer et al., 2020).

Data preprocessing

The primary objective of data preprocessing is to transform raw data into a form that can be consumed by supervised-ML models (W. Luo et al., 2016; Winkler-Schwartz et al., 2019). In other words, in most situations, directly feeding raw data into a supervised-ML model is neither feasible (e.g., because of data types and missingness) nor good practice (e.g., raw data that are not cleaned and otherwise transformed can lead to unreliable results; Chu et al., 2016). The “data wrangling” process is essential (i.e., cleaning, transforming, and otherwise modifying data into a usable format), not only providing deeper insights into the data but also helping refine the research question and subsequent application of supervised ML (Artrith et al., 2021; Braun et al., 2018).

First, in the data-transformation step (i.e., the actions taken to change the form of the data), the data set is expected to change in the number of variables (also called “features”). This is applicable in many scenarios, such as when the data format is not readily consumable by supervised-ML models (e.g., audio or text data), when one wants to derive new data fields from existing ones (e.g., extract country and city from a GPS variable) or to combine and aggregate existing variables (e.g., aggregate multiple daily sales transactions to form a monthly revenue variable), and when data types need to be altered to address the research question (e.g., altering between categorical and continuous variables). Below is one example of reporting data transformation described by Hickman et al. (2021):

Participant responses were transcribed using IBM Watson Speech to Text . . . and their full interview response was combined into a single document. Then, we first used Linguistic Inquiry and Word Count (LIWC; . . . ) to quantify verbal behavior. We used all directly counted non-punctuation variables from LIWC, including word count. (p. 1331)

Second, in the data-cleaning step (i.e., the actions taken to “fix” irregularities in the data), the data shape can change in both its length and width (i.e., the number of rows and columns, respectively). There are several factors that affect researchers’ decisions regarding how data cleaning is conducted. For example, when cleaning cases (the lowest unit of analysis, such as people), it is recommended to consider outlying data points and cases, missing data, careless responders, and undersampling/oversampling (Artrith et al., 2021). For cleaning variables, particularly when interpretability is a key objective for supervised ML (not just prediction), one needs to consider variables that are low-frequency, low-variance, linearly dependent, highly correlated, and more (Langley & Sage, 1994; Wu et al., 2008). For instance, in an article examining the reliability and validity of ML models for analyzing video interviews, Hickman et al. (2021) reported their data-cleaning methods in processing verbal behavior in video interviews:

Before extracting the words and phrases, we first removed all numbers and punctuation from the transcripts, removed common stop words, transformed all text to lowercase, handled negation by appending words preceded by “not”, “n’t”, “cannot”, “never”, and “no” with the negator and an underscore, and stemmed the corpus. We removed all one- and two-word phrases that did not occur in at least 2% of the interviews. (pp. 1331)

For data transformation and cleaning, we strongly recommend that researchers report detailed procedural information on (a) data transformation and cleaning, (b) original data fields and formats, (c) data-processing procedures and techniques, and (d) resulting data fields and formats.

Model training and evaluation

The modeling stage of a supervised-ML study involves model training, evaluation, and tuning procedures. The main objective of supervised ML is to tune a model that minimizes prediction errors yet does not overfit the data (Artrith et al., 2021; Winkler-Schwartz et al., 2019). In the training step of supervised ML, the model learns from a training data set by improving model estimates that reduce errors in predicting the outcome. Critically, the trained-ML model is then evaluated on independent holdout “test” sets of data to determine if the model generalizes, thereby helping ensure that the trained model ultimately selected will not be based on overfitting the training data in the first step. There are various ways to engage in this train-then-test process (e.g., single holdout, cross-validation, and bootstrap; Kohavi, 1995; Refaeilzadeh et al., 2009), but ultimately, the goal is to minimize prediction errors in the holdout test set. For more details of the specific methods, see Supplemental Material 3 in the Supplemental Material.

In engaging with the process above, researchers are contending with a challenge known as the “bias-variance trade-off” (e.g., Pargent et al., 2023; Putka et al., 2018). “Bias” refers to the error that arises from a model that does not fully use the predictive information in the training data set, resulting in poorer predictions of the outcome variable (Kohavi, 1995). In supervised ML, high levels of bias indicate that the model is underfitting the data (e.g., underweighting and/or excluding relevant predictors). In contrast with bias, “variance” refers to the error introduced by an ML model’s sensitivity to minor fluctuations or noise present in the training data (e.g., including variables that are predictive only in a specific sample). The bias-variance trade-off suggests that one should optimize prediction by finding an equilibrium between underfitting and overfitting—that is, bias and variance as described above. Achieving this balance is of paramount importance when constructing supervised-ML models designed to perform effectively on unseen data, thus reducing generalization error (Boehmke & Greenwell, 2019).

To manage the bias-variance trade-off in supervised ML, various resampling methods are used to mimic how well the trained model would perform when it is applied to a new data set (Pargent et al., 2023). For a brief introduction to the widely used resampling methods in psychological studies, see Supplemental Material 3 in the Supplemental Material. One example of reporting resampling methods is Hickman et al. (2021), who reported the nested cross-validation information in their model training:

Nested cross-validation with k = 10 involves splitting the data into ten equally sized parts (the outer folds). Then, nine of these parts (the outer training folds) are used to conduct a separate 10-fold (the inner folds) cross-validation to select the optimal elastic net hyperparameters (i.e., model selection) based solely on these nine outer folds. Next, the final model is trained on those nine folds using the optimal hyperparameters, and then that model’s accuracy is estimated on the outer test fold (i.e., model assessment). This process is repeated 10 times, using each of the ten outer folds only once for testing. (p. 1331)

Hyperparameter tuning

“Hyperparameter tuning” refers to adjusting ML-model hyperparameters (or a subset of them) to help find values that optimize the prediction of unseen data in a test set. For example, in the random-forests algorithm, adjusting hyperparameters such as the number of trees, the depth of the trees, and the number of variables considered at each node can improve subsequent model performance. To reduce the risk of overfitting in models, hyperparameter tuning is typically conducted using nested cross-validation procedures (Cawley & Talbot, 2010).

Frequently used techniques for hyperparameter optimization include model-free optimization, gradient-based optimization, and Bayesian optimization, although there are others (Feurer & Hutter, 2019; Yang & Shami, 2020). In model-free optimization, grid search and random search are two common options. Grid search exhaustively considers all possible combinations of hyperparameters in a defined space, whereas random search randomly samples points in a defined hyperparameter space. Gradient-based optimization attempts to move hyperparameter estimates efficiently toward a global error-minimizing criterion. Bayesian optimization determines hyperparameter estimates based on information about previous estimates (Brochu et al., 2010), employing probabilistic models to iteratively guide the search for optimal hyperparameters by balancing exploration of unexplored regions and exploitation of promising areas in the hyperparameter space (Shahriari et al., 2015). This method efficiently improves model performance by learning from each iteration’s results and adapting its search strategy accordingly. To summarize, hyperparameter tuning is recommended, and its methods and use need to be described in sufficient detail for the purpose of reproducibility. Below is an example of reporting hyperparameters tuning in training supervised-ML models (i.e., random forest; Douglas et al., 2023):

Random forest analyses were conducted using the ranger R-package (Wright & Ziegler, 2015). The forest consisted of 1000 trees. Two tuning parameters of random forests are the number of candidate variables to consider at each split of each tree, and the minimum node size resulting from a split. The optimal tuning parameters were selected by minimizing the out-of-bag mean squared error (MSE) using model-based optimization with the R-package tuneRanger; in large datasets, this approach is equivalent to cross-validation (Probst et al., 2019). The best model considered 30 candidate variables at each split, and a minimum of seven cases per terminal node. We report the results of this best model. (Douglas et al., 2023, p. 1195)

Regarding the performance of supervised-ML models, a large number of model performance metrics are available. Naturally, researchers must understand and justify the criteria used to evaluate the performance of their supervised-ML model in light of the research questions, the nature of the data, and the model selected (W. Luo et al., 2016). Then, the performance metrics and their calculations must be clearly described so that others can understand this evaluation process and replicate it if needed. For more details of evaluation metrics, see Supplemental Material 4 in the Supplemental Material.

In addition, fundamental concerns for reliability and validity are also worth investigating, particularly in psychological research (e.g., Koenig et al., 2023; Luciano et al., 2018; Tay et al., 2020), even if they are reported in metrics relevant to ML instead of psychometrics. Whereas standard ML metrics, such as accuracy or area under the curve (AUC) quantify predictive performance, they do not assess the empirical stability (reliability) of model variables and the prediction (validity) of model criteria. Reliability in psychological studies using ML can be examined through test-retest consistency (e.g., do model predictions change across different times; Fan et al., 2023; Hickman et al., 2021). Validity, especially construct validity, helps to ensure that the values predicted by supervised ML align closely with the underlying theoretical constructs. That is, information about reliability and validity plays a crucial role in evaluating ML models in psychological research. Just because ML models are sophisticated does not mean that concerns for reliability and validity disappear; in fact, those concerns may be heightened because the investigation and evidence of reliability and validity may be ignored, diminished, or dismissed under the lure of ML sophistication.

Sharing program code

We strongly recommend that researchers share fully commented program code to facilitate open science, credibility, and replicability of the research (e.g., Artrith et al., 2021; Heil et al., 2021; Vollmer et al., 2020). The process of modeling involves several steps (as outlined above) in which even small variations can result in drastically different outcomes. For example, during the model-training process, optimization methods (e.g., stochastic gradient descent) and their configuration hyperparameters (e.g., stopping criteria, learning rate, and random seeds) are critical for finding replicability. Because it may be difficult for researchers to report every single detail given the limited space in a write-up, we encourage using open-source platforms that facilitate code sharing and collaboration (e.g., GitHub, OSF). We acknowledge that there may be cases in which sharing code is not possible, for example, when researchers collaborate with corporations that restrict researchers from openly sharing code or when code is built on components with restrictive licenses. In such situations, pseudocode (i.e., a verbal description of the programming code) may be shared as an alternative to actual code to help readers understand and replicate the studies on their own. For an example of pseudocode, see Supplemental Material 5 in the Supplemental Material.

Results section

Following a thorough description of the ML-model information, it is essential to provide details on model-evaluation criteria and performance (e.g., Kakarmath et al., 2020). Subsequently, it is recommended to explain and interpret the implications of these performance metrics in the context of the study. For an introduction of different performance metrics for regression and classification tasks, see Supplemental Material 4 in the Supplemental Material. Different research questions may warrant different emphases on these metrics. For example, in a security-clearance task, precision may be the most useful metric because correctly identifying weapons (true positives) should be a priority even if it comes at the expense of identifying more nonweapons as weapons (false positives). We recommend that authors (a) consider reporting multiple evaluation metrics to provide a more multifaceted picture of how their ML algorithm performs and (b) explain their rationale for selecting the evaluation metrics based on their research question or sample characteristics. Rationales that authors can provide for selecting appropriate evaluation metrics include (a) the research question being a supervised-ML prediction/classification task, (b) the distribution of the outcome variable (e.g., continuous: normally distributed or skewed; categorical: balanced or unbalanced), and (c) the costs of missing a true positive versus missing a true negative in classification task (e.g., accuracy, receiver-operating characteristic curve, and AUC) to identify both true positives and true negatives.

Model explanation

Supervised ML holds great promise for psychological science by producing more accurate and generalizable predictions than traditional statistical modeling. That said, we note that psychological science is not limited to maximizing prediction; it also seeks to understand why an algorithm is predictive. The field relies on an explainable methodology and results so that researchers, as a field, can advance the fundamental knowledge of psychological science and guide practices (e.g., Yarkoni & Westfall, 2017).

When relating supervised-ML results back to the originating research questions, researchers may also consider providing both global and local explanations for the findings, allowing them to be interpreted from both empirical and conceptual perspectives (Lundberg et al., 2020; Ribeiro et al., 2016). For example, in a model predicting turnover behavior (e.g., Min et al., 2024), researchers may want to identify the most important features for predicting employee turnover and who is likely to leave a company in particular. A global explanation addresses the former question: the process of determining the extent to which the predictors or their interactions contribute to the overall prediction of an ML model across all the data (Du et al., 2019). A local explanation addresses the latter question: the process of understanding how a prediction is derived for a specific observation (Molnar et al., 2021).

Discussion section

Interpretation of ML results

Apart from the numerical results one should report in the results sections, it is also necessary to interpret those results as best as one can. In other words, ML research should strive to explain not only how or what findings were obtained with ML but also why those ML results might matter (Moons et al., 2015). Even when ML algorithms lack full transparency and interpretability, research often benefits from making some attempts to connect the data and ML findings to the research literature and theoretical frameworks that have been used to develop related hypotheses or research questions. In addition, supervised-ML findings may have future theoretical and practical implications that the authors can reflect on in their writing.

As with any algorithm, supervised-ML algorithms require humans to interpret the real-world meaning or usefulness of their predictive findings. For example, obtaining a more accurate prediction than traditional methods does not necessarily mean that supervised ML is better if interpretability is sacrificed through the opaqueness of ML. Indeed, interpretability can be broadly understood as the ability to present information in understandable terms to a human (Doshi-Velez & Kim, 2017), and it is key in reporting supervised-ML-model results because it helps understand, appropriately trust, and effectively manage answers to research questions (Rossi et al., 2022). Conversely, a lack of interpretability can occur when a complex model, such as gradient boosting or random forests, achieves high predictive accuracy but offers little insight into how individual predictors influence the outcome. For example, a supervised-ML model predicting personality traits might claim 90% accuracy in predicting people’s level of extraversion from the available data yet provide no clear explanation of what variables might be driving the predictions, such as social media usage patterns, preferred leisure activities, or daily sleep. How extraversion and its correlates are measured in the first place is also a key question here before one even begins to collect, analyze, and interpret the data.

Consistent with our assertions, psychological scientists have already been underscoring the critical need to better understand how to use supervised-ML techniques and explain them (e.g., Tonidandel et al., 2015). To improve interpretability in supervised-ML models, researchers can draw from several techniques across different stages of analysis/modeling. Premodeling approaches include using construct-relevant, well-labeled input data and conducting exploratory data analysis to better understand variable relationships (Aguinis & Edwards, 2014; Tukey, 1977). During the modeling stage, researchers can choose intrinsically interpretable models—such as lasso regression and other sparse linear models—that offer greater transparency through simulatability and decomposability (Lipton, 2017; Rudin, 2019). In the postmodeling stage, widely used tools include permutation feature importance (Breiman, 2001), partial dependence plots (Hastie et al., 2009), and Shapley values (Lundberg & Lee, 2017), which help explain how features contribute to predictions at both global and local levels.

Limitations

As with any study, authors should acknowledge the limitations of their supervised-ML study, such as the potential biases of the data sets, the extent to which the findings appear to be generalizable, and if the results are robust across different analytic methods. Regarding this latter point, it can often be useful to compare different ML models with at least one traditional analytic method (as recommended above; Kakarmath et al., 2020; W. Luo et al., 2016). For a checklist summarizing all recommended steps for reporting psychological ML research, see Table 3.⁶

General Discussion

Supervised-ML techniques are being increasingly adopted among psychological researchers and practitioners. As their use increases, we hope that the checklist we offer, although general in nature, will promote open-science practices, including more transparent reporting in published supervised-ML studies. When possible and of interest, reproducible and replicable research can also be conducted. Without such a checklist in hand, psychological researchers stand to report ML results much less efficiently. For example, when designing a study, individual researchers may invest a substantial amount of time and effort in learning how other researchers and disciplines report their ML results to help determine their own approach. Then, after their manuscript is submitted to a journal, they are likely to be required to revise the ML reporting in their manuscript further given the feedback from reviewers and editors who themselves lack helpful guidance. We suggest this possibility because in fact, many of us have experienced this issue firsthand. Thus, the checklist offered reflects an initial step to specify the necessary information in psychological studies involving supervised ML, which we hope that future researchers will adapt and extend further.

Following previous articles providing methodological guidelines in psychological research (e.g., Aguinis et al., 2018; Eby et al., 2020; Newman, 2014) and guidelines for reporting ML studies in computer science (e.g., M. Mitchell et al., 2019; Pineau et al., 2021), in this article, we provide general guidelines for reporting psychological-research results that use supervised-ML techniques. In the current article, we explicitly list key information that authors need to report when writing up a supervised-ML-technique-based study. Supporting the need for this information, we provide examples of reporting results from previous ML-based studies. We also offer a useful but abbreviated version of a supervised-ML checklist (Table 3) that researchers can follow in their study design and manuscript writing, which helps authors provide all necessary information for future researchers to understand and replicate their findings. The checklist can also be highly beneficial for journal editors and reviewers in that (a) it can inform and improve the standardization of instructions to authors and review procedures and (b) it helps reviewers “identify and attempt to minimize questionable research practices and the exploitation of methodological gray areas in submitted manuscripts” (Aguinis et al., 2020, p. 47).

Supervised-ML techniques are useful tools for psychological researchers and practitioners seeking to mine large data sets and work with other disciplines in doing so. Supervised ML is also an area in which inductive research may see progress (e.g., McAbee et al., 2017), although the choice of ML algorithms requires further attention (e.g., when selecting more interpretable algorithms is crucial). Moreover, supervised-ML techniques enable psychological researchers to analyze diverse, large, and messy data sets, such as text and video narratives, two-dimensional and three-dimensional images, and video. By making data from multiple sources more accessible, supervised ML holds the promise (subject to empirical support) of reducing common method bias and forms of social desirability (e.g., Podsakoff et al., 2024). In addition, supervised ML has been shown to be useful in adding the convenience of automation to text and video analyses, reducing both time and cost (e.g., Guo et al., 2024; Hickman et al., 2021; Iliev et al., 2015).

Despite this promise, a lack of transparency in reporting major decisions in supervised-ML analysis presents scientific, ethical, and legal impediments to the growth of such applications in the field. Conversely, the supervised-ML checklist presented here offers a path forward in fostering transparency and replicability in supervised-ML-based studies, ultimately increasing the accessibility of these techniques to psychological research. But a checklist is not meant to encourage box-checking by psychological researchers—quite the opposite. The checklist helps researchers systematically consider each of their decisions when conducting supervised-ML analysis and then explain those decisions in their manuscripts. We make this statement while acknowledging that when submitting a manuscript, authors must often adhere to the word limits imposed by a journal. However, by referring to open platforms (e.g., GitHub and OSF), authors can share all additional materials relevant to their work, such as annotated code (e.g., preprocessing and analysis code), materials (e.g., variable codebooks, measures, and detailed sample information), and the data set, if possible. Moreover, in the current article, we foster open science by improving the reporting quality and increasing the transparency of supervised-ML applications in psychological research.

We hope our work inspires future research in several ways. First, we focus on supervised-ML reporting; we do not cover reporting procedures for other ML paradigms, such as unsupervised learning or reinforcement learning. Clearly, unsupervised ML and reinforcement learning will also continue to be important analytical tools for researchers. Valtonen et al.’s (2022) article focused on conducting unsupervised ML in organizational research, specifically, text mining, outlining the major steps and decisions in the analysis, and providing guidelines for promoting reproducibility and accountability. Reinforcement learning involves learning from environmental feedback rather than labeled data (Sutton & Barto, 1998). Likewise, the rapid adoption of large language models leads to new challenges and opportunities in reporting and interpretability. Future work could adopt a similar approach to develop guidelines for other ML paradigms in psychological research.

Another limitation of our work is that we emphasize only the common key steps in supervised-ML reporting; we cannot cover all supervised-ML algorithms and the corresponding aspects of the analysis procedures in each specific technique. For example, deep-learning algorithms typically involve more complex decision-making processes and adopt additional metrics for model evaluation, which cannot be introduced in detail in this article. However, those procedures should also be reported transparently by annotating and sharing the complete code. In addition, in this article, we provide a general set of recommendations that may not be applicable to all studies. Authors are encouraged to adapt these recommendations to the specific nature of their own studies. Moreover, we fully expect that the guidelines and checklist provided here should evolve as supervised ML progresses; on the other hand, we argue that most of the critical questions raised in the current guidelines will apply universally for current and future ML methods because the information provided is general and essential for scientific transparency, guidance, and understanding regardless of the specific techniques applied.

Conclusion

Just as artificial intelligence, machine learning, and other major advances in technology are profoundly influencing various aspects of life and society, supervised-ML techniques are beginning to have a strong presence in psychological-research publications and practical applications. Like structural equation modeling, multilevel modeling, and many other statistical techniques when they were in their infancy in psychological research, supervised ML in psychological research will also greatly benefit from increased guidance in analytic decision-making. Furthermore, reporting those decisions in supervised-ML analyses using the current checklist to provide structure will improve open science in terms of research transparency and the sharing of knowledge that informs and improves future psychological research and applications.

Supplemental Material

sj-docx-1-amp-10.1177_25152459261419816 – Supplemental material for Ensuring Transparency and Trust in Supervised-Machine-Learning Studies: A Checklist for Psychological Researchers

Supplemental material, sj-docx-1-amp-10.1177_25152459261419816 for Ensuring Transparency and Trust in Supervised-Machine-Learning Studies: A Checklist for Psychological Researchers by Hanyi Min, Feng Guo, Tianjun Sun, Mengqiao Liu and Frederick L. Oswald in Advances in Methods and Practices in Psychological Science

Footnotes

Transparency

Action Editor: Yasemin Kisbu-Sakarya

Editor: David A. Sbarra

Author Contributions

Hanyi Min: Conceptualization; Investigation; Methodology; Writing – original draft; Writing – review & editing.

Feng Guo: Conceptualization; Methodology; Writing – original draft; Writing – review & editing.

Tianjun Sun: Conceptualization; Writing – original draft; Writing – review & editing.

Mengqiao Liu: Conceptualization; Writing – original draft; Writing – review & editing.

Frederick L. Oswald: Conceptualization; Writing – original draft; Writing – review & editing.

ORCID iDs

Hanyi Min

Feng Guo

Tianjun Sun

Mengqiao Liu

Frederick L. Oswald

Supplemental Material

Additional supporting information can be found at .

Notes

References

Abdurahman

Zou

Ungar

Bhatia

(2024). A deep learning approach to personality assessment: Generalizing across items and expanding the reach of survey-based research. Journal of Personality and Social Psychology, 126(2), 312–331. https://doi.org/10.1037/pspp0000480

Aguinis

Edwards

J. R.

(2014). Methodological wishes for the next decade and how to make wishes come true. Journal of Management Studies, 51(1), 143–174. https://doi.org/10.1111/joms.12058

Aguinis

Ramani

R. S.

Alabduljader

(2018). What you see is what you get? Enhancing methodological transparency in management research. Academy of Management Annals, 12(1), 83–110. https://doi.org/10.5465/annals.2016.0011

Aguinis

Ramani

R. S.

Alabduljader

(2020). Best-practice recommendations for producers, evaluators, and users of methodological literature reviews. Organizational Research Methods, 26(1), 46–76. https://doi.org/10.1177/1094428120943281

Alexander

III Mulfinger

Oswald

F. L.

(2020). Using big data and machine learning in personality measurement: Opportunities and challenges. European Journal of Personality, 34(5), 632–648. https://doi.org/10.1002/per.2305

Alloghani

Al-Jumeily

Mustafina

Hussain

Aljaaf

A. J.

(2020). A systematic review on supervised and unsupervised machine learning algorithms for data science. In Berry

M. W.

Mohamed

Yap

B. W.

(Eds.), Supervised and unsupervised learning for data science (pp. 3–21). Springer. https://doi.org/10.1007/978-3-030-22475-2_1

Alyahyan

Düştegör

(2020). Predicting academic success in higher education: Literature review and best practices. International Journal of Educational Technology in Higher Education, 17, 1–21. https://doi.org/10.1186/s41239-020-0177-7

Alyan

Saad

N. M.

Kamel

Rahman

M. A.

(2020). Investigating frontal neurovascular coupling in response to workplace design-related stress. IEEE Access, 8, 218911–218923. https://doi10.1109/ACCESS.2020.3040540

American Psychological Association. (2017). Ethical principles of psychologists and code of conduct. American Psychiatric Publishing.

10.

*Artrith

Butler

K. T.

Coudert

F. X.

Han

Isayev

Jain

Walsh

(2021). Best practices in machine learning for chemistry. Nature Chemistry, 13(6), 505–508. https://doi.org/10.1038/s41557-021-00716-z

11.

Babbage

(1864). The Life of a Philosopher. Longman, Green, Longman, Roberts & Green, London, UK.

12.

Baker

(2016). 1,500 scientists lift the lid on reproducibility. Nature News, 533(7604), 452–454. https://doi.org/10.1038/533452a

13.

Barchard

K. A.

Williams

(2008). Practical advice for conducting ethical online experiments and questionnaires for United States psychologists. Behavior Research Methods, 40(4), 1111–1128. https://doi.org/10.3758/BRM.40.4.1111

14.

Belcic

Stryker

(2024). Hyperparameter tuning. IBM. https://www.ibm.com/think/topics/hyperparametertuning?mhsrc=ibmsearch_a&mhq=what%20is%20hyperparameter-tuning

15.

Bleidorn

Hopwood

C. J.

(2019). Using machine learning to advance personality assessment and theory. Personality and Social Psychology Review, 23(2), 190–203. https://doi.org/10.1177/1088868318772990

16.

Boehmke

Greenwell

B. M.

(2019). Hands-on machine learning with R. CRC Press. https://doi.org/10.1201/9780367816377

17.

Bone

Goodwin

M. S.

Black

M. P.

Lee

C. C.

Audhkhasi

Narayanan

(2015). Applying machine learning to facilitate autism diagnostics: Pitfalls and promises. Journal of Autism and Developmental Disorders, 45, 1121–1136. https://doi.org/10.1007/s10803-014-2268-6

18.

Bottou

(2010). Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010: 19th International Conference on Computational Statistics (pp. 177–186). Physica-Verlag HD. https://doi10.1007/978-3-7908-2604-316

19.

Braun

M. T.

Kuljanin

DeShon

R. P.

(2018). Special considerations for the acquisition and wrangling of big data. Organizational Research Methods, 21(3), 633–659. https://doi.org/10.1177/1094428117690235

20.

Breiman

(2001). Random forests. Machine Learning, 45, 5–32.

21.

Brochu

Cora

V. M.

De Freitas

(2010). A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv. https://doi.org/10.48550/arXiv.1012.2599

22.

Brownstein

N. C.

Louis

T. A.

O’Hagan

Pendergast

(2019). The role of expert judgment in statistical inference and evidence-based decision-making. The American Statistician, 73(1), 56–68. https://doi.org/10.1080/00031305.2018.1529623

23.

Calanna

Lauriola

Saggino

Tommasi

Furlan

(2020). Using a supervised machine learning algorithm for detecting faking good in a personality self-report. International Journal of Selection and Assessment, 28(2), 176–185. https://doi.org/10.1111/ijsa.12279y

24.

Cawley

G. C.

Talbot

N. L.

(2010). On over-fitting in model selection and subsequent selection bias in performance evaluation. The Journal of Machine Learning Research, 11, 2079–2107.

25.

Cearns

Hahn

Baune

B. T.

(2019). Recommendations and future directions for supervised machine learning in psychiatry. Translational Psychiatry, 9(1), 1–12. https://doi.org/10.1038/s41398-019-0607-2

26.

Chapman

B. P.

Weiss

Duberstein

P. R.

(2016). Statistical learning theory for high dimensional prediction: Application to criterion-keyed scale development. Psychological Methods, 21(4), 603–620. https://doi.org/10.1037/met0000088

27.

Chen

Guestrin

(2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). Association for Computing Machinery. https://doi/pdf/10.1145/2939672.2939785

28.

Chicco

Jurman

(2022). The ABC recommendations for validation of supervised machine learning results in biomedical sciences. Frontiers in Big Data, 5, Article 979465. https://doi.org/10.3389/fdata.2022.979465

29.

Choi

R. Y.

Coyner

A. S.

Kalpathy-Cramer

Chiang

M. F.

Campbell

J. P.

(2020). Introduction to machine learning, neural networks, and deep learning. Translational Vision Science & Technology, 9(2), 14–26. https://doi.org/10.1167/tvst.9.2.14

30.

Chu

Ilyas

I. F.

Krishnan

Wang

(2016). Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 International Conference on Management of Data (pp. 2201–2206). Association for Computing Machinery. https://doi.org/10.1145/2882903.2912574

31.

Dalpiaz

(n.d.). Basics of statistical learning. https://statisticallearning.org/

32.

De Rooij

Weeda

. (2020). Cross-validation: A method every psychologist should know. Advances in Methods and Practices in Psychological Science, 3(2), 248–263. https://doi.org/10.1177/2515245919898466

33.

Doshi-Velez

Kim

(2017). Towards a rigorous science of interpretable machine learning. arXiv. https://doi.org/10.48550/arXiv.1702.0860

34.

Douglas

K. M.

Sutton

R. M.

Van Lissa

C. J.

Stroebe

Kreienkamp

Agostini

Bélanger

J. J.

Gützkow

Abakoumkin

Abdul Khaiyom

J. H.

Ahmedi

Akkas

Almenara

C. A.

Atta

Bagci

S. C.

Basel

Kida

E. B.

Bernardo

A. B. I.

Buttrick

N. R.

. . . Leander

N. P.

(2023). Identifying important individual-and country-level predictors of conspiracy theorizing: A machine learning analysis. European Journal of Social Psychology, 53(6), 1191–1203. https://doi.org/10.1002/ejsp.2968

35.

Liu

(2019). Techniques for interpretable machine learning. Communications of the ACM, 63(1), 68–77. https://doi.org/10.1145/3359786

36.

Dutt

Chandramouli

Das

A. K.

(2018). Machine learning. Pearson.

37.

Eby

L. T.

Shockley

K. M.

Bauer

T. N.

Edwards

Homan

A. C.

Johnson

Lang

J. W. B.

Morris

S. B.

Oswald

F. L.

(2020). Methodological checklists for improving research quality and reporting consistency. Industrial and Organizational Psychology, 13(1), 76–83. https://doi.org/10.1017/iop.2020.14

38.

Eggensperger

Lindauer

Hutter

(2019). Pitfalls and best practices in algorithm configuration. Journal of Artificial Intelligence Research, 64, 861–893. https://doi.org/10.1613/jair.1.11420

39.

Fan

Sun

Liu

Zhao

Zhang

Chen

Glorioso

Hack

(2023). How well can an AI chatbot infer personality? Examining psychometric properties of machine-inferred personality scores. Journal of Applied Psychology, 108(8), 1277–1299. https://doi.org/10.1037/apl0001082

40.

*Farrow

Zhong

Ashcroft

G. P.

Anderson

Meek

R. D.

(2021). Interpretation and reporting of predictive or diagnostic machine-learning research in trauma & orthopaedics. The Bone & Joint Journal, 103(12), 1754–1758. https://doi.org/10.1302/0301-620X.103B12.BJJ-2021-0851.R1

41.

Feurer

Hutter

(2019). Hyperparameter optimization. In Hutter

Kotthoff

Vanschoren

(Eds.), Automated machine learning: Methods, systems, challenges (pp. 3–33). Springer. https://doi.org/10.1007/978-3-030-05318-5_1

42.

*Geiger

R. S.

Cope

Lotosh

Shah

Weng

Tang

(2021). “Garbage in, garbage out” revisited: What do machine learning application papers report about human-labeled training data? arXiv. https://doi.org/10.48550/arXiv.2107.02278

43.

Golder

Ahmed

Norman

Booth

(2017). Attitudes toward the ethics of research using social media: A systematic review. Journal of Medical Internet Research, 19(6), Article e195. https://doi.org/10.2196/jmir.7082

44.

Golder

O’Connor

Wang

Stevens

Gonzalez-Hernandez

(2022). Best practices on big data analytics to address sex-specific biases in our understanding of the etiology, diagnosis, and prognosis of diseases. Annual review of biomedical data science, 5(1), 251-267. DOI: 10.1146/annurev-biodatasci-122120-025806.

45.

Greener

J. G.

Kandathil

S. M.

Moffat

Jones

D. T.

(2022). A guide to machine learning for biologists. Nature Reviews Molecular Cell Biology, 23(1), 40–55. https://doi.org/10.1038/s41580-021-00407-0

46.

Guo

Gallagher

C. M.

Sun

Tavoosi

Min

(2024). Smarter people analytics with organizational text data: Demonstrations using classic and advanced NLP models. Human Resource Management Journal, 34(1), 39–54. https://doi.org/10.1111/1748-8583.12426

47.

Halilaj

Rajagopal

Fiterau

Hicks

J. L.

Hastie

T. J.

Delp

S. L.

(2018). Machine learning in human movement biomechanics: Best practices, common pitfalls, and new opportunities. Journal of Biomechanics, 81, 1–11. https://doi.org/10.1016/j.jbiomech.2018.09.009

48.

Harari

M. B.

Parola

H. R.

Hartwell

C. J.

Riegelman

(2020). Literature searches in systematic reviews and meta-analyses: A review, evaluation, and recommendations. Journal of Vocational Behavior, 118, Article 103377. https://doi.org/10.1016/j.jvb.2020.103377

49.

Hastie

Tibshirani

Friedman

(2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). Springer. https://doi.org/10.1007/978-0-387-84858-7

50.

*Heil

B. J.

Hoffman

M. M.

Markowetz

Lee

S.-I.

Greene

C. S.

Hicks

S. C.

(2021). Reproducibility standards for machine learning in the life sciences. Nature Methods, 18(10), 1132–1135. https://doi.org/10.1038/s41592-021-01256-7

51.

Hickman

Bosch

Saef

Tay

Woo

S. E.

(2021). Automated video interview personality assessments: Reliability, validity, and generalizability investigations. Journal of Applied Psychology, 107(8), 1323–1351. https://doi.org/10.1037/apl0000695

52.

Hofman

J. M.

Watts

D. J.

Athey

Garip

Griffiths

T. L.

Kleinberg

Margetts

Mullainathan

Salganik

M. J.

Vazire

Vespignani

Yarkoni

(2021). Integrating explanation and prediction in computational social science. Nature, 595(7866), 181–188. https://doi.org/10.1038/s41586-021-03659-0

53.

IBM. (n.d.-a). Data labeling. IBM. https://www.ibm.com/think/topics/data-labeling

54.

IBM. (n.d.-b). Large language models. IBM https://www.ibm.com/think/topics/large-language-models

55.

IBM. (n.d.-c). Text mining. IBM https://www.ibm.com/think/topics/text-mining

56.

Iliev

Dehghani

Sagi

(2015). Automated text analysis in psychology: Methods, applications, and future developments. Language and Cognition, 7(2), 265–290. https://doi.org/10.1017/langcog.2014.30

57.

Jiang

Gradus

J. L.

Rosellini

A. J.

(2020). Supervised machine learning: A brief primer. Behavior Therapy, 51(5), 675–687. https://doi.org/10.1016/j.beth.2020.05.002

58.

Joel

Eastwick

P. W.

Allison

C. J.

Arriaga

X. B.

Baker

Z. G.

Bar-Kalifa

Bergeron

Birnbaum

G. E.

Brock

R. L.

Brumbaugh

C. C.

Carmichael

C. L.

Chen

Clarke

Cobb

R. J.

Coolsen

M. K.

Davis

de Jong

D. C.

Debrot

DeHaas

E. C.

. . . Wolf

(2020). Machine learning uncovers the most robust self-report predictors of relationship quality across 43 longitudinal couples studies. Proceedings of the National Academy of Sciences, 117(32), 19061–19071. https://doi.org/10.1073/pnas.1917036117

59.

Jordan

M. I.

Mitchell

T. M.

(2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245), 255–260. https://doi.org/10.1126/science.aaa8415

60.

Kaiser

(2009). Protecting respondent confidentiality in qualitative research. Qualitative Health Research, 19(11), 1632–1641. https://doi.org/10.1177/1049732309350879

61.

*Kakarmath

Esteva

Arnaout

Harvey

Kumar

Muse

Dong

Wedlund

Kvedar

(2020). Best practices for authors of healthcare-related artificial intelligence manuscripts. NPJ Digital Medicine, 3(1), 134–136. https://doi.org/10.1038/s41746-020-00336-w

62.

Koçak

(2022). Key concepts, common pitfalls, and best practices in artificial intelligence and machine learning: Focus on radiomics. Diagnostic and Interventional Radiology, 28(5), 450–464. https://doi.org/10.5152/dir.2022.211297

63.

Koenig

Tonidandel

Thompson

Albritton

Koohifar

Yankov

Speer

Hardy

J. H.

Gibson

Frost

Liu

McNeney

Capman

Lowery

Kitching

Nimbkar

Boyce

Sun

Guo

. . .Newton

(2023). Improving measurement and prediction in personnel selection through the application of machine learning. Personnel Psychology, 76(4), 1061–1123. https://doi.org/10.1111/peps.12608

64.

Kohavi

(1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. International Joint Conference on Artificial Intelligence, 14, 1137–1145.

65.

König

J. C.

Demetriou

M. A.

Glock

Hiemstra

Iliescu

Ionescu

Langer

Liem

C. C. S.

Linnenbürger

Siegel

Vartholomaios

(2020). Some advice for psychologists who want to work with computer scientists on big data. Personnel Assessment and Decisions, 6(1), 17–23. https://doi.org/10.25035/pad.2020.01.002

66.

Langley

Sage

(1994). Induction of selective Bayesian classifiers. In R. Lopez de Mantaras & D. Poole (Eds.), Uncertainty in artificial intelligence (pp. 399–406). Morgan Kaufmann. https://doi.org/10.1016/B978-1-55860-332-5.50055-9

67.

Cui

Wang

Yang

Shi

Zhang

(2024). MAGE: Machine-generated text detection in the wild. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 36–53). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.3

68.

Yoon

Zhang

Rajabipour

Srubar

W. V.

Dabo

Radlińska

(2022). Machine learning in concrete science: Applications, challenges, and best practices. NPJ Computational Materials, 8(1), 127–144. https://doi.org/10.1038/s41524-022-00810-x

69.

Lipton

(2017). Inference to the best explanation. In Galassi

M. C.

(Ed.), A companion to the philosophy of science (pp. 184–193). Wiley. https://doi.org/10.1002/9781405164481.ch29

70.

Liu

Ungar

Kording

(2021). Quantifying causality in data science with quasi-experiments. Nature Computational Science, 1(1), 24–32. https://doi.org/10.1038/s43588-020-00005-8

71.

Luciano

M. M.

Mathieu

J. E.

Park

Tannenbaum

S. I.

(2018). A fitting approach to construct and measurement alignment: The role of big data in advancing dynamic theories. Organizational Research Methods, 21(3), 592–632. https://doi.org/10.1177/1094428117728372

72.

Lueftinger

Majek

Beisken

Rattei

Posch

A. E.

(2021). Learning from limited data: Towards best practice techniques for antimicrobial resistance prediction from whole genome sequencing data. Frontiers in Cellular and Infection Microbiology, 11, Article 610348. https://doi.org/10.3389/fcimb.2021.610348

73.

Lundberg

S. M.

Erion

Chen

DeGrave

Prutkin

J. M.

Nair

Katz

Himmelfarb

Bansal

Lee

S. I.

(2020). From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence, 2(1), 56–67. https://doi.org/10.1038/s42256-019-0138-9

74.

Lundberg

S. M.

Lee

S. I.

(2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30. https://proceedings.neurips.cc/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Reviews.html

75.

*Luo

Phung

Tran

Gupta

Rana

Karmakar

Shilton

Yearwood

Dimitrova

T. B.

Venkatesh

Berk

(2016). Guidelines for developing and reporting machine learning predictive models in biomedical research: A multidisciplinary view. Journal of Medical Internet Research, 18(12), Article e323. https://doi.org/10.2196/jmir.5870

76.

Luo

Tseng

H. H.

Cui

Wei

Ten Haken

R. K.

El Naqa

(2019). Balancing accuracy and interpretability of machine learning approaches for radiation treatment outcomes modeling. BJR Open, 1(1), Article 20190021. https://doi.org/10.1259/bjro.20190021

77.

Makarov

V. A.

Stouch

Allgood

Willis

C. D.

Lynch

(2021). Best practices for artificial intelligence in life sciences research. Drug Discovery Today, 26(5), 1107–1110. https://doi.org/10.1016/j.drudis.2021.01.017

78.

Manapat

P. D.

Anderson

S. F.

Edwards

M. C.

(2024). A revised and expanded taxonomy for understanding heterogeneity in research and reporting practices. Psychological Methods, 29(2), 350–361. https://doi.org/10.1037/met0000488

79.

McAbee

S. T.

Landis

R. S.

Burke

M. I.

(2017). Inductive reasoning: The promise of big data. Human Resource Management Review, 27(2), 277–290. https://doi.org/10.1016/j.hrmr.2016.08.005

80.

Mellin

(1957). Work with new electronic ‘brains’ opens field for army math experts. The Hammond Times, 10, 66.

81.

Min

Peng

Shoss

Yang

(2021). Using machine learning to investigate the public’s emotional responses to work from home during the COVID-19 pandemic. Journal of Applied Psychology, 106(2), 214–229. https://doi.org/10.1037/apl0000886

82.

Min

Yang

Allen

D. G.

Grandey

A. A.

Liu

(2024). Wisdom from the crowd: Can recommender systems predict employee turnover and its destinations? Personnel Psychology, 77(2), 475–496. https://doi.org/10.1111/peps.12551

83.

*Mitchell

Zaldivar

Barnes

Vasserman

Hutchinson

Spitzer

Raji

I. D.

Gebru

(2019). Model cards for model reporting. In Proceedings of the 2019 ACM Conference on Fairness, Accountability, and Transparency (pp. 220–229). Association for Computing Machinery. https://doi.org/10.1145/3287560.3287596

84.

Mitchell

(1997). Machine learning. McGraw-Hill.

85.

Molnar

Casalicchio

Bischl

(2021). Interpretable machine learning–A brief history, state-of-the-art and challenges. In ECML PKDD 2020 Workshops: Workshops of the European Conference on Machine Learning and Knowledge Discovery in Databases Proceedings (pp. 417–431). Springer International Publishing. https://doi.org/10.1007/978-3-030-65965-3_28

86.

*Moons

K. G.

Altman

D. G.

Reitsma

J. B.

Collins

G. S.

(2015). New guideline for the reporting of studies developing, validating, or updating a multivariable clinical prediction model: The TRIPOD statement. Advances in Anatomic Pathology, 22(5), 303–305. https://doi.org/10.1097/PAP.0000000000000072

87.

Müller

A. C.

Guido

(2016). Introduction to machine learning with Python: A guide for data scientists. O’Reilly.

88.

Murel

Kavlakoglu

(2023). What is regularization? IBM. https://www.ibm.com/think/topics/regularization

89.

Newman

D. A.

(2014). Missing data: Five practical guidelines. Organizational Research Methods, 17(4), 372–411. https://doi.org/10.1177/1094428114548590

90.

Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349(6251), 1–8. DOI:10.1126/science.aac4716.

91.

Oswald

F. L.

Behrend

T. S.

Putka

D. J.

Sinar

(2020). Big data in industrial-organizational psychology and human resource management: Forward progress for organizational research and practice. Annual Review of Organizational Psychology and Organizational Behavior, 7, 505–533. https://doi.org/10.1146/annurev-orgpsych-032117-104553

92.

Paiano

Pasanisi

(2018). How to discover hidden knowledge according to different type data set: A guideline to apply the right hybrid information mining approach. Broad Research in Artificial Intelligence and Neuroscience, 9(4), 83–99. https://doi.org/10.3390/info9040090

93.

Pargent

Schoedel

Stachl

(2023). Best practices in supervised machine learning: A tutorial for psychologists. Advances in Methods and Practices in Psychological Science, 6(3). https://doi.org/10.1177/25152459231162559

94.

Pfob

S. C.

Sidey-Gibbons

(2022). Machine learning in medicine: A practical introduction to techniques for data pre-processing, hyperparameter tuning, and model comparison. BMC Medical Research Methodology, 22(1), Article 82. https://doi.org/10.1186/s12874-022-01758-8

95.

*Pineau

Vincent-Lamarre

Sinha

Larivière

Beygelzimer

d’Alché-Buc

Fox

Larochelle

(2021). Improving reproducibility in machine learning research: A report from the NeurIPS 2019 reproducibility program. Journal of Machine Learning Research, 22, 1–20.

96.

Podsakoff

P. M.

Podsakoff

N. P.

Williams

L. J.

Huang

Yang

(2024). Common method bias: It’s bad, it’s complex, it’s widespread, and it’s not easy to fix. Annual Review of Organizational Psychology and Organizational Behavior, 11(1), 17–61. https://doi.org/10.1146/annurev-orgpsych-110721-040030

97.

Putka

D. J.

Beatty

A. S.

Reeder

M. C.

(2018). Modern prediction methods: New perspectives on a common problem. Organizational Research Methods, 21(3), 689–732. https://doi.org/10.1177/1094428117697041

98.

Rafail

Freitas

Atkinson

Delamont

Cernat

Sakshaug

J. W.

Williams

R. A.

(2020). Natural language processing. Sage.

99.

Refaeilzadeh

Tang

Liu

(2009). Cross-validation. In Liu

Özsu

M. T.

, (Eds.), Encyclopedia of database systems (pp. 532–538). Springer. https://doi.org/10.1007/978-0-387-39940-9_565

100.

Rhahla

Allegue

Abdellatif

(2021). Guidelines for GDPR compliance in Big Data systems. Journal of Information Security and Applications, 61, Article 102896. https://doi.org/10.1016/j.jisa.2021.102896

101.

Ribeiro

M. T.

Singh

Guestrin

(2016). “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135–1144). Association for Computing Machinery. https://doi.org/10.18653/v1/N16-3020

102.

Richards

N. M.

King

J. H.

(2014). Big data ethics. Wake Forest Law Review, 49, 393–432. https://doi.org/10.1016/j.paid.2020.109905

103.

Robinson

N. M.

Nelson

W. A.

Costello

M. J.

Sutherland

J. E.

Lundquist

C. J.

(2017). A systematic review of marine-based species distribution models (SDMs) with recommendations for best practice. Frontiers in Marine Science, 4, 421–432. https://doi.org/10.3389/fmars.2017.00421

104.

Rosenbusch

Soldner

Evans

A. M.

Zeelenberg

(2021). Supervised machine learning methods in psychology: A practical introduction with annotated R code. Social and Personality Psychology Compass, 15(2), Article e12579. https://doi.org/10.1111/spc3.12579

105.

Rossi

Pappalardo

Cintia

(2022). A narrative review for a machine learning application in sports: An example based on injury forecasting in soccer. Sports, 10(1), 5–21. https://doi.org/10.3390/sports10010005

106.

Rudin

(2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 206–215. https://doi:10.1038/s42256-019-0048-x

107.

Shahriari

Swersky

Wang

Adams

R. P.

De Freitas

(2015). Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1), 148–175. https://doi.org/10.1109/JPROC.2015.2494218

108.

Sheetal

Feng

Savani

(2020). Using machine learning to generate novel hypotheses: Increasing optimism about COVID-19 makes people less willing to justify unethical behaviors. Psychological Science, 31(10), 1222–1235. https://doi.org/10.1177/0956797620959594

109.

Sheetal

Jiang

Di Milia

(2022). Using machine learning to analyze longitudinal data: A tutorial guide and best-practice recommendations for social science researchers. Applied Psychology, 72(3), 1339–1364. https://doi.org/10.1111/apps.12435

110.

Simmons

J. P.

Nelson

L. D.

Simonsohn

(2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632

111.

Stamatis

C. A.

Batistuzzo

M. C.

Tanamatis

Miguel

E. C.

Hoexter

M. Q.

Timpano

K. R.

(2021). Using supervised machine learning on neuropsychological data to distinguish OCD patients with and without sensory phenomena from healthy controls. British Journal of Clinical Psychology, 60(1), 77–98. https://doi.org/10.1111/bjc.12272

112.

Sutton

R. S.

Barto

A. G.

(1998). Reinforcement learning: An introduction. MIT Press.

113.

Tay

Malik

Zhang

Chae

Ebert

D. S.

Ding

Zhao

Kern

(2018). Big data visualizations in organizational science. Organizational Research Methods, 21(3), 660–688. https://doi.org/10.1177/1094428117720014

114.

Tay

Woo

S. E.

Hickman

Saef

R. M.

(2020). Psychometric and validity issues in machine learning approaches to personality assessment: A focus on social media text mining. European Journal of Personality, 34(5), 826–844. https://doi.org/10.1002/per.2290

115.

Tonidandel

King

E. B.

Cortina

J. M.

(Eds.). (2015). Big data at work: The data science revolution and organizational psychology. Routledge. https://doi.org/10.4324/9781315780504

116.

Tonidandel

King

E. B.

Cortina

J. M.

(2018). Big data methods: Leveraging modern data analytic techniques to build organizational science. Organizational Research Methods, 21(3), 525–547. https://doi.org/10.1177/1094428116677299

117.

Tukey

J. W.

(1977). Exploratory data analysis. Addison-Wesley.

118.

Valtonen

Mäkinen

S. J.

Kirjavainen

(2022). Advancing reproducibility and accountability of unsupervised machine learning in text mining: Importance of transparency in reporting preprocessing and algorithm selection. Organizational Research Methods, 27(1), 88–113. https://doi.org/10.1177/10944281221124947

119.

Varshney

K. R.

Chenthamarakshan

Fancher

S. W.

Wang

Fang

Mojsilović

(2014). Predicting employee expertise for talent management in the enterprise. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1729–1738). Association for Computing Machinery. https://doi.org/10.1145/2623330.2623337

120.

Vishwakarma

Sonpal

Hachmann

(2021). Metrics for benchmarking and uncertainty quantification: Quality, applicability, and best practices for machine learning in chemistry. Trends in Chemistry, 3(2), 146–156. https://doi.org/10.1016/j.trechm.2020.12.004

121.

*Vollmer

Mateen

B. A.

Bohner

Király

F. J.

Ghani

Jonsson

Cumbers

Jonas

McAllister

K. S. L.

Myles

Grainger

Birse

Branson

Moons

K. G. M.

Collins

G. S.

Ioannidis

J. P. A.

Holmes

Hemingway

(2020). Machine learning and AI research for patient benefit: 20 critical questions on transparency, replicability, ethics and effectiveness. The BMJ, 368, Article l6927. https://doi.org/10.1136/bmj.l6927

122.

von Hahn

Mechefske

C. K

. (2022). Machine learning in CNC machining: Best practices. Machines, 10(12), 1233–1260. https://doi.org/10.3390/machines10121233

123.

Walters

P. W.

(2022). Comparing classification models—A practical tutorial. Journal of Computer-Aided Molecular Design, 36(5), 381–389. https://doi.org/10.1007/s10822-021-00417-2

124.

Wang

A. Y. T.

Murdock

R. J.

Kauwe

S. K.

Oliynyk

A. O.

Gurlo

Brgoch

Persson

K. A.

Sparks

T. D.

(2020). Machine learning for materials scientists: An introductory guide toward best practices. Chemistry of Materials, 32(12), 4954–4965. https://doi.org/10.1021/acs.chemmater.0c01907

125.

Weston

S. J.

Ritchie

S. J.

Rohrer

J. M.

Przybylski

A. K.

(2019). Recommendations for increasing the transparency of analysis of preexisting data sets. Advances in Methods and Practices in Psychological Science, 2(3), 214–227. https://doi.org/10.1177/2515245919848684

126.

Wigboldus

D. H. J.

Dotsch

(2016). Encourage playing with data and discourage questionable reporting practices. Psychometrika, 81(1), 27–32. https://doi.org/10.1007/s11336-015-9445-1

127.

*Winkler-Schwartz

Bissonnette

Mirchi

Ponnudurai

Yilmaz

Ledwos

Siyar

Azarnoush

Karlik

Del Maestro

R. F.

(2019). Artificial intelligence in medical education: Best practices using machine learning to assess surgical expertise in virtual reality simulation. Journal of Surgical Education, 76(6), 1681–1690. https://doi.org/10.1016/j.jsurg.2019.05.015

128.

Kumar

Ross Quinlan

Ghosh

Yang

Motoda

McLachlan

G. J.

Liu

P. S.

Zhou

Z.-H.

Steinbach

Hand

D. J.

Steinberg

(2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14, 1–37. https://doi.org/10.1007/s10115-007-0114-2

129.

Liu

Cao

Huang

Liu

Qian

Liu

Dong

Qiu

C.-W.

Qiu

Hua

Han

Yin

Liu

. . . Zhang

(2021). Artificial intelligence: A powerful paradigm for scientific research. The Innovation, 2(4), Article 100179. https://doi.org/10.1016/j.xinn.2021.100179

130.

Yang

Shami

(2020). On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing, 415, 295–316. https://doi.org/10.1016/j.neucom.2020.07.061

131.

Yang

(2006). 10 challenging problems in data mining research. International Journal of Information Technology & Decision Making, 5(4), 597–604.

132.

Yarkoni

Westfall

(2017). Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science, 12(6), 1100–1122. https://doi.org/10.1177/1745691617693393

133.

Zhang

Zhao

Liao

Shi

Zou

Peng

(2019). Deep learning in omics: A survey and guideline. Briefings in Functional Genomics, 18(1), 41–57. https://doi.org/10.1093/bfgp/ely030

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.31 MB