Abstract
Machine-learning (ML) algorithms are being rapidly incorporated into the work of psychologists given their capability and flexibility in analyzing large-scale, complex, or otherwise messy data sets. In this context and in the spirit of open science, ML research should be conducted in a transparent, understandable, and ethical manner. However, publications by psychology researchers and practitioners show a troubling lack of consistency in reporting ML information. Given that ML offers a wide range of analytical options, in this article, we address an important need by providing a comprehensive, open-science checklist that specifies the information researchers should disclose at each stage of a supervised-ML project—from data collection and preprocessing to model selection, evaluation, interpretation, and code sharing. We hope that psychological researchers will benefit from this checklist when reporting ML results and will adapt and extend this checklist further in the future.
Keywords
Driven by the promise and confluence of big data and advances in computational power, many in psychological science anticipate that applying machine-learning (ML) algorithms will enhance, if not revolutionize, key areas of research and practice. Likewise, researchers and practitioners expect that ML tools can mine big data to capitalize on complex relationships that have never been discovered before. One broad distinction pertains to ML algorithms (see Alloghani et al., 2020). “Supervised”-learning ML algorithms share the same purpose as linear regression: making predictions. These algorithms are trained on a large set of predictors (i.e., “inputs” or “features” in ML terminology) and a known criterion variable (i.e., a “labeled” variable or “output” variable). Once the algorithm is trained and models the relationship between the predictors and the criterion, the supervised-ML algorithms are then used on external data sets of a similar nature except the criterion data are not yet observed or are otherwise unseen. “Unsupervised”-learning ML algorithms involve finding hidden structure in a data set, for example, determining how many reliable consumer profiles exist in a large data set of consumer purchases. In the current article, we focus on supervised-ML algorithms. Although the use of big data and ML has surged in various research disciplines, its integration is still developing in psychological-research and graduate-training curricula across areas such as personality psychology (Bleidorn & Hopwood, 2019), social psychology (e.g., Sheetal et al., 2020), and industrial-organizational psychology (Tonidandel et al., 2018).
Compared with commonly used analytic methods in psychology research, such as analysis of variance (ANOVA) or multiple linear regression, the application of supervised-ML algorithms requires a significantly higher degree of decision-making, especially in the model-development and -evaluation stages. For instance, two researchers analyzing the same data set using ordinary least squares (OLS) regression are expected to obtain identical results regardless of the statistical software used (e.g., the regression coefficients, standard errors, R2, and adjusted R2 for the model should all be the same). By contrast, if these two researchers are tasked with fitting the same data set using random forests, a supervised-ML method, it is highly likely that they would yield divergent outcomes (e.g., different mean squared errors of prediction, different assessments of how important each variable is for making predictions in the model). This variation arises in part from the numerous critical decisions involved in developing the algorithm. Making even one change among all these decisions (e.g., number of variables considered in the node of a tree during hyperparameter tuning) could lead to vastly different results when replicating findings or applying the method more generally. Therefore, we are inspired to apply recent guidance promoting methodological transparency in psychological research (e.g., Aguinis et al., 2018; Weston et al., 2019) to these unique and important decisions specific to the implementation of supervised-ML algorithms that are new to psychological research. The current understanding of these algorithms by researchers and readers is uneven at best. Therefore, more consistent and transparent (if not standardized) approaches to ML reporting are crucial for advancing open science in psychology.
To assess the current state of ML reporting, we conducted a literature search across 11 selected psychology journals covering general, social, personality, and methodological psychology. The search spanned from 1965 to 2024 and yielded 106 review/theoretical and empirical articles relevant to ML techniques (see Supplemental Material 1 in the Supplemental Material available online). Of the 76 empirical studies, 33 used supervised learning, 15 used unsupervised learning, and the remaining 28 used ML as methods for data preprocessing (e.g., extracting data), not their main analytic methods. Among the 33 empirical studies applying supervised-ML techniques, there is an extreme lack of consistency in reporting key information necessary for fully understanding the ML algorithms applied and their findings. Moreover, although we identified good examples of reporting different information from multiple studies, very few of them reported adequate information for replication, as we detail further below. Based on these findings, we therefore provide a checklist critical for transparency and replication of supervised-ML algorithms.
The overall purpose of the current article is fourfold. First, informed by recent practices and publications involving ML in psychology, we develop and offer a set of reporting guidelines for future researchers applying supervised-ML techniques. The guidelines outline the key steps for reporting when conducting ML research, including how to frame supervised-ML-based research questions, documenting key data-preprocessing procedures, what to report when selecting among algorithms, and instructions for reporting and interpreting the results from supervised-ML algorithms. These recommendations aim to enhance the comprehension, transparency, and accessibility of supervised-ML methods in both research and practical contexts. In this article, we offer guidelines for comprehensive reporting that attempt to strike a balance. That is, we seek to describe the general key steps and decisions involved in using supervised ML in psychological research in the interest of establishing a shared understanding of how researchers may report ML findings transparently. This information is not intended to inform important specific ML reporting practices in one’s local research setting when conducting supervised-ML studies. Often, specific details will depend on the domain, data set, prediction problem, and ML algorithms available at the time.
Second, the guidelines from our article also contribute to open science by increasing transparency and replicability in psychological research that increasingly incorporates supervised ML. Science progress depends on both innovation and replication: “Innovation points out paths that are possible; replication points out paths that are likely” (Open Science Collaboration, 2015, p. 7). Given that applications of supervised-ML techniques involve many critical decisions, such as data cleaning, model selection, and model-evaluation metrics, inadequate reporting can obscure understanding and hinder serious replication attempts (Vollmer et al., 2020). The reporting checklist provided in the current article can help mitigate such risks. Moreover, if psychological scholars and reviewers rely on (and in the future provide input to) these supervised-ML reporting guidelines and the guidelines are generally acceptable to the broader research community, then psychological researchers are more likely to communicate effectively with and contribute to multidisciplinary projects involving supervised-ML research.
Third, in the spirit of open science, our checklist is designed to reduce new and emerging forms of questionable research practices (QRPs) in supervised-ML studies in psychology. Whereas traditional QRPs (e.g., p-hacking, post hoc theorizing) rely on null hypothesis significance testing (i.e., p values), ML research is evaluated mainly through predictive-performance metrics. Researchers may instead justify feature choices or hyperparameter tweaks after the fact, cycle through countless algorithms and random seeds until a “best” score emerges, or omit key baselines that reveal the true improvement of an ML model (e.g., reporting 64% retention accuracy without noting that always predicting “stay” yields 63%). Rigorous documentation of data preprocessing, model-selection pathways, hyperparameter tuning, and baseline comparisons is therefore critical to reduce the publication of irreproducible findings, providing practitioners with recommendations based on uncertain evidence, and diversion of limited resources toward unproductive avenues.
Fourth, our checklist also seeks to enhance replicability in cases in which researchers have high degrees of freedom. “Researcher degrees of freedom” refers to defensible choices among multiple empirically supported options (Manapat et al., 2024). For example, in supervised ML, choosing among several cross-validation techniques falls under the researcher degrees of freedom. When such decisions are poorly documented, this would certainly be considered a QRP (Wigboldus & Dotsch, 2016), ranging from incomplete reporting of technical details to outright misrepresentation (e.g., cherry-picking from the research options that were actually explored). Questionable reporting is likely to contribute to irreproducible findings. Although no guidelines can guard against bad actors, in the current article, we promote transparency in conducting supervised-ML analyses and reporting procedures, such as clearly specifying all key aspects of the predictive models used and detailing the methods researchers may use to improve model performance.
The need for distinct guidelines to enhance reporting transparency for supervised ML arises primarily from the unique steps inherent to data preprocessing, model development, and model evaluation, which diverge from more conventional methods of modeling and prediction (e.g., ANOVA, regression, and structural equation modeling). The absence of transparency in these specific steps can result in difficulties when researchers attempt to replicate supervised-ML results. In addition, given the novelty and varying familiarity with supervised-ML techniques by psychological researchers, specific reporting guidelines not only establish more consistent reporting practices but also serve as an educational tool to meet the evolving needs of psychological science.
A Brief Introduction to Supervised ML in the Psychological Literature
ML is a subfield of computer science that aims to construct computer algorithms that can learn and improve with experience automatically (T. Mitchell, 1997). The distinction between ML algorithms and traditional statistics in psychological research can be somewhat fuzzy. For example, some of the traditional statistical methods widely used in psychological research (e.g., logistic regression) can be considered as a component of certain ML algorithms. Typically, when researchers apply ML techniques, they often encounter settings with a large number of predictors (e.g., Sheetal et al., 2020) and/or situations in which the number of predictors (k) exceeds the number of observations (n; Oswald et al., 2020). In these contexts, researchers often incorporate procedures such as cross-validation and regularization to reduce overfitting and improve the predictive accuracy of supervised-ML techniques. For explanations of ML-related terminology, see Table 1.
Explanations of Machine-Learning-Related Terminologies
Supervised ML involves statistically learning a set of mapping rules from a group of predictors (i.e., features or inputs) to a criterion (i.e., labels or outputs). Once trained, the supervised-ML model can be applied to new data sets with similar structures in which the criterion values are unknown or unobserved. Supervised ML has numerous applications in people’s daily lives, such as predicting mental illness (e.g., Jiang et al., 2020), inferring personality traits (e.g., Alexander et al., 2020), and detecting faking in psychological assessments (e.g., Calanna et al., 2020). These algorithms are typically categorized into two main types: regression (for continuous outcomes) and classification (for categorical outcomes).
To understand how supervised-ML algorithms are used in psychology research, we searched 11 psychology journals, publication dates from 1965 to September 2024, using relevant keyword stems combined with truncation symbols (i.e., machine learn*, big data, predictive model*, data min*, text min*, natural language process*; for the complete list, see Supplemental Material 1 in the Supplemental Material).
Studies integrating ML are growing rapidly in psychological research; there were only two articles before 2014, but there were 24 articles in the single year of 2023. Of the 106 total articles identified, 33 used supervised learning. The search results show that as a research community, psychologists are not merely discussing ML abstractly; they are now increasingly using these tools in their work. In these empirical studies, psychologists have applied ML techniques to answer a wide variety of questions. For example, can ML predict personality-assessment responses to unseen items, thereby overcoming the limitations of traditional surveys (Abdurahman et al., 2024)? What are the novel predictors (among more than 900 variables) of people’s willingness to justify unethical behaviors (Sheetal et al., 2020)? Studies like these shed light on how ML techniques can be applied to enhance the understanding of complex psychological phenomena.
However, reporting practices for ML procedures and results vary widely. Most identified supervised-ML empirical studies rarely provide the essential information required for future researchers to understand, replicate, and extend the research findings. For instance, many studies did not report the details of their data-preprocessing or cross-validation procedures. In addition, most supervised-ML empirical studies reported supervised-ML results inconsistently. For instance, two studies both used supervised ML to predict mental-health-diagnosis results; however, one reported the accuracy (i.e., proportion of correct predictions out of the total predictions), and the other reported precision (i.e., proportion of correct positive predictions over total positive predictions). This makes it impossible for the readers to directly compare the performance of the two supervised-ML models.
Admittedly, this situation is entirely understandable given that the application of supervised-ML techniques in psychological research is still in its infancy, a situation common to many advanced techniques when they are first introduced to the field (e.g., structural equation modeling, multilevel analysis). However, this situation not only introduces barriers to understanding research findings but also inhibits the scientific contribution and replication of those findings. Failing to address this issue hinders the psychology community’s understanding of the benefits, drawbacks, and broad applicability of supervised-ML techniques in the long run.
In the current work, we take an initial step to mitigate the inconsistency of reporting supervised-ML study procedures and results, recognizing that most ML studies in psychological research to date have used supervised ML. 1 We conducted a systematic review of articles across different disciplines about how to report ML results. Based on our literature review, we summarize the decision-making for each step when conducting psychological research using supervised ML and specify the key pieces of information required when reporting a supervised-ML study. We also list examples that report the key information. The examples provided in this article primarily focus on general, social, and personality psychology, reflecting the authors’ collective expertise in these areas. However, we acknowledge that supervised ML is also widely used in many other subfields of psychology, such as cognitive neuroscience and neuropsychology. For a broader overview, see Supplemental Material 2 in the Supplemental Material, which includes a table of example articles from these additional domains. The methodological challenges highlighted in the current study are also relevant and applicable across these subfields.
Method
Following the literature-search procedure recommended by Harari et al. (2020), we conducted a literature search on two electronic databases (i.e., Google Scholar and Web of Science) using two lists of keywords. The first list of keywords is related to ML, including “machine learn*,” “big data,” “predictive model*,” “data min*,” “text min*,” and “natural language process*.” The second list includes “guideline” and “best practice.” 2 The search yielded a total of 3,661 articles. A trained graduate research assistant read the titles and abstracts of these articles based on specific criteria. To qualify, an article had to (a) be written in English, (b) be available through the university library or interlibrary loan or made publicly accessible by the author(s), (c) originate from a peer-review journal, and (d) not be an empirical study applying ML to solve one specific problem because our goal was to summarize methodological-guidance articles. This resulted in 51 articles for full-text screening.
H. Min and F. Guo independently read 10 articles and coded them into three categories. Category 1 includes articles that provide guidelines specifically on supervised-ML reporting regardless of discipline. Category 2 includes articles that provide guidelines on specific aspects of applying ML and are relevant to psychological research, for example, applying ML to analyze longitudinal data (Sheetal et al., 2022) or ML-algorithm configuration (Eggensperger et al., 2019). Category 3 includes the articles whose scope was narrow (e.g., empirical studies of applying ML), broad (e.g., general introduction of artificial intelligence in multiple disciplines; Xu et al., 2021), or not relevant (e.g., best practice of species distribution models; Robinson et al., 2017). The interrater agreement of the two coders was 90%; they discussed their rating discrepancies until a consensus was reached. The two coders then finished coding the other articles separately. In the full-text review, we identified 10 articles in Category 1 (highlighted in the references using asterisks) and 24 articles in Category 2 (Table 2). Backward searches identified one additional article for Category 1 and two additional articles for Category 2. For a flowchart of the literature-review process, see Figure 1.
Summary of Guideline Articles in Other Aspects of Machine Learning

Flowchart of systematic literature review.
Results
To synthesize insights from the literature, we focused our analysis on the 11 articles identified in Category 1, which provide direct guidance on supervised-ML reporting; they come from multiple disciplines, including biomedical, pathology, health care, orthopedics, education, and chemistry. We summarized the recommended items from these studies and adapted them to apply to psychological research. Based on these items, we developed a structured checklist for reporting supervised-ML applications in psychological studies (see Table 3). The checklist is organized into five main sections, each outlining key components and recommended practices. Items marked with an asterisk in Table 3 indicate that reporting these items is strongly recommended to increase research transparency and reproducibility.
Reporting Checklist When Using Supervised-Machine-Learning Techniques
Note: Asterisk indicates the listing is strongly recommended. ML = machine learning.
Introduction section
Research questions
When authors write an article about their research, they typically outline their hypotheses and/or research questions, study design, data collection, and statistical analyses (Brownstein et al., 2019; W. Luo et al., 2016). The use of supervised-ML techniques is different in many respects: Whether variables in a large data set can predict an outcome of interest using ML techniques may be based on a theory, but it is more likely to be based on a rationale or even just a hope (e.g., what novel antecedents will predict unethical behavior).
Hofman et al. (2021) proposed a framework for categorizing research activities along two dimensions: the extent to which researchers aim to identify and estimate causal effects and the extent to which researchers focus on accurately predicting outcomes. Based on these dimensions, the research framework includes four quadrants: descriptive modeling (neither causal nor predictive), explanatory modeling (focused on causal inference), predictive modeling (focused on forecasting outcomes), and integrative modeling (focused on both causal explanation and predictive accuracy). Generally, psychologists use supervised ML to explore (mine) the data and detect complex relationships wherever they exist and hold up under cross-validation, which falls under the predictive-modeling quadrant. Supervised ML can also be used to answer research questions in other quadrants, such as to improve matching quality and further facilitate causal inferences (explanatory modeling; e.g., Liu et al., 2021) or to predict how the patients’ mental and physical health will change when one changes one characteristic of the intervention focusing on substance abuse (integrative modeling). Hofman et al. suggested that researchers clarify which quadrant their research questions fall into.
Given the flexibility of supervised ML in addressing a wide range of questions, we recommend that researchers explicitly state their research questions and document all necessary details throughout the research process to promote transparency and reproducibility by third parties when data can be shared.
Rationale for using supervised-ML techniques
When authors articulate their research questions, they also need to explain why supervised-ML techniques were used over traditional statistical methods 3 (W. Luo et al., 2016). This rationale typically centers on characteristics such as the need to handle high-dimensional predictor sets, detect complex nonlinearities and interactions, accommodate weaker distributional assumptions, or prioritize predictive accuracy over parameter estimation (Chapman et al., 2016; Hastie et al., 2009; Jordan & Mitchell, 2015). For example, when predicting turnover intentions, traditional statistical methods (e.g., OLS regression) provide interpretable results, whereas supervised-ML models (e.g., deep neural networks) may provide more accurate predictions but sacrifice interpretability. Thus, there may be a trade-off. When appropriate, researchers should conduct and report traditional methods as an analytical baseline against supervised-ML techniques in terms of predictive accuracy (Artrith et al., 2021; Vollmer et al., 2020). The practice is recommended not only to compare their predictive power but also because even in relatively simple and interpretable ML models, such as regularized regression (e.g., lasso or elastic net), interpreting coefficients can still present challenges. In the service of prediction, regularized regression coefficients are shrunk toward zero in hopes of improving predictive performance in an independent sample (Hastie et al., 2009); this will, of course, be more likely to the extent that future samples come from the same population as the original sample on which the ML model was trained and tested. Sometimes, traditional statistical methods are simply not feasible, such as when the number of variables far exceeds the number of cases, which may, in fact, motivate a researcher’s decision to conduct a supervised-ML analysis in the first place. In these cases, interpretable-ML models that provide variable coefficients (e.g., lasso or elastic net) are recommended as a baseline against more complex supervised-ML models. 4
An example of explaining the rationale for using supervised ML can be found in Joel et al. (2020), in which ML was used to investigate relationship quality. Note that their discussion addressed both the general rationale for adopting supervised ML and the specific justification for selecting random forests as the algorithm: Each dataset was analyzed using Random Forests (24), a machine-learning method designed to handle many predictors at once while minimizing overfitting (i.e., fitting a model so tightly to a particular dataset that it will not replicate in other datasets). The Random Forests method builds on classification and regression trees (25). Specifically, using a random subset of predictors and participants, the Random Forests method tests the strength of each available predictor one at a time through a process called recursive partitioning. It builds a decision tree out of the strongest available predictors and tests the tree’s overall predictive power on a subset of data that were not used to construct the tree (also called the “out of bag” sample). The Random Forests method does this repeatedly, separately bootstrapping thousands of decision trees and then averaging them together. Results reveal how much variance in the dependent measure was predictable and which predictors made the largest contributions to the model. Random Forests are nonparametric—they do not impose a particular structure on the data—and as such they are able to capture nonlinear relationships, including interactions among the predictors (26). For example, a model with actor- and partner-reported predictors would detect any robust actor × partner interactions (e.g., moderation, attenuation effects, matching effects) that could not be captured in a model featuring actor- or partner-reported predictors alone. (Joel et al., 2020, p. 19063)
Researchers should articulate why they chose to use supervised-ML techniques to answer their research questions, and when feasible, researchers should compare the application and results of those techniques with those of traditional (or more interpretable) statistical techniques. Overall, we argue that the use of supervised-ML algorithms should be explicitly justified in the context of the research question(s) and data rather than relying solely on the novelty of the ML methods.
Framing research questions as a supervised-ML task
Apart from specifying the research questions and providing the rationale for using supervised-ML techniques, authors also need to translate or operationalize their research questions in terms of specific ML tasks (W. Luo et al., 2016; Moons et al., 2015). For example, the research question of predicting employee turnover (a binary outcome) from job satisfaction (a continuous predictor) would be considered a classification problem in supervised ML. Understanding how this terminology is used in computer science differs from other terminology used in psychology for the same purpose (e.g., logistic regression in this case) facilitates cross-disciplinary collaboration (König et al., 2020). For example, for a common organizational problem of job-role assignments, Varshney et al. (2014) phrased their research question in computer-science terms: We approach the predictive modeling problem as a classification problem. Since we have veracious data on a reasonable fraction of employees’ job roles and job role specialties, we can further formulate the problem via supervised multi-category classification, using the veracious fraction of employees’ data as a training set. (p. 1730)
We recommend that when the goal is cross-disciplinary collaboration, psychology researchers should translate their research questions into ML tasks right from the outset, in the introduction section. For instance, if supervised ML is used, is it a regression or classification task? What variables are predictors (i.e., inputs), and what variable is the outcome (i.e., output)?
Selecting and justifying the supervised-ML models used
Once the authors phrase their research question as a supervised-ML prediction problem, the next step is to report the specific supervised-ML algorithm being used, explaining why it was selected (e.g., its benefits, any limitations, and how it might have been preferred to other ML algorithms). That is, researchers can begin to evaluate the types of ML models that are suitable for use given the nature of the defined task and the data set(s) employed.
Some traditional statistical methods may suggest only one obvious analysis for a research question, such as a linear regression analysis when multiple predictors predict a continuous outcome of interest or a t test when two group means are being compared on a single continuous variable. In contrast, other widely used analytic approaches may require specifying multiple models and formally comparing them before researchers determine which model best fits the data, such as models with different numbers of latent factors in confirmatory factor analysis or full versus partial mediation models in structural equation modeling.
In a somewhat analogous manner, there is a wide range of ML algorithms when conducting supervised ML (e.g., more than 100 found on https://topepo.github.io/caret/available-models.html). Therefore, researchers typically do not test and compare all supervised-ML models; however, comparing at least a few might make good sense. For instance, when predicting turnover, one could compare cross-validated logistic regression with regularization, support vector machines, tree-based models (e.g., random forests or gradient boosting), neural networks, and many other variants or alternative supervised-ML models. When selecting the set of ML models to answer a specific research question, several factors may affect researchers’ decision-making process. The researchers may begin by identifying categories of supervised-ML models that allow for exploring different advantages and limitations when setting up the prediction problem. For instance, does the researcher seek to understand the nature of relationships, and if so, are they linear or nonlinear (e.g., additive or multiplicative, respectively)? Or is the prediction of outcomes the sole goal? As discussed above, if the primary goal is to gain insights into underlying relationships, supervised ML using more explainable regression models is more suitable, such as cross-validated OLS, lasso, and logistic regression. Conversely, when the primary goal is accurate performance prediction rather than relationship interpretation, complex supervised-ML methods may be more appropriate choices, such as tree-based models, such as random forests and extreme gradient boosting, or other models, such as support vector machines and neural networks (Y. Luo et al., 2019).
In practice, researchers can select one or two specific supervised-ML models in each category to effectively examine the assumptions. If the researchers found that the tree-based model performed much better predictively than regression-based models, for instance, this suggests that the relationships between the predictors and the outcome may be nonlinear in ways that may be much more complex than interaction or quadratic effects in regression models. Meanwhile, researchers can identify and ensure that in each category, models adopting distinct approaches are selected. For instance, in tree-based models, researchers can select random forests, a specific type of bootstrap aggregating (i.e., bagging) tree-based model, and extreme gradient boosting, a stagewise tree-based model. In addition, supervised ML allows for various forms of model “ensembling,” in which the final model is based on multiple predictions that are weighted and averaged either within or across different ML methods and data sets. If model ensembling is employed, it must be transparently reported.
Moreover, the nature and size of the data set also influence the selection of supervised-ML algorithms. For example, algorithms, such as logistic regression, lasso, or elastic net, are well suited for structured numeric data, whereas natural-language-processing techniques combined with models such as support vector machines or neural networks are more suitable for textual (word-based) data. In addition to the data type, the volume of data also influences the selection of the algorithm. When the data set is very large, scalable algorithms, such as stochastic gradient descent linear models and extreme gradient boosting, may be preferred for their scalability and ability to handle high-dimensional data (Bottou, 2010; Chen & Guestrin, 2016).
Overall, researchers should base their model-selection decisions on their research questions, data type, and data volume and the strengths/weaknesses of different supervised-ML models (e.g., considering any trade-offs between ML interpretation and prediction). We strongly recommend that researchers report more transparently how they decided among the available supervised-ML models with respect to addressing the research question before they selected the ML algorithm(s) used. Such information is vital for future researchers not only to understand a given ML research project but also to compare and extend study findings and learn from these key methodological decisions (Artrith et al., 2021).
As an example, Varshney et al. (2014) provided a good example of explaining how their research question and data formats affected their choice of ML models and why some other similar or popular models were not chosen: For the skills analytics problem we are facing, the most direct and appropriate formulation is supervised multi-category classification. Moreover, due to the business structure, we learn separate classifiers for the different LOBs [line of businesses] within IBM because there are different valid class labels and different feature distributions among the different LOBs. Multi-task learning could be possible in this setting to do joint training for different LOBs, but we choose not to pursue this direction because it introduces unnecessary complexity. . . . We compare four one-against-the-rest multi-category classification algorithms: linear logistic regression with l2 and l1 regularization, linear support vector machine, and naïve Bayes. The regularization parameters for the first three models are found by cross-validation. (p. 1732)
In summary, we recommend authors (a) articulate the necessity and utility of supervised-ML techniques in addressing their research question, (b) briefly summarize prior methods applied to address the current problem (e.g., traditional statistical methods or other supervised-ML techniques), and (c) to the best of their knowledge, state the benefits of selecting their ML algorithms over other alternatives vis-à-vis the research question being investigated. As with traditional statistical methods, we advocate parsimony in ML algorithms, in which one starts with the simple algorithms for the research problem and data at hand, using a less parsimonious model only if the predictive benefit justifies the added complexity and any reduction in interpretability.
Sample section
After introducing the research question(s) and justifying the chosen ML methods, the next step is to describe the data sets used and the sample characteristics.
Data source
Given the abundance of data (both structured and unstructured) analyzed by supervised ML, psychologists need to clarify the data sources in their studies (W. Luo et al., 2016; Moons et al., 2015). This is particularly true because even though analytic toolboxes are expanding, some data selected by one researcher may be considered unfamiliar or novel to the general audience. Oswald et al. (2020) suggested that in the era of big data, researchers may consider incorporating some forms of “incidental data” (e.g., emails, video data, tweets) to address relevant research questions. Compared with the more “intentional data” that come from traditional measures common in the literature and familiar to most researchers (e.g., survey responses), there is often less knowledge or certainty regarding the nature of incidental data and therefore, its predictive value. Thus, researchers should provide detailed descriptions of where and how their data were obtained and what the data look like (e.g., human-resources information-system records, annual climate-survey archival data, social media posts; Winkler-Schwartz et al., 2019). For example, for a data source that relies on web scraping, the data-procurement process should be clearly stated. Detailed information should be provided on where and what data were scraped, the chosen time frame, and the inclusion and exclusion criteria. For instance, Min et al. (2021) provided specific information about how their data set was obtained: Our text data for emotion analysis were collected via the Twitter API. We queried all available Tweets related to WFH [work from home] (i.e., Tweets including the following seven keywords: “WFH”, “work from home”, “working from home”, “work remotely”, “working remotely”, “remote work”, and “remote working”) over a four months period (March 01, 2020 to July 01, 2020) by searching through the historical Tweets database. In total, we collected 1.56 million Tweets posted by 706,142 distinct Twitter users. (p. 219)
One’s data source directly affects whether the relevant constructs are measured and how predictive those data are of the outcomes (Vollmer et al., 2020). Geiger et al. (2021) pointed out that in supervised ML, the predictive models derived from labeled training data are “only as good as the quality of that (training) data” (p. 1; also see Farrow et al., 2021). This echoes the long-standing phrase of “garbage-in, garbage-out,” first introduced in the early days of computing to emphasize how flawed or low-quality input data inevitably lead to flawed outputs (Babbage, 1864; Mellin, 1957). In the ML context, the labeled training data set guides the ML algorithm’s learning process, ultimately producing a trained algorithm that can be applied to independent data sets (e.g., holdout data, external data sets).
Thus, it will increase transparency if researchers provide comprehensive information about how the training data are obtained and about the labeling task procedure(s) if the training data are human-labeled. This information helps readers and reviewers better understand the nature of the data, thereby enhancing the construct validity and generalizability of the trained ML model and facilitating the interpretation of results. Even when ML models are cross-validated on a given data set, there is no guarantee that they will generalize to other data sets. However, a detailed description of the original data does provide some insight into the model’s potential for generalization.
When multiple data sets are available to address a research question using supervised ML, psychologists should explain why they chose a particular one. The general principles regarding research decision-making in the Ethical Principles of Psychologists and Code of Conduct (American Psychological Association [APA], 2017) also apply to data selection. The general principles are beneficence and nonmaleficence, fidelity and responsibility, integrity, justice, and respect for people’s rights and dignity. Selecting a data set solely because it produces favorable results constitutes a form of outcome-driven bias that undermines scientific transparency and reproducibility (Simmons et al., 2011; Weston et al., 2019). Ethical decision-making further requires that the chosen data set be collected under appropriate consent and confidentiality protections (Barchard & Williams, 2008; Kaiser, 2009). Considerations of fairness are also essential. For example, a researcher investigating adolescent depression could choose between a small, locally recruited sample and a large, publicly available longitudinal data set. Which data set is chosen should depend on the researcher’s goals. A sample recruited locally may simply have been out of convenience and without consideration of whether the sample is representative of the population to which the research results are supposed to generalize. On the other hand, perhaps the local data set is highly relevant because generalizability is intended to be for that local setting, with its own unique demographic features. On the other hand, a large, publicly available data set may come with higher statistical power and wider external validity, depending on the degree of correspondence between the sample and the population of interest. Overall, data-set selection should be motivated not only by methodological fit but also by explicit ethical considerations of factors such as transparency, fairness, and participant protection.
Finally, yet most importantly, any ethical concerns associated with data reporting, such as privacy or legal issues, should be considered (Richards & King, 2014). For example, sharing verbatim quotes is often discouraged in studies that use social media data (e.g., tweets) because it allows other people to trace the quote back to the individual who posted it (Golder et al., 2017). Below is an example of authors reporting the ethical concerns in their data-collection process: No private data or non-public information is used in this work. For human annotation (Section 6.1), we recruited our annotators from the linguistics departments of local universities through public advertisement with a specified pay rate. All of our annotators are senior undergraduate students or graduate students in linguistic majors who took this annotation as a part-time job. We pay them 60 CNY an hour. The local minimum salary in the year 2023 is 25.3 CNY per hour for part-time jobs. The annotation does not involve any personally sensitive information. (Li et al., 2024, p. 44)
Stamatis et al. (2021) also reported details about their participants and data-collection procedures: Participants included patients with OCD [obsessive compulsive disorder] (n = 90) and healthy controls (n = 59). Demographic information is provided in Table 1, which also contains information on OCD and sensory phenomena symptoms for the patient group. Participants were recruited from treatment centers, self-help groups, and Internet advertisements. All participants provided informed consent, and procedures were approved by the institutional review board. Participants completed a computerized neuropsychological battery and structured clinical interviews about sociodemographic and clinical symptoms, with only the OCD group responding to OCD-related symptom measures. (p. 81)
In summary, we strongly recommend researchers (a) provide detailed descriptions of where, when, and how research data are obtained for supervised ML and clarify what the raw data are and show what they look like if necessary; (b) provide key information about the data set(s) used (e.g., source of the data, demographics); (c) explain the usefulness of “new” data sources that are unfamiliar by many readers; and (d) ensure compliance with privacy or legal regulations when reporting data sources and examples.
Sample characteristics
Once the raw data are specified, sample characteristics should be described. These characteristics include demographic information (e.g., race/ethnicity, gender, age) and basic descriptive statistics (e.g., sample size, variable means and standard deviations, prevalence and patterns of missing data; W. Luo et al., 2016; Moons et al., 2015). When appropriate, intercorrelations among key variables should also be reported. 5 Initial data exploration is a tedious but critical step before conducting any further statistical analyses. It facilitates a better understanding of the data and informs decision-making in other analytical steps (e.g., preprocessing, model choice). We strongly recommend incorporating data visualizations because they can be especially effective in identifying trends, patterns, and anomalies that may be overlooked by means, standard deviations, correlations, and other statistical summaries. Interested readers can refer to Tay et al. (2018) for more information on various visualization tools.
In addition, depending on the specific supervised-ML task, relevant statistics should be reported. For example, in a classification task, the existence of imbalanced data (i.e., disproportionate sample sizes for each group) should be reported, as should any method for addressing this issue. For example, in a study using supervised ML to examine how workstation types influence the coupling of neural and vascular activities of the prefrontal cortex, Alyan et al. (2020) provided detailed information about the participants included in their experiment: Twenty-three healthy adult volunteers with no history of psychological illness, musculoskeletal problems, or substance dependence took part in this study (mean age 28.6 ± 3.4 years, males, right-handed, mean height 1.7 ± 0.038 m, BMI > 18 and < 25). The research was approved by the Medical Research Ethics Committee (MREC) of Universiti Kuala Lumpur Royal College of Medicine Perak (UniKL RCMP). All procedures were conducted following the approved regulations and guidelines. All subjects signed informed consent under the MREC approval stipulations. (p. 218912)
To summarize, for sample characteristics, we strongly recommend that researchers report demographic information and basic descriptive statistics, as is commonly done in traditional studies. In addition, we recommend that researchers use visualization tools to report important initial findings, such as intercorrelations and prevalence of missing data, and to illustrate overall data trends and patterns (e.g., Moons et al., 2015).
Procedural information
As with traditional statistical analyses, empirical studies involving supervised ML must include detailed documentation of the data-handling procedures and the implementation of ML algorithms. Arguably, the importance of this procedural transparency requirement is heightened in ML research given the magnitude and complexity of the data and the fact that ML techniques are still relatively new to psychologists. Below, we discuss common concepts and techniques of various data-preprocessing, ML-modeling, and evaluation techniques that are most relevant to psychological research.
Description of statistical software and platforms
As with any analytic method, researchers need to document and report the analytic software (and packages, if any), including version numbers, used at each step of supervised-ML research, such as data preprocessing, model training, model testing, and model evaluation (Artrith et al., 2021; Vollmer et al., 2020).
Data preprocessing
The primary objective of data preprocessing is to transform raw data into a form that can be consumed by supervised-ML models (W. Luo et al., 2016; Winkler-Schwartz et al., 2019). In other words, in most situations, directly feeding raw data into a supervised-ML model is neither feasible (e.g., because of data types and missingness) nor good practice (e.g., raw data that are not cleaned and otherwise transformed can lead to unreliable results; Chu et al., 2016). The “data wrangling” process is essential (i.e., cleaning, transforming, and otherwise modifying data into a usable format), not only providing deeper insights into the data but also helping refine the research question and subsequent application of supervised ML (Artrith et al., 2021; Braun et al., 2018).
First, in the data-transformation step (i.e., the actions taken to change the form of the data), the data set is expected to change in the number of variables (also called “features”). This is applicable in many scenarios, such as when the data format is not readily consumable by supervised-ML models (e.g., audio or text data), when one wants to derive new data fields from existing ones (e.g., extract country and city from a GPS variable) or to combine and aggregate existing variables (e.g., aggregate multiple daily sales transactions to form a monthly revenue variable), and when data types need to be altered to address the research question (e.g., altering between categorical and continuous variables). Below is one example of reporting data transformation described by Hickman et al. (2021): Participant responses were transcribed using IBM Watson Speech to Text . . . and their full interview response was combined into a single document. Then, we first used Linguistic Inquiry and Word Count (LIWC; . . . ) to quantify verbal behavior. We used all directly counted non-punctuation variables from LIWC, including word count. (p. 1331)
Second, in the data-cleaning step (i.e., the actions taken to “fix” irregularities in the data), the data shape can change in both its length and width (i.e., the number of rows and columns, respectively). There are several factors that affect researchers’ decisions regarding how data cleaning is conducted. For example, when cleaning cases (the lowest unit of analysis, such as people), it is recommended to consider outlying data points and cases, missing data, careless responders, and undersampling/oversampling (Artrith et al., 2021). For cleaning variables, particularly when interpretability is a key objective for supervised ML (not just prediction), one needs to consider variables that are low-frequency, low-variance, linearly dependent, highly correlated, and more (Langley & Sage, 1994; Wu et al., 2008). For instance, in an article examining the reliability and validity of ML models for analyzing video interviews, Hickman et al. (2021) reported their data-cleaning methods in processing verbal behavior in video interviews: Before extracting the words and phrases, we first removed all numbers and punctuation from the transcripts, removed common stop words, transformed all text to lowercase, handled negation by appending words preceded by “not”, “n’t”, “cannot”, “never”, and “no” with the negator and an underscore, and stemmed the corpus. We removed all one- and two-word phrases that did not occur in at least 2% of the interviews. (pp. 1331)
For data transformation and cleaning, we strongly recommend that researchers report detailed procedural information on (a) data transformation and cleaning, (b) original data fields and formats, (c) data-processing procedures and techniques, and (d) resulting data fields and formats.
Model training and evaluation
The modeling stage of a supervised-ML study involves model training, evaluation, and tuning procedures. The main objective of supervised ML is to tune a model that minimizes prediction errors yet does not overfit the data (Artrith et al., 2021; Winkler-Schwartz et al., 2019). In the training step of supervised ML, the model learns from a training data set by improving model estimates that reduce errors in predicting the outcome. Critically, the trained-ML model is then evaluated on independent holdout “test” sets of data to determine if the model generalizes, thereby helping ensure that the trained model ultimately selected will not be based on overfitting the training data in the first step. There are various ways to engage in this train-then-test process (e.g., single holdout, cross-validation, and bootstrap; Kohavi, 1995; Refaeilzadeh et al., 2009), but ultimately, the goal is to minimize prediction errors in the holdout test set. For more details of the specific methods, see Supplemental Material 3 in the Supplemental Material.
In engaging with the process above, researchers are contending with a challenge known as the “bias-variance trade-off” (e.g., Pargent et al., 2023; Putka et al., 2018). “Bias” refers to the error that arises from a model that does not fully use the predictive information in the training data set, resulting in poorer predictions of the outcome variable (Kohavi, 1995). In supervised ML, high levels of bias indicate that the model is underfitting the data (e.g., underweighting and/or excluding relevant predictors). In contrast with bias, “variance” refers to the error introduced by an ML model’s sensitivity to minor fluctuations or noise present in the training data (e.g., including variables that are predictive only in a specific sample). The bias-variance trade-off suggests that one should optimize prediction by finding an equilibrium between underfitting and overfitting—that is, bias and variance as described above. Achieving this balance is of paramount importance when constructing supervised-ML models designed to perform effectively on unseen data, thus reducing generalization error (Boehmke & Greenwell, 2019).
To manage the bias-variance trade-off in supervised ML, various resampling methods are used to mimic how well the trained model would perform when it is applied to a new data set (Pargent et al., 2023). For a brief introduction to the widely used resampling methods in psychological studies, see Supplemental Material 3 in the Supplemental Material. One example of reporting resampling methods is Hickman et al. (2021), who reported the nested cross-validation information in their model training: Nested cross-validation with k = 10 involves splitting the data into ten equally sized parts (the outer folds). Then, nine of these parts (the outer training folds) are used to conduct a separate 10-fold (the inner folds) cross-validation to select the optimal elastic net hyperparameters (i.e., model selection) based solely on these nine outer folds. Next, the final model is trained on those nine folds using the optimal hyperparameters, and then that model’s accuracy is estimated on the outer test fold (i.e., model assessment). This process is repeated 10 times, using each of the ten outer folds only once for testing. (p. 1331)
Hyperparameter tuning
“Hyperparameter tuning” refers to adjusting ML-model hyperparameters (or a subset of them) to help find values that optimize the prediction of unseen data in a test set. For example, in the random-forests algorithm, adjusting hyperparameters such as the number of trees, the depth of the trees, and the number of variables considered at each node can improve subsequent model performance. To reduce the risk of overfitting in models, hyperparameter tuning is typically conducted using nested cross-validation procedures (Cawley & Talbot, 2010).
Frequently used techniques for hyperparameter optimization include model-free optimization, gradient-based optimization, and Bayesian optimization, although there are others (Feurer & Hutter, 2019; Yang & Shami, 2020). In model-free optimization, grid search and random search are two common options. Grid search exhaustively considers all possible combinations of hyperparameters in a defined space, whereas random search randomly samples points in a defined hyperparameter space. Gradient-based optimization attempts to move hyperparameter estimates efficiently toward a global error-minimizing criterion. Bayesian optimization determines hyperparameter estimates based on information about previous estimates (Brochu et al., 2010), employing probabilistic models to iteratively guide the search for optimal hyperparameters by balancing exploration of unexplored regions and exploitation of promising areas in the hyperparameter space (Shahriari et al., 2015). This method efficiently improves model performance by learning from each iteration’s results and adapting its search strategy accordingly. To summarize, hyperparameter tuning is recommended, and its methods and use need to be described in sufficient detail for the purpose of reproducibility. Below is an example of reporting hyperparameters tuning in training supervised-ML models (i.e., random forest; Douglas et al., 2023): Random forest analyses were conducted using the ranger R-package (Wright & Ziegler, 2015). The forest consisted of 1000 trees. Two tuning parameters of random forests are the number of candidate variables to consider at each split of each tree, and the minimum node size resulting from a split. The optimal tuning parameters were selected by minimizing the out-of-bag mean squared error (MSE) using model-based optimization with the R-package tuneRanger; in large datasets, this approach is equivalent to cross-validation (Probst et al., 2019). The best model considered 30 candidate variables at each split, and a minimum of seven cases per terminal node. We report the results of this best model. (Douglas et al., 2023, p. 1195)
Regarding the performance of supervised-ML models, a large number of model performance metrics are available. Naturally, researchers must understand and justify the criteria used to evaluate the performance of their supervised-ML model in light of the research questions, the nature of the data, and the model selected (W. Luo et al., 2016). Then, the performance metrics and their calculations must be clearly described so that others can understand this evaluation process and replicate it if needed. For more details of evaluation metrics, see Supplemental Material 4 in the Supplemental Material.
In addition, fundamental concerns for reliability and validity are also worth investigating, particularly in psychological research (e.g., Koenig et al., 2023; Luciano et al., 2018; Tay et al., 2020), even if they are reported in metrics relevant to ML instead of psychometrics. Whereas standard ML metrics, such as accuracy or area under the curve (AUC) quantify predictive performance, they do not assess the empirical stability (reliability) of model variables and the prediction (validity) of model criteria. Reliability in psychological studies using ML can be examined through test-retest consistency (e.g., do model predictions change across different times; Fan et al., 2023; Hickman et al., 2021). Validity, especially construct validity, helps to ensure that the values predicted by supervised ML align closely with the underlying theoretical constructs. That is, information about reliability and validity plays a crucial role in evaluating ML models in psychological research. Just because ML models are sophisticated does not mean that concerns for reliability and validity disappear; in fact, those concerns may be heightened because the investigation and evidence of reliability and validity may be ignored, diminished, or dismissed under the lure of ML sophistication.
Sharing program code
We strongly recommend that researchers share fully commented program code to facilitate open science, credibility, and replicability of the research (e.g., Artrith et al., 2021; Heil et al., 2021; Vollmer et al., 2020). The process of modeling involves several steps (as outlined above) in which even small variations can result in drastically different outcomes. For example, during the model-training process, optimization methods (e.g., stochastic gradient descent) and their configuration hyperparameters (e.g., stopping criteria, learning rate, and random seeds) are critical for finding replicability. Because it may be difficult for researchers to report every single detail given the limited space in a write-up, we encourage using open-source platforms that facilitate code sharing and collaboration (e.g., GitHub, OSF). We acknowledge that there may be cases in which sharing code is not possible, for example, when researchers collaborate with corporations that restrict researchers from openly sharing code or when code is built on components with restrictive licenses. In such situations, pseudocode (i.e., a verbal description of the programming code) may be shared as an alternative to actual code to help readers understand and replicate the studies on their own. For an example of pseudocode, see Supplemental Material 5 in the Supplemental Material.
Results section
Following a thorough description of the ML-model information, it is essential to provide details on model-evaluation criteria and performance (e.g., Kakarmath et al., 2020). Subsequently, it is recommended to explain and interpret the implications of these performance metrics in the context of the study. For an introduction of different performance metrics for regression and classification tasks, see Supplemental Material 4 in the Supplemental Material. Different research questions may warrant different emphases on these metrics. For example, in a security-clearance task, precision may be the most useful metric because correctly identifying weapons (true positives) should be a priority even if it comes at the expense of identifying more nonweapons as weapons (false positives). We recommend that authors (a) consider reporting multiple evaluation metrics to provide a more multifaceted picture of how their ML algorithm performs and (b) explain their rationale for selecting the evaluation metrics based on their research question or sample characteristics. Rationales that authors can provide for selecting appropriate evaluation metrics include (a) the research question being a supervised-ML prediction/classification task, (b) the distribution of the outcome variable (e.g., continuous: normally distributed or skewed; categorical: balanced or unbalanced), and (c) the costs of missing a true positive versus missing a true negative in classification task (e.g., accuracy, receiver-operating characteristic curve, and AUC) to identify both true positives and true negatives.
Model explanation
Supervised ML holds great promise for psychological science by producing more accurate and generalizable predictions than traditional statistical modeling. That said, we note that psychological science is not limited to maximizing prediction; it also seeks to understand why an algorithm is predictive. The field relies on an explainable methodology and results so that researchers, as a field, can advance the fundamental knowledge of psychological science and guide practices (e.g., Yarkoni & Westfall, 2017).
When relating supervised-ML results back to the originating research questions, researchers may also consider providing both global and local explanations for the findings, allowing them to be interpreted from both empirical and conceptual perspectives (Lundberg et al., 2020; Ribeiro et al., 2016). For example, in a model predicting turnover behavior (e.g., Min et al., 2024), researchers may want to identify the most important features for predicting employee turnover and who is likely to leave a company in particular. A global explanation addresses the former question: the process of determining the extent to which the predictors or their interactions contribute to the overall prediction of an ML model across all the data (Du et al., 2019). A local explanation addresses the latter question: the process of understanding how a prediction is derived for a specific observation (Molnar et al., 2021).
Discussion section
Interpretation of ML results
Apart from the numerical results one should report in the results sections, it is also necessary to interpret those results as best as one can. In other words, ML research should strive to explain not only how or what findings were obtained with ML but also why those ML results might matter (Moons et al., 2015). Even when ML algorithms lack full transparency and interpretability, research often benefits from making some attempts to connect the data and ML findings to the research literature and theoretical frameworks that have been used to develop related hypotheses or research questions. In addition, supervised-ML findings may have future theoretical and practical implications that the authors can reflect on in their writing.
As with any algorithm, supervised-ML algorithms require humans to interpret the real-world meaning or usefulness of their predictive findings. For example, obtaining a more accurate prediction than traditional methods does not necessarily mean that supervised ML is better if interpretability is sacrificed through the opaqueness of ML. Indeed, interpretability can be broadly understood as the ability to present information in understandable terms to a human (Doshi-Velez & Kim, 2017), and it is key in reporting supervised-ML-model results because it helps understand, appropriately trust, and effectively manage answers to research questions (Rossi et al., 2022). Conversely, a lack of interpretability can occur when a complex model, such as gradient boosting or random forests, achieves high predictive accuracy but offers little insight into how individual predictors influence the outcome. For example, a supervised-ML model predicting personality traits might claim 90% accuracy in predicting people’s level of extraversion from the available data yet provide no clear explanation of what variables might be driving the predictions, such as social media usage patterns, preferred leisure activities, or daily sleep. How extraversion and its correlates are measured in the first place is also a key question here before one even begins to collect, analyze, and interpret the data.
Consistent with our assertions, psychological scientists have already been underscoring the critical need to better understand how to use supervised-ML techniques and explain them (e.g., Tonidandel et al., 2015). To improve interpretability in supervised-ML models, researchers can draw from several techniques across different stages of analysis/modeling. Premodeling approaches include using construct-relevant, well-labeled input data and conducting exploratory data analysis to better understand variable relationships (Aguinis & Edwards, 2014; Tukey, 1977). During the modeling stage, researchers can choose intrinsically interpretable models—such as lasso regression and other sparse linear models—that offer greater transparency through simulatability and decomposability (Lipton, 2017; Rudin, 2019). In the postmodeling stage, widely used tools include permutation feature importance (Breiman, 2001), partial dependence plots (Hastie et al., 2009), and Shapley values (Lundberg & Lee, 2017), which help explain how features contribute to predictions at both global and local levels.
Limitations
As with any study, authors should acknowledge the limitations of their supervised-ML study, such as the potential biases of the data sets, the extent to which the findings appear to be generalizable, and if the results are robust across different analytic methods. Regarding this latter point, it can often be useful to compare different ML models with at least one traditional analytic method (as recommended above; Kakarmath et al., 2020; W. Luo et al., 2016). For a checklist summarizing all recommended steps for reporting psychological ML research, see Table 3. 6
General Discussion
Supervised-ML techniques are being increasingly adopted among psychological researchers and practitioners. As their use increases, we hope that the checklist we offer, although general in nature, will promote open-science practices, including more transparent reporting in published supervised-ML studies. When possible and of interest, reproducible and replicable research can also be conducted. Without such a checklist in hand, psychological researchers stand to report ML results much less efficiently. For example, when designing a study, individual researchers may invest a substantial amount of time and effort in learning how other researchers and disciplines report their ML results to help determine their own approach. Then, after their manuscript is submitted to a journal, they are likely to be required to revise the ML reporting in their manuscript further given the feedback from reviewers and editors who themselves lack helpful guidance. We suggest this possibility because in fact, many of us have experienced this issue firsthand. Thus, the checklist offered reflects an initial step to specify the necessary information in psychological studies involving supervised ML, which we hope that future researchers will adapt and extend further.
Following previous articles providing methodological guidelines in psychological research (e.g., Aguinis et al., 2018; Eby et al., 2020; Newman, 2014) and guidelines for reporting ML studies in computer science (e.g., M. Mitchell et al., 2019; Pineau et al., 2021), in this article, we provide general guidelines for reporting psychological-research results that use supervised-ML techniques. In the current article, we explicitly list key information that authors need to report when writing up a supervised-ML-technique-based study. Supporting the need for this information, we provide examples of reporting results from previous ML-based studies. We also offer a useful but abbreviated version of a supervised-ML checklist (Table 3) that researchers can follow in their study design and manuscript writing, which helps authors provide all necessary information for future researchers to understand and replicate their findings. The checklist can also be highly beneficial for journal editors and reviewers in that (a) it can inform and improve the standardization of instructions to authors and review procedures and (b) it helps reviewers “identify and attempt to minimize questionable research practices and the exploitation of methodological gray areas in submitted manuscripts” (Aguinis et al., 2020, p. 47).
Supervised-ML techniques are useful tools for psychological researchers and practitioners seeking to mine large data sets and work with other disciplines in doing so. Supervised ML is also an area in which inductive research may see progress (e.g., McAbee et al., 2017), although the choice of ML algorithms requires further attention (e.g., when selecting more interpretable algorithms is crucial). Moreover, supervised-ML techniques enable psychological researchers to analyze diverse, large, and messy data sets, such as text and video narratives, two-dimensional and three-dimensional images, and video. By making data from multiple sources more accessible, supervised ML holds the promise (subject to empirical support) of reducing common method bias and forms of social desirability (e.g., Podsakoff et al., 2024). In addition, supervised ML has been shown to be useful in adding the convenience of automation to text and video analyses, reducing both time and cost (e.g., Guo et al., 2024; Hickman et al., 2021; Iliev et al., 2015).
Despite this promise, a lack of transparency in reporting major decisions in supervised-ML analysis presents scientific, ethical, and legal impediments to the growth of such applications in the field. Conversely, the supervised-ML checklist presented here offers a path forward in fostering transparency and replicability in supervised-ML-based studies, ultimately increasing the accessibility of these techniques to psychological research. But a checklist is not meant to encourage box-checking by psychological researchers—quite the opposite. The checklist helps researchers systematically consider each of their decisions when conducting supervised-ML analysis and then explain those decisions in their manuscripts. We make this statement while acknowledging that when submitting a manuscript, authors must often adhere to the word limits imposed by a journal. However, by referring to open platforms (e.g., GitHub and OSF), authors can share all additional materials relevant to their work, such as annotated code (e.g., preprocessing and analysis code), materials (e.g., variable codebooks, measures, and detailed sample information), and the data set, if possible. Moreover, in the current article, we foster open science by improving the reporting quality and increasing the transparency of supervised-ML applications in psychological research.
We hope our work inspires future research in several ways. First, we focus on supervised-ML reporting; we do not cover reporting procedures for other ML paradigms, such as unsupervised learning or reinforcement learning. Clearly, unsupervised ML and reinforcement learning will also continue to be important analytical tools for researchers. Valtonen et al.’s (2022) article focused on conducting unsupervised ML in organizational research, specifically, text mining, outlining the major steps and decisions in the analysis, and providing guidelines for promoting reproducibility and accountability. Reinforcement learning involves learning from environmental feedback rather than labeled data (Sutton & Barto, 1998). Likewise, the rapid adoption of large language models leads to new challenges and opportunities in reporting and interpretability. Future work could adopt a similar approach to develop guidelines for other ML paradigms in psychological research.
Another limitation of our work is that we emphasize only the common key steps in supervised-ML reporting; we cannot cover all supervised-ML algorithms and the corresponding aspects of the analysis procedures in each specific technique. For example, deep-learning algorithms typically involve more complex decision-making processes and adopt additional metrics for model evaluation, which cannot be introduced in detail in this article. However, those procedures should also be reported transparently by annotating and sharing the complete code. In addition, in this article, we provide a general set of recommendations that may not be applicable to all studies. Authors are encouraged to adapt these recommendations to the specific nature of their own studies. Moreover, we fully expect that the guidelines and checklist provided here should evolve as supervised ML progresses; on the other hand, we argue that most of the critical questions raised in the current guidelines will apply universally for current and future ML methods because the information provided is general and essential for scientific transparency, guidance, and understanding regardless of the specific techniques applied.
Conclusion
Just as artificial intelligence, machine learning, and other major advances in technology are profoundly influencing various aspects of life and society, supervised-ML techniques are beginning to have a strong presence in psychological-research publications and practical applications. Like structural equation modeling, multilevel modeling, and many other statistical techniques when they were in their infancy in psychological research, supervised ML in psychological research will also greatly benefit from increased guidance in analytic decision-making. Furthermore, reporting those decisions in supervised-ML analyses using the current checklist to provide structure will improve open science in terms of research transparency and the sharing of knowledge that informs and improves future psychological research and applications.
Supplemental Material
sj-docx-1-amp-10.1177_25152459261419816 – Supplemental material for Ensuring Transparency and Trust in Supervised-Machine-Learning Studies: A Checklist for Psychological Researchers
Supplemental material, sj-docx-1-amp-10.1177_25152459261419816 for Ensuring Transparency and Trust in Supervised-Machine-Learning Studies: A Checklist for Psychological Researchers by Hanyi Min, Feng Guo, Tianjun Sun, Mengqiao Liu and Frederick L. Oswald in Advances in Methods and Practices in Psychological Science
Footnotes
Transparency
Action Editor: Yasemin Kisbu-Sakarya
Editor: David A. Sbarra
Author Contributions
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
