Abstract
Concerns for the replicability, reliability, and generalizability of MRI and functional MRI (fMRI) research have led to debates over the contributions of sample size, open-science practices, and recruitment methods, particularly in the psychological sciences. Key to understanding the state of a science is an assessment of reporting practices. In this structured review, we evaluated select reporting practices across three domains: (a) demographic (e.g., age), (b) methodological (e.g., inclusion/exclusion criteria), and (c) open science and generalizability (e.g., preregistration, target-population identification). Included were 919 published MRI and fMRI studies from 2019 in nine top-ranked journals. Reporting across domains was infrequent; participant racial-ethnic identity (14.8%), reasons for missing imaging data (31.2%), and identification of a target population (19.4%) were particularly low. Reporting likelihood varied by study characteristics (e.g., journal) and was correlated across domains. Finally, study sample size but not reporting frequency was positively associated with 2-year citation counts. Results call for recentering transparency in reporting practices in MRI and fMRI studies, with direct implications for study generalizability.
Over the last several decades, structural and functional neuroimaging have become widespread tools to study thoughts, emotions, behavior, and health in psychological science. The human brain is a topic of public fascination; neuroscience research has affected major legal decisions (e.g., Roper v. Simmons, 2005) and is the subject of global scientific initiatives (e.g., the Brain Research Through Advancing Innovative Neurotechnologies initiative; Yuste & Bargmann, 2017). In combination with psychological inquiry, brain science further informs other fields (e.g., philosophy; Bennett & Hacker, 2022) and affects what people think makes them human (J. Greene, 2014) and even which behaviors people believe are “normal” versus “abnormal” (Gazzaniga, 2005).
However, for any scientific research to be useful to practitioners, policymakers, or the public, study findings must be reliable (i.e., the construct can be reproduced from repeated measurements), replicable (i.e., analysts can reproduce the same results given the same data and methodological plan), and externally valid (i.e., study findings generalize to a broader population). Most discussions on how to improve the utility of human-neuroimaging research have centered on the links between sample size, statistical power, and reliability (Flournoy et al., 2024) and the adoption of open-science practices that promote reproducibility and replicability (Nichols et al., 2017). A strong focus on analytical transparency is critical (e.g., Botvinik-Nezer et al., 2020), and several guidelines for reporting study practices, such as preprocessing workflows, scan parameters, and data sharing, already exist (Nichols et al., 2017; Niso et al., 2022). However, the decisions that researchers make (and report) in the production of study samples—from the strategies to recruit participants to the data filtering that leads to the ultimate analytic sample size—also need to be understood. In this structured review, we evaluate select reporting practices in nearly 1,000 published MRI and functional MRI (fMRI) neuroimaging studies across three domains: sociodemographic composition (e.g., sex, racial-ethnic identity), methods (e.g., recruitment method, quality control), and open science and generalizability.
Sample Sociodemographic Composition and Population Generalizability
Following long-standing calls to diversify sample representation in psychological research (Henrich et al., 2010), scientists have begun to attend to the sociodemographic composition of human-neuroimaging studies as a key indicator of population generalizability (Falk et al., 2013). Empirical evaluations of sample composition in select journals (Dotson & Duarte, 2020) or topical areas (Qu & Telzer, 2017) and perspective pieces on the state of the field (Garcini et al., 2022) indicate that participants in neuroimaging studies are more likely to be socioeconomically advantaged; of majority racial-ethnic and cultural groups; and/or from White, educated, industrialized, rich, democratic populations (Henrich et al., 2010). Yet there is not a comprehensive field-wide assessment of reporting practices in neuroimaging research that goes beyond racial-ethnic identity to include other key sociodemographic factors.
Sample diversity along intersecting sociodemographic dimensions has direct implications for the basic scientific understanding of human beings and the application of precision-medicine techniques to reduce disease burden and promote well-being. When predictive models using neuroimaging data are trained in homogeneous samples, the models perform worse in samples with greater sociodemographic diversity (A. S. Greene et al., 2022). In another example, Gard et al. (2023) leveraged the Adolescent Brain Cognitive Development Study to report that accounting for sampling weights, which up- or down-weighted participants so that the sample demographics matched the target population, led to substantially different patterns of associations between socioeconomic resources and several metrics of brain development.
Sampling and Recruitment
Once participants are recruited to participate in a neuroimaging study, eligibility requirements and quality-control procedures further reduce the number of “usable” participant data to a final analytic sample, described as the “flow of participants” (American Psychological Association, 2020; Strengthening the Reporting of Observational Studies in Epidemiology [STROBE] reporting guidelines, von Elm et al., 2008). Each decision with respect to eligibility/ineligibility, quality control (e.g., removing participant with low behavioral performance), and missing data (e.g., listwise deletion) influences the sociodemographic makeup of the final analytical sample. To evaluate the degree of population generalizability in human-neuroimaging studies and to develop protocols to improve recruitment and retention, the field needs a clear understanding of from where and how participants are recruited and the decisions that contribute to the composition of the final analytic sample.
Open Science and Generalizability
Even in the case of successfully recruiting and retaining a generalizable sample, some research practices can undermine reliability, replicability, and generalizability. Open-science tools work to guard against questionable scientific practices, such as p-hacking and hypothesizing after the results are known (Kerr, 1998). “Multiverse analyses” (Botvinik-Nezer et al., 2020; Simonsohn et al., 2019); preregistration (Nosek et al., 2018); posting unthresholded statistical maps, parcellations, and atlases in open repositories, such as NeuroVault (Gorgolewski et al., 2015); and leveraging standardized Brain Imaging Data Structure (BIDS) file formats (Poldrack et al., 2024) have all been cited as paths to improving reliability and replicability (see also Nichols et al., 2017; Niso et al., 2022). However, authors’ assessment of the generalizability of their scientific findings is also a measure of transparency (Simons et al., 2017).
Empirical Evaluation of Reporting Practices
In this study, we attempt to address these knowledge gaps through a systematic evaluation of select reporting practices of all MRI and fMRI studies published in 2019 in top-ranked psychology-related journals. The first aim was to document reporting rates regarding sample demographic features (e.g., sex, racial-ethnic identity), methods (e.g., recruitment, quality control), and open-science and generalizability practices (e.g., preregistration). Although decisions related to analytic flexibility are also central in discussions of reliability, reproducibility, and generalizability, we opted to focus on researcher decisions and reporting practices most central to generalizability. Given the foundational role of study design in research-methods training, we hypothesized that studies would report methodological features at the highest rates but made no specific hypotheses about the exact rates of reporting. The second aim was to evaluate how study characteristics were associated with reporting in all three domains, but we made no explicit hypotheses. Third, the associations between study characteristics and sample size were evaluated, providing a much needed update of previous metascience investigations (e.g., Button et al., 2013). We hypothesized that studies with larger sample sizes would be more likely to be composed of adult-age participants (i.e., because of lower compliance and data quality in child samples) and to leverage structural MRI (sMRI; i.e., because of the large influence of motion artifacts in fMRI). The last aim was to evaluate to the degree to which reporting likelihood and study sample size were “valued” by the field, using citation counts as an indirect measure of value. Although we hypothesized positive associations between study sample size and reporting likelihood and citation count, no comparative hypotheses were generated.
Method
Identification and selection of studies
This systematic review was conducted using Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (Moher et al., 2009), and the protocol (but not the hypotheses) was preregistered (https://osf.io/6tpsh/) before target articles were identified, coded, or analyzed. Training procedures and deviations from the registered protocol are described in Table S4 in the Supplemental Material available online.
We first selected psychology-related journals to identify target articles published in 2019. The Web of Science Journal Citation Reports tool (Clarivate, 2024) was used to sort journals by impact factor in the categories of “psychiatry,” “neuroscience,” and “neuroimaging.” To balance the scope of the review with team resources, we chose the two highest ranked journals in each category with the following exclusion criteria: (a) journals that primarily publish review articles, (b) journals that are not indexed on PubMed or PsychInfo, and (3) journals that do not include the words “neuroimaging” or “neuroscience” in the public statement of aims and scope. We also included three specialty journals for their coverage of human-neuroimaging research in cognitive, affective, and developmental domains (i.e., Developmental Cognitive Neuroscience, Journal of Cognitive Neuroscience, and Social Cognitive and Affective Neuroscience). In total, nine journals were included in the structured review (Table S5 in the Supplemental Material). In July 2020, articles published during 2019 in all selected journals were imported from PubMed or PsychInfo into the systematic-review software DistillerSR (2023).
Titles and abstracts were screened by two trained research assistants (RAs; A. A. Albrecht and A. Lurie), and conflicts were adjudicated by A. M. Gard (Polanin et al., 2019). Exclusion criteria included the following: (a) The article is a review article, commentary, systematic review, meta-analysis, case study, or methods article/technical report; (b) the study subjects are nonhuman subjects (e.g., animals, cell lines); and (c) the study does not include MRI or fMRI data. Articles that included both MRI/fMRI and another imaging modality (e.g., functional near-infrared spectroscopy EEG [fNIRS], electroencephalography [EEG], magnetoencephalography [MEG], positron emission tomography [PET], single photon emission computed tomography [SPECT], transcranial magnetic stimulation [TMS]) were excluded.
Data extraction
A coding system was developed by A. M. Gard, C. Mitchell, and L. W. Hyde. Codes were designed to capture the process of conducting a neuroimaging study, from recruitment to generalizing study findings to a broader population. Categories of codes included (a) global study features (e.g., participant age, study type [observational/intervention], imaging modality, smallest and largest analytic sample size), (b) sociodemographic information (e.g., gender or sex, race-ethnicity, socioeconomic resources), (c) methods (e.g., report of recruitment procedures, inclusion/exclusion criteria, quality-control procedures, missing values analyses), and (d) generalizability and open science (e.g., power analyses, preregistration, identification of target population). The codebook was generated using reporting guidelines from the seventh edition of the American Psychological Association (APA) style guide (APA, 2020) and STROBE (von Elm et al., 2008). Table 1 provides a summary of the codebook; for the complete codebook, see Table S1 in the Supplemental Material. Throughout the training phase and during the first several months of coding, the coding manual was updated iteratively with definitions of terms (e.g., intervention study criteria) and data-entry instructions (e.g., round numbers to the nearest percentage).
Full-Text Data-Extraction Codebook and Interrater Reliability
Note: One hundred thirty-one articles were double-coded (14.4% of the 919 total articles analyzed) in six research assistant–A. M. Gard pairs. However, interrater reliability could be estimated only for the three research assistants who coded the majority (97%) of the articles. The reported estimates reflect Cohen’s kappas (i.e., for categorical codes) and ICCs (1, 1) (i.e., for continuous codes) weighted by the number of articles doubled-coded by each research assistant–A. M. Gard pair (weights = 0.56, 0.32, 0.12). ICCs were calculated as one-way random-effects models with consistency and a single measurement. ICC = intraclass correlation; κ = Conger’s kappa; fMRI = functional MRI; sMRI = structural MRI; NaN = not a number.
To ensure that all data were collected from each article, RAs coded whether an article cited previously published work to describe the methods. Subsequently, two coders referenced the cited articles to determine if the information was available. If so, the information in the cited article was used to code that article’s reporting practices (see Supplemental Methods in the Supplemental Material).
Articles that passed the abstract- and title-screening phase were randomized to one RA for full-text data extraction. To ensure reliability in the data-extraction phase, A. M. Gard extracted data for a random 15% of articles reviewed by each RA (Polanin et al., 2019). Interrater reliability between each RA and A. M. Gard was calculated for every extracted data point using Conger’s kappa (unweighted) for categorical data and intraclass correlations (one-way random effects, consistency, single measurement) for continuous data (Table 1).
Analytic plan
The first aim was to document current reporting practices in three domains: (a) demographic characteristics; (b) methods, from recruitment to analysis; and (c) generalizability and open science. In addition to item-level frequencies, we constructed a cumulative reporting index for each domain. The demographics reporting index consisted of four items: race or ethnicity, a measure of socioeconomic resources, sample age, and sex or gender. The methods reporting index included seven items: recruitment strategy, initial recruitment efforts, date of data collection, inclusion/exclusion criteria, reasons for no imaging data, quality-control exclusions, and missing-values analyses. The generalizability and open-science reporting index included four items: identification of a target population to which the study would generalize, description of the limitations of study generalizability, preregistration, and prestudy power analyses. In the construction of the reporting indices, dichotomous items (e.g., Was sample sex or gender composition reported?) were coded as 1 = yes, reported or 0 = no, not reported. Categorical items with more than two options were coded in “strict” or “loose” form to account for variability in field definitions of what constitutes complete reporting. For example, for sample age, the strict reporting index assigned a value of 1 to studies that reported both age range and average age, studies that reported only average age or age range were assigned 0.5, and studies that reported nothing about sample age were assigned 0; the loose reporting index assigned studies with any information about sample age (i.e., average or range) a value of 1 (see Table S1 in the Supplemental Material). In the main text, we present the results of features coded in strict form; results for loose definitions of reporting are in Supplemental Results in the Supplemental Material.
Second, study characteristics associated with reporting sociodemographic characteristics, methods, and open-science and generalizability practices were identified using multivariate linear regression. Study characteristics used to predict each of the three reporting indices included analytic sample size, sample age group, imaging modality, journal, consortia-study status, study type, and whether the study was composed of a patient (vs. community) sample. Categorical variables were entered as dummy variables such that the reference group was always the largest (e.g., for modality, sMRI studies). We also examined whether study characteristics were associated with analytic sample size (i.e., sample size was the dependent variable) using independent-samples t tests and one-way analysis of variance tests for categorical variables. Sample-size outliers (±3 SD) were first winsorized, and then sample size was log-transformed.
Finally, we examined the extent to which reporting practices in each domain were “valued” in the field. By leveraging the Clarivate’s Cited Reference Search tool (Clarivate, 2024), we constructed a measure of “value” defined as the number of times each article was cited. Citation frequency was recorded in March 2022. Multivariate linear-regression models were used to predict citation frequency; reporting indices, study characteristics, and sample size were entered as predictors. As with sample size, citation outliers (±3 SD) were first winsorized, and citation frequency was log-transformed following the addition of a constant = 1 (i.e., to account for articles with zero citations). Because both citation frequency and study sample size were log-transformed variables, unstandardized estimates can be interpreted as percentage change.
Transparency and openness
We adhered to the PRISMA 2020 guidelines for systematic reviews (Page et al., 2021). All data gathered as part of this empirical exercise and analytic code used in the current article are publicly available in the OSF repository (https://osf.io/6tpsh/). The protocol for the structured review but not the hypotheses or analytic plan was preregistered prospectively, before data was gathered (also available at the OSF link above).
Results
What were the characteristics of studies included for review?
Initially, 3,856 articles were identified (Fig. 1). During the title- and abstract-screen phase, 2,847 articles were excluded because the subjects were nonhuman (n = 694); the study was a review article, commentary, systematic review, meta-analysis, case study, or methods article/technical report (n = 1,342); and/or the neuroimaging modality was not MRI or fMRI (n = 1,772). A total of 1,009 articles were submitted to full-text screening and data extraction, of which, another 90 articles were excluded. The final sample size was 919 articles. Table 1 summarizes the codebook with estimates of interrater reliability.

Modified Preferred Reporting Items for Systematic Reviews and Meta-Analyses flow diagram for study identification.
Most studies were published in NeuroImage (n = 393; 43%) or Human Brain Mapping (n = 260; 28%); the fewest studies published were in Nature Neuroscience (n = 16; 1.7%) and Neuron (n = 7; 0.8%). Nearly 75% of the included studies were conducted in adult samples, followed by child samples (n = 142; 15%) and studies with participants across the life span (n = 99; 11%). One-quarter (n = 228) of the studies were classified as patient samples, in which participants met criteria for a medical diagnosis. Most studies were observational (n = 878; 96%) and implemented exclusively fMRI (n = 624; 68%) versus exclusively sMRI (25%) or combined fMRI and sMRI (7.5%). Finally, 13% of the studies leveraged consortium-level data in that data from multiple sites or studies were combined into a single analysis. See also Table S2 in the Supplemental Material.
How large were the included studies, and which study characteristics were associated with sample size?
Across all studies, the largest analytic sample size reported ranged from N = 5 to N = 45,615, with an average of N = 55 (median) and N = 253 (mean). Sample size was associated with other study features such that larger studies were more likely to leverage consortium data, t(121.47) = 5.08, p < .001, and adopt an observational design, t(449.85) = 6.35, p < .001. Sample size was also associated with participant developmental stage, F(2, 916) = 17.60, p < .001, and imaging modality, F(2, 916) = 32.34, p < .001. Post hoc Tukey tests revealed that studies of participants of all ages were more likely to be larger than studies of children-only (<18 years) or adult-only participants (M differences = 290.08 and 339.16, respectively; ps < .001). The fMRI studies were likely to be smaller in size than both sMRI-only studies (M difference = 300.45, p < .001) and multimodal fMRI/sMRI studies (M difference = 287.22, p < .001). Analytic sample size was not significantly associated with whether the study examined a patient population (p = .97). Results were indistinguishable when using the smallest analytic sample size reported (e.g., in cases when the article reported a sensitivity analysis with a smaller sample size; Supplemental Results in the Supplemental Material).
Do studies report sociodemographic information about their samples?
Figure 2 displays the proportion of studies that reported demographic, methodological, and generalizability and open-science features. Of the studies reviewed (N = 919), 14.8% reported the racial or ethnic identity of participants, and 27.9% reported something about the socioeconomic background of participants (e.g., income, education, poverty ratio, Hollingshead score), whereas 96% reported gender or sex, and 98% reported complete (41.7%; both age range and mean age) or partial (56.3%; age range or mean age) information about participant age. Across all four demographic features coded, the average number of features reported per study was 2 (50% of the total number of features. For results with an alternative coding scheme (Table S1 in the Supplemental Material), which produced similar reporting rates, see Figure S1 in the Supplemental Material.

Demographics, methods, and generalizability/open-science reporting in 919 studies published in 2019. N = 919. For definitions of whether a feature was reported, partially reported, or not reported, see the main text and in detail in Table S1 in the Supplemental Material available online. The proportions shown here reflect “strict” definitions (e.g., for age, complete reporting was operationalized as reporting both age range and mean age, and partial reporting was operationalized as reporting either age range or mean age). For a display of reporting proportions using “loose” definitions, see Figure S1 in the Supplemental Material.
In the 136 (≈15%) studies that reported participant racial-ethnic identity, most participants were racialized as White (M = 56.15%, Mdn = 64.5%), followed by Hispanic/Latinx (M = 15.10%, Mdn = 6.0%), Black (M = 12.89%, Mdn = 4.0%), Asian (M = 16.65%, Mdn = 0.0%), and biracial or multiracial (M = 2.92%, Mdn = 0.0%); very small proportions of participants were racialized as American Indian and Alaskan Native or Hawaiian and Pacific Islander (Fig. 3; ordered by median). We acknowledge that these categories are fluid constructs that have changed throughout the historical record and reflect imperfect and incomplete measures of identity and social position (Cardenas-Iniguez & Gonzalez, 2024). These racial-ethnic categories are also highly U.S.-centric but nevertheless capture the degree to which sociodemographic diversity is represented in the neuroimaging articles reviewed.

Race-ethnicity reporting. Among studies that report sample race-ethnicity, most participants are racialized as White. Includes 136 (14.8%) studies that reported participant race, ethnicity, or both. Box plots depict study-specific proportions of participants across seven U.S.-centric racial-ethnic groupings. These figures do not add up to 100% because every study did not report racial-ethnic breakdown for all of the identity categories coded in this structured review.
Of the 256 (27.9%) articles that reported something about the socioeconomic background of participants, most (n = 155) reported education continuously (i.e., average education in years) or categorically (e.g., percentage less than high school degree, percentage college-educated). The average educational attainment of participants in studies that reported education continuously was 14.3 years (minimum = 6; maximum = 21), suggesting that on average, most participants had some education beyond high school or secondary school. Because there is wide variability in educational attainment globally (Goujon et al., 2016), we calculated the difference between a study sample’s average education in years and the average years of schooling in the country of recruitment (Fig. 4) for the 140 studies that reported education continuously and reported a single country of recruitment. Eleven studies reported that the study sample was less educated than the average education (in years) in the country of recruitment, and 29 studies reported sample education levels within 1 year of the country average. The remaining 100 studies included participants whose average years of schooling exceeded the country-specific average education level (Fig. 4).

Education reporting. Studies overrepresent highly educated participants relative to the recruitment country average years of schooling. Of the 140 (15.2%) studies that reported both country of recruitment (n = 317; 34.5%) and mean education in years (n = 155; 16.9%), average education of the study samples recruited in each country is plotted against that country’s average education in years. Country population estimates were drawn from Our World in Data (Hannah et al., 2023). Studies that included several countries of recruitment were excluded from this analysis (n = 7). Nineteen countries are represented in this figure.
Do studies report the flow of participants from recruitment to final analytic sample?
As with sociodemographic reporting, there was wide variability in the reporting of methodological features related to the flow of participants from recruitment to final analytic sample (Fig. 2; Fig. S1 in the Supplemental Material). Few studies reported the number of participants initially contacted for recruitment (4.7%), the dates of data collection (6.7%), or a missing-values analysis comparing participants with and without usable data (4.4%). Far more studies reported information about recruitment procedures; 29.1% stated from where participants were recruited and through what method. Forty-six percent of studies reported inclusion/exclusion or eligibility/ineligibility criteria, 31.2% reported reasons for missing imaging data (if applicable), and 56.4% reported the quality-control criteria that resulted in the additional exclusion of participants from the analytic sample(s). Across the seven methodical features coded, the average number of features reported per study was 2 (≈30%).
Do studies report open-science practices and attention to generalizability?
Finally, in the open-science and generalizability domain, the average number of features reported by each study was fewer than 1. Only 19.4% of studies explicitly defined the target population for inference, 23.2% of studies commented on the limitations of the generalizability of their study sample, and 1.7% and 6% of studies preregistered their aims and hypotheses or conducted a prestudy power analysis, respectively (Fig. 2). Results using the alternative loose coding scheme were similar (see Supplemental Results and Fig. S1 in the Supplemental Material).
Which study characteristics are associated with reporting likelihood?
We next sought to explore the characteristics of studies that report demographic, methodological, and generalizability and open-science features. Zero-order correlations indicated that reporting in one domain was associated with reporting in another domain (.19 < r < .28, all ps < .001, adjusted for false-discovery rate [FDR]). Sample size was not associated with the demographic, methods, or overall reporting indices, but larger studies were more likely to report study features related to open science and generalizability (r = .12, FDR-adjusted p < .001). The same patterns of association were observed when we used the looser definitions of reporting and the smallest sample size that was reported (Supplemental Results in the Supplemental Material). Next, four multivariate models were estimated, one for each reporting index (demographic, methods, open science and generalizability) and the overall reporting index as the dependent variables (Table 2). All study characteristics and sample size were entered as independent variables. Sample age group, imaging modality, and journal were each uniquely associated with overall reporting frequency. Child and life-span studies were more likely to report study features than adult-only studies (overall reporting index: Fig. 5a), sMRI and multimodal (sMRI/fMRI) studies were more likely to report than fMRI-only studies (overall reporting index: Fig. 5b), and studies published in Developmental Cognitive Neuroscience, the American Journal of Psychiatry, and Molecular Psychiatry were more likely to report study features than studies published in Neuron, Journal of Cognitive Neuroscience, and NeuroImage (overall reporting index: Fig. 5c). Patterns were similar using the domain-specific (sociodemographic, methods, or open science and generalizability) reporting indices (Table 2). Note that nonconsortia studies were more likely than consortia studies to report demographic features, and studies using patient samples were more likely to report methods and open-science and generalizability features than studies that did not recruit patients. Consistent with the zero-order associations, sample size was not associated with the overall, sociodemographic, or methods reporting indices and was no longer associated with reporting in the open-science and generalizability domain after study characteristics were included in the multivariate models. Multivariate results were comparable in terms of both strength and direction using the loose reporting criteria (Table S3 in the Supplemental Material).
Study Characteristics Are Associated With Reporting Frequency
Note: N = 910. “Sample size” refers to the largest sample size reported, which was first winsorized to +3 SD from the mean and then log-transformed. “Multimodal” refers to studies that included both structural and functional MRI. One-way analyses of variance for categorical variables evaluate the change in model fit when that predictor was removed from the model compared with the full model. The reporting indices here reflect “strict” definitions (see Supplemental Methods in the Supplemental Material available online); Supplemental Results in the Supplemental Material includes models using the reporting indices construct with “loose” definitions and models with the smallest sample size reported.
p < .05. **p < .01. ***p < .001.

Sample age, imaging modality, and journal associations with overall reporting frequency. N = 919. (a) Overall index of the number of study features reported (out of a possible 15 features), plotted by the reported sample age group. Adult-only studies were less likely than both child-only and life-span studies to report study features across domains (Table 2). (b) Overall index of the number of study features reported, plotted by imaging modality. Studies employing multimodal or structural MRI only were more likely to report study features across domains than functional-MRI-only studies (Table 2). (c) Overall index of the number of study features reported, plotted by journal (for multivariate results, see Table 2).
What are the associations between reporting frequency, sample size, and citation counts?
Given the ubiquity of reporting standards across research fields (e.g., APA, 2020; STROBE, von Elm et al., 2008) combined with the wide variability in reporting identified here, the in the last empirical exercise, we evaluated the extent to which reporting features within and across domains were valued in terms of citation count within 2 years of publication. Operationalizing value through citation frequency is an imperfect measure and one with biases, including weak prediction of research quality (Dougherty & Horne, 2022). At the same time, more frequently cited articles are also likely more visible to both the research community and the public (McKiernan et al., 2019; Sternberg, 2016). In the ≈2 years after publication (January 2020–March 2022), the number of times each article was cited ranged from 0 to 112 (M = 11.51, SD = 12.15, Mdn = 8). After accounting for all study characteristics (i.e., journal, participant age group, modality, study type, consortia status, patient population) and sample size, none of the domain-specific reporting indices (demographics index: B = −0.004, SE = 0.032, p = .90; methods index: B = −0.016, SE = 0.021, p = .43; open-science and generalizability index: B = 0.014, SE = 0.035, p = .68) were associated with citation frequency. By contrast, study sample size was strongly associated with citation frequency (B = 0.127, SE = 0.025, p < .001). For every 1-unit increase in sample size, citation frequency increased by 12.7%. This figure was consistent in models with the overall reporting index and using the loosely defined transparency indices (Supplemental Results in the Supplemental Material).
Discussion
In the last 30 years, advances in MRI and fMRI technology have enabled scientists to study the brain bases of human behavior, cognition, and health and disease. In turn, imaging methods have captured the attention of scientists, policymakers, and the public at large. Metascience inquiries are needed to evaluate the state of this field, with particular emphasis on the generalizability and reproducibility of study findings. In this structured review of select methodological practices in human MRI and fMRI studies, we documented reporting practices across three domains (i.e., demographic, methods, open science and generalizability). Results indicated that although some study features were widely reported (e.g., gender or sex), many were not (e.g., reasons for missing imaging data). Coded study characteristics such as sample age group, journal, and imaging modality were related to a study’s likelihood of reporting across domains. Moreover, when examining associations between study characteristics and citation frequency, we found that study sample size but not reporting frequency was strongly associated with greater visibility in the field.
Two surprising results were low reporting rates of inclusion/exclusion criteria and reasons for missing MRI or fMRI data. Without this information, it is impossible to know the characteristics of the population that are not represented in each study (APA, 2020; von Elm et al., 2008). More than half (54%) of included studies in this structured review did not explicitly list inclusion/exclusion or eligibility/ineligibility criteria. We found widespread nonspecific statements such as “All participants were healthy and right-handed” or “Participants were free of major psychiatric disorders.” On the other hand, studies with clear reporting included numbered lists of exclusion criteria and definitions for each criterion (e.g., van Harmelen et al., 2014). An even larger proportion of studies (68.8%) failed to report the contributions of missing MRI or fMRI data to the flow of participants. More often, studies stated the number of participants used in the current analysis without reference to any of the reasons why participants did not have imaging data. By identifying characteristics of individuals who do not enter the scanning environment (e.g., anxiety or hesitation related to the scan), researchers could adjust protocols to promote retention (e.g., mock scanners, hiring staff with similar experiences to participants; Gard et al., 2023). One excellent example of reporting the flow of participants can be found in Hahm et al. (2019), who provided a detailed account of how participant data were lost at every stage of filtering.
Another concerning finding was the low reporting of select demographic characteristics, including the racial-ethnic identity and socioeconomic background of participants. Race-ethnicity is not a biological construct; it reflects a constellation of environmental, personal, and community experiences, including historical and current structural racism (Cardenas-Iniguez & Gonzalez, 2024; Varnum & Kitayama, 2017). Racial-ethnic identity is an especially salient sociodemographic characteristic that describes participants’ lived experiences and leads to culturally specific trajectories of development (Iruka et al., 2022). In turn, a cadre of studies have shown that racialized experiences (both promotive and potentially harmful) shape brain function and structure (Constante et al., 2023; Hyde et al., 2020). Thus, studies that ignore racialized experiences and environmental factors related to the provision of economic resources may neglect contributors to heterogeneity in brain development. Indeed, a recent investigation using a large study of early adolescent youths reported greater heterogeneity in neurodevelopment in marginalized and structurally disadvantaged groups compared with their more advantaged peers (Bottenhorn et al., 2024).
That human neuroscience emerged from animal neuroscience may help to explain why reporting rates of sample sociodemographic features are so low. All humans, regardless of social status, resources, identity, and/or geographical location, share fundamental biological processes. Dendritic arborization, cell death, synaptic pruning, and myelination are examples of brain-based molecular processes that function in the same way across all humans. However, in contrast to animal neuroscience, human-neuroimaging studies do not measure processes at a cellular level. Rather, the indirect measures of brain structure and function used in neuroscience research capture multiple molecular processes simultaneously. As the level of inference shifted from molecules and cells in animal neuroscience to individuals and groups of people in human neuroscience, so, too, did the importance of attending to variation in environmental contexts (Falk et al., 2013). In short, one brain is not representative of all brains. However, among the studies included in this structured review, most (76.8%) did not comment on the limitations of sample generalizability or identify the target population to which the study sought to generalize to (80.6%). Instead, many studies invoked generic language (DeJesus et al., 2019) that implied universalisms (e.g., “Furthermore, these results suggest that the right IFG [inferior frontal gyrus] plays a crucial role in supporting pitch encoding in the typical brain”).
A related result is the recognition of the large role that study sample size plays in both the interpretations of generalizability and the visibility of research products. There were several examples of falsely equating sample size with population representation (e.g., “Since our sample size was comparatively larger than in previous studies . . . the found partial correlations can be seen as more representative of the true effect sizes in the population”). Larger sample sizes indeed enable greater statistical power to detect small effect sizes (Turner et al., 2018). However, the notion that sample size in and of itself is a marker of generalizability is faulty at best. Decades of research highlights the insidious impact of nonresponse bias on deriving population estimates (Groves et al., 2009). Although large sample sizes are necessary for detecting small effect sizes (Marek et al., 2022), studies using these same data sets also reveal that larger sample sizes may not produce estimates that generalize to a broader population of individuals (e.g., Gard et al., 2023; LeWinn et al., 2017). And yet, the current results made clear that reporting larger sample sizes was highly “valued” by the field in terms of citation count. Estimates from multivariate models demonstrated that for every additional 1-unit increase in study sample size, article citations increased by nearly 13%. By contrast, the association between reporting frequency and citation count was estimated at a precise zero.
Perhaps less surprising but no less important were the exceedingly low rates of preregistration (1.7%) and prestudy power analyses (6%). A previous survey of 283 largely cognitive neuroscientists found that 57.6% of respondents reported engaging in at least one format of preregistration (Paret et al., 2022). Given known selection biases and low response rates common to survey designs, we see the reporting rates in the current investigation as a lower bound on the frequency of preregistration implementation in MRI and fMRI studies.
The importance of complete reporting and adoption of open-science practices in MRI and fMRI research is not a new topic of discussion. In response to concerns for the reproducibility of neuroimaging research, the Organization for Human Brain Mapping (OHBM) in 2014 created the Committee on Best Practices in Data Analysis and Sharing (COBIDAS) to generate best practices for reporting and data sharing (Nichols et al., 2017). Recommendations centered on reporting related to experimental design, image acquisition, preprocessing, statistical modeling, results, and best practices for data sharing and enhancing reproducibility. Several study characteristics that were coded and reported in the current investigation (i.e., age, sex, socioeconomic status, race-ethnicity, number of participants scanned vs. analyzed, exclusion criteria, power analyses, population and recruitment strategy, clinical criteria if applicable) were also included in the COBIDAS report. As revealed by the current structured review, however, many of these study features are not sufficiently reported. As a complement to the current review, an opportunity exists to evaluate reporting practices in the other methodological domains highlighted by the COBIDAS report.
Recommendations
Although individual studies and researchers could be targeted for intervention to increase reporting transparency in human-neuroscience studies, intervention at the journal level might prove more effective and efficient for improving reporting culture. Indeed, such structural requirements already exist; journals in the Nature publishing group require authors to complete an editorial-policy checklist that includes a data-availability statement and confirmation of practices for the protection of research participants and biological samples. Forms such as these, which authors can complete upon manuscript submission to a journal, could be expanded to include information about sample demographics, methodological details, and open-science and generalizability practices. A complementary approach, inspired by the work of COBIDAS (Nichols et al., 2017), is to update field-wide software (e.g., BIDS and BIDS apps, e.g., fMRIprep) to necessitate the inclusion of these study features in data structures themselves. For example, fMRIprep could integrate prompts that ask users to input participant sociodemographic characteristics (e.g., socioeconomic status, age, gender, and sex), which are then stored in participant-level files in BIDS-formatted data streams. Likewise, in the “derivatives” folder, a quality-control file could be automatically generated that identifies, with dummy codes, participants who passed successive preprocessing steps—this would be an extension of MRIQC (Provins et al., 2023), which can be used to evaluate the quality of raw acquired data (Esteban et al., 2020). These technical advances are necessary and will take time to implement.
To bolster reporting practices in human-neuroimaging studies more immediately, we advocate for inclusion of a journal-level reporting checklist that authors complete at the time of manuscript submission (Table 3). This same checklist can augment the existing OHBM COBIDAS checklist that, at present, is more centrally focused on acquisition parameters, preprocessing steps, and data-analysis and -sharing practices. Checklist items might include the three domains of reporting that we evaluated in this review—sample sociodemographic characteristics, methodological features, and open-science and generalizability practices—in addition to those identified by COBIDAS (Nichols et al., 2017). For each item, authors select whether the manuscript includes the information, just as current submission checklists require authors to acknowledge that all authors have approved the manuscript. Even if journals do not have the capacity to enforce all reporting features, the simple act of acknowledging whether said information is included in the manuscript may encourage researchers to increase their reporting practices more widely. Although not ideal, researchers can also acknowledge that the requested information was not collected (e.g., reasons for ineligibility).
Proposed Submission Reporting Checklist to Promote Increased Transparency and Generalizability in Human-Neuroimaging Research
One major challenge to this proposal is that racial-ethnic identity is arguably a difficult study feature to code and compare worldwide. Here, we endorse a recommendation advanced by the ManyBabies international consortium for studies to adopt a measure of community of descent that is locally valid and captures aspects of heritage and identity (Singh et al., 2024). Multiple constructs are captured under the term “community of descent” proposed by Singh et al. (2024), including ancestry, race-ethnicity, religion, national origin, cultural practices, and native language use, among others.
Limitations, future directions, and conclusions
Despite the strengths of this structured review, we cannot address all methodological elements (e.g., preprocessing pipelines, statistical-modeling procedures) and open-science practices (e.g., data sharing) that contribute to the reliability, reproducibility, and generalizability of human-neuroimaging research. Our focus on MRI and fMRI further limits understanding of methodological practices outside of these neuroimaging modalities. Finally, our team was unable to collect information about biological sex and gender separately. During training, it became clear that few studies reported participant gender as distinct from sex, and nearly all articles categorized sex or gender as binary constructs. Thus, to balance team resources with data availability, we opted to code whether sex or gender was reported and if reported, only the percentage of the sample identified as female. Because growing evidence points to robust sex and gender differences in phenotypic presentation and underlying brain systems (Eliot et al., 2023), the fact that we do not know the complete sex and gender makeup of participants in neuroscience studies is a limitation of this article and a major shortcoming of the field. Thus, to fully gauge the state of the science, additional metascience inquiries should be undertaken to document reporting of more methodological elements, open-science practices, and sociodemographic characteristics; the coding procedures adopted here may be useful to other researchers interested in pursuing this work.
The question of whether and how to report methodological practices is distinct from efforts to increase diversity in human-neuroimaging research. The former can be implemented now (Table 3); the latter requires specialized training, experience, and resources (Habibi et al., 2015). Supporting researchers with specialized skills to recruit and retain communities historically excluded from scientific research is essential to increasing the generalizability of the field’s science. Ultimately, efforts to increase sample sociodemographic representation will require radical shifts in how researchers conduct neuroimaging studies (Gard et al., 2022; La Scala et al., 2023).
The “Age of the Brain” is very much on researchers; MRI and fMRI studies inform public life in policy, medicine, and public-health domains. At the same time, the opportunity to use human-neuroimaging data to understand human behavior and support human well-being comes with responsibility to participants, funders, and the public at-large. Promoting greater transparency in reporting practices is one of many efforts that are needed to gain a more generalizable, reliable, and reproducible understanding of the human brain across environmental contexts.
Supplemental Material
sj-docx-1-amp-10.1177_25152459251372115 – Supplemental material for A Window Into the State of the Science: Current Reporting Practices Related to Generalizability in MRI and Functional-MRI Studies
Supplemental material, sj-docx-1-amp-10.1177_25152459251372115 for A Window Into the State of the Science: Current Reporting Practices Related to Generalizability in MRI and Functional-MRI Studies by Arianna M. Gard, Deena Shariq, Alison A. Albrecht, Alaina Lurie, Hyung Cho Kim, Colter Mitchell and Luke W. Hyde in Advances in Methods and Practices in Psychological Science
Footnotes
Acknowledgements
Transparency
Action Editor: Rogier Kievit
Editor: David A. Sbarra
Author Contribution(s)
D. Shariq, Al. A. Albrecht, and A. Lurie contributed equally.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
