Abstract
The reliability generalization meta-analysis of the Modified Dental Anxiety Scale (MDAS) synthesized a total of 128 α coefficients from 118 studies encompassing 95,588 participants. This synthesis was performed using three-level random-effects models with CR2 inference and Bonett transformation. The pooled internal consistency α = 0.882 (95% CI (0.873, 0.889)) and 95% PI (0.759, 0.942). The language moderator showed a strong effect that persisted after continent adjustment. Furthermore, while mean age and sample size were found to be reliably associated with α, other moderators were found to be invalid. PET/PEESE and Egger tests indicated minimal small-study effects, and reliability induction was prevalent (76.1%). Overall, MDAS scores show high reliability; however, they are also sensitive to contextual effects. It is important for studies in this field to report sample-specific alpha values along with CIs using omega/ordinal alpha. Practical benefits of using MDAS are presented for dental health professionals and researchers.
Introduction
Dental anxiety is a construct that encompasses learned responses shaped by prior aversive experiences and trait vulnerability, as captured by established theories of anxiety. The Modified Dental Anxiety Scale (MDAS) has primarily been used as a trait-oriented screening tool, although some studies have incorrectly treated it as a measure of peri-procedural state anxiety (Kok et al., 2023; Steenen et al., 2024). Instruments should be matched to their target construct (e.g. STAI-State or VAS-Anxiety for state anxiety; MDAS for trait dental anxiety), and clarifying this distinction is essential for both psychometric interpretation and clinical decision-making.
Dental anxiety significantly influences oral health behaviors. Individuals with heightened anxiety tend to postpone or avoid dental visits altogether, leading to delayed diagnoses, worsening oral health outcomes, and increased treatment complexity (Armfield, 2010; Locker et al., 1996). Such behaviors are especially problematic in populations with limited access to preventive care. From a psychological perspective, dental anxiety represents a key modifiable factor affecting health behavior, making its assessment and management a public health priority (Chan and Chin, 2017).
Because the MDAS is widely used across clinical and research settings, including multiple cross-linguistic adaptations, it is important to understand how consistently it performs across populations. Reliability estimates vary considerably due to methodological and sample-related differences, and single-study coefficients cannot answer this question. A reliability generalization (RG) meta-analysis allows us to estimate the pooled internal consistency, quantify between- and within-study heterogeneity, and examine study-level moderators, thereby providing more accurate and context-sensitive guidance for the use of MDAS scores in research and clinical practice (Vacha-Haase et al., 2000).
Evidence on the reliability and validity of the modified dental anxiety scale
The MDAS is a modified version of the Dental Anxiety Scale (DAS) with one additional item, consisting of a total of five items (Corah, 1969; Humphris et al., 1995). The MDAS, which incorporates the first four items from the DAS and uses a 5-point Likert scale for each item, addresses the limitation of the DAS in the context of local anesthetic injection by adding a fifth item in the modification. In this scale, where participants report their own condition, the scores obtained from each question are summed (range 5–25). Participants are classified as having “high dental anxiety” if the total score is ⩾19. Humphris and colleagues reported Cronbach’s alpha internal consistency estimate of the scale in the original study as ranging from 0.72 to 0.93 (Humphris et al., 1995). In a recent study on the MDAS, it was suggested that the scale could be considered two-factorial (i.e. the first two items as “anticipatory dental anxiety” and the last three items as “treatment-related dental anxiety”) However, the researchers reported the Cronbach’s alpha coefficient as 0.915 overall in the study (Humphris and Newton, 2025). Additionally, findings from Italy, Switzerland, Taiwan, and Finland, in addition to the UK, confirm the two-dimensional factorial structure of the scale (Gremigni et al., 2014; Höglund et al., 2024; Lin et al., 2021; Tolvanen et al., 2017). According to the test results of a different study, dental anxiety measurements conducted using MDAS (n = 1108) indicate that a total score of ⩾19 obtained from participants’ responses may assist in the diagnosis of dental phobia (King and Humphris, 2010). MDAS, which has been translated into 22 different languages and reliability estimates have been published in some studies (Facco et al., 2015; Gremigni et al., 2014; Yeung, 2024), has been included in the literature in 26 different languages (e.g. German, Arabic, Chinese, Persian, Hebrew).
Most studies implemented the conventional five-item, single-total-score MDAS and reported Cronbach’s α. However, certain research studies proposed a two-factor solution, categorized as expectation-related and treatment-related anxiety (Humphris and Newton, 2025; Yuan et al., 2008). As demonstrated in the study by Zsido et al. (2025), some of the reported alternative coefficients (e.g. ω) are of particular interest. The inconsistent findings that characterized this study motivated the implementation of moderator analyses via factorial validation.
Purpose of the study and research questions
This reliability generalization (RG) meta-analysis aims to: (a) estimate the pooled internal consistency of the MDAS total score using a three-level random-effects model (effects nested within studies), and report its 95% confidence intervals (CI) and prediction intervals (PI); (b) quantify heterogeneity and examine study-level moderators, including factorial confirmation (none/ exploratory factor analysis-EFA/confirmatory factor analysis-CFA), intended construct (trait vs peri-procedural state), language, continent, participant type, publication/analysis type, and continuous characteristics (mean age, age SD, female proportion, scale mean, sample size) as coded in Supplemental A; (c) estimate the reliability induction rate (by omission or reuse of prior reliability/by report) and the share of cumulative sample size contributed by studies without sample-specific reliability; (d) express the pooled measurement-error proportion as (1−α) and provide its PI to anticipate reliability in new samples.
Research questions: (a) What is pooled Cronbach’s alpha for the MDAS total score, and what is the 95% PI across future populations/samples? (b) Which moderator variables (categorical and continuous) account for between-study variability in Cronbach’s alpha? (c) What is the prevalence of reliability induction, and what proportion of the cumulative sample size in the MDAS literature is affected by it? (d) What proportion of observed-score variance is attributable to measurement error (i.e. (1−α)) at the pooled estimate and across the PI?
Alpha coefficients and moderator fields were extracted and systematically coded (Supplemental A). Some primary studies lacked values for particular moderators; therefore, the analytic k for a given moderator may be smaller than the overall k, and per-moderator sample sizes are reported alongside model results rather than in fixed tables.
Methods
This RG meta-analysis adheres to the Reliability Generalization Meta-Analysis (REGEMA) guidelines outlined by Sánchez-Meca et al. (2021). The PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-analysis) standard was employed to report the protocol steps and outcomes of the screening process (Page et al., 2021). Given that Cronbach’s alpha coefficients are commonly reported in the literature, they were utilized as the reliability estimate in this RG meta-analysis. This study is a systematic review and reliability generalization meta-analysis based solely on previously published data. No new data was collected, and no human or animal participants were involved. Therefore, ethical approval and informed consent were not required in accordance with institutional and journal guidelines.
Selection of information sources and studies
A comprehensive literature search was performed across five electronic databases—Web of Science™, Scopus©, ProQuest©, PubMed®, and APA PsycNet®—to identify relevant studies utilizing the MDAS. The search was restricted to the period from 1995, the year MDAS was first introduced, through to 2025. The search was executed on May 5, 2025, using the keyword “modified dental anxiety” across all platforms. Additionally, studies citing key works by Humphris and colleagues published in the years 1995, 2000, 2006, 2007, 2009, 2011, 2013, 2016, and 2025 were identified (Humphris et al., 1995, 2000, 2006, 2009, 2013, 2016; Humphris and Hull, 2007; Humphris and King, 2011; Humphris and Newton, 2025). The resulting dataset was then cross-referenced with previously compiled MDAS usage lists by Yeung (2024) and Gremigni et al. (2014), leading to the inclusion of nine additional studies not originally captured. The initial dataset of 2954 studies was reduced to 941 after identifying and removing duplicates using Zotero (n = 2005) and manual review (n = 8). Remote access to full-text articles was provided through institutional library services. Notwithstanding these endeavors, a total of 16 works remained inaccessible due to restrictions enshrined within the interlibrary loan and database access agreements of Eastern Mediterranean University (North Cyprus). Consequently, only English-language studies (n = 118; α = 128) were included in the final PRISMA flow diagram (Supplemental Figure 2).
Inclusion and exclusion criteria
The current analysis applied a predefined set of inclusion criteria to select appropriate studies. First, the studies had to be published in peer-reviewed academic journals. Second, the full text of the publication had to be available in English. Third, unpublished master’s and doctoral theses were also included in the evaluation along with published articles. Fourth, the included studies had to explicitly state that they used the MDAS. Fifth, the studies had to report the overall Cronbach’s alpha coefficient of the MDAS. Sixth, the MDAS had to be administered using a five-point Likert-type response format consistent with its original structure. Finally, the publication date of the eligible studies had to be between 1995 and 2025.
Studies with the following characteristics were excluded: (a) bibliometric analyses, (b) meta-analysis, (c) reviews, (d) qualitative research, (e) books, (f) non-English publications, (g) working papers, and (h) case reports. Additionally, studies presenting duplicate reliability estimates based on the same sample, those omitting or ambiguously reporting reliability values (e.g. presenting a range rather than a single coefficient), and those using a reliability coefficient other than Cronbach’s alpha were excluded. Pilot studies that reported alpha coefficients based on limited participant samples prior to the main study were also excluded.
No restrictions were applied based on participants’ age, ethnicity, gender, geographical location, or study design. Furthermore, the abstracts or full texts of seven studies could not be retrieved due to inactive online access. A comprehensive overview of the exclusion process, including the number of studies excluded at each stage, is presented in the PRISMA flow diagram (Supplemental Figure 2).
Data extraction and coding
At this stage of the study, the data were classified in accordance with the RG coding recommendations outlined in the existing literature (Henson and Thompson, 2002). The coding procedure was initially conducted by the primary researcher. To enhance the reliability and validity of the process, an experienced assistant independently performed parallel coding. In instances of disagreement between the researcher and the assistant regarding the classification of a particular study, the dataset was re-evaluated collaboratively until a consensus was reached. The full scope of the coding scheme is presented in Supplemental A.
To assess inter-coder reliability, Cohen’s kappa coefficient was calculated. This statistical measure evaluates the level of agreement between two raters beyond chance, with values ranging from −1 to 1. According to interpretation guidelines, kappa values approaching 1 indicate stronger agreement, while values closer to 0 reflect weaker reliability (Coleman et al., 2024).
Statistical analysis
Analyzes were conducted in R (version 4.4.3) using metafor package (three-level models), clubSandwich (cluster-robust variance estimator, CR2), and ggplot2 (graphics); data preparation was performed in Microsoft Excel, and Cohen’s κ for inter-coder agreement was computed in SPSS (version 26; Coleman et al., 2024; Pustejovsky and Tipton, 2018; Viechtbauer, 2010; Wickham, 2016).
To stabilize variances and approximate normality, Cronbach’s α was transformed using Bonett’s log transformation (metafor’s measure = ABT), and model estimates were back-transformed to the α-scale for presentation. Sampling variances for ABT were computed by Bonett’s method as implemented in metafor. Pooled estimates, CIs, and PIs are reported on the α-scale (Bonett, 2002; Viechtbauer, 2010).
We fitted three-level REML, accounting for dependence among effects nested within studies. Heterogeneity is partitioned into between-study and within-study components; we report Q and I2 at total, between-, and within-study levels, together with variance components. Inference used CR2 standard errors with Satterthwaite degrees of freedom (Bell and McCaffrey, 2002; Cheung, 2014; Cochran, 1954; Higgins et al., 2003; Higgins and Thompson, 2002; Konstantopoulos, 2011; Pustejovsky and Tipton, 2018; Satterthwaite, 1946; Tipton, 2015).
Categorical moderators included factorial confirmation (none/EFA/CFA), intended construct (trait vs peri-procedural state), language, continent, participant type, and publication/analysis type. These were tested using Wald omnibus tests under CR2, with Holm-adjusted pairwise contrasts where relevant. Continuous moderators (mean age, age SD, female proportion, sample size, scale mean, publication year) were examined via meta-regression; continuous predictors were centered and scaled to aid interpretation. Moderator-specific sample sizes (k) are reported with results due to occasional missingness in primary reports. Variable definitions follow Supplemental A (Coding of Studies; Cheung, 2015; Hedges et al., 2010; Holm, 1979; Pustejovsky and Tipton, 2018).
We prioritized three-level precision-effect test (PET) and precision-effect estimate with standard error (PEESE) with CR2 (regressing ABT on SE and on SE2, respectively) to test and adjust for small-study effects. As a sensitivity check, we ran Egger’s regression with CR2 (cluster = study). We report bias-adjusted intercepts on the α-scale and present the PET/PEESE funnel plots in the Supplemental ( Supplemental Figure 1; Egger et al., 1997; Stanley, 2017; Stanley and Doucouliagos, 2014).
Robustness was examined using leave-one-effect and leave-one-study analyses under the three-level model. We report the maximum change in pooled α (|Δα|), DFBETAS for the intercept, and a Cook-like deviance ranking to flag influential rows (Cook, 1977; Viechtbauer and Cheung, 2010).
Alongside pooled α and 95% CI, we report the 95% PI to reflect expected dispersion for a single new study (Rivera et al., 2024). For interpretability, we also report the measurement-error proportion as
Results
In this REGEMA, a total of 128 coefficient alpha values were extracted from 118 individual studies (Supplemental B). The aggregated dataset comprised responses from 95,588 participants (M = 746.8, SD = 2383.9) who completed the MDAS. All included studies were published in full-text English between the years 2000 and 2025 (M publication year = 2016.6, SD = 6.62). The reported raw alpha coefficients ranged from 0.72 to 0.96, with M = 0.872 and SD = 0.05. Among all participants, 46,760 were female (M = 386.4, SD = 1203.1), accounting for approximately 49% of the total sample. The mean age of participants across studies ranged from 7.52 to 72.78 years (M = 33.81, SD = 13.06).
Induction rates
As indicated in the study selection phase (Supplemental Figure 2), only 118 out of 493 studies (24%) met the pre-specified inclusion and exclusion criteria and were thus included in the meta-analysis. The induction findings derived from the current research process indicate that 302 studies (61.3%) did not report any reliability estimate, while 73 studies (14.8%) relied solely on prior reliability estimates reported in previous literature (total induction rate 76.1%). Additionally, two studies presented a range of reliability values, and nine studies reported reliability coefficients other than the coefficient alpha (e.g. intraclass correlation coefficient).
When induction scores were analyzed by country of origin, a total of 363 reliability induction cases were identified across 51 countries. Among these, 11 countries produced reliability inductions above the average score of 7.12. The top three contributing countries were Turkiye (46 studies), India (42 studies), and the United Kingdom (33 studies).
In this study, the number of participants in studies that led to induction is treated as “missing data.” As presented in Supplemental Figure 2, the total number of participants identified as missing data across 375 studies—irrespective of country—is 170,767. This means that although the MDAS was used, the absence of alpha reliability coefficients in these 375 studies accounts for 64.11% of the total sample size.
Inter-coder reliability
Inter-coder agreement was evaluated based on independently coded variables using Cohen’s Kappa. The analysis yielded an average Kappa coefficient of 0.933, with a standard deviation of 0.066 and a 95% CI of (0.872, 0.994). All coefficients were statistically significant (p < 0.001), indicating a consistently high level of agreement between coders. Discrepancies were addressed through collaborative re-evaluation and consensus.
Results related to the overall alpha coefficient
Across k = 128 reliability coefficients from m = 118 studies, the pooled internal consistency of the MDAS total score was α = 0.882 with a 95% CI (0.873, 0.889; back transformed from the Bonett ABT scale). The 95% PI was determined to be (0.759, 0.942), indicating that reliability in a new study is expected to fall within this range (Supplemental C: Table 1). Substantial heterogeneity was observed both between and within studies. The Q-statistic (127) is equal to 4131.37, with a p-value less than 0.001, and I2 (total) is 88.9%. This can be decomposed into I2 (between) of 44.2% and I2 (within) of 44.7%. The variance components on the ABT scale are τ2 (between) of 0.0645–τ2 (within) of 0.0653 (Supplemental C: Table 1).
To facilitate interpretation, the pooled measurement-error proportion (defined as (1−α)) was 0.118, with a CI-based range of (0.111, 0.127) and a PI-based range of (0.058, 0.241). These findings are consistent with high average reliability but meaningful dispersion across settings (Supplemental C: Table 2).
Robustness checks revealed that the pooled estimate was highly stable. Leave-one-study-out analyses resulted in an adjustment to the pooled α of at most ±0.001. Meanwhile, leave-one-effect-out diagnostics (Cook’s deviance and DFBETAS) did not identify any materially influential effects. The rankings are reported in Supplemental C (Tables 4 and 5).
Categorical moderators and alpha coefficient relationship
The omnibus test was found to be significant under the three-level REML model (CR2/Satterthwaite; Bonett ABT α), F (1, 43.1) = 12.1, p = 0.0012. The pooled estimates for English and non-English were found to be 0.896 and 0.871, respectively. The 95% CIs for these estimates were 0.885–0.907 and 0.860–0.881, respectively. The 95% PIs for these estimates were 0.793–0.948 and 0.743–0.935. The approximate difference between these estimates was found to be 0.025 (Supplemental D: Table CM-6).
After adjusting for continent, the language effect remained significant, as evidenced by the statistical analysis. The analysis yielded a t-value of 3.724, a degree of freedom of 33.69, and a p-value of 0.000715. The adjusted pooled values were α = 0.909 (95% CI (0.898, 0.920); 95% PI (0.823, 0.954)) for English and α = 0.883 (95% CI (0.870, 0.894); 95% PI (0.772, 0.940)) for others, with a post-adjustment difference of Δα ≈ 0.027 (Supplemental D: Table CM-6A).
The CR2–HTA omnibus tests for analysis type, factor analysis, publication/analysis type, participant type, continent, and intended construct (adjusted/unadjusted), which were included in the current study as other categorical moderators, were not statistically significant; level-based α estimates, CIs/PIs, and multiple comparison results are detailed in Supplemental D: Tables CM-1–CM-5, CM-7, CM-7A, and CM-8.
Continuous moderators and alpha coefficient relationship
The analyses employed a three-level REML framework, leveraging CR2/Satterthwaite inference. The estimates were subsequently transformed back to the α scale using the Bonett (ABT) transformation. The slope for mean age proved to be statistically significant: β_(ABT) = 0.0684 per 10 years (SE = 0.0268), t (22.8) = 2.55, p = 0.0179; model k = 79, m = 74. Following the back-transformation process, the anticipated reliability at specific ages was determined to be α = 0.871 (95% CI (0.855, 0.885); 95% PI (0.747, 0.934)) at approximately 20.8 years, α = 0.882 (95% CI (0.872, 0.892); 95% PI (0.770, 0.940)) at approximately 33.8 years, and α = 0.892 (95% CI (0.880, 0.903); 95% PI (0.788, 0.945)) at approximately 46.9 years. In comparison with the null model, τ²_between diminished from 0.0789 to 0.0594, corresponding to a pseudo-R²_between of approximately 24.7% (Supplemental E: Table CM-9).
The slope for sample size was statistically significant: β_ABT = 0.00314 per 100 participants (SE = 0.000456), t (2.35) = 6.88, p = 0.0131; model k = 128, m = 118. Following the back-transformation to the α scale, the anticipated reliability at representative sample sizes was determined to be α = 0.878 (95% CI (0.861, 0.894); 95% PI (0.757, 0.939)) at approximately n = 5 (mean − 1 SD), α = 0.881 (95% CI (0.865, 0.895); 95% PI (0.763, 0.941)) at approximately n = 747 (mean), and α = 0.890 (95% CI (0.877, 0.901); 95% PI (0.779, 0.945)) at approximately n = 3131 (mean + 1 SD). In comparison with the null model, τ²_between exhibited a decrease from 0.0645 to 0.0580, corresponding to pseudo-R2_between ≈10.1%. Detailed coefficients and summaries are provided in Supplemental E: Table CM-13.
The CR2-robust slope tests did not demonstrate statistical significance for Age (SD; Supplemental E: Table CM-10), female proportion (Supplemental E: Table CM-11), publication year (Supplemental E: Table CM-12), and scale mean (Supplemental E: Table CM-14). These were included as continuous moderators in the analysis of the current study. Level-representative α estimates, CI/PI intervals, and model summaries can be found in Supplemental E versions of the relevant tables.
Publication bias
The impact of small-study effects was assessed through the implementation of a three-level PET/PEESE (CR2) approach. The PET intercept corresponded to α = 0.888 (95% CI (0.868, 0.904)) with a non-significant slope (p = 0.468), and the PEESE intercept to α = 0.882 (95% CI (0.871, 0.893)) with a non-significant slope (p = 0.860). These findings suggest limited small-study bias and alignment with the main pooled estimate (Supplemental C: Table 3). For visual inspection, PET/PEESE funnel plots are provided in Supplemental Figure 1.
Discussion
Overall alpha coefficient and measurement error
In the three-level RG meta-analysis, the pooled internal consistency of the MDAS total score was α = 0.882 (95% CI (0.873, 0.889)), and the 95% PI was (0.759, 0.942), indicating non-trivial dispersion expected across new populations and settings (Supplemental C: Table 1). When interpreting this magnitude in the context of conventional guidance, internal consistency is typically expected to exceed 0.80 for applied use and 0.70 for exploratory work. However, a value of 1.00 is not desirable because it would imply redundancy and, in the limit, identical responding across participants (Haktanir et al., 2024; Nunnally and Bernstein, 1994). By this yardstick, the pooled MDAS reliability is comfortably within the recommended range.
These findings are consistent with the original MDAS validation by Humphris et al. (1995), which reported high internal consistency alongside the United Kingdom norms. The pooled estimate aligns with the single-study evidence, while the PI brackets the likely range for future applications beyond the original UK context, underscoring both robustness and contextual variability.
In the context of classical test theory, the measurement-error proportion is defined as (1−α). The pooled estimate of this proportion is 0.118; the 95% CI ranges approximately from 0.111 to 0.127; and the PI ranges approximately from 0.058 to 0.241 (Supplemental C: Table 2). The presentation of the point estimate, along with both the CI and the PI, elucidates the variability in the proportion of observed-score variance attributable to measurement error, contingent upon the sampling context.
Categorical moderators
Across the range of categorical moderators, only language showed a statistically significant (F (1, 43.1) =12.1, p = 0.0012) moderator effect on internal consistency, and this difference persisted after adjustment for continent (Supplemental D: Tables CM-6 and CM-6A). This pattern is theoretically coherent with cross-cultural measurement guidance: translation quality, semantic and idiomatic equivalence, and pretesting procedures can influence item intercorrelations and thus α (Beaton et al., 2000; Diercke et al., 2013; Sousa and Rojjanasrirat, 2011). In a broader perspective, the bias–equivalence framework distinguishes construct, method, and item bias as potential sources of cross-linguistic differences in scores and reliability (van de Vijver and Tanzer, 2004). The continent-adjusted analysis indicates that the language effect is not merely a proxy for geographic clustering; however, unmeasured context (e.g. administration mode; clinical vs community settings) may still contribute.
MDAS-specific evidence is consistent with this interpretation: multiple adaptations have reported high internal consistency while noting differences in structure or administration context (e.g. Italian, Swedish, Taiwanese, and Chinese versions; Gremigni et al., 2014; Höglund et al., 2024; Lin et al., 2021; Yuan et al., 2008). Taking together with the present meta-analytic finding, these reports imply that cross-linguistic variation in α is modest but detectable and may reflect adherence to recommended adaptation steps (committee review, back-translation, cognitive debriefing) and whether dimensionality is explicitly checked.
Continuous moderators
In three-level REML meta-regressions with CR2/Satterthwaite inference and Bonett (ABT) back-transformation to the α scale, two continuous predictors—mean age and sample size—exhibited statistically reliable slopes, while all others were found to be null.
For 10-year increments in mean age, the ABT-scale slope was β = 0.0684 (p = 0.0179; Supplemental E: Table CM-9). The back-transformed estimates at representative values were α = 0.871 (95% CI (0.855, 0.885); 95% PI (0.747, 0.934)) at approximately 20.8 years, α = 0.882 (95% CI (0.872, 0.892); 95% PI (0.770, 0.940)) at approximately 33.8 years, and α = 0.892 (95% CI (0.880, 0.903); 95% PI (0.788, 0.945)) at approximately 46.9 years. The incorporation of mean age led to a reduction in between-study τ² from 0.0789 to 0.0594 (pseudo-R2_(between) ≈24.7%) and a marginal increase in within-study τ² (pseudo-R2_within ≈−24.1%). This suggests that age contributes to the variability across studies while preserving some within-study heterogeneity. Conceptually, the documentation of lifespan work reveals age-related shifts in affect dynamics and emotion regulation, which can alter item intercorrelations (and thus α), thus offering a plausible mechanism (Charles and Carstensen, 2010). Within the context of dental anxiety, studies have reported age-related differences in levels and the onset of symptoms. Older age groups tend to exhibit lower levels of anxiety, which is consistent with the modest but statistically significant increase in reliability observed at higher mean ages (Armfield, 2006; Hittner and Hemmo, 2009; Yuan et al., 2008).
On the ABT scale, the slope was β = 0.00314 per 100 participants (p = 0.0131; Supplemental E: Table CM-13). After back-transforming to the α scale, expected reliability at representative sample sizes was α = 0.878 (95% CI (0.861, 0.894); 95% PI (0.757, 0.939)) at n ≈ 5, α = 0.881 (95% CI (0.865, 0.895); 95% PI (0.763, 0.941)) at n ≈ 747, and α = 0.890 (95% CI (0.877, 0.901); 95% PI (0.779, 0.945)) at n ≈ 3131. Relative to the null model, between-study τ² decreased from 0.0645 to 0.0580 (pseudo-R2_between ≈10.1%), while within-study τ² changed only slightly from 0.0653 to 0.0659 (pseudo-R2_within ≈−0.9%). Methodologically, sample size primarily governs precision (i.e. interval width) rather than the population reliability itself; very small samples inflate uncertainty and can destabilize α through sampling fluctuation (Bonett, 2002). The modest positive slope likely reflects improved precision and co-occurring design/administration features in larger studies, not a causal effect of n on the true score structure.
Homogeneity and induction rates
In the context of RG meta-analysis, researchers have offered diverging interpretations regarding non-significant moderator findings. Some argue that such results reflect the resilience of scale reliability to variation in study-level characteristics, rather than a lack of meaningful moderation (Onwuegbuzie and Daniel, 2002). A review of prior RG meta-analysis reveals that in several cases, no single moderator significantly contributed to explaining the variance in overall alpha coefficients. However, this outcome should not be exclusively attributed to limited statistical power due to invariant reliability estimates. The presence of substantial heterogeneity at both the between- and within-study levels in the dataset suggests that reliability is a property of scores rather than tests, and that sample- and administration-specific conditions can systematically shift Cronbach’s α. At the categorical level, the persistence of the language effect after adjustment for continent, and at the continuous level, the finding that mean age and sample size account for only a modest share of between-study variance, together indicate that the observed heterogeneity at least partly reflects the measurement context (Supplemental D–E). The residual variability that remains unexplained is consistent with unreported details concerning the design or administration of the procedure (e.g. the mode of administration; clinical vs community settings), the quality of the adaptation procedures, and potential differences in dimensionality. Accordingly, interpretation should not rely solely on the point estimate but also consider the implications of the PI for future applications; the breadth of the PI underscores that the reliability expected in any single new study is contingent on context.
In the absence of statistically significant moderator effects, the observed reliability coefficients may reflect the robustness of scale scores to contextual variation. This stability can be interpreted as evidence for the generalizability and psychometric strength of the MDAS across populations (Botella et al., 2010; Yörük and Sen, 2023). Nonetheless, it is critical to acknowledge that methodological interpretations vary. As Helms et al. (2006) have emphasized, it is essential to systematically evaluate the underlying evidence before drawing firm conclusions about moderator null effects. Against this backdrop, the high prevalence of reliability induction in literature suggests that MDAS applications frequently omit sample-specific reliability estimates and instead incorporate coefficients borrowed from prior studies into the evidentiary record. Such practices have been shown to (i) impede context-sensitive interpretation of reliability, (ii) attenuate the detection of genuine moderator effects by diluting cross-study signal, and (iii) weaken comparability across studies. It is therefore recommended that three pragmatic improvements be implemented: firstly, the reporting of sample-specific alpha with CIs and, where feasible, PIs; secondly, the transparent documentation of sample composition – especially age distribution – and the details of translation/adaptation and administration; and thirdly, the accompaniment of alpha with alternative coefficients that are more sensitive to metric and tau-equivalence assumptions (e.g. omega, ordinal alpha).
In the present meta-analysis, the overall reliability induction rate for studies using the MDAS was 76.1%. This rate reflects a substantial portion of the literature relying on previously published estimates. Similarly, López-Pina et al. (2015) found that in their RG analysis, 84.7% of studies failed to report any reliability coefficient (induction by omission), while 15.3% reused estimates from earlier applications of the scale (induction by report). Such practices, though common, may limit the precision and contextual relevance of reliability interpretations (Sen, 2022).
Publication bias
In this RG meta-analysis, efforts were made to minimize the impact of publication bias by including both published and unpublished studies (Rosenthal, 1995). As part of the literature search strategy, the ProQuest database was systematically reviewed to identify English-language master’s and doctoral theses. This search yielded three unpublished theses that met all inclusion and exclusion criteria.
The assessment of small-study effects was conducted through the implementation of three-level PET/PEESE with CR2 inference. This methodology has been demonstrated to effectively differentiate size-related artifacts from selective reporting, while concurrently providing bias-adjusted intercepts on the α scale (Stanley, 2017; Stanley and Doucouliagos, 2014). In the models under consideration, the PET intercept was estimated to be α ≈ 0.888 (95% CI (0.868, 0.904)), and the PEESE intercept was α ≈ 0.882 (95% CI (0.871, 0.893)). These intercepts closely tracked the pooled estimate. The corresponding slope terms were not statistically significant. These results indicate limited small-study effects and suggest that selective reporting does not materially distort the central reliability estimate (Supplemental C: Table 3).
To verify the reliability of the findings, we conducted an Egger’s regression analysis with CR2 data, which had been grouped by study. The intercept tests did not indicate directional asymmetry, thereby corroborating the PET/PEESE conclusions (Egger et al., 1997; Supplemental C: Table 4). Additional robustness analyses, which included alternative specifications and restricted samples, yielded inferences that were substantively similar (Supplemental C: Table 5).
The PET/PEESE funnel plot (three-level specification) provides a visual complement to these findings: the pattern is broadly consistent with symmetry, with no concentration of small, extreme estimates that would suggest publication-related distortions (Supplemental Figure 1). When considered as a whole, the bias-adjusted intercepts, sensitivity tests, and graphical evidence reach a consensus conclusion: small-study and selection effects are improbable explanations for the observed pooled alpha or its dispersion. Consequently, the substantive interpretation of reliability remains unaltered by bias adjustments.
Limitations
Although this RG meta-analysis study proved that the reliability level of the MDAS is high, it has some limitations. First, the MDAS is best interpreted as trait-oriented; our construct-based subgrouping relied on study descriptions rather than direct state–trait assessments, so residual conflation cannot be ruled out (Steenen et al., 2024). Secondly, a meta-analysis of α was conducted due to the infrequent reporting of alternatives (ω, ordinal α). Inconsistent subscale reporting also imposed limitations on factor-specific reliability synthesis. Thirdly, despite the implementation of three-level REML with CR2 and PIs, residual heterogeneity remained. This suggests that unmeasured features, such as administration mode, clinical versus community setting, adaptation quality, and dimensionality checks, are likely to contribute to the observed heterogeneity. Due to the unavailability of item-level data, measurement invariance/differential item functioning (DIF) analyses were not feasible. Consequently, cross-linguistic comparability was examined indirectly. Fourthly, although PET/PEESE and Egger suggested limited small-study effects, bounded outcomes and dependence structures constrain detection. Reliability induction, defined as the phenomenon of a moderator signal being reduced or mitigated, was a prevalent phenomenon in the observed data. This induction occurred through two distinct mechanisms: by report and by omission. The scope of the study was constrained to internal consistency, excluding considerations such as test–retest reliability, sensitivity to change, and validity.
Conclusions
The objective of this reliability generalization was twofold: firstly, to verify the global reliability of a brief, widely utilized instrument for an essential clinical construct—dental anxiety—and secondly, to elucidate the extent to which that reliability varies across different settings. Given the impact of dental anxiety on avoidance and oral health outcomes, a transparent, evidence-based statement of the MDAS’s score reliability is a necessary foundation for both research and clinical decision-making (Armfield, 2006; Locker, 2003; Murray et al., 2019).
On average, MDAS total scores demonstrate high internal consistency, comfortably exceeding conventional thresholds for applied use (⩾0.80) and exploratory work (⩾0.70; Nunnally and Bernstein, 1994). Concurrently, reliability is a property of scores in context, not a fixed property of the test. The PI indicates that reliability for a new sample can be somewhat higher or lower than the pooled estimate, and the share of observed-score variance due to measurement error (1 – α) changes accordingly (Bonett, 2002; Lord and Novick, 1968). This combination—strong average reliability with bounded dispersion—captures both the robustness and the contextual sensitivity of MDAS scores.
To ensure the instrument is aligned with the intended construct, clinicians should ensure that the instrument is adapted accordingly. The language exhibited a significant effect that remained consistent after adjustment for continent, consistent with the assertion that translation quality, semantic/idiomatic equivalence, and pre-testing can modify item intercorrelations and thus α (Beaton et al., 2000; Sousa and Rojjanasrirat, 2011; van de Vijver and Tanzer, 2004). Among continuous variables, mean age demonstrated a positive association with reliability, consistent with shifts in affect dynamics and regulation over the lifespan that can influence responses. Larger studies yielded slightly higher and, more importantly, more precise reliability estimates, reflecting reduced sampling fluctuation rather than a causal effect of sample size on the true score structure (Bonett, 2002; Charles and Carstensen, 2010). The findings of other moderators did not demonstrate statistically reliable effects under robust inference, indicating the presence of residual heterogeneity that is likely attributable to unreported design or administration features, as well as to differences in adaptation or dimensionality checking.
Moreover, we identified a pervasive phenomenon of reliability induction, characterized by the reuse of coefficients from previous studies or the omission of sample-specific reporting. This practice undermines context-sensitive interpretation, hinders moderator detection, and diminishes the comparability across studies (Helms et al., 2006; López-Pina et al., 2015; Onwuegbuzie and Daniel, 2002; Vacha-Haase et al., 2000). To strengthen the evidence base, it is recommended that routine reporting of sample-specific alpha with CIs (and, where feasible, PIs) be implemented. In addition, transparent documentation of sample composition (especially age distribution) and adaptation/administration procedures is advised. Finally, pairing alpha with omega or ordinal alpha is necessary to address tau-equivalence and ordinal scaling.
The findings of this study indicate that small-study and selection effects are limited. The bias-adjusted PET/PEESE intercepts closely tracked the pooled estimate, and Egger’s test did not suggest directional asymmetry. Therefore, the substantive interpretation of reliability is unchanged by bias adjustment (Egger et al., 1997; Stanley, 2017; Stanley and Doucouliagos, 2014).
Conclusions for clinicians
MDAS should be considered as the first assessment screening focused on traits, while situational tools such as STAI-State/VAS-Anxiety should be used for peri-procedural (state) anxiety as part of the clinical flow. This distinction prevents the conflation of day-to-day management decisions (communication, appointment scheduling, sedation/psychological support referrals) with longitudinal monitoring (risk at the trait level).
Based on our pooled α = 0.882 (95% CI (0.873, 0.889); 95% PI (0.759, 0.942)) findings, clinicians are advised to interpret interval information rather than single scores. Reporting sample-specific α and 95% CI at the institutional/internal service level both makes local measurement error visible (1−α principle) and ensures that scores close to thresholds are treated cautiously. If interpreting meaningful change in follow-up, standard error of measurement based on local α should be used when possible (small differences may fall within measurement error).
In this RG meta-analysis study, the language moderator was found to be significant and remained so after continent-specific adjustment; therefore, for non-English applications, the use of certified translations, following standard steps such as translation–back-translation/committee review–cognitive debriefing, and dimensionality CFA control are recommended for clinical comparability. Direct “equivalence” between languages should not be assumed; adaptation quality should be reported.
The statistical significance of the slope for mean age (a slight increase in α with age) suggests that consistency of scores may increase somewhat in older populations; the small but significant positive relationship with sample size indicates that collecting a sufficient sample is important for range width and stability in clinical monitoring/service studies. We recommend estimating the local α with pooling at monthly/quarterly intervals in small units (e.g. sub-clinics) with narrower intervals.
Supplemental Material
sj-docx-1-hpq-10.1177_13591053251412485 – Supplemental material for Estimating the reliability of the modified dental anxiety scale: A systematic review and reliability generalization meta-analysis
Supplemental material, sj-docx-1-hpq-10.1177_13591053251412485 for Estimating the reliability of the modified dental anxiety scale: A systematic review and reliability generalization meta-analysis by Sinan Özyavaş in Journal of Health Psychology
Supplemental Material
sj-docx-2-hpq-10.1177_13591053251412485 – Supplemental material for Estimating the reliability of the modified dental anxiety scale: A systematic review and reliability generalization meta-analysis
Supplemental material, sj-docx-2-hpq-10.1177_13591053251412485 for Estimating the reliability of the modified dental anxiety scale: A systematic review and reliability generalization meta-analysis by Sinan Özyavaş in Journal of Health Psychology
Supplemental Material
sj-docx-3-hpq-10.1177_13591053251412485 – Supplemental material for Estimating the reliability of the modified dental anxiety scale: A systematic review and reliability generalization meta-analysis
Supplemental material, sj-docx-3-hpq-10.1177_13591053251412485 for Estimating the reliability of the modified dental anxiety scale: A systematic review and reliability generalization meta-analysis by Sinan Özyavaş in Journal of Health Psychology
Supplemental Material
sj-docx-4-hpq-10.1177_13591053251412485 – Supplemental material for Estimating the reliability of the modified dental anxiety scale: A systematic review and reliability generalization meta-analysis
Supplemental material, sj-docx-4-hpq-10.1177_13591053251412485 for Estimating the reliability of the modified dental anxiety scale: A systematic review and reliability generalization meta-analysis by Sinan Özyavaş in Journal of Health Psychology
Supplemental Material
sj-docx-5-hpq-10.1177_13591053251412485 – Supplemental material for Estimating the reliability of the modified dental anxiety scale: A systematic review and reliability generalization meta-analysis
Supplemental material, sj-docx-5-hpq-10.1177_13591053251412485 for Estimating the reliability of the modified dental anxiety scale: A systematic review and reliability generalization meta-analysis by Sinan Özyavaş in Journal of Health Psychology
Supplemental Material
sj-docx-6-hpq-10.1177_13591053251412485 – Supplemental material for Estimating the reliability of the modified dental anxiety scale: A systematic review and reliability generalization meta-analysis
Supplemental material, sj-docx-6-hpq-10.1177_13591053251412485 for Estimating the reliability of the modified dental anxiety scale: A systematic review and reliability generalization meta-analysis by Sinan Özyavaş in Journal of Health Psychology
Supplemental Material
sj-docx-7-hpq-10.1177_13591053251412485 – Supplemental material for Estimating the reliability of the modified dental anxiety scale: A systematic review and reliability generalization meta-analysis
Supplemental material, sj-docx-7-hpq-10.1177_13591053251412485 for Estimating the reliability of the modified dental anxiety scale: A systematic review and reliability generalization meta-analysis by Sinan Özyavaş in Journal of Health Psychology
Supplemental Material
sj-xlsx-8-hpq-10.1177_13591053251412485 – Supplemental material for Estimating the reliability of the modified dental anxiety scale: A systematic review and reliability generalization meta-analysis
Supplemental material, sj-xlsx-8-hpq-10.1177_13591053251412485 for Estimating the reliability of the modified dental anxiety scale: A systematic review and reliability generalization meta-analysis by Sinan Özyavaş in Journal of Health Psychology
Footnotes
Ethical considerations
This study is a systematic review and reliability generalization meta-analysis based solely on previously published data. No new data was collected, and no human or animal participants were involved. Therefore, ethical approval and informed consent were not required in accordance with institutional and journal guidelines.
Consent to participate
Consent to participate is not applicable to this review article as no data were collected from participants.
Consent for publication
Consent for publication is not applicable to this review article as no identifiable participant data are included.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability statement
Supplemental material
Supplemental material for this article is available online.
