Abstract
Introduction
The reliability of performance assessment scores can be affected by several factors, such as the number of students, raters, and the performance of raters. Minimizing inter-rater variability and ensuring its applicability are important factors in assessing medical education programs. This study aimed to examine the reliability coefficients derived from a generalizability study and a decision study (D-study) conducted within a two-facet cross-design using generalizability theory (G-theory) to assess performance in medical education.
Method
This study employed a two-facet, crossed-mixed design [b × p × m]. A total of 40 randomly selected students were evaluated by five raters (random facets) using 35 items (fixed facets) in a performance assessment setting. Data were analyzed using EduG software.
Results
Of the participants, 142 (60%) were female and 98 (40%) were male. The total mean score for the crossed-designed set of skills was 62.11, the percentage of individual estimated variance components was 33.90%, and the G-coefficient was 0.94. For the D-study, the reliability coefficients were 0.86 for two raters, 0.90 for three raters, 0.92 for four raters, 0.94 for five raters, 0.95 for six raters, and 0.96 for seven raters. In the G-faced analyses, there were no differences between the raters.
Conclusions
Inter-rater variability is a potential risk for limited performance evaluations, regardless of application design. Rater standardization is recommended to reduce the likelihood of such risks. In our study, rater standardization and D-studies were performed using applications with a crossed-mixed design. With the widespread use of these analyses, crossed designs strengthened by rater standardization in assessment and evaluation practices in medical education are preferred. Suitable rater standardization can make crossed designs a preferred option in performance assessment, and feedback can be obtained for subsequent ratings by analyzing them using G-theory.
Keywords
Introduction
Undergraduate medical education consists of two main phases: preclinical and clinical phases. Skill training is an essential part of the medical curriculum in both phases of the program. Basic clinical skills training aims to equip students with the necessary knowledge and abilities to perform tasks (e.g., communication, history taking, physical examination, and intravenous catheter placement) in the preclinical phase. 1 This training plays a crucial role in undergraduate medical education by providing students with the foundational skills they need during their years of education. 2 In the clinical phase, students engage in practical training in genuine healthcare settings, including outpatient and inpatient clinics and emergency rooms. This hands-on exposure enables students to apply and enhance their fundamental clinical skills, fostering their development as competent and adept healthcare professionals.
Skills training programs entail hands-on educational activities that demand dedicated resources, including appropriate facilities, materials, proficient workforce, and efficient logistical support. 3
Assessment is a fundamental cornerstone of medical education. 4 The crucial role of evaluating student achievement not only informs pedagogical practices at various educational stages but also places substantial importance on institutions to ensure the reliability and validity of assessment outcomes. This responsibility is particularly paramount because consequential decisions concerning individuals’ academic pursuits are frequently based on the results of these measurements.
In medical education, performance assessment is a distinctive aspect of evaluating student accomplishments. Performance assessment is a testing method in which students are assigned tasks that require the generation or execution of responses to practical and meaningful assignments. This approach emphasizes the application of advanced psychomotor skills and is evaluated based on explicit and well-defined criteria. 5
Performance assessment is an evaluation in which students are given tasks to respond to or perform realistic and meaningful assignments and are scored with clear and explicit criteria that expect the use of higher-level psychomotor skills. 5 Two characteristics of performance assessment are mentioned in the literature. The first is that the task is based on real professional situations, and the second is the evaluation of both the learning product and the process of creating the product. 6 Performance assessment is a form of testing that requires students to complete a task and comes in two forms: the performance assessment level in a real environment where the student “performs the task” is called comprehensive performance assessment, while the simulated performance assessment level where the student “demonstrates how the task is done” is referred to as “limited performance assessment." 4
Several performance assessment methods have been used in various educational contexts. These diverse methodologies serve as pivotal tools for comprehensively gauging the proficiency and capabilities of individuals in their respective study domains.7,8
In medical education, the assessment process is often intricate, and numerous factors contribute to potential measurement error. These factors include external elements such as raters, variations in cases, the difficulty of procedures, the nature of the procedure itself, and engagement with other individuals such as supervisors. 9 An increase in the number of students could also impose a considerable constraint on the feasibility of the chosen assessment approach because of the increased workload of raters or evaluators. 10 Challenges arising from the rater's own performance could be influenced by factors such as the assessment tool, the number of students, and the number of performances to be evaluated. Typically, raters assess different groups, and despite efforts to achieve evaluator standardization, variations can occur among raters, leading to a prolonged evaluation process.11–13 These challenges may contribute to improving the reliability of student-performance scores.
Generalizability theory (G-theory) facilitates a robust assessment of reliability by incorporating a comprehensive array of sources and factors that contribute to the variance in performance and measurement errors.14–16 G-studies involve the calculation of multiple sources of variance in a single analysis. They determined the magnitude of each source of variance and calculated the reliability coefficients related to relative decisions based on individual performance, as well as absolute decisions about individual performance. These coefficients, namely the G- and Phi-coefficients, were used to assess reliability.17,18
In decision studies (D-studies), information from the generalizability study (G-study) is used to make decisions for a specific purpose. 15 D-study provides an estimation toward determining the number of raters, items, or tasks.
Numerous studies have been conducted in the medical education context to investigate factors associated with the reliability of performance assessment scores. Among the factors under investigation, variations associated with participants and raters are the two most frequently explored. 9 However, to the best of our knowledge, few studies have explored assessor standardization and its effects on the variance in student-performance scores.
This study aimed to examine the reliability coefficients derived from a G-study and D-study conducted within a two-facet cross-design using G-theory to assess performance in medical education.
Method
This study used a quasi-experimental and quantitative research design. This study was approved by the Süleyman Demirel University (SDSoM) Clinical Research Ethics Committee (No. 169740; date: 21.12.2020). All participants were informed of the study protocols. Written informed consent was obtained from all the participants. Since the participants were students, a commitment document stating that “there will be no sanctions regarding the education program of the students.” Data were collected for this study on 26.01.2022.
Study design
Crossed design: in this design, each facet is sampled at all levels along with the others. In a crossed design, all the conditions of one facet observe the conditions of the other facet. 14 An “×” mark is used to indicate the relationship between the facets. For example, in our study, all raters (p) in the crossed design group observed the items/tasks (m) of all students (b), which is expressed as [b × p × m]. For a two-facet crossed design, all 40 students (individuals) were rated separately by five raters for 35 items.
Participants
Data were collected from first-year medical students enrolled in the 2021–2022 academic year at SDSoM in Türkiye. Our study had a quasi-experimental and quantitative design. A power analysis was performed to determine the sample sizes. The sample size was calculated as 40, with a 95% confidence level, considering the number of trainers, students, and scoring logistics (n = 40). The calculation was based on a population size of 240 students, an expected frequency of 50%, and an acceptable margin of error of 14%. Students were randomly selected from the medical school enrollment list and invited via email to participate in this study. For those who chose not to participate (n = 8), we employed the same selection process, ultimately achieving the desired sample size through the second round of invitations. All students were randomly sampled from each group.
Data collection tool
Basic clinical skills, hygienic hand washing, wearing/removing a surgical cap, mask, and gown, wearing sterile gloves, and removing gloves were selected for evaluation in the study. The researchers developed a scoring instrument for these five skills using national and international skills training guidelines. This instrument was independently reviewed by subject matter experts, including three medical educators and two public health specialists. Modifications and changes were made as suggested by the experts, and the instrument was returned for review and agreement by the researchers. Ultimately, a rating scale comprising 35 items was created to assess the five skills used in the study. The scoring system used was as follows: 0 points for no or inadequate performance, 1 point for performance that needed improvement, and 2 points for satisfactory performance. 19
A balanced dataset, which is frequently used in G-theory analyses, was used in this study's design. In a balanced design, there are no missing data, and for any nested facet, the sample size is constant for each level of that facet, and there is a dataset with equal amounts for all facets. Therefore, in our study, the same number of questions was asked for each student, or in the nested design, an equal number of students were assigned to each rater was used.
Raters
Residents from the Public Health and Family Medicine Departments of the SDSoM were invited to participate as raters in the study, with 41 accepting the invitation (n = 14 and n = 27, respectively). To achieve rater standardization in the study, as recommended in the literature, a 30-min standardization session was conducted prior to the evaluations. 20 During this session, an ideal sample performance video was shared with the raters before scoring. The features and scoring criteria of the tool were reviewed based on videos. The raters’ questions regarding these items were also addressed and evaluated.
Data collection and analysis
In the application, the student was asked to perform the five skills sequentially in front of the five raters within a 5-min timeframe. The raters were asked to simultaneously rate the students’ performance using a rating scale. This study utilized a mixed design, as the item facet remained fixed throughout the analysis.
Given the consecutive and hierarchical relationships among the items in the skills training, it was not possible to remove any items from the sequence of training. Consequently, the scoring instrument items were included as fixed facets for the G-study analysis. The raters were designated as random facets of the study. Both G- and D-studies were conducted on this random facet. Because there were no fixed facets in the D-study, they were not included in the analyses.
This study used a two-facet crossed-mixed design [b × p × m] in which 40 students were evaluated by five raters (random facets) using 35 items (fixed facets). The G-study and D-study were conducted for this design. EduG software was used for data analysis.14,21–23
The distribution of the number of students in the dataset was balanced. For our skill set, the scores for all five skills were combined to calculate the total score. Data from the scores designed for the same skill set in a crossed design were analyzed using G-theory. In this study, the students were coded as b (individuals), raters as p (raters), and skill items as m (items). The two-facet crossed [b × p × m] mixed design involved three main effects (b, p, and m) and four interaction effects (bp, bm, pm, bpm, and e), resulting in seven effects.
This study conforms to the STROBE statement for observational studies. 24
Results
The study was conducted on January 26, 2022, at the Interprofessional Applied Training Laboratory of the Süleyman Demirel University Medical Faculty. The study involved a planning team consisting of three faculty members, five raters, two staff members, and 40 students (n = 40). In this study, 24 students (60%) were female and 16 (40%) were male. The average age of the students was 18.62 ± 1.15 years. Among the raters, one (20%) was female and four (80%) were male. The average age of the raters was 28.62 ± 3.34 years. To describe the scores, the average score for the skills was calculated as 1.56 ± 0.11. A detailed score analysis for each item is presented in Table 1.
Descriptive Findings.
In the analysis of the average skill scores, considering the crossed-mixed pattern of the students’ performance ratings, the average score for hygienic hand rubbing was 3.58 (min: 0, max: 4). The average score for cap-wearing, mask-putting, and clean apron-wearing skill sets was 19.97 (min: 0, max: 22). The average score for the sterile glove-putting skill was 13.76 (min: 0, max: 16), and for the sterile glove-removal skill, it was 6.77 (min: 0, max: 8). Finally, the average score for cap, mask, and apron removal skills was 18.04 (min: 0, max: 20). The overall average skill-set score was 62.11 (min: 0, max: 70) (Table 2).
Two-Facet Crossed-Mixed Design [b × p × m] Mean Score Values.
The mean score of rater 1 was 62.13 ± 9.51, that of rater 2 was 60.48 ± 11.93, that of rater 3 was 61.08 ± 11.60, that of rater 4 was 64.00 ± 10.89, and that of rater 5 was 62.88 ± 13.11. The detailed score values and standard deviations for the crossed-mixed [b × p × m] design are presented in Table 3.
Mean Values and Standard Deviations of Scores in the Crossed-Mixed [b × p × m] Design.
G-study: G-theory applications were used to evaluate the reliability of skill training. In G-theory, applications were evaluated in a crossed-mixed design [b × p × m], the percentage of the variance component estimated for individuals was 34.10%, that for items/tasks was 3.00%, and that for raters was 3.00%. The variance component percentage estimated for individuals was 34.10%; the variance component percentage estimated for items/tasks was 3.00%; the variance component percentage estimated for raters was 0.30%; and the variance component percentages estimated for individual items/tasks were 11%, 30%, 11.10%, 3.70%, 36.60%, and 36.60% for the variance component estimated for the individual rater, item/task-rater, and individual-item/task-rater, respectively (Table 4).
Variance Values and Total Variance Explanation Ratios Estimated Via the G-Method for the [i × r × i] Design.
Estimated variance component interpretations for the two-facet crossed pattern
In this study, the percentage of the variance component estimated for individuals in the two-facet crossed-mixed design in G-studies via G-theory was the universal score variance corresponding to the true score variance in classical test theory. This parameter indicates the extent to which individuals differ in their measured characteristics. Differences between individual characteristics can be determined through measurements. Therefore, the share of variance estimated for individuals in the total variance should be large. 14 In this study, the percentage of the variance component estimated for individuals (b = 33.90%) was large. The large relative value of the share of this variance in the total variance indicates that systematic differences between individuals can be revealed and that the power of the observed scores to represent the universe (real) scores increases. Therefore, in line with the literature, this measurement tool can reveal differences in measured traits and represent the population. 25
The estimated variance for the rater main effect indicates whether a particular rater's ratings across all individuals are more generous or stricter than those of other raters. 14 The consistency of raters’ scores across all individuals depends on the relative magnitude of the variance or a small variance percentage. When the variance component estimated for the raters approached zero, the ratings for all individuals were similar. When the share of the estimated variance component in the total variance is zero, it indicates that the raters behave with the same strictness/generosity in their scores for all the individuals. In this study, the percentage of variance components estimated for the raters was calculated as (p = 0.30%). The raters were evaluated as having similar generosity and harshness behaviors. This finding is consistent with the effect of rater standardization, which has been expressed as an important source of error in the literature.11,12,20
The interpretation of the variance components estimated for item/task (i) the main effect was similar to that of the rater main effect. The mean value for any task was defined as the level of difficulty. Therefore, the variance estimated for the main task effect differed across task difficulty levels. In this study, the percentage of the variance component estimated for the item/task was calculated as 3.00%, and task difficulties were interpreted differently. The small variance component (i = 3.00%) estimated in the G-study for the main effect of item (i) indicates that item difficulties are not very different from one another. 26
The variance component estimated for the variance source of the individual–rater common effect indicates that a certain rater scores a certain individual more strictly or generously than other raters do. The relatively high variance indicates that some raters score some individuals more strictly or generously than others. In this study, the variance component estimated for individual raters was 11.10%. This relatively high value can be interpreted as raters scoring some individuals more strictly or generously than others.
The variance component estimated for the individual-item/task common effect shows variation in the relative position of a given individual from one task to another. The larger the share of the variance estimated for the individual item/task common effect in the total variance, the greater the differences in the relative positions of some individuals from task to task. In this study, the variance component estimated for the individual items/tasks was 11.30%. This value was the third highest value estimated in this study. This finding is interpreted as being compatible with the effects of five skills of different levels of difficulty (hygienic hand scrubbing skill (two items), wearing a cap–mask–clean apron skill (11 items), wearing sterile glove skills (eight items), sterile glove-removal skills (four items), and cap–mask–apron removal skills (10 items)).
The estimated variance component for the rater/item/task common effect indicates the extent to which raters consistently score individuals from task to task. When the estimated variance approached zero, the raters consistently scored each task. In this study, the variance component for the rating items/tasks was estimated to be 3.70%. This value is interpreted as the stable behavior of the raters, even though there are differences in difficulty between the tasks.
This shows the individual-rater-item/task (bpm,e) joint effect and unmeasured sources of variance as a composite. The unmeasured sources of variance were divided into two groups: systematic (students repeating different skills more) and non-systematic/random (students’ individual differences). In this study, the variance component estimated for individual rating items/tasks had the highest estimated value (37.1%). This finding is consistent with those of many studies that have evaluated real-world data.25,27 This high variance can be interpreted as the fact that students have already received this training for some reason (the COVID-19 pandemic period), that the skills are relatively easy and easily learned, and that different error sources that cannot be controlled are involved in the process.
In the analysis of the reliability coefficients using G-theory, the G-coefficient for the crossed design [b × p × m] was calculated to be 0.94.
Decision study: This D-study provides an estimation of the number of raters. A D-study was conducted by increasing and decreasing the number of raters, which is a random facet of the crossed-mixed design, in which 40 students were evaluated using 35 items. The detailed findings are presented in Table 5.
D-Study Results for Raters in the Crossed-Mixed Design [b × p × m] (Optimization).
D-study: decision study.
The G-coefficient obtained with the five raters working within the scope of the study was 0.94, and the G values obtained when this number was reduced to four, three, and two were estimated as 0.92, 0.90, and 0.86, respectively. A slight decrease in the G-coefficient was observed when the number of raters decreased. When the number of raters increased from five to six or seven, the G-coefficients were estimated to be 0.95 and 0.96. At this point, a relatively small increase in the G-coefficient was observed.
In the context of G-theory, an analysis called facet analysis can be performed. With this analysis, the effect of removing the conditions of the facets on reliability can be examined further. This calculation was performed over the rater facet, which was the only random facet considered in this study. The results are presented in Table 6.
G-Facet Analysis Results.
When Table 6 is examined, it is possible to observe a change in the G-coefficient, which was estimated to be 0.94 because of the evaluation of 40 students on 35 items by the five raters. In this context, the largest decrease observed as a result of the removal of raters in the study was for rater 4. In other words, the decrease in the G-coefficient observed because of the removal of rater 4 was relatively greater than that observed because of the removal of the other raters. The lowest decrease was observed for rater 2. However, when the entire table was analyzed, none of these decreases led to a significant decrease or increase in the G-coefficient.
Discussion
This study demonstrated that G-theory can be effectively applied to assess the reliability of performance-based assessments in undergraduate medical education, even in the context of basic clinical skills.
Our findings highlight that G- and D-studies can serve as a pragmatic framework for improving the reliability of performance assessment in medical education. While it is neither feasible nor necessary to conduct G- and D-studies for every clinical skills assessment, our results suggest that applying these methods selectively to high-stakes or frequently used assessments can meaningfully minimize inter-rater variability and guide rater standardization efforts. Once standardized procedures are established for a given assessment, the principles of rater calibration can be extended to other similar evaluations, thereby reducing the need for repeated, full-scale analyses. In this sense, our study provides a proof-of-concept model that demonstrates the strengths and limitations of using G-theory in practice. We argue that this approach offers a realistic and scalable strategy for institutions seeking to balance methodological rigor with practical constraints, positioning rater standardization and G-theory-based analyses as valuable tools for enhancing the fairness and credibility of clinical skills evaluations.
In our study, basic professional skill training, which is frequently included in limited performance evaluations, was preferred. In these training sessions, the difference between raters was found to be important in terms of fairness and applicability. 28 In addition, performance assessments applied to skills training in undergraduate medical education are considered ideal for G-studies because they involve repeated measurements of the same construct. 29 Recent studies have shown that collecting repeated measurements of the same construct increases reliability. 30 This is because random variances can cancel each other out in multiple measurements of the same construct. 16
The main aim of the experimental design was to minimize errors in finding real differences between the focal groups. In this study, rater standardization was performed. Studies on raters have discussed different types of rater bias and their impact on scores and judgments. 28 Each station is more or less difficult, or each rater is more or less lenient, and has unique effects on the students’ scores. Instead of ignoring these effects, we can measure them using G-theory. In this study, in line with the literature, the difference between assessors was a source of error for both designs.
Although theoretical studies have been carried out on simulated datasets in basic studies of G-theory, data obtained from real applications are used and recommended for use in many studies.22,23,27,28,31
This study was designed with the idea that the use of G-theory applications in real-time limited performance evaluation in skills training, which has an important role in pre-graduation medical education programs, would contribute to the evaluation of many sources of error by providing feedback and enabling prospective decision making. In this study, the data obtained by scoring the actual application of skill training were evaluated in accordance with the literature.32–34
There are many different patterns in G-theory applications.16,35,36 In Turkey, there has been an increase in studies on G-theory over the years. 37 In a study evaluating 60 studies conducted in Turkey between 2004 and 2017, the majority used a two-facet crossed design with balanced datasets. 37 A mixed design was used in a few of these studies. 37 Given the mixed design preferred in our study, this study contributes to the literature in terms of design. 38
In our study, the data obtained in the planning of the real-time limited performance evaluation of the five skills included in the training program were evaluated. In this context, 40 students were scored by five raters. The student and rater characteristics were evaluated in accordance with many previous studies.10,12,19
After scoring, the data were analyzed using EduG software, which was developed for G-theory applications and is widely used in the literature.14,15
In this study, the percentage of the variance component estimated for individuals in the two-facet crossed-mixed design was large in the G-studies conducted using G-theory. Therefore, the measurement tool can reveal differences in terms of the measured traits in accordance with the literature and can represent the universe. 25
In the variance estimated for the main effect of the raters, the raters were evaluated as having similar generosity/harshness behaviors. This finding is consistent with the effect of rater standardization, which is expressed as an important source of error in the literature.11,12,20
The variance components estimated for the item/task (i/t) main effects were interpreted as different task difficulties. The small variance component estimated for the main effect of item (i) indicates that item difficulties are not very different from one another. 26
This value, which is relatively high in the variance component estimated for the individual–rater common effect, can be interpreted as raters scoring some individuals more strictly than others.
The variance component estimated for the individual-item/task common effect was the third-highest value estimated in this study. This finding is interpreted as compatible with the effects of the five skills of different difficulty levels found in the study (hygienic hand-rubbing skill (two items), wearing cap–mask–clean apron skills (11 items), wearing sterile glove skills (eight items), sterile glove-removal skill (four items), and cap–mask–apron removal skill (10 items)).
The variance component estimated for the rater–item/task common effect was interpreted as the stable behavior of the raters, although there were differences in task difficulty.
The variance component estimated for the individual-rater-item/task (bpm,e) common effect and the individual-rater-item/task variance component were consistent with many studies that evaluated real practice data.25,27 This high variance can be interpreted as the fact that students have already received this training for some reason (the COVID-19 pandemic period), that the skills are relatively easy and easily learned, and that different error sources that cannot be controlled are involved in the process.
G-coefficients
In reliability analysis, the number of test items, relationships between items, and dimensionality affect the alpha values. There are different reports of acceptable alpha values ranging from .70 to .95.39–41 According to G-theory, a G-coefficient and Phi-coefficient above 0.70 are accepted as generalizable and reliable in reliability analyses conducted to evaluate the internal consistency of measurement tools.27,40,42–44 The reliability coefficient was determined using the assessment results of the study. Measurement and evaluation practices in medical education programs express the achievement of graduation goals. Therefore, a high reliability coefficient is recommended.
In our study, the G-coefficient for the crossed design was calculated as 0.94 in the analysis of the reliability coefficients based on G-theory. This finding is considered generalizable and reliable, which is in line with the literature.14,15,27,45
Decision studies
In the evaluation of inter-rater reliability, G-theory, which allows the evaluation of many sources of error, is at the forefront of research. 46 In the D-study conducted for raters in a crossed-mixed design [b × p × m], the G-coefficients were 0.86 for two raters, 0.90 for three raters, 0.92 for four raters, 0.94 for five raters, 0.95 for six raters and 0.96 for seven raters. In this study, feedback was obtained that the number of raters could be reduced, which is consistent with the literature.
In addition to the differences between raters in G-studies, differences between raters can be examined in more detail in G-facet studies. In our study, there was no difference between the raters in the G-study. With the G-facet study, the change in the reliability coefficient with the removal of a rater can be calculated as follows: our study revealed no significant changes in the reliability coefficient after the removal of any rater. This was considered a positive effect of rater standardization.11,12,20
Rater supply and standardization are among the problems frequently encountered during skills training. The optimum rater requirement was determined using the D-study. In addition, differences between raters can be evaluated using G-studies and G-facet studies. This information may contribute to skill training in the field of medical education.
Assessments should be conducted in the context of educational programs. Scores were calculated based on the number of isolates. 47 A score gains meaning and begins to serve its context when we have standards to compare with equation. 40 G-theory allows us to develop an approach in this context. The variance was similar to the distribution of the ingredients in a single slice of cake. Variance can sometimes work in favor of the test-taker and sometimes against them. Sometimes, the variance behaves in predictable ways (systematic errors), and at other times, it behaves unpredictably (random errors). Consequently, these aspects of variance must be considered to fully assess reliability.
In our study, when the crossed design is evaluated in terms of reliability coefficients and the feedback it provides, it provides information about many error sources that cannot be obtained using other designs. Improvements can be made in terms of the applicability of D-study suggestions to raters in crossed design.
In the crossed design, each rater evaluated all students using all items, which was somewhat time-consuming. Nevertheless, it is necessary to operate such a process with the suggestions of two raters to achieve highly reliable measurement.
G-facet analyses revealed no difference between raters, in accordance with the variance component estimated for raters. This finding was considered a positive contribution of rater standardization, which has been emphasized in the literature.
Conclusion
This study demonstrated that G-theory can be effectively applied to assess the reliability of performance-based assessments in undergraduate medical education, even in the context of basic clinical skills. However, an important question for medical educators is whether these results can be extrapolated to more complex clinical skills assessments, such as the evaluation of communication skills, diagnostic reasoning, or procedural competence in simulated or real clinical environments. In such cases, variability may increase due to greater cognitive complexity, longer task durations, and higher inter-task heterogeneity. Although our findings are promising, they should be cautiously generalized. Future studies are needed to examine whether similar reliability can be achieved in high-fidelity assessments involving affective or interpersonal domains. Ultimately, this study contributes to medical education by showcasing a practical model for applying G-theory in real-world assessment contexts. By identifying and controlling the sources of measurement error, educators can improve the fairness and defensibility of student evaluations. In competency-based education models, where accurate assessments are critical, methodological rigor is essential.
Footnotes
Author Note
This manuscript was defended as a thesis at the Health Sciences Institute, Ege University, to fulfill the requirements of the Medical Education PhD Program.
Acknowledgments
The authors express their gratitude to Prof. Dr Hakan Atılgan and Prof. Dr Halil İbrahim Durak for their support during this study. Additionally, the authors extend their appreciation to the medical students and residents of the SDSoM who dedicated their time, commitment, and willingness to participate in this study. Finally, the authors would like to extend their deep and sincere gratitude to Neşe Güler and Gülşen Taşdelen Teker for their substantial support and guidance in the research design and for conducting the statistical analysis of this research.
Ethics Approval
This study was approved by the Süleyman Demirel University Clinical Research Ethics Committee (No. 169740, Date: 21.12.2020). This study was conducted in accordance with the principles of the Declaration of Helsinki.
Consent for Publication
All contributors were informed of the study.
Participant Consent
Before the study, all participants were informed of the study.
Author Contributions
Conception: GK and SAÇ; design: GK; supervision: GK; sources: GK and SAÇ; data collection and/or processing: GK; analysis and/or interpretation: GK and SAÇ; literature review: GK and SAÇ; main text: GK and SAÇ; and critical review: GK and SAÇ.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Declaration of AI-Assisted Technologies
The authors declare that artificial intelligence (AI)-assisted technologies, including large language models (LMMs), were not used in this study.
Data Access Statement and Material Availability
The authors assumed that the data stored in the data warehouse could be easily accessed by others if the requests were approved by the data warehouse.
