Sage Journals: Discover world-class research

Abstract

Introduction

The reliability of performance assessment scores can be affected by several factors, such as the number of students, raters, and the performance of raters. Minimizing inter-rater variability and ensuring its applicability are important factors in assessing medical education programs. This study aimed to examine the reliability coefficients derived from a generalizability study and a decision study (D-study) conducted within a two-facet cross-design using generalizability theory (G-theory) to assess performance in medical education.

Method

This study employed a two-facet, crossed-mixed design [b × p × m]. A total of 40 randomly selected students were evaluated by five raters (random facets) using 35 items (fixed facets) in a performance assessment setting. Data were analyzed using EduG software.

Results

Of the participants, 142 (60%) were female and 98 (40%) were male. The total mean score for the crossed-designed set of skills was 62.11, the percentage of individual estimated variance components was 33.90%, and the G-coefficient was 0.94. For the D-study, the reliability coefficients were 0.86 for two raters, 0.90 for three raters, 0.92 for four raters, 0.94 for five raters, 0.95 for six raters, and 0.96 for seven raters. In the G-faced analyses, there were no differences between the raters.

Conclusions

Inter-rater variability is a potential risk for limited performance evaluations, regardless of application design. Rater standardization is recommended to reduce the likelihood of such risks. In our study, rater standardization and D-studies were performed using applications with a crossed-mixed design. With the widespread use of these analyses, crossed designs strengthened by rater standardization in assessment and evaluation practices in medical education are preferred. Suitable rater standardization can make crossed designs a preferred option in performance assessment, and feedback can be obtained for subsequent ratings by analyzing them using G-theory.

Keywords

medical education skills training assessment and evaluation performance assessment generalizability theory

Introduction

Undergraduate medical education consists of two main phases: preclinical and clinical phases. Skill training is an essential part of the medical curriculum in both phases of the program. Basic clinical skills training aims to equip students with the necessary knowledge and abilities to perform tasks (e.g., communication, history taking, physical examination, and intravenous catheter placement) in the preclinical phase.¹ This training plays a crucial role in undergraduate medical education by providing students with the foundational skills they need during their years of education.² In the clinical phase, students engage in practical training in genuine healthcare settings, including outpatient and inpatient clinics and emergency rooms. This hands-on exposure enables students to apply and enhance their fundamental clinical skills, fostering their development as competent and adept healthcare professionals.

Skills training programs entail hands-on educational activities that demand dedicated resources, including appropriate facilities, materials, proficient workforce, and efficient logistical support.³

Assessment is a fundamental cornerstone of medical education.⁴ The crucial role of evaluating student achievement not only informs pedagogical practices at various educational stages but also places substantial importance on institutions to ensure the reliability and validity of assessment outcomes. This responsibility is particularly paramount because consequential decisions concerning individuals’ academic pursuits are frequently based on the results of these measurements.

In medical education, performance assessment is a distinctive aspect of evaluating student accomplishments. Performance assessment is a testing method in which students are assigned tasks that require the generation or execution of responses to practical and meaningful assignments. This approach emphasizes the application of advanced psychomotor skills and is evaluated based on explicit and well-defined criteria.⁵

Performance assessment is an evaluation in which students are given tasks to respond to or perform realistic and meaningful assignments and are scored with clear and explicit criteria that expect the use of higher-level psychomotor skills.⁵ Two characteristics of performance assessment are mentioned in the literature. The first is that the task is based on real professional situations, and the second is the evaluation of both the learning product and the process of creating the product.⁶ Performance assessment is a form of testing that requires students to complete a task and comes in two forms: the performance assessment level in a real environment where the student “performs the task” is called comprehensive performance assessment, while the simulated performance assessment level where the student “demonstrates how the task is done” is referred to as “limited performance assessment."⁴

Several performance assessment methods have been used in various educational contexts. These diverse methodologies serve as pivotal tools for comprehensively gauging the proficiency and capabilities of individuals in their respective study domains.^7,8

In medical education, the assessment process is often intricate, and numerous factors contribute to potential measurement error. These factors include external elements such as raters, variations in cases, the difficulty of procedures, the nature of the procedure itself, and engagement with other individuals such as supervisors.⁹ An increase in the number of students could also impose a considerable constraint on the feasibility of the chosen assessment approach because of the increased workload of raters or evaluators.¹⁰ Challenges arising from the rater's own performance could be influenced by factors such as the assessment tool, the number of students, and the number of performances to be evaluated. Typically, raters assess different groups, and despite efforts to achieve evaluator standardization, variations can occur among raters, leading to a prolonged evaluation process.^11–13 These challenges may contribute to improving the reliability of student-performance scores.

Generalizability theory (G-theory) facilitates a robust assessment of reliability by incorporating a comprehensive array of sources and factors that contribute to the variance in performance and measurement errors.^14–16 G-studies involve the calculation of multiple sources of variance in a single analysis. They determined the magnitude of each source of variance and calculated the reliability coefficients related to relative decisions based on individual performance, as well as absolute decisions about individual performance. These coefficients, namely the G- and Phi-coefficients, were used to assess reliability.^17,18

In decision studies (D-studies), information from the generalizability study (G-study) is used to make decisions for a specific purpose.¹⁵ D-study provides an estimation toward determining the number of raters, items, or tasks.

Numerous studies have been conducted in the medical education context to investigate factors associated with the reliability of performance assessment scores. Among the factors under investigation, variations associated with participants and raters are the two most frequently explored.⁹ However, to the best of our knowledge, few studies have explored assessor standardization and its effects on the variance in student-performance scores.

This study aimed to examine the reliability coefficients derived from a G-study and D-study conducted within a two-facet cross-design using G-theory to assess performance in medical education.

Method

This study used a quasi-experimental and quantitative research design. This study was approved by the Süleyman Demirel University (SDSoM) Clinical Research Ethics Committee (No. 169740; date: 21.12.2020). All participants were informed of the study protocols. Written informed consent was obtained from all the participants. Since the participants were students, a commitment document stating that “there will be no sanctions regarding the education program of the students.” Data were collected for this study on 26.01.2022.

Study design

Crossed design: in this design, each facet is sampled at all levels along with the others. In a crossed design, all the conditions of one facet observe the conditions of the other facet.¹⁴ An “×” mark is used to indicate the relationship between the facets. For example, in our study, all raters (p) in the crossed design group observed the items/tasks (m) of all students (b), which is expressed as [b × p × m]. For a two-facet crossed design, all 40 students (individuals) were rated separately by five raters for 35 items.

Participants

Data were collected from first-year medical students enrolled in the 2021–2022 academic year at SDSoM in Türkiye. Our study had a quasi-experimental and quantitative design. A power analysis was performed to determine the sample sizes. The sample size was calculated as 40, with a 95% confidence level, considering the number of trainers, students, and scoring logistics (n = 40). The calculation was based on a population size of 240 students, an expected frequency of 50%, and an acceptable margin of error of 14%. Students were randomly selected from the medical school enrollment list and invited via email to participate in this study. For those who chose not to participate (n = 8), we employed the same selection process, ultimately achieving the desired sample size through the second round of invitations. All students were randomly sampled from each group.

Data collection tool

Basic clinical skills, hygienic hand washing, wearing/removing a surgical cap, mask, and gown, wearing sterile gloves, and removing gloves were selected for evaluation in the study. The researchers developed a scoring instrument for these five skills using national and international skills training guidelines. This instrument was independently reviewed by subject matter experts, including three medical educators and two public health specialists. Modifications and changes were made as suggested by the experts, and the instrument was returned for review and agreement by the researchers. Ultimately, a rating scale comprising 35 items was created to assess the five skills used in the study. The scoring system used was as follows: 0 points for no or inadequate performance, 1 point for performance that needed improvement, and 2 points for satisfactory performance.¹⁹

A balanced dataset, which is frequently used in G-theory analyses, was used in this study's design. In a balanced design, there are no missing data, and for any nested facet, the sample size is constant for each level of that facet, and there is a dataset with equal amounts for all facets. Therefore, in our study, the same number of questions was asked for each student, or in the nested design, an equal number of students were assigned to each rater was used.

Raters

Residents from the Public Health and Family Medicine Departments of the SDSoM were invited to participate as raters in the study, with 41 accepting the invitation (n = 14 and n = 27, respectively). To achieve rater standardization in the study, as recommended in the literature, a 30-min standardization session was conducted prior to the evaluations.²⁰ During this session, an ideal sample performance video was shared with the raters before scoring. The features and scoring criteria of the tool were reviewed based on videos. The raters’ questions regarding these items were also addressed and evaluated.

Data collection and analysis

In the application, the student was asked to perform the five skills sequentially in front of the five raters within a 5-min timeframe. The raters were asked to simultaneously rate the students’ performance using a rating scale. This study utilized a mixed design, as the item facet remained fixed throughout the analysis.

Given the consecutive and hierarchical relationships among the items in the skills training, it was not possible to remove any items from the sequence of training. Consequently, the scoring instrument items were included as fixed facets for the G-study analysis. The raters were designated as random facets of the study. Both G- and D-studies were conducted on this random facet. Because there were no fixed facets in the D-study, they were not included in the analyses.

This study used a two-facet crossed-mixed design [b × p × m] in which 40 students were evaluated by five raters (random facets) using 35 items (fixed facets). The G-study and D-study were conducted for this design. EduG software was used for data analysis.^14,21–23

The distribution of the number of students in the dataset was balanced. For our skill set, the scores for all five skills were combined to calculate the total score. Data from the scores designed for the same skill set in a crossed design were analyzed using G-theory. In this study, the students were coded as b (individuals), raters as p (raters), and skill items as m (items). The two-facet crossed [b × p × m] mixed design involved three main effects (b, p, and m) and four interaction effects (bp, bm, pm, bpm, and e), resulting in seven effects.

This study conforms to the STROBE statement for observational studies.²⁴

Results

The study was conducted on January 26, 2022, at the Interprofessional Applied Training Laboratory of the Süleyman Demirel University Medical Faculty. The study involved a planning team consisting of three faculty members, five raters, two staff members, and 40 students (n = 40). In this study, 24 students (60%) were female and 16 (40%) were male. The average age of the students was 18.62 ± 1.15 years. Among the raters, one (20%) was female and four (80%) were male. The average age of the raters was 28.62 ± 3.34 years. To describe the scores, the average score for the skills was calculated as 1.56 ± 0.11. A detailed score analysis for each item is presented in Table 1.

Table 1.

Descriptive Findings.

Skill	Item	Mean	±	SD
Hygienic hand-rubbing skill	1. Handling of the disinfectant.	1.88	±	0.41
Hygienic hand-rubbing skill	2. Rubbing both hands together, including the backs of the hands, palms, between the fingers, and thumbs, for at least 15 s in total.	1.76	±	0.51
Cap-wearing and mask-putting skill, clean apron-wearing skill	3. The hair is gathered properly.	1.62	±	0.56
	4. The inside of the cap is opened from the elastic part on the closed side using both hands.	1.66	±	0.53
	5. With the thumb and index finger of both hands, the cap's elastic is stretched so that it fits properly on the head.	1.58	±	0.55
	6. The cap is gently placed on the head from front to back by slightly tilting the head forward.	1.55	±	0.59
	7. It is placed on the head so that no hair is visible from the outside and it passes over the ears.	1.42	±	0.63
	8. The hard metal part of the mask is positioned to sit on top of the nose and is shaped by bending the wire so that it fits the contour of the nose.	1.42	±	0.56
	9. The mask's straps are placed over the ears.	1.61	±	0.55
	10. The gown is held in front of the body and the arms are inserted into the sleeves of the gown.	1.59	±	0.57
	11. The sleeves of the gown are adjusted one by one, and the wrists are properly positioned. At this stage, the outer surface of the gown is not touched.	1.50	±	0.59
	12. The back and side parts of the apron are adjusted to fully cover the clothing.	1.56	±	0.56
	13. The apron strings are tied.	1.54	±	0.56
Sterile glove-putting skill	14. Opening of the sterile glove package.	1.68	±	0.59
	15. Turning the wrist part of the gloves in the package toward the user.	1.63	±	0.60
	16. Holding the folded wrist end of the glove on the active hand by grasping the upper edge with the thumb and index finger of the other hand.	1.50	±	0.66
	17. Placing the active hand by sliding the fingers into the glove.	1.39	±	0.64
	18. With the four fingers of the gloved active hand, grasp the lower end under the folded wrist part of the other glove.	1.41	±	0.66
	19. The insertion of the other hand's fingers inside.	1.38	±	0.62
	20. Placement of the glove on the other hand with the help of the active hand wearing a sterile glove.	1.40	±	0.61
	21. Fingers of both gloved hands are interlocked to ensure the gloves fit the hands properly.	1.40	±	0.58
Sterile glove-removal skill	22. The glove is removed by gripping the palm side of the other hand's glove with the active hand and pulling it off.	1.66	±	0.55
	23. The removed glove is placed into the palm of the active hand.	1.56	±	0.58
	24. Grasp the inner wrist area of the gloved hand with the ungloved hand and remove the glove by turning it inside out.	1.60	±	0.55
	25. The glove should be disposed of in the medical waste bin.	1.59	±	0.56
Cap, mask, and apron removal skill	26. The neck and waist ties of the gown are untied.	1.61	±	0.55
	27. The active hand pulls the other arm downward by holding the sleeve of the gown.	1.58	±	0.55
	28. The gown sleeve is removed from the arm covered by the gown with the help of the other hand.	1.58	±	0.56
	29. The gown is removed by sliding it forward from the shoulders.	1.57	±	0.56
	30. The gown is held from the reverse side and placed in the dirty laundry bag.	1.58	±	0.56
	31. The mask straps are removed from the ears.	1.64	±	0.53
	32. The mask is removed from the face.	1.60	±	0.54
	33. The mask is folded inward from the center.	1.45	±	0.55
	34. The bone is removed by grasping it from the outer surface.	1.56	±	0.55
	35. Masks and caps are disposed of in the medical waste bin.	1.66	±	0.55
	Mean	1.56	±	0.11

In the analysis of the average skill scores, considering the crossed-mixed pattern of the students’ performance ratings, the average score for hygienic hand rubbing was 3.58 (min: 0, max: 4). The average score for cap-wearing, mask-putting, and clean apron-wearing skill sets was 19.97 (min: 0, max: 22). The average score for the sterile glove-putting skill was 13.76 (min: 0, max: 16), and for the sterile glove-removal skill, it was 6.77 (min: 0, max: 8). Finally, the average score for cap, mask, and apron removal skills was 18.04 (min: 0, max: 20). The overall average skill-set score was 62.11 (min: 0, max: 70) (Table 2).

Table 2.

Two-Facet Crossed-Mixed Design [b × p × m] Mean Score Values.

		Rater group (crossed design) [b × p × m]
	Uygulama	Rater 1	Rater 2	Rater 3	Rater 4	Rater 5	Mean score
Skill 1	Hygienic hand-rubbing skill	3.70	3.35	3.40	3.75	3.70	3.58
Skill 2	Cap-wearing and mask-putting skill, clean apron-wearing skill	20.53	19.40	18.85	20.55	20.53	19.97
Skill 3	Sterile glove-putting skill	13.20	13.55	13.53	14.45	14.08	13.76
Skill 4	Sterile glove-removal skill	6.60	6.55	6.98	7.23	6.48	6.77
Skill 5	Cap, mask, and apron removal skill	18.10	17.63	18.33	18.03	18.10	18.04
	Total	62.13	60.48	61.08	64.00	62.88	62.11

The mean score of rater 1 was 62.13 ± 9.51, that of rater 2 was 60.48 ± 11.93, that of rater 3 was 61.08 ± 11.60, that of rater 4 was 64.00 ± 10.89, and that of rater 5 was 62.88 ± 13.11. The detailed score values and standard deviations for the crossed-mixed [b × p × m] design are presented in Table 3.

Table 3.

Mean Values and Standard Deviations of Scores in the Crossed-Mixed [b × p × m] Design.

	n	Max	Mean	SD
Rater 1.1	40	4.00	3.70	0.78
Rater 1.2	40	22.00	20.53	3.79
Rater 1.3	40	16.00	13.20	2.63
Rater 1.4	40	8.00	6.60	1.73
Rater 1.5	40	20.00	18.10	3.76
Rater 1 total	40	61.00	62,13	9.51
Rater 2.1	40	4.00	3.35	0.77
Rater 2.2	40	22.00	19.40	4.11
Rater 2.3	40	16.00	13.55	3.10
Rater 2.4	40	8.00	6.55	2.01
Rater 2.5	40	20.00	17.63	3.99
Rater 2 total	40	70.00	60,48	11.93
Rater 3.1	40	4.00	3.40	0.84
Rater 3.2	40	22.00	18.85	3.79
Rater 3.3	40	16.00	13.53	2.90
Rater 3.4	40	8.00	6.98	1.53
Rater 3.5	40	20.00	18.33	3.55
Rater 3 total	40	70.00	61,08	11.60
Rater 4.1	40	4.00	3.75	0.71
Rater 4.2	40	22.00	20.55	3.46
Rater 4.3	40	16.00	14.45	2.73
Rater 4.4	40	8.00	7.23	1.44
Rater 4.5	40	20.00	18.03	4.12
Rater 4 total	40	70.00	64,00	10.89
Rater 5.1	40	4.00	3.70	0.79
Rater 5.2	40	22.00	20.53	4.51
Rater 5.3	40	16.00	14.08	2.85
Rater 5.4	40	8.00	6.48	1.58
Rater 5.5	40	20.00	18.10	4.62
Rater 5 total	40	70.00	62,88	13.11

G-study: G-theory applications were used to evaluate the reliability of skill training. In G-theory, applications were evaluated in a crossed-mixed design [b × p × m], the percentage of the variance component estimated for individuals was 34.10%, that for items/tasks was 3.00%, and that for raters was 3.00%. The variance component percentage estimated for individuals was 34.10%; the variance component percentage estimated for items/tasks was 3.00%; the variance component percentage estimated for raters was 0.30%; and the variance component percentages estimated for individual items/tasks were 11%, 30%, 11.10%, 3.70%, 36.60%, and 36.60% for the variance component estimated for the individual rater, item/task-rater, and individual-item/task-rater, respectively (Table 4).

Table 4.

Variance Values and Total Variance Explanation Ratios Estimated Via the G-Method for the [i × r × i] Design.

Analysis of variance [b × p × m]
Source of variance	Sum of square	df	Mean of square	Variance	%	Standard error
i/individual	657.88514	39	16.86885	Mixed	33.9	0.02130
r/rater	9.03086	4	2.25771	0.09049	0.30	0.00094
i/item	77.89371	34	2.29099	0.00087	3.00	0.00272
ir	161.18629	156	1.03325	0.00821	11.10	0.00332
ii	330.09486	1326	0.24894	0.02952	11.30	0.00197
ri	67.69914	136	0.49779	0.03024	3.70	0.00150
iri,e	518.48371	5304	0.09775	0.01000	36.60	0.00190
Total	1822.27371	6999			100

Estimated variance component interpretations for the two-facet crossed pattern

In this study, the percentage of the variance component estimated for individuals in the two-facet crossed-mixed design in G-studies via G-theory was the universal score variance corresponding to the true score variance in classical test theory. This parameter indicates the extent to which individuals differ in their measured characteristics. Differences between individual characteristics can be determined through measurements. Therefore, the share of variance estimated for individuals in the total variance should be large.¹⁴ In this study, the percentage of the variance component estimated for individuals (b = 33.90%) was large. The large relative value of the share of this variance in the total variance indicates that systematic differences between individuals can be revealed and that the power of the observed scores to represent the universe (real) scores increases. Therefore, in line with the literature, this measurement tool can reveal differences in measured traits and represent the population.²⁵

The estimated variance for the rater main effect indicates whether a particular rater's ratings across all individuals are more generous or stricter than those of other raters.¹⁴ The consistency of raters’ scores across all individuals depends on the relative magnitude of the variance or a small variance percentage. When the variance component estimated for the raters approached zero, the ratings for all individuals were similar. When the share of the estimated variance component in the total variance is zero, it indicates that the raters behave with the same strictness/generosity in their scores for all the individuals. In this study, the percentage of variance components estimated for the raters was calculated as (p = 0.30%). The raters were evaluated as having similar generosity and harshness behaviors. This finding is consistent with the effect of rater standardization, which has been expressed as an important source of error in the literature.^11,12,20

The interpretation of the variance components estimated for item/task (i) the main effect was similar to that of the rater main effect. The mean value for any task was defined as the level of difficulty. Therefore, the variance estimated for the main task effect differed across task difficulty levels. In this study, the percentage of the variance component estimated for the item/task was calculated as 3.00%, and task difficulties were interpreted differently. The small variance component (i = 3.00%) estimated in the G-study for the main effect of item (i) indicates that item difficulties are not very different from one another.²⁶

The variance component estimated for the variance source of the individual–rater common effect indicates that a certain rater scores a certain individual more strictly or generously than other raters do. The relatively high variance indicates that some raters score some individuals more strictly or generously than others. In this study, the variance component estimated for individual raters was 11.10%. This relatively high value can be interpreted as raters scoring some individuals more strictly or generously than others.

The variance component estimated for the individual-item/task common effect shows variation in the relative position of a given individual from one task to another. The larger the share of the variance estimated for the individual item/task common effect in the total variance, the greater the differences in the relative positions of some individuals from task to task. In this study, the variance component estimated for the individual items/tasks was 11.30%. This value was the third highest value estimated in this study. This finding is interpreted as being compatible with the effects of five skills of different levels of difficulty (hygienic hand scrubbing skill (two items), wearing a cap–mask–clean apron skill (11 items), wearing sterile glove skills (eight items), sterile glove-removal skills (four items), and cap–mask–apron removal skills (10 items)).

The estimated variance component for the rater/item/task common effect indicates the extent to which raters consistently score individuals from task to task. When the estimated variance approached zero, the raters consistently scored each task. In this study, the variance component for the rating items/tasks was estimated to be 3.70%. This value is interpreted as the stable behavior of the raters, even though there are differences in difficulty between the tasks.

This shows the individual-rater-item/task (bpm,e) joint effect and unmeasured sources of variance as a composite. The unmeasured sources of variance were divided into two groups: systematic (students repeating different skills more) and non-systematic/random (students’ individual differences). In this study, the variance component estimated for individual rating items/tasks had the highest estimated value (37.1%). This finding is consistent with those of many studies that have evaluated real-world data.^25,27 This high variance can be interpreted as the fact that students have already received this training for some reason (the COVID-19 pandemic period), that the skills are relatively easy and easily learned, and that different error sources that cannot be controlled are involved in the process.

In the analysis of the reliability coefficients using G-theory, the G-coefficient for the crossed design [b × p × m] was calculated to be 0.94.

Decision study: This D-study provides an estimation of the number of raters. A D-study was conducted by increasing and decreasing the number of raters, which is a random facet of the crossed-mixed design, in which 40 students were evaluated using 35 items. The detailed findings are presented in Table 5.

Table 5.

D-Study Results for Raters in the Crossed-Mixed Design [b × p × m] (Optimization).

Coefficients	Number of raters
Coefficients	2	3	4	5	6	7
G	0.86	0.90	0.92	0.94	0.95	0.96

D-study: decision study.

The G-coefficient obtained with the five raters working within the scope of the study was 0.94, and the G values obtained when this number was reduced to four, three, and two were estimated as 0.92, 0.90, and 0.86, respectively. A slight decrease in the G-coefficient was observed when the number of raters decreased. When the number of raters increased from five to six or seven, the G-coefficients were estimated to be 0.95 and 0.96. At this point, a relatively small increase in the G-coefficient was observed.

In the context of G-theory, an analysis called facet analysis can be performed. With this analysis, the effect of removing the conditions of the facets on reliability can be examined further. This calculation was performed over the rater facet, which was the only random facet considered in this study. The results are presented in Table 6.

Table 6.

G-Facet Analysis Results.

Facet	Level	G-coefficient	Phi-coefficient
Rater	Rater 1	0.92372	0.92022
	Rater 2	0.93202	0.93073
	Rater 3	0.92848	0.92604
	Rater 4	0.91451	0.91433
	Rater 5	0.92408	0.92103

When Table 6 is examined, it is possible to observe a change in the G-coefficient, which was estimated to be 0.94 because of the evaluation of 40 students on 35 items by the five raters. In this context, the largest decrease observed as a result of the removal of raters in the study was for rater 4. In other words, the decrease in the G-coefficient observed because of the removal of rater 4 was relatively greater than that observed because of the removal of the other raters. The lowest decrease was observed for rater 2. However, when the entire table was analyzed, none of these decreases led to a significant decrease or increase in the G-coefficient.

Discussion

Our findings highlight that G- and D-studies can serve as a pragmatic framework for improving the reliability of performance assessment in medical education. While it is neither feasible nor necessary to conduct G- and D-studies for every clinical skills assessment, our results suggest that applying these methods selectively to high-stakes or frequently used assessments can meaningfully minimize inter-rater variability and guide rater standardization efforts. Once standardized procedures are established for a given assessment, the principles of rater calibration can be extended to other similar evaluations, thereby reducing the need for repeated, full-scale analyses. In this sense, our study provides a proof-of-concept model that demonstrates the strengths and limitations of using G-theory in practice. We argue that this approach offers a realistic and scalable strategy for institutions seeking to balance methodological rigor with practical constraints, positioning rater standardization and G-theory-based analyses as valuable tools for enhancing the fairness and credibility of clinical skills evaluations.

In our study, basic professional skill training, which is frequently included in limited performance evaluations, was preferred. In these training sessions, the difference between raters was found to be important in terms of fairness and applicability.²⁸ In addition, performance assessments applied to skills training in undergraduate medical education are considered ideal for G-studies because they involve repeated measurements of the same construct.²⁹ Recent studies have shown that collecting repeated measurements of the same construct increases reliability.³⁰ This is because random variances can cancel each other out in multiple measurements of the same construct.¹⁶

The main aim of the experimental design was to minimize errors in finding real differences between the focal groups. In this study, rater standardization was performed. Studies on raters have discussed different types of rater bias and their impact on scores and judgments.²⁸ Each station is more or less difficult, or each rater is more or less lenient, and has unique effects on the students’ scores. Instead of ignoring these effects, we can measure them using G-theory. In this study, in line with the literature, the difference between assessors was a source of error for both designs.

Although theoretical studies have been carried out on simulated datasets in basic studies of G-theory, data obtained from real applications are used and recommended for use in many studies.^{22,23,27,28,31}

This study was designed with the idea that the use of G-theory applications in real-time limited performance evaluation in skills training, which has an important role in pre-graduation medical education programs, would contribute to the evaluation of many sources of error by providing feedback and enabling prospective decision making. In this study, the data obtained by scoring the actual application of skill training were evaluated in accordance with the literature.^32–34

There are many different patterns in G-theory applications.^16,35,36 In Turkey, there has been an increase in studies on G-theory over the years.³⁷ In a study evaluating 60 studies conducted in Turkey between 2004 and 2017, the majority used a two-facet crossed design with balanced datasets.³⁷ A mixed design was used in a few of these studies.³⁷ Given the mixed design preferred in our study, this study contributes to the literature in terms of design.³⁸

In our study, the data obtained in the planning of the real-time limited performance evaluation of the five skills included in the training program were evaluated. In this context, 40 students were scored by five raters. The student and rater characteristics were evaluated in accordance with many previous studies.^10,12,19

After scoring, the data were analyzed using EduG software, which was developed for G-theory applications and is widely used in the literature.^14,15

In this study, the percentage of the variance component estimated for individuals in the two-facet crossed-mixed design was large in the G-studies conducted using G-theory. Therefore, the measurement tool can reveal differences in terms of the measured traits in accordance with the literature and can represent the universe.²⁵

In the variance estimated for the main effect of the raters, the raters were evaluated as having similar generosity/harshness behaviors. This finding is consistent with the effect of rater standardization, which is expressed as an important source of error in the literature.^11,12,20

The variance components estimated for the item/task (i/t) main effects were interpreted as different task difficulties. The small variance component estimated for the main effect of item (i) indicates that item difficulties are not very different from one another.²⁶

This value, which is relatively high in the variance component estimated for the individual–rater common effect, can be interpreted as raters scoring some individuals more strictly than others.

The variance component estimated for the individual-item/task common effect was the third-highest value estimated in this study. This finding is interpreted as compatible with the effects of the five skills of different difficulty levels found in the study (hygienic hand-rubbing skill (two items), wearing cap–mask–clean apron skills (11 items), wearing sterile glove skills (eight items), sterile glove-removal skill (four items), and cap–mask–apron removal skill (10 items)).

The variance component estimated for the rater–item/task common effect was interpreted as the stable behavior of the raters, although there were differences in task difficulty.

The variance component estimated for the individual-rater-item/task (bpm,e) common effect and the individual-rater-item/task variance component were consistent with many studies that evaluated real practice data.^25,27 This high variance can be interpreted as the fact that students have already received this training for some reason (the COVID-19 pandemic period), that the skills are relatively easy and easily learned, and that different error sources that cannot be controlled are involved in the process.

G-coefficients

In reliability analysis, the number of test items, relationships between items, and dimensionality affect the alpha values. There are different reports of acceptable alpha values ranging from .70 to .95.^39–41 According to G-theory, a G-coefficient and Phi-coefficient above 0.70 are accepted as generalizable and reliable in reliability analyses conducted to evaluate the internal consistency of measurement tools.^{27,40,42–44} The reliability coefficient was determined using the assessment results of the study. Measurement and evaluation practices in medical education programs express the achievement of graduation goals. Therefore, a high reliability coefficient is recommended.

In our study, the G-coefficient for the crossed design was calculated as 0.94 in the analysis of the reliability coefficients based on G-theory. This finding is considered generalizable and reliable, which is in line with the literature.^14,15,27,45

Decision studies

In the evaluation of inter-rater reliability, G-theory, which allows the evaluation of many sources of error, is at the forefront of research.⁴⁶ In the D-study conducted for raters in a crossed-mixed design [b × p × m], the G-coefficients were 0.86 for two raters, 0.90 for three raters, 0.92 for four raters, 0.94 for five raters, 0.95 for six raters and 0.96 for seven raters. In this study, feedback was obtained that the number of raters could be reduced, which is consistent with the literature.

In addition to the differences between raters in G-studies, differences between raters can be examined in more detail in G-facet studies. In our study, there was no difference between the raters in the G-study. With the G-facet study, the change in the reliability coefficient with the removal of a rater can be calculated as follows: our study revealed no significant changes in the reliability coefficient after the removal of any rater. This was considered a positive effect of rater standardization.^11,12,20

Rater supply and standardization are among the problems frequently encountered during skills training. The optimum rater requirement was determined using the D-study. In addition, differences between raters can be evaluated using G-studies and G-facet studies. This information may contribute to skill training in the field of medical education.

Assessments should be conducted in the context of educational programs. Scores were calculated based on the number of isolates.⁴⁷ A score gains meaning and begins to serve its context when we have standards to compare with equation.⁴⁰ G-theory allows us to develop an approach in this context. The variance was similar to the distribution of the ingredients in a single slice of cake. Variance can sometimes work in favor of the test-taker and sometimes against them. Sometimes, the variance behaves in predictable ways (systematic errors), and at other times, it behaves unpredictably (random errors). Consequently, these aspects of variance must be considered to fully assess reliability.

In our study, when the crossed design is evaluated in terms of reliability coefficients and the feedback it provides, it provides information about many error sources that cannot be obtained using other designs. Improvements can be made in terms of the applicability of D-study suggestions to raters in crossed design.

In the crossed design, each rater evaluated all students using all items, which was somewhat time-consuming. Nevertheless, it is necessary to operate such a process with the suggestions of two raters to achieve highly reliable measurement.

G-facet analyses revealed no difference between raters, in accordance with the variance component estimated for raters. This finding was considered a positive contribution of rater standardization, which has been emphasized in the literature.

Conclusion

This study demonstrated that G-theory can be effectively applied to assess the reliability of performance-based assessments in undergraduate medical education, even in the context of basic clinical skills. However, an important question for medical educators is whether these results can be extrapolated to more complex clinical skills assessments, such as the evaluation of communication skills, diagnostic reasoning, or procedural competence in simulated or real clinical environments. In such cases, variability may increase due to greater cognitive complexity, longer task durations, and higher inter-task heterogeneity. Although our findings are promising, they should be cautiously generalized. Future studies are needed to examine whether similar reliability can be achieved in high-fidelity assessments involving affective or interpersonal domains. Ultimately, this study contributes to medical education by showcasing a practical model for applying G-theory in real-world assessment contexts. By identifying and controlling the sources of measurement error, educators can improve the fairness and defensibility of student evaluations. In competency-based education models, where accurate assessments are critical, methodological rigor is essential.

Footnotes

Author Note

This manuscript was defended as a thesis at the Health Sciences Institute, Ege University, to fulfill the requirements of the Medical Education PhD Program.

Acknowledgments

The authors express their gratitude to Prof. Dr Hakan Atılgan and Prof. Dr Halil İbrahim Durak for their support during this study. Additionally, the authors extend their appreciation to the medical students and residents of the SDSoM who dedicated their time, commitment, and willingness to participate in this study. Finally, the authors would like to extend their deep and sincere gratitude to Neşe Güler and Gülşen Taşdelen Teker for their substantial support and guidance in the research design and for conducting the statistical analysis of this research.

ORCID iDs

Giray Kolcu

Süleyman A. Çalişkan

Ethics Approval

This study was approved by the Süleyman Demirel University Clinical Research Ethics Committee (No. 169740, Date: 21.12.2020). This study was conducted in accordance with the principles of the Declaration of Helsinki.

Consent for Publication

All contributors were informed of the study.

Participant Consent

Before the study, all participants were informed of the study.

Author Contributions

Conception: GK and SAÇ; design: GK; supervision: GK; sources: GK and SAÇ; data collection and/or processing: GK; analysis and/or interpretation: GK and SAÇ; literature review: GK and SAÇ; main text: GK and SAÇ; and critical review: GK and SAÇ.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Declaration of AI-Assisted Technologies

The authors declare that artificial intelligence (AI)-assisted technologies, including large language models (LMMs), were not used in this study.

Data Access Statement and Material Availability

The authors assumed that the data stored in the data warehouse could be easily accessed by others if the requests were approved by the data warehouse.

References

Hale

Cahan

Zanetti

. Integration of basic clinical skills training in medical education: an interprofessional simulated teaching experience. Teach Learn Med. 2011;23(3):278–284.

Remmen

Scherpbier

Van Der Vleuten

, et al. Effectiveness of basic clinical skills training programmes: a cross-sectional comparison of four medical schools. Med Educ. 2001;35(2):121–128.

Doğan

Şendir

Başak

. Sağlık Profesyonelleri İçin Klinik Simülasyon. 1st ed. Ankara Nobel Kitabevleri; 2023.

Gronlund

. Assessment of students achievement. 2006.

Darling-Hammond

Adamson

. Beyond basic skills: The role of performance assessment in achieving 21st century standards of learning. Scope Publishing; 2010.

Oosterhof

. Developing and using classroom assessment. 3rd ed. Merrill/Prentice Hall; 2003.

Boursicot

KAM

Roberts

Burdick

. Structured assessments of clinical competence. In: Understanding medical education. 2018: 335–345.

Norcini

Zaidi

. Workplace assessment. In: Understanding medical education. 2018: 319–334.

Andersen

SAW

Nayahangan

Park

Konge

. Use of generalizability theory for exploring reliability of and sources of variance in assessment of technical skills: a systematic review and meta-analysis. Acad Med. 2021;96(11):1609–1619.

10.

Nayar

Malik

Bijlani

. Objective structured practical examination: a new concept in assessment of laboratory exercises in preclinical sciences. Med Educ. 1986;20(3):204–209.

11.

Lawson

. Applying generalizability theory to high-stakes objective structured clinical examinations in a naturalistic environment. J Manipulative Physiol Ther. 2006;29(6):463–467.

12.

Stora

Hagtvet

Heyerdahl

. Reliability of observers’ subjective impressions of families: a generalizability theory approach. Psychother Res. 2013;23(4):448–463.

13.

Tamam

Başer Kolcu

Mİ

. Türkiye’deki Tıp Fakülteleri ile Süleyman Demirel Üniversitesi Tıp Fakültesi Öğrenci Trendinin Değerlendirilmesi [Evaluation of student trend in medical schools in Turkey and Süleyman]. SDÜ Tıp Fakültesi Dergisi. 2018;25(1):56–62.

14.

Atılgan

. Genellenebilirlik Kuramı ve Uygulaması. 1. Baskı. 2019; 1 p.

15.

Güler N, Kaya Uyanık G, Teşdelen Teker G. Genellenebilirlik Kuramı (Neşe G ed.). 1st ed. Pegem Akademi Yayınları; 2012: 16–32.

16.

Brennan

. Generalizability theory: statistics for social science and public policy. 1st ed, Vol. 30. Springer-Verlag Publishing; 2001.

17.

Shavelson

Webb

. Generalizability theory: a primer. Sage Publications, Inc.; 1991, xiii, 137 (Measurement methods for the social sciences series, Vol. 1).

18.

Güler

Eroğlu

Akbaba

. Reliability of criterion-dependent measurement tools according to generalizability theory: application in the case of eating skills. Abant İzzet Baysal Üniversitesi Eğitim Fakültesi Dergisi. 2014;14(2):217–232.

19.

Sarmasoğlu

Dinç

Elçin

. Nursing students’ opinions about the standardized patients and part task trainers used in the clinical skills training. Hemşirelikte Eğitim ve Araştırma Dergisi. 2016;13(2):107–115.

20.

Barrett

. The impact of training on rater variability. Int Educ J. 2001;2(1):49–58.

21.

Güler . Genellenebilirlik Kuramı ve SPSS ile GENOVA Programlarıyla Hesaplanan G ve K Çalışmalarına İlişkin Sonuçların Karşılaştırılması [Generalizability theory and comparison of the results of G and D studies computed by SPSS and GENOVA packet programs]. Eğitim ve Bilim. 2009;34(154):93–104.

22.

Maurice Dalois; Léo Laroche . EduG Version 6.1-e [Internet]. 2010 [cited 2020 June 5]. https://www.irdp.ch/institut/english-program-1968.html

23.

Mushquash

O’Connor

. SPSS, SAS, and MATLAB programs for generalizability theory analyses. Behav Res Methods. 2006;38(3):542–547.

24.

von Elm

Altman

Egger

Pocock

Gotzsche

Vandenbroucke

. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) Statement: guidelines for reporting observational studies. 2007.

25.

Taşdelen Teker

Şahin

Baytemir

. Using generalizability theory to investigate the reliability of peer assessment. J Human Sci. 2016;13(3):5574.

26.

Başol

Şevran

A generalizability Analysis: the reliability of measurements: “Let’s circuit electric” [Bir genellenebilirlik kuramı analizi: değerlendirmede kullanılan ölçme araçlarının güvenirliği “Haydi Elektriği İletelim”].

J Human Sci. 2019;16(1):370.

27.

Akindahunsi

Afolabi

ERI

. Using generalizability theory to investigate the reliability of scores assigned to students in English language examination in Nigeria. J Meas Eval Educ Psychol. 2021;12(2):147–162.

28.

Atılgan

. Genellenebilirlik kurami ve puanlayicilar arasi güvenirlik için örnek bir uygulama. Egitim ve Bilim. 2005;7(1):95-108.

29.

Chiu

CWT

. Scoring performance assessments based on judgements. 2001.

30.

Cronbach

. Coefficient alpha and the internal structure of tests. Psychometrika. 1951;16(1):297-334.

31.

Brennan

. Generalizability theory. Springer-Verlag Publishing; 2001 (Statistics for social science and public policy).

32.

Miller

. The assessment of clinical skills/competence/performance. Acad Med. 1990;65(9 Suppl):S63–S67.

33.

Kopelman

Dacre

. Handbook of clinical skills. 2019.

34.

Al-Elq

. Medicine and clinical skills laboratories. J Family Community Med [Internet]. 2007;14(2):59–63. https://www.ncbi.nlm.nih.gov/pubmed/23012147

35.

Shavelson

Webb

. Generalizability theory: a primer. Sage Publications, Inc.; 1991.

36.

Shavelson

Webb

. Generalizability theory. In: Handbook of complementary methods in education research. 2012: 309–322.

37.

Taşdelen Teker

Güler

. Thematic content analysis of studies using generalizability theory. Int J Assess Tools Educ. 2019;6(2):279–299.

38.

Kaya Uyanık

Güler

. Kavram Haritası Puanlarının Güvenirliğinin İncelenmesi: genellenebilirlik Kuramında Çaprazlanmış Karışık Desen Örneği. Hacettepe Univ J Educ. 2015:1–1.

39.

Bland

Altman

. Statistics notes: Cronbach’s alpha. Br Med J. 1997;314(7080):572.

40.

Nunnally JC, Bernstein IH. Psychometric theory. McGraw-Hill Humanities/Social Sciences/Languages; 1994.

41.

DeVellis

. Classical test theory. Med Care. 2006;44(11):S50–S59.

42.

Nunnally

. Psychometric theory. McGraw-Hill Humanities/Social Sciences/Languages; 1978.

43.

Taşdelen Teker

Odabaşı

. Reliability of scores obtained from standardized patient and instructor assessments. Eur J Dent Educ. 2019;23(2):88–94.

44.

Nalbantoğlu Yılmaz

Başusta

. Genellenebilirlik Kuramıyla Dikiş Atma ve Alma Becerileri İstasyonu Güvenirliğinin Değerlendirilmesi [Using generalizability theory to assess reliability of suturing and remove stitches skills station]. J Meas Eval Educ Psychol. 2015;6(1):107–116.

45.

Kolcu

Başer Kolcu

. Reliability analysis of the feedback scale of a course with classical test theory. Turk J Health Sci Life. 2020;3(2):20–24.

46.

Güler

Taşdelen Teker

. Açık Uçlu Maddelerde Farklı Yaklaşımlarla Elde Edilen Puanlayıcılar Arası Güvenirliğin Değerlendirilmesi. Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi. 2015;6(1):12–24.

47.

Thorndike

. Psikolojide ve Eğitimde Ölçme ve Değerlendirme. 8th ed. Pearson; 2010: 19–20.

Advancing Assessment of Reliability in Clinical Education: A Generalizability Theory Perspective

Abstract

Introduction

Method

Results

Conclusions

Keywords

Introduction

Method

Study design

Participants

Data collection tool

Raters

Data collection and analysis

Results

Estimated variance component interpretations for the two-facet crossed pattern

Discussion

G-coefficients

Decision studies

Conclusion

Footnotes

Author Note

Acknowledgments

ORCID iDs

Ethics Approval

Consent for Publication

Participant Consent

Author Contributions

Funding

Declaration of Conflicting Interests

Declaration of AI-Assisted Technologies

Data Access Statement and Material Availability

References