Abstract
Background
Objective structured clinical examinations (OSCEs) are widely used to assess nursing students’ clinical skills. However, ensuring consistent scoring between examiners remains difficult, particularly for subjective areas such as communication.
Purpose
This study aimed to evaluate inter-rater agreement between two faculty examiners in a pre-graduation OSCE recommended for final-year pre-registration nursing students in Japan, comparing technical skill stations with communication-focused stations.
Methods
A total of 90 final-year nursing students completed two OSCE stations: one assessing technical procedures, the other assessing communication and patient education. Two examiners independently scored each aspect of performance using binary checklists. Inter-rater agreement was calculated using Cohen's kappa and prevalence- and bias-adjusted kappa.
Results
Higher inter-rater agreement was found for psychomotor items (e.g., auscultation) than for verbal or empathy-based items. In the technical station, agreement improved across successive circuits, suggesting examiner calibration. In contrast, in the communication station, agreement remained consistently low. Empathy-related items showed the greatest discrepancy between kappa and prevalence- and bias-adjusted kappa, highlighting challenges in evaluating subjective skills.
Conclusions
OSCE inter-rater reliability was higher for objective technical skills than for subjective communication skills and empathy-related behaviors, among pre-registration nursing students.
Implications for Practice
Improving checklist clarity and providing targeted examiner training for communication and empathy-related items may enhance the reliability of OSCE scoring in nursing education.
Introduction
Rise of Competency-Based Education
In recent years, competency-based education has gained traction in nursing education, aiming to develop practice-ready nursing professionals with clearly defined competencies. In Japan, pre-registration nursing students are encouraged to undertake comprehensive clinical performance assessments, such as the objective structured clinical examinations (OSCEs), in their final year before graduation. Competency-based education is an outcomes-driven approach in which essential skills and competencies are explicitly defined, and curriculum design, implementation, and assessment are systematically aligned with these learning goals (Pijl-Zieber et al., 2014). This educational paradigm represents a shift from traditional process-oriented models emphasizing course completion and clinical hours toward a focus on the demonstrable competence a learner acquires as a nurse. Recent studies have further developed the model by advocating for the integration of capability-based assessment and holistic competence development within nursing programs (Howland et al., 2024). Although capability has recently been conceptualized as broader than competence, this study adopts the term “competence” in alignment with national nursing education guidelines (Howland et al., 2024). Reflecting this transformation, nursing universities in Japan have increasingly integrated OSCEs into their assessment procedures to provide a comprehensive and standardized evaluation of students’ clinical performance (Ministry of Education, Culture, Sports, Science and Technology [MEXT], Japan, 2024). OSCEs, structured as circuits of stations involving clinical tasks or simulations, enable direct observation and objective measurement of practical competence (Mitchell et al., 2009). They continue to be regarded as the gold-standard assessment method for evaluating both technical and communication skills in pre-registration nursing education (de Beer et al., 2023; Hyde et al., 2022). Internationally, OSCEs have demonstrated effectiveness as assessment tools: a recent systematic review by Kassabry (2023) reported that, compared with conventional evaluation methods, OSCEs significantly enhance nursing students’ knowledge, clinical reasoning, self-confidence, and overall satisfaction (Kassabry, 2023). However, despite its educational advantages, concerns remain about the consistency and fairness of OSCE scoring across examiners.
Comparing Psychomotor and Communication Skills
The current study aimed to evaluate inter-rater agreement in a pre-graduation OSCE conducted at a Japanese nursing university, with a particular focus on how agreement varies across different types of evaluation items. Specifically, the study sought to identify patterns and challenges associated with distinct checklist categories, such as items assessing psychomotor nursing skills versus those targeting communication, explanation, or patient understanding. By employing both Cohen's kappa and PABAK, the study also aimed to demonstrate how the choice of statistical index can influence the interpretation of inter-rater reliability in OSCE assessments. Ultimately, this study provides evidence-based insights that inform improvements in examiner training and checklist development, thereby enhancing the overall reliability and fairness of this critical evaluation method.
Review of Literature
Challenges in OSCE Scoring Reliability
Despite the recognized benefits of OSCEs, inter-rater reliability is a known concern of this assessment method. Previous research has reported that when two examiners independently evaluate the same OSCE performance, their scores often differ. For example, de Beer et al. (2023) examined an OSCE with stations on urologic history taking, respiratory examination, and gynecological skills, and found that in most cases the two raters’ final scores disagreed by more than 5% (de Beer et al., 2023). Only one rater pair in that study achieved substantial agreement (weighted kappa = 0.74) on a station, highlighting the challenge of consistency in OSCE scoring (de Beer et al., 2023). The literature suggests that each examiner may bring certain biases or idiosyncrasies to their ratings. Wood (2014) noted that examiners’ first impressions of a candidate can influence subsequent ratings, indicating the potential for unconscious bias in rater-based assessments (Wood, 2014). Similarly, Williams et al. (2003) discussed various cognitive, social, and environmental sources of bias in clinical performance ratings, such as an examiner's personality traits or preconceived expectations, which can introduce variability into OSCE scoring (Williams et al., 2003). Accordingly, a previous study reported that the evidence supporting the validity, reliability, and internal consistency of OSCE scoring remains limited and inconclusive (Bobos et al., 2021).
Perceived Fairness vs. Scoring Inconsistencies
Despite the limitations mentioned above, both nursing students and faculty have been found to generally perceive the OSCE format as objective, fair, valid, and reliable. For example, a systematic review by Vincent et al. (2022) found that students favored OSCEs over traditional clinical examinations, viewing them as more credible assessments of clinical competence (Vincent et al., 2022). Faculty perceptions mirrored this sentiment, with many regarding OSCEs as rigorous and meaningful evaluation tools (Vincent et al., 2022). However, despite this widespread confidence regarding the value of OSCEs, it remains essential to examine potential threats to their fairness and validity, particularly inconsistencies in scoring between different evaluators.
Training to Improve Inter-Rater Agreement
Efforts to enhance OSCE reliability have primarily concentrated on examiner training and calibration. Previous research has demonstrated that training OSCE raters improves scoring consistency. For instance, Guerrero et al. (2024) compared three rater training approaches (lecture-based, online, and simulation-based) and found that joint simulation-based training for both examiners and students was the most effective method for standardizing scoring behaviors (Guerrero et al., 2024). Regular rater training and calibration sessions (moderation meetings) are widely recommended to strengthen the objectivity and reliability of OSCE assessments. While examiner-focused interventions have received substantial attention, less is known about the contribution of the evaluation checklist items themselves to scoring variability. In particular, few studies have investigated how inter-rater agreement varies by the type of competency assessed (e.g., psychomotor nursing procedures versus communication or clinical reasoning tasks) or how scoring consistency changes as examiners gain experience over multiple OSCE administrations. Furthermore, conventional reliability metrics like Cohen's kappa may underestimate agreement in educational contexts with imbalanced performance distributions. Therefore, incorporating additional indices, such as the prevalence-adjusted bias-adjusted kappa (PABAK), may yield a more nuanced understanding of scoring reliability. These gaps underscore the need for a detailed, item-level analysis of inter-rater agreement within OSCE assessments.
Methods
Study Design and Setting
This cross-sectional analytical study used data from a pre-graduation OSCE administered as the capstone clinical skills examination at a nursing university in Japan. The OSCE, held in July 2023, marked the institution's first implementation of this assessment format. A total of 90 fourth-year nursing students (undergraduate seniors) participated in the OSCE as part of their final clinical competency evaluation prior to graduation.
The OSCE consisted of two clinical station scenarios designed to reflect common nursing practice situations. Each student was required to complete both Tasks 1 and 2 at separate stations. A total of 38 faculty members from the university's nursing program served as examiners across the stations. To ensure standardization, each student's performance at a station was independently assessed by a pair of faculty raters observing the student simultaneously.
The OSCE tasks were developed by the institutional OSCE working group in alignment with the school's diploma policy, drawing on established practices from other universities. Standardized patients who had received specialized training through the OSCE program were recruited and performed scripted roles to ensure consistent interactions across examinees. In Task 1, a high-fidelity simulator was employed to ensure consistent physiological responses and scenario realism. Each student completed two tasks within a single day. All nursing faculty members served as examiners, and the number of examination rooms was determined based on the availability of private spaces. To accommodate the entire cohort of 90 students, 12 sequential OSCE circuits were organized and implemented. These arrangements were intended to ensure both logistical feasibility and the standardization of the examination process.
Study Population and Recruitment
Students (OSCE Examinees)
This study included all 90 fourth-year nursing students at a nursing university in Japan who participated in the mandatory pre-graduation OSCE held in July 2023. Because the OSCE was a compulsory component of the curriculum, no exclusion criteria were applied, and no students were absent.
Faculty Examiners
A total of 38 faculty members from the School of Nursing served as OSCE examiners. Their participation was part of their routine teaching responsibilities; therefore, all eligible faculties were assigned to this role by the department. For the research component, the study's purpose was communicated in advance via information sheets displayed on campus and distributed by email. Faculty members were given the option to decline inclusion of their data. However, no examiner opted out, and all examiner identities were anonymized during analysis. For all nursing faculty members, this was their first experience serving as OSCE examiners. Available demographic data included gender and academic rank: 34 were females and 4 males; 9 professors, 2 associate professors, 10 lecturers, and 17 assistant professors. It should be noted that demographic data beyond gender and academic rank were not collected, and all nursing faculty were confirmed as first-time OSCE examiners at the university.
OSCE Station Scenarios and Evaluation Items
Task 1: “Assessment and Initial Care for an Elderly Patient with Aspiration Pneumonia”
This station focused on the acute assessment and initial nursing interventions for a hospitalized elderly patient presenting with aspiration pneumonia and respiratory distress. Each student was given 8 min to perform the clinical task, followed by 2 min for examiner feedback. A high-fidelity patient simulator capable of exhibiting signs of respiratory distress was used to represent the patient, with a teaching assistant providing vocal responses and interaction based on a scripted role. Students were expected to assess the patient's condition and implement appropriate immediate nursing actions.
For Task 1, examiners used a checklist comprising 22 evaluation items. Each item represented a specific action or behavior expected of the student and was scored dichotomously: 1 for “performed correctly” and 0 for “not performed” or “performed incorrectly.” The 22 checklist items for Task 1 were categorized into three conceptual domains:
Communication (seven items): These items assessed the student's ability to engage effectively and empathetically with a patient experiencing respiratory distress. Examples include: “Uses closed-ended questions appropriately in consideration of the patient's dyspnea and difficulty speaking,” and “Adjusts volume and speech rate to accommodate mild hearing impairment.” Nursing Technical Skills (10 items): These items evaluated essential nursing procedures related to respiratory assessment and acute care. Examples include: “Performs chest auscultation accurately” and “Correctly repositions the nasal cannula.” Reporting (five items): These items assessed the student's ability to convey key information and clinical actions to another healthcare professional, simulating handover or communication with a supervisor. Examples include: “Reports that the patient's SpO₂ dropped to 92% upon entry but recovered during the encounter,” and “Describes the interventions taken to relieve the patient's dyspnea.”
Each checklist item in Task 1 carried equal weight, with 1 point awarded for actions performed correctly and 0 points for those omitted or incorrectly executed, yielding a maximum possible score of 22 for the station. Students’ performance was observed in real time by two faculty examiners, who independently completed their checklists without discussion during the assessment. Following the 8-min scenario, the examiners provided a brief 2-min feedback session to the student. This feedback was not included in the scoring.
Task 2: “Discharge Planning and Patient Education for a Middle-Aged Patient with Chronic Heart Failure Aiming for Social Reintegration”
This scenario emphasized communication, assessment of patient understanding, and educational support for a middle-aged individual with chronic heart failure preparing for discharge and return to daily life in the community. The station was structured as a 7-min student–patient interaction, followed by a 3.5-min feedback session. A standardized patient (trained actor) portrayed the patient, enabling a more authentic and interactive nurse–patient dialogue.
For Task 2, examiners used a checklist of 16 evaluation items, each scored dichotomously: 1 for “achieved” and 0 for “not achieved.” The items reflected core competencies in communication, assessment of the patient's perspective, and delivery of patient education. The 16 checklist items for Task 2 were grouped into the following domains:
Communication Skills (seven items): These items assessed fundamental therapeutic communication and rapport-building behaviors. Examples include: “Introduces oneself to the patient at the beginning of the encounter,” and “Listens attentively without interrupting.” Confirmation of Patient's Symptoms and Concerns (10 items): These items evaluated the student's ability to explore the patient's understanding, concerns, and subjective experience of illness. Representative items include: “Checks the patient's regimen for post-discharge medication management,” “Asks about the patient's thoughts and feelings regarding their heart failure and its treatment,” and “Verifies the patient's understanding of the physician's explanation.” Explanations and Education (four items): These items focused on the clarity and appropriateness of the student's patient education and health counseling. Examples include: “Explains necessary self-care behaviors to prevent exacerbation of heart failure after discharge,” and “Provides guidance on self-monitoring for heart failure symptoms.”
Although some skills overlapped across categories, the checklist for Task 2 comprehensively addressed essential elements of discharge planning communication. As in Task 1, each item was scored as 1 point if performed satisfactorily, resulting in a maximum possible score of 16 points for the station. In this scenario, a live standardized patient allowed students to engage in more natural and realistic dialogue. Each performance was observed in real time by two faculty examiners, who independently recorded their binary ratings for each of the 16 items without discussion.
Moderation and OSCE Examination Flow
To promote examiner calibration, a preliminary one-hour briefing was held prior to the OSCE. All examiners attended this session, which was led by the developers of the evaluation checklists. During the meeting, assessment criteria were thoroughly explained, questions were addressed, and demonstrations were provided to illustrate the expected behaviors of examinees. This session also offered a space for examiners to discuss item interpretation and served as the initial step in aligning their evaluation standards.
In addition, moderation sessions were interspersed throughout the OSCE. Specifically, after selected student circuits, faculty examiners convened brief meetings to discuss scoring practices. In this OSCE, three moderation sessions were held: between circuits 2 and 3, circuits 5 and 6, and circuits 9 and 10. These sessions provided opportunities to compare interpretations and clarify any major discrepancies, with the goal of enhancing alignment for subsequent assessments. Importantly, no previously recorded scores were altered; rather, the moderation meetings functioned as real-time calibration efforts to promote consistency moving forward.
Each student completed two stations (Tasks 1 and 2), each of which was assessed by a distinct pair of faculty examiners. The OSCE was structured with eight parallel stations per task, allowing up to eight students to be evaluated simultaneously, each at a separate station for that task. To accommodate all 90 students, 12 sequential circuits (rounds) were conducted. By the end of the examination, each examiner pair had evaluated multiple students on the same task, providing repeated exposure to the same checklist items and scenarios. Faculty examiners were allocated to examination rooms in pairs with consideration given to balancing academic ranks (e.g., pairing senior with junior faculty), rather than by strict randomization. Students were scheduled in order of student identification numbers, and each student was assessed by the fixed pair of examiners assigned to that room.
Data Collection and Analysis
After the OSCE, score sheets from each examiner were collected. For both tasks, the checklists completed by the two examiners who observed the same student's performance were paired. The analysis focused on inter-rater agreement at two levels: individual checklist items and total station scores, as described below.
For each checklist item in both tasks, inter-rater agreement between the two examiners was calculated across all students. Because each item was scored dichotomously (1 or 0), each examiner pair's ratings formed a 2 × 2 contingency table representing agreement or disagreement (i.e., can vs. cannot do). From these tables, Cohen's kappa coefficient (Kappa) was computed for each item to quantify agreement beyond chance.
In addition, the PABAK was calculated for each item. This statistic adjusts for both the prevalence of positive ratings and systematic rater bias, offering a more stable measure of agreement when distributions are highly skewed. PABAK was calculated using the formula PABAK = 2Po – 1, where Po represents the observed proportion of agreement between raters. PABAK is particularly useful for mitigating the so-called “kappa paradox,” in which high agreement can produce a low kappa because of imbalanced prevalence. By comparing Kappa and PABAK for each item, items in which the two measures diverged were identified, suggesting potential issues related to rating prevalence or examiner bias.
The item-wise performance rate—the percentage of students who received a score of 1 (“performed”)—was calculated for each checklist item. This served as an indicator of item difficulty or the frequency with which the behavior was achieved. Associations between performance rates and inter-rater agreement metrics were then examined to explore whether items that were very easy or very difficult tended to yield higher or lower levels of agreement.
Overall inter-rater agreement was calculated for each station within each circuit. For both tasks and for each circuit (i.e., group of students evaluated in a given round), all checklist item ratings were pooled, and PABAK (and Cohen's kappa where relevant) was computed across all items in that circuit. This allowed examination of whether inter-rater agreement improved as examiners gained experience across successive circuits. Agreement statistics (PABAK and Kappa) were then plotted by circuit number for both Tasks 1 and 2 to visualize trends over time.
Additionally, checklist items were aggregated by content domain to compare inter-rater agreement across different types of competencies. For Task 1, items were grouped into three predefined domains (Communication, Nursing Technical Skills, and Reporting) and the average PABAK was calculated for each domain, on the basis of all relevant ratings across students and raters. For Task 2, items were similarly grouped into Communication, Patient Condition Confirmation, and Explanation domains (as defined in the checklist), and the mean PABAK was computed within each category.
Finally, to obtain an overall measure of scoring agreement for each task, the intraclass correlation coefficient (ICC) was calculated on the basis of the total scores assigned by the two examiners. A two-way random-effects model with single-rater measurement (ICC[2,1]) was used, which is appropriate for assessing absolute agreement between two raters on a continuous outcome—namely, the sum of item scores for each student.
All statistical analyses were conducted using SPSS version 26 (IBM Corp., Armonk, NY, USA), with the latter used specifically for computing Kappa and PABAK where required. Kappa values were interpreted according to standard benchmarks: <0.00 = poor, 0.00–0.20 = slight, 0.21–0.40 = fair, 0.41–0.60 = moderate, 0.61–0.80 = substantial, and 0.81–1.00 = almost perfect agreement. PABAK values were interpreted along the same 0–1 scale (Byrt et al., 1993; Landis & Koch, 1977). The difference between PABAK and Kappa (PABAK – Kappa) was examined to identify checklist items where extreme prevalence (very high or very low) may have suppressed Kappa despite high observed agreement.
Results
Task 1: “Assessment and Initial Care for an Elderly Patient with Aspiration Pneumonia.” OSCE Performance
Overall Student Performance
Out of a maximum of 22 points, the mean score achieved by students in Task 1 was 13.0 points (59.3%, standard deviation = ± 21.3). The highest-scoring item was “Maintains eye contact at the patient's eye level and keeps an appropriate distance (proper body positioning),” which was successfully performed by 89.4% of students. Other items with similarly high success rates included “Performs chest auscultation” (88.3%) and “Speaks at a volume and pace appropriate for a patient with mild hearing loss” (84.4%). In contrast, the most challenging item was “Encourages the patient to take deep breaths,” which only 14.4% of students performed successfully.
Inter-Rater Agreement by Item
Figure 1 presents the Kappa and PABAK values for each of the 22 checklist items in Task 1. The item with the lowest inter-rater agreement was “Reports that, since the SpO₂ drop to 92% was transient and lung sounds remained as earlier, there is no actual deterioration in the patient's condition,” a reporting-related item with a PABAK of 0.42 and a Kappa of 0.39. Another item with comparatively low agreement was “Explains the reason why the patient is unable to ingest water” (PABAK = 0.58, Kappa = 0.54). In contrast, several items demonstrated almost perfect agreement. For instance, “Performs chest auscultation” showed very high agreement (PABAK = 0.98, Kappa = 0.95), as did “Raises the bed (gatch bed) from the foot end correctly” (PABAK = 0.98, Kappa = 0.97). Only one item in Task 1 exhibited a substantial divergence between the two agreement measures (PABAK and Kappa). The item “Speaks at a volume and speed appropriate for a patient with mild hearing loss,” which had a high performance rate (84.4%), showed a PABAK of 0.78 but a lower Kappa of 0.58.

Blue bars represent Cohen's Kappa values and orange bars represent PABAK values for each checklist item. Higher values indicate stronger inter-rater agreement.
Trend of Agreement Over Successive Circuits
As examiners progressed through successive student assessments, their scoring agreement for Task 1 showed a notable upward trend (Figure 2). In the first circuit, the overall inter-rater agreement for Task 1—calculated across all checklist items—was PABAK = 0.70, indicating substantial agreement. By the final circuit (12th), this value had increased to PABAK = 0.89, suggesting improved alignment among examiners over time.

Trends in examiner agreement (PABAK) across 12 OSCE circuits for Task 1 and Task 2.
When analyzed by item domain, the mean PABAK across all circuits was highest for the 10 nursing skill items (0.87), reflecting substantial agreement. For the seven communication-related items, the mean PABAK was 0.74, while the five reporting items yielded a lower average of 0.69.
The total score agreement for Task 1 reached an excellent level, as indicated by ICC(2,1) of 0.88 (95% confidence interval [0.84, 0.93]).
Task 2: “Discharge Planning and Patient Education for a Middle-Aged Patient with Chronic Heart Failure Aiming for Social Reintegration” OSCE Performance
Overall Student Performance
Out of a maximum of 16 points, the average student score for Task 2 was 10.9 points (68.2%, standard deviation = ± 26.6). Students performed particularly well on patient education and safety-related items. The highest-scoring item was “Confirms post-discharge dietary management with the patient,” which 95.6% of students completed successfully. Similarly high success rates were observed for “Confirms post-discharge medication management” and “Explains self-care behaviors to prevent heart failure exacerbation,” both at 95.0%. In contrast, the lowest-performing item was “Verifies the patient's identity using their full name” at the start of the encounter, achieved by only 9.4% of students. Another frequently missed item was “Checks if the patient understood the content of the doctor's explanation about their illness and treatment,” with a success rate of just 11.7%.
Inter-Rater Agreement by Item
The item with the lowest inter-rater agreement in Task 2 was “Explores the patient's thoughts/feelings about their heart failure condition and treatment,” which showed only fair agreement (PABAK = 0.47, Kappa = 0.43). Another item with a marked discrepancy between measures was “Asks about the patient's activity level and daily living in relation to cardiac workload,” which had a PABAK of 0.58 but a notably low Kappa of 0.26, which indicated poor agreement by Kappa standards. Similarly, “Shows an empathetic attitude toward the patient who has been living with heart failure” yielded PABAK = 0.62 and Kappa = 0.30. In contrast, “Verifies patient identity using full name,” although performed by only 9.4% of students, demonstrated high inter-rater agreement (PABAK = 0.93, Kappa = 0.81). Another item, “Confirms post-discharge dietary management,” had similarly high agreement by PABAK (0.91) but a lower Kappa (0.48), illustrating how high prevalence can suppress Kappa despite near-universal agreement (Figure 3).

Blue bars represent Cohen's Kappa values and orange bars represent PABAK values for each checklist item. Higher values indicate stronger inter-rater agreement.
In total, eight of the 16 checklist items in Task 2 exhibited a difference of ≥ 0.20 between PABAK and Kappa. The largest discrepancy was observed for the item “Listens without interrupting the patient,” which had a PABAK of 0.78 but a much lower Kappa of 0.23—a difference of 0.55. This item had a high-performance rate, with 92.2% of students marked as having completed it. Other items showing a pronounced PABAK–Kappa gap (≥ 0.40) included high-success items such as “Confirms dietary management,” “Confirms medication management,” and “Explains self-care behaviors,” as well as the low-success item “Checks understanding of physician's explanation.”
Trend of Agreement Over Successive Circuits
As shown in Figure 2, the overall inter-rater agreement for Task 2, measured by PABAK, began at 0.72 in the first circuit and fluctuated slightly across rounds, ending at 0.77 in the final circuit.
When analyzed by item category, the five core communication skill items (e.g., introducing oneself, active listening, and expressing empathy) demonstrated substantial agreement, with an average PABAK of 0.80. In contrast, the six items related to confirming patient symptoms or understanding (e.g., checking comprehension of explanations, asking about concerns) had a lower average PABAK of 0.69. The four items focusing on patient education and explanation showed intermediate agreement, with an average PABAK of 0.74.
Finally, the overall agreement on total scores for Task 2 was excellent, with an ICC(2,1) of 0.89 (95% confidence interval [0.84, 0.93]).
Discussion
By examining inter-rater reliability and scoring trends in a nursing OSCE, this study demonstrated that the consistency of examiner evaluations varied depending on the nature of the checklist items. In general, items involving direct observation of psychomotor behaviors, such as performing auscultation or executing a technical nursing procedure, showed higher levels of agreement between raters. In contrast, items requiring assessment of verbal communication, clinical reasoning, or empathetic responses tended to yield lower agreement. Moreover, in the skills-focused station (Task 1), inter-rater agreement improved over successive circuits, suggesting that repeated exposure led to better calibration among examiners. However, in the communication-focused station (Task 2), agreement levels remained relatively stable despite multiple repetitions and moderation efforts. Finally, items assessing verbal behaviors were more likely to exhibit large discrepancies between Kappa and PABAK, particularly when the item performance was highly imbalanced (i.e., very high or very low). These findings highlight the importance of selecting appropriate reliability indices when evaluating subjective or low-prevalence behaviors in performance-based assessments.
The finding that technical skill items demonstrated high inter-rater agreement is unsurprising. Behaviors such as performing a procedure are either completed or not, and are typically straightforward to observe and evaluate consistently. This likely reflects the fact that such items have clearer, more objective evaluation criteria, making it easier for different raters to reach consensus.
In contrast, the lower agreement observed for items related to reporting clinical findings or checking patient understanding suggests that these behaviors were perceived as more subjective or ambiguously defined. This suggestion is consistent with findings by Ishikawa et al. (2017), who reported that OSCE items requiring clinical reasoning or history-taking exhibited greater variability between raters compared with concrete physical examination techniques (Ishikawa et al., 2017). The current results illustrate this pattern. In Task 1, for instance, whether a student auscultated the chest is a binary and highly observable action, while the item “adequately reported the patient's status” is more open to interpretation and may vary depending on each rater's expectations or clinical experience.
More recently, examiner-related factors—such as gender, professional background, and prior OSCE experience—have been identified as influential variables affecting inter-rater reliability in communication-focused OSCEs (Mortsiefer et al., 2017). In parallel, systematic reviews of OSCE communication checklists have revealed ongoing heterogeneity in rubric design and a lack of consensus regarding evaluation standards (Setyonugroho et al., 2015). International studies have echoed these concerns, underscoring persistent challenges in maintaining examiner consistency and ensuring checklist validity across varied educational contexts, including nursing OSCEs (de Beer et al., 2023; Guerrero et al., 2024; Hyde et al., 2022). This issue also reflects findings reported by Cazzell and Howe (2012), who categorized OSCE checklist items by cognitive, psychomotor, and affective learning domains. Their results revealed that items assessing the cognitive (knowledge) and psychomotor (skills) domains achieved acceptable inter-rater reliability, whereas items targeting the affective domain, such as empathy or professional demeanor, showed notably lower reliability (Cazzell & Howe, 2012). In the current study, Task 2 included an affective-domain item: “Shows an empathetic attitude toward the patient who has been living with heart failure.” This item had a relatively low PABAK of 0.62, aligning with Cazzell and Howe's observations. These findings suggest that soft-skill evaluations may be inherently more difficult to standardize and may require more nuanced rubrics, clearer behavioral anchors, or enhanced rater training to be assessed reliably.
The finding that inter-rater agreement improved over time in Task 1 but not in Task 2 is particularly noteworthy. Built-in moderation sessions were implemented throughout the OSCE, anticipating that periodic discussions among examiners would help calibrate scoring. This expectation aligns with prior research by Watari et al. (2022), who found that mid-OSCE moderation meetings can significantly enhance scoring consistency across examiners (Watari et al., 2022). In the current study, despite Task 1 having more checklist items (22 vs. 16) and a longer scenario duration, inter-rater agreement increased noticeably after each moderation session. This suggests that calibration efforts were effective for that station. A likely reason for this result is that Task 1 included more concrete behaviors, such as physical assessments and simple communication acts, making it easier for examiners to reach consensus on what constituted acceptable performance during moderation. In contrast, Task 2, which was shorter and centered on dialogue and patient education, showed no clear trend of improvement. A possible explanation for this finding is that it reflects examiner fatigue or cognitive overload, because the station required raters to assess multiple qualitative elements within a limited time frame. Chong et al. noted that OSCE examiners are susceptible to fatigue, especially when evaluating many candidates, which can result in scoring drift and inconsistent ratings later in the session (Chong et al., 2017). This risk underscores the importance of examiner rotation, scheduled breaks, and/or workload management to maintain scoring reliability in extended OSCE administrations. In the case of the current study, all faculties were relatively new to OSCE rating and may have found it challenging to sustain consistent attention to nuanced communication behaviors by the 12th circuit.
Another important consideration is the experience level of the examiners. All faculty raters in the OSCE were first-time examiners who had received only one or two brief training sessions. Hyde et al. (2022) found that novice OSCE examiners are more susceptible to rating inconsistencies compared with those with greater experience (Hyde et al., 2022). In the current study, despite initial training, some raters may have lacked confidence or consistency in applying the checklist—an issue supported by prior findings that insufficiently prepared examiners often feel uncertain during performance-based evaluations (Hyde et al., 2022). In future, more extensive examiner preparation may be necessary. However, overall agreement on total scores was excellent for both tasks, with ICC(2,1) values of 0.88 for Task 1 and 0.89 for Task 2, suggesting a high degree of consistency at the aggregate level. Guerrero et al. (2024) reported that simulation-based rater training outperformed didactic and online methods in improving examiner readiness (Guerrero et al., 2024). This suggests that engaging faculty in mock OSCE sessions, such as scoring recorded student encounters followed by group calibration discussions, may enhance inter-rater agreement, particularly for subjective items. These findings underscore the importance of robust rater preparation: a brief 1–2-h orientation may be insufficient for first-time examiners to reliably assess complex communication and counseling behaviors.
A key methodological insight from the current study is the “high agreement, low Kappa” paradox, which was observed in several items with highly skewed performance rates. Feinstein and Cicchetti (1990) attributed this phenomenon to prevalence bias, in which one response category dominates, and rater bias, reflecting systematic differences in how examiners apply scoring criteria (Feinstein & Cicchetti, 1990). In the present data, items with highly skewed performance rates (i.e., items that almost all students passed or failed) often yielded Kappa values that underestimated the true level of agreement, occasionally even suggesting chance-level reliability despite substantial observed concordance. This issue is especially relevant in OSCE settings, where certain checklist behaviors are nearly universally performed or missed. In these cases, relying solely on Kappa can be misleading. These findings reinforce the recommendation by Cazzell and Howe (2012) that both Kappa and PABAK should be reported to provide a more robust and comprehensive analysis of inter-rater reliability in simulation-based assessments (Cazzell & Howe, 2012).
Limitations
While this study provides meaningful insights into OSCE inter-rater reliability in nursing education, two limitations should be noted:
Examiner-Related Factors and Training Limitations: All nursing faculty members were first-time OSCE examiners with no prior experience in OSCE-based evaluation, and they received only minimal training. This likely influenced inter-rater agreement and limits the generalizability of the findings to populations with more experienced raters. It is possible that agreement rates would be higher in settings with seasoned examiners; thus, the results may partially reflect a learning curve effect among raters. In addition, because each examiner pair assessed only a subset of students, pair-level comparisons of inter-rater reliability were not conducted. Potential variability across examiner pairs should be addressed in future research. Finally, although examiners attended a pre-examination briefing and demonstration session, a full mock OSCE with structured pre-moderation was not conducted. Such pre-moderation activities may further enhance examiner calibration and should be considered in future implementations. Moreover, examiner fatigue over the course of the examination day was not assessed in this study. Fatigue may have contributed to variability in ratings and should be considered in future research. Single-Institution Context and Custom Checklists: The OSCE scenarios and evaluation checklists were developed internally and tailored to the curricular goals of the university. Several checklist items were unique to the program's approach, which may limit the generalizability of the specific agreement values reported. Variations in scenario design, checklist content, or cultural expectations across institutions may yield different inter-rater reliability outcomes. However, the broader patterns observed, particularly the differences in agreement between psychomotor and communication-focused items, are likely to be applicable in similar educational settings.
Finally, although the sample size of 90 students and 38 examiners was appropriate for evaluating inter-rater agreement within this institutional context, no formal power analysis was conducted for item-level comparisons. Future studies with larger, multi-institutional samples may provide more robust estimates of reliability across diverse contexts.
Implications for Nursing Practice
This study underscores the importance of structured examiner training to enhance OSCE reliability, especially in subjective areas such as communication, where scoring often varies. Consistent, simulation-based calibration can enable examiners to develop a shared understanding of nuanced behaviors like empathy. Equally vital is the refinement of OSCE checklists to ensure greater clarity and specificity, thus promoting more consistent evaluations. In the context of nursing education, integrating examiner training with improved checklist design contributes to assessments that are both more reliable and equitable. Future research should assess the effectiveness of these interventions and explore additional strategies to better align examiner judgments, ultimately enhancing the quality of OSCEs and equipping graduates for real-world clinical demands.
Conclusion
This study demonstrated that the reliability of OSCE scoring in pre-registration nursing education is strongly influenced by the type of skills and competencies being evaluated. Observable technical skills consistently yielded higher inter-rater agreement, whereas items related to communication and empathy tended to exhibit lower consistency and more significant discrepancies. These variations were largely shaped by vague item phrasing and skewed score distributions. To improve the reliability and fairness of OSCEs in nursing education, educators should emphasize structured examiner training and the development of clearer, more precise checklists. Such enhancements may bolster the objectivity of performance assessments and further solidify OSCEs as essential instruments for evaluating clinical competence.
Supplemental Material
sj-doc-1-son-10.1177_23779608261417794 - Supplemental material for Inter-Rater Reliability in a Pre-Graduation Nursing Objective Structured Clinical Examination: A Kappa and Prevalence-Adjusted Bias-Adjusted Kappa Comparison of Technical Skill and Communication Items
Supplemental material, sj-doc-1-son-10.1177_23779608261417794 for Inter-Rater Reliability in a Pre-Graduation Nursing Objective Structured Clinical Examination: A Kappa and Prevalence-Adjusted Bias-Adjusted Kappa Comparison of Technical Skill and Communication Items by So Yayama, Atsushi Ohashi, Akemi Mitsui and Kanako Yamamoto in SAGE Open Nursing
Footnotes
Acknowledgments
The authors wish to express sincere gratitude to the faculty members who served as OSCE examiners and to the students who generously contributed their time and effort to this study. Special thanks are also extended to the faculty members who supported the organization and coordination of the examination sessions. We thank Benjamin Knight, MSc., from Edanz (
) for editing a draft of this article.
Ethical Approval and Informed Consent
This study was approved by the university's ethics review committee (Approval ID: 2023354). Participation in the OSCE was a standard curricular requirement for students; however, approval was obtained to use performance data for research purposes. An information sheet outlining the study's aims and procedures was posted on the campus bulletin board and distributed via email to all eligible students and faculty, offering the opportunity to opt out of data inclusion. Confidentiality was strictly maintained: individual student scores and identities were anonymized in the analysis, and examiner identities were also kept confidential. The use of performance data for research had no impact on students’ course grades or examination results.
Author Contributions
SY designed and conceptualized the study. AO and AM contributed to data acquisition. SY conducted the data analysis and drafted the manuscript with intellectual input and critical revisions from AO and KA. KY acquired the funding that supported this research. All authors contributed to the interpretation of the data, reviewed the manuscript for important intellectual content, and approved the final version. All authors agree to be accountable for all aspects of the work.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: this study was supported by the KMU Faculty of Nursing Research Consortium (2023).
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The datasets used and/or analyzed during the current study will be available from the corresponding author on reasonable request.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
