Sage Journals: Discover world-class research

Abstract

Objective

This study reports the development and preliminary evaluation of SAAAR, a multimodal system designed to assess and support the development of situation awareness (SA).

Background

SA is critical in anesthesiology, yet existing assessment methods lack standardized tools tailored to its complexities of anesthetic practice. Systems developed in other domains have limited applicability, highlighting the need for a purpose-built approach for anesthesia residents.

Method

The SAAAR comprises two components: a 16-item behavioral marker scale and a structured debriefing with eye-tracking. Thirteen anesthesiology faculty tested interrater and test-retest reliability, while five experts conducted content validation of the scale. Both components were implemented in a simulation-based training program for preliminary system evaluation.

Results

The behavioral marker scale demonstrated moderate content validity and high reliability. Internal consistency was strong (McDonald’s Ω = 0.928), test-retest reliability high (Spearman’s ρ = 0.952), and interrater agreement moderate (Kendall’s W = 0.412). Faculty reported the scale to be clear, comprehensive, and easy to use. Pilot implementation showed significant improvements across domains (Wilcoxon signed-rank test), indicating the system’s potential to provide targeted feedback and guide educational interventions.

Conclusions

Grounded in HFE principles, the SAAAR provides a structured approach to assessing SA in anesthesia residents and demonstrates preliminary potential to inform educational strategies. Further research is required to determine its impact on clinical performance.

Application

The SAAAR offers residency programs and human factors experts a practical tool for assessing SA and designing targeted training. Its adaptable framework suggests potential applicability in other high-pressure medical contexts, pending further evaluation.

Keywords

situation awareness anesthesia and perioperative care medical simulation/training and assessment training evaluation patient safety

Introduction

Effective decision making in high-stakes, dynamic environments—such as the operating room—depends on clinicians’ ability to interpret and anticipate evolving scenarios. In this context, situation awareness (SA) is essential to anesthetic practice, enabling professionals to perceive relevant cues, comprehend their significance, and project future states. SA, as defined by Endsley (1995b), comprises three levels: perception, comprehension, and projection. This model has been widely applied in aviation, military operations, and, increasingly, in healthcare, where failures in SA have been linked to adverse events and compromised patient safety (Endsley & Garland, 2000; Schulz et al., 2016).

SA failures have been identified in approximately 81.5% of adverse events in anesthesiology, underscoring its central role in clinical performance (Schulz et al., 2016). Despite this, SA training remains largely informal, relying on experiential learning over standardized tools or pedagogical strategies (Haber et al., 2017). Existing SA assessment systems, developed for other domains, are ill-suited to the complexities of anesthesiology residency education.

Several efforts have been made to adapt SA measurement to medical simulation. Wright et al. (2004) emphasized the need for objective, validated metrics in human patient simulation and identified the Situation Awareness Global Assessment Technique (SAGAT) as an alternative to traditional performance indicators. Crozier et al. (2015) developed the Team Situation Awareness Global Assessment Technique (TSAGAT), demonstrating its reliability in trauma teams. Hogan et al. (2006) validated SAGAT’s ability to distinguish expertise levels in trauma simulations. Shelton et al. (2013) refined the Situation Present Assessment Method (SPAM), enabling real-time SA queries via handheld devices. Schulz (2016) identified SAGAT as the gold standard, SART as a post hoc self-report, and the ANTS behavioral rating scale (Fletcher et al., 2003) as the most common observational method.

While these tools offer valuable insights, their applicability to anesthesia training is limited. The freeze-probe SAGAT interrupts task flow, making it unsuitable for real operating rooms. The SART posttask self-evaluation scale is easier to use but raises validity concerns due to memory bias and limited insight into SA failures (Endsley, 1995a). Observer-based tools such as ANTS can assess nontechnical skills, including SA, but observable behaviors often fail to capture the underlying cognitive processes—particularly at Level 2 (comprehension). Furthermore, most SA tools lack rigorous longitudinal evaluation (Stanton et al., 2013), restricting innovation and pedagogical advancement. Therefore, SA assessment in anesthesiology should aim not only to measure performance but also to foster SA development as an educational competency.

These tools were designed to evaluate expert performance rather than to facilitate learning in formative training contexts. Their limited scope, feasibility, and pedagogical applicability underscore the need for assessments specifically tailored to anesthesiology residency education, where distinctive SA demands arise from the dynamic and complex nature of anesthetic practice. In this setting, residents must integrate physiological data, direct observations, team interactions, equipment function, and environmental cues (perception) into a coherent mental model that anticipates potential complications (comprehension), while forecasting risks, planning contingencies, and managing competing goals under high cognitive load, time constraints, and team demands (Dishman et al., 2020; Fioratou et al., 2010; Gaba et al., 1995). Although simulation-based education effectively improves SA (Walshe et al., 2019), and constructive instructor-led debriefing enhances learning (Savoldelli et al., 2006), anesthesiology training still relies mainly on workplace learning, the backbone of postgraduate education (Olmos-Vega, 2018). Consequently, SA assessment methods must be adaptable to both simulated scenarios and real clinical environments.

This study aimed to design and conduct a preliminary evaluation of the Situation Awareness Assessment for Anesthesia Residents (SAAAR), a multimodal system composed of a behavioral marker scale and a structured debriefing supported by eye-tracking. The system was developed for use in both simulated and clinical environments, aligned with residency education goals and the ACGME Anesthesiology Milestones 2.0—specifically Patient Care 7 (PC7): Situational Awareness and Crisis Management (ACGME, 2021). The SAAAR supports SA as a core competency through assessment, continuous feedback, and progressive skill development.

Methods

Design of the SAAAR System

The Situation Awareness Assessment for Anesthesia Residents (SAAAR) was designed from an HFE perspective to support the assessment and development of SA in anesthesiology training. It adopts Endsley’s definition of SA as the perception of environmental elements, their comprehension, and the projection of future status (Endsley, 1995b). The SAAAR comprises two components: a 16-item behavioral marker scale and a structured debriefing supported by eye-tracking. The scale is organized into four domains: three reflecting Endsley’s theoretical levels of SA (perception, comprehension, and projection) and one pedagogical domain, communication.

Although not part of the original SA taxonomy, communication emerged during the formative design process as a key behavioral indicator for making residents’ mental models visible during critical events. A prior study on SA training needs (Daza-Beltrán et al., 2025) found many SA errors stemmed from communication failures. In this context, four observable behaviors were included in the SAAAR—providing necessary information clearly, communicating assertively, demonstrating active listening, and ensuring closed-loop communication—not as a theoretical SA dimension, but as a means to externalize reasoning and enable timely, targeted pedagogical feedback to shape mental models.

The domains targeted for assessment were: perception (Level 1), defined as the ability to detect relevant information and manage attention, measured via gaze overlap in eye-tracking videos; comprehension (Level 2), the development and application of mental models, inferred from performance and verbalizations; and projection (Level 3), anticipating contingencies and responding to errors, evaluated through actions and reasoning. Communication—conveying and obtaining key information—was also targeted for assessment within the system. These behavior-based observations (SA as product) were enriched by structured video debriefing, accessing cognitive processes (SA as process) and supporting pedagogical feedback.

Within this academic context, the SAAAR was developed to: (1) assess SA during both clinical and simulated procedures; (2) identify SA-related behavioral gaps and support the implementation of targeted educational interventions to strengthen SA competencies; (3) foster reflective learning through video-based debriefing supported by eye-tracking; and (4) support the evaluation of the effectiveness of SA-focused instruction.

Its design was informed by previously identified training needs in anesthesiology (Daza-Beltrán et al., 2025), but it was conceived as a flexible formative tool that can be adapted to different educational programs and applied in both clinical and simulated environments. The system provides structured feedback to enhance situational awareness learning, requires minimal observer training, and addresses the limitations of freeze-probe and self-report methods, which are often impractical and lack immediate pedagogical value in clinical settings.

The development of the system was informed by an exploratory literature review on SA-related skills in anesthesiology. This included foundational studies by Gaba et al. (1994, 1995, 1998) on production pressure, SA, and behavioral performance during simulated crises, and studies by Schulz et al. (2011, 2013, 2016, 2018) on SA-related errors. We also reviewed behavioral marker systems for nontechnical skills, particularly ANTS and ANTS-AP (Fletcher et al., 2004; Rutherford et al., 2015). The needs identified in the previous research (Daza-Beltrán et al., 2025) were aligned with the findings in the literature and guided the development of the initial item pool for the SAAAR.

The system includes two components:

Behavioral Marker Evaluation Format

This includes a 16-item behavioral marker scale organized into four domains: three reflecting Endsley’s theoretical levels of SA (perception, comprehension, and projection) and one pedagogical domain, communication. Each item is rated on a 5-point Likert scale (1 = Never, 5 = Always), indicating how often the resident demonstrates the behavior during the scenario.

Supervising professors, acting as expert observers, complete the 16-item format in real time while observing the simulated or clinical case and the live feed from the Tobii Pro Glasses 2 (Tobii Technology, 2014), a wearable eye-tracking device used by the residents. This first-person view helps verify attentional focus and interactions during the task. Combining direct observation with this perspective enhances interpretation of perception and projection behaviors and allows inference of comprehension through verbal reasoning, and provided opportunities to capture communication as a cross-cutting pedagogical complement that externalizes residents’ mental models.

In contrast to previous quantitative studies that used eye-tracking to examine perceptual differences between novice and expert anesthetists in simulated scenarios (Desvergez et al., 2019), our study did not conduct quantitative analysis of eye-tracking data. Instead, gaze-overlay videos were qualitatively reviewed to support formative observation and feedback. This approach aligns with the formative purpose of the SAAAR, enabling immediate interpretation and personalized feedback during debriefing. Quantitative processing, in contrast, requires specialized software and delays analysis, limiting its applicability for real-time reflection—one of the main constraints of current SA assessment methods.

The gaze-overlay recordings helped evaluators identify observed undesirable events (OUEs)—observable deviations from expected behavior that may signal SA failures (Daza-Beltrán et al., 2025)—which guided the debriefing process. The first-person video provided evaluators with access to critical information not visible from their physical position, such as the resident’s gaze direction, missed cues, and visual interactions with equipment or the environment. Faculty consistently described this feature as one of the most valuable components of the system, as it enabled a more accurate interpretation of behavior during high-pressure moments.

Importantly, gaze behaviors were not interpreted in isolation or used as definitive indicators of good or poor situation awareness. Potentially ambiguous behaviors, such as brief glances, were intentionally examined within the video debriefing session, where residents were invited to explain their intentions, reasoning, and contextual constraints at that moment. In this way, eye-tracking data functioned as a prompt for guided reflection rather than as a standalone evaluative metric, helping to mitigate subjective interpretations and support access to situation awareness as a cognitive process.

Video Debriefing Session

Immediately after each simulation, residents participate in individual debriefing sessions structured according to the formative debriefing model by Rudolph et al. (2008). The session begins with participants’ reactions to the scenario, followed by performance analysis using segments from the eye-tracking recordings that capture OUEs, along with results from the Behavioral Marker Evaluation Format (SA as process).

By revisiting OUEs from their own visual field, residents are able to identify discrepancies between perceived and actual performance, reconstruct their decision-making processes, and mitigate memory bias. This process supports guided reflection on their cognitive strategies and enables specific feedback grounded in real, SA-related behaviors.

Integrating real-time observation with self-reflection supported by eye-tracking video ensures that feedback remains timely, individualized, and pedagogically meaningful. It also allows evaluators to assess residents’ SA—from the interpretation of perceived information to the mental models and anticipatory processes that guide their decisions (SA as process). This approach complements the evaluation of SA as a product and concludes with a summary of the key learned by the residents for future performance improvement. This strategy is grounded in a dual understanding of SA: as a process, informing skill development; as a product, enabling performance assessment and gap identification.

Figure 1 summarizes this process.

Figure 1.

Sequential stages of pilot implementation of SAAAR in simulation-based training. Note. The figure describes the five stages of SAAAR pilot implementation: (a) calibration of eye-tracking glasses, (b) scenario start, (c) live assessment of 16-item behavioral marker format with gaze-overlay, (d) scenario end, and (e) video-based debriefing. AI-generated images were used to protect confidentiality, based on real study scenarios.

Simulation-Based Implementation of the SAAAR

The SAAAR was implemented in a simulation-based setting as part of a pilot study with the full 2023 cohort of anesthesiology residents at Pontificia Universidad Javeriana (18 participants: six per training year). Inclusion criteria were: (1) active enrollment in the residency, (2) full participation in the training module for their level, and (3) signed informed consent. No exclusion criteria were applied, as the intervention was educational and designed for full cohort participation.

The intervention was designed by the research team specifically to train situation awareness (SA) skills across three residency levels (R1–R3). Each group received a training module that included a pretraining simulation scenario (15 min), a theoretical session covering core SA principles for all participants with differentiated emphasis according to residency year (2 h), and a posttraining simulation scenario of similar difficulty conducted approximately eight days later (15 min). No debriefing was conducted between the two simulation scenarios. At completion of the module, participants engaged in a video-based debriefing session (30 min). The topics addressed were aligned with the SA training needs previously identified for each residency level (Daza-Beltrán et al., 2025), and the simulation scenarios were tailored to each group’s clinical responsibilities and experience.

Two structured scenarios—pretraining and posttraining—were conducted in the critical care room of the Clinical Simulation Center. The room was equipped with an Ohmeda Excel 110 anesthesia machine (GE Healthcare, UK), a Laerdal SimMan® 3G mannequin (Laerdal USA), and all necessary medical supplies. Each scenario involved an actor playing a nurse assistant and a biomedical engineer operating the simulator.

Residents wore Tobii Pro Glasses 2 (Tobii Technology, Sweden) to capture visual attention and behavior. Two anesthesiology faculty conducted synchronous SA evaluations using the behavioral marker format. Eye-tracking videos were used in structured debriefing sessions to develop individualized improvement plans. One faculty member led the simulation and assessment; the other coordinated logistics, including actors, equipment, recordings, and informed consent. Assessment data were used to evaluate the SAAAR’s internal reliability.

The implementation involved the complete multimodal SAAAR system, integrating both components: the 16-item behavioral marker scale and the structured debriefing guide supported by eye-tracking, which together provided complementary quantitative and qualitative evidence of residents’ SA performance.

Preliminary Evaluation of the SAAAR Behavioral Marker Format

The preliminary evaluation of the SAAAR presented in this section refers exclusively to the 16-item Behavioral Marker Evaluation Format. This component was examined for content validity, internal consistency, temporal stability, and interrater agreement within the context of anesthesiology residency training (see Figure 2). The structured debriefing component, by contrast, was not subjected to psychometric testing, as it serves a qualitative and formative feedback function within the training process. This evaluation represents an initial step in the instrument’s development and does not constitute a full psychometric validation of the complete SAAAR system.

Figure 2.

Validation and reliability assessment techniques applied in the study. Note. This figure includes only the validation techniques implemented and reported in this study.

Content Validity

Content validity was assessed through expert judgment using the criteria proposed by Escobar-Pérez and Cuervo-Martínez (2008): sufficiency, clarity, coherence, and relevance. Table 1 summarizes the indicators associated with each rating level (1–4) across these criteria. Five experts participated: three were senior psychology professors with extensive experience in test development and psychometric validation, and two physicians with clinical and research experience in SA—one with psychometric experience and the other an internist and anesthesiologist who had contributed to prior SA studies. All experts had over ten years of professional experience.

Table 1.

Content Validity Evaluation Criteria and Indicators Used by Expert Judges

Criterion	Indicators by Score Level
Sufficiency	1 = Items are not sufficient to measure the dimension
	2 = Items address parts of the dimension but do not cover it fully
	3 = Some items should be added to complete the dimension
	4 = Items are sufficient
Clarity	1 = Item is unclear
	2 = Requires major wording or phrasing changes
	3 = Requires specific term modifications
	4 = Clear semantics and syntax
Coherence	1 = No logical relation to the dimension
	2 = Tangential relation
	3 = Moderate logical relation
	4 = Fully aligned with the dimension
Importance	1 = Item can be removed without loss of understanding
	2 = Somewhat relevant, may be omitted
	3 = Relatively important
	4 = Very relevant and should be included
Relevance	1 = Can be removed with no impact on study purpose
	2 = Weak relevance to the purpose
	3 = Relatively relevant
	4 = Highly relevant and should be included

Note. Indicators for each rating level across the five criteria used in the expert judgment process.

Kendall’s W was employed to measure interrater agreement. After two evaluation rounds, nine of the sixteen items were revised based on expert feedback. Changes included rewording for clarity, refining behavioral descriptions to reduce ambiguity, and reassigning items to more appropriate SA dimensions. These modifications aimed to strengthen conceptual alignment and ensure that each item reflected a distinct, observable behavior relevant to its target SA component.

Internal Reliability

The internal consistency was evaluated using McDonald’s Omega coefficient to estimate the overall reliability of the SAAAR. The analysis used the 36 evaluations from the simulation-based implementation with the full 2023 cohort of anesthesiology residents (18 participants: six per training year). Each resident was assessed twice—before and after a training session—allowing for the evaluation of the instrument’s reliability across repeated applications and training levels (homogeneity). This structure, including level-specific scenarios and training content, was detailed in the “Simulation-Based Implementation” section.

Test-Retest Reliability

To assess the temporal stability of the SAAAR, test-retest reliability was evaluated using eye-tracking recordings from four first-year residents, collected during the pilot implementation. Each resident’s video was assessed twice by faculty at different time points to ensure identical input across sessions. Thirteen anesthesiology faculty members (all with over ten years of experience) participated in the first evaluation round; nine completed the second round. In both sessions, evaluators independently scored the recordings using the behavioral marker format. Score consistency was analyzed using Spearman’s correlation coefficient. At the end of the second round, evaluators provided open-ended feedback on the format’s clarity and usability.

Interrater Reliability

To assess the equivalence of the SAAAR, interrater reliability was analyzed using the data from the first round of the test-retest evaluation. In that session, thirteen anesthesiology faculty members independently rated four simulation videos, each showing a first-year resident performing a comparable clinical task. Using the 16-item behavioral marker format, each evaluator generated 64 item-level ratings (4 videos × 16 items). Agreement among raters was calculated using Kendall’s coefficient of concordance (W), providing a robust measure of scoring consistency across evaluators observing identical performances.

Ethical Considerations

The project was approved by the ethics and research committees of the Faculty of Engineering (FID-22-366, November 17, 2022) and the Faculty of Medicine (FM-CIEI-0199-23, March 15, 2023) at Pontificia Universidad Javeriana. In accordance with the Declaration of Helsinki, participant anonymity was guaranteed, and all participants provided written informed consent before participating in the study, including consent for publication and the use of their photographs or other images. Additionally, all collected information has been anonymized. No incentives were offered.

Results

Evaluation System

The SAAAR was successfully implemented as a tool for evaluating and providing feedback on SA skills during simulated scenarios. This instrument integrated behavioral markers and video debriefing sessions, identifying gaps in perception, comprehension, projection, and communication. Faculty highlighted its applicability and usefulness for delivering specific, structured feedback.

Instrument Validation

Content Validity

The 16 items of the behavioral marker scale were evaluated by experts using a 4-point scale according to the criteria proposed by Escobar-Pérez and Cuervo-Martínez (2008): sufficiency, clarity, coherence, importance, and relevance. As shown in Table 2, the criteria of sufficiency, coherence, importance, and relevance received an average score of 3.10, while clarity scored lower at 2.60, highlighting the need for improved wording.

Table 2.

Expert Ratings on Sufficiency, Clarity, Coherence, Importance, and Relevance of SAAAR Items

Item Range	Criterion	Definition	Average Score
1–16	Sufficiency	Items within the same dimension adequately measure that dimension	3.10
	Clarity	Items are easily understood, with appropriate syntax and semantics	2.60
	Coherence	Logical relation of the item to the dimension or indicator being measured	3.10
	Importance	Essential for understanding the object of study	3.10
	Relevance	Close alignment with the established purpose	3.10

Note. Scale: 1 = does not meet the criterion, 4 = fully meets the criterion.

Kendall’s W coefficient (W = 0.090) indicated low initial interrater agreement, although the chi-square test (χ² = 28.917, p < 0.001) confirmed that this agreement was statistically significant and unlikely to have occurred by chance. The results suggest a high level of agreement for sufficiency, coherence, importance, and relevance (mean rank = 3.10), and a moderate level of agreement for clarity (mean rank = 2.60).

Qualitative feedback suggested refinements in item clarity and relevance for greater semantic precision and alignment with dimensions, and the reassignment of one task management from the perception domain to the projection domain. A second round of review was conducted with three of the original five experts. Through iterative discussion, qualitative consensus was achieved on all items; however, no quantitative reassessment of interrater agreement was performed in this round. As a result, while semantic and conceptual improvements—particularly in clarity, coherence, and dimensional alignment—were substantial, the degree of statistical agreement in the final version could not be empirically verified. Additionally, the reduced number of participating experts may have facilitated consensus but limits the generalizability of these results.

Table 3 presents the final 16-item version of the SAAAR, organized into four domains: the three SA dimensions—perception, comprehension, and projection—and the cross-cutting pedagogical component of communication. All subsequent reliability analyses reported in this manuscript were conducted using this final version.

Table 3.

Final 16-Item Version of the SAAAR Instrument Organized by Domains

Situation Awareness Assessment for Anesthesia Residents	Never (1)	Rarely (2)	Sometimes (3)	Almost Always (4)	Always (5)
Level 1 SA: Perception
The resident is able to identify relevant information
The resident appropriately distributes attention according to the task type
The resident effectively discriminates haptic and auditory stimuli
Level 2 SA: Comprehension
The resident is able to identify possible complications
The resident uses the taught mental models and algorithms
The resident appropriately handles specialized equipment required for the case
The resident demonstrates competence in procedures appropriate to their training level
Level 3 SA: Projection
The resident prioritizes and manages their tasks effectively
The resident anticipates possible contingencies in complex medical situations
The resident evaluates the risks associated with their decisions
The resident appropriately handles team and time-related pressures
The resident effectively manages errors
Additional: Communication
The resident provides the necessary information clearly to other team members
The resident communicates assertively
The resident demonstrates active listening skills
The resident ensures clear, concise, complete, and correct closed-loop communication

Internal Reliability

Internal consistency, estimated with McDonald’s Omega coefficient (ω = 0.928; 95% CI [0.805, 1.0]), showed high reliability. By domains, values ranged from 0.844 (Comprehension) to 0.940 (Projection), with Communication scoring lower (0.671), indicating the need to adjust related items to enhance their contribution to overall consistency.

Test-Retest Reliability

A strong correlation was found between scores from the first and second evaluation rounds using Spearman’s correlation (ρ = 0.952, p < 0.001), indicating excellent test-retest reliability and supporting the temporal stability and robustness of the SAAAR instrument across repeated measures.

Interrater Reliability

Interrater agreement was analyzed using Kendall’s W coefficient of concordance (W). The analysis showed moderate agreement among the evaluators (W = 0.412), and the associated chi-square test indicated that this agreement was statistically significant (χ² = 316.579, df = 12, p < 0.001). These results suggest the potential value of future calibration or training efforts to further improve evaluator consistency, particularly if the instrument is scaled beyond pilot use.

Evaluator Feedback on Format Usability

As noted in the Methods section, faculty evaluators who participated in the test-retest phase provide open-ended feedback on the clarity and usability of the SAAAR format. Their comments consistently described it as clear, complete, and easy to use. No suggestions for changes in item structure or content were reported. While not a formal usability study, these practitioner insights offer valuable qualitative evidence of the instrument’s practicality and formative utility in anesthesiology training.

Results of the Simulation-Based Implementation of the SAAAR

To assess the instrument’s utility in detecting meaningful changes in SA performance, the SAAAR was applied before and after a simulation-based training intervention with the full 2023 cohort of anesthesiology residents. Each resident completed a training module that included a theoretical session, a pretraining simulation scenario and posttraining simulation scenario evaluated quantitatively with the 16-item behavioral marker format (SA as a product), and a video-based debriefing session evaluated qualitatively (SA as a process).

For the quantitative analysis, item scores were summed within each domain of the SAAAR to obtain an aggregated score per domain: perception (maximum 15 points), comprehension (20 points), projection (25 points), and communication (20 points). Using the Wilcoxon signed-rank test with a one-tailed hypothesis, statistically significant improvements were observed across all domains of the SAAAR. Specifically, perception (W = 37.00, p = .031, r = .52), comprehension (W = 5.00, p < .001, r = .93), projection (W = 17.00, p = .001, r = .80), and communication (W = 19.00, p = .003, r = .76) showed consistent increases posttraining.

These results suggest that the SAAAR system detects improvements in resident performance across SA domains. Figure 3 illustrates the pre- and posttraining scores for each item.

Figure 3.

Pre- and posttraining SAAAR scores by domains. Note. Each graph compares pre- and posttraining scores for each domain. Data were collected using the SAAAR during the pilot implementation with anesthesiology residents. Bars represent means displayed for descriptive purposes, while squares indicate medians connected by lines that show the direction and magnitude of change per domain. Statistically significant improvements were found across all domains using the Wilcoxon signed-rank test (one-tailed).

During debriefing, instructors used gaze-overlay videos to corroborate the causes of SA errors (EOND), facilitate reflective discussion, and connect behavioral observations with cognitive processes. Qualitative feedback from residents highlighted the usefulness of this integrated approach, particularly the opportunity to visualize their own decision making, identify overlooked cues, and receive targeted feedback. The satisfaction survey indicated high levels of acceptance and perceived pedagogical value of the training.

These results suggest the potential utility of the SAAAR not only as an evaluation tool but also as a means to guide pedagogical interventions and specific training.

For example, in a simulation scenario involving a blood transfusion, units were intentionally mismatched to simulate a latent error. Most residents failed to verify the blood type, and the patient developed an anaphylactic reaction. Gaze-overlay review during debriefing revealed that residents had not checked the blood unit information themselves. Their verbal explanations showed difficulties anticipating such complications and verifying critical information, indicating a projection failure (Level 3 SA), and incomplete mental models (Level 2 SA). Communication issues also emerged, as verification steps were not confirmed verbally. This example illustrates how two core components of the SAAAR—Behavioral Marker Evaluation Format and the Video Debriefing Session supported by eye-tracking—enable evaluators to detect SA deficiencies in residents and reinforce core SA competencies through targeted feedback.

Discussion

Summary of Key Findings and Connection to Study Objective

This study presented the development and preliminary evaluation of the SAAAR, a multimodal system created to assess situation awareness (SA) in anesthesia residents. By integrating behavioral markers, eye-tracking, and structured debriefing to identify gaps across the three SA levels—perception, comprehension, and projection—while supporting formative feedback. From a Human Factors and Ergonomics perspective, the SAAAR offers a context-adapted approach for simulation-based training, with potential applicability to clinical settings.

Comparison with Prior Research

Existing SA tools such as SAGAT (Endsley, 1988), SART (Taylor, 2017), and behavioral marker systems like ANTS (Fletcher et al., 2003) and ANTS-AP (Rutherford et al., 2015) have advanced the assessment of nontechnical skills in healthcare; however, their application to anesthesiology training remains constrained. These instruments often rely on retrospective or observer-only methods, offering limited access to how residents actually perceive, interpret, and anticipate evolving clinical situations.

In contrast, the SAAAR was designed to both assess and foster SA development by triangulating behavioral markers, first-person eye-tracking, and structured video debriefing. This integration enables the identification of attentional gaps through eye-tracking (Level 1), manifestations of internal comprehension expressed through communication behaviors (Level 2), and OUEs, including omitted or delayed anticipatory actions captured on video (Level 3), while transforming observed errors into opportunities for guided reflection and feedback.

These events are analyzed during debriefing to reveal residents’ reasoning processes, enabling access to Level 2 (comprehension) and Level 3 (projection), and complementing SA evaluation as both product and process. Unlike tools focused on scoring, the SAAAR fosters pedagogical dialogue—transforming errors into learning and feedback into reflection. Faculty reported ease of use, clearer evaluations, and deeper resident reflection. This approach aligns with educational goals and HFE principles by promoting resilience and safety in complex healthcare systems (Carayon et al., 2014).

Theoretical, Methodological, and Practical Implications

From a theoretical perspective, the SAAAR addresses SA as both product and process, treating these dimensions as complementary but distinct. This design responds to concerns raised by Endsley (2007) regarding the risks of inferring cognitive processes solely from observable behavior. By separating performance indicators from reflective access to reasoning, the SAAAR supports a more balanced understanding of how SA is constructed, maintained, and degraded in complex settings.

Methodologically, the SAAAR integrates eye-tracking, behavioral observation, and structured debriefing to infer attentional focus, reasoning, and anticipatory actions, allowing evaluators to contrast observed behaviors with residents’ mental models and thereby support guided reflection, targeted feedback, and learning.

From a practical standpoint, the pilot implementation demonstrated the feasibility and pedagogical value. Faculty reported clarity in evaluations, acceptable interrater reliability, and strong resident engagement. Whereas prior work often infers SA using quantitative eye-tracking metrics such as saccadic frequency or fixation time (Cha & Yu, 2021), the SAAAR adopts a qualitative approach that facilitates its integration into educational settings. This enabled immediate, personalized feedback and meaningful reflection without the complexity of quantitative data analysis, and required minimal faculty training, addressing limitations previously noted in simulation-based SA evaluations (Gaba et al., 1998). Communication, included as a pedagogical complement, further enhanced practicality by making residents’ mental models explicit and guiding supervisory feedback when interpretive errors emerged.

Based on these outcomes, the anesthesiology department and the simulation center where the pilot was conducted have expressed interest in formally incorporating the SAAAR into the residency curriculum, and have begun investing in dedicated infrastructure to support its continued development.

Although tested only in simulation, the SAAAR shows promise for clinical implementation. Its structured format enables identification of perception, comprehension, and projection errors, along with communication breakdowns that may compromise safety. The tool’s adaptability also suggests potential for other high-stakes areas such as emergency medicine, critical care, or perioperative transitions.

Limitations and Future Directions

Despite the strengths, the SAAAR presents several limitations. First, the psychometric evaluation was preliminary and limited to the 16-item behavioral marker format; therefore, the findings should not be interpreted as conclusive. The debriefing component was not subjected to psychometric testing because it is a pedagogical process rather than a measurement instrument. Its purpose is to explore residents’ reasoning, clarify mental models, and facilitate targeted feedback, contributing to learning and access to SA as a cognitive process rather than to the generation of quantitative scores.

The weighting assigned to the Communication component also represents a preliminary design decision. Although Communication is not conceptualized as a dimension of situation awareness, its higher weighting reflects its pedagogical relevance and observability in training contexts and should be considered provisional and subject to refinement in future studies.

Second, interrater reliability in the pilot study was moderate (Kendall’s W = 0.412), suggesting the need to enhance evaluator training. Third, although the system was designed for both simulated and clinical environments, its current evaluation is limited to simulation-based settings and to a single residency cohort, restricting generalizability. Additionally, despite the use of different scenarios separated by approximately eight days, some degree of familiarization with the simulation environment and task structure may have partially influenced performance.

Future research should address these limitations by testing the SAAAR in real clinical contexts, expanding psychometric evaluation to include convergent, discriminant and predictive validity, and comparing its outcomes with established tools such as SAGAT or ANTS. Beyond individual assessment, the SAAAR could be adapted to team-based SA assessment in high-stakes environments such as emergency medicine or intensive care. Exploring its application in interdisciplinary teams could expand its scope and contribute to a better understanding of shared awareness and collaborative decision making in patient care.

Conclusions

Building on this foundation, the present study introduces and preliminarily evaluates the SAAAR as a structured multimodal system for assessing SA in anesthesiology training. By integrating behavioral observation, verbal reasoning, and eye-tracking evidence, the SAAAR enables a deeper understanding of SA and supports targeted pedagogical feedback. Its design responds to the specific training needs of anesthesia residents and offers a feasible approach for improving SA and patient safety in complex environments.

Key Points

• This study presents the SAAAR, a system for assessing SA in anesthesia residents and reports its preliminary evaluation in simulated settings.

• It integrates innovative tools such as behavioral markers, eye-tracking, and video debriefing, enabling integral evaluation across three theoretical SA levels—perception, comprehension, and projection—together with communication as a pedagogical component.

• The behavioral marker format demonstrated high internal reliability (ω = 0.928) and test-retest reliability (ρ = 0.952, 95% CI: [0.752, 0.992]), supporting its robust implementation in training programs.

• The SAAAR’s adaptable design allows for potential applications in clinical environments and other high-complexity medical domains.

Footnotes

Acknowledgments

This study was conducted as part of the first author’s doctoral dissertation, we extend our gratitude to the Department of Anesthesiology at Pontificia Universidad Javeriana and the Hospital Universitario San Ignacio for their collaboration in this research, as well as to the operating room staff, residents, and anesthesia supervisors who generously volunteered their time. Special thanks to the staff at the Clinical Simulation Center at Pontificia Universidad Javeriana and its director, Leonar Aguiar, for providing their resources and time for the tests conducted with residents.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study is supported by Pontificia Universidad Javeriana; 20854.

ORCID iDs

Carolina Daza-Beltrán

Angélica Paola Fajardo Escolar

Martha Caro

Daniel R. Suárez

Author Biographies

Carolina Daza-Beltrán is an assistant professor in the Design Department at Pontificia Universidad Javeriana, Bogotá, Colombia. She earned her PhD in engineering (2025) from Pontificia Universidad Javeriana and specializes in human factors and ergonomics.

Angélica Paola Fajardo Escolar is an anesthesiologist with an MSc in pediatric anesthesiology (2014) from Universidad Nacional Autónoma de México (UNAM). She is an assistant professor and the current Head of the anesthesiology Department at Pontificia Universidad Javeriana. She also works as a clinical supervisor and researcher at Hospital Universitario San Ignacio.

Martha Caro is an associate professor in the Industrial Engineering Department at Pontificia Universidad Javeriana, Bogotá, Colombia. She earned her PhD in engineering (2018) from Pontificia Universidad Javeriana and specializes in human factors and ergonomics.

Daniel R. Suárez is a full professor in the Industrial Engineering Department at Pontificia Universidad Javeriana, Bogotá, Colombia. He holds a PhD in biomedical engineering (2012) from Delft University of Technology, The Netherlands. His work focuses on engineering design and human factors applied to healthcare and biomechanics.

References

Carayon

Xie

Kianfar

(2014). Human factors and ergonomics as a patient safety practice. BMJ Quality and Safety, 23(3), 196–205. https://doi.org/10.1136/bmjqs-2013-001812

Cha

J. S.

(2021). Objective measures of surgeon non-technical skills in surgery: A scoping review. Human Factors, 64(1), 42–73. https://doi.org/10.1177/0018720821995319

Crozier

M. S.

Ting

H. Y.

Boone

D. C.

O’Regan

N. B.

Bandrauk

Furey

Squires

Hapgood

Hogan

M. P.

(2015). Use of human patient simulation and validation of the team situation awareness global assessment technique (TSAGAT): A multidisciplinary team assessment tool in trauma education. Journal of Surgical Education, 72(1), 156–163. https://doi.org/10.1016/j.jsurg.2014.07.009

Daza-Beltrán

Olmos-Vega

F. M.

Caro Gutiérrez

M. P.

Muñoz Larsson

Suárez Venegas

D. R.

(2025). Identifying situation awareness training needs in anesthesia residents: A qualitative study. Colombian Journal of Anesthesiology, 53(4), 1–10. https://doi.org/10.5554/22562087.e1154

Desvergez

Winer

Gouyon

J.-B.

Descoins

(2019). An observational study using eye tracking to assess resident and senior anesthetists’ situation awareness and visual perception in postpartum hemorrhage high fidelity simulation. PLoS One, 14(8), Article e0221515. https://doi.org/10.1371/journal.pone.0221515

Dishman

Fallacaro

M. D.

Damico

Wright

M. C.

(2020). Adaptation and validation of the situation awareness global assessment technique for nurse anesthesia graduate students. Clinical Simulation in Nursing, 43(1), 35–43. https://doi.org/10.1016/j.ecns.2020.02.003

Endsley

M. R.

(1988). Situation awareness global assessment technique (SAGAT). In Aerospace and Electronics Conference, 1988. NAECON 1988, Proceedings of the IEEE 1988 National (pp. 789–795). IEEE.

Endsley

M. R.

(1995a). Measurement of situation awareness in dynamic systems. Human Factors, 37(1), 65–84. https://doi.org/10.1518/001872095779049499. https://www.ida.liu.se/∼729A71/Literature/SA_M/Endsley_1995.pdf

Endsley

M. R.

(1995b). Toward a theory of situation awareness in dynamic systems. In Situational awareness (pp. 9–42). Routledge. https://doi.org/10.4324/9781315092898-13

10.

Endsley

M. R.

(2007). Theoretical underpinnings of situation awareness: A critical review. Journal of Operations Management, 25(6), 1141–1160. https://doi.org/10.1201/b12461

11.

Endsley

M. R.

Garland

D. J.

(Eds.), (2000). Situation awareness analysis and measurement (1st ed.). CRC Press. https://doi.org/10.1201/b12461

12.

Escobar-Pérez

Cuervo-Martínez

Á.

(2008). Validez de contenido y juicio de expertos: una aproximación a su utilización. Avances en Medición, 6(1), 27–36.

13.

Fioratou

Flin

Glavin

Patey

(2010). Beyond monitoring: Distributed situation awareness in anaesthesia. British Journal of Anaesthesia, 105(1), 83–90. https://doi.org/10.1093/bja/aeq137

14.

Fletcher

Flin

McGeorge

Glavin

Maran

Patey

(2003). Anaesthetists’ non-technical skills (ANTS): Evaluation of a behavioural marker system. British Journal of Anaesthesia, 90(5), 580–588. https://doi.org/10.1093/bja/aeg112

15.

Fletcher

Flin

McGeorge

Glavin

Maran

Patey

(2004). Rating non-technical skills: Developing a behavioural marker system for use in anaesthesia. Cognition, Technology & Work, 6(3), 165–171. https://doi.org/10.1007/s10111-004-0158-y

16.

Gaba

D. M.

Howard

S. K.

Flanagan

Smith

B. E.

Fish

K. J.

Botney

(1998). Assessment of clinical performance during simulated crises using both technical and behavioral ratings. Anesthesiology, 89(1), 8–18. https://doi.org/10.1097/00000542-199807000-00005

17.

Gaba

D. M.

Howard

S. K.

Jump

(1994). Production pressure in the work environment. California anesthesiologists’ attitudes and experiences. Anesthesiology, 81(2), 488–500. https://doi.org/10.1097/00000542-199408000-00028

18.

Gaba

D. M.

Howard

S. K.

Small

S. D.

(1995). Situation awareness in anesthesiology. Human Factors, 37(1), 20–31. https://doi.org/10.1518/001872095779049435

19.

Haber

J. A.

Ellaway

R. H.

Chun

Lockyer

J. M.

(2017). Exploring anesthesiologists’ understanding of situational awareness: A qualitative study. Canadian Journal of Anesthesia-Journal Canadien D Anesthesie, 64(8), 810–819. https://doi.org/10.1007/s12630-017-0904-2

20.

Hogan

M. P.

Pace

D. E.

Hapgood

Boone

D. C.

(2006). Use of human patient simulation and the Situation Awareness Global Assessment Technique in practical trauma skills assessment. The Journal of Trauma, 61(5), 1047–1052. https://doi.org/10.1097/01.ta.0000238687.23622.89

21.

Olmos-Vega

F. M.

(2018). Workplace learning through interaction. [Maastricht: Datawyse / Universitaire Pers Maastricht.]. https://doi.org/10.26481/dis.20181207fo

22.

Rudolph

J. W.

Simon

Raemer

D. B.

Eppich

W. J.

(2008). Debriefing as formative assessment: Closing performance gaps in medical education. Academic Emergency Medicine, 15(11), 1010–1016. https://doi.org/10.1111/j.1553-2712.2008.00248.x

23.

Rutherford

J. S.

Flin

Irwin

McFadyen

A. K.

(2015). Evaluation of the prototype Anaesthetic Non-technical Skills for Anaesthetic Practitioners (ANTS-AP) system: A behavioural rating system to assess the non-technical skills used by staff assisting the anaesthetist. Anaesthesia, 70(8), 907–914. https://doi.org/10.1111/anae.13127

24.

Savoldelli

G. L.

Naik

V. N.

Park

Joo

H. S.

Chow

Hamstra

S. J.

(2006). Value of debriefing during simulated crisis management: Oral versus video-assisted oral feedback. Anesthesiology, 105(2), 279–285. https://doi.org/10.1097/00000542-200608000-00010

25.

Schulz

C. M.

(2016). Quality and safety in anesthesia and perioperative care. In Ruskin

K. J.

Stiegler

M. J. I. P.

Rosenbaum

S. H.

(Eds.), Quality and safety in anesthesia and perioperative care (pp. 98–113). Oxford University Press.

26.

Schulz

C. M.

Burden

Posner

K. L.

Mincer

S. L.

Steadma

Wagner

K. J.

Domino

K. B.

(2018). Frequency and type of situational awareness errors contributing to death and brain damage - A closed claims analysis. Anesthesiology, 127(2), 326–337. https://doi.org/10.1097/ALN.0000000000001661

27.

Schulz

C. M.

Endsley

M. R.

Kochs

E. F.

Gelb

A. W.

Wagner

K. J.

(2013). Situation awareness in anesthesia: Concept and research. Anesthesiology, 118(3), 729–742. https://doi.org/10.1097/ALN.0b013e318280a40f

28.

Schulz

C. M.

Krautheim

Hackemann

Kreuzer

Kochs

E. F.

Wagner

K. J.

(2016). Situation awareness errors in anesthesia and critical care in 200 cases of a critical incident reporting system. BMC Anesthesiology, 16(1), 1–10. https://doi.org/10.1186/s12871-016-0172-7

29.

Schulz

C. M.

Schneider

Fritz

Vockeroth

Hapfelmeier

Wasmaier

Kochs

E. F.

Schneider

(2011). Eye tracking for assessment of workload: A pilot study in an anaesthesia simulator environment. British Journal of Anaesthesia, 106(1), 44–50. https://doi.org/10.1093/bja/aeq307

30.

Shelton

C. L.

Kinston

Molyneux

A. J.

Ambrose

L. J.

(2013). Real-time situation awareness assessment in critical illness management: Adapting the situation present assessment method to clinical simulation. BMJ Quality and Safety, 22(2), 163–167. https://doi.org/10.1136/bmjqs-2012-000932

31.

Stanton

N. A.

Salmon

P. M.

Rafferty

L. A.

Walker

G. H.

Baber

Jenkins

D. P.

(2013). Human factors methods: A practical guide for engineering and design (2nd ed.). CRC Press. https://doi.org/10.1201/9781315587394

32.

Taylor

R. M.

(2017). Situational awareness rating technique (SART): The development of a tool for aircrew systems design. In Situational awareness (pp. 111–128). Routledge.

33.

Walshe

N. C.

Crowley

C. M.

O’Brien

Browne

J. P.

Hegarty

J. M.

(2019). Educational interventions to enhance situation awareness: A systematic review and meta-analysis. Simulation in Healthcare, 14(6), 398–408. https://doi.org/10.1097/SIH.0000000000000376

34.

Wright

M. C.

Taekman

J. M.

Endsley

M. R.

(2004). Objective measures of situation awareness in a simulated medical environment. Quality and Safety in Health Care, 13(Suppl 1), 65–71. https://doi.org/10.1136/qshc.2004.009951

Situation Awareness Assessment for Anesthesia Residents (SAAAR): Development and Preliminary Evaluation of a Multimodal System

Abstract

Objective

Background

Method

Results

Conclusions

Application

Keywords

Introduction

Methods

Design of the SAAAR System

Behavioral Marker Evaluation Format

Video Debriefing Session

Simulation-Based Implementation of the SAAAR

Preliminary Evaluation of the SAAAR Behavioral Marker Format

Content Validity

Internal Reliability

Test-Retest Reliability

Interrater Reliability

Ethical Considerations

Results

Evaluation System

Instrument Validation

Content Validity

Internal Reliability

Test-Retest Reliability

Interrater Reliability

Evaluator Feedback on Format Usability

Results of the Simulation-Based Implementation of the SAAAR

Discussion

Summary of Key Findings and Connection to Study Objective

Comparison with Prior Research

Theoretical, Methodological, and Practical Implications

Limitations and Future Directions

Conclusions

Key Points

Footnotes

Acknowledgments

Declaration of Conflicting Interests

Funding

ORCID iDs

Author Biographies

References