Abstract
Firstly, I comment on the lack of support for the predictions of the lumberjack model to professionally qualified operators in high-fidelity work simulations (Jamieson & Skraaning, 2020a). I highlight the advantages that Bayesian statistics provide for qualifying the degree of evidence for the null hypotheses, issues concerning situation awareness measurement, and the alternative techniques available to study experts. Secondly, I comment on the innovative taxonomy of automation failure presented by Skraaning and Jamieson (2024), pointing out some issues with overlapping definitions and lack of cause-effect relationships. I then discuss the substantial opportunity this taxonomy presents to guide future research, such as the design of transparent automation. To conclude, I identify some other key problems regarding how we currently study human-automation teaming (e.g. presenting randomized automation failure unlinked to task context) and invite discussion from the research community on the relevance of computational modelling to this field of research.
Introductory Statements
Automated systems have profoundly improved safety and productivity in domains such as healthcare, transportation, finance, aviation, and defence (National Academies of Sciences Engineering & Medicine, 2022). Given rapid progress in AI, it is increasingly critical to improve human-automation/AI design. Most, if not all, industrial contexts which support economic growth and keep society safe utilize some form of automation/AI.
The concept of degree of automation (DOA; Wickens et al., 2010) describes the level of responsibility of automation (Sheridan & Verplank, 1978) across four stages of information processing (Parasuraman et al., 2000): information acquisition, information analysis, decision recommendation, and action execution. The combination of higher levels and later processing stages constitutes higher DOA. A meta-analysis reported that as DOA increased, workload decreased and performance improved, but situation awareness (SA) and automation failure task performance degraded, supporting predictions of the lumberjack model (Onnasch et al., 2014).
In this paper, I provide opinions on two recent debates. Firstly is the debate regarding the extent to which DOA effects have been, or should be expected to, replicate from the laboratory with naïve participants that largely represent studies in the Onnasch et al. (2014) meta-analysis, to professionally qualified operators in high-fidelity work simulations (Jamieson & Skraaning, 2018; Wickens, 2018). Jamieson & Skraaning, 2020 reported what they viewed as weak evidence supporting predictions from the lumberjack model in the field. While acknowledging (as I do) the value of testing the ecological predictive validity of the lumberjack model, Wickens et al. (2020) critiqued Jamieson & Skraaning, 2020 based on low statistical power, measurement issues, and their perceived downplaying of expert subjective experience (see response by Jamieson & Skraaning, 2020b).
Another issue raised by Wickens et al. (2020) concerned the type of automation examined by Jamieson & Skraaning, 2020. In reply, Skraaning and Jamieson (2024) offered what I view as an innovative and thought-provoking initial taxonomy of automation failure. Skraaning and Jamieson (2024) highlight that much automation is reliant on an underlying system (sensors, equipment, functions, and logic) that can fail and disrupt automated system functions, but the automation performs, and is used by the human, as designed/trained (citing several airline incidents/accidents). I agree with Skraaning and Jamieson (2024) that our current definition of automation failure is narrow, and my second focus is to comment on the opportunity this initial taxonomy of automation failure presents to guide future research.
As an active researcher in this research space, I have found these debates thought-provoking and progressive and have learnt much. I thank the aforementioned authors for their innovative exchanges and have passed them on to my graduate students and post-docs.
Failure to Replicate the Lumberjack Model
Why not Use Bayesian Statistics?
It is important to not generalize broadly from initial failures to replicate the lumberjack model from what seems an evidence base constituting a handful of applications to experts completing automation-aided high-fidelity tasks (e.g. Calhoun et al., 2009; Jamieson & Skraaning, 2020; Metzger & Parasuraman, 2005).
The applied studies under the microscope (and admittedly, laboratory studies included in Onnasch et al., 2014), and much of my own research) has used frequentist statistics. Wickens et al. (2020) argue that the null effects reported by Jamieson & Skraaning, 2020 neither confirm nor disprove the lumberjack model, and Jamieson and Skraaning (2020b) replied that they see no logical reasons for dismissing null findings as uninformative, particularly because they suspect that lumberjack effect sizes outside the laboratory are trivially small.
In one sense, I see it as a judgement call what can be interpreted from the null hypothesis, and there are various methods to aid interpretation (see Cumming, 2012). However, the best way forward is to use Bayesian statistics in future research, but also retrospectively applied to the field studies with experts included in the Onnasch et al. (2014) meta-analyses and in Jamieson & Skraaning, 2020. I understand that in Human Factors, Bayesian statistics is relatively novel (as it was for myself until educated by my graduate students/post-docs). Bayes factors (BFs) can be interpreted as the strength of evidence for one hypothesis over another (thus, including the null) (Vandekerckhove et al., 2018). The key advantage is the ability to quantify evidence for the null hypothesis (as opposed to frequentist statistics failing to reject it with a p-value point estimate; Wagenmakers et al., 2018). Bayesian statistics provide a range between which researchers can be certain for or against a hypothesis (e.g. BF>3 as ‘weak evidence’, BF>5 as ‘moderate evidence’, BF>10 as ‘strong evidence’, and BF>100 as ‘very strong evidence’; Etz & Vandekerckhove, 2016), avoiding the use of a single arbitrary cut-off (p < .05) for hypothesis testing.
Measuring the Situation Awareness Underlying Automation Monitoring
Jamieson & Skraaning, 2020 reported improved SA with increasing DOA (Important Parameter Assessment Questionnaire; IPAQ). Admittedly, there is no universally accepted theory/measure of SA (Pritchett, 2015). SA broadly refers to an individual’s or team’s understanding of the relevant elements of their task(s) and how these elements might change through environmental conditions or interactions with operator control actions. There is ongoing debate regarding the extent to which SA is a state of conscious reportable knowledge (Endsley, 1995a, 2021) that can be measured by pausing the task and blanking information displays during a scenario (Situational Awareness Global Assessment Technique; Endsley, 1995b), or whether SA constitutes knowledge where to find relevant task information from interactions with displays and/or team members (situated SA/distributed SA; Chiappe et al., 2012; Stanton et al., 2015), and as such whether SA is better measured without pausing/blanking displays (e.g. Situation Present Assessment Method; Durso & Dattel, 2004). The IPAQ used by Jamieson & Skraaning, 2020 did not follow the theoretical underpinnings of either, asking participants to rate the importance or not (i.e. two forced choice decisions) of process parameters after completion of the scenarios. The SA queries were developed by SME’s (reflecting a SA requirements analysis; Endsley & Jones, 2012), but it is questionable, based on the dichotomous response scale and administration timing, whether dichotomous rating of the importance of process parameters after scenario completion reflected the real-time SA required to monitor automation. I agree with Wickens et al. (2020) then that the IPAQ unlikely measured the real-time SA of the dynamic changing value of specific process parameters during the period leading up to/during automation failure (i.e. SA of current and future predicted system parameters). Of course, it is often difficult and not well-received to pause a high-fidelity simulation or field exercise, or intrude with on-line SA queries (Loft et al., 2015; Pierce, 2012), so in that sense I understand the SA measurement choice Jamieson & Skraaning, 2020 made.
Alternatives to Assessing Experts in High Fidelity Task Contexts
Wickens et al. (2020) noted that experts in Jamieson & Skraaning, 2020 reported decreased human-automation cooperation and out-of-the loop performance issues with increased DOA. These effects were relatively strong, and industry reports often point to lumberjack model–related variables as contributors to workplace incidents/accidents (cf. Wickens et al., 2020). Although subjective reports can lack validity (Matthews et al., 2020), I agree with Wickens et al. (2020) that the expert opinions should be given weight, despite no change in objective performance. Experts learn to adapt to dynamic conditions, concurrent task demands, time pressure, and tactical constraints (Loft et al., 2009; Sheridan, 2002). Workload, SA, and performance are intricately related (Loft et al., 2023), but workload is not something imposed upon a passive operator but rather is managed through dynamic choice of work method (task prioritization, satisficing, task shedding, etc.) (Gray & Fu, 2004; Loft et al., 2007; Simon, 1956; Sperandio, 1971). Choice of work method depends on metacognitive knowledge (i.e. the monitoring and control of cognition; Efklides, 2008). It is thus a critical finding that experts in Jamieson & Skraaning, 2020 held negative perceptions of increased DOA, because that indicated that the experts held the metacognitive knowledge that an increase in task demand or an unexpected event may have exceeded their capacity (the ‘red zone’ of workload; Strickland et al., 2019; Wickens et al., 2015), and thus could be problematic for managing automation. The overarching point is that expert performance requires converging evidence from several different types of research methods (Dismukes, 2010), and that while there is no immediate solution to transferring knowledge from the laboratory to the field (Loft, 2014; Stokes, 1997), techniques such as ethnographic observation, self-report, diary studies, and accident/incident reports can be very insightful.
The Skraaning and Jamieson (2024) Taxonomy of Automation Failure
Skraaning and Jamieson (2024) made the astute observation that current definitions of automation failure are either too narrow or too broad, and their initial taxonomy defined three types of automation failure. I commend Skraaning and Jamieson (2024) for their innovation (i.e. thinking outside the box). Researchers prepared to do this are extremely valuable for making larger than incremental scientific progress.
Skraaning and Jamieson (2024) contend that Elementary Automation Failures arise from isolated failures of components or functions localized to the automation (e.g. failures in automation control logic, programming errors, malfunctioning hardware, and loss of power) that lead to the unexplained loss of automation capability. I agree with Skraaning and Jamieson (2024) that the literature is replete with examples of Elementary Automation Failures. For example, in my own work with colleagues using simulated air traffic control, aircraft conflict detection automation fails to detect some conflicts (i.e. aircraft that will violate minimum separation in the future), but from the participants’ perspective there is no apparent underlying reason the automation failed to detect a particular aircraft conflict (e.g. Gegoff et al., 2024), and the same can be said with other tasks we use such as maritime surveillance (e.g. Hutchinson et al., 2023) and submarine track management (e.g. Tatasciore et al., 2020).
Skraaning and Jamieson (2024) introduce a common, but understudied, form of automation failure referred to as Systemic Automation Failures that results from situationally triggered failures of integrated functions that support automation. A prime example Skraaning and Jamieson (2024) focus on is where sensors feed incomplete/incorrect information to automation and falsely trigger an automated function or other forms of invalid data that create confusion from which operators are unable to recover (e.g. B737 MAX, Turkish Airlines Flight 1951). Other examples of Systemic Automation Failures include parallel automated systems performing in contradictory ways or the control/decision logic of automation containing latent (hidden) problems that cause automation failure. I agree with Skraaning and Jamieson (2024) that these forms of automation failure need to be distinguished from Elementary Automation Failures reflecting unexplained (non-identifiable) failure of automation controlling a single well-defined function. Skraaning and Jamieson (2024) refer to a third category of automation failure as Human-Automation Interaction Breakdowns that reflect non-alignment between the design of automation and human capabilities (e.g. concealed operation modes, misleading decision support, and automation presenting to the human unrealistic capabilities). Interestingly, and highly related to my musings in the proceeding paragraphs, Skraaning and Jamieson (2024) also refer to Human-Automation Interaction Breakdowns that result from automation being unreliable and the underlying logic/workings of automation being unavailable to operators.
Definition Overlaps Across the Taxonomy Categories: Unavoidable?
Skraaning and Jamieson (2024) were prudent to point out that they present an initial taxonomy of automation failure and are open for discussion/further development. A current issue in my opinion is the degree of, and at times inconsistent, overlap between the definitions of the three types of automation failure (see Figure 1, p. 7; Skraaning & Jamieson, 2024). These overlaps were possibly unavoidable but are nonetheless noteworthy. To highlight an example, Elementary Automation Failures are largely referred to as resulting from failures in automation control, logic/programming, or malfunctioning hardware producing degraded/inaccurate output. What is the difference between that and the examples provided for Human-Automation Interaction Breakdowns resulting from automation providing misleading support to operators (low reliability automation)? Skraaning and Jamieson (2024) refer to Systemic Automation Failures being caused by automation working as intended but not conveying limitations or the automation having ‘hidden’ logic issues. What is the difference between that and examples provided for Human-Automation Interaction Breakdowns caused by hidden modes of operation, failure modes that are not recognizable, or automation goals and capabilities being inaccessible to operators? Any one of such factors could contribute to an incident/accident (and many cross-relate), but the initial Skraaning and Jamieson (2024) taxonomy does not speak to how operators would be expected to differentially respond to their categories of automation failure, which is ultimately the understanding required for practitioners to design work design interventions. A potential variation to the initial taxonomy could be to more cleanly delineate precipitating events (e.g. faulty sensor inputs, workload/fatigue, and environmental conditions) that cause Elementary Automation Failures and Systemic Automation Failures and then describe how the design of automation (e.g. display transparency, reliability, mode salience, DOA, and communication) and organizational level factors (i.e. training and safety culture) could potentially moderate how operators cognitively and behaviourally respond (e.g. loss of SA, attention tunnelling, mode confusion, and false expectations).
Further Issues Common to Understanding Automation Failure
Inspired by the innovation of Skraaning and Jamieson (2024), I believe there are other core additional issues in the manner in which we study human-automation teaming. Skraaning and Jamieson (2024) squarely associate Elementary Automation Failures as representing the category on which researchers have focused almost exclusively (i.e. presenting unexplained loss of automatic function). Indeed, Systemic Automation Failures have more potentially identifiable underlying causes and patterns of occurrence. Nonetheless, a major limitation of human-automation research (including my own research) is that typically participants are exposed to fixed quotas of randomized automation failure, allowing little opportunity to develop understanding of the automation they are using, limiting their capacity to predict when automation failures might occur. Really, in most studies the only learnable context for system reliability is the frequency of automation failure. This contrasts sharply with my observations in aviation and defence field settings in which automation reliability is dynamic/context-driven, allowing nuanced human understanding of automation capabilities and limitations that enable prediction of when intervention is required. Indeed, trust calibration becomes increasingly sophisticated with expertise. Trust is multifactorial and affected by a range of operator characteristics, contexts, and automation characteristics, but perceived reliability is a major driver (Hoff & Bashir, 2015). The Human-Automation Trust Expectation Model (HATEM) recently published by Carter et al., (2024) asserts that trust in automation becomes increasingly calibrated over time through human understanding of automation reliability. Trust thus, at least partly, reflects the difference (closeness) between experienced automation reliability and expected reliability (prediction error). Observed automation performance is evaluated against expectation and either increases or decreases confidence in future predictions, thereby refining understanding. We need to keep this in mind, regardless of what form of automation failure taxonomy consensus is formed.
Also common to any taxonomy of automation failure (but admittedly somewhat contradictory to my previous paragraph) is the fact (first recognized by Molloy & Parasuraman, 1996) that studies typically (including Jamieson & Skraaning, 2020) present multiple automation failures within a single testing session, with only a handful of studies examining response to a single automation failure (e.g. Bailey & Scerbo, 2007; Bowden et al., 2023, 2024; Metzger & Parasuraman, 2005). In the modern workplace, humans increasingly monitor near-perfect automated systems (Foroughi et al., 2023). While detection of a first automation failure is often poor, the detection of subsequent failures improves (Merlo et al., 2000). We need research that evaluates detection of rare automation failures.
As discussed earlier, Skraaning and Jamieson (2024) highlight the importance of automated system goals, modes, logic/rationale, and capabilities being available to operators. This is also critical for Elementary Automation Failures in addition to the other two failure categories. For example, if there are identifiable failures in automation control logic or programming errors, they should be made transparent. Further, given the focus by Skraaning and Jamieson (2024) on Systemic Automation Failures using aviation examples (Qantas Flight 72, Turkish Airlines 1951) that resulted from incorrect information from sensor input to automated cockpit systems, I was surprised that Skraaning and Jamieson (2024) did not mention the automaton transparency literature. Automation transparency is intended to aid understanding of the rationale underlying automation. A leading model is the Situation Awareness Agent-Based Transparency (Chen et al., 2014) model, outlining three levels of transparency: the automation’s goals, purpose, and intentions (Level 1), automation’s rationale underling information/advice (Level 2), and automation’s projected future outcomes if information/advice is followed and any associated uncertainty (Level 3). Reviews/meta-analyses indicate that automation transparency can improve SA and automation use, including recovery from automation failure (Bhaskara et al., 2020; Sargent et al., 2023; Van de Merwe et al., 2024). Yet at the same time, the types of automation failure in these transparency studies, such as those using uninhabited vehicle management/control tasks (e.g. Griffiths et al., 2024; Loft et al., 2023; Mercado et al., 2016; Stowers et al., 2020; Tatasciore & Loft, 2024), do not fit neatly into the Skraaning and Jamieson (2024) taxonomy. Automation failures (incorrect automated decision advice) in these studies stem from changes in military commander intent, changes in vehicle capability (e.g. payload, speed, and fuel reserve), and environmental constraints (i.e. fog, wind, and road blocks). The automation functions as intended and sensors feeding automation are not incorrect/faulty.
Conclusion
I conclude by thanking Jamieson, Skraaning, and Wickens (and their colleagues; e.g. Onnasch) for their innovative and enlightening discussion. It is obviously critical that the research we do has relevance to operational (field) settings, and we need to continue to think how to improve the manner in which we can do that. I hope I have at least made an incremental contribution in this paper with my opinions. I end by inviting the research community to comment on the relevance of a subset of my own recent work with colleagues regarding computational modelling of how humans integrate automated advice to make decisions (e.g. Strickland et al., 2021; Strickland et al., 2023; and see review by Boag et al., 2023), how humans learn to track variation in automation reliability (Strickland et al., 2024) and more generally our computational models of the human cognitive control and capacity mechanisms underlying workload management and multi-tasking in complex dynamic task environments (see review by Boag et al., 2023). Does the research community see this work as having relevance to applied environments and understanding automation failure, and if not, how could it be made more impactful from the practical use perspective?
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by an Australian Research Council Future Fellowship (FT190100812) awarded to Loft.
Shayne Loft is Professor at the University of Western Australia and currently holds an Australian Research Council Future Fellowship (Research). He received his PhD in Experimental Psychology/Human Factors in 2004 from the University of Queensland. He has 119 referred publications in Human Factors/Applied Cognitive Psychology.
