Abstract
To address the increasing need for efficient data analysis and decision-making in defense, the U.S. Navy is prioritizing AI/ML systems capable of handling multi-source data and suggesting courses of actions. Historically, many such systems failed due to technical issues and lack of usability or mission relevance. Research on human-AI collaboration aims to create AI systems that better integrate with frontline operators. A recent National Academies of Sciences, Engineering, and Medicine Report (NASEM) outlined 57 research goals, but the U.S. Navy requires a more-focused set of priorities. A workshop with 23 experts from various fields was held at the Naval Information Warfare Center Pacific, resulting in five key research priorities spanning different time frames. This panel discusses the workshop’s findings, highlighting critical questions going into this workshop and those coming out of it. Panel participants come from government, academia, and industry, providing unique perspectives on big questions for human-AI teaming.
Introduction
The U.S. Navy, along with the other branches of the U.S. Armed Forces, recognizes the potential of AI to aid warfighters in nearly every aspect of their mission. Systems that can ingest multi-source data, analyze and identify patterns, and then recommend courses of action can generate new insights and creative solutions to challenging problems.
Research into human-AI teaming is seen as a key enabler for adoption of AI by the U.S. Navy Fleet, as many previous technologies have been adopted too slowly (or not adopted at all) due to technical, usability, and supportability challenges. While the National Academies has recently released a report detailing 57 research objectives to better align, support, and measure human-AI teams, additional scoping work was desired by the U.S. Navy to narrow down these research directions to meet three objectives:
(1) Identify specific units of work to support proposals, funding, and execution.
(2) Bin identified work units into near-, mid-, and far-term research priorities based on Navy need.
(3) Further align these priorities with time frames based on research feasibility and difficulty.
To support these goals, a workshop was held at the Naval Information Warfare Center (NIWC) Pacific and attended by 23 human factors and computer scientists from academia, industry, and government, plus three active-duty sailors to serve as real-world Navy subject matter experts (SMEs).
Results from the workshop were narrowed down to a set of five research priorities that span near-, mid-, and far-term investment time frames. The two near-term priorities were: (1) developing human-AI team effectiveness metrics and (2) human-AI team testbeds. The one mid-term priority, human-AI team task sharing, was developed during our workshop discussion as understanding how to best allocate tasks beyond function allocation. The identified two far-term priorities focused on (1) developing AI awareness of human teammates and (2) establishing human-AI team development teams (i.e., multidisciplinary approaches for developing successful human-AI teams). These challenges were binned as far-term due to the technical complexity and warfighter organizational challenges these objectives present.
Ambiguity around how to prioritize research, including the existence of potentially competing strategies, contributed to a rich discussion among workshop participants. Major discussion points highlighted areas that would benefit from further strategic thinking and finer-grained prioritization, such as how to approach testbed development and human-AI-teaming metrics. Addressing these two near-term priorities would enable many human-AI teaming research activities but involves first defining what is being assessed (and why). This panel discussion will pick up where we left off and invite meeting participants to contribute their perspectives.
Panel Motivation
While this panel will provide a summary of the workshop findings, most of the discussion will present varied perspectives on the hard problems facing human-AI teaming research and AI adoption in the Fleet. Defense scientists, industry engineers, and academic researchers working in this area largely share the goal of establishing, measuring, and supporting effective human-AI teams. However, there remain unique challenges for each of these organizations to be able to work in unison.
Each speaker will propose and present potential answers to “big questions” surrounding human-AI teaming and cover four multi-disciplinary topics: (1) the uniqueness of Navy-specific challenges where AI would be most beneficial; (2) human-systems integration and academic undertakings toward measurement of human-AI teams; (3) barriers and opportunities in bridging between human-AI teaming research and application; and (4) how to prepare a workforce for the new roles and responsibilities needed to maximize the benefits of AI.
This panel discussion will also invite the audience to participate in the posed “big questions,” as it is anticipated that many attendees will have some experience working in or thinking about human-AI teaming. The identified research roadmap that resulted from the workshop is nascent and malleable; therefore, leveraging the experience of others in the field—from experienced experts to students with fresh perspectives—will help the Navy achieve their unique human-AI teaming needs more quickly and deliver systems that maximize overall team mission performance.
“Big Questions” from the Panelists
Jason H. Wong, Naval Information Warfare Center Pacific
The critical question for integrating AI into the Navy is this: how can sailors, submariners, airmen, and Marines be encouraged to adopt these AI systems? There are many ways in which warfighters do not use new technology. In the early 2010s, a new optimization visualization integrated into the submarine combat system went ignored by many. One anecdotal cause was that course instructors (retired Chiefs) felt that because they did not need such a “fancy” tool, junior crew members did not need it as well (Wong & Nguyen, 2012). More recent experience has shown that putting a poorly tested AI prototype in front of warfighters led to immediate distrust and low utilization that did not recover over 2 weeks of use and multiple software patches (NATO Science & Technology Organization [STO], in press). New AI systems need to not just be accurate and resilient, but they need to meet the warfighter on their terms. The workshop-identified far-term research goal of establishing and developing human-AI teams is important here. Making a solid first impression is not easy, but it is critical for adoption. It is also a different task from developing and sustaining teaming relationships in the long run.
Another question pertinent to the Navy and other organizations dealing with multi-agent sociotechnical systems (multiple humans and machines) is: what factors cause an AI to be viewed as a teammate instead of a tool? Talk of human-AI teaming often implies that the AI is a teammate that can flexibly execute multiple functions independently towards a larger shared goal. This is opposed to a tool that is specialized to do specific functions and is directly aligned with its user’s intent. For example, there is a strong push in the Navy for planning aids, and many of those are taking the form of a Course of Action (COA) recommender (Szatkowski et al., 2023). These COA tools optimize over many parameters and present an easily digestible plan. Some even allow for adjustments (e.g., “This platform has a maintenance issue, so it won’t be ready in time.”). Would sailors in a command center view this as a tool, or as a teammate? This perception will drive all interaction and teaming behaviors. Research into human-AI task sharing (the mid-term goal identified in this workshop) will be critical for addressing this.
A third important question is: How can AI systems effectively communicate to be considered “one of the team”? There is unique vocabulary, semantics, syntax, and customs that warfighters use and abide by, and it helps individuals be identified as part of the team. Training AI models to incorporate jargon and acronyms (vocabulary) is an active area of research. Of equal importance is an understanding of syntax, as the Navy often has a specific litany in which orders are given and repeated or information is conveyed (i.e., in ship driving or submarine periscope operations). Differences in semantics are also likely between areas of operation (i.e., East and West Coast Marines each have their own quirks). Similarly, individual ships will also have their own customs. Hierarchy is imperative amongst any operations center, but some run more loosely (encouraging discussion and forceful backup) and others run more strictly. To form a human-AI team, warfighters will need to adopt the AI as one of their own, or at least one that can quickly adapt to their customs. As human-AI teaming testbeds are built, the idea of one- or low-shot learning (i.e., adaptation) is an expected trait in humans and must be built into the models.
A final question for consideration is: how does AI represent uncertainty and allow for source data exploration? Especially during active conflict, communications is expected to be minimal or non-existent, and this will lead to high (or infinite) latency and stale data, which may impact AI accuracy. Warfighters understand this uncertainty, and they are accustomed to communicating it. AI models are famously thought to be “black boxes,” but warfighters will often demand the ability to analyze the source data used by the system in generating a solution. Explainable AI encompasses this issue, but specifically, the Navy will require AI to “show its work.” It could be argued that the pace of warfare will be so fast that there will be no time to explore, but this could either result in the operator over-relying on the AI or ignoring it entirely, and neither is a desirable result.
Robert S. Gutzwiller, Arizona State University
The U.S. Navy has showcased their desire for unmanned systems (UxS) to be operated by a distributed group of controllers and commands, as recently demonstrated in exercises such as RIMPAC 2022 and Integrated Battle Problem 23.1. For example, recent sinking exercises are now testing distributed integration of data and intelligence across UxS platforms and other assets to create an optimal firing solution (SINKEX at RIMPAC; U.S. Navy, 2022). It is likely that AI and automation will be incorporated in this and many other situations of distributed control. The desire for distributed control creates a large gap in understanding about how AI-controlled or AI-managed UxS can be teamed across the numerous commands and roles in the military. While distributed command and control (C2) is a familiar concept (once labeled “network centric warfare”), increasing AI integration will likely help realize its benefits by taking on some of the burden of coordinating. UxS are an obvious starting point. As multifaceted assets with many different capabilities, UxS serve many different mission purposes. The goals of those who exercise C2 on these systems may be either aligned or not across the various command and mission levels (one can imagine trying to control a UAV to gain surveillance in one area of the battlefield, only to find out that it is needed in another conflicting area to provide a firing solution).
One part of this problem is technical, as with the SINKEX example; figuring out how to literally share information across platforms and coordinate who fires what from where against the target in a tactically sound manner. The other part of this problem is the Human Systems Integration (HSI) challenge to ensure that in this complex distributed situation, human and AI elements can align their goals in a maximally productive way—and with as little confusion as possible. The HSI test and evaluation methodology in such a case of distributed C2 are not obvious. The need to expand and improve HSI methodologies pose a true challenge to improvement of HAT and will need to leverage new areas of research to evaluate them in the field when the time comes. This requires a broadening of scope in HSI methods, and improvements in researcher’s ability to study and evaluate more dynamic shifts in authority and C2.
Challenge 1: The broader scope of HAT configurations will challenge current methods of evaluation by expanding these into much broader pairings of people, AI, and tasks. Whereas there has been continual research advancement in smaller scale human-automation interaction measurement (i.e., measuring situation awareness, over- or under-reliance in various capabilities, and span of control issues), these methods are used under strict definitions of the “who” (humans, AI) and “what” (robots, UxS). Even in real world cases where autonomous systems are introduced and this range of human roles is expanded (e.g., Ho et al.’s [2017] AutoGCAS work), in the moment of its use, interactions with AI that need to be evaluated generally boil down to a few people and a specific system tied to a specific event, which can then be evaluated. One set of solutions could be to emphasize the expansion to team-based HAT metrics. This also requires improvements to research platforms and paradigms, but this should be achievable (and was identified as a near-term goal from the Navy HAT Roadmap workshop). Finally, researchers (especially academics) will be called to understand far more about the mission, goals, and capabilities of platforms and their controllers, including the basics of military C2 hierarchies and rules of engagement. Violating any of these elements while building a system will reduce the ability of the warfighter to know what is going on, reduce the ability to ensure commander’s intent is being followed, and reduce technology adoption.
Challenge 2: AI will create dynamic shifts in authority and C2 as part of teaming. AI is expected to influence (disrupt? improve?) decisions and actions in multiple OODA loops across commands. For example, AI could suggest a change in decision making command to improve operational efficiency. This could include a change to which humans (maybe human and AI combinations) are in charge, or other authority shifts based on operator/team skillsets, proximity, or experience. These shifts would normally fall under function allocation challenges, but when these AI technologies include technologies such as machine learning or GPT models, function allocation schemas may be more difficult to engineer ahead of time. Testing function allocation schemes in general is only just emerging and maturing (Pritchett et al., 2013; Roth et al., 2019) and could be productively applied in HSI test and evaluation contexts, but it will need to be understood by practitioners first.
Maia B. Cook, Pacific Science & Engineering Group, Inc
For human factors practitioners working at the intersection of human-AI research and AI applications for Naval use, bridging between research and application is critical for delivering AI capabilities that can be safely and effectively used by warfighters. Multi-disciplinary stakeholder teams of program managers, systems engineers, software developers, AI developers, representative end users, and human factors practitioners are supporting science and technology (S&T) and acquisition programs that are developing and fielding AI capabilities, defining software interfaces for AI services, and managing and transitioning AI capabilities. In these programs, AI offers transformative capabilities across a spectrum of classic human-system activities of information acquisition and analysis, decision making, and action execution (Parasuraman et al., 2000), all with numerous Naval applications. And while aspects of these AI capabilities and their applications may be novel, the core human factors aspects of designing for human-AI teaming are well established from decades of human-automation interaction research dating back to at least the 1950s (e.g., Fitts, 1951) and from extensive research on human-AI teaming (e.g., National Academies of Sciences, Engineering, and Medicine [NASEM], 2022).
With such promise from AI capabilities combined with extensive literatures on the human factors of automation and AI, the Naval community is well positioned to develop and field AI capabilities that safely and effectively team with humans. However, taking a broad view across S&T and acquisition programs, shortfalls are starting to emerge in bridging between human-AI teaming research and application and in applying HSI methods in the design of human-AI systems (NASEM, 2022). If left unchecked, these shortfalls will compromise the safety and effectiveness of fielded AI capabilities—and slow progress in developing and fielding AI. Two emerging barriers to human-AI applications are described next, along with questions to stimulate ideas for targeting these barriers.
Barrier 1: Flawed beliefs about human-AI relationships. Stakeholders outside the human factors community often proclaim that AI will, as a rule, reduce and simplify human work and improve outcomes (Cook, 2024). Such proclamations imply that these desirable effects and outcomes will innately be achieved by AI. They overlook that automation and AI transform—and sometimes complicate—human work (Bainbridge, 1983; Endsley, 2023). Additionally, they overlook the potential for negative outcomes from teaming automation and AI with humans in safety-critical systems (e.g., Cummings, 2004; Hawley, 2007; NTSB, 2020). How can we correct program managers’ and stakeholders’ beliefs about human-AI relationships to more productively shape technical approaches in AI capability programs?
Barrier 2: Flawed approaches to human-AI system design. Technology-focused stakeholders often promote an approach that develops AI and then gives it to users to see what they do with it. These stakeholders tend to be aware of some human-AI teaming constructs (e.g., trust and explainability), but tend to be unaware that a wealth of expertise exists about the human use of automation and AI for these constructs (e.g., NASEM, 2022). Further, they tend to be unfamiliar with involving HSI practitioners who actively identify and design for user needs early and throughout AI development (OUSD [R&E], 2022). Moreover, they have limited awareness of the potential impact of human-automation/human-AI system design decisions on effectiveness and safety outcomes (e.g., Hawley, 2007; NTSB, 2020). In practice, technology stakeholders are often making human-AI system design decisions (e.g., about human-AI function allocation, decision authorities, etc.) without involving skilled human factors practitioners who possess the needed expertise. Of particular concern, this emerging pattern in AI development parallels prior technology-centric automation approaches implicated in accidents and lost lives (e.g., Hawley, 2017). How can we effect a paradigm shift towards mainstreaming HSI in AI development to prioritize user needs and avoid prior pitfalls?
Corey K. Fallon, Pacific Northwest National Laboratory
AI brings new capabilities to the Navy such as rapid complex pattern matching and advanced decision support. The introduction of these new capabilities is expected to reduce operator workload, increase efficiency, and free up time for more advanced problem solving. However, integrating AI tools into operational environments presents challenges because these tools differ from traditional technology in fundamental ways. For example:
AI tools may not generalize to novel data.
AI tool predictions and recommendations may change as the tool learns.
AI tools may require feedback from the user to support their learning.
AI tool processes may be more complex and less transparent than traditional systems.
Organizations must prepare operators for these challenges. Otherwise, the lack of preparation for these new classes of tools could offset benefits in reduced workload. The Human-AI Teaming workshop and subsequent Navy-focused report provides guidance for Human Systems Integration (HSI) and Human Factors researchers to address these challenges. The following three research areas expand on several of the priorities highlighted in the report and are important to prepare the workforce for the introduction of AI.
Defining New Roles and Core Competencies
As stated in the HAT Research Priorities for the Navy report, it is important for the Navy to incorporate HSI best practices early in the development of AI tools. Job Analyses will be important to identify the specific tasks associated with monitoring the tool, predicting its errors, and accelerating the tools’ learning through feedback (Brannick et al., 2007). In addition to task identification, the Job Analysis should include a method for identifying the associated Knowledge, Skills, and Abilities (KSAs) necessary for completing each task. Once complete, the set of tasks and associated KSAs can be used to inform standard operating procedures and training.
Training
The workshop and subsequent report also highlight the importance of training. This training should go beyond buttonology and additionally focus on helping users develop a mental model of the tool’s performance boundaries (i.e., under what conditions is the tool likely accurate and under what conditions is it likely to fail?) Consistent with the research priorities emphasized in the report, this type of training can help manage warfighter expectations of the AI. As an example of this type of training, PNNL developed a method for training on an AI image classifier’s strengths and limitations (Fallon & Yin, 2023). A small group of people were presented with a curated subset of the model’s training data that allowed them to inspect instances where the model correctly classified an image as well as when the model misclassified. The individuals who received the training were able to learn the visual characteristics that confused the model plus those that the model appeared to correctly cue in on. After receiving this training, individuals could correctly predict how the model would perform.
Support Feedback
The HAT Research Priorities for the Navy report highlights the importance of task allocation between humans and AI. One of the tasks allocated to the human in this pairing is providing corrective feedback to improve AI performance (Wenskovitch et al., 2021). End users have first-hand experience observing the AI’s performance, including its errors. Users also have a deep understanding of their operational environment and can identify subtleties in this environment that may impact the AI’s performance. To capitalize on end user feedback, the process for generating and sharing corrective feedback should be clearly defined and supported by technology to reduce the reporting burden. If feedback must be sent to the developers to improve the model, incident repositories (Cummings, 2023) and feedback pipelines will be needed for storing and efficiently transferring this information. In the case of interactive ML, the tool may receive corrective feedback directly from the end user.
Panel Summary
Erin K. Chiou, Arizona State University
To summarize, the four panelists assembled for this discussion identify research priorities and use cases that reflect current open questions and areas where human factors professionals could be active contributors. Wong raises the differences between an entity perceived as a tool versus a teammate, and the challenge of designing for first impressions to aid in appropriate technology adoption. At the same time, there is also the longer-term challenge of designing and developing effective human-AI communication such that the AI could be adopted and integrated as “one of the team.” Part of this communication challenge involves considering how to represent uncertainty and providing the ability for human counterparts to explore source data. Beyond interactional challenges, Gutzwiller highlights that in complex distributed human-machine work systems, the need to align goals among participating entities in a productive way with as little confusion as possible is of central importance. Yet, HSI test and evaluation methods for how to achieve this are not obvious. A path forward could involve advancements of current testbeds to study novel and dynamic function allocation schemes, with closer involvement from researchers on customer-relevant use cases. From an even broader perspective, Cook discusses the perennial challenge felt by many practitioners and human factors organizations, which is the urgent need to move beyond the attitude of technology as a panacea for current problems. This attitude is exacerbated in the current climate of AI excitement following decades of software dominance and tech startup culture. Fallon brings us back to another perennial challenge—how to best situate human and machine counterparts at work, identify corresponding core competencies, and include the requisite workforce training. What is novel about the AI context is that attributing performance becomes a complex endeavor (is it the underlying data that is the issue or the model itself), especially given the ability to provide corrective feedback to the machine during operation.
One final thought given the motivation of this panel is to involve the broader community into this work. Gutzwiller mentions that academics could be more deeply involved in testing user-relevant use cases. Ideally, this would distribute and include more minds in the vast scope of work that remains in the human-AI teaming research and defense space. Challenges with achieving this were noted during our initial workshop; there is the lack of open source, easily configurable test beds to address the types of teaming issues known to arise in more open-world environments, and a lack of access to a central repository of user-relevant and open-source use cases. Although identifying shared needs was a valuable outcome of the workshop, it is also worth noting the heterogenous goals among our HFES community. For example, a central mission of many higher education institutions is inclusive and accessible education, which does not necessarily conflict with the research priorities highlighted, but it does result in a wider range of goals and interests that must be considered. A common complaint underscores the dearth of behavior-based studies with experts, and an oversaturation of survey-based studies with college students. This is a fair point to make about our lacking state of knowledge; but in the absence of privilege, resources, and the right network, these latter types of studies still hold value. Beyond their scientific value (assuming they are well-justified, reasonably designed, rigorously conducted, and responsibly reported), these studies are how many early-career scholars can go from “zero to one” within a limited time; they are crucial outputs of general education, research training, and continuous learning in our field. Supporting a common research mission necessarily involves supporting collective education, heterogenous interests, and human learning.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
