Sage Journals: Discover world-class research

Abstract

Objective

We investigated how various error patterns from an AI aid in the nonbinary decision scenario influence human operators’ trust in the AI system and their task performance.

Background

Existing research on trust in automation/autonomy predominantly uses the signal detection theory (SDT) to model autonomy performance. The SDT classifies the world into binary states and hence oversimplifies the interaction observed in real-world scenarios. Allowing multi-class classification of the world reveals intriguing error patterns previously unexplored in prior literature.

Method

Thirty-five participants completed 60 trials of a simulated mental rotation task assisted by an AI with 70–80% reliability. Participants’ trust in and dependence on the AI system and their performance were measured. By combining participants’ initial performance and the AI aid’s performance, five distinct patterns emerged. Mixed-effects models were built to examine the effects of different patterns on trust adjustment, performance, and reaction time.

Results

Varying error patterns from AI impacted performance, reaction times, and trust. Some AI errors provided false reassurance, misleading operators into believing their incorrect decisions were correct, worsening performance and trust. Paradoxically, some AI errors prompted safety checks and verifications, which, despite causing a moderate decrease in trust, ultimately enhanced overall performance.

Conclusion

The findings demonstrate that the types of errors made by an AI system significantly affect human trust and performance, emphasizing the need to model the complicated human–AI interaction in real life.

Application

These insights can guide the development of AI systems that classify the state of the world into multiple classes, enabling the operators to make more informed and accurate decisions based on feedback.

Keywords

trust dynamics multi-class classification human–automation interaction human–autonomy interaction human–AI interaction

Introduction

Consider the following hypothetical scenario:

Sarah, a highly skilled pharmacist, is responsible for filling the medication bottles for each prescription order. Recently, the pharmacy Sarah works for has introduced an AI computer vision system¹ that can scan the filled bottle and identify the specific medication that has been filled. The AI system is introduced as another layer of verification before medication dispensing. Today, Sarah receives a prescription order for one patient, Noah, who needs to take medication X .

Upon receiving the order, one of the following five cases could happen:

Case A: Sarah correctly fills the bottle with medication X . The AI system scans the bottle and correctly identifies that the filled medication is X and signals a green light.

Case B: Sarah correctly fills the bottle with medication X . The AI system scans the bottle but incorrectly identifies that the filled medication is Y (or any medication that is not X). As medication Y is not the ordered medication, the AI system signals a red flag and indicates the need to double-check.

Case C: Sarah has a lapse when filling the bottle and incorrectly fills it with medication Z . Fortunately, the AI system correctly identifies the filled medication as Z . As medication Z is not the ordered medication, the AI system signals a red flag.

Case D: Sarah has a lapse when filling the bottle and incorrectly fills it with medication Z . The AI system scans the filled medication and identifies it as medication Y (or any medication that is not X and not Z). As medication Y is not the ordered medication, the AI system signals a red flag.

Case E: Sarah has a lapse when filling the bottle and incorrectly fills it with medication Z . The AI system scans the filled medication and incorrectly identifies it as medication X and signals a green light.

This hypothetical scenario, which reflects recent developments in medical dispensing (Q. Chen et al., 2024; J. Y. Kim et al., 2025; Lester et al., 2021; Tsai et al., 2025; Zheng et al., 2023), presents unique characteristics that are largely overlooked in existing research on trust in and dependence on automation/autonomy and AI systems. Existing research typically involves a human subject performing tasks, with certain tasks being automated. For example, an automated combat identification (CID) aid can scan the environment, identify a friend or a foe, and make recommendations to soldiers (Guo et al., 2023, 2024; Du et al., 2020; Neyedli et al., 2011; Wang et al., 2009). To model the performance of the automation, most studies used the signal detection theory (SDT) (Hautus et al., 2021; Tanner & Swets, 1954). However, SDT categorizes the world into binary states—signal present or absent—which oversimplifies the complex dynamics and variances observed in real-world scenarios.

Let’s examine the five cases presented in the hypothetical scenario, focusing particularly on Case D and Case E shown in Figure 1(b), which are especially thought-provoking. In Case D, although the AI erroneously identifies medication Y as Z, this error can be considered beneficial from a utilitarian perspective. The incorrect identification differs from the prescribed medication X, which rightly triggers a safety alert. Conversely, Case E also involves an error by the AI, but this mistake could potentially lead to catastrophic results, as it mistakenly confirms the incorrectly filled medication as correct, misleading the human operator and providing false reassurance. Interestingly, as shown in Figure 1(a), if we assumed the prediction of the AI is only binary, Cases D and E would be combined into a single case.

Figure 1.

Difference in the performance patterns when choices are binary or nonbinary. In the binary choice (a), the Wrong-Incorrect pattern always matches the reference. However, in the nonbinary choice (b), the Wrong-Incorrect pattern could be categorized into two, based on whether incorrect predictions match the reference (prescribed pill).

This research aims to explore how people trust and depend on automation/autonomy when the predictions made by these systems extend beyond simple binary decisions. We first review the SDT, followed by existing studies on trust and dependence that utilize SDT within both single- and dual-task frameworks. After that, we introduce and detail the current study.

SDT, Trust, Automation Compliance, and Reliance

SDT models the relationship between signals and noise, as well as the automation’s ability to detect signals among noise (Sorkin & Woods, 1985; Tanner & Swets, 1954). The state of the world is characterized by either “signal present” or “signal absent,” which may or may not be identified correctly by automation. The combination of the state of the world and the automation’s detection results in four possible states: hit, miss, false alarm (FA), and correct rejection (CR) (see Figure 2).

Figure 2.

Signal detection theory (SDT).

SDT is widely employed to examine trust in and dependence on automation/autonomy and AI systems. Typically, SDT is used to model the reliability of automation with which the human operators interact in both single-task (Bhat et al., 2022, 2024; Lacson et al., 2005; Madhavan et al., 2006; Neyedli et al., 2011; Wang et al., 2009; Wiegmann et al., 2001) and multi-task paradigms (Chung & Yang, 2024; Du et al., 2020; McBride et al., 2011; Wiczorek & Manzey, 2014; Yang et al., 2017). The general findings indicate that as automation reliability increases, trust and dependence on automation increase accordingly.

When error types (FAs and misses) are examined more closely, findings suggest they lead to distinct behavioral patterns (Chancey et al., 2017; Meyer, 2001, 2004). Dependence on automation can be categorized into compliance and reliance. Compliance is characterized by operators responding as though the error exists when the system indicates an error. FAs often result in commission errors when the operator follows the automation’s false indication. Reliance, by contrast, occurs when operators believe the automated system’s indication of safety. When operators fail to respond to events due to automation misses, omission errors arise (Meyer, 2001, 2004; Skitka et al., 1999). Research suggests that FAs primarily influence compliance, though some studies indicate they may also affect reliance, whereas misses predominantly impact reliance (Chancey et al., 2015; Chancey et al., 2017; Dixon & Wickens, 2006; Dixon et al., 2007; Meyer, 2001).

For example, Chancey et al. (2017) investigated automation reliability and error types (FA-prone vs. miss-prone) in a simulated flight task. Participants performed a primary task of maintaining level flight and a secondary resource management task supported by automated aids with high (90%) and low (60%) reliability. Trust was measured at an aggregate level halfway through each session. The results showed that reliability significantly affected compliance in the FA-prone condition, with participants more likely to comply with the highly reliable systems. In contrast, in the miss-prone condition, reliability primarily affected reliance, with greater reliance observed on highly reliable systems. Additionally, FAs had a stronger impact on trust than misses; trust mediated the relationship between FAs and compliance but not between misses and reliance (Chancey et al., 2017).

Other research indicates that misses may cause operators to trust automation less. For instance, in a study simulating an unmanned aerial vehicle mission with an automation-assisted secondary weapon-deployment task, Davenport and Bustamante (2010) investigated participants’ trust, reliance, and compliance following FA-prone and miss-prone aid. Trust was assessed through postsession questionnaires, revealing a greater trust decrement for systems with more misses compared to FAs. Compliance was greater for the miss-prone aid, while reliance was greater for the FA-prone aid (Davenport & Bustamante, 2010). Conversely, other studies indicate that operators may exhibit lower trust in systems with a high frequency of FAs, which may be attributed to the heightened prominence of FAs that reduces the credibility as operators need to actively verify the presence of errors, making the error more salient (Breznitz, 2013; Johnson et al., 2004).

Some studies implicitly assessed compliance and reliance using performance and reaction time metrics. Dixon et al. (2007) examined error types in tracking and system monitoring tasks. Participants were grouped by automation aid type: no help, perfectly reliable aid, FA-prone aid (60% reliability), and miss-prone aid (60% reliability). Reliance was evaluated through performance deficits, while compliance was measured by comparing response times to those observed in the perfectly reliable aid condition. Results indicated that the FA-prone system reduced both compliance and reliance, while the miss-prone system only reduced reliance. In the FA-prone condition, failure detection rates and response times in the monitoring task were worse than in the perfectly reliable automation condition, reflecting reduced compliance. Both error types affected reliance as evidenced by poorer tracking task performance after silent automation trials. Notably, the FA-prone condition resulted in lower performance and longer detection times compared to the miss-prone condition (Dixon et al., 2007). In Sanchez et al. (2014)’s multi-task simulation experiment, participants performed a collision avoidance driving task supported by 94.5% reliable automation and a secondary tracking task. Reliance was assessed based on whether participants chose to double-check after receiving support from the automated aid. Results revealed a trend of increasing reliance in the FA condition as participants became more familiarized with automation, whereas reliance decreased in the miss condition (Sanchez et al., 2014).

In high-stakes scenarios where the consequences of missing critical events could be catastrophic, automation thresholds are often calibrated to favor FAs to minimize the chance of overlooking abnormalities (Sanchez et al., 2014; Sorkin & Woods, 1985). However, excessive FAs may affect trust more than misses, leading to disuse (neglect or under-dependence), deviating from the original purpose (Chancey et al., 2017; Dixon & Wickens, 2006; Parasuraman & Riley, 1997). These studies treated the error type (FA-prone and miss-prone) as a between-subject variable. Future research could benefit from evaluating within-subject differences to provide a more comprehensive understanding of human behavioral patterns in response to automation errors.

Properties of Trust Dynamics

Over the past 30 years, many researchers have investigated people’s trust in automated/autonomous and AI systems. Historically, most studies have approached trust from a snapshot view, typically using end-of-experiment questionnaires to evaluate trust levels. Research within this view has identified many factors that can influence people’s trust in these systems (Hoff & Bashir, 2015; Kaplan et al., 2023). More recently, acknowledging that a person’s trust can change dynamically while interacting with automated/autonomous technologies, there is a shift of research focus from snapshot trust to trust dynamics—how humans’ trust in autonomy forms and evolves due to moment-to-moment interaction with automated/autonomous technologies (de Visser et al., 2020; Wischnewski et al., 2023; Yang, Guo, & Schemanske, 2023; Yang, Schemanske, & Searle, 2023).

One body of research on trust dynamics explores how trust can be violated and restored after trust violation. Trust violations happen after negative experiences (false alarms and misses), which substantially decreases trust to a level below the original trust level (Baker et al., 2018; Esterwood & Robert, 2022; Esterwood & Robert, 2023; P. H. Kim et al., 2004; T. Kim & Song, 2021; Lewicki & Brinsfield, 2017; Sebo et al., 2019). Some trust violation research suggests that misses may lead to lower trust in automation. For example, Azevedo-Sa et al. (2020) investigated how failures in semi-autonomous driving systems influence drivers’ trust. Participants engaged in a nondriving task in a semi-autonomous vehicle, which occasionally provided FAs or missed obstacles. Trust dynamics were evaluated after each interaction, revealing that misses had a more harmful effect on trust than FAs (Azevedo-Sa et al., 2020). In contrast, other studies suggest no differences between error types. For instance, Guzman-Bonilla and Patton (2024) investigated participants’ responses to automation errors (FAs or misses) during a block pair matching task. Trust was measured after each block, and their findings indicated that error type did not affect the participants’ trust (Guzman-Bonilla & Patton, 2024). After trust violations occur, efforts can be made to restore trust, and these efforts often involve various actions/strategies such as apologies (Kohn et al., 2019; Mahmood et al., 2022; Natarajan & Gombolay, 2020), denials (Kohn et al., 2019), explanations (Ashktorab et al., 2019; M. Faas et al., 2021; Natarajan & Gombolay, 2020), and promises (Albayram et al., 2020).

The second body of research is focused on developing real-time trust prediction models (Bhat et al., 2022; M. Chen et al., 2018; Guo & Yang, 2021; Xu & Dudek, 2015). Examples of trust prediction models include the online probabilistic trust inference model (OPTIMo) by Xu and Dudek (2015), the Beta random variable model by Guo and Yang (2021), and the Bayesian model combining Gaussian processes and recurrent neural networks by Soh et al. (2020). For a detailed review, please refer to Kok and Soh (2020).

The third body of research, which is examined in this study, uncovers general properties that govern how a person’s trust in automated/autonomous and AI systems changes over time. Recently, Yang, Guo, and Schemanske (2023) summarized three key properties of trust dynamics: continuity (Trust at the present moment i is significantly associated with trust at the previous moment i − 1.), negativity bias (Negative experiences due to autonomy failures have a greater influence on trust than positive experiences due to autonomy successes), and stabilization (A person’s trust will stabilize over repeated interactions with the same autonomy). The first two properties have been reported consistently in prior research (Lee & Moray, 1992; Manzey et al., 2012; Yang et al., 2017; Yang, Schemanske, & Searle, 2023). The last property, stabilization, was first empirically found in Yang, Guo, and Schemanske (2023). Using these properties, a predictive computational model was developed, which claims that trust at any point follows a Beta distribution (Guo & Yang, 2021). This model surpasses previous models in prediction accuracy while ensuring generalizability and explanatory power, indicating the importance of the three properties of trust dynamics.

The Present Study

For readability purposes, we denote the five cases mentioned in the hypothetical scenario using two letters. The first letter, R or W, represents the participants making a right or wrong initial decision. The second letter, C or I or I_ref, represents the AI making a correct or incorrect prediction, given the human’s initial performance. We differentiate I and I_ref, where “ref” emphasizes that the AI mistakenly confirms the incorrectly filled medication as the prescribed (i.e., reference) medication. Additionally, the outcome of the task can be denoted as R or W. Altogether, we have ten different outcomes combining the five patterns (cases) and two possible outcomes for each pattern (see Table 1).

Table 1.

Description of the Five Cases and Two Outcomes. Bold Text Indicates Potential Incorrect Outcomes Resulting From Automation Errors.

Case	Pattern	Prescribed Medication	Filled Medication	AI Prediction	Final Outcome
A	RC	X	X	X	R: Dispense medication / W: Error-correction refill
B	RI	X	X	Y/Z	R: Dispense medication / W: Error-correction refill
C	WC	X	Y	Y	R: Error-correction refill / W: Dispense medication
D	WI	X	Y	Z	R: Error-correction refill / W: Dispense medication
E	WI _ref	X	Y	X	R: Error-correction refill / W: Dispense medication

The aims of the study are multifaceted. The primary interest of the study is to investigate the influence of AI predictions following wrong initial decisions. Specifically, to compare between patterns WC (i.e., Case C), WI (i.e., Case D), and WI_ref (i.e., Case E).

Both patterns WI and WI_ref involve the wrong medication being filled and subsequently misidentified by the AI system. The critical difference emerges in the outcome of the AI’s error: in pattern WI, the AI’s mistaken identification of medication Y as Z prompts a reevaluation of the dispensed medication, potentially averting a mistake. Conversely, in pattern WI_ref, the AI incorrectly identifies medication Y as X, misleading the human operator and providing false reassurance, thereby compounding the error. Therefore, we hypothesize:

H1: Pattern WC will lead to the best performance, followed by pattern WI , then pattern WI _ref.

In terms of trust change, we hypothesize both pattern WI and pattern WI_ref will lead to trust decrement compared to trust increment for pattern WC. In addition, there will be a more significant trust decrement for pattern WI_ref.

H2: Pattern WC will have the highest trust increment, followed by pattern WI , then pattern WI _ref. This trend should be observed regardless of the final outcome of the task (Note: a larger trust decrement is considered a smaller trust increment).

Regarding the time taken to make a final decision after receiving the AI’s prediction, we hypothesize an interaction between the three patterns and the final outcome. As mentioned earlier, pattern WI_ref misleads the human operator and provides false reassurance. Therefore, we speculate that when participants are indeed tricked by pattern WI_ref, they would react very fast. On the contrary, it would take the longest time for a participant to recognize and rectify the error in pattern WI because the incorrect prediction does not align with either the reference or the initial answer.

H3a: It would take a shorter time for people in pattern WI _ref to commit an error, compared to patterns WI and WC .

H3b: It would take a longer time for people in pattern WI _ref to recognize an error, compared to patterns WC and WI .

Along with the primary interest, we wish to provide further evidence for the continuity (H4), negativity bias (H5), and stabilization (H6) properties of trust dynamics.

H4: Trust at the present moment i is significantly associated with trust at the previous moment i − 1.

H5: The magnitude of trust decrement due to incorrect predictions ( RI , WI , WI _ref) will be greater than the magnitude of trust increment following correct predictions ( RC , WC ).

H6: The magnitude of trust adjustment will decrease as a function of time for both the correct and the incorrect AI predictions.

Method

This research complied with the American Psychological Association code of ethics and was exempt from the Institutional Review Board oversight at the University of Michigan (HUM00230326). Informed consent was obtained from each participant.

Participants

A total of 35 University of Michigan undergraduate and graduate students (average age = 22.43 years, SD = 3.27) participated in the experiment. Participants were required to have a normal or corrected-to-normal vision. Participants received a compensation of 10 dollars base rate with a bonus of up to 10 dollars based on their performance. Participants completed a three-dimensional spatial visualization and recognition task (J. Kim et al., 2023; J. Y. Kim et al., 2024; Vandenberg & Kuse, 1978), assisted by an AI. In each trial, participants were presented with a reference image selected from a pool of 30 images. They were then shown five options and asked to choose the one that matched the reference. Following their selection, the simulated AI predicted the shape of the participant’s chosen image. Finally, participants indicated whether their initial selection was correct. This workflow mirrors a medication dispensing task, where a pharmacist first receives a prescription (e.g., vitamin C) and retrieves the corresponding medication from an inventory containing hundreds of options. The retrieved medication is then verified for accuracy by pharmacy technicians or an AI system, comparing it to the prescription order. Based on the verification, the pharmacist decides whether to dispense the medication or to rectify errors by retrieving the appropriate item.

Experimental Apparatus and Stimuli

Stimuli Development

The study employed a simulated mental rotation task (MRT) and we followed the study of Shepard and Metzler (1971) to develop stimuli. We created 30 three-dimensional (3D) objects, half with 10 cubes and another half with 8 cubes. For each object, we created a reference two-dimensional image of the object, positioned at a specific rotation (see Figure 3). Subsequently, for each reference, we created 14 rotated images by rotating the 3D object every 45 degrees along the horizontal and vertical axes. This resulted in 30 reference images, each with 14 corresponding rotated images, denoted as correct alternatives (of the reference).

Figure 3.

Illustration of stimuli development. After creating the (a) reference images, one (b) mirrored image, fourteen 45 degree (c) horizontal axis rotated figures, and fourteen 45 degree (d) vertical axis rotated figures were created for each reference image.

The difficulty of the stimuli was determined through a separate study using Qualtrics, and the participants were recruited via Amazon Mechanical Turk (MTurk). Following Shepard and Metzler (1971)’s mental rotation study, participants viewed multiple image pairs. Each left image was a randomly chosen reference image, and the right image was either a correct alternative, a distractor, or the reference image itself (Figure 4). Distractors were randomly chosen from the remaining 29 reference images, which included 14 images with the same cube amount as the reference image (i.e., 8 or 10) and 15 images with a different cube amount. Participants had 10 seconds to identify whether the image pairs were the same or different. Data from participants who failed to match the validation pairs (reference-reference image pairs) correctly were discarded. The mean hit rate determined stimuli difficulty.

Figure 4.

For the beta test, participants were asked to determine whether the image pairs are the same items in different rotations (by clicking “Yes, they are the same item”) or different items (by clicking “No, they are different”) within 10 seconds.

Experimental Task

The task was adapted from a three-dimensional spatial visualization test (Vandenberg & Kuse, 1978) and was developed using the Python Tkinter package. Figure 5 shows the flowchart of the experimental task.

Figure 5.

Flowchart of the experiment.

Before the experiment, participants completed a demographics survey (i.e., age and gender) and a trust propensity survey estimating their propensity to trust automation (Merritt et al., 2013).

During the experiment, participants performed the MRT, which consisted of 60 trials presented in random order. Each trial contained several steps. First, participants were shown a reference image and five choices, containing one correct alternative and four distractors (Figure 6(a)). The correct alternative was identical to the reference image in structure but was shown in a rotated position. Two distractors were randomly chosen from rotated mirrored images of the reference and the other two were randomly chosen from rotated images of other reference images. Participants were asked to choose the image that portrayed the same 3D object as the one portrayed in the reference image, as accurately as possible and within 15 seconds, then to click the “Next” button. The 15 second duration was determined based on pilot testing. If no selection was made within the time limit, the text on the initial choice page displayed “You did not make a selection within the time limit” (Figure 6(b)), and participants needed to click the “Next” button to start the next trial.

Figure 6.

Participants were asked to (a) make their initial answer choice within 15 seconds. (b) If no selection was made within the time limit, participants needed to skip to the next trial.

After making the initial choice, participants rated their confidence in their initial choice using a visual analog scale, with the leftmost point labeled “Not confident at all” and the rightmost point “Absolutely confident” (Figure 7).

Figure 7.

Participants rated their confidence using a visual analog scale.

Next, participants were shown the AI’s prediction, displayed below their initial selection to illustrate the rotation of the AI-predicted shape. In Figure 8(a), the participant initially selected the third image (indicated by an arrow), but the AI misidentified the initial selection and presented the reference image of its predicted shape (i.e., Case B described in the Introduction section). In contrast, in Figure 8(b), the participant initially selected the fifth image (indicated by an arrow). The AI misidentified the initial selection and displayed the reference image of the correct alternative (i.e., Case E described in the introduction).

Figure 8.

Participants were presented with the AI system’s recognition and chose between sticking with or rejecting their initial choice within 10 seconds. If no selection was made, they were skipped to the next trial. (a) Right initial choice, incorrect AI prediction (Case B, Pattern RI). (b) Wrong initial choice, incorrect AI prediction (Case E, Pattern WI_Ref).

After viewing the AI’s prediction, participants chose between sticking with their initial answer by clicking “I was right” or rejecting their initial answer by clicking “I was wrong” within 10 seconds. Similarly, the duration was determined based on a pilot study. If no selection was made in 10 seconds, the text on the final choice page displayed a message similar to Figure 6(b), and participants needed to press the “Next” button to move to the next trial.

Participants were then presented with their performance feedback and the validity of AI’s prediction (Figure 9).

Figure 9.

Participants were presented with the performance feedback page, showing their performance on the initial and final answer choices and the validity of the AI’s prediction. (a) Correct AI prediction. (b) Incorrect AI prediction.

At last, participants rated their trust in and perceived reliability of the AI system using a visual analog scale (Figure 10). Following Yang, Schemanske, and Searle (2023)’s approach, participants of this experiment were asked to rate their trust after each trial. The leftmost anchor of the trust scale was labeled “I don’t trust the decision aid at all” and the rightmost anchor of the trust scale was labeled “I absolutely trust the decision aid.” The perceived reliability rating was on a 0 to 100 scale.

Figure 10.

Participants rated their trust and perceived reliability using a visual analog scale.

Experimental Design

The experiment employed a within-subjects design. The independent variable was the patterns of performance (Tables 1 and 2). It is important to note that the presence of each pattern or outcome was participant-dependent, as the participant’s initial and final responses could not be manipulated.

Table 2.

Mean and SD of Confidence Rating, Trust Adjustment, Final Performance, Final Reaction Time, Number of Patterns, and Number of Participants Grouped by Patterns and Outcomes of Performance.

Patterns of Performance		Confidence Rating Mean (SD)		Trust Adjustment Mean (SD)		Final Performance (%) Mean (SD)	Final Reaction Time (s) Mean (SD)		N Patterns (n)		N Participants (n)
RC	R	74.58 (23.28)	75.48 (22.73)	1.17 (4.33)	1.21 (4.36)	96.96 (17.17)	3.64 (2.00)	3.54 (1.88)	889	862	35	35
RC	W	74.58 (23.28)	45.81 (22.84)	1.17 (4.33)	−0.07 (3.02)	96.96 (17.17)	3.64 (2.00)	6.70 (3.00)	889	27	35	13
RI	R	72.53 (24.53)	78.45 (21.57)	−3.67 (6.06)	−3.65 (5.89)	78.61 (41.11)	5.78 (2.47)	5.82 (2.53)	201	158	35	33
RI	W	72.53 (24.53)	50.77 (22.58)	−3.67 (6.06)	−3.77 (6.71)	78.61 (41.11)	5.78 (2.47)	5.63 (2.26)	201	43	35	23
WC	R	50.41 (22.11)	44.93 (21.75)	0.76 (4.03)	1.01 (3.39)	62.33 (48.50)	5.71 (2.71)	5.96 (2.62)	592	369	35	35
WC	W	50.41 (22.11)	59.46 (19.63)	0.76 (4.03)	0.34 (4.90)	62.33 (48.50)	5.71 (2.71)	5.30 (2.83)	592	223	35	33
WI	R	51.68 (23.32)	43.38 (21.53)	−1.91 (5.39)	−2.23 (5.32)	56.85 (49.65)	6.34 (2.88)	6.25 (2.71)	197	112	35	32
WI	W	51.68 (23.32)	62.62 (21.07)	−1.91 (5.39)	−1.48 (5.47)	56.85 (49.65)	6.34 (2.88)	6.45 (3.10)	197	85	35	30
WI_Ref	R	53.71 (26.16)	42.29 (23.23)	−4.64 (8.03)	−4.46 (7.82)	25.00 (43.53)	4.56 (2.86)	6.79 (3.15)	96	24	34	15
WI_Ref	W	53.71 (26.16)	57.51 (26.12)	−4.64 (8.03)	−4.71 (8.15)	25.00 (43.53)	4.56 (2.86)	3.82 (2.35)	96	72	34	31

Note: Performance is based on the final answer choice. Therefore, for each pattern, there is only one output.

To induce the occurrence of different patterns, the AI system is set as follows: The AI system made correct recognition 70% of the time, irrespective of whether the participant’s initial choice was right or wrong. Pattern RC (Case A) occurred when the participant’s initial choice was correct, coupled with a correct recognition from the AI system; Pattern WC (Case C) occurred when the participant’s initial choice was wrong, coupled with a correct recognition from the AI system. The AI system made incorrect recognition 20% of the time by recognizing the participant’s selected choice as another shape other than the reference shape. This led to pattern RI (Case B) if the participant’s initial choice was correct and pattern WI (Case D) if the participant’s initial choice was wrong. The AI system recognized the participant’s initially selected choice as the reference shape 10% of the time. This led to pattern RC (Case A) if the participant’s initial choice was right and pattern WI_ref (Case E) if the participant’s initial choice was wrong. Therefore, depending on each participant’s initial choice, AI reliability ranged from 70%–80%. This reliability is above the baseline threshold level of around 70% to be perceived as useful (Wickens & Dixon, 2007). In the present study, participants were informed that the AI aid’s recommendations were imperfect, but they were not provided with specific information about its reliability level.

Measures

We are interested in the following measures.

Confidence

After participants completed step 1 of each trial, they rated their confidence in their initial selection on a Visual Analog Scale from 0 to 100. The leftmost anchor was labeled “I am not confident in my answer at all,” and the rightmost anchor was labeled “I absolutely trust the decision aid.”

Trust Adjustment

After each trial i, participants reported their trust(i) in the decision aid. We calculate a trust adjustment as:

Trust adjustment(i) = Trust(i) − Trust(i − 1), where i = 2, 3, …, 60

Since the moment-to-moment trust is reported after each trial, only 59 trust adjustments are obtained from each participant.

Performance

Another dependent variable was the final performance of the experimental task, which was calculated as the percentage of correct final answers (determining the validity of the initial answers) for each pattern for each participant (Figure 8(a)).

Reaction Time

The reaction time was measured in seconds from when the AI prediction appeared until the participant pressed either the “I was right” or “I was wrong” button (Figure 8(a)).

Experimental Procedure

After participants signed the consent form and completed the online demographics and trust propensity surveys (Merritt et al., 2013), they watched a video explaining the experimental task. Participants were informed that the AI aid’s recommendations were imperfect and may or may not be correct. Each participant completed 60 trials. Upon completion, participants completed a postexperiment survey.

Statistical Analysis

We constructed linear mixed-effects models, accounting for random effects of participants, to examine our hypotheses using the “lme4” package in R (Bates et al., 2014). Mixed-effects models have advantages in dealing with repeated measures and missing values and are, therefore, particularly appropriate for our analysis. We used an iterative approach to construct the models. Following the standard procedure for building mixed-effects models (Field et al., 2012), we started simple and gradually added complexity to our models. Using the likelihood ratio test (LRT), comparisons of models were conducted, and the simpler models were used when no significant differences were found. The level of significance for this study was set to α = 0.05.

The mixed-effects model equation is:

y_{i j} = (β_{0} + u_{0 j}) + β_{1} \cdot {CaseB}_{i j} + β_{2} \cdot {CaseC}_{i j} + β_{3} \cdot {CaseD}_{i j} + β_{4} \cdot {CaseE}_{i j} + ϵ_{i j}

where:

y_ij: The outcome variable for participant j under case pattern i.

β₀: The fixed effect intercept.

u_0j: The random intercept for the jth participant.

CaseB_ij, CaseC_ij, CaseD_ij, CaseE_ij: Dummy variables for the case patterns, comparing each case to the reference case (i.e., Case A).

β₁, β₂, β₃, β₄: The fixed effect coefficients for the case patterns (B, C, D, and E) relative to the reference case A.

ϵ_ij: The residual error for participant j in case pattern i.

Post hoc t-tests were conducted to analyze the differences between the patterns. The Kenward-Roger method of estimating degrees of freedom was used, which adjusts the variance-covariance matrix of the fixed effects using a Taylor series expansion, then calculates degrees of freedom (Kenward & Roger, 1997). This method is widely recognized in mixed-effects models for its ability to mitigate Type 1 errors, generate precise p-values, and maintain reliability across varying sample sizes (Kuznetsova et al., 2017; Luke, 2017; McNeish & Stapleton, 2016).

Results

After all trials were completed, the number of occurrences for each pattern coupled with the outcome was calculated. As the occurrence of each pattern can only be determined posteriorly, participants might not necessarily display each performance pattern. Table 2 shows the number of occurrences for each performance pattern and the descriptive statistics of all the measurements.

Confidence

Analyzing participants’ data on confidence can be regarded as a check of data quality. We expected that participants would have higher confidence when their initial choices were correct. We organized the patterns into two classifications: right human initial choice (RC, RI) and wrong human initial choice (WC, WI, WI_ref). The confidence rating revealed significant differences between the right (Mean = 72.74, SD = 1.98) and the wrong (Mean = 56.77, SD = 2.00) initial choices (t(1962.72) = −20.20, p < .001).

Comparing Patterns WC, WI, and WI_ref

Significance indicators are displayed on figures to highlight the statistical significance between patterns of performance.

Trust Adjustment

There was a significant difference between the three patterns on trust adjustment (F(2, 875.68) = 61.2, p < .001). Pattern WC had a significantly larger trust increase than pattern WI (t(875) = 6.57, p < .001) and pattern WI_ref (t(878) = 9.96, p < .001). Pattern WI_ref had a significantly greater trust decrement than pattern WI (t(876) = −4.46, p < .001) (Figure 11).

Figure 11.

Trust adjustment by patterns of performance. The error bars represent +/− 2 standard errors.

Performance

There was a significant difference between the three patterns in the final performance (F(2, 858.11) = 29.43, p < .001). Post hoc results show that pattern WI_ref had a significantly worse performance compared to WC (t(856.62) = −7.67, p < .001) and to WI (t(856.47) = −5.73, p < .001) (Figure 12).

Figure 12.

Performance by patterns of performance. The error bars represent +/− 2 standard errors.

Reaction Time

For the statistical analyses of reaction times, a log scale transformation was applied to response time data. Both patterns (F(2, 852.35) = 26.82, p < .001) and outcomes (F(1, 862.59) = 24.64, p < .001) significantly affected reaction time. In addition, a significant interaction effect was observed (F(2, 855.15) = 10.38, p < .001). Because of the significant interaction effect, we compare the three patterns within each level of outcome. When the final outcome was right (Figure 13(a)), there was no significant difference between the three patterns (F(2, 475.49) = .35, p = .71); When the final outcome was wrong (Figure 13(b)), patterns significantly affected reaction time (F(2, 363.47) = 22.09, p < .001). Post hoc comparisons revealed WI_ref resulted in significantly quicker reaction time than WC (t(362.36) = 5.82, p < .001) and WI (t(358.65) = 6.14, p < .001).

Figure 13.

Reaction Time by patterns of performance for (a) the right outcomes and (b) the wrong outcomes. The error bars represent +/− 2 standard errors.

Evaluating Continuity, Negativity Bias, and Stabilization

To examine the two properties of trust dynamics, the patterns were organized into two classifications: correct AI predictions (i.e., patterns RC and WC) and incorrect AI predictions (i.e., patterns RI, WI, and WI_ref).

Continuity

Comparing correct AI predictions to incorrect AI predictions shows that AI success led to a trust increment. Conversely, AI prediction failures led to a decline in trust. In addition, trust at the present moment t is highly correlated with trust at the previous moment t − 1. To investigate the relationship between the present rating of trust and its historical rating over time, we computed the autocorrelation of trust ratings. The results, depicted in Figure 14, show that trust at time t is highly correlated with trust at the previous moment t − 1. In addition, the mean autocorrelation decreased as the time gap between ratings increased, indicating a decline in correlation with longer time intervals (Figure 15).

Figure 14.

Autocorrelation of trust as a function of time separation. The error bars represent +/− 2 standard errors.

Figure 15.

Trust adjustment by patterns of performance. The error bars represent +/− 2 standard errors.

Negativity Bias

To evaluate the property of negativity bias, we compared the magnitude of trust increment and that of trust decrement. Results show significant differences between the correct (Mean = 1.01, SD = 0.13) and the incorrect (Mean = −3.16, SD = 0.22) AI prediction (t(1968.00) = −16.58, p < .001).

Stabilization

To evaluate the stabilization property, an analysis using a linear mixed model of trust adjustment as a function of time. Results show a significant decrease in the magnitude of trust adjustment over time for both the correct AI predictions (β = −0.035, t(1448) = −4.58, p < .001) and the incorrect AI predictions (β = −0.203, t(469.27) = −3.50, p < .001).

Discussion

In this section, we discuss the findings in relation to our aims and hypotheses.

The primary interest of the study is to compare between patterns WC, WI, and WI_ref. The results partially support H1, in which we hypothesized that pattern WC will lead to the best performance, followed by pattern WI, then pattern WI_ref. Our results showed that pattern WI_ref had a significantly worse performance compared to WC and WI. Out of the 96 occurrences of pattern WI_ref, the outcomes of 72 were wrong. This highlights the high likelihood of making errors when automated decision aid erroneously confirms participants’ wrong initial answer as correct, making participants falsely believe their initial mistakes were correct. In H1, we also hypothesized that the performance of pattern WC would be higher than pattern WI. To our surprise, we found a nonsignificant difference between patterns WC and WI, which suggests the strong benefits of making such errors from a utilitarian point of view. This counter-intuitive finding—that certain AI errors are, in fact, beneficial—only surfaces when we go beyond the binary categorization of signal detection theory (SDT).

In prior work examining human–automation/autonomy interaction, risk has been identified as a significant factor influencing trust, dependence, and task performance (Hoff & Bashir, 2015). The risk of an event is the product of the probability of an event and the consequence of that event (Sheridan, 2008; Slovic & Lichtenstein, 1968). Mathematically, Sheridan (2008) expressed risk R as the “product of the probability P_E of an initial reference event E and the sum of products of probability P_i|E of each consequence occurring given that E occurs, as well as the cost of each consequence C_i.” Therefore, $R = P_{E} \times \sum_{i} (P_{i} | E \times C_{i})$ , where $C o n s e q u e n c e_{E} = \sum_{i} (P_{i} | E \times C_{i})$ . Applying the equation to the present study, P_E is the probability of each case (i.e., P_case), and P_i|E can be considered the probability of final outcome (i.e., P_success|case and P_failure|case). In the experiment, participants were encouraged to perform the task as accurately as possible. We did not manipulate the cost of an error (i.e., one medication error could be harmless and another could lead to patient death) and of a success. Therefore, C_success and C_failure are constants. The findings pertinent to H1 can be summarized as P_success|WC ≈ P_success|WI > P_success|WI_ref. Furthermore, if we are willing to assume C_success = 1 and C_failure = −1, we could calculate the average consequences of the three cases: Consequence_WC = 0.24, Consequence_WC = 0.14, and Consequence_WC = −0.5.

H2 is fully supported that pattern WC led to the highest trust increment, followed by pattern WI, then pattern WI_ref. Prior research has consistently shown that automation performance is a key determinant of the direction of trust change—automation success leads to trust increment and failure leads to trust decrement (Lee & Moray, 1992; Manzey et al., 2012; Yang et al., 2017; Yang, Schemanske, & Searle, 2023). We simultaneously observe two seemingly competing theories of how people interpret automation performance. On the one hand, if automation performance is assessed based on the automation’s prediction performance per se, we would expect to see a trust increment for pattern WC, and a trust decrement for both patterns WI and WI_ref. In addition, the amount of trust decrements should be similar because, for both cases, the automation made a wrong prediction. Alternatively, if automation performance is assessed based on the “utility” of a prediction, we would expect to see a trust decrement for pattern WI_ref and a trust increment for both patterns WC and WI, as both patterns prompt a reevaluation of the initial selection. Further, the amount of trust increment between the two patterns should be similar because the final performance of the two patterns was comparable. Our results revealed neither theory is entirely right or wrong. Instead, the two forces—the AI’s performance per se and the “utility” of an AI prediction—together determine trust change. In prior research based on SDT, the automation’s prediction performance and its “utility” are aligned. However, in multi-class scenarios, the prediction performance and the “utility” of WI are in opposite directions, causing the intriguing finding. Specifically, the direction of trust change, increment or decrement, is largely determined by the AI performance. Consistent with prior literature (Lee & Moray, 1992; Manzey et al., 2012; Yang et al., 2017; Yang, Schemanske, & Searle, 2023), a wrong prediction leads to a trust decrement. However, the magnitude of trust decrement is further influenced by the “utility”: a wrong prediction leading to good outcomes has a smaller trust decrement.

Regarding the time taken to make a final decision after receiving the AI’s prediction, we hypothesized an interaction between the three patterns and the final outcome. The results supported H3a—when the human operators were tricked by the incorrect reassurance from the AI, they reacted very fast, WI_ref resulting in significantly quicker reaction time than WC and WI. The results did not support H3b. We found a nonsignificant difference between the three patterns when participants successfully recognized the false reassurance. This lack of significance could be due to the small sample size, and future research is needed to examine this further. In addition, we noticed a pattern of speed-accuracy trade-off, where faster responses were associated with increased errors. In a practical work environment, whether time or accuracy has a higher priority hinges on the payoff structure. For example, in medical settings where a wrong medication may lead to catastrophic outcomes, it might be more beneficial to prioritize accuracy over time.

Along with our primary objectives, we also wished to validate the existence of the three properties of trust dynamics, namely, continuity (H4), negativity bias (H5), and stabilization (H6). The continuity and negativity bias has been reported in the literature consistently (Lee & Moray, 1992; Manzey et al., 2012; Yang et al., 2016). The stabilization property was recently reported in the studies of Yang, Guo, and Schemanske (2023) and Guo and Yang (2021). Our study provides further validation of this property—as a person uses an automated technology over and over again, this person’s trust will stabilize over time. It is important to note that the three properties are summarized based on aggregated data from empirical research (i.e., the trust change is averaged over a group of participants). Therefore, the three properties represent the population average behaviors.

Conclusion

Most studies on trust in and dependence on automation/autonomy and AI systems used SDT to model the performance of the systems. However, SDT categorizes the world into binary states—signal present or absent—which oversimplifies the complex dynamics observed in real-world scenarios. This binary categorization fails to address the recognition of complexity inherent in multi-class classification scenarios, which are increasingly prevalent in advanced AI systems across various sectors. Examples include medical diagnosis through image classification (classifying chest X-rays into normal, pneumonia, or lung cancer), object detection in autonomous vehicles (recognizing various objects encountered on the road), and product categorization in e-commerce (classifying products as electronics, clothing, home appliances, or books), among others (Gao et al., 2018; Kozareva, 2015; Rahman et al., 2008). As far as we know, we are one of the few, if not the only, studies that examine trust in and dependence on automation/autonomy and AI systems beyond the binary classification of the state of the world.

Our study revealed the difference in performance, response time, and trust between three patterns: WC, WI, and WI_ref. These findings emphasize the need for careful consideration of error patterns in automated systems. In scenarios where AI falsely reassures the human operator, such as in pattern WI_ref, we observed high likelihood for operators to make expedited decisions that often led to errors. This highlights a significant danger in AI design—systems that mislead operators can severely undermine safety and efficiency. Conversely, errors that lead to safety checks and verifications, even though incorrect, can enhance the overall reliability of human–AI systems by preventing potential mishaps. Our results also illustrate the generalizability of the three properties of trust dynamics, continuity, negativity bias, and stabilization, illustrating how people’s trust can change over time due to moment-to-moment interaction with AI.

There are several limitations to this study. First, in the present study, we did not manipulate the cost of an error or the cost of success (i.e., C_i in the risk equation), which often varies in safety-critical domains. For instance, Aldhwaihi et al. (2016) found that although most dispensing errors caused no harm, some errors had high consequences that could potentially cause harm or even death, depending on the dosage and patients’ vulnerability. Future research could incorporate varying costs into experiments to examine the impact of various AI error types. Also, future research should be conducted to validate the results in more realistic contexts, particularly in safety-critical domains. Second, inspired by the medical dispensing examples described at the beginning of the article, the testbed we developed focused on a mental rotation and recognition task. Future studies could replicate the study using testbeds focusing on other human information processing components such as memory and decision making. Third, there were only 96 WI_ref cases, which could have caused the lack of significance in H3b. Future studies could benefit from a larger sample size.

Key Points

• AI systems that offer a multi-class classification of the state of the world can lead to types of errors that SDT does not traditionally capture.

• Varying error patterns from AI significantly impacted performance, reaction times, and trust adjustments.

• AI errors can provide false reassurance, misleading operators into believing their incorrect decisions were correct. This leads to worsened performance and a significant decline in trust. Conversely, and paradoxically, some AI errors can prompt safety verification, ultimately enhancing overall performance.

• We illustrate the generalizability of the three properties of trust dynamics: continuity, negativity bias, and stabilization.

Footnotes

Acknowledgments

The authors would like to thank Szu-Tung Chen for her assistance in testbed development, and Yili Liu for lively intellectual discussions.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study is in part supported by the University of Michigan Industrial and Operations Engineering department’s fellowship, the University of Michigan Rackham Graduate School’s graduate student research grant, and the National Library of Medicine of the National Institutes of Health under award number R01LM013624.

ORCID iDs

Jin Yong Kim

X. Jessie Yang

Note

Appendix

Author Biographies

Jin Yong Kim is a PhD candidate in the Department of Industrial and Operations Engineering (human factors) at the University of Michigan Ann Arbor. He obtained a BSE in industrial and operations engineering from the University of Michigan in 2021. His research is focused on the interactions between human–autonomy/AI, specifically to enhance human trust when using automation. His future goal is to conduct human factors research to improve human–autonomy interaction for people with disabilities and the aging population.

Corey Lester is an assistant professor in the College of Pharmacy at the University of Michigan Ann Arbor. He is a medication safety expert who uses pharmacy informatics techniques to improve the delivery of healthcare. He received his PharmD from the University of Rhode Island in 2012 and completed a PGY-1 Community Practice Residency at Virginia Commonwealth University. He received his PhD from the University of Wisconsin-Madison, School of Pharmacy in 2017. His research focuses on analyzing clinical data to identify medication safety risks and develop systems engineering-based solutions to create safer, more effective medication use.

X. Jessie Yang is an associate professor in the Department of Industrial and Operations Engineering and the Department of Robotics at the University of Michigan Ann Arbor. She obtained a PhD degree in mechanical and aerospace angineering (human factors) from Nanyang Technological University, Singapore, in 2014.

References

Albayram

Jensen

Khan

M. M. H.

Fahim

M. A. A.

Buck

Coman

(2020). Investigating the effects of (empty) promises on human-automation interaction and trust repair. In Proceedings of the 8th international conference on human-agent interaction (pp. 6–14). ACM.

Aldhwaihi

Schifano

Pezzolesi

Umaru

(2016). A systematic review of the nature of dispensing errors in hospital pharmacies. Integrated Pharmacy Research and Practice, 5(1), 1–10. https://doi.org/10.2147/IPRP.S95733

Ashktorab

Jain

Liao

Q. V.

Weisz

J. D.

(2019). Resilient chatbots: Repair strategy preferences for conversational breakdowns. In Proceedings of the 2019 CHI conference on human factors in computing systems (pp. 1–12). ACM.

Azevedo-Sa

Jayaraman

S. K.

Esterwood

C. T.

Yang

X. J.

Robert Jr

L. P.

Tilbury

D. M.

(2020). Comparing the effects of false alarms and misses on humans’ trust in (semi) autonomous vehicles. In Companion of the 2020 acm/ieee international conference on human-robot interaction (pp. 113–115). ACM.

Baker

A. L.

Phillips

E. K.

Ullman

Keebler

J. R.

(2018). Toward an understanding of trust repair in human-robot interaction: Current research and future directions. ACM Transactions on Interactive Intelligent Systems (TiiS), 8(4), 1–30. https://doi.org/10.1145/3181671

Bates

Mächler

Bolker

Walker

(2014). Fitting linear mixed-effects models using lme4. arxiv. arXiv preprint arXiv, 1406(1), 5823. https://doi.org/10.18637/jss.v067.i01

Bhat

Lyons

J. B.

Shi

Yang

X. J.

(2022). Clustering trust dynamics in a human-robot sequential decision-making task. IEEE Robotics and Automation Letters, 7(4), 8815–8822. https://doi.org/10.1109/LRA.2022.3188902

Bhat

Lyons

J. B.

Shi

Yang

X. J.

(2024). Evaluating the impact of personalized value alignment in human-robot interaction: Insights into trust and team performance outcomes. In Proceedings of the 2024 ACM/IEEE international conference on human-robot interaction (pp. 32–41). Association for Computing Machinery. https://doi.org/10.1145/3610977.3634921

Breznitz

(2013). Cry wolf: The psychology of false alarms. Psychology Press.

10.

Chancey

E. T.

Bliss

J. P.

Liechty

Proaps

A. B.

(2015). False alarms vs. misses: Subjective trust as a mediator between reliability and alarm reaction measures. Proceedings of the Human Factors and Ergonomics Society - Annual Meeting, 59(1), 647–651. https://doi.org/10.1177/1541931215591141

11.

Chancey

E. T.

Bliss

J. P.

Yamani

Handley

H. A.

(2017). Trust and the compliance–reliance paradigm: The effects of risk, error bias, and reliability on trust and dependence. Human Factors, 59(3), 333–345. https://doi.org/10.1177/0018720816682648

12.

Chen

Nikolaidis

Soh

Hsu

Srinivasa

(2018). Planning with trust for human-robot collaboration. In Proceedings of the 2018 ACM/IEEE international conference on human-robot interaction (HRI ’18) (pp. 307–315). ACM.

13.

Chen

Al Kontar

Nouiehed

Yang

X. J.

Lester

(2024). Rethinking cost-sensitive classification in deep learning via adversarial data augmentation. INFORMS Journal on Data Science, 4(1), 1–19. https://doi.org/10.1287/ijds.2022.0033

14.

Chung

Yang

X. J.

(2024). Associations between trust dynamics and personal characteristics. In 2024 IEEE 4th international conference on human-machine systems (ICHMS) (pp. 1–6). https://doi.org/10.1109/ICHMS59971.2024.10555583

15.

Davenport

R. B.

Bustamante

E. A.

(2010). Effects of false-alarm vs. miss-prone automation and likelihood alarm technology on trust, reliance, and compliance in a miss-prone task. Proceedings of the Human Factors and Ergonomics Society - Annual Meeting, 54(19), 1513–1517. https://doi.org/10.1177/154193121005401933

16.

de Visser

E. J.

Peeters

M. M.

Jung

M. F.

Kohn

Shaw

T. H.

Pak

Neerincx

M. A.

(2020). Towards a theory of longitudinal trust calibration in human–robot teams. International Journal of Social Robotics, 12(2), 459–478. https://doi.org/10.1007/s12369-019-00596-x

17.

Dixon

S. R.

Wickens

C. D.

(2006). Automation reliability in unmanned aerial vehicle control: A reliance-compliance model of automation dependence in high workload. Human Factors, 48(3), 474–486. https://doi.org/10.1518/001872006778606822

18.

Dixon

S. R.

Wickens

C. D.

McCarley

J. S.

(2007). On the independence of compliance and reliance: Are automation false alarms worse than misses? Human Factors, 49(4), 564–572. https://doi.org/10.1518/001872007X215656

19.

Huang

K. Y.

Yang

X. J.

(2020). Not all information is equal: Effects of disclosing different types of likelihood information on trust, compliance and reliance, and task performance in human-automation teaming. Human Factors, 62(6), 987–1001. https://doi.org/10.1177/0018720819862916

20.

Esterwood

Robert

L. P.

(2022). A literature review of trust repair in hri. In 2022 31st IEEE international conference on robot and human interactive communication (ro-man) (pp. 1641–1646). IEEE.

21.

Esterwood

Robert

L. P.

Jr. (2023). Three strikes and you are out!: The impacts of multiple human–robot trust violations and repairs on robot trustworthiness. Computers in Human Behavior, 142(1), Article 107658. https://doi.org/10.1016/j.chb.2023.107658

22.

Field

Miles

(2012). Discovering statistics using r. Sage.

23.

Gao

Cheng

Wang

Zhao

(2018). Object classification using cnn-based fusion of vision and lidar in autonomous vehicle environment. IEEE Transactions on Industrial Informatics, 14(9), 4224–4231. https://doi.org/10.1109/tii.2018.2822828

24.

Guo

Yang

X. J.

(2021). Modeling and predicting trust dynamics in human–robot teaming: A Bayesian inference approach. International Journal of Social Robotics, 13(8), 1899–1909. https://doi.org/10.1007/s12369-020-00703-3

25.

Guo

Yang

X. J.

Shi

(2023). Enabling team of teams: A trust inference and propagation (TIP) model in multi-human multi-robot teams. Robotics: Science and Systems XIX. https://doi.org/10.15607/RSS.2023.XIX.003

26.

Guo

Yang

X. J.

Shi

(2024). Tip: A trust inference and propagation model in multi-human multi-robot teams. Autonomous Robots, 48(7), 20. https://doi.org/10.1007/s10514-024-10175-3

27.

Guzman-Bonilla

Patton

C. E.

(2024). Misses or false alarms? How error type affects confidence in automation reliability estimates. In Proceedings of the human factors and ergonomics society annual meeting (p. 10711813241275093). Sage.

28.

Hautus

M. J.

Macmillan

N. A.

Creelman

C. D.

(2021). Detection theory: A user’s guide. Routledge.

29.

Hoff

K. A.

Bashir

(2015). Trust in automation: Integrating empirical evidence on factors that influence trust. Human Factors, 57(3), 407–434. https://doi.org/10.1177/0018720814547570

30.

Johnson

J. D.

Sanchez

Fisk

A. D.

Rogers

W. A.

(2004). Type of automation failure: The effects on trust and reliance in automation. Proceedings of the Human Factors and Ergonomics Society - Annual Meeting, 48(18), 2163–2167. https://doi.org/10.1177/154193120404801807

31.

Kaplan

A. D.

Kessler

T. T.

Brill

J. C.

Hancock

(2023). Trust in artificial intelligence: Meta-analytic findings. Human Factors, 65(2), 337–359. https://doi.org/10.1177/00187208211013988

32.

Kenward

M. G.

Roger

J. H.

(1997). Small sample inference for fixed effects from restricted maximum likelihood. Biometrics, 53(3), 983–997. https://doi.org/10.2307/2533558

33.

Kim

Chen

S. T.

Lester

Yang

X. J.

(2023). Exploring trust and performance in human-automation interaction: Novel perspectives on incorrect reassurances from imperfect automation.

34.

Kim

J. Y.

Marshall

V. D.

Rowell

Chen

Zheng

Lee

J. D.

Kontar

R. A.

Lester

Yang

X. J.

(2025). The effects of presenting ai uncertainty information on pharmacists’ trust in automated pill recognition technology: Exploratory mixed subjects study. JMIR Human Factors, 12(1), Article e60273. https://doi.org/10.2196/60273

35.

Kim

J. Y.

Richie

Yang

X. J.

(2024). Evaluating the influence of incorrect reassurances on trust in imperfect automated decision aid. Proceedings of the Human Factors and Ergonomics Society - Annual Meeting, 68(1), 81–87. https://doi.org/10.1177/10711813241264498

36.

Kim

P. H.

Ferrin

D. L.

Cooper

C. D.

Dirks

K. T.

(2004). Removing the shadow of suspicion: The effects of apology versus denial for repairing competence-versus integrity-based trust violations. Journal of Applied Psychology, 89(1), 104–118. https://doi.org/10.1037/0021-9010.89.1.104

37.

Kim

Song

(2021). How should intelligent agents apologize to restore trust? Interaction effects between anthropomorphism and apology attribution on trust repair. Telematics and Informatics, 61(1), Article 101595. https://doi.org/10.1016/j.tele.2021.101595

38.

Kohn

S. C.

Momen

Wiese

Lee

Y.-C.

Shaw

T. H.

(2019). The consequences of purposefulness and human-likeness on trust repair attempts made by self-driving vehicles. Proceedings of the Human Factors and Ergonomics Society - Annual Meeting, 63(1), 222–226. https://doi.org/10.1177/1071181319631381

39.

Kok

B. C.

Soh

(2020). Trust in robots: Challenges and opportunities. Current Robotics Reports, 1(4), 297–309. https://doi.org/10.1007/s43154-020-00029-y

40.

Kozareva

(2015). Everyone likes shopping! multi-class product categorization for e-commerce. In Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 1329–1333). Association for Computational Linguistics.

41.

Krizhevsky

Sutskever

Hinton

G. E.

(2012). ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems (Vol. 25). Curran Associates, Inc. Retrieved 2025-01-02, from. https://proceedings.neurips.cc/paper_files/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html

42.

Kuznetsova

Brockhoff

P. B.

Christensen

R. H. B.

(2017). Lmertest package: tests in linear mixed effects models. Journal of Statistical Software, 82(13), 1–26. https://doi.org/10.18637/jss.v082.i13

43.

Lacson

F. C.

Wiegmann

D. A.

Madhavan

(2005). Effects of attribute and goal framing on automation reliance and compliance. Proceedings of the Human Factors and Ergonomics Society - Annual Meeting, 49(3), 482–486. https://doi.org/10.1177/154193120504900357

44.

Lee

Moray

(1992). Trust, control strategies and allocation of function in human-machine systems. Ergonomics, 35(10), 1243–1270. https://doi.org/10.1080/00140139208967392

45.

Lester

C. A.

Ding

Rowell

Yang

X. J.

Kontar

R. A.

(2021). Performance evaluation of a prescription medication image classification model: An observational cohort. NPJ Digital Medicine, 4(1), 118. https://doi.org/10.1038/s41746-021-00483-8

46.

Lewicki

R. J.

Brinsfield

(2017). Trust repair. Annual Review of Organizational Psychology and Organizational Behavior, 4(1), 287–313. https://doi.org/10.1146/annurev-orgpsych-032516-113147

47.

Luke

S. G.

(2017). Evaluating significance in linear mixed-effects models in r. Behavior Research Methods, 49(4), 1494–1502. https://doi.org/10.3758/s13428-016-0809-y

48.

Madhavan

Wiegmann

D. A.

Lacson

F. C.

(2006). Automation failures on tasks easily performed by operators undermine trust in automated aids. Human Factors, 48(2), 241–256. https://doi.org/10.1518/001872006777724408

49.

Mahmood

Fung

J. W.

Won

Huang

C.-M.

(2022). Owning mistakes sincerely: Strategies for mitigating AI errors. In Proceedings of the 2022 CHI conference on human factors in computing systems (pp. 1–11). ACM.

50.

Manzey

Reichenbach

Onnasch

(2012). Human performance consequences of automated decision aids: The impact of degree of automation and system experience. Journal of Cognitive Engineering and Decision Making, 6(1), 57–87. https://doi.org/10.1177/1555343411433844

51.

McBride

S. E.

Rogers

W. A.

Fisk

A. D.

(2011). Understanding the effect of workload on automation use for younger and older adults. Human Factors, 53(6), 672–686. https://doi.org/10.1177/0018720811421909

52.

McNeish

D. M.

Stapleton

L. M.

(2016). The effect of small sample size on two-level model estimates: A review and illustration. Educational Psychology Review, 28(2), 295–314. https://doi.org/10.1007/s10648-014-9287-x

53.

Merritt

S. M.

Heimbaugh

LaChapell

Lee

(2013). I trust it, but i don’t know why: Effects of implicit attitudes toward automation on trust in an automated system. Human Factors, 55(3), 520–534. https://doi.org/10.1177/0018720812465081

54.

Meyer

(2001). Effects of warning validity and proximity on responses to warnings. Human Factors, 43(4), 563–572. https://doi.org/10.1518/001872001775870395

55.

Meyer

(2004). Conceptual issues in the study of dynamic hazard warnings. Human Factors, 46(2), 196–204. https://doi.org/10.1518/hfes.46.2.196.37335

56.

Faas

S. M.

Kraus

Schoenhals

Baumann

(2021). Calibrating pedestrians’ trust in automated vehicles: Does an intent display in an external hmi support trust calibration and safe crossing behavior? In Proceedings of the 2021 CHI conference on human factors in computing systems (pp. 1–17). ACM.

57.

Natarajan

Gombolay

(2020). Effects of anthropomorphism and accountability on trust in human robot interaction. In Proceedings of the 2020 ACM/IEEE international conference on human-robot interaction (pp. 33–42). ACM.

58.

Neyedli

H. F.

Hollands

J. G.

Jamieson

G. A.

(2011). Beyond identity: Incorporating system reliability information into an automated combat identification system. Human Factors, 53(4), 338–355. https://doi.org/10.1177/0018720811413767

59.

Parasuraman

Riley

(1997). Humans and automation: Use, misuse, disuse, abuse. Human Factors: The Journal of the Human Factors and Ergonomics Society, 39(2), 230–253. https://doi.org/10.1518/001872097778543886

60.

Rahman

M. M.

Desai

B. C.

Bhattacharya

(2008). Medical image retrieval with probabilistic multi-class support vector machine classifiers and adaptive similarity fusion. Computerized Medical Imaging and Graphics, 32(2), 95–108. https://doi.org/10.1016/j.compmedimag.2007.10.001

61.

Sanchez

Rogers

W. A.

Fisk

A. D.

Rovira

(2014). Understanding reliance on automation: Effects of error type, error distribution, age and experience. Theoretical Issues in Ergonomics Science, 15(2), 134–160. https://doi.org/10.1080/1463922X.2011.611269

62.

Sebo

S. S.

Krishnamurthi

Scassellati

(2019). “i don’t believe you”: Investigating the effects of robot trust violation and repair. In 2019 14th ACM/IEEE international conference on human-robot interaction (HRI) (pp. 57–65). IEEE.

63.

Shepard

R. N.

Metzler

(1971). Mental rotation of three-dimensional objects. Science, 171(3972), 701–703. https://doi.org/10.1126/science.171.3972.701

64.

Sheridan

T. B.

(2008). Risk, human error, and system resilience: Fundamental ideas. Human Factors, 50(3), 418–426. https://doi.org/10.1518/001872008X250773

65.

Skitka

L. J.

Mosier

K. L.

Burdick

(1999). Does automation bias decision-making? International Journal of Human-Computer Studies, 51(5), 991–1006. https://doi.org/10.1006/ijhc.1999.0252

66.

Slovic

Lichtenstein

(1968). Relative importance of probabilities and payoffs in risk taking. Journal of Experimental Psychology, 78(3, Pt.2), 1–18. Retrieved 2024-12-31, from. https://doi.apa.org/doi/10.1037/h0026468

67.

Soh

Xie

Chen

Hsu

(2020). Multi-task trust transfer for human–robot interaction. The International Journal of Robotics Research, 39(2-3), 233–249. https://doi.org/10.1177/0278364919866905

68.

Sorkin

R. D.

Woods

D. D.

(1985). Systems with human monitors: A signal detection analysis. Human-Computer Interaction, 1(1), 49–75. https://doi.org/10.1207/s15327051hci0101_2

69.

Tanner

W. P.

Swets

J. A.

(1954). A decision-making theory of visual detection. Psychological Review, 61(6), 401–409. https://doi.org/10.1037/h0058700

70.

Tsai

C.-C.

Kim

J. Y.

Chen

Rowell

Yang

X. J.

Kontar

Whitaker

Lester

(2025). Effect of artificial intelligence helpfulness and uncertainty on cognitive interactions with pharmacists: Randomized controlled trial. Journal of Medical Internet Research, 27(1), Article e59946. https://doi.org/10.2196/59946

71.

Vandenberg

S. G.

Kuse

A. R.

(1978). Mental rotations, a group test of three-dimensional spatial visualization. Perceptual & Motor Skills, 47(2), 599–604. https://doi.org/10.2466/pms.1978.47.2.599

72.

Wang

Jamieson

G. A.

Hollands

J. G.

(2009). Trust and reliance on an automated combat identification system. Human Factors, 51(3), 281–291. https://doi.org/10.1177/0018720809338842

73.

Wickens

C. D.

Dixon

S. R.

(2007). The benefits of imperfect diagnostic automation: A synthesis of the literature. Theoretical Issues in Ergonomics Science, 8(3), 201–212. https://doi.org/10.1080/14639220500370105

74.

Wiczorek

Manzey

(2014). Supporting attention allocation in multitask environments: Effects of likelihood alarm systems on trust, behavior, and performance. Human Factors, 56(7), 1209–1221. https://doi.org/10.1177/0018720814528534

75.

Wiegmann

D. A.

Rich

Zhang

(2001). Automated diagnostic aids: The effects of aid reliability on users’ trust and reliance. Theoretical Issues in Ergonomics Science, 2(4), 352–367. https://doi.org/10.1080/14639220110110306

76.

Wischnewski

Krämer

Müller

(2023). Measuring and understanding trust calibrations for automated systems: A survey of the state-of-the-art and future directions. In Proceedings of the 2023 CHI conference on human factors in computing systems (pp. 1–16). ACM.

77.

Dudek

(2015). OPTIMo: Online probabilistic trust inference model for asymmetric human-robot collaborations. In Proceedings of the 2015 ACM/IEEE international conference on human-robot interaction (HRI ’15) (pp. 221–228). ACM.

78.

Yang

X. J.

Guo

Schemanske

(2023). From trust to trust dynamics: Combining empirical and computational approaches to model and predict trust dynamics in human-autonomy interaction. In Duffy

V. G.

Landry

S. J.

Lee

J. D.

Stanton

(Eds.), Human-automation interaction: Transportation (pp. 253–265): Springer International Publishing. https://doi.org/10.1007/978-3-031-10784-9_15

79.

Yang

X. J.

Schemanske

Searle

(2023). Toward quantifying trust dynamics: How people adjust their trust after moment-to-moment interaction with automation. Human Factors, 65(5), 862–878. https://doi.org/10.1177/00187208211034716

80.

Yang

X. J.

Unhelkar

V. V.

Shah

J. A.

(2017). Evaluating effects of user experience and system transparency on trust in automation. In Proceedings of the 2017 ACM/IEEE international conference on human-robot interaction (pp. 408–416). ACM.

81.

Yang

X. J.

Wickens

C. D.

Hölttä-Otto

(2016). How users adjust trust in automation: Contrast effect and hindsight bias. Proceedings of the Human Factors and Ergonomics Society - Annual Meeting, 60(1), 196–200. https://doi.org/10.1177/1541931213601044

82.

Zheng

Rowell

Chen

Kim

J. Y.

Kontar

R. A.

Yang

X. J.

Lester

C. A.

(2023). Designing human-centered ai to prevent medication dispensing errors: Focus group study with pharmacists. JMIR Formative Research, 7(1), Article e51921. https://doi.org/10.2196/51921

Beyond Binary Decisions: Evaluating the Effects of AI Error Type on Trust and Performance in AI-Assisted Tasks

Abstract

Objective

Background

Method

Results

Conclusion

Application

Keywords

Introduction

SDT, Trust, Automation Compliance, and Reliance

Properties of Trust Dynamics

The Present Study

Method

Participants

Experimental Apparatus and Stimuli

Stimuli Development

Experimental Task

Experimental Design

Measures

Confidence

Trust Adjustment

Performance

Reaction Time

Experimental Procedure

Statistical Analysis

Results

Confidence

Comparing Patterns WC, WI, and WI ref

Trust Adjustment

Performance

Reaction Time

Evaluating Continuity, Negativity Bias, and Stabilization

Continuity

Negativity Bias

Stabilization

Discussion

Conclusion

Key Points

Footnotes

Acknowledgments

Declaration of conflicting interests

Funding

ORCID iDs

Note

Appendix

Author Biographies

References

Comparing Patterns WC, WI, and WI_ref