Sage Journals: Discover world-class research

Abstract

Objective

We investigated how people used cues to make Judgments of Difficulty (JODs) while observing automation perform a task and when performing this task themselves.

Background

Task difficulty is a factor affecting trust in automation; however, no research has explored how individuals make JODs when watching automation or whether these judgments are similar to or different from those made while watching humans. Lastly, it is unclear how cue use when observing automation differs as a function of experience.

Method

The study involved a visual search task. Some participants performed the task first, then watched automation complete it. Others watched and then performed, and a third group alternated between performing and watching. After each trial, participants made a JOD by indicating if the task was easier or harder than before. Task difficulty randomly changed every five trials.

Results

A Bayesian regression suggested that cue use is similar to and different from cue use while observing humans. For central cues, support for the UAH was bounded by experience: those who performed the task first underweighted central cues when making JODs, relative to their counterparts in a previous study involving humans. For peripheral cues, support for the MEH was unequivocal and participants weighted cues similarly across observation sources.

Conclusion

People weighted cues similar to and different from when they watched automation perform a task relative to when they watched humans, supporting the Media Equation and Unique Agent Hypotheses.

Application

This study adds to a growing understanding of judgments in human–human and human–automation interactions.

Keywords

human–automation interaction trust in automation metacognition decision making visual search

Introduction

Automated systems increasingly support and perform tasks previously performed by humans (Parasuraman & Riley, 1997). Consider luggage screening: for many years, Transportation Security Agency (TSA) agents manually searched passengers’ luggage for prohibited items. Although TSA agents still maintain a prominent role in luggage screening—possessing the sole authority of selecting luggage for additional screening—their roles are now supported by automated systems which aid in cueing operator attention to suspicious items (Huegli et al., 2020). When determining whether to act on the system’s recommendations, operators must recognize when the screening task, based on X-Ray imagery, might be too difficult for the system to complete accurately (e.g., weapon occlusion or rotation; Mery et al., 2020). Recognizing system limitations requires an understanding of how people evaluate the difficulty of the tasks that they perform, relative to those automation performs. Within the luggage screening example, cues to difficulty can be used to anticipate when a task might be too difficult for automation and easier for a TSA agent to perform manually (i.e., physically inspecting luggage). This provides an opportunity to intervene before accidents occur.

This paper discusses the kinds of information involved in evaluations of task difficulty and how this information is used by people as they perform a task or watch automation do so. We believe this topic is important because of the purported role judgments of difficulty (JODs) play in calibrating Trust in Automation (TiA; Hoff & Bashir, 2015). Because the existing literature on JODs and TiA involves implicit evaluations of task difficulty (Huegli et al., 2023; Ingram et al., 2021, Madhavan et al., 2006; Merritt et al., 2013; Monroe & Vangsness, 2022; Schwark et al., 2010), understanding this relationship requires a discussion about cue-based metacognitive frameworks. Furthermore, because task difficulty and TiA research often requires individuals to perform a task with automation, understanding this relationship also requires a focus on models of decision making for the self as compared to others. The Media Equation Hypothesis (Reeves & Nass, 1996) and the Unique Agent Hypothesis (de Visser et al., 2016) offer two predictions for patterns of judgment and decision making. These hypotheses informed an experiment that determined how cues to difficulty informed JODs during a visual search task where participants alternated between performing the task themselves and watching automation perform the same task.

Metacognitive Judgments

Metacognition is how one monitors or thinks about one’s own or others’ cognition (Flavell, 1979). Research suggests that individuals use cues (e.g., study time and reaction time) to make metacognitive judgments (Desender et al., 2017; Koriat, 1997; Undorf & Erdfelder, 2013; Vangsness & Young, 2019). Although metacognition research typically focuses on one’s own cognition, the principles and methods used to study metacognition (i.e., cue-based frameworks) can be used within the context of human–automation interaction, in which humans must anticipate the needs of their automated counterparts (Dekker & Hollnagel, 1999; Dekker & Woods, 2002; Hancock, 2022). Within human–automation interaction, applicable cues might include cues related to task difficulty (Hoff & Bashir, 2015).

Understanding which cues contribute most strongly to metacognitive judgments provides a window into the underlying strategy used to anticipate and support human and automated agents’ state. Imagine a situation where an individual is leveraging automation to conduct a luggage screening task. Certain object configurations (e.g., bundled cords) can make the task more difficult due to poor image clarity (Huegli et al., 2023; Ingram et al., 2021). Task-based characteristics (e.g., image clarity) reflecting objective task difficulty are a cue that individuals can monitor to inform reliance (Goddard et al., 2014; Huegli et al., 2023) and compliance decisions (Ingram et al., 2021). Therefore, studying individuals’ metacognitive assessments of automation provides a way to quantify how calibrated individuals are in weighing task-based characteristics and their sensitivity to changes in these characteristics over time. As long as humans are required to monitor and observe automation, there is a need to understand how humans track and integrate cues to make metacognitive judgments about automation. One metacognitive judgment relevant to TiA is judgments of difficulty (JODs) (Goddard et al., 2014; Hoff & Bashir, 2015; Huegli et al., 2023; Ingram et al., 2021).

Judgments of Difficulty

Some perspectives define difficulty as a function of task-based characteristics—static qualities of a task which make an individual more or less likely to succeed in their goals (Monroe & Vangsness, 2022; Vangsness & Young, 2019, 2021). Others relate difficulty to completion time (Gray et al., 2007), the proportion of time that task performance requires attention (Potts & Rosenbaum, 2021), and anticipated task duration (Potts et al., 2018)—the results of a person’s engagement with a task. While some perspectives also define task difficulty as the subjective assessment of the effort required to complete a task (Bedwell et al., 2014), this paper defines difficulty as a function of task-based characteristics. Defining difficulty in this way allows for cue grouping with respect to their relationships with the task and human. This enables researchers to identify cues equally available to all people and those available to individuals with task experience (Vangsness & Young, 2019).

Judgments of Difficulty (JODs) are metacognitive judgments constructed from sources of information (i.e., cues) available prior to and during task engagement (Vangsness & Young, 2021). The inherent properties of a task—such as the distance one must drive an automated vehicle—are relatively static and are known before task engagement. Because these attributes drive a task’s objective difficulty, they are highly valid central cues to JODs. Central cues can be contrasted with peripheral cues, such as task completion time (Gray et al., 2007; Potts et al., 2018), which arise from a person’s engagement with the task. Although peripheral cues are affected by person-and-task interactions, they can still be valid cues to JODs, provided a decision maker accurately estimates their correlation with objective difficulty (Vangsness & Young, 2019, 2021). The differences between central and peripheral cues are summarized in Table 1.

Table 1.

Central versus Peripheral Cues to Difficulty.

Central	Peripheral
• Available at task onset	• Emerge after task engagement
• Independent of skill and resource allocation	• Vary with skill and resource allocation
• Task-based	• Performance-based
Example: Target speed in a video game	Example: Points lost in a video game

The TiA literature commonly defines trust as, “the attitude that an agent will help achieve an individual’s goals in a situation characterized by uncertainty and vulnerability” (Lee & See, 2004, p. 51). One quality that contributes to uncertainty and vulnerability is task difficulty (Hoff & Bashir, 2015). Previous research demonstrates that people use JODs to inform reliance (Madhavan et al., 2006; Schwark et al., 2010), and under-rely on automation when it performs “easy” tasks poorly (Madhavan et al., 2006). Calibrated trust leads to appropriate use, and reduced misuse and disuse of automation (Lee & See, 2004; Parasuraman & Riley, 1997). Therefore, studying the factors that contribute to peoples’ ability to estimate task difficulty is essential to understanding trust calibration.

Research on the relationship between task difficulty and TiA is divided. Sometimes the two share a direct relationship, such that increased task difficulty leads to increased TiA (Huegli et al., 2023; Merritt et al., 2013); other times, this relationship is reversed (Goddard et al., 2014; Ingram et al., 2021; Monroe & Vangsness, 2022). Task difficulty can also act as a moderating factor between reliability and TiA. For example, while Ingram et al. (2021) found higher levels of TiA when automation correctly labeled images, these findings were attenuated by task difficulty (i.e., image clarity). Despite the existing research on how task difficulty impacts TiA, no research identifies which information people attend to or how they weigh these cues within the context of tasks performed by automation. This gap motivated the first research question (RQ1): How do people make JODs while observing automation? We evaluated this question by manipulating task difficulty in a visual search task systematically and by requiring participants to make repeated JODs while observing automation.

JODs When Observing Others

While past research does not address how JODs are made while observing automation, there is existing research on how humans make JODs while observing other humans. Differences emerge when individuals make JODs while observing others—previous task experience moderates these differences (Kardas & O’Brien, 2018; Vangsness, 2019). Recent research suggests that people overestimate their ability to perform a task when observing humans. However, hands-on experience can attenuate this effect. Participants with early task exposure were better calibrated in their JODs, while participants with delayed task exposure reported lowered JODs until they received task experience (Kardas & O’Brien, 2018; Vangsness, 2019). This suggests that people use cues differently when they have task experience as compared to when they are naïve observers. These differences can lead to poorly calibrated JODs. This research, viewed in conjunction with present advances in automation (i.e., performing tasks previously performed by humans; Parasuraman & Riley, 1997; Rahwan et al., 2019), motivated the second research question (RQ2): how does experience with a task performed by oneself or by automation, impact JODs? We answered this research question by manipulating source of experience (i.e., task vs. observing automation) between-subjects. To date, there is no research that extends these findings to environments where individuals observe automation perform a task, nor is there research extending these findings to circumstances in which people have previous task experience. As a starting point for understanding how individuals might make JODs for automation, we can look to theories of Human–Automation Relationships.

Theories of Human–Automation Relationships

Parasuraman and Riley (1997) predicted a shift in the structure of human activities due to projected increases in automation power, speed, and intelligence. Automation increasingly replaces or supports human judgments (Rieger et al., 2022). For example, luggage detection screening is now supplemented with automated recommender systems (Merritt & Ilgen, 2008; Pop et al., 2015). While the literature evaluating human relationships with automated recommender systems is vast (for a review, see Hoff & Bashir, 2015), it is unclear how humans make judgments following handoffs with decision aids. Although there is existing research on attitudes of passive (Montague & Asan, 2012; Montague et al., 2014) or incidental users (Domeyer et al., 2020, 2022) of automation which they do not directly engage with or have control over, it is still unclear how these attitudes extend to handoff situations. Given the difficulty for humans to imagine how another entity (e.g., automated system) might think or act (Gray et al., 2007) and the tendency of experts to underestimate the difficulty of tasks performed by others (Hinds, 1999; Nickerson, 1999), it becomes imperative to understand the judgments associated with automation we passively observe. Understanding how humans make judgments regarding passively observed automation can aid in improving the calibration of TiA. The Media Equation Hypothesis (Reeves & Nass, 1996) and the Unique Agent Hypothesis (de Visser et al., 2016) offer two possible explanations for how human observers might evaluate automation following handoffs.

Media Equation Hypothesis

The Computers are Social Actors (CASA) paradigm, suggests that the human-technology relationship is fundamentally social (Reeves & Nass, 1996). Under the CASA paradigm, people apply attitudes towards technology that are typically reserved for humans. A revised version of the paradigm is referred to as the Media Equation Hypothesis (MEH). The MEH assumes that the mechanisms driving human-human and human-automation trust are similar (Nass et al., 1999; Reeves & Nass, 1996). While some studies in support of the MEH suggest that this relationship persists because of humans’ tendency to assign human-like traits to technology (Abubshait & Wiese, 2017; de Visser et al., 2016, 2017; Pak et al., 2012), it is worth noting that behaviors which might suggest anthropomorphism do not necessarily emerge from anthropomorphic perceptions (i.e., ethopoeia; Nass & Moon, 2000; Takayama, 2009). Instead, these behaviors may emerge from mindless application of social scripts to machines (Nass & Moon, 2000).

Unique Agent Hypothesis

The Unique Agent Hypothesis (UAH) asserts that humans respond to automation differently than they do to other humans. The UAH acknowledges that since interactions with automation are subject to biases unique to nonhuman entities (de Visser et al., 2016), this can lead to poorly calibrated trust. One consequence is that humans can sometimes over-trust automation and believe that it is always perfect (i.e., the Perfect Automation Schema/PAS; Dzindolet et al., 2002). Negative consequences from this bias emerge when automation performs imperfectly and can impact how humans respond to automation.

MEH and UAH

Although the MEH and UAH often are pitted against one another in the literature (Alarcon et al., 2021, 2023; de Visser et al., 2016, 2017), recent research suggests that attitudes towards automation are more nuanced than previously suggested. While people might perceive automated systems as social agents, they can still maintain biases about automation’s ability to outperform humans (Alarcon et al., 2023). Similarly, humans can shift degrees of verbal anthropomorphism towards automation, which might occur due to perceiving agents as less human-like and more tool-like over time (Cohen et al., 2021). While humans might have an in-the-moment social response to automation, often reflectively, humans believe there are distinctions between human and automated agents (Takayama, 2009).

Study Objective

We sought to understand how humans make JODs (i.e., weigh cues to difficulty) while observing automation (RQ1) and how JODs might be moderated by experience (RQ2). We also wished to compare these results to those obtained from a study in which people observed a human (Vangsness, 2019). In doing so, we sought to provide support to one of two theories:

H1A

when observing automation perform a task, participants will weigh cues to difficulty differently from how they weigh these cues when observing other humans, supporting the UAH (de Visser et al., 2016).

H1B

when observing automation perform a task, participants will weigh cues to difficulty similar to how they weigh these cues when observing other humans, supporting the MEH (Reeves & Nass, 1996).

We acknowledge it is possible that our study provides mixed support for both theories. In this case, we hope to identify the boundary conditions under which one hypothesis or the other is supported.

Method

This research complied with the American Psychological Association Code of Ethics and was approved by the Institutional Review Board at Wichita State University. We obtained informed consent from each participant. Eighty participants recruited through the psychology subject pool completed the experiment. Participants received 1 hour of research credit as compensation.

The Conjunction Visual Search Task

The study used a 3 $\times$ 4 mixed factorial design with role order (perform-first, observe-first, interleaved) as a between-subjects condition and difficulty dimension (clicks, feedback, set size, timing) as a within-subjects condition (see Figure 1). Participants completed 280 trials of a visual search task programmed in Unity (Unity Technologies, 2019). They identified a blue circle in a display of different shapes (blue squares, pink squares, and pink circles) presented at random, nonoverlapping locations (Vangsness, 2019; Vangsness & Young, 2021; see Figure 2).

Figure 1.

A schematic of the manipulations presented during the visual search task. Note. Role order was manipulated between-subjects. Difficulty blocks were manipulated within-subjects and Latin-square counterbalanced. Participants experienced one difficulty condition per block.

Figure 2.

Schematic diagram of the visual search task.

Once found, participants clicked on the blue circle three times as quickly as possible and received feedback in the form of points earned or lost (see Table 2). Participants started the experiment with 100 points. If the score dropped below 0, participants waited for the experiment to reload. After the waiting period, the score reset to 100 and participants resumed progress throughout the task. Figure 2 depicts the sequence of events for each trial. A summary of the sequence of events for the experiment is provided in Table 3.

Table 2.

Difficulty Conditions (Within-Subjects).

Difficulty Condition	Description	Random Sampling Values	Fixed Values
Clicks	The number of times the target had to be clicked before it disappeared.	5–7	3
Feedback	The number of points lost after failing to identify the target.	2–45	25
Set size	The number of nontarget items in the stimulus array.	1–42	25
Timing	The amount of time (in seconds) the stimulus array appeared on screen.	4.25–5.21	2.10

Note. The random sampling values represent the possible range of values in a given difficulty condition. The fixed values represent the value for the dimension during unassociated difficulty conditions.

Table 3.

Sequence of Events During Experiment.

Event	Duration
Watched screenshot video of the automated system (5 trials)	<1 min
Completed propensity to trust and trust in automation questionnaires	5 min
Completed 280-trial experiment	50 min
Completed trust in automation questionnaire	2 min

Task Difficulty

To understand sensitivity to changes in objective task difficulty, trials were split into three, 70-trial difficulty blocks whose presentation order was counterbalanced across subjects: clicks, set size, and timing. A fourth dimension, feedback, manipulated the salience of peripheral cues with task difficulty held constant. Difficulty dimension changed every 70 trials. The screen notified participants when the dimension changed but did not provide details on which dimension changed. To ensure that participants experienced levels of difficulty reflecting the full range of performance, a Halton sequence (Halton & Smith, 1964) semi-randomly assigned an objective task difficulty level at the beginning of each trial and was based on the condition to which a participant was assigned, changing every five trials. The ranges of values across the difficulty conditions were determined through algebraic calculations using the regression equations from Vangsness and Young (2021), which selected sampling values that equated the slopes across the difficulty conditions. Participants observed 35 trials and performed 35 identical trials within each difficulty condition block. A summary of these conditions is provided in Table 2. Stimulus array characteristics unassociated with the participant’s difficulty condition remained set at a fixed value throughout the block (see Table 2). Because the characteristics across dimensions differed, difficulty was standardized (i.e., normalized) by taking each difficulty condition value for a given trial, subtracting the maximum value and dividing by the random sampling value range. The standardized difficulty level ranged from 0 to 1, with values closer to 0 indicating easier levels, and values closer to 1 indicating harder levels.

Role Order

To understand the impact of experience on JODs, we manipulated role order between-subjects. Participants either observed-first, performed-first, or interleaved these two roles (see Figure 1). Those who observed first watched automation perform the task for the first 35 trials of the block, then completed the second 35 trials of the block. Those who performed first completed the first 35 trials of the block, then observed for the remaining 35 trials of the block. Those in the interleaved condition alternated between observing and performing every 5 trials throughout the block.

Automated System Details

Prior to starting the experiment, participants watched a 23s video of the automated system successfully performing four trials (https://www.youtube.com/watch?v=YuI2pZEqC7k). In the clip, the system’s characteristics were based on the fixed values (see Table 2) across all four difficulty conditions. Participants did not receive information regarding how the automation worked.

During the experiment, a predictive modeling equation that mimicked participants’ average performance during a pilot study (Vangsness, 2019) controlled the system’s behavior. The system’s performance was probabilistic, depended on difficulty condition, and standardized difficulty (see Figure 3), and was 62.3% accurate on the whole. Because the automation’s performance was based on past participants’ performance, the general decline in performance as a function of standardized difficulty, suggests that people were sensitive to changes in difficulty.

Figure 3.

Automated system performance as a function of standardized difficulty. Note. Automated system performance generally decreased as standardized difficulty increased. Participants observed the full range of automated system performance. From “Changes in Trust in Automation (TIA) After Performing a Visual Search Task with an Automated System,” by Driggs & Vangsness (2022), 2022 IEEE 3rd International Conference on Human-Machine Systems (ICHMS). Copyright 2022 by IEEE. Reprinted with permission.

JODs

After each trial, regardless of role, participants rated whether each trial was easier or harder than before by clicking one of two buttons (Vangsness & Young, 2019, 2021). After selecting an option, the buttons would disappear, and the next trial would load. Participants provided a JOD after receiving trial feedback. Repeating this JOD after each trial ensured that participants anchored each judgment to their experiences during the previous trial. Comparative judgments reduce idiosyncratic interpretations of scale anchors and produce more reliable measurements (Böckenholt, 2004).

Demographics Survey/Trust Measurement

Before starting, participants provided demographic information and completed the Propensity to Trust Machines Questionnaire (Merritt & Ilgen, 2008). Participants completed the Trust in Automation Questionnaire (Jian et al., 2000), before and after the experiment. Preliminary analyses confirmed that trust in the automated system decreased pre- to post-experiment (Driggs & Vangsness, 2022).

Results

We cleaned our data and conducted all analyses in R (R Core Team, 2023). We used the brms package (Bürkner, 2017) for the Bayesian regression, conducted post-hoc tests using the emmeans package (Lenth, 2022), and generated all graphs using ggplot2 (Wickham, 2016). The full R-markdown file and data set are provided here: https://osf.io/jzcg2/?view_only=dc380535b9774e19961a1d49048ce735.

Defining Variables

To allow for comparisons of objective task difficulty across difficulty condition (i.e., clicks, set size, and timing), we standardized (i.e., normalized) the difficulty predictor by taking each difficulty condition’s value on a given trial, subtracting the minimum difficulty condition value, and dividing by the sampling range. For example, the standardized difficulty value associated with a trial requiring 6 clicks would be 0.5 (i.e., $\frac{6 - 5}{7 - 5} = \frac{1}{2}$ ; see Table 2 for values for all difficulty conditions). This created a variable where 0 represented the easiest difficulty level encountered in a condition and 1 represented the hardest (time was reverse-coded). Readers interested in additional information about these calculations are directed to the R-markdown file on the OSF and to Vangsness and Young’s (2019) publication involving the same normalization approach.

We tracked central cue use by modeling the relationship between this predictor (standardized difficulty) and JODs. This allowed us to understand the extent to which participants tracked objective task difficulty as a cue. We tracked peripheral cue use using target identification, a dichotomous variable indicating whether the target was identified on a given trial. For succinctness, unsuccessful (0) and successful (1) identifications will be described as “misses” and “hits,” respectively. Similarly, for our outcome measure (JODs), we converted easier/harder JODs to 0’s and 1’s, respectively.

Using trial-in-block as a predictor modeled changes in JODs that occurred as participants gained experience within a 70-trial difficulty block. We log-transformed trial-in-block to account for learning effects, which occurred most rapidly at the beginning of a 70-trial difficulty block before participants adjusted to the change in difficulty condition (i.e., clicks, set size, and timing). Role order, a three-level categorical predictor, represented the between-subjects assignment (e.g., perform-first, observe-first, and interleaved). Prior to analysis, we effect coded all categorical predictors and median-centered continuous predictors so that the model intercept represented average cue use halfway through a block. We subset trials to include only those where the objective task difficulty changed (i.e., every fifth trial). Additionally, we subset all feedback difficulty condition trials from the data since the objective task difficulty did not change during those trials.

Bayesian Regression

We used a Bayesian regression to answer RQ1 (how do people make JODs while observing automation?) and RQ2 (how does experience with a task performed by oneself or by automation, impact JODs?). The model included the Trial in Block $\times$ Standardized Difficulty $\times$ Role Order interaction, the Trial in Block $\times$ Target Identification $\times$ Role Order interaction, and their associated lower-order interactions and main effects. We answered our RQs by analyzing the critical 3-way interactions between cues, role order, and target identification. We selected the random effects structure with the smallest AIC (Akaike, 1973) value, which allowed the intercept, standardized difficulty slope, and target identification slope to vary to account for individual differences in central versus peripheral cue weighting in JODs. Candidates for the model’s random effect structure and the full model results are available in the R-markdown file.

Bayesian analyses incorporate a study’s data with a prior distribution (often derived from the statistical results of past studies) to produce a posterior distribution that represents the possible values of model parameters, given this previously observed data (Gelman & Hill, 2014; Young, 2019). Prior distributions represent researchers’ current knowledge about a relationship; here, we derived our priors from a regression model of how humans make JODs while observing humans (Vangsness, 2019). By integrating these priors with the data from a study that involves a slightly different manipulation—such as how humans make JODs while observing automation—we can ascertain whether the new data is sufficiently different to conclude a unique process is at work in a second study (Young, 2019).

In lieu of p-values, Bayesian regressions generate 95% credible intervals (CI) for every model parameter, indicating likely intervals that contain the true parameter estimates. When new data falls within a prior’s CI, the CI narrows due to increased confidence in the estimate; data outside the CI shifts it. CIs excluding zero suggest a statistically significant effect (Gelman & Hill, 2007).

Media Equation Hypothesis versus Unique Agent Hypothesis

We informed the posterior distribution using the estimated regression weights from Vangsness (2019) as normally distributed priors. Estimates are provided in Appendix A and are available on the OSF (https://osf.io/8ghmt/?view_only=295d8f0aac9d40aa928913cf6b6e06fc). If humans treat automation like other humans (MEH), the prior distribution parameter estimates should fall within the 95% CIs of the posterior distribution. Conversely, if humans treat automation differently (UAH), the prior distribution estimates should fall outside the 95% CIs. We direct readers interested in learning more about this approach to Young (2019), who more thoroughly discusses the process of comparing posterior distribution estimates with prior distribution estimates. We estimated the posterior distribution by drawing 8,000 samples across four chains, discarding the first 1,000 samples as a burn in period. All R-hat values (fit indices) for each model parameter equaled one, indicating model convergence (Gelman & Hill, 2014). A summary of the parameter estimates, standard errors, and 95% CIs can be found in Appendix B and on the OSF.

JODs While Observing Automation

The CIs revealed that overall, JODs were informed by peripheral cues (B = 1.19, SE = 0.07, 95% CI [1.04, 1.33]), but not by central cues (B = −0.01, SE = 0.10, 95% CI [−0.21, 0.19]). Post-hoc tests indicated that the three-way interaction involving central cues, trial, and role order was nonsignificant, suggesting that disuse of central cues was stable over time and across role order (see Appendix C). On the other hand, the three-way interaction involving peripheral cues revealed significant differences in cue use over time. Participants were most sensitive to misses (B = 0.97, 95% CI [0.77, 1.18]). Weighting of misses over time (i.e., trial-in-block) significantly differed between perform-first and observe-first participants, B = −0.27, 95% CI [−0.54, −0.01]; see Table 4 for all pairwise comparisons).

Table 4.

Pairwise Comparisons Test the Three-Way Interaction Between Peripheral Cues, Role Order, and Trial in Block.

Contrast	B_{Automated System}	95% Credible Interval	B_Human
miss (peripheral cue)
perform-first – interleaved	0.20	[−0.01, 0.41]	0.31
perform-first – observe-first	−0.27	[−0.54, −0.01]	−0.01
interleaved – observe-first	−0.47	[−0.74, −0.21]	−0.31
hit (peripheral cue)
perform-first – interleaved	0.05	[−0.17, 0.25]	0.05
perform-first – observe-first	0.10	[−0.14, 0.38]	0.10
interleaved – observe-first	0.06	[−0.23, 0.33]	0.05

Note. B_Human refers to the parameter estimates from Vangsness (2019); rows containing estimates in which the current study differs are bold.

The Impact of Experience on JODs

The role order predictor manipulated participants’ source of experience with the task (i.e., performing or observing automation first). Because participants alternated roles halfway through a 70-trial block, the trial-in-block and role order predictors together provide an opportunity to understand the impact of experience on JODs. Observe-first participants became increasingly likely to weigh misses over the course of a difficulty block (B = 0.25, 95% CI [0.05, 0.42]).

Central and Peripheral Cue Use for Humans versus Automated Systems

To evaluate the MEH and UAH, we can look at parameter estimates for central and peripheral cues from Vangsness (2019) and compare them to the credible intervals generated in the present study. With respect to central cues, support for the MEH or UAH was bounded by role order: perform-first participants in this study underweighted central cues to difficulty when making JODs, relative to their counterparts in the previous study involving humans. Participants in the interleaved and observe-first conditions weighed central cues to difficulty similar to those participants in the previous study (see Table 5).

Table 5.

Slope Estimates for the Two-Way Interaction Between Central Cues and Role Order.

Parameter	B_{Automated System}	95% CI_{Automated System}	B_Human
Perform-first	0.18	[−0.17, 0.52]	0.54
Interleaved	−0.14	[−0.49, 0.18]	−0.04
Observe-first	−0.05	[−0.40, 0.30]	−0.18

Note. B_Human refers to the parameter estimates from Vangsness (2019); rows containing estimates in which the current study differs are bold.

However, to support the MEH or UAH more concretely, it is necessary to probe the three-way interaction during the trials where perform-first participants were observing automation (i.e., second half of each 70-trial block). Trials 36, 53, and 70 represent the beginning, middle, and end of each automation observation block. The slope estimates for these trials from the previous study involving humans fall outside of the CIs from the posterior distribution updated with information on how humans make JODs while observing automation (see Tables 6 –8). Therefore, we can infer that after performing a task first, humans weight central cues differently while observing automation, supporting the UAH.

Table 6.

Slope Estimates for the Three-Way Interaction Between Central Cues and Role Order at Trial 36.

Parameter	B_{Automated System}	95% CI_{Automated System}	B_Human
Perform-first	0.16	[−0.18, 0.50]	0.55
Interleaved	−0.16	[−0.52, 0.17]	−0.04
Observe-first	0.00	[−0.35, 0.35]	−0.17

NoteBHuman refers to the parameter estimates from Vangsness (2019); rows containing estimates in which the current study differs are bold.

Table 7.

Slope Estimates for the Three-Way Interaction Between Central Cues and Role Order at Trial 53.

Parameter	B_{Automated System}	95% CI_{Automated System}	B_Human
Perform-first	0.11	[−0.28, 0.48]	0.55
Interleaved	−0.23	[−0.59, 0.16]	−0.04
Observe-first	0.10	[−0.29, 0.49]	−0.16

Note. B_Human refers to the parameter estimates from Vangsness (2019); rows containing estimates in which the current study differs are bold.

Table 8.

Slope Estimates for the Three-Way Interaction Between Central Cues and Role Order at Trial 70.

Parameter	B_{Automated System}	95% CI_{Automated System}	B_Human
Perform-first	0.07	[−0.38, 0.48]	0.56
Interleaved	−0.28	[−0.71, 0.15]	−0.04
Observe-first	0.17	[−0.26, 0.63]	−0.16

Note. B_Human refers to the parameter estimates from Vangsness (2019); rows containing estimates in which the current study differs are bold.

With respect to peripheral cues, support for the MEH or UAH was unequivocal: the slope estimates from Vangsness (2019) fall within all posterior distribution CIs, indicating their convergence with the posterior distribution updated with information on how humans make JODs while observing automation (see Table 9). Therefore, we can infer that humans weigh peripheral cues similarly in their JODs while observing automation and humans, supporting the MEH (see Figure 4).

Table 9.

Slope Estimates for the Three-Way Interaction Between Peripheral Cues, Role Order, and Trial in Block.

Parameter	B_{Automated System}	95% CI_{Automated System}	B_Human
miss (peripheral cue)
Perform-first	−0.02	[−0.18, 0.14]	0.11
Interleaved	−0.22	[−0.39, −0.06]	−0.19
Observe-first	0.25	[0.05, 0.42]	0.12
hit (peripheral cue)
Perform-first	0.12	[−0.04, 0.29]	0.10
Interleaved	0.07	[−0.10, 0.24]	0.05
Observe-first	0.01	[−0.16, 0.21]	0.00

Note. B_Human refers to the parameter estimates from Vangsness (2019); rows containing estimates in which the current study differs are bold.

Figure 4.

Bayesian (top) and Logistic (bottom) regressions illustrate similarities in how humans weigh peripheral cues in JODs for automated systems and humans. Note. Panels represent whether the target was successfully identified (hit) or not (miss) on a given trial. The dashed line represents the halfway point, at which participants shifted roles during a block (trial 35 for the automated system study and trial 40 for the human study). The bottom graph was reproduced from the data in Vangsness (2019) with permission.

Discussion

We addressed two research questions: how do people make JODs while observing automation and how does past experience with a task performed by oneself or by automation impact JODs? JODs were most strongly impacted by peripheral cues. Participants were most likely to rate the task as harder after receiving feedback that they (or the automated system) had missed the target. Although we intentionally manipulated central cues, participants did not use this cue to make JODs; this did not change as a function of role order or time. Weighting of misses significantly differed between perform-first and observe-first participants, with observe-first participants becoming increasingly likely to rate the task as harder over time in response to misses. Lastly, the results suggest that the process by which people make JODs for automation is both similar to and different from the way the make JODs for humans, given the previous evidence from a study involving humans (Vangsness, 2019). Although support for the MEH was unequivocal with respect to peripheral cues, with respect to central cues, role order bounded support for the UAH: those who performed the task first underweighted central cues when making JODs during periods of automation observation, relative to their counterparts in the previous study involving humans.

Unique Agent Hypothesis versus Media Equation Hypothesis

Our analyses evaluated two hypotheses in the TiA literature. The results provided support to both the MEH and UAH: people used peripheral cues similarly while observing automation or humans, while those with earlier task experience differed in central cue use across observation source. Although we did not manipulate who or what was observed, we used a Bayesian Regression to compare these results with Vangsness (2019), who did. The posterior distribution credible intervals encompassed the parameter estimates from Vangsness (2019) for peripheral cue use—but not for central cue use for perform-first participants.

Similarities and Differences in Cue Weighting While Observing Other Humans and Automation

Regarding peripheral cue use, participants in our study viewed automation in the same way as other humans (i.e., MEH; Reeves & Nass, 1996), which suggests that there are similar judgment and decision making processes that underlie both situations (Epley et al., 2007; Madhavan & Wiegmann, 2007). Because the automated system mirrored the imperfect performance of humans, it is possible that participants might have relied on learned social scripts from past human–human interactions (Nass & Moon, 2000), which could explain the similarities in peripheral cue weighting across observation sources.

Regarding central cue use, perform-first participants in our study viewed automation differently from other humans (i.e., UAH; de Visser et al., 2016). It is possible that perform-first participants may have exhibited an automation bias (Dzindolet et al., 2002; Parasuraman & Manzey, 2010) while observing, which could explain the underweighting of central cues while observing automation—relative to the study involving humans. Together, these findings are consistent with the points highlighted in Madhavan and Wiegmann (2007)—although humans possess a propensity to apply rules governing human–human interactions to automated systems, there are subtle differences to the manner in which humans perceive and react to automation—relative to other humans. Lastly, these findings align with recent work which suggests that the MEH and UAH can be simultaneously supported (Alarcon et al., 2023; Cohen et al., 2021), underscoring the nuances of human attitudes towards automation.

Study Limitations/Future Directions

We instructed participants to observe passively, without the option to engage/disengage automation. Therefore, an alternative interpretation of this stimuli could be that participants simply experienced a period of task familiarization before completing the task themselves. This limits our ability to generalize our findings to contexts where people actively engage with automation (i.e., level 2–level 8 automation; Parasuraman et al., 2000). Future research should be conducted to determine if and how this impacts how cues are used to make JODs. It is also possible that participants in the observe-first condition could have experienced a degree of high monitoring workload/automation vigilance issues (Warm et al., 1996), providing an alternative explanation of the differences between observe-first and perform-first participants for peripheral cue use. Lastly, visual search tasks are known to be related to individual difference factors (e.g., attention and memory abilities, video game experience)—our work is limited in that it did not account for these factors.

Our research evaluated the extent to which humans track difficulty as a cue. Future research would benefit from understanding how people weigh different cues as to whether automation will succeed at a task and how humans weight different trust cues (e.g., automation performance; de Visser et al., 2014). It is worth noting that because we based our model priors on JOD data for human observation, inherently there is a bias in support of the MEH. We encourage future studies to build on our work, using our parameter estimates as a prior distribution—or by including a true between-subjects manipulation. Lastly, there is emerging research (de Visser et al., 2018; Somon et al., 2019) monitoring brain activity while observing automation—or another human—perform a task (e.g., Flanker task). Future research would benefit from expanding our research to include physiological measures.

Conclusion

Our work supports both the MEH and UAH and suggests that the cues used while observing automation are similar to and different from those used to make JODs when observing humans. Together, these findings underscore the nuances to our interactions with automation and highlight the moderating impact of experience: performing a task first can contribute to an underweighting of central cues while observing automation, while observing automation first can lead to lower anchored JODs that eventually come into alignment with those with earlier task experience.

At the onset, automation came online to replace tasks previously performed by humans (Parasuraman & Riley, 1997; Rahwan et al., 2019). This context aligns with the perform-first condition in our preparation, resembling instances where humans transfer tasks to highly automated systems (i.e., Level 10; Parasuraman et al., 2000). TiA research involves humans observing automation perform unfamiliar tasks. Our work calls into question the generalizability of research that involves humans observing automation perform tasks with which they have no previous experience.

Our findings serve as a warning to system designers. Anchoring to easier JODs for automation could lead to inappropriate levels of reliance (Parasuraman & Riley, 1997). Understanding how experience moderates JODs for oneself creates instances where automation use more closely mirrors real-life situations; this is crucial in promoting calibrated TiA.

Footnotes

Acknowledgments

We would like to thank Olivia Tipton, Laila Robinson, and Courtney Brown for their assistance with data collection. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not reflect the official guidance or position of the United States Government, the Department of Defense or of the United States Air Force.

Key Points

• Patterns of cue use while observing an automated system support the Media Equation and Unique Agent Hypotheses.

• JODs were most strongly impacted by peripheral cues to difficulty and were unaffected by central cues to difficulty.

• Task experience moderates our JODs while observing automated systems. Observing automation first can lead to lower anchored JODs that eventually come into alignment with those with earlier task experience. Performing a task first can contribute to an underweighting of central cues.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Jade Driggs

Appendix A

Parameter Estimates from Vangsness (2019).

Parameter	B	SE	z	p
Intercept	−0.23	0.09	−2.66	0.01
Trial in block	−0.03	0.04	0.73	0.47
Standardized difficulty	0.11	0.15	0.78	0.44
Target identification	1.17	0.10	12.09	<.001
Perform-first	0.21	0.12	1.72	0.08
Interleaved	−0.17	0.12	−1.40	0.16
Standardized difficulty × Trial in block	0.01	0.12	0.10	0.92
Standardized difficulty × Perform-first	0.44	0.20	2.19	0.03
Standardized difficulty × Interleaved	−0.15	0.20	−0.78	0.44
Trial in block × Perform-first	0.08	0.06	1.28	0.20
Trial in block × Interleaved	−0.10	0.06	−1.75	0.08
Trial in block × Target identification	−0.02	0.04	−0.41	0.68
Target identification × Perform-first	0.18	0.13	1.33	0.19
Target identification × Interleaved	0.13	0.13	0.99	0.32
Trial in block × Target identification × Perform-first	0.02	0.17	0.05	0.96
Trial in block × Target identification × Perform-first	−0.10	0.06	−1.73	0.08
Trial in block × Standardized difficulty × Perform-first	0.02	0.06	0.40	0.69
Trial in block × Standardized difficulty × Interleaved	−0.10	0.06	−1.73	0.08

Note. Trial in block was centered on the median (M = 40.00). Hits and role order were effect coded, with misses serving as the {−1} baseline for Hits, and Observe-first serving as the {−1, −1} baseline for experimental condition.

Appendix B

Parameter Estimates from the Bayesian Regression Predicting JODs.

Parameter	B	SE	95% Credible Interval
Intercept	−0.22	0.07	[−0.35, −0.08]
Trial in block	0.03	0.03	[−0.03, 0.10]
Standardized difficulty	−0.01	0.10	[−0.21, 0.19]
Target identification	1.19	0.07	[1.04, 1.33]
Perform-first	0.22	0.09	[0.05, 0.40]
Interleaved	−0.10	0.09	[−0.28, 0.08]
Standardized difficulty × trial in block	−0.01	0.10	[−0.20, 0.18]
Standardized difficulty × perform-first	0.17	0.13	[−0.09, 0.41]
Standardized difficulty × interleaved	−0.16	0.14	[−0.42, 0.11]
Trial in block × perform-first	0.01	0.05	[−0.08, 0.10]
Trial in block × interleaved	−0.11	0.05	[−0.20, −0.02]
Trial in block × target identification	−0.03	0.03	[−0.10, 0.03]
Target identification × perform-first	0.09	0.09	[−0.09, 0.27]
Target identification × interleaved	0.17	0.10	[−0.02, 0.36]
Trial in block × target identification × perform-first	−0.04	0.05	[−0.13, 0.05]
Trial in block × target identification × perform-first	−0.11	0.05	[−0.20, −0.01]
Trial in block × standardized difficulty × perform-first	−0.13	0.14	[−0.40, 0.14]
Trial in block × standardized difficulty × interleaved	−0.15	0.14	[−0.42, 0.13]

Note. Trial in block was centered on the median (M = 35.00). Hits and role order were effect coded, with misses serving as the {−1} baseline for Hits, and Observe-first serving as the {−1, −1} baseline for experimental condition.

Appendix C

Pairwise Comparisons Testing Participants’ Central Cue Use Over Time.

Trial 1
parameter/contrast	B	95% Credible Interval
difficulty slope (central cue)
perform-first – interleaved	0.23	[-1.34, 1.73]
perform-first – observe-first	1.65	[-0.44, 3.51]
interleaved – observe-first	1.43	[-0.68, 3.41]
Trial 17
parameter/contrast	B	95% Credible Interval
difficulty slope (central cue)
perform-first – interleaved	0.31	[-0.20, 0.81]
perform-first – observe-first	0.47	[-0.17, 1.05]
interleaved – observe-first	0.16	[-0.46, 0.79]
Trial 35
parameter/contrast	B	95% Credible Interval
difficulty slope (central cue)
perform-first – interleaved	0.32	[-0.08, 0.74]
perform-first – observe-first	0.18	[-0.34, 0.67]
interleaved – observe-first	−0.15	[-0.66, 0.38]
Trial 36
parameter/contrast	B	95% Credible Interval
difficulty slope (central cue)
perform-first – interleaved	0.32	[-0.08, 0.74]
perform-first – observe-first	0.16	[-0.37, 0.65]
interleaved – observe-first	−0.17	[-0.68, 0.36]
Trial 53
parameter/contrast	B	95% Credible Interval
difficulty slope (central cue)
perform-first – interleaved	0.34	[-0.13, 0.78]
perform-first – observe-first	0.01	[-0.58, 0.57]
interleaved – observe-first	−0.33	[-0.93, 0.24]
Trial 70
parameter/contrast	B	95% Credible Interval
difficulty slope (central cue)
perform-first – interleaved	0.35	[-0.17, 0.86]
perform-first – observe-first	−0.11	[-0.79, 0.54]
interleaved – observe-first	−0.45	[-1.14, 0.20]]

Author Biographies

Jade Driggs is an active duty U.S. Air Force Officer and Program Manager with the 711th Human Performance Wing, Air Force Research Laboratory in Dayton, Ohio. She received her PhD in human factors psychology from Wichita State University in 2024.

Lisa Vangsness is an assistant professor in the Department of Psychology at the University of Alabama in Huntsville. She received her PhD in psychology from Kansas State University in 2019.

References

Abubshait

Wiese

(2017). You look human, but act like a machine: Agent appearance and behavior modulate different aspects of human–robot interaction. Frontiers in Psychology, 8, 1393. https://doi.org/10.3389/fpsyg.2017.01393

Akaike

(1973). Maximum likelihood identification of Gaussian autoregressive moving average models. Biometrika, 60(2), 255–265. https://doi.org/10.2307/2334537

Alarcon

G. M.

Capiola

Hamdan

I. A.

Lee

M. A.

Jessup

S. A.

(2023). Differential biases in human-human versus human-robot interactions. Applied Ergonomics, 106, Article 103858. https://doi.org/10.1016/j.apergo.2022.103858

Alarcon

G. M.

Gibson

A. M.

Jessup

S. A.

Capiola

(2021). Exploring the differential effects of trust violations in human-human and human-robot interactions. Applied Ergonomics, 93, Article 103350. https://doi.org/10.1016/j.apergo.2020.103350

Bedwell

W. L.

Salas

Funke

G. J.

Knott

B. A.

(2014). Team workload: A multilevel perspective. Organizational Psychology Review, 4(2), 99–123. https://doi.org/10.1177/2041386613502665

Böckenholt

(2004). Comparative judgments as an alternative to ratings: Identifying the scale origin. Psychological Methods, 9(4), 453–465. https://doi.org/10.1037/1082-989X.9.4.453

Bürkner

P. C.

(2017). brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical Software, 80(1), 1–28. https://doi.org/10.18637/jss.v080.i01

Cohen

M. C.

Demir

Chiou

E. K.

Cooke

N. J.

(2021, September). The dynamics of trust and verbal anthropomorphism in human-autonomy teaming. In 2021 IEEE 2nd international conference on human-machine systems (ICHMS) (pp. 1–6). IEEE.

Dekker

Hollnagel

(Eds.), (1999). Coping with computers in the cockpit. Ashgate.

10.

Dekker

S. W.

Woods

D. D.

(2002). MABA-MABA or abracadabra? Progress on human–automation co-ordination. Cognition, Technology & Work, 4(4), 240–244. https://doi.org/10.1007/s101110200022

11.

Desender

Van Opstal

Van Den Bussche

(2017). Subjective experience of difficulty depends on multiple cues. Scientific Reports, 7(1), Article 44222. https://doi.org/10.1038/srep44222

12.

de Visser

E. J.

Beatty

P. J.

Estepp

J. R.

Kohn

Abubshait

Fedota

J. R.

McDonald

C. G.

(2018). Learning from the slips of others: Neural correlates of trust in automated agents. Frontiers in Human Neuroscience, 12, 309. https://doi.org/10.3389/fnhum.2018.00309

13.

de Visser

E. J.

Cohen

Freedy

Parasuraman

(2014). A design methodology for trust cue calibration in cognitive agents. In Virtual, Augmented and Mixed Reality. Designing and Developing Virtual and Augmented Environments: 6th International Conference, VAMR 2014, Held as Part of HCI International 2014, Heraklion, Crete, Greece, June 22-27, 2014, Proceedings, Part I (6, pp. 251–262). Springer International Publishing. https://doi.org/10.1007/978-3-319-07458-0_24

14.

de Visser

E. J.

Monfort

S. S.

Goodyear

O’Hara

Lee

M. R.

Parasuraman

Krueger

(2017). A little anthropomorphism goes a long way: Effects of oxytocin on trust, compliance, and team performance with automated agents. Human Factors, 59(1), 116–133. https://doi.org/10.1177/0018720816687205

15.

de Visser

E. J.

Monfort

S. S.

McKendrick

Smith

M. A. B.

McKnight

P. E.

Krueger

Parasuraman

(2016). Almost human: Anthropomorphism increases trust resilience in cognitive agents. Journal of Experimental Psychology: Applied, 22(3), 331–349. https://doi.org/10.1037/xap0000092

16.

Domeyer

J. E.

Lee

J. D.

Toyoda

Mehler

Reimer

(2020). Interdependence in vehicle-pedestrian encounters and its implications for vehicle automation. IEEE Transactions on Intelligent Transportation Systems, 23(5), 4122–4134. https://doi.org/10.1109/tits.2020.3041562

17.

Domeyer

J. E.

Lee

J. D.

Toyoda

Mehler

Reimer

(2022). Driver-pedestrian perceptual models demonstrate coupling: Implications for vehicle automation. IEEE Transactions on Human-Machine Systems, 52(4), 557–566. https://doi.org/10.1109/thms.2022.3158201

18.

Driggs

Vangsness

(2022). Changes in trust in automation (TIA) after performing a visual search task with an automated system. In 2022 IEEE 3rd International Conference on Human-Machine Systems (ICHMS) (pp. 1–6). https://doi.org/10.1109/ICHMS56717.2022.9980724

19.

Dzindolet

M. T.

Pierce

L. G.

Beck

H. P.

Dawe

L. A.

(2002). The perceived utility of human and automated aids in a visual detection task. Human Factors, 44(1), 79–94. https://doi.org/10.1518/0018720024494856

20.

Epley

Waytz

Cacioppo

J. T.

(2007). On seeing human: A three-factor theory of anthropomorphism. Psychological Review, 114(4), 864–886. https://doi.org/10.1037/0033-295X.114.4.864

21.

Flavell

J. H.

(1979). Metacognition and cognitive monitoring: A new area of cognitive–developmental inquiry. American Psychologist, 34(10), 906–911. https://doi.org/10.1037/0003-066x.34.10.906

22.

Gelman

Hill

(2007). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.

23.

Gelman

Hill

(2014). Bayesian data analysis. CRC Press.

24.

Goddard

Roudsari

Wyatt

J. C.

(2014). Automation bias: Empirical results assessing influencing factors. International Journal of Medical Informatics, 83(5), 368–375. https://doi.org/10.1016/j.ijmedinf.2014.01.001

25.

Gray

H. M.

Gray

Wegner

D. M.

(2007). Dimensions of mind perception. Science, 315(5812), 619. https://doi.org/10.1126/science.1134475

26.

Halton

Smith

(1964). Algorithm 247: Radical-inverse quasi-random point sequence. Communications of the ACM, 7(12), 701–702. https://doi.org/10.1145/355588.365104

27.

Hancock

P. A.

(2022). Avoiding adverse autonomous agent actions. Human-Computer Interaction, 37(3), 211–236. https://doi.org/10.1080/07370024.2021.1970556

28.

Hinds

P. J.

(1999). The curse of expertise: The effects of expertise and debiasing methods on prediction of novice performance. Journal of Experimental Psychology: Applied, 5(2), 205–221. https://doi.org/10.1037//1076-898x.5.2.205

29.

Hoff

K. A.

Bashir

(2015). Trust in automation: Integrating empirical evidence on factors that influence trust. Human Factors, 57(3), 407–434. https://doi.org/10.1177/0018720814547570

30.

Huegli

Merks

Schwaninger

(2020). Automation reliability, human–machine system performance, and operator compliance: A study with airport security screeners supported by automated explosives detection systems for cabin baggage screening. Applied Ergonomics, 86, Article 103094. https://doi.org/10.1016/j.apergo.2020.103094

31.

Huegli

Merks

Schwaninger

(2023). Benefits of Decision Support Systems in Relation to Task Difficulty in Airport Security X-Ray Screening. International Journal of Human-Computer Interaction, 39(19), 3830–3845. https://doi.org/10.1080/10447318.2022.2107775

32.

Ingram

Moreton

Gancz

Pollick

(2021). Calibrating trust toward an autonomous image classifier. Technology, Mind, and Behavior, 2(1). https://doi.org/10.1037/tmb0000032

33.

Jian

Bisantz

Drury

(2000). Foundations for an empirically determined scale of trust in automated systems. International Journal of Cognitive Ergonomics, 4(1), 53–71.

34.

Kardas

O’Brien

(2018). Easier seen than done: Merely watching others perform can foster an illusion of skill acquisition. Psychological Science, 29(4), 521–536. https://doi.org/10.1177/0956797617740646

35.

Koriat

(1997). Monitoring one's own knowledge during study: A cue-utilization approach to judgments of learning. Journal of Experimental Psychology: General, 126(4), 349–370. https://doi.org/10.1037/0096-3445.126.4.349

36.

Lee

J. D.

See

K. A.

(2004). Trust in automation: Designing for appropriate reliance. Human Factors, 46(1), 50–80. https://doi.org/10.1518/hfes.46.1.50_30392

37.

Lenth

(2022). emmeans: Estimated marginal means, aka least-squares means. R package version, 1.8.1-1.

38.

Madhavan

Wiegmann

D. A.

(2007). Similarities and differences between human–human and human–automation trust: An integrative review. Theoretical Issues in Ergonomics Science, 8(4), 277–301. https://doi.org/10.1080/14639220500337708

39.

Madhavan

Wiegmann

D. A.

Lacson

F. C.

(2006). Automation failures on tasks easily performed by operators undermine trust in automated aids. Human Factors, 48(2), 241–256. https://doi.org/10.1518/001872006777724408

40.

Merritt

S. M.

Heimbaugh

LaChappell

Lee

(2013). I trust it, but I don’t know why: Effects of implicit attitudes toward automation on trust in an automated system. Human Factors, 55(3), 520–534. https://doi.org/10.1177/0018720812465081

41.

Merritt

S. M.

Ilgen

D. R.

(2008). Not all trust is created equal: Dispositional and history-based trust in human-automation interactions. Human Factors, 50(2), 194–210. https://doi.org/10.1518/001872008X288574

42.

Mery

Saavedra

Prasad

(2020). X-Ray baggage inspection with computer vision: A survey. IEEE Access, 8(99), 145620–145633. https://doi.org/10.1109/access.2020.3015014

43.

Monroe

Vangsness

(2022). The effects of task difficulty and stress on trust in an automated navigation aid. Proceedings of the Human Factors and Ergonomics Society - Annual Meeting, 66(1), 1080–1084. https://doi.org/10.1177/1071181322661406

44.

Montague

Asan

(2012). Trust in technology-mediated collaborative health encounters: Constructing trust in passive user interactions with technologies. Ergonomics, 55(7), 752–761. https://doi.org/10.1080/00140139.2012.663002

45.

Montague

Chiou

(2014). Shared experiences of technology and trust: An experimental study of physiological compliance between active and passive users in technology-mediated collaborative encounters. IEEE Transactions on Human-Machine Systems, 44(5), 614–624. https://doi.org/10.1109/thms.2014.2325859

46.

Nass

Moon

(2000). Machines and mindlessness: Social responses to computers. Journal of Social Issues, 56(1), 81–103. https://doi.org/10.1111/0022-4537.00153

47.

Nass

Moon

Carney

(1999). Are people polite to computers? Responses to computer-based interviewing systems. Journal of Applied Social Psychology, 29(5), 1093–1109. https://doi.org/10.1111/j.1559-1816.1999.tb00142.x

48.

Nickerson

R. S.

(1999). How we know—and sometimes misjudge—what others know: Imputing one's own knowledge to others. Psychological Bulletin, 125(6), 737–759. https://doi.org/10.1037/0033-2909.125.6.737

49.

Pak

Fink

Price

Bass

Sturre

(2012). Decision support aids with anthropomorphic characteristics influence trust and performance in younger and older adults. Ergonomics, 55(9), 1059–1072. https://doi.org/10.1080/00140139.2012.691554

50.

Parasuraman

Manzey

D. H.

(2010). Complacency and bias in human use of automation: An attentional integration. Human Factors, 52(3), 381–410. https://doi.org/10.1177/0018720810376055

51.

Parasuraman

Riley

(1997). Humans and automation: Use, misuse, disuse, abuse. Human Factors, 39(2), 230–253. https://doi.org/10.1518/001872097778543886

52.

Parasuraman

Sheridan

T. B.

Wickens

C. D.

(2000). A model for types and levels of human interaction with automation. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans: A Publication of the IEEE Systems, Man, and Cybernetics Society, 30(3), 286–297. https://doi.org/10.1109/3468.844354

53.

Pop

V. L.

Shrewsbury

Durso

F. T.

(2015). Individual differences in the calibration of trust in automation. Human Factors, 57(4), 545–556. https://doi.org/10.1177/0018720814564422

54.

Potts

C. A.

Pastel

Rosenbaum

D. A.

(2018). How are cognitive and physical difficulty compared? Attention, Perception, & Psychophysics, 80(2), 500–511. https://doi.org/10.3758/s13414-017-1434-2

55.

Potts

C. A.

Rosenbaum

D. A.

(2021). Does attention solve the “apples-and-oranges” problems of judging task difficulty and task order? Psychological Research, 85(8), 3040–3047. https://doi.org/10.1007/s00426-020-01453-7

56.

Rahwan

Cebrian

Obradovich

Bongard

Bonnefon

J. F.

Breazeal

Crandall

J. W.

Christakis

N. A.

Couzin

I. D.

Jackson

M. O.

Jennings

N. R.

Kamar

Kloumann

I. M.

Larochelle

Lazer

McElreath

Mislove

Parkes

D. C.

Pentland

A. 'S.

Wellman

(2019). Machine behaviour. Nature, 568(7753), 477–486. https://doi.org/10.1038/s41586-019-1138-y

57.

R Core Team . (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing. URL. https://www.R-project.org/

58.

Reeves

Nass

(1996). The media equation: How people treat computers, television, and new media like real people. Cambridge University Press.

59.

Rieger

Roesler

Manzey

(2022). Challenging presumed technological superiority when working with (artificial) colleagues. Scientific Reports, 12(1), 3768. https://doi.org/10.1038/s41598-022-07808-x

60.

Schwark

Dolgov

Graves

Hor

(2010). The influence of perceived task difficulty and importance on automation use. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting (Vol. 54, pp. 1503–1507). Sage Publications. https://doi.org/10.1177/154193121005401931

61.

Somon

Campagne

Delorme

Berberian

(2019). Human or not human? Performance monitoring ERPs during human agent and machine supervision. NeuroImage, 186, 266–277. https://doi.org/10.1016/j.neuroimage.2018.11.013

62.

Takayama

(2009). Making sense of agentic objects and teleoperation: In-the-moment and reflective perspectives. In Proceedings of the 4th ACM/IEEE international conference on Human robot interaction (pp. 239–240). IEEE.

63.

Undorf

Erdfelder

(2013). Separation of encoding fluency and item difficulty effects on judgements of learning. Quarterly Journal of Experimental Psychology, 66(10), 2060–2072. https://doi.org/10.1080/17470218.2013.777751

64.

Unity Technologies . (2019). Unity Game Engine (2018.3.5). Computer software, Unity Technologies.

65.

Vangsness

(2019). The role of peripheral cues in judgments of difficulty . In Doctoral Dissertation. Kansas State University.

66.

Vangsness

Young

M. E.

(2019). Central and peripheral cues to difficulty in a dynamic task. Human Factors, 61(5), 749–762. https://doi.org/10.1177/0018720818809877

67.

Vangsness

Young

M. E.

(2021). More isn’t always better: When metacognitive prompts are misleading. Metacognition and Learning, 16(1), 135–156. https://doi.org/10.1007/s11409-020-09241-9

68.

Warm

J. S.

Dember

W. N.

Hancock

P. A.

(1996). Vigilance and workload in automated systems. In Automation and human performance (pp. 183–200). CRC Press.

69.

Wickham

(2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag.

70.

Young

M. E.

(2019). Bayesian data analysis as a tool for behavior analysts. Journal of the Experimental Analysis of Behavior, 111(2), 225–238. https://doi.org/10.1002/jeab.512

Judgments of Difficulty (JODs) While Observing an Automated System Support the Media Equation and Unique Agent Hypotheses

Abstract

Objective

Background

Method

Results

Conclusion

Application

Keywords

Introduction

Metacognitive Judgments

Judgments of Difficulty

JODs When Observing Others

Theories of Human–Automation Relationships

Media Equation Hypothesis

Unique Agent Hypothesis

MEH and UAH

Study Objective

Method

The Conjunction Visual Search Task

Task Difficulty

Role Order

Automated System Details

JODs

Demographics Survey/Trust Measurement

Results

Defining Variables

Bayesian Regression

Media Equation Hypothesis versus Unique Agent Hypothesis

JODs While Observing Automation

The Impact of Experience on JODs

Central and Peripheral Cue Use for Humans versus Automated Systems

Discussion

Unique Agent Hypothesis versus Media Equation Hypothesis

Similarities and Differences in Cue Weighting While Observing Other Humans and Automation

Study Limitations/Future Directions

Conclusion

Footnotes

Acknowledgments

Key Points

Declaration of Conflicting Interests

Funding

ORCID iD

Appendix A

Appendix B

Appendix C

Author Biographies

References