Sage Journals: Discover world-class research

Abstract

Objective

We investigate the impact of event uncertainty, decision support (DS) display format, and DS sensitivity on participants’ behavior, performance, subjective workload, and perception of DS usefulness and performance in a classification task.

Background

DS systems can positively and negatively affect decision accuracy, performance time, and workload. The ability to access DS selectively, based on informational needs, might improve DS effectiveness.

Method

Participants performed a sensory classification task in which they were allowed to request DS on a trial-by-trial basis. DS was presented in separated-binary (SB), separated-likelihood (SL), or integrated-likelihood (IL) formats. Access preferences, task performance, performance time, subjective workload, and perceived DS usefulness and performance were recorded.

Results

Participants accessed DS more often when it was highly sensitive, stimulus information was highly uncertain, or the DS cue and stimulus information were perceptually integrated. Effective sensitivity was highest with the integrated likelihood DS. Although the separated likelihood DS provided more information than the binary likelihood DS, it was accessed less often, leading to lower sensitivity.

Conclusion

Participants are most likely to access DS when raw stimulus information is highly uncertain and appear to make effective use of likelihood DS only when DS cues are integrated with raw stimulus information within a display.

Application

Results suggest that DS use will be most effective when likelihood DS cues and raw stimulus information are perceptually integrated. When DS cues and raw stimuli cannot be perceptually integrated, binary cues from the DS will be more effective than likelihood cues.

Keywords

human–automation interaction decision making display design principles

Introduction

Studies on the use of decision support (DS) typically manipulate the availability of support between blocks of experimental trials. Consequently, participants have DS in situations in which the decision was difficult and in situations in which it was easy to decide even without the DS. Here, we examine the decision to use DS at the single-trial level as a function of trial difficulty, the quality of the information provided by the DS, and the DS display format.

The performance of a DS system is determined by the information the DS receives and the DS algorithm. Responses to DS depend not only on the performance of the DS but also on its display format. For example, it was more difficult for decision makers to ignore unhelpful cues from a DS when they were integrated with the raw data than when they were spatially separated (Meyer, 2001). This pattern is consistent with the Proximity Compatibility Principle (PCP) of display design (Wickens & Carswell, 1995), which holds that the perceptual grouping of display channels facilitates the integration of information from multiple sources (see also, Ayres & Sweller, 2005; Moreno & Mayer, 1999).

A related aspect of DS design is whether the DS judgment should be shown as a simple binary cue or in a finer grain. Sorkin et al. (1988) proposed the idea of likelihood alarms, diagnostic DS systems, initially explored within industrial process control settings, highlighting the potential of graded measures. In theory, DS systems that provide confidence estimates should engender better human performance than binary DS systems because they provide more information (Bahrami et al., 2010; Bartlett & McCarley, 2017). However, empirical comparisons of binary and likelihood DS have produced mixed results. Several studies compared the effectiveness of graded decision support (DS) cues to binary ones, finding no significant advantage. Bartlett and McCarley (2017) had participants act as geologists sorting fictional minerals by identifying the dominant color in orange and blue random-dot images, aided by a DS that provided judgments in binary or likelihood display forms. Carragher and Hancock (2023) tasked participants with face matching; they manipulated the existence of an Automated Facial Recognition System with binary or likelihood output from the system. Pharmer et al. (2022) conducted a controlled experiment to assess the influence of DS transparency in a maritime collision avoidance scenario. Participants either received recommendations from a collision avoidance system or the same recommendations alongside confidence ratings, revealing the system’s certainty in its suggestions. Wickens and Colcombe (2007) assessed pilots managing a dual task of tracking and conflict detection with binary or likelihood alerts.

Other research, though, has suggested the superiority of graded DS cues over binary. Zirk et al. (2020) presented a control room operation task requiring the management of two simultaneous tasks with the support of a binary or a likelihood alarm system. Participants showed improved decision making strategies with graded DS. Similarly, Meyer et al. (2020) engaged participants in distinguishing between malicious and innocuous downloads, finding that graded DS cues led to quicker decisions and greater trust in the system. These mixed results can be attributed to a range of factors. The effectiveness of binary versus likelihood cues may vary significantly depending on the complexity of the task and the context in which it is applied. For simpler tasks, binary cues might suffice, whereas more complex decision making scenarios could benefit from the nuanced information from likelihood cues (Sorkin et al., 1988). The cognitive load can also be a significant factor; users might prefer binary cues as they are easier to process. However, when users can manage cognitive demands, likelihood cues could offer advantages by enabling more informed decision making (Sorkin et al., 1988). An additional factor might be individual differences. This complexity emphasizes the importance of tailoring DS to their intended use cases’ specific needs and conditions to support optimal decision making.

When users choose whether to consult DS support, they often choose nonoptimally. As expected, they consult a reliable DS more often than a less reliable one (Montanari & Nunnari, 2019) and can generally distinguish when it is worth purchasing information (Ben Yaakov et al., 2019; Pescetelli et al., 2021). However, they sometimes access information that is not worth the cost and sometimes decline information that would benefit them (Ben-Yaakov et al., 2021). A possible explanation for this nonoptimal behavior is “confirmation bias,” a tendency for people to gather information supporting their beliefs even when the additional information is worthless (Baron & Hershey, 1988). Evidence of “confirmation bias” can be seen in recent studies where operators seek confirmation when using DS, preferring to consult systems that are biased to agree with them (Montanari & Nunnari, 2019). We predict that participants will adapt their decision in response to raw information quality, showing a higher tendency to use DS in ambiguous events. This prediction aligns with the idea that individuals are more inclined to seek external support when faced with high uncertainty (Pescetelli et al., 2021).

Studies have also addressed the effect of using DS on different aspects of performance. Many (e.g., Bartlett & McCarley, 2017, 2019; Ben Yaakov et al., 2020; Boskemper et al., 2022; Munoz et al., 2022; Wang et al., 2009) found that DS improved decision accuracy in signal detection tasks and could reduce operator workload (Rottger et al., 2009; Todd & Benbasat, 1992) and shorten decision times (Botzer et al., 2013). However, DS is not guaranteed to be helpful (Parasuraman & Manzey, 2010) and can even compromise performance (Wickens & Dixon, 2007). In a previous study, we found that participants could identify when to acquire additional information but did not invest optimally. Overinvestment occurred when the expected benefit was low, and underinvestment happened when benefits were high. This suboptimal investment behavior extended to DS utilization, suggesting a discrepancy between actual and optimal use (Ben Yaakov et al., 2019).

This study introduces a novel approach to DS implementation by allowing users to access DS on a trial-by-trial basis, diverging from traditional block design methodologies where DS is either constantly available (e.g., Bartlett & McCarley, 2017; Meyer et al., 2020; Pharmer et al., 2022) or its usage is predetermined by users at the start of a block (Ben Yaakov et al., 2019; Ben-Yaakov et al., 2021). Our trial-by-trial design requires users to consider whether to actively seek additional information, unlike traditional block designs where users decide whether to use or ignore available information. This innovation seeks to mimic real-world decision making scenarios more closely, where the need for additional information can vary significantly from one situation to another, depending on, among other things, the uncertainty users encounter. By employing the trial-by-trial approach, the study explores how DS Display Format and event uncertainty affect access decisions to use DS, task performance, performance time, perception of DS usefulness and performance, and cognitive workload. This contribution bridges a gap in the literature regarding the dynamics of DS utilization, offering theoretical insights into how and in which conditions individuals choose to engage with DS actively, as the current literature does not differentiate when DS information is used or ignored. The approach can potentially inform the design of more effective DS systems better aligned with users’ decision making needs, ultimately enhancing performance in complex tasks.

Method

Participants

We recruited 171 participants (mean age = 59.71 years, SD = 16.67, range = 18–99, 94 female and 77 male) from the sourcing platform Cloud Research. Participants were paid according to the platform’s payment systems and received a base payment of US$1.5 for their participation. In a pilot study, we examined the effect of stimulus strength and DS Format on DS access, revealing an interaction odds ratio of 5.26. We then converted this odds ratio to Cohen’s d effect size, resulting in a computed effect size of 0.91 (Cohen, 1988). For the main study’s sample size determination, we conducted a power analysis using R (version 4.2.2) software with the “pwr” package (version 1.30) for a two-sample t-test. We used the computed effect size of .91, alpha = .05, and power = .8. The analysis, based on a two-sample t-test, suggested a sample size of 20 participants per group. Additionally, we conducted a power analysis using G*Power (3.1.9.4) for ANOVA: Fixed effects, special, main effects, and interactions. With a Cohen’s f effect size of .46, and maintaining the same alpha (.05) and power (.80) settings, the analysis indicated a required sample size of 9 participants per group. However, to ensure robust results across our study’s various conditions, we opted to adhere to the more conservative sample size estimate derived from the two-sample t-test analysis. Therefore, we set the minimum number of participants in each condition at 20.

Apparatus

We developed the experiment as a web-based system with a back-end side implemented with a Python infrastructure (using the Django library), a database (MySQL), and a front-end side developed in HTML. Participants performed the experiment on their own devices, which were restricted to desktop or laptop computers. The research complied with the American Psychological Association Code of Ethics and was approved by the Tel Aviv University Ethics Committee for Research with Human Participants. Informed consent was obtained from each participant.

Stimuli

Stimuli were 426 × 426 orange and blue random-dot images, where each dot was assigned the color orange with probability p_O and the color blue with probability 1-p_O. The value of p_O in each trial was determined by randomly sampling a value X_H from a Gaussian distribution, computing the Probability Density Function (PDF) under the Orange and Blue hypothesis, and then computing the posterior probability of orange given X_H using Bayes’ theorem, as can be seen in equations (1)–(3):

P (X_{H} | O r a n g e) = G a u s s i a n P D F (X_{H}, μ_{O r a n g e}, σ_{O r a n g e})

(1)

P (X_{H} | B l u e) = G a u s s i a n P D F (X_{H}, μ_{B l u e}, σ_{B l u e})

(2)

P_{0} = P (O r a n g e | X_{H}) = \frac{P (X_{H} | O r a n g e)}{P (X_{H} | O r a n g e) + P (X_{H} | B l u e)}

(3)

Values of X_H images were sampled with equal probability from distributions with μ = −.75 for the orange-dominant stimuli and μ = .75 for the blue-dominant stimuli with σ = 1 for both distributions. As described below, the participant’s task was to discriminate orange-dominant from blue-dominant stimuli. The best-achievable d', as determined by the variability in the stimuli, was approximately 1.5.

Procedure

In each trial, participants decided whether the stimulus image reflected a blue-dominant or orange-dominant distribution (see Bartlett & McCarley, 2017; Voss et al., 2004). A cover story asked them to imagine themselves as geologists sorting samples of a fictional mineral, vibranium, into blue and orange strains. They were instructed:

“Unfortunately, the two strains are difficult to tell apart. Both are speckled blue and orange. The only visual difference is that one strain tends to have a little more orange, and the other tends to have a little more blue. For simplicity, we will call them VBN-ORANGE and VBN-BLUE. However, there is a lot of overlap in their appearance, and it is almost impossible to sort them with 100% accuracy by eye.”

A point system was used to incentivize performance. Participants began the experiment with 30 points and were awarded 2 points for each correct response and penalized 2 points for each incorrect response.

Participants could request assistance from DS for 0.1 points per trial. The DS displayed its cue in one of three formats, as shown in Figure 1: Separated-Binary (SB), Separated-Likelihood (SL), or Integrated-Likelihood (IL). The sensitivity of the DS was low (d' = .75; accuracy = 65% for unbiased binary judgments) or high (d’ = 2.25; accuracy = 87% for unbiased binary judgments). All DS systems operated by first sampling an evidence value X_DS from a normal distribution with σ = 1. For the low-sensitivity DS, evidence values were sampled from a distribution with μ = −0.375 for orange and μ = 0.375 for blue-dominant trials. For the high-sensitivity DS, evidence values were sampled from a distribution with μ = −1.125 for orange-dominant trials and μ = 1.125 for blue-dominant trials. Using an unbiased response criterion, the SB DS rendered a binary orange or blue judgment. Binary cues were displayed as an orange or blue colored bar beneath the random dot stimulus. The SL DS rendered its judgment as a random dot image in which $p_{0}$ was determined similar to the human $p_{0}$ using X_DS, and presented alongside the original stimulus image. Cues in the SB and SL Display Formats were presented as graphic displays rather than verbal descriptions to maintain consistency with the visual nature of the human stimuli (Figure 1(a) and (b)).

Figure 1.

Original stimulus (panels A and B) and the DS display formats for the SB (panel C), IL (panel D), and SL (panel E) formats. Stimuli are represented as smaller in the SL condition in this figure for clarity and do not reflect actual experimental sizes. (a) A sample orange-dominant stimulus image (p = .4); (b) a sample blue-dominant stimulus image (p = .4); (c) a sample orange-dominant stimulus image (p = .4), with the SB display format; (d) IL display format, orange-dominant stimulus image (p = .3), based on the orange-dominant stimuli in (a) and (b) and the DS informations; (e) SL sample orange-dominant stimulus image (p = .4) for the human stimulus (left) and (p = .2) for the DS stimulus (right).

The IL DS integrated the aid’s evidence value with the human’s evidence by updating the random dot display, shifting the proportion of blue dots in the direction suggested by the aid’s observation. As illustrated in Figure 1(c), the proportion of orange dots in the IL display is increased relative to the original stimulus shown in Figure 1(a), reflecting the DS’s judgment. The proportion of blue dots in the IL display was determined by using the probability of X_H as the prior probability for a Bayesian calculation of the posterior probability in light of X_DS, as shown in equation (4). Here, $P (B l u e | X_{H}, X_{D S})$ is the conditional probability of the event being blue-dominant given both human $X_{H}$ and system $X_{D S}$ , that is, the posterior probability of the event being blue-dominant given the Bayesian-optimal combination of both the human and DS observations.

P (B l u e | X_{H}, X_{D S}) = \frac{P_{S} \cdot P (X_{D S} | B l u e) \cdot P (X_{H} | B l u e)}{P_{S} \cdot P (X_{D S} | B l u e) \cdot P (X_{H} | B l u e) + (1 - P_{s}) \cdot P (X_{D S} | O r a n g e) \cdot P (X_{H} | O r a n g e)}

(4)

Each participant was randomly assigned to one of the 6 between-subjects conditions, resulting from the factorial combination of 3 DS formats and 2 levels of DS sensitivity. Due to factors related to the assignment process used by the experimental software and the platform on which participants were recruited, the between-subjects conditions were unbalanced, as shown in Table 1.

Table 1.

Number of participants in Each Experimental Condition.

DS Sensitivity	DS Format
DS Sensitivity	Separated Binary (SB)	Separated Likelihood (SL)	Integrated Likelihood (IL)
Low sensitivity	34	31	37
High sensitivity	24	21	24

Participants performed the task in 3 blocks. During the first 2 blocks, each consisting of 16 trials, the participants familiarized themselves with the task and the system. The first block was without the DS aid. In the second block, the aid was available by default on all trials. We did not analyze the results of these blocks. The third block consisted of 50 trials, half sampled from the distribution with more blue and the other half with more orange dots. Values of X_H and X_DS for all trials were pregenerated. Values of X_H were similar for all participants, and values of X_DS were the same for all participants within a given level of DS sensitivity. The order of trials was randomized between subjects.

We recorded DS access choice (yes or no), classification decisions (blue or orange), and classification time for each trial. At the end of the experiment, participants reported their subjective workload by completing the NASA-TLX questionnaire separately for events for which they had and had not used the DS. They also answered two questions regarding their perception of DS’s usefulness and performance on a 1-10 scale: (1) How much did the automation help you in the task? (2) How good was the automation in distinguishing between blue and orange Vibranium strains?

Results

To address the unbalanced design in our data analyses, we employed Type III sums of squares in our ANOVA. For post-hoc comparisons, we used Tukey’s HSD test, which is robust to unbalanced designs and effectively controls the family-wise error rate. We also used mixed regressions which efficiently address unbalanced group sizes.

Tendency to access DS

We conducted a two-way analysis of variance (ANOVA) to compare the effects of the DS format (SB, SL, or IL) and Sensitivity (low or high) on the tendency to access the DS. Figure 2 and Table 2 present the data and results of the analysis. Both main effects and the interaction were significant. Simple-effects ANOVAs for each level DS sensitivity showed that the DS format had no significant effect when DS sensitivity was low, $F (2, 82) = 0.6, η^{2} = . 01, p = . 55$ , but significantly affected the tendency to access DS when its sensitivity was high, $F (2, 83) = 47.76, η^{2} = . 54, p < 0.01$ . Pairwise comparisons using Tukey’s HSD test showed that the IL DS $(M = . 47, S D = . 06)$ was accessed more in the high-sensitivity condition than the SL DS $(M = . 33, S D = . 05), p < . 001$ , or the SB DS $(M = . 42, S D = . 05), p < . 01$ . The SB DS was also accessed more than the SL DS, p < .001, even though the SL cues conveyed more information.

Figure 2.

Tendency to access DS as a function of the DS format and Sensitivity. Note: In this and other data figures, violin plots represent distributions of empirical scores. Violin width is scaled to the number of observations.

Table 2.

ANOVA Results for the Tendency to Access DS Depending on the DS Display Format and Sensitivity.

	$d f$	$F$	$η^{2}$	$M S e$
DS display format	2	38.57***	.03	.08
DS sensitivity	1	2248***	.88	4.68
Display format X sensitivity	2	35.28***	.03	.07

***p < .001 **p < .01, *p < .05.

To analyze data at the single-trial level, we computed two independent variables for each trial: the strength of the signal, measured as the absolute difference of the proportion of dominant-color dots from 0.5, and the direction of the deviation, either Blue or Orange. We analyzed the decision to access the DS as a function of the strength and direction of the signal, display format, and DS sensitivity using mixed effects logistic regression. Participants served as a random effect, and display format, sensitivity, strength, and direction and the two-way interaction of Display format × DS Sensitivity were fixed effects. Figure 3 and Table 3 present the data and results of the model.

Figure 3.

Tendency to Access DS as a function of the DS Sensitivity, DS Format, Signal Strength, and Signal Direction. Note: Data from the low and high DS sensitivity conditions are presented in the left and right panels, respectively.

Table 3.

Results of the Mixed Effect Logistic Regression on the Tendency to Access DS.

	B	SE b	95% CI
Baseline	2.26***	.1	(7.89, 11.59)
Strength	−9.13***	.26	(.0, .0)
Direction [+]	−.02	.06	(.86, 1.11)
DS display format [LB]	−.57***	.09	(.48, .67)
DS display format [IL]	.29**	.09	(1.12, 1.62)
DS sensitivity [low]	−2.8***	.13	(.05, .08)
Display format [SL] X sensitivity [low]	.42*	.19	(1.05, 2.2)
Display format [LI] X sensitivity [low]	.42*	.18	(.46, .93)

***p < .001 **p < .01, *p < .05.

Performance

We computed the effective sensitivity (d') and the participants’ total scores to compare the performance between DS formats and sensitivity levels. We analyzed them with the same ANOVA as before. Figure 4 and Table 4 present data and results of the model. Both main effects and the interaction were significant for both measures. Separate one-way ANOVAs for each level of DS sensitivity showed that for participants working with the low-sensitivity DS, the format had no significant effect on the effective sensitivity, $F (2, 82) = . 96, η^{2} = . 02, p = . 39$ , or the score $F (2, 82) = 1.2, η^{2} = . 03, p = . 31$ . For participants working with the high-sensitivity DS, both effective sensitivity, $F (2, 83) = 5, η^{2} = . 1, p = 0.009$ , and score, $F (2, 83) = 5.2, η^{2} = . 11, p = 0.007$ , differed between formats. Pairwise comparisons between formats of the high-sensitivity DS, with Tukey’s HSD test, showed that the effective sensitivity was higher with the IL DS $(M = 2.27, S D = . 27)$ than with the SL format $(M = 1.93, S D = . 43), p < . 01$ . The IL format also produce higher scores $(M = 101.15, S D = 5.23)$ than did the SL format $(M = 93.27, S D = 10.81), p < . 01$ .

Figure 4.

Distributions of Effective Sensitivity (left) and Score (right) as a function of the DS display format and sensitivity.

Table 4.

ANOVA Results for Effective Sensitivity and Score, Depending on the DS Display and Sensitivity.

4	Effective Sensitivity					Score
4	$d f$	$F$	$η^{2}$	$M S e$	$F$	$η^{2}$	$M S e$
DS display format	2	4.35*	.01	.84	5.02**	.01	584
DS sensitivity	1	295***	.62	28.4	298***	.62	17,351
Interaction	2	5.6**	.02	1.08	5.78**	.02	671

***p < .001 **p < .01, *p < .05.

Classification time

We analyzed the classification time, measured from stimulus onset to the execution of the classification response, including the decision whether to access DS, at the single-trial level. For each trial, we computed the strength of the signal and the direction of the deviation as defined above. We used mixed effects linear regression to analyze the relationship between classification time and the strength and direction of the signal, display format, DS sensitivity, and the participants’ decision on whether to access the DS. Participants served as a random effect, and the AD, display format, DS sensitivity, strength, direction, and all two-way interactions of AD with the other factors were fixed effects. Figure 5 and Table 5(a) present data and results of the model. Classification time was longer when participants accessed the DS $(M = 6.46, S D = 2.9)$ than when they did not $(M = 3.99, S D = 1.39)$ . To analyze the interaction between access decision and DS format, we conducted Mixed Effects Linear Regression analyses separately for trials where the DS was and was not accessed. Table 5(b) and (c) summarize the results of the models with and without DS, respectively. Post-hoc analyses using the Sidak correction for multiple comparisons found that all pairwise differences between display formats were significant for trials on which the DS was accessed, $p s < . 001$ . Classification times were shorter for participants who used SB display format $(M = 5.89, S D = 2.19)$ than for participants who used IL display format $(M = 6.61, S D = 3.21)$ or SL display format $(M = 7.1, S D = 3.25)$ . They were also longer with low-sensitivity DS $(M = 8.03, S D = 3.51)$ than with high-sensitivity DS $(M = 6.2, S D = 2.7)$ .

Figure 5.

Classification time distributions as a function of display format and DS sensitivity for events on which DS was accessed (left) and not accessed (right).

Table 5.

Results of the Mixed Effect Linear Regressions on Classification Time.

		b	SE b	95% CI
(a)	Baseline	4.35***	.08	(4.2, 4.51)
	AD [True]	3.1***	.12	(2.86, 3.34)
	Strength	−1.14***	.16	(−1.46, −.82)
	Direction [+]	−.07	.04	(−.15, .02)
	DS display format [SL]	.04	.05	(−.06, .15)
	DS display format [LI]	.04	.05	(−.07,.14)
	DS sensitivity [low]	.01	.05	(−.08, .1)
	AD [True]: Strength	−7.1***	.36	(−7.8, −6.4)
	AD [True]: Direction [+]	−.35***	.09	(−.52, −.17)
	AD [True]: DS display format [SL]	1.04***	.11	(.82, 1.25)
	AD [True]: DS display format [IL]	.46***	.11	(.25, .67)
	AD [True]: DS sensitivity [low]	2.04***	.12	(1.8, 2.27)
(b)	Baseline	7.44***	.14	(7.17, 7.72)
	Strength	−8.25***	.47	(−9.18, −7.33)
	Direction [+]	−.41***	.12	(−.64, −.19)
	DS display format [SL]	1.09***	.14	(.81, 1.38)
	DS display format [IL]	.5***	.14	(.22,.79)
	DS sensitivity [low]	2.04***	.17	(1.72, 2.38)
(c)	Baseline	4.35***	.06	(4.23, 4.47)
	Strength	−1.14***	.13	(−1.4, −.89)
	Direction [+]	−.07	.03	(−.13, .00)

***p < .001 **p < .01, *p < .05.

Subjective workload

A three-way Mixed ANOVA was performed to test for differences in the NASA-TLX ratings between display formats and DS sensitivity levels (between-subject variables) and the questionnaire (within-subject variable). Figure 6 and Table 6 present the data and results of the model. All effects were statistically significant.

Figure 6.

Subjective workload distributions for the DS display formats and Sensitivity with, and without the DS.

Table 6.

ANOVA Results for Subjective Workload, Depending on the DS’s Display Format and Sensitivity and Whether the DS Was Used.

	$d f$	$F$	$η_{g}^{2}$	$M S E$
DS’s display format	2	869.1***	.83	3169
DS’s sensitivity	1	1145.37***	.77	5469
Questionnaire	1	3328.98***	.91	17,853
Display format X sensitivity	2	14.13***	.08	67.87
Display format X questionnaire	2	794.04***	.83	3313
Sensitivity X questionnaire	1	933.57***	.75	4882
3-way interaction	2	8.17***	.05	42.95

***p < .001 **p < .01, *p < .05.

No effects were significant in a two-way ANOVA for the events without the DS. In the analysis for events with the DS, the main effects of DS format and Sensitivity were significant, as seen in Table 7(a). Therefore, we conducted a one-way ANOVA for each level of DS sensitivity for the questionnaire regarding events with DS. The effect of format was significant for both high and low-sensitivity DS, as seen in Table 7(b) and (c). Post-hoc analyses using Tukey’s HSD test indicated that all pairwise differences between displays were significant for both levels of Sensitivity. Workload was highest for the SL condition and lowest for the IL condition. In the high-sensitivity IL DS condition, participants reported lower workload with the DS

(M = 23.48, S D = 2.46)

than without it

(M = 29.29, S D = 2.17)

, (see Table 7(d)). In contrast, other conditions exhibited the opposite workload patterns, with a higher workload with DS.

Table 7.

ANOVA Results for Subjective Workload When Using DS.

		$d f$	$F$	$η^{2}$	$M S e$
(a)	Display format	2	1347.84***	.68	6470
	DS’s sensitivity	1	2154.38***	.43	10,342
	Display format X sensitivity	2	20.97***	.01	101
(b) Low sensitivity	Display format	2	1031***	.96	4262
(c) High sensitivity	Display format	2	748.9***	.95	4090
(d) High sensitivity–IL	Questionnaire	1	18.26*	.12	354.38

***p < .001 **p < .01, *p < .05.

Perceived usefulness and performance of the DS

Responses to the two questions in the perceived usefulness and performance (PUP) questionnaire were strongly correlated $r (169) = . 79, p < . 001$ . We averaged them to create a single estimate of PUP per participant. Figure 7 and Table 8 present the model’s data and results. A two-way ANOVA with DS format and Sensitivity as factors showed that both main effects and the interaction were significant.

Figure 7.

PUP distributions as a function of the DS format and Sensitivity.

Table 8.

ANOVA Results for PUP, Depending on the DS Display Format and Sensitivity.

	Perceived Usefulness and Performance
	$d f$	$F$	$η^{2}$	$M S e$
DS display format	2	3.19*	.02	5.3
DS sensitivity	1	327.42***	.59	542.9
Display format X sensitivity	2	26.14***	.09	43.3

***p < .001 **p < .01, *p < .05.

Separate one-way ANOVAs for each sensitivity level showed significant effects of format both when DS sensitivity was low and when it was high, $F (2, 82) = 3.37, η^{2} = . 08, p < 0.04$ , and $F (2, 82) = 37.22, η^{2} = . 47, p < 0.001$ , respectively. Pairwise comparisons using Tukey’s HSD test revealed that for the participants with the low-sensitivity DS, only the IL $(M = 1.84, S D = 1.74)$ and the SB $(M = 2.79, S D = 1.22)$ formats differed significantly, p $< . 05 .$ Surprisingly, participants perceived the binary DS as more useful and effective than the IL DS even though it provided less information. All pairwise differences between display formats were significant for participants aided by the high-sensitivity DS, all $p < . 01$ . PUP was highest for the IL system $(M = 7.38, S D = 1.6)$ , lower for the SL DS $(M = 5.93, S D = . 93)$ , and lowest for the SB DS $(M = 4.79, S D = . 78)$ .

Discussion

We examined the selective decision to use DS on a trial-by-trial basis as a function of event uncertainty and DS display format and reliability. We compared three DS formats. In the first two conditions, the raw data and cues from the DS were perceptually separated, requiring participants to mentally integrate them. These conditions differed in that the DS showed either binary or likelihood information. In the third condition, the DS integrated raw data and the DS cue and presented them in a likelihood display. This integration process is essentially the “automation” of information integration. We investigated the impact of information presentation on preferences to use the DS, task performance, performance time, subjective workload, and perceived usefulness and performance (PUP) of the DS.

Participants were more likely to access DS when stimulus uncertainty was high, a result consistent with previous studies showing that users can identify situations where accessing additional information is beneficial (Ben Yaakov et al., 2019; Ben Yaakov & Meyer, 2024; Ben-Yaakov et al., 2021; Pescetelli et al., 2021). Participants were also influenced by the DS format when choosing to access support, but only when the DS was highly sensitive. In that case, participants accessed the IL DS more often than either form of separated DS, suggesting that they preferred cues that were integrated with raw stimulus information. Amongst the separated DS formats, participants used the binary cues more than the likelihood cues despite the likelihood cues providing more information. Participants rarely accessed the low-sensitivity DS regardless of its display format, and accordingly, the display format did not affect performance with low-sensitivity DS. With high-sensitivity DS, participants using the IL format had higher effective sensitivity and scores than those using the SL format. Still, no significant difference was found between the IL and SB display formats. These findings differ from previous ones (Meyer, 2001), who found an effect of presentation form only for low-sensitivity DS sensitivity. The difference may have resulted from the stronger tendency to access the high-sensitivity DS in the current study. Future studies should isolate the effects of access choice. Regardless of the display format, participants reached an overall sensitivity with the better DS that was lower than the Sensitivity of the DS alone, which means that automating the task and letting the DS alone make the decision would have led to better performance than having the participants involved in the decision process (Meyer, 2001; Wickens & Dixon, 2007). However, it is crucial to note that this does not necessarily imply a recommendation to exclude the human operator from the loop. Rather, it underscores the importance of carefully considering the role and interaction between human operators and DS to optimize performance.

Importantly, our data suggest that a barrier to using likelihood aids is the challenge of effectively integrating the information they provide with the raw data. These results align with the mixed findings observed in prior research regarding unintegrated likelihood versus binary cues, with some indicating that graded DS cues were not more helpful than binary cues (Bartlett & McCarley, 2017; Carragher & Hancock, 2023; Pharmer et al., 2022; Wickens & Colcombe, 2007), and others indicating that likelihood led to improved performance (Meyer et al., 2020; Zirk et al., 2020). The results underscore the need for further investigations into the factors influencing the choice and efficacy of DS with various display formats.

Our results also point to a tradeoff between the effects of the DS on performance and classification time. Here, classification times were significantly longer for trials in which the participant accessed DS than for trials in which they did not. This contrasts with the results of Botzer et al. (2013), who found that DS reduced decision times. A possible explanation for the difference is that in our experiment, unlike in Botzer et al. (2013), DS use was selective, and participants tended to invoke support only when facing highly uncertain events. This would have affected classification times in two ways. First, as the data showed, more certain information engendered shorter classification times regardless of the decision to access DS. Mean classification time will, therefore, tend to be longer for high-uncertainty events than for a mixture of high- and low-uncertainty events. Second, classification times on DS-aided trials included the time needed to access the DS in the current experiment, whereas those in Botzer et al. (2013) did not. When the DS was accessed, classification time was shorter for participants using high-sensitivity DS than for those using low-sensitivity DS, as expected. Binary DS format led to faster classification than likelihood formats, presumably reflecting the additional time needed to interpret the continuous cues.

Allowing users to choose DS on a trial-by-trial basis instead of having it by default or deciding at the block level carries various implications that warrant exploration. Most important is the possibility that selective access might improve overall DS effectiveness. On the one hand, providing DS by default ensures that users have access to additional information even when they might ignore it. However, it carries the risk that when DS gives a poor recommendation, it could lead to errors, especially in easy trials (Tikhomirov et al., 2023). On the other hand, selective DS access will improve performance only if the operator effectively requests support when needed. Future research can directly compare selective access to access-by-default conditions to resolve this question.

We also examined the subjective cognitive workload with different DS formats and sensitivity levels. Participants in all conditions but one reported higher workloads for DS-aided trials compared to nonaided trials, contradicting previous findings that indicated a decrease in subjective workload for aided trials (Rottger et al., 2009). As with classification times, higher workload ratings for DS-aided trials likely reflect the fact that participants tended to invoke the DS primarily in response to high-uncertainty events, that is, trials on which the classification was inherently difficult. In contrast, Rottger et al. (2009) provided the DS by default, thereby agglomerating high- and low-uncertainty events in the participants’ workload ratings. Notably, we observed lower workload ratings for aided trials than for unaided trials among participants using the high-sensitivity IL DS, indicating that a reliable and well-designed support system can reduce cognitive load.

Subjective workload did not differ significantly between the DS format conditions for trials on which the participants did not access DS. For DS-aided trials, participants using the high-sensitivity DS reported lower workload than those using the low-sensitivity DS. Subjective workload also varied between the DS formats for both high and low-sensitivity DS. The reported workload was lower for the integrated format than for either of the separated formats. Among the two separated formats, it was lower for the binary than for the likelihood displays. These findings underscore the advantages of automating information integration and suggest that presenting binary cues may alleviate the human workload when integrated information displays are impossible.

The results also showed that the display format and sensitivity affected users’ PUP. Participants who used high-sensitivity likelihood DS, whether of separated or integrated format, reported higher PUP than those who used binary DS, consistent with prior findings regarding trust in DS (Meyer et al., 2020). One possible explanation for the higher PUP scores associated with the likelihood display is that graded cues provide users with more transparent and interpretable information about the DS. By presenting a probability estimate rather than a binary indication, the likelihood display format allows users to understand the level of uncertainty in the DS prediction. However, this additional PUP did not translate into a preference for accessing the SB DS over the SL DS.

Participants using the IL format reported higher PUP than those using the SL format. These effects imply that users may perceive DS as better and more useful when it integrates information from multiple sources rather than burdening them with the task themselves.

The value of providing users autonomy to access the DS may depend on the nature of the task and the intended objective of the DS. For example, it may be more suitable in domains like diagnostic medicine, where decisions are unspeeded, and DS aims to maximize decision quality. In contrast, tasks requiring prompt decision making, such as in financial markets, may be less advisable. Nonetheless, when the decision regarding DS usage is based on individual events, it grants the decision maker flexibility. When the information is unambiguous, users might opt not to use DS, even though the goal is to maximize performance. Conversely, when the information is highly ambiguous, using DS might be preferred, even if minimizing performance time is crucial. Examining selective access to DS can provide insight into users’ decision making in tasks involving multiple objectives. This analysis has implications for designing and implementing DS systems across diverse domains.

Constraints on Generality

The study used an abstract task designed for nonexpert decision makers. Of necessity, it differed from many naturalistic tasks in ways that might limit the generalizability of our conclusions. For example, our participants used a binary classification task with equal probabilities and symmetric payoffs to avoid bias toward one outcome. These characteristics may not reflect some real-world situations, for instance, in the medical domain, where multiple types of events are possible, some forms of event are rare, and the payoff attached to different forms of error may be asymmetrical. Future research should explore the effects of biased event distributions and classification costs on decision making. An additional limitation is the absence of a comparison between selective DS access, access by default, and no DS conditions. This comparison would have allowed us to determine whether the observed benefits of the integrated display are primarily due to increased usage or to the more effective utilization of information. Our use of graphic design for the DS stimuli, rather than verbal descriptions, can also affect the study’s generalizability. While our choice of graphic displays was intended to maintain consistency with the visual nature of the stimuli participants saw before accessing the DS, it may limit the applicability of our findings to scenarios where two conditions hold: first, the raw data themselves are visual, and second, the DS’s recommendations can be encoded in a format similar to the raw data. Future studies should address this limitation and manipulate graphic versus verbal DS formats to explore their effects on performance and cognitive workload.

Key Points

• The ambiguity of information plays a significant role in the decision to access Decision Support (DS), which might demonstrate the value of DS in resolving uncertainty.

• The tendency to access DS might vary with the nature of the task and its objectives. DS proves effective for enhancing task performance but might not be ideal for swift decisions like financial market decisions.

• A DS’s display format plays a prominent and influential role in the inclination to utilize it, especially when it provides reliable information. The likelihood-integrated display format was favored over binary or likelihood-separated display formats.

Footnotes

Acknowledgments

This research was supported by the Israeli Science Foundation (ISF) (grant No. 2029/19) and the Center for AI and Data Science at Tel Aviv University. The abstract of the study was presented at the Human Factors and Ergonomics Society 2023 Annual Meeting (Ben-Yaakov et al., 2023).

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the Israeli Science Foundation (ISF) (grant No. 2029/19) and the Center for AI and Data Science at Tel Aviv University.

ORCID iDs

Yoav Ben Yaakov

Jason S. McCarley

Joachim Meyer

Author Biographies

Yoav Ben Yaakov is a PhD candidate in the Department of Industrial Engineering at Tel Aviv University, Israel. He received his MSc in industrial engineering from Tel Aviv University in 2020.

Jason S. McCarley is a professor at Oregon State University. He received his PhD in experimental psychology from the University of Louisville in 1998.

Joachim Meyer is the Celia and Marcos Maus Professor of Data Sciences at the Department of Industrial Engineering at Tel Aviv University, Israel. He received his MA in psychology and PhD in industrial engineering (1994) from the Ben-Gurion University of the Negev in Beer Sheva, Israel.

References

Ayres

Sweller

(2005). The split-attention principle in multimedia learning. In The Cambridge Handbook of Multimedia Learning (pp. 135–146). Cambridge University Press.

Bahrami

Olsen

Latham

P. E.

Roepstorff

Rees

Frith

C. D.

(2010). Optimally interacting minds. Science, 329(5995), 1081–1085. https://doi.org/10.1126/science.1185718

Baron

Hershey

J. C.

(1988). Heuristics and biases in diagnostic reasoning: I. Priors, error costs, and test accuracy. Organizational Behavior and Human Decision Processes, 41(2), 259–279. https://doi.org/10.1016/0749-5978(88)90030-1

Bartlett

M. L.

McCarley

J. S.

(2017). Benchmarking aided decision making in a signal detection task. Human Factors, 59(6), 881–900. https://doi.org/10.1177/0018720817700258

Bartlett

M. L.

McCarley

J. S.

(2019). No effect of cue format on automation dependence in an aided signal detection task. Human Factors, 61(2), 169–190. https://doi.org/10.1177/0018720818802961

Ben-Yaakov

Bitan

Meyer

(2021). Do users know when to buy decision support and how to use it? Proceedings of the Human Factors and Ergonomics Society - Annual Meeting, 65(1), 1021. https://doi.org/10.1177/1071181321651120

Ben-Yaakov

McCarley

Meyer

(2023, October). Selective access to decision support as a function of event uncertainty. Proceedings of the Human Factors and Ergonomics Society - Annual Meeting, 67(1). https://doi.org/10.1177/21695067231192540

Ben Yaakov

Meyer

(2024). Choose and use: Users’ selection of information sources for decision support. In Proceedings of the annual meeting of the cognitive science society (Vol. 46).

Ben Yaakov

Meyer

Wang

(2020). User detection of threats with different security measures. In 2020 IEEE international conference on human-machine systems (ICHMS) (pp. 1–6). Rome, Italy, IEEE.

10.

Ben Yaakov

Wang

Meyer

(2019). Choosing protection: User investments in security measures for cyber risk management. In International conference on decision and game theory for security (pp. 33–44). GameSec.

11.

Boskemper

M. M.

Bartlett

M. L.

McCarley

J. S.

(2022). Measuring the efficiency of automation-aided performance in a simulated baggage screening task. Human Factors, 64(6), 945–961. https://doi.org/10.1177/0018720820983632

12.

Botzer

Meyer

Parmet

(2013). Mental effort in binary categorization aided by binary cues. Journal of Experimental Psychology: Applied, 19(1), 39–54. https://doi.org/10.1037/a0031625

13.

Carragher

D. J.

Hancock

P. J. B.

(2023). Simulated automated facial recognition systems as decision-aids in forensic face matching tasks. Journal of Experimental Psychology: General, 152(5), 1286–1304. https://doi.org/10.1037/xge0001310

14.

Cohen

(1988). Statistical power analysis for the behavioral sciences. Erlbaum.

15.

Meyer

(2001). Effects of warning validity and proximity on responses to warnings. Human Factors, 43(4), 563–572. https://doi.org/10.1518/001872001775870395

16.

Meyer

Dembinsky

Raviv

(2020). Alerting about possible risks vs. Blocking risky choices: A quantitative model and its empirical evaluation. Computers & Security, 97, Article 101944. https://doi.org/10.1016/j.cose.2020.101944

17.

Montanari

Nunnari

(2019). Audi alteram partem: An experiment on selective exposure to information. Unpublished manuscript.

18.

Moreno

Mayer

R. E.

(1999). Cognitive principles of multimedia learning: The role of modality and contiguity. Journal of Educational Psychology, 91(2), 358–368. https://doi.org/10.1037/0022-0663.91.2.358

19.

Munoz

F. G. A.

Duncan-Reid

McCarley

J. S.

(2022). The effect of correlated observations on human-automation interaction in a signal detection task. Proceedings of the Human Factors and Ergonomics Society - Annual Meeting, 66(1), 2062–2066. https://doi.org/10.1177/1071181322661290

20.

Parasuraman

Manzey

D. H.

(2010). Complacency and bias in human use of automation: An attentional integration. Human Factors, 52(3), 381–410. https://doi.org/10.1177/0018720810376055

21.

Pescetelli

Hauperich

A.-K.

Yeung

(2021). Confidence, advice seeking and changes of mind in decision making. Cognition, 215(10), Article 104810. https://doi.org/10.1016/j.cognition.2021.104810

22.

Pharmer

R. L.

Wickens

C. D.

Clegg

B. A.

(2022). When transparency fails: Compliance with decision support in nautical collision avoidance. Proceedings of the Human Factors and Ergonomics Society - Annual Meeting, 66(1), 397–401. https://doi.org/10.1177/1071181322661321

23.

Röttger

Bali

Manzey

(2009). Impact of automated decision aids on performance, operator behaviour and workload in a simulated supervisory control task. Ergonomics, 52(5), 512–523. https://doi.org/10.1080/00140130802379129

24.

Sorkin

R. D.

Kantowitz

B. H.

Kantowitz

S. C.

(1988). Likelihood alarm displays. Human Factors, 30(4), 445–459. https://doi.org/10.1177/001872088803000406

25.

Tikhomirov

Lana

Bartlett

Megan

L. Duncan-Reid

Jackson

McCarley

Jason

S. (2023). Journal of Experimental Psychology: Applied, 29(4), 869–886. https://doi.org/10.1037/xap0000484

26.

Todd

Benbasat

(1992). The use of information in decision making: An experimental investigation of the impact of computer-based decision aids. MIS Quarterly, 16(3), 373–393. https://doi.org/10.2307/249534

27.

Voss

Rothermund

Voss

(2004). Interpreting the parameters of the diffusion model: An empirical validation. Memory & Cognition, 32(7), 1206–1220. https://doi.org/10.3758/bf03196893

28.

Wang

Jamieson

G. A.

Hollands

J. G.

(2009). Trust and reliance on an automated combat identification system. Human Factors, 51(3), 281–291. https://doi.org/10.1177/0018720809338842

29.

Wickens

Colcombe

(2007). Dual-task performance consequences of imperfect alerting associated with a cockpit display of traffic information. Human Factors, 49(5), 839–850. https://doi.org/10.1518/001872007X230217

30.

Wickens

C. D.

Carswell

C. M.

(1995). The proximity compatibility principle: Its psychological foundation and relevance to display design. Human Factors, 37(3), 473–494. https://doi.org/10.1518/001872095779049408

31.

Wickens

C. D.

Dixon

S. R.

(2007). The benefits of imperfect diagnostic automation: A synthesis of the literature. Theoretical Issues in Ergonomics Science, 8(3), 201–212. https://doi.org/10.1080/14639220500370105

32.

Zirk

Wiczorek

Manzey

(2020). Do we really need more stages? Comparing the effects of likelihood alarm systems and binary alarm systems. Human Factors, 62(4), 540–552. https://doi.org/10.1177/0018720819852023

Selective Access to Decision Support as a Function of Event Uncertainty

Abstract

Objective

Background

Method

Results

Conclusion

Application

Keywords

Introduction

Method

Participants

Apparatus

Stimuli

Procedure

Results

Tendency to access DS

Performance

Classification time

Subjective workload

Perceived usefulness and performance of the DS

Discussion

Constraints on Generality

Key Points

Footnotes

Acknowledgments

Declaration of Conflicting Interests

Funding

ORCID iDs

Author Biographies

References