Abstract
The study tested the role of cue utilization and cognitive reflection tendencies in email users’ phishing decision capabilities in both controlled and naturalistic settings. 94 university students completed measures of their phishing cue utilization and cognitive reflection, a phishing decision task, and a naturalistic simulated phishing campaign, in which they were sent simulated phishing emails to their personal inboxes. For the phishing decision task, results revealed that participants with lower cognitive reflection tendencies were more likely to misclassify genuine emails as phishing, compared to participants with higher cognitive reflection. Further, participants with higher cognitive reflection and lower cue utilization took the most time to diagnose emails, but participants low in both cue utilization and cognitive reflection demonstrated the shortest response latencies. These findings suggest that greater cognitive reflection can offset lower levels of cue utilization. For the naturalistic simulation, neither cue utilization nor cognitive reflection predicted an increased propensity to interact with a suspicious email. This result highlights a potential gap between phishing investigations conducted in controlled and naturalistic settings. The implications extend to future research, emphasizing the need for studies that employ naturalistic methodologies to better understand and address phishing threats in real-world environments.
Introduction
Online criminal activity is a serious and escalating threat impacting individuals, organizations, and governments both in Australia and world-wide. Social engineering attacks, which involve the psychological manipulation of individuals to alter their behavior and prompt actions that compromise security, are particularly concerning. Notably, 31% of these attacks are conducted via phishing emails (Verizon, 2024), which are fraudulent messages designed to deceive recipients into revealing sensitive information, such as financial details or passwords. As the last line of defense, email users must determine whether an email is legitimate or fraudulent, making human error a significant vulnerability in cybersecurity systems (Herzberg, 2009).
A popular framework for understanding decision-making in naturalistic contexts is the heuristics and biases approach (Tversky & Kahneman, 1974). When equipped with limited information, or under conditions of uncertainty or urgency, people will frequently employ heuristics (i.e., mental “rules of thumb”) that leverage previous experience to problem-solve and conserve cognitive resources (Dumm et al., 2020; Williams et al., 2018). However, while typically useful, an overreliance on heuristics can result in judgment errors and faulty decision-making (i.e., biases) (Gigerenzer & Gaissmaier, 2011; Kahneman & Egan, 2011). Senders use social engineering techniques within the body of an email (e.g., creating a sense of urgency or offering a reward) to capitalize on these judgment errors, and as such, research has investigated if a more rational or reflective approach to reasoning would reduce the potential for human error (Moody et al., 2017; Workman, 2008).
Cognitive Reflection
The dual-processing system (DPS; Evans, 2008) outlines two distinct cognitive pathways for decision-making: System 1 (S1) is characterized by intuitive thinking, which describes an unconscious, instinctive, and rapid process; and System 2 (S2) is a more deliberate, slow, and intentional thinking pathway. The DPS proposes that heuristics (S1) generate automatic behavioral responses, often described as unskilled intuition, unless analytic reasoning (S2) intervenes and inhibits the behavior. Similarly, the Elaboration Likelihood Model (ELM) presents a dual-cognitive model (Petty & Cacioppo, 1986). The model suggests that individuals using the central information processing pathway activate two sub-processes: attention, which is described as mental focus, and elaboration, which involves forming connections between the present experience and past knowledge (Harrison et al., 2016).
Although the supposition of pathway independence has been questioned (Baron et al., 2015; Klein, 2011; Trémolière & Bonnefon, 2014), a central tenet of this model is the proposed benefit of the S2 pathway, whereby individuals engage deeply with stimuli or contextual features such as those found in phishing emails (Stanovich & West, 2000). The assumption is that upon receipt of an email, the S2 pathway may prompt a recipient to continue the information search and carefully examine the email for cues of authenticity, thus decreasing their chances of victimization in cybercrime (Harrison et al., 2016; Luo et al., 2013; Vishwanath et al., 2016; Vishwanath et al., 2011).
A body of research has implicated cognitive reflection as a significant activator of the systematic thinking pathway (Isler et al., 2020; Kahneman & Klein, 2009; Pennycook & Rand, 2019). Cognitive reflection is defined as an information processing technique and is measured by behavioral outcomes characterized along a spectrum of rapid, impulsive decision-making or slower, rational, and stepwise approaches (Frederick, 2005). Frederick’s (2005) Cognitive Reflection Test (CRT) operationalizes cognitive reflection as a measure of impulsivity requiring participants to complete a three-item mathematical questionnaire. Each item is designed to evoke an instinctively incorrect answer, which avoids cognitive strain or reflection (i.e., S2 processing) and relies on the heuristics-biases pathway (i.e., S1 processing) for decision-making.
The CRT has been previously used to examine decision-making with reference to phishing emails. In a study by Jones et al. (2019), 224 university students and staff completed an email judgment task requiring them to discriminate between legitimate and phishing emails, as well as complete a battery of cognitive tasks relating to individual differences, including participants’ cognitive reflection. The results revealed that individuals with a greater capacity for cognitive reflection were more successful in discriminating phishing emails from authentic communication (Jones et al., 2019). Thus, measuring cognitive reflection tendencies can be useful in predicting phishing victimization.
The CRT has faced criticism in the literature regarding its validity and reliability, with some studies citing that the wide-spread use of the test, particularly the original version (Frederick, 2005), has created an over-exposure and familiarity with the test items (Baron et al., 2015; Chandler et al., 2014). Other reports suggest that variables such as mathematical ability and general intelligence factors have the potential to confound participants’ typical cognitive impulsivity (Toplak et al., 2011). To address these issues, Thomson and Oppenheimer (2016) designed the CRT–2, which uses questions that limit a reliance on respondents’ mathematical ability to generate correct answers and thus absolving it from one of the key critiques of the original CRT.
Cheng and Janssen (2019) aimed to validate the CRT-2 by examining its relationship with the conceptually related intertemporal choice task, which measures preference for immediate smaller rewards (i.e., impulsivity) or later larger rewards (i.e., reflective, long-term orientation). Results from 139 college students showed a significant positive association between correct responses on the CRT-2 (i.e., higher cognitive reflection) and fewer impulsive choices across two hypothetical gain and payment conditions. The study also reported a positive correlation between impulsive choices and intuitive errors (compared to non-intuitive errors) on the CRT-2. This supports the measure’s construct validity in assessing information processing preferences and implications for reflective decision-making (Cheng & Janssen, 2019).
In line with these findings, Isler et al. (2020) compared methods of activating reflective thinking using the CRT-2 to measure participants’ performance on common behavioral manipulations used in online research. 1748 adult participants were assigned to five experimental conditions (e.g., time delay, memory recall, and decision justification). The results demonstrated that reflective thinking (i.e., higher scores on the CRT-2) was successfully activated when participants were asked to justify their decisions or were given training to increase reflection and awareness of cognitive biases. This suggests that the CRT-2 may be useful in distinguishing between reflective thinking and critical analysis, and other aspects of information processing such as attention and recall (Harrison et al., 2016; Petty & Cacioppo, 1986; Thomson & M. Oppenheimer, 2016).
Skilled Intuition
It is important to note that not all decision-making will neatly assemble into either pathway presented in the dual processing models (Stanovich & West, 2000; Trémolière & Bonnefon, 2014). While unskilled intuitive processes (e.g., heuristics and biases) are theorized to reside or operate via S1 channels, the development of skilled intuition is also likely to leverage S1 architecture. This describes the expert ability to react rapidly to an extenuating circumstance, often demonstrating better performance outcomes compared to their less experienced counterparts (French & Nevett, 1993; Klein et al., 1986).
Klein and Klinger (1991) pioneered an investigation into expert decision-making by interviewing fire commanders, and reported that under critical and complex conditions, these experts typically generate only one plausible course of action rather than the hypothesized systemic appraisal of several plausible actions. The Recognition-Primed Decision (RPD; Klein & Klinger, 1991) model describes a process for developing skilled intuition, a blended S1 and S2 information processing pathway where domain experts capitalize on pattern recognition skills to create mental simulations for how the events or process might unfold and then evaluate the merits of the first viable course of action. The repertoire of memories available to skilled decision-makers is presumed to reduce uncertainty and accelerate appropriate responses (Baron et al., 2015; Baylor, 2001; French & Nevett, 1993; Klein, 1993, 2011; Klein & Klinger, 1991; Weick et al., 2005).
Cue Utilization
The foundation of skilled intuition is thought to originate with cues, which are the associations between features and events that exist in the environment or in memory (Wiggins et al., 2018). With repeated exposure to cues, a connection to an object or event is reinforced for later recall (Wiggins, 2015). The grouping of multiple cues is presumed to create a mental model (i.e., pattern) of the expected cue sequence, which leads to a comprehensive understanding or sensemaking of the entire process, event, or object (Gacasan & Wiggins, 2017; Weick, Sutcliffe, & Obstfeld, 2005Weick et al., 2005). Cue utilization is therefore the capacity to recognize and respond to the demands of a task or process, by mentally organizing specific cues (Gacasan & Wiggins, 2017). Studies have linked cue utilization with performance across several domains including piloted aircrafts (Renshaw & Wiggins, 2017; Wiggins et al., 2018), disaster recovery project management (Gacasan & Wiggins, 2017), driving simulations (Yuris et al., 2019), collision avoidance at sea (Chauvin & Lardjane, 2008), and emergency triaging for nurses (Reay & Rankin, 2013).
Schriver et al. (2008) used a flight simulator to examine the differences between expert and less-skilled pilot decision-making. Results showed that expert pilots made more accurate and timely decisions, used less cues overall and applied more attention to relevant diagnostic cues when compared to the novice pilots. This outcome reflects the RPD model, which describes domain experts use of cue pattern recognition skills as an important facet or precursor to skilled intuition and cue utilization (Gacasan & Wiggins, 2017; Stanovich & West, 2000). However, contrary to expectations, in the failure simulation (i.e., flight problems), both expert and novice pilots demonstrated greater accuracy in problem diagnoses for conditions where cues were less correlated to each other (i.e., randomized), than for more correlated cues (i.e., typical or expected).
The authors suggested that in the absence of an existing mental model, both novice and expert pilots applied similar attentional and analytic resources (i.e., S2 thinking pathway) for effective decision making (Schriver et al., 2008). Activation of S2 decision making pathways (i.e., cognitive reflection) may therefore be the first step in facilitating skill development via the acquisition of new diagnostic cues (i.e., experience), which over time develops into pattern recognition and subsequently cue utilization occurring in the S1 thinking pathway (Kahneman & Klein, 2009; Klein, 2008; Klein & Klinger, 1991). As such there may be an important interaction between cognitive reflection and cue utilization in explaining how individuals make accurate decisions (Endsley, 1995; Jones et al., 2019; Pennycook & Rand, 2019; Schriver et al., 2008).
Cue utilization may also play an important role in a cyber security context by reducing cognitive load where the failure to recognize and take appropriate action can result in a significant safety breach (Wiggins, 2015; Wiggins et al., 2014). Wiggins (2021) suggests that participants are classified as having greater or lesser cue utilization skills based on domain-specific performance indicators, specifically, their ability for Recognition, Association, Prioritization, Identification, and Discrimination (RAPID) of cues.
Bayl-Smith et al. (2020) used the phishing edition of EXPERTise 2.0 (Brouwers et al., 2016; Wiggins, 2016; Wiggins et al., 2015), a web-based tool to assess participants’ cue utilization across the five different (RAPID) facets. Participants were instructed to complete five scenario-based tasks where they must recognize or classify email authenticity, assign strength of association between phishing concepts, prioritize or rank feature importance, identify diagnostic features, and discriminate between domain-related or unrelated features (Wiggins, 2016; Wiggins et al., 2018). Participants were also required to identify suspicious features in emails and classify them as legitimate or untrustworthy. Using a k-means cluster analysis, participants were successfully grouped into those demonstrating relatively higher or lower cue utilization skills. Participants with higher cue utilization were more accurate in identifying phishing features than those in lower cue utilization (Bayl-Smith et al., 2020). Additionally, it was reported that the average email deliberation time (i.e., cognitive reflection) had a positive impact upon participants’ ability to recognize key email features that indicated suspicion. This supports the notion that participants demonstrating higher cue utilization (i.e., S1 thinking) or higher cognitive reflection (i.e., S2 thinking) were better at identifying features that do not fit with the expected patterns of the situation.
In a similar study investigating the relationship between email users’ cue utilization and their phishing detection, Nasser et al. (2020) asked 50 participants to complete a dual-task exercise, which required them to complete a phishing detection task while also completing a rail control task with increasing complexity (i.e., cognitive load). Participants’ relative cue utilization was again distinguished using the cyber version of EXPERTise 2.0. Results revealed that users with relatively higher cue utilization had greater accuracy in discriminating email authenticity. However, in contrast to expectations, high cue utilizers did not demonstrate an advantage over low cue utilizers under conditions of increasing cognitive load. This may be attributed to practice effects (Duff et al., 2007) and design limitations when reporting cue sources (Nasser et al., 2020). Overall, the findings, taken together with those of Bayl-Smith et al. (2020) suggest a clear advantage for email users higher in cue utilization in the detection of malicious emails.
Ackerley et al. (2022) recently extended on the work of Nasser et al. (2020), for the first time examining the potential interplay between cue utilization and cognitive reflection in email users’ ability to efficiently differentiate between phishing and genuine emails. Participants completed the original Cognitive Reflection Test (CRT), a laboratory-based phishing diagnostic task, and the EXPERTise 2.0 battery. The results revealed an interaction between users’ cognitive utilization and cue reflection, whereby participants relatively low in both domains performed significantly worse in diagnosing phishing emails compared to others. They concluded that a high level of cognitive reflection was able to compensate for a lower level of cue utilization, and vice versa.
While a novel contribution, Ackerley et al. (2022) note the artificial nature of the phishing diagnostic performance measure they used, which may have yielded several experimental artefacts. For example, requiring participants to detect phishing emails from a sample of emails may have elicited expectation effects not present during real-world decision making. Further, such explicit directions may have primed participants to engage in greater cognitive reflection during their decision making, resulting in a distorted view of their natural tendencies beyond the study. The authors underline the need to test the generalizability of their results beyond controlled phishing diagnosis tasks via the use of naturalistic study techniques, for instance, employing simulated phishing emails sent to participants’ real inboxes sporadically.
Study Aims
The aim of this study is to test: 1) the role of cue utilization and cognitive reflection tendencies in email users’ phishing email diagnostic capabilities and 2) whether any differences exist when comparing capabilities in controlled versus naturalistic settings.
The notion of skilled intuition brought awareness to the fact that not all individuals need to access S2 thinking pathways for good decision-making (Kahneman & Klein, 2009; Klein, 1993, 2011; Klein & Klinger, 1991; Weick et al., 2005). Previous research suggests that individuals with relatively high cue utilization will perform better than those lower in cue utilization for complex tasks such as discriminating legitimate emails from phishing emails in a cyber-attack (Ackerley et al., 2022; Bayl-Smith et al., 2020; Nasser et al., 2020). Consistent with their findings, we hypothesize that (H1) the cyber security edition of the EXPERTise 2.0 battery will enable identification of two “clusters” of participants, with one group demonstrating relatively higher cue utilization than the other group (i.e., lower cue utilization). Additionally, we hypothesize that (H2) participants with higher levels of cue utilization will demonstrate, a) greater accuracy on the phishing decision task (i.e., higher rates of true positives, and lower rates of false positives) and b) shorter response latencies in making their judgments, compared to those with lower levels of cue utilization.
It is postulated that S2 thinking may come more easily for individuals with a greater tendency towards cognitive reflection, and this may be key for the careful analysis of email legitimacy (Evans, 2008; Isler et al., 2020). Indeed, evidence has revealed advantages in phishing detection capabilities among those with greater cognitive reflection (Ackerley et al., 2022). Therefore, we hypothesize that (H3) participants who demonstrate a greater tendency to engage in cognitive reflection will demonstrate, a) greater accuracy on the phishing decision task (i.e., higher rates of true positives, and lower rates of false positives) and b) longer response latencies in making their judgments, compared to those participants who are less inclined to engage in cognitive reflection.
As noted, this study also aims to add to the limited phishing email research set in naturalistic conditions. This is actioned by sending participants simulated phishing emails to their university email addresses, differentiated by either greater or fewer phishing cues. It is hypothesized that (H4) cue utilization groupings will be predictive of participants’ engagement with a naturalistic phishing simulation, whereby participants with lower cue utilization will demonstrate greater engagement with phishing emails (i.e., opening an email or clicking on an embedded link) compared to those with higher cue utilization. Likewise, we hypothesize that (H5) cognitive reflection groupings will be predictive of participants’ engagement with a naturalistic phishing simulation, whereby participants with lower cognitive reflection will demonstrate greater engagement with phishing emails (i.e., opening an email or clicking on an embedded link) compared to those with higher cognitive reflection.
It was presumed that users with both higher cue utilization levels and cognitive reflection tendencies possess cognitive resources that increase their sensitivity to phishing cues, compared to the respective lower groupings. As such, we ask (RQ1) do any differences in engagement with a naturalistic phishing simulation (H4 and H5) relate to the number of phishing cues embedded within the phishing emails (i.e., are emails with fewer phishing cues less likely to be engaged than those with a greater number of phishing cues)? Additionally, (RQ2) are any differences based on phishing cue numbers contingent on cue utilization or cognitive reflection groupings?
Method
Participants
The convenience sample of 94 (50 female, 42 male, and two non-binary) participants were recruited from first- and second-year undergraduate psychology students enrolled at Macquarie University, Australia. Female ages ranged from 18 to 42 years (Mage = 21.03, SDage = 6.14), male ages ranged from 17 to 43 years (Mage = 20.38, SDage = 5.55), and non-binary ages ranged from 19 to 31 years (Mage = 25, SDage = 8.49). In addition to completing the online research activities, participants consented to being sent three simulated phishing emails within a six-week period after completion. These emails presented no risk to participants’ computer or device.
Materials
CRT-2
The Cognitive Reflection Test-2 (CRT-2; Thomson & Oppenheimer, 2016) measures participants’ cognitive reflection tendencies. The CRT-2 is a revised version of the original three-item Cognitive Reflection Test (CRT; Frederick, 2005). The CRT-2 is a four-item short answer questionnaire, and measures participants’ tendency for impulsivity. Its theoretical framework is underpinned by the “System 1 and 2” dual process reasoning models of cognition (Stanovich & West, 2000). The measure required participants to respond to four “trick” items, with the expectation that they will invariably arrive at an intuitive but incorrect answer or engage in systematic and reflective thinking to arrive at the correct answer.
Participants could attain a maximum score of four, with each correct answer equaling one point, as per the original method of scoring (Frederick, 2005). No points were given to intuitive errors (i.e., incorrect) or non-intuitive errors (e.g., “I don’t know”) responses. The CRT-2 has previously shown average internal reliability across items (Cronbach’s α > .50; Thomson & M. Oppenheimer, 2016), and although this is a less-than-ideal statistic, Cheng and Janssen (2019) suggest that the very few items may influence the power available for calculating reliability. A strong correlation has been found between CRT-2 and the original CRT (r > .50, p < .001; Thomson & M. Oppenheimer, 2016).
Phishing Decision Task
The phishing decision task was accessed via Qualtrics (Qualtrics, 2021). The task has been used to measure participants’ ability to correctly differentiate between trustworthy or suspicious emails (Ackerley et al., 2022; Bayl-Smith et al., 2020; Nasser et al., 2020). A total of 40 images were presented to each participant, consisting of half genuine and half phishing emails. The emails were sampled from a compilation of real phishing attempts that the research team had received over a 6-month period. Phishing cues (e.g., unknown email address) were included, but the content was deidentified. Each email was displayed in a randomized order for a maximum of 20 seconds, after which the email disappeared, and participants were asked to select if the email was trustworthy or suspicious. The total time each email was viewed was also collected for each participant.
EXPERTise 2.0
The cyber security edition of Expert Intensive Skills Evaluation Program Version 2.0 (EXPERTise 2.0; Wiggins et al., 2015) is a situational judgment test software platform that comprises of five tasks, each testing different components of behavior indicative of cue utilization. These are the Feature Recognition Task (FRT), Feature Association Task (FAT), Feature Prioritization Task (FPT), Feature Identification Task (FIT), and the Feature Discrimination Task (FDT). The statistical properties of EXPERTise 2.0 have been assessed using domain-specific stimuli, and predictive validity and test-retest reliability has been established for clinician audiology training (Watkinson et al., 2018), power system controllers (Loveday et al., 2013), and pilot weather decision-making (Wiggins et al., 2014).
The stimuli for this study were designed to reflect the most ecologically valid or realistic experience for participants receiving phishing emails (Bayl-Smith et al., 2020; Nasser et al., 2020). These diagnostic elements were selected via collaboration with a subject-matter expert to ensure content validity. The five scenario-based EXPERTise 2.0 tasks incorporate cybersecurity cues represented through text, auditory, and visual elements (Sturman et al., 2024).
Feature Identification Task (FIT)
The FIT assesses participants’ ability to quickly discern if an email contains suspicious visual features (phishing email) or is a legitimate communication (non-phishing). 15 email images are presented on a screen individually for 20 seconds and participants use their cursor to click on any element in the email that causes suspicion or click on a green box on the screen labeled “trustworthy.” Participants with relatively higher cue utilization are expected to demonstrate shorter response latency in identifying features and formulating a diagnosis (Loveday et al., 2013).
Feature Recognition Task (FRT)
The FRT measures the accuracy in which participants classify key diagnostic features of emails as either phishing or genuine (Wiggins et al., 2018). 15 email images are presented on a screen individually for a short duration (1 second) and then participants are asked to classify each email as trustworthy, untrustworthy, or impossible to tell. Despite the limited exposure, skilled decision makers are expected to successfully access their repertoire of cues in memory, which enables rapid recognition of the diagnostic features, and respond accurately (Loveday et al., 2013; Wiggins et al., 2018). The FRT generates a count of correct judgments, with greater levels of accuracy being indicative of higher levels of skilled cue utilization (Bayl-Smith et al., 2020; Morrison et al., 2018).
Feature Association Task (FAT)
The FAT measures participants’ ability to consider the strength of association between specific diagnostic features of phishing emails. Two conceptual phrases are presented on the screen for a limited time, and participants indicate their relatedness using a 7-point Likert-type scale (1 = Extremely Unrelated to 7 = Extremely Related). Participants complete two parts of the FAT, at first phrases are presented adjacently in pairs and the participants rate their association. Subsequently, the phrases are presented sequentially one after the other, to investigate for improvements in decision-making. Research suggests that associated concepts are more rapidly distinguished due to pre-existing neural connections within memory (Morrison et al., 2013). Therefore, individuals with relatively higher levels of cue utilization are expected to obtain a greater variance between concepts.
Feature Discrimination Task (FDT)
The FDT tests a participant’s capacity to discriminate the relative importance of features of a suspected phishing email. Participants are given one detailed scenario regarding a potential phishing email along with a picture of the email. Participants are then given several choices (e.g., pay as requested and ignore the email) and are asked to select their decision. Following this, features of the scenario (e.g., time sent and hyperlink) are presented on a 10-point Likert-type scale (1 = Not Important at All to 10 = Extremely Important) for participants to rate each feature on its perceived importance for decision-making. Research suggests that participants’ who can discriminate relevant from less relevant email features via the rating scales (i.e., greater levels of variance between responses) tend to demonstrate relatively higher levels of cue utilization (Loveday et al., 2013).
Feature Prioritization Task (FPT)
The FPT measures participants’ ability to prioritize email features in an information-search task. Participants are given an introduction sentence to a scenario and are required to click on individual drop-down menus one at a time describing different email features (e.g., company logo and knowledge of sender). For the first scenario, participants are given 60 seconds to decide their course of action, and for the second scenario participants are given 120 seconds. Relatively higher levels of cue utilization are associated with accessing drop-down menu items in order of importance rather than sequentially down the webpage (Crane et al., 2018).
Naturalistic Phishing Simulation
Following their participation in the first part of the research, participants were sent three simulated phishing emails to their student email address, (one per week). One email was blocked by spam filters and was excluded from the analysis. The decision to send a small number of emails in the phishing simulation was made to reflect the scarcity of phishing emails that successfully bypass spam filters and reach student inboxes. This approach aimed to create a more authentic scenario, aligning with the actual frequency with which students encounter phishing attempts. The emails differed in sophistication (i.e., the number of phishing cues) and included persuasive elements (e.g., urgent message and university logos) to mislead the recipient. The three possible behavioral outcomes for participants included, disregarding the email, opening but not clicking the embedded URL, or clicking on the embedded URL in the phishing email. By clicking on the URL in a phishing email, participants were directed to a webpage with educational content regarding phishing email identification, as well as a description of the study.
Procedure
Participants accessed the online study via the advertised link from SONA, the online research participation system associated with Macquarie University. Participants landed on the Qualtrics page, an online research survey platform (Qualtrics, 2021), where they read a Participant Information and Consent Form (PICF) that described the study as approved by the ethics board at Macquarie University. Participants were advised that the study was about phishing email detection associated with cue identification. They were informed that they would receive course credit upon completion of several tasks, which included an online survey, a series of email image evaluations, and a cue utilization task.
Participants answered demographic questions, the CRT-2 items (Thomson & M. Oppenheimer, 2016), followed by the phishing decision task on the Qualtrics platform (Qualtrics, 2021). Participants were instructed to continue to the next task by clicking on the arrow at the bottom of the page, upon which they were re-directed to the EXPERTise 2.0 platform. Upon completion, participants were redirected to the debriefing statement. Over the following weeks, three simulated phishing emails were sent to participants’ from Macquarie University’s IT department to their university email address.
Design and Statistical Analysis
The study employed a quasi-experimental, 2 × 2 between-subjects design to examine the effects of cognitive reflection (IV; high vs. low) and cue utilization (IV; high vs. low) on phishing email decision accuracy and response latency (DVs) in the phishing decision task, as well as email engagement (DV) with the simulated phishing emails. A k-means cluster was used to identify the cue utilization groups, and cognitive reflection scores were classified as either higher or lower (calculation method described in “Data Reduction and Preliminary Analysis”). Analysis of Variance (ANOVA) and multinomial regression were used to test the hypotheses.
Results
Data Reduction and Preliminary Analysis
Cases were visually assessed for missing data and those with substantially incomplete records were removed. A total of 94 participants were retained for the analyses, and the error rate was set at α = 0.05.
Participants’ responses on the CRT-2 were scored as either the correct (1) or incorrect (0) answer and tallied to yield a total score out of 4. Participants were divided into two distinct groups, with those who scored 0 through 2 being categorized as “lower” in cognitive reflection (n = 38), and those who scored 3 or 4 categorized as “higher” (n = 56).
Consistent with previous methods, cue utilization was established based on participants’ performance across the five EXPERTise 2.0 tasks. A k-means cluster was performed on standardized scores from each task, forcing a two-cluster model (i.e., higher and lower cue utilization), consistent with previous approaches (Bayl-Smith et al., 2020; Brouwers et al., 2016; Sturman et al., 2021). The FIT, FRT, FAT, and FDT yielded statistically significant mean differences between the two groups (Note: Only one FAT – sequential, was included in the cluster analysis as both FAT Tasks were strongly correlated r = .943, p < .001). Scores on the FPT failed to reveal the expected direction of performance across the two groups and was excluded from further analysis.
Standardized Means From EXPERTise Tasks: Centroid Values for the Four Retained EXPERTise 2.0 Task Clusters.
Note. The F test differences between clusters were statistically significant (p < .05).
Participants’ responses on the phishing decision task were separated into “true positives” relating to the correct detection of 20 phishing email (1 = correct, 0 = incorrect), and “false positives” relating to the false detection of phishing when presented with 20 genuine emails (1 = false detection, 0 = no detection). Participants’ response latency (i.e., time taken to diagnose an email as trustworthy or suspicious) was recorded in seconds (s) and calculated to yield an average speed score. The participants’ raw scores for cue utilization, total CRT-2, and the phishing task (i.e., true positives, false positives, and response latency) were transformed into standardized z-scores, no participants were identified as having extreme scores, with all z-scores <3.29 (Osborne & Overbay, 2004; Tabachnick & Fidell, 2007). Participants’ engagement with the two simulated phishing emails were classified into three behavioral outcomes and each assigned a numerical value, disregarding the email (0), opening the phishing email without clicking the embedded URL (1), and clicking the embedded URL (2).
Main Analysis
The phishing decision task scores (i.e., true or false positive detection) and response latency scores for the participants (N = 94) were examined using a series of 2 × 2 factorial between-groups analyses of variance (ANOVA). All scores were examined for violations of normality and homogeneity of variance for all groups. Interpretation of effect size relating to partial eta squared (η2) were advised from Cohen (1988).
Cue Utilization, Cognitive Reflection, and Phishing Decision Task Accuracy
For true positive scores, results revealed a non-significant main effect for both cue utilization, F (1,90) = 3.55, p = .063, partial η2 = .038, obs. Power = 0.461, and cognitive reflection, F (1,90) = .180, p = .673, partial η2 = .002, obs. Power = 0.07. No main effect was reported for cue utilization for the false positive scores, F (1,90) = 1.08, p = .301, partial η2 = .012, obs. Power = .178. Contrary to expectations, this finding suggested that those with higher levels of cue utilization did not accurately detect phishing (i.e., true positives or false positives) more than those with lower levels of cue utilization on the phishing decision task (H2).
Results revealed a statistically significant effect for false positive scores in relation to participant cognitive reflection groupings, F (1,90) = 4.88, p = .030, partial η2 = .051 (small effect), obs. power = .589. Participants with a lower tendency for cognitive reflection (n = 38) incorrectly diagnosed genuine emails as phishing emails more often (M = 6.50, SE = .43) than participants with higher cognitive reflection (n = 56) (M = 5.32, SE = .32). This supported the expectation that participants who demonstrate a greater tendency for cognitive reflection will also demonstrate greater accuracy on the phishing decision task (H3a), see Figure 1. Mean true positive and false positive scores across cognitive reflection conditions. Note. Error bars represent standard errors (±1 SE).
The findings failed to reveal an interaction between cue utilization and cognitive reflection for either true positives, F (1, 90) = .225, p = .637, partial η2 = .002, obs. power = 0.08, or false positives scores, F (1,90) = .827, p = .365, partial η2 = .009, obs. power = 0.15. Therefore, the mean phishing decision accuracy for participants in either of the two cue utilization clusters (i.e., higher or lower), were not contingent on participants’ tendency to engage in cognitive reflection, and vice versa (see Figures 2 and 3). Interaction of mean true positive scores across conditions. Note. Error bars represent standard errors (±1 SE). Interaction mean false positive scores across conditions. Note. Error bars represent standard errors (±1 SE).

Cue Utilization, Cognitive Reflection, and Phishing Decision Task Response Latency
A between-subjects ANOVA was conducted to test the effect of cue utilization and cognitive reflection groupings on participants’ response latencies in the phishing decision task. No main effect was reported for cue utilization, F (1,90) = .725, p = .397, partial η2 = .008, obs. power = 0.13. However, results revealed a significant main effect for cognitive reflection, F (1,90) = 9.68, p = .002, partial η2 = .097 (moderate effect), obs. power = 0.87 (H3b), with participants with greater cognitive reflection taking more time to provide a response (H3b). Additionally, there was a significant interaction between cognitive reflection and cue utilization, F (1,90) = 5.00, p = .028, partial η2 = .053 (small effect), obs. power = .60. The interaction is shown in Figure 4. Interaction of mean average response latency across conditions. Note. Error bars represent standard errors (±1 SE).
Four simple effects tests were conducted to further analyze the interaction, using a Bonferroni adjusted alpha of .0125 to maintain the familywise error rate at .05 (Field, 2013). The simple effect was not statistically significant for participants in either the high cognitive reflection group, F (1,905) = 1.11, p = .294, partial η2 = .012, or the high cue utilization group, F (1,90) = .461, p = .499, partial η2 = .00. For participants in the low cognitive reflection group, the simple effect of cue utilization was statistically significant, F (1,90) = 4.18, p = .044, partial η2 = .044 (small effect), with greater average response latency for those with higher cue utilization levels (M = 10.17 SE = .77), than those with low cue utilization levels (M = 7.88 SE = .81). For participants in the low cue utilization group, the simple effect for cognitive reflection was statistically significant, F (1,90) = 12.24, p = .001, partial η2 = .120 (moderate effect), with greater average response latency for those with for those with higher cognitive reflection levels (M = 11.84 SE = .79) than those with low cognitive reflection levels (M = 7.88 SE = .81).
Overall, participants with either higher levels of cognitive reflection or cue utilization (or both) responded at similar speeds. However, those with higher cue utilization levels appeared to unexpectedly slow participants’ response in the lower cognitive reflection group, and predictably, higher cognitive reflection tendencies appeared to increase latency for participants with lower levels of cue utilization during the phishing decision task. Participants lower in both cognitive reflection and cue utilization demonstrated the shortest response latencies.
Cue Utilization, Cognitive Reflection, and Phishing Simulation Engagement
Descriptive Statistics for Multinomial Regression Across Naturalistic Phishing Simulations.
Results revealed that for both phishing email attempts (i.e., emails with relatively greater or fewer phishing cues), the relationship between email engagement and cue utilization was not statistically significant, χ2 (2) = 2.45, p = .293 and χ2 (2) = 1.26, p = .533. This means that cue utilization did not predict participant engagement with the simulated phishing emails (H4). Likewise, there was no statistically significant relationship between email engagement and cognitive reflection for either phishing email attempts, χ2 (2) = .76, p = .685 and χ2 (2) = 3.38, p = .185 (H5).
A Chi-square test of independence was used to investigate if there were differences between participant email engagement as a by-product of the number of phishing cues embedded within the emails. There was no significant relationship between the form of email interaction and the type of email viewed (i.e., greater or fewer cues). Thus, the proportions of the various types of email interaction did not differ depending on the number of cues present in the email, χ2 (2, N = 174) = 5.29, p = .071, Cramer’s V = .17. As such, results demonstrated that engagement with a naturalistic phishing simulation were not contingent on the number of phishing cues nor the interaction of cue utilization and cognitive reflection groupings (RQ1 and RQ2).
Correlations Between Phishing Decision Task and Simulated Phishing Emails
To further investigate the non-significant results, a Pearson’s correlation was conducted to examine the relationships between the participants’ performance on the phishing tasks (i.e., true and false positive scores and response latency scores) and the participants engagement with the simulated phishing emails (i.e., opening an email or clicking on an embedded link). There were no statistically significant correlations reported between the phishing decision task components and the participants engagement in the simulated phishing campaign.
Discussion
The Study’s Hypotheses and Research Questions.
Key findings
Cue Utilization, Cognitive Reflection, and Phishing Decision Task
Practical implications for the EXPERTise 2.0 battery may be gleaned from its ability to distinguish differences in the broader population and therefore its potential use as a diagnostic tool for employee training needs, as well as tool measuring skill acquisition from relevant interventions (Morrison et al., 2018). This result is consistent with the recent literature that demonstrated that the software was able to differentiate cue utilization abilities in several operational contexts including phishing detection (Ackerley et al., 2022), radiology (Carrigan et al., 2021), aviation (Renshaw & Wiggins, 2017; Wiggins et al., 2018), and electricity distribution (Wiggins et al., 2020). The results may have been limited by experimental design choices. Cue utilization was clustered into either higher or lower groupings and this method of division may have been at too gross a level to detect significant differences. Future studies may decide to examine more granular differences in cue utilization expertise (e.g., novice, beginner, and intermediate). Potential ceiling effects are also noted given the high average accuracy in the phishing decision task. Participants were explicitly instructed to identify emails as either genuine or phishing, which could have alerted the participants’ response to phishing cues, introducing experimental artefacts such as expectation effects. Lastly, the non-significant result may also be attributed to the poverty of cues within a phishing email when compared to other operational domains (e.g., firefighting, driving, and flight simulation). The limited visual markers may disadvantage participants in their efforts for skill acquisition and cue expertise, which requires sufficient detail to create a mental model representative of the complex task (French & Nevett, 1993; Harré et al., 2012; Klein & Klinger, 1991). This is especially evident in spear phishing emails, which are designed to appear authentic and personalized, and as such phishing cues may not be as apparent. Therefore, cue utilization may not be as valuable within the cyber security domain compared to other domains (Benenson et al., 2017; Lin et al., 2019). Speculatively, participants with less of an inclination towards reflective processes may have assumed that phishing emails were presented at a higher rate than actuality. The Truth-Default Theory (TDT) states that people on average tend to trust others, and because the phishing decision task alerted participants to the presence of phishing emails, this may have created demand characteristics that disproportionately calibrated the default decision-making for participants who generally engage less in a systematic interrogation (Levine, 2014). Organizations may benefit from creating procedures that support staff in carving dedicated time for focused email processing, as well as investment in educational and training programs to mitigate cyber security risks especially for individuals who are less inclined to engage in cognitive reflection processes. Importantly, training and education should be conducted with frequency as some studies note behavioral changes are short-lived because employees continue to rely on heuristics and reinforce co-processing habits in demanding environments (Canova et al., 2014; Vishwanath, 2015).
Cue Utilization, Cognitive Reflection and Naturalistic Phishing Simulation
The authors also acknowledge the idiosyncratic nature of how people manage their personal inboxes, and the difficulty in measuring confounds or complexities associated with the real-world. Previous studies examining similar relationships note that there are many confounds that influence how people respond to unsolicited communications, including contextual expectations (Harré et al., 2012; Vishwanath, 2015), authority and urgency cues (Williams et al., 2018), age and sex (Lin et al., 2019; Sheng et al., 2010), and users’ propensity for curiosity, risk, and general Internet usage (Moody et al., 2017). Further, motivational variables may play a role (e.g., organizational commitment and job satisfaction), which would vary across organizations (Cooke et al., 2004).
Limitations
A limitation of the research paper was the size of the final sample and the homogenous demographics (i.e., first year university students). It is noted that a larger sample may have yielded significant relationships between cue utilization, cognitive reflection, and phishing decision task performance, which were not detected in the current study. Alternatively, future studies may wish cluster cue utilization into more specified levels of expertise (e.g., novice, beginner, and intermediate) to identify greater detail in the existing relationships.
Another limitation to be considered is that most participants performed well in the phishing decision task and the subsequent ceiling effects likely contributed to the non-significant results. This may have been a result of the 50:50 ratio of legitimate to phishing emails, which disproportionately increased the likelihood of detection. Previous studies suggest that sufficient task exposure may progressively improve participant performance (Nasser et al., 2020), but studies have also shown that participants anticipate a 50:50 ratio in experimental tasks and thus appropriate their responses, when real-life phishing emails occur far less frequently (ACSC, 2021; Canfield et al., 2016).
A further potential limitation is that the phishing emails were sent within six weeks of completing the phishing cue and performance assessments. It’s possible that these measures may have sensitized participants to phishing cues, thereby enhancing their detection capabilities during the naturalistic simulation. Additionally, participants were informed that they would be sent phishing emails, which may have made them more vigilant and conscious of the impending emails. These factors could have influenced the results, potentially leading to an overestimation of participants’ ability to detect phishing attempts. Future research should consider extending the time interval between measures and utilizing limited disclosure to reduce the likelihood of priming effects and to better simulate real-world conditions.
A final consideration is the approach to measuring email engagement in the naturalistic condition. Participants who did not open the email generally had fewer phishing cues to assess, relying only on “pre-opening” cues such as sender information or subject line. In contrast, participants who opened the email but did not click on the malicious link had access to additional cues within the email body. Although these individuals were better performers than those who clicked on the link, they were still vulnerable to the risk of malware installation from simply opening the email. This approach provides a staged view of user performance. However, future research could benefit from further investigating the differentiation between cues visible before and after opening the email. Such an approach would offer a more comprehensive evaluation of participants’ phishing detection abilities across various stages of email interaction.
Conclusion and future directions
To the authors’ current knowledge, this study is the first to investigate cue utilization (using the EXPERTise 2.0 phishing decision battery) and decision-making using a naturalistic phishing simulation. The study’s novelty is further extended by the division of phishing decision accuracy into true positives and false positive response rates and examining the potential for interaction between cue utilization and cognitive reflection. Results revealed that participants with relatively lower cognitive reflection were more likely to falsely diagnose genuine emails as phishing emails, implicating cognitive reflection tendencies as an important information processing mechanism for phishing-related decision-making. Furthermore, task response was slowest for participants with both greater cognitive reflection tendencies and lower levels of cue utilization, compared to other groups. This implicates the potential for cognitive reflection tendencies to influence deliberation time and provide a greater chance for deeper email engagement for users with relatively lower cue utilization skills.
Although cognitive reflection is considered a stable personality trait, future research should investigate interventions that might encourage greater focus and attention, such as mindfulness which has shown positive outcomes in the workplace (Althammer et al., 2021; Creswell, 2017; Dobie et al., 2016). Aspects of system design and software could also be leveraged to discourage concurrent task or information processing and a systematic consideration of emails (e.g., warning banners) to support workers in highly complex and demanding environments.
Further, studies should investigate cyber security risks within a naturalistic context by utilizing simulated phishing trials across various demographics and contexts. The limited significant results from the current study’s phishing simulation highlights the knowledge gap between experimental settings and naturalistic confounds, and subsequently reflect the scarcity of scientific understanding and increasing vulnerability for the email user. Additionally, usability testing using the naturalistic decision paradigm across a wide demographic could map commonalities around email usage, and therefore advance the knowledge of the confounds and complexities in phishing email detection. Overall, the findings of this study have implications for training and educational approaches that encourage email users to engage in systematic and effortful decision-making, as well as utilize software design interventions as both a technological barrier and a warning system for vulnerable workers.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
