Abstract
Confidence is commonly assumed to monitor the accuracy of responses. However, intriguing results, examined in the light of philosophical discussions of epistemic justification, suggest that confidence actually monitors the reliability of choices rather than (directly) their accuracy. The focus on reliability is consistent with the view that the construction of truth has much in common with the construction of reality: extracting reliable properties that afford prediction. People are assumed to make a binary choice by sampling cues from a “collective wisdomware,” and their confidence is based on the consistency of these cues, in line with the self-consistency model. Here, however, I propose that internal consistency is taken to index the reliability of choices themselves—the likelihood that they will be repeated. The results of 10 studies using binary decisions from different domains indicated that confidence in a choice predicts its replicability both within individuals and across individuals. This was so for domains for which choices have a truth value and for those for which they do not. For the former domains, differences in replicability mediated the prediction of accuracy whether confidence was diagnostic or counterdiagnostic of accuracy. Metatheoretical, methodological, and practical implications are discussed.
Questions about truth and its justification have intrigued philosophers for centuries. These questions have also been important in many applied areas, such as jury decisions and medical diagnoses (Dunning et al., 2004).
In psychological research, assessments of subjective confidence have been used and investigated in a wide range of domains, including perception and psychophysics, memory, decision-making, eyewitness testimony, and social cognition. A great deal of research has examined the correspondence between confidence and the accuracy of first-order decisions. Two aspects of correspondence have been distinguished: monitoring resolution and monitoring calibration. Monitoring resolution (Liberman & Tversky, 1993; Nelson, 1984; Yates, 1990) is generally indexed by the within-person confidence-accuracy correlation across items, which reflects participants’ ability to discriminate between correct and incorrect answers. Calibration, in turn, concerns the absolute discrepancy between confidence and performance—the extent to which confidence judgments are realistic or disclose an overconfidence or underconfidence bias (Lichtenstein et al., 1982). Although this article has implications for both resolution and calibration, the discussion and the empirical analyses will center on monitoring resolution.
In this article, I take a bird’s-eye view at the results of a large number of studies, some of which have been conducted in the context of the self-consistency model (SCM) of subjective confidence (Koriat, 2012a, 2018). In these studies, participants chose the answer to two-alternative forced-choice (2AFC) questions and indicated their confidence in their choice. In some of the domains (e.g., general knowledge, perceptual judgments), the choice has a truth value, whereas in others (e.g., social attitudes, social beliefs) it does not. These studies yielded some intriguing results that motivated a new look at confidence judgments.
I begin by sketching the general structure of the article, which relies on previous results but proposes a different view of what confidence judgments monitor. This view is hoped to contribute to a greater rapprochement between psychological research on confidence and the rich discussions in philosophy about truth and its justification.
The proposal advanced in this article was motivated by two observations that have been found to display impressive generality across different domains. The first concerns the extent to which confidence judgments monitor the accuracy of first-order beliefs. Discussions in the philosophy of truth are characterized by strong doubts about people’s ability to discriminate between true and false beliefs. These doubts stand in sharp contrast to the empirical results in psychology indicating that people’s confidence judgments are largely diagnostic of the accuracy of first-order beliefs in many domains. However, recent results obtained in the context of the SCM clearly indicate that the confidence-accuracy correlation is positive only when calculated across items for which first-order choices are more likely to be correct than wrong. These items are more representative of our natural ecology. In contrast, when the confidence-accuracy correlation is calculated across items for which first-order responses are more likely to be wrong, the correlation is consistently negative. This pattern of results may contribute to a reconciliation between philosophers and experimental psychologists, suggesting that people do not have privileged access to the truth of their beliefs. Rather, they rely on a heuristic that has some validity but only within the confines of the natural ecology.
The second observation concerns confidence judgments in domains for which first-order choices do not have a truth value (e.g., social beliefs). If the question of confidence accuracy is ignored, the results for such judgments have been found to display similar regularities to those observed for binary choices that have a truth value. This observation suggests that confidence may be monitoring a property other than “correctness” or “truth,” unlike what has been assumed by theories of confidence, most of which concern binary choices with truth value.
I propose that people’s confidence in a choice monitors the extent to which the choice is “trustworthy” or “reliable”—likely to be made again in many future encounters with the question. This is the essence of the replicability hypothesis tested in this article. This hypothesis has two important implications. First, at the metatheoretical level, it implies that the construction of truth has much in common with the construction of reality, which rests on the extraction of reliable properties in sense experience that can aid predictions. Second, at the experimental level, this proposal implies that the immediate performance criterion for confidence is replicability rather than accuracy: Confidence in a choice should predict the replicability of the choice both within individuals and (as follows from the SCM) between individuals. This is expected to be the case whether the choice has a truth value or not, and for domains for which the choice has a truth value, whether confidence is diagnostic or counterdiagnostic of the accuracy of the belief.
The introduction of this article consists of three sections. In the first section, I outline the SCM and review the results that gave rise to it. In the second section, I discuss the replicability hypothesis in the light of the debates in the philosophy of mind on beliefs and their epistemic justification. The third section examines concerns about the replicability hypothesis, particularly the assumption that confidence in the outcome (choice) of a random process can predict the replicability of that outcome.
I use the results from 10 studies to test the replicability hypothesis for confidence and response speed. I then summarize the experimental and correlational results that corroborate the replicability hypothesis.
The Self-Consistency Model
In this section I review the SCM and the observations that support it. Three sets of results are described. The first of these is also used to describe the model.
Monitoring resolution: the accuracy of confidence judgments
Many studies indicate that confidence judgments are diagnostic of the accuracy of first-order judgments. This observation points to the value of subjective confidence as a guide for behavior. Indeed, results indicate that people rely heavily on their confidence in a belief in deciding whether to translate that belief to action (Gill et al., 1998; Goldsmith & Koriat, 2008; Koriat, 2011; Tullis & Goldstone, 2020; Wixted & Wells, 2017).
Koriat (2018), however, noted that because of people’s adaptation to reality through evolution and learning (Gigerenzer et al., 1991; Hoffrage & Hertwig, 2006; Juslin, 1994; Koriat et al., 2000), first-order responses tend to be correct for many domains. For example, for 2AFC general-knowledge questions that were sampled representatively from their reference classes (Gigerenzer et al., 1991), the percentage of correct answers is around 75% rather than around 50% (see Koriat, 2018). Therefore, the confidence-accuracy correlation has been calculated typically across samples of items for which first-order responses are more likely to be correct than wrong. What happens when items are deliberately selected for which people’s first-order responses tend to be wrong?
This question was examined for several domains (Koriat, 2018) by deliberately including 2AFC items that yield a preponderance of wrong choices. The items were then divided for each domain into two classes: Those yielding more correct than wrong answers were labeled consensually correct (CC), and those yielding more wrong answers than correct answers were labeled consensually wrong (CW).
The results were clear: The CC items yielded the typical positive correlation between confidence and accuracy. The CW items, in contrast, yielded a negative correlation: For these items, participants were more confident in their wrong answers than in their correct answers, and for each item, participants who chose the wrong answer tended to endorse it with greater confidence than those who chose the correct answer. This pattern of results was observed for 16 different tasks. A similar pattern of results was reported by Brewer and associates: Whereas a positive relationship was observed for typical (“nondeceptive”) items, this relationship was negative for “deceptive” items that tend to elicit mostly erroneous responses (Brewer et al., 2005; Brewer & Sampaio, 2006, 2012; Sampaio & Brewer, 2009). Other studies also demonstrated a similar pattern of results for face recognition (Sampaio et al., 2017), syllogistic reasoning (Bajšanski et al., 2019; Bajšanski & Žauhar, 2019), and the diagnosis of mammograms (Litvinova et al., 2022).
Two general assumptions underlie the SCM’s account of these results. The first is a sampling assumption that is common to many models (see Fiedler et al., 2023). Participants are assumed to construct their response to a 2AFC item on the spot (Schwarz, 2007) by retrieving a sample of cues from a population of cues associated with the item. They choose the option that is favored by the majority of the cues (Shafir et al., 1993; Tversky & Koehler, 1994; Vickers, 2001), and their confidence is based on the consistency with which that option was supported across the sampled cues (Brewer & Sampaio, 2012; Slovic, 1966).
The second assumption is that in many domains, people with similar backgrounds retrieve their cues from distinct but overlapping distributions of cues that are relevant to the item (see Koriat et al., 2020). The repository of cues that are largely shared is referred to herein as the “collective wisdomware.” Because of people’s adaptation to reality through evolution and learning, most of the cues in the collective wisdomware tend to support the correct answer for most items, therefore yielding a positive confidence-accuracy correlation for these items (Hertwig, 2012; Kurvers et al., 2016). For the less representative, CW items, in contrast, most of the cues lean toward the wrong answer, thus yielding a preponderance of wrong answers as well as a negative confidence-accuracy correlation.
The majority effect for confidence judgments
A second set of results is the prototypical majority effect (Koriat et al., 2016, 2018): For each item, people who opt for the majority response endorse it faster and with stronger confidence than those who choose the minority response, with the majority–minority difference increasing as a function of the degree of cross-person consensus. The prototypical majority effect is consistent with the assumption that choice and confidence are based on the sampling of cues from a collective wisdomware for each item. Somewhat surprisingly, this assumption turned out to hold true even for domains such as social attitudes, social beliefs, and personal preferences, for which the choice does not have a truth value (Koriat, 2012a).
The prototypical majority effect is precisely what social psychologists expect to stem from conformity to group influence (Bassili, 2003; Festinger, 1954). However, results suggest that this effect can occur independent of any social influence. In fact, it was observed even when people fail to guess the majority decision (Koriat et al., 2018) and was also observed within individuals: When participants were presented with the same list of items on several occasions, their confidence was higher for their majority choice across presentations than for their minority choice, with the difference in confidence increasing with the degree of choice consistency across presentations.
Collective decision-making
A final set of results concerns collective decision-making. Many studies have documented an advantage for group-based decisions over individual decisions. Some of these studies relied on the wisdom-of-crowds idea (Surowiecki, 2004), indicating that the aggregation of judgments across individuals yields better accuracy than that of the average individual (see Clemen, 1989; Hanea et al., 2021; Herzog et al., 2019; Kurvers et al., 2016; Larrick et al., 2012). Other studies that involved joint decisions reached by interacting group members indicated that cooperative groups generally perform better than independent individuals (e.g., Hill, 1982; Koriat, 2015b; Laughlin, 2011; Steyvers & Miller, 2015).
Koriat (2012c) examined a maximum confidence-slating algorithm that was applied to virtual dyads. For each item, the decision that was made with higher confidence by one member of the dyad was selected, and all selected decisions were compiled to form a dummy high-confidence participant. This algorithm was found to yield more accurate performance than the best member of a dyad. However, this was true only for the CC items. For the CW items, in contrast, the algorithm yielded worse performance than the worst member of the dyad (Koriat, 2012c; see also D. Bang et al., 2017; Hasan et al., 2021; Hertwig, 2012; Litvinova et al., 2022).
In a subsequent study (Koriat, 2015b), members of a dyad made individual decisions (and confidence) before interacting to reach a joint decision. The maximum confidence-slating algorithm applied to the individual decisions yielded a similar pattern to that observed in Koriat (2012c). However, dyadic interaction was found to accentuate this pattern further, increasing accuracy significantly for the CC items but decreasing it significantly for the CW items. Thus, both confidence slating and group deliberation affected performance in the same direction, improving accuracy when individual accuracy was better than chance but impairing it when individual accuracy was below chance. These results suggest that collective decision-making tends to amplify the effects of the collective wisdomware on choice and confidence.
It should be noted that all the effects mentioned above for confidence judgments were also observed for response speed under the assumption that response speed also increases with self-consistency (Koriat, 2012a).
Taken together, the results reviewed above suggest the following generalizations: First, choice and confidence are based on the sampling of cues largely from a collectively shared wisdomware. Because each choice is based on a small number of cues, it may vary between individuals as well as within individuals. Second, confidence increases with the extent to which the vote of the sampled cues agrees with that of the shared wisdomware. Thus, majority choices are endorsed with higher confidence than minority choices, and for items for which the choice has a truth value, confidence increases with the consensuality of the choice rather than with its accuracy. Finally, the contribution of collective wisdomware to choice and confidence increases with factors that boost collective decision-making.
The Current Proposal: Metatheoretical Considerations
Philosophical theories of knowledge and its justification
The proposal advanced in this article can be introduced by reference to the extensive debates in the philosophy of mind about beliefs and their justification (Alston, 1989; BonJour & Sosa, 2003; Goldman, 1979; Sosa, 1992). In psychology, the term “knowledge” is used somewhat liberally to include both true and false beliefs (see Koriat, 1993). In philosophy, in contrast, propositional knowledge is commonly defined as justified true belief. That is, people are said to “know” P if they believe P, if P is true, and if the belief is justified, that is, not a mere lucky guess.
A great deal of discussion in philosophy has revolved around the notion of epistemic justification, which refers roughly to metacognitive monitoring—the justification of one’s first-order beliefs (see Alston, 1989; Kirkham, 1992; Williamson, 2010). The standard theory of truth since the Greek philosophers is the correspondence theory—that the truth or falsity of a statement is determined by how that statement relates to the world and whether it describes objects or facts accurately.
The main problem with correspondence theories concerns epistemic justification: How can people ascertain that a belief is true? It was argued that even if I believe P and P is true, there is no way I can ascertain that P is true. To do so, I have to “get outside” my beliefs, so to speak, to compare my beliefs to the objective facts. In psychological terms: I do not have extra knowledge (“metaknowledge”) that allows me to compare my “knowledge” to reality. The controversies around this question have led some philosophers to challenge the definition of knowledge as a justified true belief (e.g., Gettier, 1963) and have led others to doubt whether knowledge itself is possible (see Greco, 2008).
The objections to correspondence theory gave rise to coherence theory, which has several versions (BonJour, 1985; Rescher, 1973; Walker, 1989). According to coherence theories, the truth or falsity of a statement is determined by its relation to other statements rather than its relation to the world. From this perspective, a person’s beliefs are justified insofar as they belong to a system of beliefs that are mutually supportive.
Some puzzles
The debates in philosophy surrounding correspondence and coherence theories raise two questions that help introduce the thesis of this article. First, the objection to correspondence theories mentioned above stands in sharp contrast to the empirical evidence in psychology, which indicates that people are quite skilled in monitoring the accuracy of their first-order beliefs.
With regard to this question, the SCM may be seen to contribute toward a reconciliation between philosophers and psychologists regarding people’s ability to discriminate between true and false beliefs. As reviewed earlier, confidence judgments are diagnostic of accuracy only across items for which first-order beliefs are mostly correct. The implication is that people’s ability to monitor the correctness of their responses is due to a heuristic—the self-consistency heuristic—that yields a positive confidence-accuracy relationship only for our natural ecology but may go awry in other ecologies. Thus, the doubts about people’s general ability to monitor the correctness of their beliefs are warranted independent of the theory of truth that is entertained.
The second question concerns coherence theories: How do coherence theories offer a solution to the problem faced by correspondence theories—ascertaining that a belief corresponds to reality? Philosophers have debated whether coherence can be said to be conducive to truth (e.g., BonJour, 1985; Olsson, 2002). Bonjour (1999), for example, raised doubts about the possibility that coherence, in itself, is sufficient for justification, arguing that a pure coherence theory would still require input from the outside world.
As far as this second question is concerned, the SCM may be seen to posit a link between coherence, in a particular sense, and correspondence. That sense of coherence is implied in Kant’s (1885) criticism of correspondence theory for failing to afford a justification of one’s beliefs: “For as the object is external to me, and the knowledge is in me, I can only judge whether my knowledge of the object agrees with my knowledge of the object” (p. 40).
The SCM actually implies that the ability to assess the agreement between one’s pieces of knowledge about the object can provide a basis for the subjective “justification” of one’s belief about the object. Thus, people may be said to use coherence as a cue for correspondence. Although Kant seems to dispose of one’s ability to assess the internal reliability of one’s own knowledge (“I can only judge”), that ability can constitute an asset in judging the validity of one’s own memory and knowledge (Bolte & Goschke, 2005; M. Ross, 1997; see also BonJour, 1999).
These ideas raise the question of whether research on metacognition can contribute to the philosophical discussions about truth and its justification. In recent years, work on metacognition has received a great deal of attention by philosophers (see Carruthers, 2011; De Sousa, 2009; Dokic, 2012; Proust, 2013; Proust & Fortier, 2018), and some philosophers have examined whether metacognitive judgments can offer epistemic justification of one’s own beliefs (Nagel, 2007; Proust, 2013). In addition, in psychology, Reber and Unkelbach (2010) discussed the possibility that the use of processing fluency as a cue for truth is epistemically justified under the assumption that most statements one is exposed to are true. On the basis of this logic, coherence or self-consistency may also be seen to afford epistemic justification.
However, the assumption that coherence or self-consistency serves as a cue to accuracy encounters two problems. The first, of course, is raised by the findings indicating that the self-consistency heuristic is not inherently diagnostic of accuracy and cannot offer a justification of one’s first-order beliefs when it is unclear whether the environment in question is “kind” or “wicked” (Hertwig, 2012; Litvinova et al., 2020).
In my previous discussions I argued that the self-consistency heuristic has been tailored specifically to the statistical structure of the natural environment, for which first-order responses are generally correct (Koriat, 2018). I made the same argument with regard to the feeling of knowing (Koriat, 1993, 1995). Indeed, Unkelbach (2007) reported results supporting the possibility that the processing fluency heuristic is learned.
However, the idea that the self-consistency heuristic is learned presupposes some access to the accuracy of one’s own beliefs in the first place. At present, it is unclear how people (possibly children) learn that most of the cues that come to mind when making confidence judgments or feeling-of-knowing judgments are generally correct.
The second problem concerns confidence in asks for which the response does not have a truth value. As noted earlier, some of the results obtained for confidence in these tasks (e.g., the majority effect; Koriat et al., 2016) are very similar to those for tasks for which the response has a truth value.
Confidence as a monitor of choice replicability
These two problems suggest that confidence may actually monitor a different property than accuracy or truth. They raise the question: Do confidence judgments monitor a property that is (a) independent of the statistical structure of the environment in question and (b) indifferent to the question of accuracy?
It is proposed that confidence judgments monitor the reliability of the choice: People use the internal consistency among the sampled cues to infer the reliability of the choice itself—the likelihood that the same choice will be made across many presentations. To draw an analogy from test theory, the reliability of a test can be assessed by the correlation between different items on the same test (internal consistency) or by the consistency of a test score from one use to another (test–retest reliability). Researchers often use internal consistency to estimate test–retest reliability. In this article, I argue that in the same way, when people make confidence judgments, they rely on cue consistency (in line with the SCM) but use it to infer the replicability of the choice. In assessing confidence in a choice people ask: Is this the response that I am eventually bound to agree on? Thus, rather than aimed at assessing the accuracy of the choice, self-consistency is aimed at inferring another aspect of reliability—the likelihood that the choice will be made across multiple occasions. Such is the case also for choices that do not have a truth value. The interpretation of “confidence” as an assessment of replicability is what confidence judgments have in common when they apply to the correctness of a choice that has a truth value or to the certainty in a social attitude or a social belief (Tormala & Rucker, 2007). In what follows, I use the terms “self-consistency” or “cue consistency” to designate the consistency across the sampled cues, and replicability in designating the reproducibility of the choice. I use the term “reliability” more loosely to designate both.
The idea that confidence judgments monitor replicability rather than accuracy has been suggested recently by several authors. Of particular interest is the theoretical framework of Mamassian and colleagues (see Caziot & Mamassian, 2021; Mamassian & de Gardelle, 2022). In their studies on confidence in perceptual judgments, they argued that rather than attempting to monitor the correctness of a choice, perceivers aim at being self-consistent with themselves. Confidence judgments were said to reflect “an evaluation of the extent to which the current perceptual decision is self-consistent with other decisions the observer could have taken, previously or in the future, in the same conditions. In other words, self-consistency is a measure of reproducibility of perceptual decisions” (Caziot & Mamassian, 2021, p. 2).
It was argued that people focus on self-consistency when they are instructed to judge the correctness of their response and do so even when they are capable of monitoring the correctness of that response.
Another model that stresses the reliability of a choice was proposed by Boundy-Singer et al. (2022). According to their model, confidence reflects a noisy decision-reliability estimate. It represents the person’s estimate of the reliability of the decision, but the quality of that estimate is limited by the person’s uncertainty about the uncertainty of the variable that informs the decision (“meta-uncertainty”). The model predicts a systematic dependency of confidence on choice consistency.
It is of particular interest that the focus on reliability has emerged in the context of studies of perceptual confidence. Confidence in perceptual decisions has received a great deal of research in recent years (see Rahnev et al., 2022). As noted by Rahnev et al., recent research on perceptual confidence has been exhibiting increased interest in understanding self-evaluation itself rather than simply using confidence as a tool to understand perception.
The focus on reliability departs from the traditional focus on accuracy. Accuracy has been the dominant criterion in evaluating the validity of confidence judgments as well as other metacognitive judgments. This focus is also characteristic of work on the wisdom of crowds and on collective decision-making (see Kurvers et al., 2021). In philosophy too, correspondence to reality is typically used as the main criterion for epistemic justification (although other epistemic norms have also been discussed; see Proust, 2012). The emphasis on accuracy is understandable in view of the critical role of accuracy for the adaptation to reality. In fact, people take the accuracy of their judgments for granted and do so even when reliance on these judgments turns out to be detrimental (e.g., Koriat, 2011).
Metatheoretical implications of the replicability hypothesis
The focus on reliability and replicability has metatheoretical implications that tie together self-consistency, reliability, and prediction. The proposed conceptualization of confidence implies that the process underlying the psychological construction of truth has much in common with that underlying the construction of reality. The construction of reality involves an attempt to optimize the prediction of the environment by extracting the reliable regularities in sense experience. This attempt is illustrated by the observations that give rise to the notions of object permanence and object constancy. Consider the central distinction in epistemology dating back to Galileo, and espoused by Locke and Descartes, between primary and secondary qualities. Primary qualities (such as size, shape, and motion) have been assumed to be real properties of physical objects. Secondary qualities, in contrast (such as taste, color, and smell), are said to constitute merely the effects of these properties on the mind, not the properties of real objects. It may be argued that unlike secondary qualities, primary qualities afford mutual validation across different senses, giving rise to the subjective understanding that these proximal qualities convey information about distal objects that have enduring existence. The merit of primary qualities is that they allow reliable predictions: When I see an object in front of me, I know that I should walk around it rather than through it. Thus, the goal of the cognitive apparatus is to extract those properties of sense experience that afford reliable predictions. That extraction is achieved by focusing on the consistencies in sense experience. In like manner, cue consistency—the coherence among the cues accessed in making a choice—helps support the assessed reliability of the choice endorsed.
In what follows, I present empirical evidence for the replicability hypothesis. However, before doing so, I discuss certain concerns that are raised by the idea that confidence in a choice predicts the replicability of that choice.
A Critical Examination of the Replicability Hypothesis
The replicability hypothesis implies that if we compare the two responses made to the same binary item, either by the same person or by different people, the response endorsed with stronger confidence should evidence higher replicability than that endorsed with lower confidence. This hypothesis runs counter to our intuitions, particularly because the process underlying subjective confidence was modeled by analogy to the procedure underlying statistical confidence (Koriat, 2012a). Suppose, as was assumed by the simulation used to test the SCM, that different people respond to a 2AFC item by drawing a random sample of the same size from the same population of binary cues. Is it possible that samples that are entirely equivalent on a priori grounds can be distinguished a posteriori in terms of the likelihood that their binary outcome will be replicated?
The replicability hypothesis also contrasts with discussions of the outcome bias (Baron & Hershey, 1988), which refers to the error of evaluating the quality of a decision on the basis of its ultimate outcome. The replicability hypothesis, in contrast, implies that the outcome of an allegedly random process carries potentially valuable information, thus implying an outcome benefit in hindsight.
I first examined people’s beliefs about whether samples that are equivalent on a priori grounds can be distinguished a posteriori in terms of the likelihood that their binary outcome is trustworthy (for details, see the Supplemental Material available online). Most participants denied that possibility.
To show that our participants’ intuitions are false, I carried out two simulation experiments (see the Supplemental Material). The results (Figs. S1 and S2) clearly indicate that when it comes to deciding which of two events is the more frequent in a two-class population, some samples can be shown in retrospect to yield more reliable decisions than others among random samples that are equivalent on a priori grounds. Another simulation was found to yield a similar outcome benefit for a two-class population in which the members differed along a continuous, interval variable.
In sum, for the type of decision that is assumed by the SCM to underlie the response to a 2AFC item, the outcome of random sampling contains information about the reliability of the decision: It predicts the likelihood that the same binary outcome will be found again. The results suggest an outcome benefit.
Empirical Evidence for the Replicability Hypothesis
I turn now to the empirical results that tested the predictions of the replicability hypothesis. These results overlap with those reported previously on the prototypical majority effect (Koriat et al., 2016). However, the analyses are specifically targeted at the replicability proposition and its corollaries.
The assumption that confidence monitors the reliability of first-order responses implies that the immediate performance criterion for confidence is the replicability of the choice endorsed. Assuming that confidence judgments are valid predictors of replicability, it follows that confidence in a choice should predict the likelihood that the same choice will be made by oneself in subsequent presentations of the question and will also be made by others. This should be the case whether the choice has a truth value or not.
For choices that have a truth value, confidence should predict within- and between-individual replicability whether confidence is diagnostic or counterdiagnostic of accuracy. Thus, confidence should correlate positively with replicability for both the CC and CW items.
Finally, for domains for which the choice has a truth value, differences in replicability should mediate the diametrically opposed confidence-accuracy correlations observed for the CC and CW items.
These predictions were tested on the results of several published studies. These results are reported in two sections. The first section examines the replicability of the choice as a function of the confidence with which it was endorsed, whereas the second section reports correlational analyses that lend further corroboration of the replicability results.
Confidence as a predictor of replicability
I used the results from nine published studies to test the predictions of the replicability hypothesis. In these studies, 2AFC items were used from different domains. For four of these domains (general knowledge, perceptual comparisons—lines, perceptual comparisons—shapes, and word associations), the answer has a truth value, whereas for the remaining five domains (category membership, social beliefs, social attitudes, personal preferences, self-report personality questions), the response does not have a truth value. In some of the studies, the task was presented between five and seven times. However, the analyses of cross-person replicability were based only on the first presentation.
Confidence and response speed as predictors of others’ responses
The first analysis involved cross-person replicability. I examined the hypothesis that in comparing the two responses made by different individuals to the same binary-choice item, the relative confidence with which the two responses are endorsed predicts which of them is the more likely to be made by others. Assuming that response speed also reflects self-consistency (see Koriat, 2012a), I tested the hypothesis that response speed is also predictive of the replicability of the choice.
The following analytical procedure was applied to the confidence results of each study. First, confidence judgments were standardized to nullify individual differences in the characteristic level of confidence (see Kleitman & Stankov, 2001). Second, for each item, participants were divided at the median confidence into a low-confidence group and a high-confidence group. Finally, for each participant, the percentage of other participants who made the same choice was calculated for each item. This same-choice percentage was then averaged for each item for the low-confidence and high-confidence groups, and the two means were averaged across all items. A similar analytical procedure was applied to the response-speed results.
Methodological details as well as the complete results appear in the Supplemental Material. Here I present only a summary of the results.
The results, summarized in Tables S2 and S3, are clear. First, for each of the nine studies, the responses endorsed with higher confidence were significantly more likely to be made by other individuals than those endorsed with lower confidence. In addition, for each of the studies, the number of items that yielded the predicted pattern (“consistent”) was significantly higher than that yielding the opposite pattern (“inconsistent”).
Second, a very similar pattern of results was observed for response speed (Table S3), suggesting that response speed is also diagnostic of the replicability of the response.
Note that the results described above were observed for tasks for which the choice has a truth value as well for those for which it does not. Of particular interest is the observation that confidence and speed predicted cross-person replicability even for social attitudes, personal preferences, and self-report personality questions, which are known to yield reliable individual differences.
Predicting others’ responses for CC and CW items
It might be argued that for the tasks for which the response has a truth value, the replicability hypothesis derives from the confident responses being those that exhibit higher confidence as well as higher consensus. This possibility was evaluated using the stimuli from the study of Koriat (2018, Project 2), which included an equal number of CC and CW items that closely matched on the percentage of participants who chose the consensual response.
The same analyses as those described above were applied to the confidence and response-speed results. The analyses across all items supported the cross-person replicability hypothesis for confidence and response speed. This was also true for each of the two types of items, CC and CW, when analyzed separately, and there was hardly any difference between them in this respect (Table S4).
A within-individual analysis
In the last analysis in this section, the replicability hypothesis was tested within individuals. In seven of the nine studies listed in Table S2, participants responded to the same task between five and seven times, sometimes on different days. The analysis was performed across these studies and across all presentations, focusing on cases in which participants changed their response to the same item from the first presentation to the second presentation. For these cases, after neutralizing systematic differences in confidence between the first and second presentations, it was found that the relative confidence with which the first two responses were endorsed predicted which of them was the more likely to be repeated across the subsequent three-to-five presentations. The same pattern of results was found for response speed. Thus, the confidence and speed with which a response is endorsed can discriminate between choices that are more likely and those that are less likely to be repeated in the future.
Correlational Analyses: Monitoring Replicability and Accuracy
I now present correlational results that supplement the results reported in the last section. It was proposed that the replicability hypothesis invites analyses of the relationship between confidence and replicability and allows an assessment of the degree to which replicability mediates the correlation between confidence and accuracy. A confidence-replicability index was used to designate the correlation between confidence and the percentage of other participants who made the same choice as that of the target participant. This index can be calculated between individuals for each item (and then averaged across items) or within individuals across items (and then averaged across individuals). The full results are presented in Tables S5 and S6. Here I present a summary.
The correlational analyses provided further support for the replicability hypothesis. First, for each of the 10 studies examined (Tables S2 and S4), the confidence-replicability correlation was significant both in the between-individual analyses and in the within-individual analyses. Thus, for each item, interindividual differences in confidence predicted interindividual differences in choice replicability. In turn, for each individual, interitem differences in confidence predicted interitem differences in replicability. These results support the idea that confidence tracks properties of first-order choices regardless of who makes these choices.
Second, for all studies for which the response has a truth value, the between-individual confidence-accuracy correlation was positive for the CC items but negative for the CW items, mirroring the pattern observed for the within-individual correlations.
Third, despite the diametrically opposed confidence-accuracy correlations observed for the CC and CW items, both within individuals and between individuals, the confidence-replicability correlation was positive both within individuals and between individuals for both types of items.
Finally, when replicability was partialed out from the confidence-accuracy correlation (for domains for which the choice has a truth value), the results suggested that replicability can account for both the positive and negative confidence-accuracy correlations observed.
Altogether, the results are consistent with a conclusion that is difficult or impossible to prove: When you are presented with a 2AFC item, your confidence monitors the replicability of the choice that you happened to settle on. I recommend that reporting confidence-replicability correlations should be a standard in future research when assessing the accuracy of confidence judgments for 2AFC items.
Discussion
This article examined the idea that confidence judgments monitor the reliability of a choice rather than (directly) its accuracy. In what follows, I first summarize the results of the empirical analyses and examine the boundaries of the replicability hypothesis. I then elaborate on the two assumptions underlying the SCM and on the replicability hypothesis.
The replicability hypothesis and its boundaries
The analyses of the empirical results provided consistent support for the replicability hypothesis. The confidence and speed with which the choice was endorsed predicted the likelihood that the same choice will be made by others. This was so across nine studies that included domains for which the response has a truth value and those for which it does not. A similar pattern of results was observed for the CC and CW items, with little differences between them in the extent to which confidence predicted replicability. The replicability hypothesis was also observed within individuals.
The correlational analyses corroborated these results: Positive within- and between-individual confidence-replicability correlations were observed in each of the studies. These correlations were positive even for the CW items, which yielded negative confidence-accuracy correlations both within and between individuals. Finally, for all choices that have a truth value, the positive confidence-accuracy correlations for the CC items and the negative confidence-accuracy correlations for the CW items were largely accounted for by differences in replicability.
The generality of the replicability results across domains is quite impressive. However, the generality of the results across participants requires the use of participant populations that are more heterogeneous than those used in the reviewed studies.
As far as cross-person replicability is concerned, consider the results of an early study (Koriat, 1975) that ultimately led to the SCM, in which Israeli participants successfully matched antonyms in Thai, Kannada, and Yoruba with their corresponding English antonyms, consistent with the idea of a universal phonetic symbolism (see Blasi et al., 2016). Surprisingly, they were also successful in monitoring the accuracy of their matches. However, a subsequent study (Koriat, 1976) that included CW items 1 found these items to yield a negative confidence-accuracy correlation. These results suggest that for some domains collective wisdomware is shared to some extent even across cultures: The sound-meaning associations that influenced the construction and evolution of words in the noncognate languages possibly contributed to the choices and confidence judgments of Israeli participants for these languages.
Other results, in contrast, suggest some boundaries to cross-person replicability. Previous results have suggested that the more heterogeneous the participants are in a particular domain, the weaker cross-person replicability is relative to within-person replicability (Koriat et al., 2020). More importantly, recent studies on the effects of political partisanship, homophily, and network dynamics reveal systematic differences between different groups (e.g., Bak-Coleman et al., 2022; Galesic et al., 2018) that emerge even for trivia questions. Thus, the boundaries of cross-person replicability requires further investigations.
I now discuss the two assumptions underlying the replicability hypothesis. I first elaborate on the assumption of collective wisdomware and then turn to the assumption of cue sampling.
Two layers of collective wisdom
Work on the wisdom of crowds (Armstrong, 2001; Budescu & Chen, 2015; Herzog et al., 2019; Surowiecki, 2004) has concentrated on first-order judgments made by a group of participants, and specifically on judgments that have a truth value. Underlying the SCM instead is an extended view of collective wisdom that includes a latent, deeper layer than that of the judgments themselves (see Lee & Danileiko, 2014; Lee et al., 2011). It has been proposed that people with similar backgrounds have much in common in terms of the wisdomware underlying their choices even when it comes to choices that do not have a truth value. This is due to people’s adaptation to the same natural ecology through evolution and learning (Dhami et al., 2004), to the shared social and cultural traditions (Koriat & Adiv, 2011, 2012), and to the use of similar reasoning and heuristics (Alter & Oppenheimer, 2009; Bajšanski et al., 2019). Although some variation in the sampled cues would be expected, confidence and response speed can still track the replicability of a response across people and occasions.
The hypothesis that one’s confidence should predict others’ choices follows from the assumption of collective wisdomware and from the principle of “knowing by doing” (Koriat, 2015a) assumed to characterize metacognitive judgments. According to this principle, metacognitive data-driven judgments are parasitic on the very processes underlying first-order responses, relying on the feedback from these processes. Thus, confidence in a choice is based on the feedback from the very process of making a choice—the consistency among the cues underlying the choice. Because of that, confidence judgments are expected to track properties of the choice itself. Assuming that people draw their cues from a shared pool of cues, confidence should predict the likelihood that others will make the same choice. Indeed, the confidence-replicability correlations that were observed between individuals were very similar to those observed within individuals, suggesting that confidence tracks properties of first-order choices regardless of who makes these choices.
The notion of collective wisdomware has not received a systematic theoretical analysis. However, its importance has been acknowledged by several researchers. For example, it has been proposed that group performance can be optimized by weighing judgments by confidence, consensus, and consistency (see Steyvers & Miller, 2015). Such weighing possibly implicates increased reliance on the collective wisdomware.
Researchers have also argued that diversity is beneficial for achieving aggregation gains (Page, 2008; see also Krebs et al., 2023). Indeed, research has indicated that the aggregation of judgments can benefit from cross-person diversity (Keck & Tang, 2020) and from within-person diversity (Herzog & Hertwig, 2009, 2014; Hourihan & Benjamin, 2010).
The assumption of a shared wisdomware is also implied by studies that explored interitem differences to shed light on the cues underlying metacognitive judgments across participants (e.g., Ackerman, 2023; Alter & Oppenheimer, 2009; Koriat, 2008; Koriat & Bjork, 2005; Koriat et al., 2006; Koriat & Lieblich, 1977; Metcalfe et al., 1993).
In sum, the results just mentioned, together with those obtained in the current study and in the context of the SCM, suggest that research on the wisdom of crowds can benefit from a consideration of the role of collective wisdomware. However, it should be stressed that the notion of collective wisdomware also applies to domains for which the judgment does not have a truth value.
The notion of cue sampling
Turning next to the idea of cue sampling, this idea is common to many models (see Fiedler et al., 2023). It has been proposed that although people share a repertoire of item-specific cues, the cues accessed in each occasion may differ even for the same individual depending on contextual circumstances (Koriat & Sorka, 2017; Schwarz, 2007).
In recent years, a sampling assumption has been incorporated in a modified view of the Bayesian approach to cognition, in which the mind is conceived as a Bayesian sampler rather than as a full-blown probability-inference machine (Chater et al., 2020; Sanborn & Chater, 2016). It has been proposed that because of computational constraints, cognitive processes only approximate probabilistic calculations by drawing finite samples from an internal model.
A unique feature of the SCM, however, is that it uses the sampling assumption to make predictions about systematic differences between different responses to the same binary item. This assumption, together with the idea of collective wisdomware, was sufficient to account for the systematic, between-response differences in replicability. It is of interest to examine whether other models of decision-making that incorporate a sampling assumption also predict between-response differences.
A concern that has been noted about the replicability hypothesis is that it appears to violate our intuitions about the notion of randomness. The results of the simulation experiments (see the Supplemental Material) help appease that concern. It should be stressed that these experiments simulated the process assumed by the SCM to underlie the response to a 2AFC item: deciding which of two events is the more frequent in a two-class population. For this type of decision, when several same-size samples are drawn randomly from a population, a statistical index was shown to track the potential replicability of the outcome. Thus, an outcome that is unforeseeable ex ante can be shown to be foreseeable ex post.
I now compare accuracy and replicability as the performance criteria for confidence, examine the relations between the current work and other models of confidence, and end by mentioning some implications of the replicability effects.
Predicting replicability versus predicting accuracy
As noted by Kurvers et al. (2021), research on the wisdom of crowds has focused almost exclusively on accuracy as the performance criterion. In fact, this is also true of most of the work in metacognition—what Nelson and Narens (1990) referred to as “meta-object correspondence.” The focus on metacognitive accuracy is understandable in view of the functional role of metacognitive judgments in guiding self-regulation (Bjork et al., 2013; Dunlosky & Metcalfe, 2008; Koriat & Goldsmith, 1996; Tullis & Goldstone, 2020).
This article instead, focuses on the replicability of the choice as the performance criterion of confidence judgments (see also Caziot & Mamassian, 2021; Mamassian & de Gardelle, 2022). Thus, unlike the confidence-accuracy relationship, which was found to be positive only for “kind” environments (Hertwig, 2012), in which first-order judgments tend to be correct, the confidence-replicability relationship was also positive for “wicked” (or “demonic”; Sosa, 1992, 2008) environments and for choices that do not have a truth value. These results are consistent with the idea that the self-consistency heuristic is applied across the board to yield an assessment of choice replicability.
The relations to other theories
I now comment briefly on the relationship of the SCM to other models of confidence. The SCM is much less elaborate than other theories of confidence, attempting to capture the gross architecture of the process underlying choice and confidence across different binary-choice tasks. Perhaps because of that it turned out to yield several results, including the replicability results reported in this article, which turned out to evidence impressive generality and robustness (see Mazancieux et al., 2023). For example, unlike other theories that assume qualitatively different models for confidence in different tasks (e.g., Juslin & Olsson, 1997), the SCM assumes a gross architecture that is common to many tasks. In addition, unlike models that focus on explicit “reasons” for and against the alternative options (e.g., Koriat et al., 1980; Shafir et al., 1993), the SCM accommodates many types of cues, some that consist of explicit considerations but also others that affect choice and confidence below full consciousness. What matters to confidence, according to the SCM, is the overall consistency with which the cues support the endorsed option.
The SCM also differs from other inferential models that reserve the option of “direct retrieval” of the choice (Metcalfe, 2000; Unkelbach & Stahl, 2009). For example, the influential theory of Gigerenzer et al. (1991) includes a strategy in which the answer to a general-knowledge question is based on a direct solution by memory. Only when that strategy fails does a person rely on probabilistic information. The SCM instead leaves open the possibility of a direct access strategy (see Koriat, 2012b) but tentatively assumes that the entire process is inferential and based on cues. In addition, participants are not assumed to have access to the validity of each cue.
I comment now specifically on the relationship between the work presented in this article and some of the research on social judgments in which participants were asked to predict others’ characteristics and behavior. The results for these social predictions have documented several effects such as the false consensus effect (L. Ross et al., 1977) and the self-enhancement effect (Festinger, 1954). These effects are not relevant directly to the replicability hypothesis, which entails the implicit prediction of others’ choices by one’s confidence. This is also true for the studies in which participants were asked to make explicit predictions of other people’s binary choices (Koriat, 2011; Koriat & Adiv, 2011, 2012). The results of these studies suggested that confidence in these explicit predictions also depends on the consistency among the cues underlying these predictions. It has been argued that these cues may differ from those underlying people’s own choices (see also Sun et al., 2018).
Other research on social judgments, however, is relevant, raising concerns about the SCM. The social sampling model (Galesic et al., 2012, 2018) posits that people judge different characteristics in the population by sampling instances from memory. The results suggest that the sampled instances come from people’s social circles rather than constituting a representative sample of the general population. These results are inconsistent with the assumption of the SCM that people sample their cues from a shared population of cues.
These results raise the question: When is the sampling space broad enough to yield the expected replicability results, and when it is more constrained so that it may differ for different groups? One possibility is that the sampling from a largely shared wisdomware only occurs for cue-based inferences, particularly when the cues are sampled from semantic memory (e.g., Gigerenzer et al., 1991). In contrast, the sampling tends to be tuned to people’s specific social environments when the task primes instance-based inferences that encourage sampling from episodic memory (Galesic et al., 2018; Pachur et al., 2013). This and other possibilities deserve investigation (see Hertwig et al., 2005), the results of which may require a qualification of the replicability hypothesis.
Some implications of the replicability hypothesis
I finally comment on the metatheoretical, methodological, and practical implications of the replicability hypothesis. I first mention the metatheoretical considerations that helped motivate the replicability hypothesis. This hypothesis is consistent with the view that assigns a fundamental role to reliability and prediction in the construction of truth and reality. It has been proposed that underlying the construction of reality and truth is the attempt to focus on the reliable, invariant properties that afford prediction. Several philosophers have discussed the relationship between “true” and “real” (e.g., Rasmussen, 2013; Toohey, 1939), and Toohey noted that the words “real” and “true” would hardly have been invented if humans had not fallen into error. An examination of philosophical discussions reveals similarities between the ideas raised about truth and its justification, and those raised about the conception of a reality that has independent existence. These ideas, together with the replicability results reported in this article, invite a joint exploration by philosophers and psychologists.
As for the methodological implications, future research may use the between-participant confidence-replicability correlation introduced in this article to examine the extent to which confidence relies on a shared wisdomware for the task used and across the specific group of participants sampled. Because this index can be calculated for each item, it can help delimit the items for which the underlying cues are shared across different groups of participants. For example, if social judgments are based on the sampling of instances from people’s social circles (Galesic et al., 2018), confidence in binary social judgments (see Pachur et al., 2013) may help map the social-circle environments from which instances are sampled in each case. This possibility, of course, depends on the assumption that the process underlying choice and confidence is similar to that described by the SCM.
The merit of the confidence-replicability index is that it can be calculated on many available data sets with no need to collect additional results. Of course, the confidence-replicability index may be useful for testing the idea of Mamassian and colleagues (Caziot & Mamassian, 2021; Mamassian & de Gardelle, 2022) that confidence in perceptual judgments reflects the reliability of these judgments.
I end by mentioning some practical implications of the replicability results. These results suggest that confidence judgments and response latency can be harnessed to assist in the prediction of peoples’ knowledge, opinions, attitudes, and preferences (Barasz et al., 2016; Dunning, 2007; Epley & Dunning, 2000; Ramnani & Miall, 2004; Tirso & Geraci, 2020; Tullis, 2018; Tunney & Ziegler, 2015). These judgments may also prove useful in economic predictions and prediction markets (e.g., Dreber et al., 2015).
The replicability hypothesis is also relevant to research on group decisions. Given the finding that collaborative, group decisions are more accurate than individual decisions (Bahrami et al., 2010; Hautz et al., 2015), confidence can help simulate the judgments made by groups, thus saving the heavy costs entailed in the use of interacting teams and committees (Maciejovsky & Budescu, 2020). As noted earlier, when the confidence judgments of noninteracting participants were taken into account, the results were found to predict the pattern observed for group decisions (Koriat, 2012c, 2015b). Meyen et al. (2021) reported that simulated group decisions that incorporated individual confidence judgments matched the accuracy of real group decisions better than those that placed equal weights on group members.
Several studies have found that even without the “benefit of omniscience”—knowing whether the environment in question is kind or wicked—confidence judgments can be exploited in harnessing the wisdom of crowds (Herzog et al., 2019) by selecting for each item the answer of the most confident person in a group (B. Bang et al., 2014; Koriat, 2012c). The potential value of this algorithm was demonstrated for the diagnosis of breast and skin cancer (Kurvers et al., 2016) and for simulated emergency-room decisions (Kämmer et al., 2017). Hasan et al. (2021) applied the maximum confidence-slating algorithm to participants’ medical-image interpretations (cancerous vs. noncancerous). This algorithm was found to improve the diagnostic accuracy of both novices (undergraduates) and experts (medical professionals), with the performance of groups of novices reaching that of individual experts.
Bennett et al. (2018) capitalized on the observation that participants generally withhold responses in which they have low confidence (see Goldsmith & Koriat, 2008). They observed that crowds that are composed of volunteered judgments are more accurate than crowds composed of forced judgments. These are but a few of the potential applications of the results reported in this article.
Supplemental Material
sj-docx-1-pps-10.1177_17456916231224387 – Supplemental material for Subjective Confidence as a Monitor of the Replicability of the Response
Supplemental material, sj-docx-1-pps-10.1177_17456916231224387 for Subjective Confidence as a Monitor of the Replicability of the Response by Asher Koriat in Perspectives on Psychological Science
Footnotes
Acknowledgements
I thank Joelle Proust for her help with the philosophical literature. I am also grateful to Miriam Gil, Dan Manor, and Noam Yehudai for their help in the statistical analyses, and to Shiri Adiv-Mashinsky for her assistance in the self-consistency project. Shiri Adiv-Mashinsky passed away on September 20, 2017. I also thank Etti Levran (Merkine) for her help in copyediting.
Transparency
Action Editor: Mirta Galesic
Editor: Interim Editorial Panel
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
