Four Internal Inconsistencies in Tversky and Kahneman’s (1992) Cumulative Prospect Theory Article: A Case Study in Ambiguous Theoretical Scope and Ambiguous Parsimony

Abstract

Scholars heavily rely on theoretical scope as a tool to challenge existing theory. We advocate that scientific discovery could be accelerated if far more effort were invested into also overtly specifying and painstakingly delineating the intended purview of any proposed new theory at the time of its inception. As a case study, we consider Tversky and Kahneman (1992). They motivated their Nobel-Prize-winning cumulative prospect theory with evidence that in each of two studies, roughly half of the participants violated independence, a property required by expected utility theory (EUT). Yet even at the time of inception, new theories may reveal signs of their own limited scope. For example, we show that Tversky and Kahneman’s findings in their own test of loss aversion provide evidence that at least half of their participants violated their theory, in turn, in that study. We highlight a combination of conflicting findings in the original article that make it ambiguous to evaluate both cumulative prospect theory’s scope and its parsimony on the authors’ own evidence. The Tversky and Kahneman article is illustrative of a social and behavioral research culture in which theoretical scope plays an extremely asymmetric role: to call existing theory into question and motivate surrogate proposals.

Keywords

cumulative prospect theory median responses theoretical scope theoretical parsimony

As Paul Meehl (1978) famously stated, theories in many “areas of Psychology lack the cumulative character of scientific knowledge. They tend neither to be refuted nor corroborated” (p. 806). Although Meehl was primarily referring to paradigms that grow in and out of fashion, we contend that the statement applies to some of the most famous and enduring scholarly contributions. In addition, whereas Meehl was primarily referring to what he called “soft” areas of psychology, his characterization can apply to high-profile and mathematically formal theories. In this article, we consider one of the most prolifically cited articles in behavioral science as a case study in both ambiguous theoretical scope and ambiguous parsimony. That article uses theoretical scope as a tool to challenge prior theories, yet a closer look at the article’s own evidence calls its own scope into question by the very same criteria. Both proponents and opponents of this theory can cherry-pick the ways in which they characterize its scope and/or its parsimony, solely on the basis of the original article, in ways that serve their goals. In some domains of science, scholars have reached broad consensus about many theories’ “edge conditions,” their actual scope, and their flexibility. Behavioral research is yet to find even minimal consensus about whom its theories describe and under what circumstances.

The article is organized as follows. We start by examining what it might mean to specify theoretical scope and parsimony, first with a simple example from the natural sciences and then with a few exemplary articles from psychology. This prepares the ground for our case study, the 1992 article by Daniel Kahneman and Amos Tversky on cumulative prospect theory (CPT). In the context of that article, we discuss the difference between theoretical scope and parsimony in more detail. In particular, we lay out how scholars can cherry-pick aspects of the same evidence on the same theory to draw diametrically opposite conclusions. Next, we dig more deeply into Tversky and Kahneman (1992) to discuss four internal inconsistencies within that single seminal article. Citing the example of such a stellar theory, we make our case that social scientists need to consider the intended scope and the parsimony of their theoretical proposals much more carefully and with less bias. We then turn to the general question of how behavioral scientists can clarify the scope and parsimony of their theories. We review symptoms that flag problems, and we sketch some ideas for better practice.

What Is Theoretical Scope? What Is Parsimony?

Physicists share virtually perfect consensus about the “theoretical scope” of Newton’s law of gravity: The law applies to all objects in all locations at all times as long as the objects in question are not “too small” and do not “move too fast.” Much of engineering is based on understanding the situations in which Newton’s laws of motion, Pascal’s law of pressure, Boyle’s law of gases, and so on either do or do not apply verbatim. There is also broad consensus on how one can cover a broader range of phenomena using more complex (i.e., less “parsimonious”) theories.¹

What is theoretical scope and parsimony in psychology? If one considers any behavioral regularity, whether it is in cognition, personality, social interaction, or another domain, is there a consensus in the field as to who displays this behavior and under what circumstances? Do the behavioral sciences strive to develop broad agreement on delineating the range of conditions in which a theory applies and the characteristics of people whose behavior it explains?

Imagine that Archimedes’s principle was subject to major exceptions: If the weight of the displaced water matched only some boats’ weight, how would that affect boat design? Psychologists take it for granted that their theories hold only with exceptions. The logic of permissible exceptions is baked into much of our statistical methodology: Whenever a statistically significant proportion of participants in a study, but nowhere close to all, show a certain behavioral regularity, say, they remember more, or take larger risks, or are more cooperative, we infer that the phenomenon or effect is ‘real.’ But what does that tell us about who actually satisfies that regularity when, how, and why? If it is too ambitious to characterize people to this level of detail, do psychologists at least estimate how large a proportion of the population obeys their hypotheses about decision-making, memory, perception, personality, reasoning, or social interaction? How does the discipline identify and interpret conditions, individuals, stimuli, or tasks in which a theoretical claim does not actually apply? In other words, how does psychology conceptualize theoretical scope, and how does it handle limitations in a theory’s scope? Going one step further, how does the field ensure the parsimony of new theories that aim to encompass phenomena not captured by earlier theories?

To illustrate current best practice in stating the intended purview of new theories, we briefly consider three very recent high-profile articles from cognitive psychology: Popov and Reder (2020) in long-term memory research, Schneegans et al. (2020) in working-memory research, and Lleras et al. (2020) in visual attention. These are three research paradigms in which individual differences may plausibly play a muted role compared with, say, clinical, developmental, or social psychology. Popov and Reder proposed a new theory and computational model that purports to explain a wide range of “frequency effects” on memory while also providing a process-level explanation for how working-memory capacity gives rise to these effects. Popov and Reder acknowledged:

Our extensions to recall tasks could be considered lacking in some respects. For example, while our serial recall model does a good job capturing the interaction between word-frequency and serial position, it does not reproduce the one item recency effect, which has been attributed to access from a [working-memory] buffer (Anderson, et al., 1998). Furthermore, we have made no attempt to model the full specter of contiguity effects in free recall, which have been a crucial benchmark for models of free recall. (p. 38)

Schneegans et al. proposed a framework grounded on a specific neural model of visual working memory. A purported strength of this model is that it links visual working-memory limits to a concrete neural substrate. Schneegans et al. discussed limitations of their theory, stating, for example, “In keeping with most previous work on [visual working-memory] limits, we have not here attempted to reproduce the variations in bias and precision that are observed for different feature values” (pp. 8–9). Lleras et al. proposed a novel visual search model, in part to account for the specific form of “search functions.” They reported that their theory could account for a wide range of effects in visual search tasks, particularly the effects of the similarity between target and distractor items in the visual search array for simple and real-world objects. Lleras et al. acknowledged, “A . . . limitation is that TCS [Target Contrast Signal Theory] is currently mostly focused on parallel processing and efficient search. More work is needed to flesh out what happens after parallel evidence accumulation is stopped and several target likely locations need to be inspected” (p. 422).

These articles stand out in that unlike many scholarly articles in psychology, they at least acknowledge phenomena that their theories do not explain. However, although they do take the admirable step of reviewing limitations, these articles are nonetheless reflective of a research culture in which theoretical scope is usually treated like a set of moving goalposts. Often, stating that “more work is needed” can be a diplomatic way to acknowledge that the scope of a theory is inherently ambiguous. For instance, even in these cognitive tasks, the extent to which every effect holds in every individual person and across a broad range of contexts is ambiguous. Hence, it is not really clear how desirable it is to model all effects jointly in the same individuals without committing a conjunction fallacy, for instance. Even in these exemplary articles, the challenging question of how to weigh one’s own theory’s limitations against its ability to account for an enlarged range of phenomena remains unanswered. Schneegans et al. (2020) used some heuristic statistical measures of parsimony (known as Akaike information criterion [AIC] and Bayesian information criterion [BIC]) based on counting the number of free parameters in the theory. The other two articles compared their own work with other work conceptually. Ultimately, despite the authors’ best efforts, all three articles are ambiguous about both the scope and the parsimony of their theories. Our article aims to raise awareness of the asymmetric role that theoretical scope plays in the development of social and behavioral theory.

The asymmetric role of theoretical scope

Notwithstanding the higher level of nuance in the aforementioned three articles, it is common to use theoretical scope almost exclusively to motivate new theories.

Often, the scope of an existing theory is delineated through “critical tests” or “unaccounted-for effects” by selecting an empirical paradigm and carefully designing certain stimuli for which a substantial number of people generate data that conflict with the existing theory’s predictions. By focusing on critical tests or unaccounted-for effects, scholars deliberately place pressure on existing theory. Proposing a new theory that passes those same hurdles creates an inherent bias in favor of the new theory. By the design of the research paradigm, this bias is immune to detection by even the best standard statistical model selection criteria. This is because model-selection methods typically apply post hoc, only after the scholar has already selected a suitable paradigm and crafted the relevant diagnostic stimuli to stress test the old theory. Yet support for a new theory may, in fact, already be ambiguous at the time of its inception because some participants may already provide some evidence against the new theory on some of the stimuli. Indeed, because presumably no behavioral theory performs universally well for everyone on all stimuli and in all contexts, a new difficulty arises as soon as the new theory reveals some of its weaknesses. Different schools of thought may disagree on whether the cracks in the new theory represent new critical tests that create the need for yet another theory or whether these fissures are merely examples of the imperfections that are inherent to even the best behavioral theories.

In this article, we unpack these ideas for one article that proposed the most prominent theory of decision-making: CPT. Although the authors did not highlight them, the first fissures are already visible in the original article proposing the theory (Tversky and Kahneman, 1992). As for how to weigh this theory’s improvement over prior theories against its own limitations, that question, even decades later, has been neither settled nor discussed in much depth.

CPT: A Case Study

Few theories from the social and behavioral sciences are as prominent across all of science, and even popular science, as CPT. Yet since its inception, CPT has also become a routine lightning rod for countless competing proposals about decision behavior. The theory enjoys extremely broad use in applied settings, in which it guides much policy development while also being the target of abundant skepticism, especially in basic research. Although some critics call it extremely narrow and easy to refute, others think it is too flexible and even irrefutable. We consider both scenarios later in this article. One possible explanation for finding both strong support and strongly mutually contradictory criticisms of one and the same theory might be that this theory may perform extremely well in accounting for some people’s behavior in some circumstances but not others. Intuitively speaking, the theory may have limited scope.

Tversky and Kahneman (1992) premised CPT on showing that many people displayed phenomena in violation of prior models, such as EUT (see also Kahneman & Tversky, 1979; Tversky & Kahneman, 1981). Specifically, Tversky and Kahneman (pp. 303–304 and their Tables 1 and 2) reported two studies in which 53% (money managers) and 46% (Stanford students) of the participants violated a core property of EUT called “independence.” Let E denote the event that the Dow Jones changes by a certain number between today and tomorrow. Suppose that two lotteries f and g both yield $25,000 if E occurs. Suppose that $f'$ and $g'$ are the same as f and g, respectively, except that they both yield $0 if E occurs. If event E occurs, it does not matter whether the decision maker has chosen f or g today. Indeed, if E occurs, then they receive $25,000 tomorrow. Likewise, if E occurs, then it does not matter whether they have chosen $f'$ or $g'$ today because $f'$ and $g'$ cause them neither to win nor to lose anything tomorrow. Through the lens of EUT, the outcomes under event E simply ‘cancel out.’ Therefore, anyone who prefers f to g must prefer $f'$ to $g'$ and vice versa because these preferences should depend only on what happens when something other than E occurs, and in that situation, f and $f'$ are identical to g and $g',$ respectively. However, as we already mentioned, 53% of Tversky and Kahneman’s money managers chose f over g and chose $g'$ over $f'$ . If roughly half of all participants violate a property required by standard theory, Tversky and Kahneman argued, then a new theory is needed. Later in this article, we show that Tversky and Kahneman’s own findings elsewhere in the same article provide evidence that, in turn, at least half of all participants in their test of another property, called “loss aversion,” likewise violated their own theory. This and other incongruences in Tversky and Kahneman raise questions about how one should conceptualize CPT’s theoretical scope.

Table 1.

Preference Patterns Among Loss Gambles of Tversky and Kahneman’s (1992) Table 3 Prospects With $p \leq 0.1$ .

Our label	I	II	III	IV	V	VI	VII	VIII		I	II	III	IV	V	VI	VII	VIII
Eight loss prospects:										Eight loss prospects:
Outcome A	−50	−100	−200	−200	−400	−100	−150	−200		−50	−100	−200	−200	−400	−100	−150	−200
p	0.1	0.05	0.01	0.1	0.01	0.1	0.05	0.05		0.1	0.05	0.01	0.1	0.01	0.1	0.05	0.05
Outcome B	0	0	0	0	0	−50	−50	−100		0	0	0	0	0	−50	−50	−100
$1 - p$	0.9	0.95	0.99	0.9	0.99	0.9	0.95	0.95		0.9	0.95	0.99	0.9	0.99	0.9	0.95	0.95
EV	−5	−5	−2	−20	−4	−55	−55	−105		−5	−5	−2	−20	−4	−55	−55	−105
Predicted preferences according to CPT: Equations 1 and 3									Row ∑	Predicted preferences according to the “toy” theory: Equations 1 and 3 with the modification that $1 < β, δ < 2$
	1	1	1	1	1	1	1	1	8	1	1	1	1	1	1	1	1
	1	1	1	1	1	1	1	0	7
	1	1	1	1	1	1	0	0	6
	1	1	1	1	1	0	1	0	6
	1	1	1	1	1	0	0	0	5
	1	1	0	1	0	1	1	1	6
	1	1	0	1	0	1	1	0	5
	1	1	0	1	0	1	0	0	4
	1	1	0	1	0	0	0	0	3
	1	0	0	1	0	1	0	0	3
	1	0	0	1	0	0	0	0	2
									6	0	1	1	0	1	1	1	1
									6	0	0	1	0	1	1	1	1
									3	0	0	0	0	0	1	1	1
									2	0	0	0	0	0	1	0	1
									1	0	0	0	0	0	0	0	1
	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

Note: A “1” indicates “risk-seeking” preference in favor of a lottery over the sure amount equal to the lottery’s expected value (EV). The left side shows the predictions under Equations 1 and 3, whereas the right side shows predictions according to Equations 1 and 3 according to a “toy” theory where $1 < β, δ < 2$ . The predicted number of “risk-seeking” choices is tallied in the center column. The preference pattern predicted by expected utility theory is underlined. CPT = cumulative prospect theory.

Table 2.

Percentages of Risk-Seeking Choices Among Subjects 4, 6, and 21 and Average Percentage, Adapted From Table 4 of Tversky and Kahneman (1992)

Column 1	Column 2	Column 3	Column 4	Column 5
Subject	Gain lotteries	Gain lotteries	Loss lotteries	Loss lotteries
	$p \leq 0.1$	$p \geq 0.5$	$p \leq 0.1$	$p \geq 0.5$
4	71%	0%	30%	58%
6	100%	5%	0%	100%
21	100%	0%	0%	100%
Average	78%	10%	20%	87%

The goal of reporting these disparities is not to take a stance about either the validity or value of CPT, or the lack thereof (recall that we focus on one article, Tversky & Kahneman, 1992, not an entire research program). In our view, every behavioral theory has limited scope. Rather, we aim to highlight the one-sidedness in much behavioral science in which theoretical scope is disproportionately used to censure a leading theory, often with little discussion of the intended or expected scope of the new theory. Building on earlier preparatory work (Davis-Stober & Regenwetter, 2019; Regenwetter & Robinson, 2017), we advocate that social and behavioral science should develop more constructive ways to reconcile, synergize, and weigh the theoretical proposals associated with different schools of thought. These recommendations reach beyond the replication crisis to more general questions, including the need to think through unintended consequences of successful replication (see also Davis-Stober & Regenwetter, 2019; Irvine, 2021; Kellen, 2020; Regenwetter & Robinson, 2017; Rotello et al., 2015; Yarkoni, 2022).

What is the difference between theoretical scope and parsimony?

Consider a prospect that leads to outcome A with probability p and outcome B otherwise in which A and B are monetary amounts. Positive values for A and/or B are monetary gains, whereas negative values are monetary losses. Table 1 shows eight such lotteries taken from Tversky and Kahneman (1992). For example, the first column shows a lottery in which the decision maker runs a 10% chance ( $p = 0.1$ ) of losing $50 (i.e., $A = - 50$ ), otherwise (with probability $1 - p = 0.9$ ) neither winning nor losing anything. This lottery has an expected value (EV) of $0.1 \times ($ - 50) = ($ - 5)$ , as mentioned in Table 1. In line with Tversky and Kahneman’s terminology, choosing to play this lottery over paying $5 (its EV) is called a “risk-seeking” choice. The opposite choice is called “risk averse.”

EUT transforms dollar amounts to subjective utilities and calculates the expected value of the subjective utilities to determine whether it is preferable to play this lottery or (in this example) to accept the sure loss. Using a utility function for losses of the form $v (x) = - {(- x)}^{β}$ , in which $0 \leq β < 1$ , EUT implies that in each lottery of Table 1, the lottery is preferable to the sure loss of paying the lottery’s EV (regardless of the value of the parameter $0 < β < 1$ ). Accordingly, showing a risk-seeking preference as a “1,” the first preference pattern in Table 1 contains a string of ones across the eight lotteries. This preference pattern is underlined in the table. Disregarding the knife-edge possibility of being indifferent between a lottery and its EV, there are $2^{8} = 256$ possible binary preference patterns for eight pairwise decisions. Of these, EUT with the above utility function permits only one preference pattern here: At least as far as these lotteries are concerned, EUT (with that utility function) is rather parsimonious. If everyone consistently acted in a risk-seeking fashion when confronted with any of these choices, then EUT would also have excellent scope, again, at least as far as these lotteries are concerned. However, if there are one or more lotteries among the eight in which some, many, or all people make risk-averse choices (i.e., they choose to pay the sure loss of the EV rather than play the lottery), then EUT suffers from limited scope: In that case, either EUT may apply only at best to certain decisions but not others, or it may capture only the behavior at best of some people but not others, and so on.

Having established limitations in scope of a given theory, it is common, although not ubiquitous, that scholars will propose a less parsimonious revision that includes the original theory as a special case. Such is the case with CPT: It contains EUT as a special case, but CPT introduces extra flexibility that accommodates a broader range of possible preference patterns. As stated in the 1992 article (Tversky and Kahneman, 1992), CPT transforms money into subjective utilities in a fashion similar to the above version of EUT, it transforms probabilities into subjective weights, and it switches to a cumulative weighted average calculation. We omit the details of the latter because they are not important here. To keep mathematical formulas to a minimum, we restate only CPT’s core building blocks: subjective utility and probability weighting. According to Equations 5 and 6 of Tversky and Kahneman (1992), CPT invokes a value function v and probability weighting functions $w^{+}$ and $w^{-}$ for gains and losses with parameters $α, β, γ, δ, λ$ , of the form

v (x) = {\begin{array}{l} x^{α} & i f x \geq 0 (0 \leq α \leq 1), \\ - λ {(- x)}^{β} & i f x < 0 (0 \leq β \leq 1; 0 < λ), \end{array}

(1)

w^{+} (p) = \frac{p^{γ}}{{(p^{γ} + {(1 - p)}^{γ})}^{\frac{1}{γ}}} (0 < γ \leq 1),

(2)

w^{-} (p) = \frac{p^{δ}}{{(p^{δ} + {(1 - p)}^{δ})}^{\frac{1}{δ}}} (0 < δ \leq 1),

(3)

in which x are money amounts and p are probabilities. See Tversky and Kahneman for the details on how they combined these functions to derive subjective utilities of lotteries and to model preferences among lotteries.

For now, we consider only the left half of Table 1. It shows 12 different preference patterns predicted by CPT for those eight prospects. Because there are no mixed prospects (i.e., lotteries involving both gains and losses), we set $λ = 1$ without loss of generality, thereby effectively dropping it from Equation 1 and reducing the utility function to the same form we used for EUT. Setting $β = 0.44$ and $δ = 0.9$ , for example, generates the first pattern (also generated by EUT), and setting $β = 0.35$ and $δ = 0.89$ generates the second pattern in the table. We derived the 12 preference patterns by plugging 1,000 distinct values greater than zero and smaller than one for each of $β$ and $δ$ (1 million combinations of values) into Equations 1 and 3. Rather than the single pattern predicted by EUT, CPT accommodates 12 out of the 256 possible binary preference patterns for these eight decision problems.² In other words, as far as these decision problems are concerned, although arguably still parsimonious, CPT is less parsimonious than EUT. If more people made decisions in accordance with these 12 preference patterns rather than just the one predicted by EUT, then, at the price of a reduction in parsimony, CPT would have purchased greater theoretical scope. If, like various laws of physics or chemistry, ‘everyone in a certain well-delineated population’ made decisions in accordance with these 12 preference patterns and, more generally, made decisions consistent with CPT for ‘all decisions of a certain well-delineated type,’ then we would have a clear sense of CPT’s (possibly ‘immense’) theoretical scope. On the flip side, the cost in parsimony might not be commensurate with the enhancement of its theoretical scope if ‘too many’ people violated CPT in ‘too many situations.’

One thing is clear: Permitting more than one preference pattern is an important step toward enhancing theoretical scope. Tversky and Kahneman (1992) reported extensive individual differences in behavior, a finding that virtually all scholars agree with, at least in principle. Put differently, the theoretical scope of any theory of decision-making hinges, at least in part, on its ability to balance individual differences with theoretical parsimony. The combination of a single set of mathematical formulas (Equations 1–3) with parameters that can each vary across a continuum of permitted values creates the potential for a theory that is both parsimonious and enjoys great scope.³

Tversky and Kahneman (1992) summarized some of their key findings as follows:

The median exponent of the value function was 0.88 for both gains and losses, in accord with diminishing sensitivity. The median $λ$ was 2.25, indicating pronounced loss aversion, and the median values of $γ$ and $δ$ , respectively, were 0.61 and 0.69. . . . The parameters estimated from the median data were essentially the same. (pp. 311–312)

Following Regenwetter and Robinson (2017), we refer to CPT with these parameter values as $C P T_{M E D}$ . Returning to Table 1, a decision maker who satisfies $C P T_{M E D}$ will prefer to incur the sure loss rather than play the lottery in every one of the eight decision problems. That predicted preference pattern, shown at the bottom of Table 1, is the diametrical opposite of the EUT prediction. Together, these 12 preference patterns provide an example of CPT’s ability to model huge diversity in a parsimonious fashion, including two preference patterns that are diametrical opposites. It may be tempting to cite CPT’s ability to produce two diametrically opposite preference patterns as ‘proof’ that it lacks parsimony and maybe even that it ‘can fit anything.’ However, because Table 1 still rules out 244 of 256 binary preference patterns, such speculation would be premature.

We have already asserted that scholars often use theoretical scope in a one-sided fashion to establish limitations in the scope of someone else’s theory. We have also mentioned that CPT has served as a lightning rod for alternative proposals about decision-making. In fact, it is common practice to reduce CPT down to $C P T_{M E D}$ and to question the entire theory by offering evidence that many people violate the predictions of $C P T_{M E D}$ on some stimuli (for particularly prominent examples of this line of reasoning in support of alternate proposals, see e.g., Birnbaum, 2008; Brandstätter et al., 2006).⁴ To see how counterproductive this approach can be, consider Table 1 once more. Because EUT is a special case of CPT, anyone who satisfies EUT also satisfies CPT. Yet anyone who satisfies EUT has the exact opposite preference from $C P T_{M E D}$ for every one of the lotteries in Table 1. Hence, an EUT decision maker (who, by design, also satisfies CPT) violates every prediction of $C P T_{M E D}$ in Table 1. Using the number of violations to $C P T_{M E D}$ as a measure of performance of CPT as a whole would lead one to conclude falsely that ‘every prediction of CPT is violated’ when considering the preference pattern of an EUT decision maker. Whereas unilaterally challenging the theoretical scope of one theory to advance another theory already stacks the odds against existing theory, the common practice of attacking CPT by testing only $C P T_{M E D}$ goes much further in that it misrepresents CPT while calling its scope into question.

We round out our comparison of theoretical scope and parsimony by showing that defining the latter is just as elusive as defining the former. Consider Table 1 once more. We derived preference patterns also for a hypothetical “toy” theory that satisfies Equations 1 through 3 but with parameter ranges $1 < α, β, γ, δ < 2$ . The resulting preference patterns for the same loss gambles are given in the right half of Table 1. In one sense, this theory contradicts CPT: Diametrically contrary to CPT, it uses convex utility for gains and concave utility for losses. Yet both CPT and the toy theory use the same mathematical formula, and both also use four parameters that are each defined on a unit interval. Hence, by the most popular heuristic conceptualization of model complexity, both theories are equally parsimonious. Yet for these stimuli, the toy theory predicts only seven binary preference patterns rather than CPT’s 12, two of which are shared with CPT (namely the pattern also predicted by EUT and the pattern also predicted by $C P T_{M E D}$ ). In the next section, we show that matters are even more ambiguous and that using this empirical paradigm and these stimuli, one could make a case that CPT is more, equally, or less parsimonious than the ‘toy’ theory.

Four Inconsistencies

We now discuss four internal inconsistencies within Tversky and Kahneman (1992) and how they affect our understanding of CPT’s theoretical scope. As these inconsistencies reveal, Tversky and Kahneman provided evidence that their own theory suffers from limited scope almost in the same way as EUT. More generally, the inconsistencies reinforce the point that social scientists need to meticulously consider the intended scope of their own theory and how evidence stacks for or against it rather than disproportionately focus on the limited scope of contending theories.

Inconsistency 1

Tversky and Kahneman (1992) stated:

The most distinctive implication of prospect theory is the fourfold pattern of risk attitudes. For the nonmixed prospects used in the present study, the shapes of the value and the weighting functions imply risk-averse and risk-seeking preferences, respectively, for gains and for losses of moderate or high probability. Furthermore, the shape of the weighting functions favors risk-seeking for small probabilities of gains and risk aversion for small probabilities of loss, provided the outcomes are not extreme. (p. 306)

Similar statements appeared in the article’s abstract (Tversky and Kahneman, 1992). The authors aimed to document this fourfold pattern in their Table 4 using four different types of prospects described in their Table 3. We provide an adapted excerpt of that table in our Table 2. All of these prospects were of the form $(A, p; B, 1 - p)$ . Our Table 1 shows the collection of eight loss lotteries with $p \leq 0.1$ that underlie the data in Column 4 of Tversky and Kahneman’s (1992) Table 4. The corresponding eight gain lotteries with $p \leq 0.1$ underlying their Column 2 of their Table 4 can be obtained by taking the absolute values of all the outcomes in our Table 1. Tversky and Kahneman also used a collection of 17 gain lotteries and a collection of 17 loss lotteries, all with $p \geq 0.5$ , for Columns 3 and 5 of their Table 4. We do not repeat those stimuli here.

Table 3.

Test of Loss Aversion in Tversky and Kahneman (1992)

Problem	a	b	c	Median x	Median $θ$	x from $C P T_{M E D}$	$θ$ from $C P T_{M E D}$
1	0	0	−25	61	2.44	68.5	2.74
2	0	0	−50	101	2.02	137	2.74
3	0	0	−100	202	2.02	274	2.74
4	0	0	−150	280	1.87	411	2.74
5	−20	50	−50	112	2.07	132	2.73
6	−50	150	−125	301	2.01	357	2.76
7	50	120	20	149	0.97	169	1.63
8	100	300	25	401	1.35	429	1.72

Note: For each problem, participants chose a value of x that made the prospect $(a, . 5; b, . 5)$ equally attractive as $(c, . 5; x, . 5)$ . The fixed values of a, b, and c; median values of x; and median values of $θ = | (x - b) / (c - a) |$ are from Tversky and Kahneman (1992). The values of x and $θ$ derived from $C P T_{M E D}$ are shown in the right two columns. CPT_MED = cumulative prospect theory with median parameter values.

According to the fourfold pattern, for these prospects, decision makers make risk-seeking choices for loss prospects with $p \leq 0.1$ and gain prospects with $p \geq 0.5$ , whereas they make risk-averse choices for gain prospects with $p \leq 0.1$ and loss prospects with $p \geq 0.5$ . Tversky and Kahneman’s (1992) Table 4 reports, separately for each of 25 subjects, the percentages of risk-seeking choices in gains with $p \leq 0.1$ in Column 2, gains with $p \geq 0.5$ in Column 3, losses with $p \leq 0.1$ in Column 4, losses with $p \geq 0.5$ in Column 5, and aggregated results. Again, see our Table 2 for an adapted excerpt. Perfectly error-free adherence to the fourfold pattern would mean entries of “100” in Columns 2 and 5 and “0” in Columns 3 and 4.

As we saw earlier, we derived the 12 preference patterns in Table 1 by plugging 1,000 distinct values greater than zero and smaller than one for each of $β$ and $δ$ into Equations 1 and 3. Plugging the same 1 million values for $α$ and $γ$ into Equations 1 and 2 generates 12 preference patterns for the eight gain lotteries with small p by switching zeros and ones in the lower part of Table 1. The center column of our Table 1 shows the associated number of risk-seeking choices, predicted in an error-free decision maker, for each preference pattern. For any integer N from 0 to 8 (except 1), we have found values of $β, δ$ with which CPT predicts a percentage of $\frac{N \times 100}{8}$ in Column 4 of Tversky and Kahneman’s (1992) Table 4 (and in our Table 2). Replacing losses by gains in the top and switching around zeros and ones in the lower part of Table 1 shows that for any number $N'$ from 0 to 8 (except 1), we have found values of $α, γ$ for which CPT predicts $\frac{(8 - N') \times 100}{8}$ in their Column 2. Taken together, CPT permits almost⁵ any possible combination of values in Columns 2 and 4 of their Table 4, even in error-free choice.

This finding leads to several noteworthy insights. First, because Equations 1 through 3 can accommodate almost any conceivable data in Columns 2 and 4 of Tversky and Kahneman’s (1992) Table 4, CPT does not actually imply a fourfold pattern on the prospects that Tversky and Kahneman used in their fourfold pattern study. In particular, the theory is less parsimonious than advertised. Second, any scholar who focused only on these stimuli and on the number of predicted risk-seeking choices among them might mistakenly infer that CPT is a nearly vacuous and essentially irrefutable theory. Third, on these stimuli, the number of risk-seeking choices is a metric that obstructs a clear view of both CPT’s scope and its parsimony. Fourth, the stimuli we reproduced in Table 1 (which we took from the original article) are not diagnostic of CPT’s empirical performance when viewed through the lens of the number of risk-seeking choices.

As we show in the Proofs section, Equations 1 through 3 do, however, indeed predict 0% in Column 3 and 100% in Column 5 of Tversky and Kahneman’s (1992) Table 4 (our Table 2) almost regardless⁶ of the parameter values in Equations 1 through 3. This finding also leads to several noteworthy insights. First, on these stimuli, a portion of the fourfold pattern does indeed follow from the theory. In particular, with respect to the stimuli in Tversky and Kahneman’s Columns 3 and 5, the theory is, indeed, extremely parsimonious. Second, any scholar who focused only on these stimuli and on the number of predicted risk-seeking choices among them might mistakenly infer that CPT is extremely narrow and easy to refute. Third, on these stimuli, the metric of counting risk-seeking pairwise preferences happens to be informative because a value of 0 (or 100) implies a risk-averse (or seeking) preference for every stimulus. Fourth, these stimuli are extremely diagnostic of CPT’s empirical performance, as assessed through the number of risk-seeking choices. In particular, just six out of 25 participants (including Subject 21 in our Table 2) were perfectly aligned with CPT’s prediction in that they showed 0 in Column 3 and 100 in Column 5. To label the other 19 participants as “consistent” with CPT, one needs to permit anywhere from 5% (Subject 6, Column 3) to 42% (Subject 4, Column 5) response errors. Fifth, one way to evaluate CPT’s scope on these stimuli would be to develop a model of within- and between-persons variability that allows one to infer who or how many people satisfy/violate the theory’s predictions on which stimuli after accounting, for example, for response errors.

In all, Tversky and Kahneman’s (1992) discussion of fourfold patterns creates an internal tension in that (a) in contrast to their claim, their theory does not actually imply a fourfold pattern on the stimuli they used to document that pattern; (b) their theory is virtually immune to rejection on two columns in their Table 4 because of undiagnostic stimuli (or an undiagnostic performance statistic); and at the opposite extreme, (c) the other two columns can be viewed as supporting CPT only if one is willing to permit substantial error rates in responses in many participants. In all, Tversky and Kahneman’s own study of fourfold patterns paints an ambiguous picture about both the parsimony and the theoretical scope of their theory.

This leads us back to considering the inherent difficulty, in current-day psychology, to properly define what we mean by either “parsimony” or “scope” of a theory and how we can go about assessing either of these concepts. One way of looking at this challenge is to consider the substantial ambiguity associated with asking what constitutes a suitable study design to either reject or support a theory: Tversky and Kahneman (1992) used two studies, each with just two decision problems, but a large number of participants, to challenge EUT’s scope in their test of independence. In contrast, their assessment of CPT involved far fewer participants but many more stimuli for each study. How do these experimental design choices affect the balance of evidence between competing theories? A contemporary statistical analysis could evaluate the parsimony and statistical power of each approach, given a suitable probabilistic specification of the theory, for given stimuli. It is far less clear how it would evaluate and take into account the very design of the tasks, studies, and stimuli themselves (for important related challenges, see also Broomell & Bhatia, 2014). The perplexing trade-off between ‘diagnostic’ and ‘undiagnostic’ stimuli is particularly striking in the fourfold pattern study.

In this context, it is useful to return to the toy theory we discussed earlier. We mentioned earlier that both theories use the same mathematical formula, the same number of parameters, and two different but equally sized (unit interval) domains for their parameters. Yet we also saw that CPT permits almost twice as many binary preference patterns on these stimuli. We omit a proof, but when concentrating only on the number of risk-seeking choices, the toy theory can accommodate almost any imaginable data in Tversky and Kahneman’s (1992) Table 4 with little or no reliance on response errors. This makes the toy theory almost irrefutable by that particular statistic on those stimuli. In contrast, as we have seen, through the lens of the same statistic, CPT is more parsimonious in that it predicts values of 0 in Column 3 and 100 in Column 5. This creates a tension: CPT can be viewed as equally parsimonious, more parsimonious, or less parsimonious than the toy theory depending on the viewpoint. At the same time, by almost any measure of fit, CPT does not fit the data nearly as well as the toy theory in two columns. Comparing both the scope and the parsimony of CPT versus the toy theory would only become more complicated if we were to expand from these stimuli to include other stimuli, from binary choices and lotteries to include other tasks, or from these participants in this lab to include other participants in other labs. All in all, we face an inherent ambiguity as to which theory performs better and at what cost. Psychology is yet to develop agreed-on ways to weigh experimental task, study design, stimulus design, parsimony, and goodness of fit. Many aspects of this trade-off reach beyond model-selection methods in statistics largely because the concept of degrees of freedom is ill-defined without a specified data structure. For related points, see Yarkoni (2022), who warned that sampling individuals from a population, sampling stimuli from a universe of possible stimuli, and sampling tasks or other study features from a design space creates many additional and unaccounted for sources of variance:

Failing to model such factors appropriately (or at all) means that a researcher will end up either (a) running studies with substantially higher-than-nominal false positive rates, or (b) drawing inferences that technically apply only to very narrow, and usually uninteresting, slices of the universe the researcher claims to be interested in. (Yarkoni, 2022, p. 5)

Returning to Tversky and Kahneman (1992), besides the fourfold pattern, they also discussed a phenomenon called “loss aversion,” according to which decision makers may go to great lengths to avoid losing something. Tversky and Kahneman asked decision makers to determine the value of x that makes a 50/50 chance of receiving either $$ a$ or $$ b$ equally attractive as a 50/50 chance of receiving either $$ c$ or $$ x$ . Our Table 3 shows the eight decision problems they used (see also their Table 6). In many cases, but not all, some among a, b, c are negative numbers, hence denoting monetary losses. The three remaining inconsistencies, which we discuss next, are all related to that study. Each, again, highlights the asymmetric and ambiguous role of theoretical scope in Tversky and Kahneman’s article.

Inconsistency Part 2

We first reevaluate Tversky and Kahneman’s (1992) findings for their Problem 7 (in which $a = 50, b = 120, c = 20$ ), for which they reported a median x value of 149. Suppose for a moment that they correctly identified the median of population x values with perfect accuracy. A median x of $149$ would mean that given the choice between a 50/50 chance of winning either $50 or $120, on the one hand, and a 50/50 chance of winning either $20 or $149, half of the population would either be indifferent or prefer ($50, $\frac{1}{2}$ ; $120, $\frac{1}{2}$ ), and half of the population would either be indifferent or prefer ($20, $\frac{1}{2}$ ; $149, $\frac{1}{2}$ ).

Alas, it is impossible, according to CPT, to be indifferent between ($50, $\frac{1}{2}$ ; $120, $\frac{1}{2}$ ) and ($20, $\frac{1}{2}$ ; $x, $\frac{1}{2}$ ) for any $x \leq 149$ . As we show in the Appendix, it is mathematically impossible to obtain a value of $x \leq 149$ for Problem 7 using Equations 1 through 3 above.⁷ This means that the theory is incompatible with half of the population having x values smaller than 149, as would be implied by the definition of a median. In other words, if we take Tversky and Kahneman’s (1992) median x of 149 in Problem 7 at face value, then just as roughly half the participants violated prior theories in Tversky and Kahneman’s test of independence, so did at least half the participants violate CPT in their test of loss aversion. For this particular decision problem, CPT suffers from almost exactly the same limitation in theoretical scope as the limitation that Tversky and Kahneman used to hamstring EUT (using two decision problems). Yet this limitation in CPT’s theoretical scope, to our knowledge, has not been previously discussed in the literature. Inconsistency 2 illustrates the strikingly asymmetric role of theoretical scope in decision research: It is routine to cite Tversky and Kahneman’s findings only to question EUT but not to question CPT.

Inconsistency 3

We now consider Problem 8, in which $a = 100, b = 300$ , $c = 25$ and for which Tversky and Kahneman (1992) reported a median x of 401. Here, we find a different type of internal inconsistency. Tversky and Kahneman, and many scholars since, like to characterize the theory and empirical findings through summary statistics, such as median parameter estimates and median responses. Their median x of 401 in Problem 8 does not directly contradict CPT as a whole. However, when taken at face value, it is nonetheless incompatible with their reported median $γ$ value of 0.61 regardless of the other parameter values in CPT. We provide a demonstration in the Appendix.

Inconsistency 4

Table 3 summarizes some information of Tversky and Kahneman’s (1992) direct test of loss aversion (see their Table 6). Besides providing median x values, they also reported that the median value of $θ = | (x - b) / (c - a) |$ approximately equaled 2 in Problems 1 through 6. In their view, in addition to a median $λ$ of 2.25, this empirically supports loss aversion and the popular stylized claim that ‘losses loom roughly twice as large as gains.’ We challenge these pervasive interpretations and conclusions using Tversky and Kahneman’s own published data.

As we quoted earlier, Tversky and Kahneman reported (1992, p. 312), without giving any details, that the “parameters estimated from the median data were essentially the same” as $C P T_{M E D}$ . We derived x and $θ$ values from $C P T_{M E D}$ (right two columns of Table 3). The monetary amounts are systematically larger than the median x values, especially strongly so in Problem 4. Likewise, switching from x to $θ$ , the values of $θ$ predicted by $C P T_{M E D}$ are systematically larger than their empirical counterparts. In all, Tversky and Kahneman’s median responses, median parameter estimates, and theory are internally misaligned. To our knowledge, it is an open question how to combine these different points of view on Tversky and Kahneman’s own empirical evidence into a concise assessment of the theory’s scope. Inconsistency 4 tells us that even very broad statements, for example, about the size of the subpopulation whose parameters satisfy various stylized properties about risk attitudes and/or probability weighting, are not well founded in the original CPT article. In particular, it is not at all clear from their study for whom, for how many people, and for which decisions it is actually the case that losses loom twice as large as gains. In our view, this inherent ambiguity of theoretical scope has huge policy implications. It is not at all clear, from Tversky and Kahneman’s or other scholars’ evidence, who is served and who is hurt by policies built on the presumption that losses loom about twice as large as gains.

How Can Scope and Parsimony Be Clarified?

Recently, there has been much activity to protect psychology from fraud, improve the quality of research, and strengthen theory. This work has led to prominent recommendations for good practice with respect to a variety of goals. We now review and comment on some of these from our perspective of theoretical scope and parsimony.

One prominent recommendation is to preregister studies. Preregistration is often promoted as a way to decrease post hoc analyses and theorizing because it forces researchers to identify key hypotheses before data collection (e.g., Mistler, 2012; Moore, 2016; Simmons et al., 2021; Wagenmakers et al., 2012). However, as noted by others, preregistration is not a panacea for poor theory development, mediocre methods, or undiagnostic data (see e.g., Lakens & DeBruine, 2021; Szollosi & Donkin, 2021; Szollosi et al., 2020). It is unclear how preregistration guards against the problems and errors we have discussed here. Preregistered studies can engage in the same logical fallacies, use the same stylized statistics, perpetuate the same double standards, repeat the same asymmetric philosophies of science, and be as internally inconsistent as nonpreregistered studies. Preregistering a design that perpetuates ambiguous scope and ambiguous parsimony merely documents study flaws in advance. On the upside, preregistration provides an opportunity for scholars to discuss issues of scope, parsimony, diagnosticity of stimuli, fairness of model selection, and so on ahead of running a study if they so choose.

A second prominent recommendation is increased emphasis on replication (see e.g., Pashler & Harris, 2012; Simons, 2014). Replication helps to improve measurement precision and asses the reliability of an effect of interest. This is inherently useful and can also be leveraged to compute lower and upper bounds on the number of people who satisfy a theoretical claim or display a phenomenon (see e.g., Bogdan et al., n.d.; Davis-Stober & Regenwetter, 2019; Heck, 2021). Thus, replication can help assess scope. However, along with others, we advocate that replicability is far from a panacea: For one thing, efforts invested into reproducing and replicating a prior study as identically as possible are efforts not invested into exploring how the finding extends to other people, novel stimuli, different tasks, or new contexts. Relatedly, for arguments on the relative merits (e.g., of direct and conceptual replication), see also Carpenter (2012), Nosek et al. (2012), Pashler and Harris (2012), Schmidt (2009), and Simons (2014). In other words, replication can be orthogonal to explorations of theoretical scope. Just as importantly, like preregistration, successful replications of a phenomenon would not guard against most of the errors we identify here, such as fallacies of sweeping generalization, conjunction fallacies, and other problems associated with stylized statistics. To the contrary, it can repeat, reinforce, and even perpetuate reasoning errors and scientific biases (Davis-Stober & Regenwetter, 2019; Irvine, 2021; Regenwetter & Robinson, 2017, 2019a, 2019b; Rotello et al., 2015; Yarkoni, 2022). On the upside, we can envision situations in which scholars could both reproduce a prior study and enhance it with additional features that aim to bring theoretical scope and parsimony into better focus. We also advocate that scholars preface replication with a discussion of its impact on understanding scope.

A third recommendation, which is gaining traction especially in cognitive psychology, is to replace or supplement verbal theories with formal computational or mathematical models (e.g., Borsboom et al., 2021; Grahek et al., 2021; Guest & Martin, 2021; Navarro, 2021; Oberauer & Lewandowsky, 2019; Robinaugh et al., 2021; van Rooij & Baggio, 2020). We agree that formal modeling can force researchers to think more explicitly about both the intended scope and the flexibility of their theories. However, formal modeling on its own is not sufficient for addressing many of the double standards and ambiguity problems that we have identified. Clearly, CPT is a formal model. Yet our discussion above demonstrates that the value of formal modeling hinges on how it is implemented (and this point is reinforced by all of the references in this paragraph). Formal modeling can give the appearance of rigor and mask systemic errors (Chen et al., 2021). Formally precise models often force simplifying assumptions or omit hidden variables. These can become counterproductive (for related general points, see also Kellen et al., 2021; Yarkoni, 2022). To keep formal models tractable, scholars may limit themselves to overly simple tasks or simple stimuli (for a discussion, see e.g., Navarro, 2021). On the upside, in contrast to verbal theories, which we would consider inherently ambiguous, formal modeling does provide a common language (logic, computer code, mathematics, and/or statistics) through which to discuss theoretical scope, parsimony, and standards of scientific discourse openly and rigorously. However, as we show when we review the fifth recommendation, although mathematical formulas ostensibly eschew rhetoric, the connection between the mathematics and the substantive questions of interest can also be ambiguous. Our discussion of CPT in this article highlights an example of that broad problem.

A fourth prominent recommendation is to supplement or replace data fitting with prediction to other tasks or unseen data (e.g., Busemeyer & Wang, 2000; Erev et al., 2010, 2017; Pitt et al., 2003; Yarkoni & Westfall, 2017). This has been advocated as a tool for addressing overfitting and for developing theories that generalize. We agree that prediction is an important step toward recognizing and avoiding heuristic approaches to parsimony. However, to avoid asymmetries and double standards, scholars should provide a clear explanation why the participants, tasks, stimuli, and contexts are designed in such a way as not to provide an unfair advantage to some theories over others. Notice that searching a parameter space for best fitting parameters need not make a theory unparsimonious or cause overfitting. The number of parameters is no more than a heuristic measure parsimony.⁸ CPT is a prominent example in which prediction has gone awry: ‘Refuting’ CPT by testing predictions from $C P T_{M E D}$ or other stylized distortions of the theory gives no consideration to either the theory’s scope or its parsimony. In some prediction tournaments (e.g., Erev et al., 2010), although the parameters of some theories could characterize specific properties of individuals,⁹ the tournament rules required participants to reduce their theory down to a single set of specific parameter values, thereby obfuscating individual differences and collapsing the scope of each theory to a single set of stylized predictions. Likewise and more generally, claiming that a theory performs poorly in predictions is counterproductive when those predictions hinge on untested and unquestioned auxiliary assumptions, such as off-the-shelf statistical models. We turn to this next.

A final major recommendation is increased attention to the problem of “coordination” in psychological research: Theory simultaneously presumes and guides measurement of latent constructs (Irvine, 2021; Kellen et al., 2021; Singmann et al., 2021; van Frassen, 2008). Some of these articles warn that heated debates about the relative merit of competing theories often heed no attention to the pivotal role of technical and auxiliary assumptions, such as analysis of variance or other off-the-shelf models. Attending to the circular connection between theory and measurement (e.g., attending to auxiliary assumptions) forces researchers to consider both jointly. For a related literature, see the extensive work on meaningfulness in psychological measurement and theory (Falmagne & Doble, 2016; Falmagne & Narens, 1983; Narens, 2002, 2007; Roberts, 1985; Roberts & Rosenbaum, 1986). Attention to the coordination problem and to meaningfulness may lead to a more nuanced understanding of both theoretical scope and theoretical parsimony.

Flagging symptoms of ambiguous scope or parsimony

We have touched on a number of features of Tversky and Kahneman (1992) whose parallels and analogues in other paradigms can flag ambiguous scope or ambiguous parsimony in psychological theory more broadly. First, the most prominent flags are all forms of asymmetric reasoning in which scholars point out shortcomings of others’ theories or evidence without discussing the possible shortcomings of the replacements they propose. Double standards, such as using many more or many fewer stimuli to test the old theory than the new one, using stimuli (even if picked ‘randomly’) that pressure the old theory but not the new one, pushing a novel theory merely on the basis of its ability to accommodate some ‘anomalies‘ that the old theory does not explain, may all create systematic biases against existing theory and in favor of the proposed new theory. When the latter is custom designed to handle certain phenomena, it is important to also understand the associated cost in parsimony. Extreme forms of asymmetric reasoning occur when scholars provide evidence only against special cases of a theory (e.g., $C P T_{M E D}$ ), thereby literally misrepresenting the theory they question. Second, serious questions of scope arise with mathematical errors or omissions. For example, the mathematical model in Tversky and Kahneman (1992) does not actually imply a fourfold pattern on their own fourfold pattern study stimuli. Third, more broadly, any internal inconsistencies in reasoning can flag problems with parsimony and/or scope. A very troubling yet extremely common practice is strawman null hypothesis testing. Testing hypotheses whose violation is a foregone conclusion (e.g., perfectly calibrated coin flipping as a null model of behavior, the null that two groups are identical) cannot legitimately provide evidence in favor of a proposed theoretical claim (see also Cohen, 1994; Meehl, 1978). More broadly, meaningless statistics, such as the ‘number of correct predictions’ or the number of correct modal choices in decision-making, generate useless evidence. Fourth, claims of “converging evidence” are often unsubstantiated. Here, it is useful to see whether it is possible to calculate or estimate how many people satisfy the conjunction of evidentiary phenomena (see also Davis-Stober & Regenwetter, 2019; Regenwetter et al., in press). Fifth, nontechnical readers should be aware that some commonly used model selection criteria, such as AIC and BIC, are heuristic in nature and thus may give an analysis a sheen of rigor that need not be warranted. A major improvement is the use of Bayes’s factors, especially in cases in which the researchers provide information on the possible range of Bayes’s factors for a given study.

How parsimonious is CPT?

We end with an illustration of Bayes’s factors as a quantitative measure of parsimony for CPT. We concentrate on the 8 + 8 + 17 + 17 = 50 stimuli from Tversky and Kahneman’s (1992) fourfold pattern study used in their Table 4 (we show and label some of them in our Table 2). As we have already seen, there are 12 possible preference patterns for the 25 gains prospects and 12 possible preference patterns for the 25 loss prospects. We briefly consider two “probabilistic specifications” of CPT (with Equations 1–3) from Regenwetter et al. (2014) and Zwilling et al. (2019). According to the “aggregation-based” model, each individual has one of the $12 \times 12 = 144$ allowed patterns (out of $2^{50} > 10^{15}$ possible ones) as his or her single “true” preference state. If the person prefers prospect f to prospect g and if we allow the person to make a response error with probability¹⁰ at most $τ$ , then the person will choose f with probability $\geq 1 - τ$ . According to the “random preference model”, the probability of choosing f over g is the total probability of those preference patterns in which f is preferred to g in an unspecified probability distribution over all $144$ possible preference patterns. Considering just the prospects and preference patterns in our Table 1, note that a decision maker who prefers the lottery in Prospect I also does so in Prospect IV and vice versa. The same holds for Prospects III and V. Moreover, in any pattern in which the lottery is preferable in Prospect II, the lottery is also preferable in Prospect IV, but the converse does not hold. Using similar reasoning or using polyhedral combinatorics,¹¹ one can show that no matter what probability distribution we consider over these preferences, writing $P_{i}$ for the probability of choosing the lottery in Prospect i,

1 \geq P_{I} = P_{I V} \geq P_{I I} \geq P_{V} = P_{I I I} \geq 0; P_{I I} \geq P_{V I I} \geq P_{V I I I}

(4)

P_{I V} \geq P_{V I} \geq P_{V I I I} \geq 0; P_{V} + P_{V I} \geq P_{V I I} .

(5)

The remaining choice probabilities for the other 17 loss prospects are 1. Similar constraints hold among the choice probabilities for the 25 gain prospects.

Regenwetter et al. (2018) and Zwilling et al. (2019) reviewed how to calculate the range of possible Bayes’s factors between such a given model and an unconstrained “encompassing” model. The aggregation-based model can generate a Bayes’s factor anywhere between 0 and $\frac{1}{144 \times τ^{50}}$ . For $τ = 0.5$ , the upper bound exceeds $10^{12}$ , and for $τ = \frac{1}{4}$ , it exceeds $10^{27}$ . Because the random preference model predicts deterministic choice of the lottery in Tversky and Kahneman’s (1992) 17 “high-probability” loss prospects and deterministic choice of the sure gain in Tversky and Kahneman’s 17 “high-probability” gain prospects, the Bayes’s factor, either in favor or against the random preference model, is unbounded! To conclude about CPT’s parsimony through the lens of Bayes’s factors, for both of these probabilistic specifications, there is no limit as to how much evidence could be provided against CPT on these stimuli.

Conclusion and Discussion

Since the publication of prospect theory some 40 years ago (Kahneman & Tversky, 1979), scholars have prolifically cited Tversky and Kahneman’s (and others’) findings that EUT suffers from limited scope. Yet for nearly 30 years, it has gone unnoticed that Tversky and Kahneman (1992) provided evidence for the exact same limitation in CPT’s scope: Just like half their participants violated EUT early in their article, so did ostensibly half the respondents violate CPT in Problem 7 of their loss aversion study later in the same article. In addition, some aggregate measures do not align with each other, and the exact role of fourfold patterns is ambiguous. All in all, this raises the question about the overall balance of evidence that the original CPT article ultimately provided in favor of or against its own theory. Likewise, moving beyond the 1992 article and considering CPT’s entire “functional menagerie” (Stott, 2006) of potential utility and weighting functions, it is unclear to us whether these modifications make the theory’s scope or parsimony any less ambiguous and in what way. We are not aware of any consensus in the field as to how to weigh the many versions of the theory against each other and against competing theories or how to determine which version of the theory provides unambiguously the best trade-off between scope and parsimony across all possible stimuli and tasks.

The primary purpose of this article is not to question the validity of CPT as a theory or to provide further grounds to endorse it but, rather, to call attention to the ambiguity of the evidence provided by the authors in support of their own theory. Nor is our goal to single out Tversky and Kahneman for a practice that appears rather widespread. Our goal is conceptual: How should one really think of theoretical scope in psychology? History has been repeating itself in that much behavioral decision research turns CPT’s limitations against it. Scientific cost-benefit analysis in decision science all too often appears to focus on highlighting the cost of others’ theories and the benefits of one’s own proposals. The resulting fault lines have left us with various ‘camps:’ Some endorse EUT as their preferred theoretical idealization. Others consider CPT as their sweet spot for a theory of risky choice. Meanwhile, countless articles expand or modify prior theories to accommodate behavioral regularities or irregularities that have been reported as evidence against those theories. Over time, each new generation has tended to develop its new proposals by calling out limitations in previous ideas¹² with little attention to the limitations of the new theses (for examples of notable exceptions, see Brandstätter et al., 2006, 2008; Loomes, 2010, who acknowledged and specified limitations of their own models) and, perhaps more importantly, with little discussion about what limitations are acceptable or inacceptable. The literature in risky choice often appears to follow a three-pronged research strategy: (a) Scrutinize the old theory by showing certain weaknesses and promote the new theory by showing that it overcomes that particular set of weaknesses, (b) leave it to others to explore the new proposed theory’s weaknesses, and (c) either ignore the other camps or defend the new theory vigorously against their challenges. We see little effort toward reconciliation among schools of thought that highlight different aspects of and approaches to decision-making. We also detect little effort to weigh strengths and weaknesses of competing theories in a comprehensive manner.

Although our discussion centered around decision-making, our conclusions apply to the discipline more widely. In our view, psychology should move from using theoretical scope primarily as a bludgeon to attack others’ theories and proceed toward pursuing more constructive goals. Every scientific theory has some limitations, especially in psychology. When proposing a new idea, behavioral scientists should make every effort to spell out the intended scope of this new theory. Proposals for new theories are far more interesting when they also delineate what would constitute critical tests, what would qualify as refutation of a new proposal, what is considered beyond a theory’s intended scope, and who the theory applies to when, where, and why. Likewise, more adversarial collaborations between camps would help bring sense to the balkanized landscape of entrenched schools of thought (for useful guidelines on how to run such projects, see e.g., Mellers et al., 2001, Table 1).

So, what is one to make of the fact that half of Tversky and Kahneman’s (1992) participants in one of their studies appear to have violated their own theory on one stimulus? What is one to make of the internal inconsistencies among reported findings within one article? It is unclear how to weigh the evidence in favor of CPT on the one hand and the limitations of CPT on the other hand against the corresponding strengths and weaknesses of competing theories. Statistical science actively researches and studies the trade-off in complexity and parsimony in statistical models by counting parameters and degrees of freedom, computing heuristic model selection indices such as AIC and BIC, and applying quantitative model selection tools such as Bayes’s factors. As Yarkoni (2022) argued in somewhat different words, the ‘sampling’ of stimuli, participants, and design features of psychological research effectively hides uncounted degrees of freedom in the data. Psychology still needs to properly define theoretical scope and theoretical parsimony beyond post hoc statistical models. The discipline should move beyond weighing, say, ‘good’ and ‘bad’ stimuli or study designs heuristically and develop methods, concepts, and standards for weighing theoretical scope against scientific simplicity. A first step is for scholars of different schools of thought to cooperate more systematically in synergizing the strengths of different theories. A second and easy-to-implement step is for scholars to specify what they mean by “diagnostic stimuli” and/or “critical tests” not only for existing theory but also for their own proposed new theory. A third step is for scholars to be more cognizant that every behavioral theory has limitations and therefore spell out, as explicitly as they can, what scope they envision for their proposed theory.

Footnotes

Appendix

Acknowledgements

We thank Meichai Chen, Brittney Currie, Yu Huang, Emily Line, Xiaozhi Yang, and our referees for comments on earlier drafts. We are grateful to Lyle Regenwetter for providing an elegant mathematical proof of Claim I in the . The National Science Foundation (NSF) and Army Research Office (ARO) had no other role besides financial support. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the ARO or the U.S. Government, the NSF, colleagues, or the authors’ home institutions. The U.S. Government is authorized to reproduce and distribute reprints for government purposes notwithstanding any copyright notation herein. Part of this article was previously presented orally at the annual Society for Mathematical Psychology satellite meeting of Psychonomics in 2019 and at the 57th Annual Edwards Bayesian Research Conference in 2020.

Transparency

Action Editor: Frederick L. Oswald

Editor: Frederick L. Oswald

Author Contributions

M. Regenwetter guided and supervised the project and contributed most of the writing. M. M. Robinson and C. Wang discovered the inconsistencies when exploring how much preference heterogeneity cumulative prospect theory permits. They provided most of the mathematical proofs in joint work while M. M. Robinson was a PhD student in psychology at University of Illinois at Urbana-Champaign and C. Wang was an undergraduate mathematics/economics double major at University of Illinois at Urbana-Champaign. All of the authors approved the final manuscript for submission.

ORCID iD

Michel Regenwetter

Notes

References

Birnbaum

(2008). New paradoxes of risky decision making. Psychological Review, 115, 463–501.

Bogdan

P.C.

Cervantes

V.H.

Regenwetter

(n.d.). Bridging the model-theory gap in mediation analysis: What does mediation reveal about individual people? Manuscript under review.

Borsboom

van der Maas

H. L.

Dalege

Kievit

R. A.

Haig

B. D.

(2021). Theory construction methodology: A practical framework for building theories in psychology. Perspectives on Psychological Science, 16, 756–766.

Brandstätter

Gigerenzer

Hertwig

(2006). The Priority Heuristic: Making choices without trade-offs. Psychological Review, 113, 409–432.

Brandstätter

Gigerenzer

Hertwig

(2008). Risky choice with heuristics: Reply to Birnbaum (2008), Johnson, Schulte-Mecklenbeck, and Willemsen (2008), and Rieger and Wang (2008). Psychological Review, 115, 281–290.

Broomell

S. B.

Bhatia

(2014). Parameter recovery for decision modeling using choice data. Decision, 1, 252–274.

Busemeyer

Wang

Y-M.

(2000). Model comparisons and model selections based on generalization criterion methodology. Journal of Mathematical Psychology, 44(1), 171–189.

Carpenter

(2012). Psychology’s bold initiative. Science, 335, 1558–1561.

Chen

Regenwetter

Davis-Stober

(2021). Collective choice may Tell nothing about anyone’s individual preferences. Decision Analysis, 18, 1–24.

10.

Cohen

(1994). The earth is round (p<.05). American Psychologist, 49, 997–1003.

11.

Davis-Stober

Regenwetter

(2019). The ‘paradox’ of converging evidence. Psychological Review, 126, 865–879.

12.

Erev

Ert

Plonsky

Cohen

(2017). From anomalies to forecasts: Toward a descriptive model of decisions under risk, under ambiguity, and from experience. Psychological Review, 124, 369–409.

13.

Erev

Ert

Roth

Haruvy

Herzog

Hau

Hertwig

Stewart

West

Lebiere

(2010). A choice prediction competition: Choices from experience and from description. Journal of Behavioral Decision Making, 23(1), 15–47.

14.

Erev

Ert

Yechiam

(2008). Loss aversion, diminishing sensitivity, and the effect of experience on repeated decisions. Journal of Behavioral Decision Making, 21, 575–597.

15.

Falmagne

J-C.

Doble

(2016). On meaningful scientific laws. Springer-Verlag.

16.

Falmagne

J-C.

Narens

(1983). Scales and meaningfulness of quantitative laws. Synthese, 55, 287–325.

17.

Grahek

Schaller

Tackett

(2021). Anatomy of a psychological theory: Integrating construct-validation and computational-modeling methods to advance theorizing. Perspectives on Psychological Science, 16, 803–815.

18.

Guest

Martin

(2021). How computational modeling can force theory building in psychological science. Perspectives on Psychological Science, 16, 789–802.

19.

Heck

(2021). Assessing the ‘paradox’ of converging evidence by modeling the joint distribution of individual differences: Comment on Davis-Stober and Regenwetter. Psychological Review, 128(6), 1187–1196. https://doi.org/10.1037/rev0000316

20.

Irvine

(2021). The role of replication studies in theory building. Perspectives on Psychological Science, 16, 844–853.

21.

Kahneman

Tversky

(1979). Prospect theory: An analysis of decision under risk. Econometrica, 47, 263–291.

22.

Kellen

(2020). The limited value of replicating classic patterns of prospect theory. https://go.nature.com/2YqwWXR

23.

Kellen

Davis-Stober

Dunn

Kalish

(2021). The problem of coordination and the pursuit of structural constraints in psychology. Perspectives on Psychological Science, 16, 767–778.

24.

Kellen

Pachur

Hertwig

(2016). How (in)variant are subjective representations of described and experienced risk and rewards? Cognition, 157, 126–138.

25.

Lakens

DeBruine

(2021). Improving transparency, falsifiability, and rigor by making hypothesis tests machine-readable. Advances in Methods and Practices in Psychological Science. Advance online publication. https://doi.org/10.1177/2515245920970949

26.

Lleras

Zang

Ballew

Buetti

(2020). A target-contrast signal theory of parallel processing in goal-directed search. Attention, Perception, & Psychophysics, 82, 394–425.

27.

Loomes

(2010). Modeling choice and valuation in decision experiments. Psychological Review, 117, 902–924.

28.

Meehl

(1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806–834.

29.

Mellers

Hertwig

Kahneman

(2001). Do frequency representations eliminate conjunction effects? An exercise in adversarial collaboration. Psychological Science, 12(4), 269–275.

30.

Mistler

(2012). Planning your analyses: Advice for avoiding analysis problems in your research. Psychological Science Agenda. http://www.apa.org/science/about/psa/2012/11/planning-analyses

31.

Moore

(2016). Preregister if you want to. American Psychologist, 71(3), 238–239.

32.

Murphy

ten Brincke

(2018). Hierarchical maximum likelihood parameter estimation for cumulative prospect theory: Improving the reliability of individual risk parameter estimates. Management Science, 64, 308–326.

33.

Narens

(2002). Theories of meaningfulness. Erlbaum.

34.

Narens

(2007). Introduction to the theories of measurement and meaningfulness and the use of invariance in science. Erlbaum.

35.

Navarro

D. J.

(2021). If mathematical psychology did not exist we would need to invent it: A case study in cumulative theoretical development. Perspectives on Psychological Science, 16, 707–716.

36.

Nilsson

Rieskamp

Wagenmakers

E. -J.

(2011). Hierarchical Bayesian parameter estimation for Cumulative Prospect theory. Journal of Mathematical Psychology, 55, 84–93.

37.

Nosek

Spies

Motyl

(2012). Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7(6), 615–631.

38.

Oberauer

Lewandowsky

(2019). Addressing the theory crisis in psychology. Psychonomic Bulletin & Review, 26(5), 1596–1618.

39.

Pachur

Schulte-Mecklenberg

Murphy

Hertwig

(2018). Prospect theory reflects selective allocation of attention. Journal of Experimental Psychology: General, 147, 147–169.

40.

Pashler

Harris

(2012). Is the replicability crisis overblown? Three arguments examined. Perspectives on Psychological Science, 7(6), 531–536.

41.

Pitt

Kim

Myung

I-J.

(2003). Flexibility versus generalizability in model selection. Psychonomic Bulletin & Review, 10(1), 29–44.

42.

Popov

Reder

(2020). Frequency effects on memory: A resource-limited theory. Psychological Review, 127(1), 1–46.

43.

Regenwetter

Cavagnaro

Popova

Guo

Zwilling

Lim

Stevens

(2018). Heterogeneity and parsimony in intertemporal choice. Decision, 5, 63–94.

44.

Regenwetter

Davis-Stober

C. P.

(2012). Behavioral variability of choices versus structural inconsistency of preferences. Psychological Review, 119(2), 408–416.

45.

Regenwetter

Davis-Stober

C. P.

Lim

S. H.

Guo

Popova

Zwilling

Cha

Y. A.

Messner

(2014). QTEST: Quantitative testing of theories of binary choice. Decision, 1(1), 2–34.

46.

Regenwetter

Robinson

(2017). The construct-behavior gap in behavioral decision research: A challenge beyond replicability. Psychological Review, 124(5), 533–550. https://doi.org/10.1037/rev0000067

47.

Regenwetter

Robinson

(2019a). The construct-behavior gap revisited: Reply to Hertwig and Pleskac (2018). Psychological Review, 126, 451–454.

48.

Regenwetter

Robinson

(2019b). Tutorial: Nuisance or substance? Leveraging heterogeneity of preference. The Spanish Journal of Psychology, 22, Article e60. https://doi.org/10.1017/sjp.2019.50

49.

Regenwetter

Robinson

Wang

(2022). Are you an exception to your favorite decision theory? Behavioral decision research is a Linda problem! Decision. Advance online publication. https://doi.org/10.1037/dec0000161

50.

Roberts

(1985). Applications of the theory of meaningfulness to psychology. Journal of Mathematical Psychology, 29, 311–332.

51.

Roberts

Rosenbaum

(1986). Scale type, meaningfulness and the possible psychophysical laws. Mathematical Social Sciences, 12, 77–95.

52.

Robinaugh

Haslbeck

Oisín

Fried

Waldorp

(2021). Invisible hands and fine calipers: A call to use formal theory as a toolkit for theory construction. Perspectives on Psychological Science, 16(4), 725–743.

53.

Rotello

C. M.

Heit

Dubé

(2015). When more data steer us wrong: Replications with the wrong dependent measure perpetuate erroneous conclusions. Psychonomic Bulletin & Review, 22, 944–954.

54.

Scheibehenne

Pachur

(2015). Using Bayesian hierarchical parameter estimation to access the generalizability of cognitive models of choice. Psychonomic Bulletin & Review, 22, 391–407.

55.

Schmidt

(2009). Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Review of General Psychology, 13, 90–100.

56.

Schneegans

Taylor

Bays

(2020). Stochastic sampling provides a unifying account of visual working memory limits. Proceedings of the National Academy of Sciences, USA, 117(34), 20959–20968.

57.

Shepard

(1987). Toward a universal law of generalization for psychological science. Science, 237, 1217–1323.

58.

Simmons

Nelson

Simonsohn

(2021). Pre-registration: Why and how. Journal of Consumer Psychology, 31(1), 151–162.

59.

Simons

(2014). The value of direct replication. Perspectives on Psychological Science, 9(1), 76–80.

60.

Singmann

Cox

Kellen

Chandramouli

Davis-Stober

Dunn

Gronau

Q. F.

Kalish

McMullin

S. D.

Navarro

Shiffrin

(2021). Statistics in the service of science: Don’t let the tail wag the dog. PsyArXiv. https://doi.org/10.31234/osf.io/kxhfu

61.

Stott

(2006). Cumulative prospect theory’s functional menagerie. Journal of Risk and Uncertainty, 32, 101–130.

62.

Szollosi

Donkin

(2021). Arrested theory development: The misguided distinction between exploratory and confirmatory research. Perspectives on Psychological Science, 16, 717–724.

63.

Szollosi

Kellen

Navarro

Shiffrin

van Rooij

Van Zandt

Donkin

(2020). Is preregistration worthwhile? Trends in Cognitive Sciences, 24, 94–95.

64.

Tversky

Kahneman

(1981). The framing of decisions and the psychology of choice. Science, 211, 453–458.

65.

Tversky

Kahneman

(1992). Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and Uncertainty, 5, 297–323.

66.

van Frassen

(2008). Scientific representation: Paradoxes of perspective. Oxford University Press.

67.

van Rooij

Baggio

(2020). Theory before the test: How to build high-verisimilitude explanatory theories in psychological science. Perspectives on Psychological Science, 16, 682–697.

68.

Wagenmakers

E-J.

Wetzels

Borsboom

van der Maas

H. L.

Kievit

(2012). An agenda for purely confirmatory research. Perspectives on Psychological Science, 7(6), 632–638.

69.

Yarkoni

(2022). The generalizability crisis. Behavioral and Brain Sciences, 45, Article E1. https://doi.org/10.1017/S0140525X20001685

70.

Yarkoni

Westfall

(2017). Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science, 12(6), 1100–1122.

71.

Zwilling

Cavagnaro

Regenwetter

Lim

Fields

Zhang

(2019). QTEST 2.1: Quantitative testing of theories of binary choice using Bayesian inference. Journal of Mathematical Psychology, 91, 176–194.