Abstract
As a discipline that concerns itself with the future, planning relies on forecasts to inform and guide action. With this reliance comes a concern that the best possible forecasts be produced. This review identifies three distinct ways in which forecasts may be evaluated (methodology, accuracy, and usefulness) and describes challenges associated with evaluating forecasts along any of these three dimensions. By way of example, this general discussion of forecasting is applied to the specific case of demand forecasts for transportation infrastructure, with an emphasis on transit infrastructure. There is a continuing need for planners to engage with interdisciplinary forecasting literature.
Planning has been described as a set of activities aimed at “designing the future” (Mazza 1996, 5). As such, it relies on beliefs about the future: both the likely consequences of planning actions and the context in which communities will experience those consequences. Planning actions aimed at increasing employment opportunities, providing affordable housing, reducing crime, improving community cohesion, improving urban mobility, or achieving other goals are based on planners’ beliefs. This includes beliefs about what the plan is likely to achieve and what would happen if the plan were not to be implemented. Such beliefs which may or may not be formalized as quantitative forecasts. Thus, as a prerequisite to designing the future, planners must either predict the future or rely on the predictions of others. Thus, good forecasts are often a necessary (if insufficient) precondition for effective plans. Indeed, the subfield of transportation planning relies so heavily upon forecasts that some have (erroneously) defined transportation planning to be synonymous with travel demand forecasting (Ortúzar and Willumsen 2011). Given the connection between effective plans and good forecasts, planners must carefully consider the question: What makes a good forecast?
Planners may seek to temper expectations for perfect forecasts. They may argue that for all their expertise, they do not have crystal balls with which to predict the future. However, in many cases, their predictions are based on equally mysterious artifacts: sophisticated mathematical models. There is nothing magical about these modern “crystal balls”; modern forecasting methods are built on plausible assumptions. They apply empirical relationships that suggest how present conditions will likely evolve in the future. But as forecasting models become increasingly data intensive and complex, the process by which they convert input data into predictions about the future can be far from “crystal clear.” Hence, another metaphor has arisen to describe models that may not be well understood even by the technical experts who use them: a black box.
If forecasters do not fully understand the models they apply to producing forecasts, this is cause for concern. However, an even greater concern is that forecasters understand their models well enough to manipulate their forecasts to produce whatever result they, or their clients, may wish (Pickrell 1992; Flyvbjerg, Skamris Holm, and Buhl 2005; Richmond 2005; Kain 1990).
The purpose of this article is to offer a framework for evaluating forecasts. Is the best forecast always the most accurate forecast? What other characteristics of forecasts are relevant to forecast evaluation? The question of how to evaluate forecasts is understudied. Moreover, planners are not the only experts who produce and rely on predictions about the future. Thus, much can be gained by looking beyond the disciplinary boundaries of planning for insights on forecast evaluation. In this article, I draw on the work of scholars from diverse disciplines including planning, politics, public health, psychology, computer science, and meteorology to address these questions. Based on this literature, I describe a framework for evaluating forecasts based on three dimensions: methodology, accuracy, and usefulness. By way of example, I apply this framework to published studies evaluating demand forecasts for transportation infrastructure.
Literature Search Methodology
The interdisciplinary literature scan took four book-length works as a starting point. The first, Forecasting, An Appraisal for Policy-makers and Planners (Ascher 1979), is widely considered to be a classic on forecasting within the planning discipline. The second, Expert Political Judgment: How Good Is It? How Can We Know (Tetlock 2009), addresses forecasting from the perspective of political psychology and describes the success of predictions of rare political events. The third, Thinking, Fast and Slow (Kahneman 2013), summarizes decades of insight into prediction by Daniel Kahneman and Amos Tversky, for which Kahneman was awarded the Nobel Memorial Prize in Economic Science in 2002. The fourth, The Signal and the Noise (Silver 2015), is a reflection on forecasting based on the authors’ experience as a popular political pollster. In addition to these four works, the works they cite, and the works that cite them, papers from the general forecasting literature were gathered the International Journal of Forecasting, as well as the works citing and cited by relevant papers from that journal.
Studies of transportation demand forecasts were selected for inclusion in this analysis based on searches of research databases including Transportation Research International Documentation (TRID) and Google Scholar and by recommendations from other transportation planning scholars. Search terms included “travel demand forecast accuracy,” “ridership forecast evaluation,” and “infrastructure demand forecast evaluation.” Works on tourism demand and nontransportation infrastructure were excluded from this review. Works on cost estimation for transportation infrastructure projects were included if they also discussed demand forecasts. Works on the development of travel demand models were also excluded.
Forecast Evaluation Framework
Four scholars, Ascher (1979), Tetlock (2009), Murphy (1993), and Kahneman (2013), have addressed the question of forecast goodness from their respective disciplinary perspectives: public policy, political psychology, meteorology, and behavioral economics. Table 1 summarizes the criteria that each of these scholars has proposed; the terms used in this table are discussed below.
Categories of Criteria for Evaluating Forecasts.
In introducing his evaluation of population, energy, economic, transportation, and technology forecasts, Ascher (1979) differentiates the “insider’s approach” from the “outsider’s approach.” The insider’s approach focuses on the scientific validity of the forecasting technique and the correct use of the information that was available to the forecaster at the time the forecast was made. In contrast, the “outsider’s approach” focuses on the accuracy of the forecasts rather than the method that is used to produce them.
Tetlock (2009) likewise proposes two categories of tests to evaluate a political forecaster’s judgment. The first category is “correspondence tests rooted in empiricism” (Tetlock 2009, 7), which correspond neatly with Ascher’s (1979) outsider’s approach. The second is “coherence and process tests rooted in logic” (Tetlock 2009, 7), which are closely related to, if somewhat distinct from, Ascher’s (1979) insider’s approach. Coherence and process tests address the question of whether the forecast is based on a set of internally consistent beliefs that the forecaster updates in response to new evidence.
In reference to weather forecasts, Murphy (1993) identifies three distinct types of forecast goodness: consistency, quality, and value. Consistency is the degree to which a published forecast corresponds to the forecaster’s best judgment. This is closely related to (but again somewhat distinct from each of) Ascher’s (1979) insider’s approach and Tetlock’s (2009) coherence and process tests. Quality is the degree to which a published forecast corresponds to subsequently observed conditions. This corresponds neatly with Ascher’s (1979) outsider’s approach and Tetlock’s (2009) correspondence tests. Murphy’s quality is typically what we refer to when we discuss forecast accuracy. Murphy’s (1993) final type of forecast goodness, value, refers to the usefulness of a forecast to the forecast’s users. While Ascher (1979) and Tetlock (2009) both discuss the concept of forecast usefulness, neither suggests it as a separate criterion for evaluating a forecast.
Kahneman (2013) does not directly address how a forecast should be evaluated but rather discusses two approaches to preparing project-management forecasts: the “inside view” and the “outside view.” A forecast prepared with an inside view is based on the forecaster’s knowledge of specific characteristics of the planned project. The inside view is closely related to forecast evaluation based on Ascher’s (1979) insider’s approach, Tetlock’s (2009) coherence and process tests, and Murphy’s (1993) consistency. A forecast prepared with an outside view is based on observed data from completed projects with similar characteristics to the planned project. The outside view is closely related to Ascher’s (1979) outsider’s approach, Tetlock’s (2009) correspondence tests, and Murphy’s (1993) quality.
Taken together, the loosely related frameworks these four scholars propose suggest three dimensions that are useful in evaluating forecasts: (1) methodology, (2) accuracy, and (3) usefulness. I review each of these dimensions in greater detail below, followed by a discussion of the application of these dimension to the case of travel demand forecasting.
Method
Forecast methodology comprises three components: (1) forecast input data on historical or existing conditions, (2) a theoretical and/or mathematical models, and (3) the forecaster’s judgment.
Forecast Input Data
Ascher (1979) evaluates many forecasts from a variety of disciplines. He finds that improvements in the accuracy of forecast inputs have a much greater influence on forecast accuracy than improvements in the theoretical/mathematical model used to produce the forecast. Alonso (1968) demonstrates how model improvements may be especially ineffective when they increase model complexity. This is because complex models tend to magnify the uncertainty of model inputs. Input data are usually uncertain, either due error in the measurement of existing conditions or because the inputs are themselves forecasts. Thus, a less conceptually complete, but simpler, model will yield a more reliable forecast than a more conceptually complete, more complex model.
Theoretical and Mathematical Models
This effect of model complexity described above may help explain Ascher’s (1979) curious finding that the introduction of sophisticated modeling techniques in a variety of forecasting disciplines has generally not improved forecast accuracy. Any improvement in the quality of input data is likely to be offset by the magnification of uncertainty that usually accompanies increasing model complexity.
Ascher’s (1979) finding is underscored by the work of Makridakis, a leading scholar on forecasting, who organized a series of four competitions referred to in the forecasting literature as the Makridakis competitions or M competitions (Makridakis et al. 1982, 1993; Makridakis, Spiliotis, and Assimakopoulos 2018; Makridakis and Hibon 2000; Hyndman and Koehler 2006). The purpose of these competitions was to compare the performance of a variety of extrapolative methods for forecasting time-series data. Two findings that were consistent in all four competitions are relevant in evaluating forecast methodologies. First, more complex methods generally do not perform better than simpler methods. Second, the results obtained by averaging forecasts produced using a variety of methods (or applying a hybrid method) tend to be more accurate than the forecasts produced by any individual method included in the average (Makridakis and Hibon 2000; Makridakis et al. 1982, 1993; Makridakis, Spiliotis, and Assimakopoulos 2018). Interestingly, the latter finding is consistent with Tetlock’s (2009) observation about political forecasts prepared based on expert judgment alone. Tetlock (2009) found that averaging predictions by a group of experts will generally yield a more accurate set of predictions than any one expert will produce.
These findings would suggest that forecasting resources would be better allocated to applying a variety of simple models and averaging or combining the results than to developing a single complex model. However, the increasing availability and affordability of computing power has served to sharply reduce the cost of increasing model complexity.
Judgment
As forecasting models have become more complex and reliant on high-power computing, some have expressed concern that these models are increasingly becoming “black boxes”: mysterious machines that transform inputs into outputs using methods that even the forecaster operating the machine may not fully understand (Ascher 1979; Silver 2015). An argument could be made that one advantage of a black box model is that it reduces reliance on the modeler’s subjective and biased judgment. Fair (1971) argues that if the modeler has the ability to tinker with model parameters and inputs to achieve his preferred result, the model offers little advantage over the modeler’s subjective judgment. Ascher (1979) and Silver (2015) both describe this concern with an analogy to the Mechanical Turk, the subject of an essay by Edgar Allen Poe about a machine that astonished audiences throughout Europe in the late eighteenth and early nineteenth centuries by its ability to apparently play chess independently against a human opponent. Poe (1836) correctly surmised that the machine was actually operated by a concealed human chess master. Just as the chess master’s abilities were more impressive when it appeared that they were those of a machine, is it possible that the primary function of computer models is to add credibility to the human modeler’s subjective opinions?
There is some evidence that good judgment on the part of a human model operator improves model performance. Ascher (1979) cites “[n]umerous studies [which] have shown that, contrary to expectations, short-term economic forecasts that require the forecaster to project the values for exogenous variables often are much more accurate than parallel forecasts that incorporate the actual values for these exogenous variables” (p. 68, emphasis in orginal). This occurs because the forecaster iteratively adjusts model inputs to bring the forecasts into a range that is consistent with the forecaster’s judgment, which can compensate to some degree for errors or omissions in the model specification, but may obscure serious weaknesses in the validity of the model. This finding does not necessarily mean that the model does not add value or that we would do equally well to rely on expert judgment alone. The model serves primarily to assess judgmentally the general implications of the forecaster’s assumptions on future exogenous developments, including his ad hoc adjustments for anticipated changes in structure since the sample period, and for the correction of apparent specification errors. (Hickman 1972, 17) Be wary…when you come across phrases like “the computer thinks…” [I]f you get the sense that the forecaster means this…literally—that he thinks of the computer as a sentient being or the model as having a mind of its own—it may be a sign that there isn’t much thinking going on at all. Whatever biases and blindspots the forecaster has are sure to be replicated in his computer program. (p. 293)
A forecaster’s judgment of whether a forecast is within a reasonable range may be somewhat subjective, but the forecaster will ideally update his subjective beliefs based on data. Kahneman (2013) describes how project performance forecasts prepared based on the characteristics of the project itself (described as the “inside view”) can be improved when the forecaster incorporates information about the performance of similar projects that have already been completed (the “outside view”). Thus, the forecaster’s judgment about whether the initial forecast is reasonable is informed by data on similar projects. Kahneman and Tversky (1979) suggest reference class forecasting as means of going beyond simply anchoring the forecaster’s judgment based on prior performance and applying a formal, quantitative technique to actually develop forecasts based on probability distributions derived from data on other projects. This method was first operationalized and applied to project planning by Flyvbjerg (2006).
Tetlock (2009) and Silver (2015) both discuss how a forecaster should update beliefs according to new evidence by applying Bayes’s theorem. Based on the work of Reverend Thomas Bayes (Bayes and Price 1763) and further refined by Marquis de LaPlace (1902), Bayes’s theorem suggests that new evidence should update the estimated probability that a hypothesis is true based on equation (1), where H represents the hypothesis and E represents the evidence.
P(H) in equation (1) is the prior probability, or the probability that the analyst would have assigned to the hypothesis before the new evidence was available. P(E|H) is the probability of observing the new evidence, given that the hypothesis is true, P(E) is the total overall probability of observing the evidence (regardless of whether the hypothesis is true), and P(H|E) is the posterior probability, or the probability that the hypothesis is true, given the evidence that was observed. Thus, when new evidence becomes available, the analyst’s prior beliefs should be updated according to the ratio of P(E|H) to P(E).
The degree to which Bayes’s theorem would suggest that a forecaster should update her beliefs in response to new evidence depends on the strength of her prior beliefs. For example, as P(H) approaches either zero or one (i.e., the forecaster already believes that the hypothesis is either definitely true or definitely false), the ratio of P(E|H) to P(E) also approaches one, so that no amount of evidence can have much effect on the posterior probability. This helps to explain Tetlock’s (2009) finding that forecasters with a particular cognitive style—characterized by a strong need for closure and a low tolerance for ambiguity—were more likely to assign very high probabilities to the events they predicted and were unlikely to change the beliefs on which they based their predictions, even after their predictions proved to be false. As a result, forecasters who expressed the greatest certainty about their forecasts generally proved to produce the least accurate forecasts. In other words, “There is often a curiously inverse relationship between how well forecasters thought they were doing and how well they did” (Tetlock, 2009, 106–107).
In fact, there is a real danger that forecasters will not only fail to update their beliefs in response to new information, but that information that conflicts with deeply help beliefs may even serve to strengthen those beliefs. This response to contradictory information has been demonstrated in two studies, one on beliefs about the dangers of vaccines (Nyhan et al. 2014) and another on political misperceptions (Nyhan and Reifler 2010).
Methodologies and judgments used to produce a forecast can be evaluated independently from forecast accuracy. Ascher (1979) refers to this type of evaluation as the insider’s approach. The insider’s approach requires an insider’s knowledge of all the assumptions, methods, and judgments that were used to create the forecast. It also requires sufficient expertise to determine whether these assumptions, methods, and judgments were appropriate, given the information available when the forecast was made and the current state of practice. Murphy (1993) goes even further and argues that no one besides the forecaster herself can ascertain whether a forecast was consistent with the forecaster’s own judgment.
Accuracy and Bias
In contrast to evaluating methodology, evaluating forecast accuracy may seem to be a relatively straightforward task of comparing forecast values to observed values. However, a variety of metrics are available to make such comparisons for point forecasts (e.g., those that predict the future value of a quantitative variable) and other approaches are need to evaluate probabilistic forecasts (including the probability of discreet qualitative events and the probability that a future value will fall within a particular range). Moreover, evaluators of forecast accuracy must also acknowledge the problem of causality in forecast accuracy: some forecasts may actually influence observed future events in way that influence forecast accuracy.
Accuracy of Point Forecasts
The selection of a metric to describe point forecast accuracy can influence conclusions about the overall accuracy of particular forecaster or forecasting method. Each of the first three M competitions (Makridakis et al. 1982; Makridakis and Hibon 2000; Makridakis et al. 1993) evaluated forecast accuracy using five different metrics and found that the relative performance of the specific methodologies they evaluated varied depending on the accuracy metric that was used. A substantial body of literature has emerged to compare and propose metrics for evaluating forecast accuracy. Table 2 lists a selection of these metrics.
Selected Metrics for Evaluating Point Forecast Accuracy.
Note: f = forecast value; a = actual value; f* = a forecast produced by a “naive” model such as random walk or no change; n = the number of forecasts being evaluated.
Metrics for evaluating forecast accuracy may be either dimensioned or dimensionless. Dimensioned measures are expressed in the units of the quantity being forecast; dimensionless measures are expressed as ratios or percentages.
Dimensioned measures for forecast accuracy
The most basic dimensioned measure for forecast accuracy is error. Error can be calculated as the forecast value minus the actual value. Error will be positive when the forecast is higher than the actual value and negative when the forecast is lower than the actual value. By indicating both the magnitude and direction of the difference between the forecast and the actual value, error is useful for measuring both accuracy and bias for a single forecast. However, for a set of forecasts, average error can indicate average bias, but not average accuracy, since large positive errors have the potential to cancel out large negative errors.
Two common dimensioned measures for the accuracy of a set of forecasts are average absolute error (AAE) and root mean squared error (RMSE; Ascher 1979). AAE for a set of forecasts is calculated as the difference between the forecast value and the actual value (expressed as a positive number regardless of the direction of the error), averaged across all forecasts. Thus, while average error measures bias rather than accuracy, AAE measures accuracy rather than bias. RMSE is calculated by summing the squares of the error for each forecast in a set, then taking the square root of the sum. The RMSE for a given set of forecasts will always be at least as large as the AAE, but RMSE imposes a greater penalty for extreme errors (Ascher 1979). Thus, two sets of forecasts might have the same AAE if one set of forecasts is moderately accurate and the other includes several very accurate forecasts and a few very inaccurate ones. However, the latter set will have a higher RMSE.
A feature of dimensioned measures of accuracy is that they preserve the scale of the error, and this may or may not be an advantage. For example, in cost estimating, an error of 3 percent might be of greater concern if it represents a difference of one million dollars than if it represents a difference of one hundred dollars. However, this may also make it impractical to compare forecast accuracy across projects or phenomena of widely different scales. Moreover, the amount of error that a forecast user will accept may depend on the scale of the project or phenomenon, so that the ratio of the error to the predicted or forecast value is more important than the magnitude of the error. Beginning in the 1980s, econometric forecasting researchers began to shift from a preference for dimensioned to dimensionless measures of forecast accuracy (Armstrong and Collopy 1992).
Dimensionless measures for forecast accuracy
The most common dimensionless measure for forecast accuracy is mean absolute percentage error (MAPE; Tofallis 2015; Armstrong and Collopy 1992; Gneiting 2011). MAPE is calculated as the average of the absolute difference between the forecast and the actual value, divided by the actual value (see the equation in Table 2). MAPE is closely related to mean percent error, which is used colloquially to describe accuracy. However, as described above in the case of average error, mean percentage error is a measure of bias rather than accuracy. An unbiased but highly inaccurate model may produce a very low mean percentage error because very large errors will be balanced by errors of equal magnitude in the opposite direction.
Mean magnitude of error relative (MMER) to the estimate is another related measure that divides the absolute error by the forecast value rather than by the actual value (see the equation in Table 2). Kitchenham et al. (2001) argue that MMER is a more useful measure for forecast users than MAPE is, since it allows the user to estimate the likely error that she may expect for new forecasts produced by a forecaster or methodology with an established MMER.
A major flaw of MAPE and MMER (and mean percentage error as a measure of bias) is asymmetry: they impose a greater penalty for errors in one direction than in the other. For each metric, a value of zero indicates perfect accuracy and higher values indicate less accuracy. However, assuming that the quantity being forecast is strictly positive (i.e., there can be no meaningful negative values), a forecaster or methodology that consistently underpredicts the actual values will have a maximum MAPE of one (or 100 percent), but there will no upper limit on its MMER. Conversely, a forecaster or methodology that consistently overpredicts the actual values will have a maximum MMER of one (or 100 percent), with no upper limit on its MMER. Thus, MAPE imposes a greater penalty for overprediction than for underprediction, and MMER imposes a greater penalty for underprediction than for overprediction.
A second but less serious flaw of MAPE and MMER is that both may be undefined in specific cases. MAPE will be undefined if any actual values are zero; MMER will be undefined if any forecast values are zero.
Armstrong (1978) proposed the symmetric mean absolute percentage error (SMAPE) to address both of these flaws of MAPE and MMER. SMAPE is calculated by dividing the absolute error by the average of the forecast and the actual value (see Table 2). As a result of this modification, the maximum SMAPE for both over- and underprediction is 2.0 or 200 percent, and the penalty for overprediction is the same as for underprediction. For example, if a pessimistic forecaster predicts a value of 50, an optimistic forecaster predicts a value of 200, and the actual value is 100, MAPE would suggest that the optimist was more accurate, with a MAPE of 50 percent compared to the pessimist’s MAPE of 100 percent. However, MMER would suggest that the pessimist was more accurate, with an MMER of 50 percent compared to the optimist’s MMER of 100 percent. SMAPE resolves this discrepancy with a SMAPE for both forecasters of 67 percent (Armstrong and Collopy 1992).
In some cases, a researcher may want an accuracy measure that preserves information about the direction of the error. Kitchenham et al. (2001) propose that this can achieved by using the average of the ratio of the forecast value to the actual value (see Table 2). Kitchenham et al. (2001) and Lokan (2005) refer to this metric as z; Tofallis (2015) refers to the same metric as Q; and Lo and Gao (1997) have used this metric without naming it. Perfect accuracy will yield an average Q of one; underprediction will yield values less than one, and overprediction will yield values greater than one. Q is equal to one plus percentage error, and it suffers from the same primary flaw as average percentage error: when aggregated to an average, it measures bias rather than accuracy: it is possible for a very inaccurate forecaster or method to achieve an average ratio close to one if forecasts are relatively unbiased. The ratio also has the same problems with asymmetry as MAPE and MMER, although Tofallis (2015) and Lo and Gao (1997) recommend correcting for this by taking the log of the ratio (shown as ln(Q) in Table 2).
An analyst may be interested in whether a forecaster or methodology offers more value than a “naive” forecast, such as one produced by random guessing or simple rules (e.g., a no-change model). If a set of reference forecasts is available, a few metrics are available to compare a set of forecasts to the reference forecasts. The simplest of these is percent better (see Table 2), which is the percentage of forecasts with a smaller absolute error than the absolute error of the reference forecast. A disadvantage of percent better is that it does not account for how much improvement a specific forecast offers relative to the reference forecast (although this may also be an advantage of percent better in the sense that it is insensitive to outliers). Theil’s (1966) U2 statistic and the geometric mean of the relative absolute error (GMRAE) both incorporate information about the magnitude of error. Theil’s U2 is the ratio of the RMSE for a set of forecasts to the RMSE for a set of reference forecasts (see Table 2). GMRAE is the geometric mean of the ratios of the forecasts absolute errors to the reference forecasts absolute errors (see Table 2).
Accuracy of Probabilistic and Range Forecasts
All of the metrics in Table 2 are useful for evaluating point forecasts. However, they cannot be directly applied to forecasts that express uncertainty by giving ranges of possible values or by assigning probabilities to particular events. However, as Murphy (1993) argues, forecasts should be accurate not only in terms of forecast magnitude but also in terms of the degree of uncertainty associated with the forecast. Since there is never perfect certainty about the future, this notion suggests that forecasts of continuous variables should always be expressed as ranges of possible or likely values.
To assess the accuracy of forecasts that are expressed as ranges, Ascher (1979) takes the midpoint of the range and applies one of the formulae from Table 2, arguing that forecast users are most likely to interpret the midpoint as the most likely value within the range. However, this approach assigns less accuracy to a forecast when the actual value is just within a wide forecast range (in which case the forecast was correct) than when the actual value is just outside a narrow forecast range (in which case, the forecast was incorrect).
Murphy and Winkler (1977) suggest reliability as an indication of the quality of probabilistic forecasts that consist of a probability assigned to a dichotomous event. Reliability is the degree to which the forecast probabilities correspond with observed relative frequencies. The most common application of reliability is in weather forecasting. If, when a forecaster predicts a 20 percent chance of rain, it rains 20 percent of the time, the forecaster may be said to be perfectly reliable. The concept of reliability can also be applied to any forecast expressed as a confidence interval. For example, 90 percent confidence interval forecasts are perfectly reliable if observed values fall within the confidence interval 90 percent of the time. Reliability is a useful measure for forecasts of events that occur frequently enough that a meaningful sample of comparison observations can be measured (hence its common application in weather forecasting). However, as Tetlock (2009) notes, probabilistic forecasts of rare events, or events that are conditional on rare circumstances, are much more difficult to evaluate.
Causation in Forecast Accuracy
Another obstacle to evaluating forecast accuracy arises when decisions made as a result of the forecast have an influence on observed values. In such cases, the forecast itself may be a cause of the observed value. This may be a result of self-defeating or self-fulfilling forecasts, as described by Ascher (1979). When performance forecasts are used as a basis for project selection, this also introduces selection bias into estimates of forecast accuracy. Eliasson and Fosgerau (2013) demonstrate that when forecasts of project performance cannot achieve perfect accuracy, and projects are selected based on those forecasts so that accuracy is only observable if the forecast is above a particular threshold, observable errors will always be optimistically biased, even if the errors of the full set of forecasts would have been unbiased had all projects been completed. This effect is illustrated in Figure 1, which represents forecast and actual values for sixteen hypothetical proposed projects. Only those projects for which the forecast lies above the selection threshold are selected for completion and thus have observable errors. Based on the observable errors, there appears to be a substantial optimistic bias in the forecasts. However, if all projects had been completed, the full set of forecasts would have been unbiased.

Effect of selection bias on observable forecast accuracy (based on Eliasson and Fosgerau 2013).
Usefulness and Value
Murphy (1993) describes the value or usefulness of a forecast as the degree to which the forecast influences forecast users to make better decisions than they would have made without the forecast (or perhaps with a different forecast). Thus, measuring forecast usefulness requires speculation about two counterfactuals that cannot be directly observed:
Would another decision have been made with no (or with a different) forecast?
Would the unselected alternative have led to a better outcome?
Returning to Figure 1, a more accurate forecast would only have led to a different project selection decision for the projects where the line representing the observable or unobservable error crosses the selection threshold. Even when a forecast has substantial error, a more accurate forecast would not have led to a different decision as long as the forecast value and the actual value are both on the same side of the selection threshold.
In the simplified scenario illustrated in Figure 1, the decision is made based exclusively on the forecast. This is not the case for all decisions that might nevertheless be influenced by expert forecasts. Ascher (1979) finds that expert forecasters in fields related to policy and planning tend to be skeptical about the influence of their forecasts on decisions by forecast users, although forecast users themselves “seemed quite convinced of the utility and influence of experts’ forecasts” (p. 16). Ascher attributes this difference, not to differing perceptions of how influential forecasts actually are in decision-making but rather to differing perceptions of how influential forecasts ought to be in decision-making: [T]he forecaster’s perspective focuses on securely implanting the forecast in the decision-making routine, which is aided (but not guaranteed) by making consideration of expert forecasts a necessary step in the policy-making process. In contrast, the policy-maker’s perspective calls for forecasts (and technical expertise in general) to be useful in his deliberation, but without reducing his flexibility in the policy choice. (p. 17)
A further danger of consistent bias is its potential to undermine credibility in the long term, diminishing the usefulness of the forecast. In response to the perception of persistent bias, decision makers may seek to counterbalance with additional forecasts from sources perceived to have the opposite bias. De Bruijn and Leijten (2007) describe how such circumstances lead to “contested information” as those for and against a particular decision each produce forecasts that support their respective positions. They argue that the resulting proliferation of conflicting information makes informed decision-making impossible and undermines the usefulness of all forecasts. As a remedy, De Bruijn and Leijten (2007) suggest that this proliferation may be avoided through the creation of “negotiated knowledge” if, prior to analysis, stakeholders agree on a consistent set of assumptions and methods that will be used to produce forecasts and the ways in which the forecast will inform decision-making.
Murphy (1993) argues that, when a forecaster has imperfect information about how the forecast will be used, the usefulness of a forecast is most likely maximized when its accuracy is maximized.
Beyond the aforementioned difficulty in determining whether a different forecast would have led to a different decision, it is likewise difficult to judge whether another decision would have led to a better outcome. Tetlock (2009) refers to this problem with regard to evaluating past political decisions more generally. For example, if a choice between two competing alternative projects is made based on forecasts of each project’s performance, the selected project may underperform so that its actual performance is worse than the forecast performance of the unselected alternative. In that case, it is clear that a more accurate forecast for the selected project would have led to the selection of the other alternative. However, we cannot know that this would have been a better alternative without also knowing how accurate the forecasts for the unselected alternative would have proven to be. If the forecasts for both alternatives had errors of a similar magnitude and in the same direction, the selected alternative would still be the best decision.
Demand Forecasts for Transportation Infrastructure
The preceding section describes three criteria by which planners might evaluate forecasts: methodology (and judgment), accuracy (and bias), and usefulness. The remainder of this article analyzes how each of these criteria has been applied in the literature on forecasts of the demand for transportation infrastructure. Studies of transportation demand forecasts were selected for inclusion in this analysis based on searches of research databases including TRID and Google Scholar and by recommendations from other transportation planning scholars.
Methodology in Transportation Infrastructure Demand Forecasts
As defined in the previous section, forecast methodology comprises three components: (1) input data on historical or existing conditions, (2) a theoretical and/or mathematical models, and (3) the forecaster’s judgment. Siemiatycki’s (2009) work on disciplinary difference in explanations for cost overruns for transportation projects is also relevant to understanding how different disciplines may emphasize different explanations for failure in demand forecasting. Siemiatycki (2009) describes how two groups—academics and government auditors—who study inaccuracy in cost estimates for transportation projects emphasize different explanations for cost overruns. Academics tend to emphasize political and psychological explanations for failures in judgment. Auditors are more likely to emphasize technical explanations for cost overruns, including both errors in input data and errors in the models used to produce estimates. Likewise, the engineering literature has tended to emphasize technical critiques of the inputs and models used for transportation demand forecasts, while the planning literature has critiqued forecasting judgment. This difference illustrates the importance of taking an interdisciplinary approach to understanding evaluating forecasts.
Forecast input data
With regard to forecast input data describing historical or existing conditions, evaluating demand forecasts for transportation infrastructure is challenging since there can be uncertainty associated with both the forecast values and the observed values. Observed values may be used to evaluate forecast accuracy (as discussed in the following section) and may also be used as forecast inputs or for calibration and validation of mathematical forecast models. Uncertainty in observed, historical demand values may be a result of measurement error or temporal variation in demand, since measurements of actual demand will vary depending on the particular days and times and which demand measurements are made. In spite of this uncertainty and variation in observable transportation demand, estimates of existing demand and other forecast inputs are rarely accompanied by measures of spread, such as 95 percent confidence intervals or standard deviations. Monte Carlo simulation may be used to generate such ranges based on probability distributions of input variables.
Zhao and Kockelman (2002) have demonstrated this application of Monte Carlo simulation to generate ranges of travel demand forecasts based on ranges of input variables. Their findings are consistent with Alonso’s (1968) conclusion that complex models tend to magnify the errors associated with input variables. This phenomenon partly explains Pickrell’s (1989) finding that errors in eight input variables (rail headway, rail operating speed, rail fare, feeder bus headway, auto operating cost, parking cost, population, and employment) account for less than half of the observed forecast error. He concludes that the remaining error must then be explained by “less obvious sources, including the structure of the…models…, the way in which they were applied, or the misinterpretation of their numerical outputs during the planning process” (Pickrell 1989, 29).
Some of the input data to a demand forecast must be based on survey data, which is subject to sampling error. When a survey sample is representative of the population, sampling error is quantifiable and unbiased. However, a nonrepresentative sample can introduce bias. For example, in a 2010 evaluation of demand forecasts that had been prepared for the proposed California High Speed Rail project, Brownstone, Hansen, and Madanat (2010) note that the forecasters had developed their demand model based in part on the results of a survey in which air travelers were overrepresented relative to car travelers, which may have biased the model to overestimate the effect of travel time on mode choice.
Theoretical and mathematical models
One way that model structure can contribute to forecast error without necessarily being technically incorrect is by magnifying input errors in each step of a complex model (Zhao and Kockelman 2002). Transportation demand forecasts are generally based on regional travel demand models. Travel demand models apply a series of several regression equations to estimate the total number of trips that people will take between every possible pair of origin and destination neighborhoods within the region, as well as the share of those trips that will take place by each mode, and the specific routes that travelers will take (Ortúzar and Willumsen 2011).
At each stage of the modeling processes, the forecaster must make assumptions about future changes in the population and economic characteristics of the region and how people will respond to changes in travel times and costs. The outputs of one step in the modeling process are inputs to the next, so small differences in these assumptions can be magnified with each step to have a large effect on the total ridership estimate.
In their 2010 review of demand forecasts for the California High Speed Rail project, Brownstone, Hansen, and Madanat (2010) primarily emphasize the details of the model itself, including model’s mathematical forms and the use of engineering judgment to select constraints for model parameters. The application of engineering judgment to develop, validate, and calibrate a model raises important questions about potential for misplaced incentives and motivations to influence the judgment of the forecaster.
Judgment
In an article discussing the accuracy of cost and ridership forecasts for a subset of the projects discussed in the 1989 report (Pickrell 1989), Pickrell (1992) suggests: The most effective way to induce planners and decision-makers to choose projects on the basis of more accurate ridership and cost projections would be to transfer the financial risk of forecasting error from the federal treasury to local government. (p. 170)
Differing measures of success for different types of projects may play a role in creating different incentives for biased forecasts. For example, both rail projects and road projects may be justified by a goal of reducing roadway congestion, either by increasing roadway capacity (through a road project) or by shifting a roadway’s vehicle trips to rail trips (through a rail project). For rail projects, success at achieving congestion relief seems most likely if ridership forecasts are high. For road projects, success at achieving congestion relief seems most likely if traffic volume forecasts are low. Indeed, Parthasarathi and Levinson (2010) have found that, in a sample of traffic forecasts for roadway projects in Minnesota, there has been a tendency to underestimate traffic volumes. Toll roads create a different set of incentives than nontolled facilities, since project promoters must show that the project will generate enough toll revenue for the project to be financially feasible. This may partially explain the finding that demand forecasts for toll road project in Norway have generally been more accurate than those for nontolled roads (Welde and Odeck 2011; Bain 2009).
Forecasts for public transit infrastructure have been of particular interest to researchers because of the relative unique separation between the entities preparing forecasts and those bearing the financial risks associated with infrastructure construction.
Gomez-Ibanez (1985) examines the performance of the first three modern (post-1950) light-rail systems to be constructed in North America, including two in Canada (Edmonton and Calgary) and one in the United States (San Diego), all of which were completed between 1978 and 1981. Rather than comparing project performance to specific demand forecasts prepared for those projects, Gomez-Ibanez (1985) compares performance to general claims light-rail advocates had made about the benefits of light-rail. For example, advocates had claimed that, for modestly higher capital costs than bus service, light-rail can attract more transit passengers and serve those passengers at a lower operating cost per passenger mile. He finds neither increases in ridership nor reductions in operating costs, relative to his estimates of what the bus lines the new systems replaced would have achieved. Gomez-Ibanez’s (1985) argument thus rests on assumptions about counterfactuals. As summarized earlier in this article, Tetlock (2009) and Eliasson and Fosgerau (2013) each discuss how the inability to observe counterfactuals can make it difficult to evaluate past decisions. However, since all three cities hosting the light-rail projects Gomez-Ibanez (1985) evaluates had existing bus service that continued after limited replacement by light-rail service, his estimates of how the discontinued routes would have performed are likely to be reliable.
Kain (1990) presents a case study of ridership forecasts prepared for the first urban rail system in Dallas. Since the urban rail system had not yet opened for service and the forecasts were for the years 2000 and 2010, Kain’s (1990) study could not have applied Ascher’s (1979) “outsider’s approach” to forecast evaluation by evaluating measure forecast accuracy. Instead, Kain (1990) takes Ascher’s (1979) “insider’s approach,” focusing on the forecast methodology, with an emphasis on assumptions for inputs such as highway congestion, central business district employment levels, and parking costs. Kain (1990) finds that these assumptions were largely incorrect and points to evidence that the selection of input assumptions was politically motivated to produce inflated ridership forecasts.
In his criticism of the ridership forecasts for the first urban rail lines in Los Angeles, Richmond (2005) likewise takes Ascher’s (1979) “insider’s approach” to forecast evaluation and goes on to claim that empirical tests of forecast accuracy are irrelevant since traditional methods of travel demand modeling are so inadequate that the accuracy of the forecasts they produce can only be coincidental.
Kain (1990) stops short of presenting an explanation for Dallas officials’ consistent efforts to use intentionally misleading analysis to justify a project that was not economically viable: While some advocates were clearly acting out of perceived self-interest, the unswerving and blind commitment of many others to rail is difficult to explain in these terms. I leave it to others, more skilled in bureaucratic and political analysis of psychology to provide an explanation. (p. 193)
Wachs (1990) likewise describes the influence of politics on forecasting for transit projects. After interviews with “public officials, consultants, and planners” in which many shared stories of being pressured to revise forecasts in support of politically popular projects, Wachs concludes, I am absolutely convinced that the cost overruns and patronage overestimates were not the result of technical errors, honest mistakes, or inadequate methods.…The forecasts had to be “cooked” in order to produce numbers which were dramatic enough to gain federal support for the projects whether or not they could be fully justified on technical grounds. (p. 144)
Schmitt (2016) has compiled a database of ridership forecast accuracy for transit projects in the United States that is intended for use by travel demand modelers who wish to apply the principles of reference-class forecasting suggested by Kahneman and Tversky (1979 ) and Flyvberg (2006). Schmitt (2016) evaluates the percentage by which forecasts have tended to overestimate demand by project type and recommends forecasts be adjusted based on these values to account for likely forecast bias. However, if as Flyvbjerg, Skamris Holm, and Buhl (2005) suggest, forecast bias is intentional, this approach could incentive forecasters to further inflate their unadjusted demand forecasts, in anticipation of such an adjustment.
Accuracy and Bias in Transportation Infrastructure Demand Forecasts
Early studies of demand forecasts for modern North American transit infrastructure projects necessarily focused on methodology because too few projects had been completed to allow for a large enough sample size make statistically significant observations about the overall accuracy and bias of demand forecasts for transit projects. As more publicly funded transit projects have been completed, empirical studies of travel demand forecast accuracy for transit projects (such as Schmitt’s [2016] study described above) have become more feasible.
A few researchers have compared the accuracy of demand forecasts among different types of transportation infrastructure projects. In a study comparing demand forecasts for road and rail transportation projects in Europe, Næss, Flyvbjerg, and Buhl (2006) found that demand forecasts for road projects tend to be more accurate than rail project forecasts. This difference could be a function of the low mode shares that are typical on transit, including rail transit. When rail mode shares are much lower than road mode shares, a small error in the overall share of travelers that use rail will result in larger percentage errors in demand for rail travel than in demand for road travel.
Pickrell (1989) finds that forecasts for ridership on the new rail systems exceeded actual ridership by 28–85 percent, with an average of 65 percent; forecasts of capital costs exceeded actual costs by 17–156 percent, with an average of 77 percent.
A 2003 study (Spielberg et al. 2003) and a 2008 study (Lewis-Workman et al. 2008) by the Federal Transit Administration (FTA) follow up on Pickrell’s (1989) study by comparing his findings with observations of projects that had been completed between 1990 and 2002 and between 2002 and 2006. Both studies found that the accuracy of cost estimates and ridership forecasts completed after Pickrell’s (1989) study was better than that of the forecasts completed before the 1989 study. The authors of the 2003 study (Spielberg et al. 2003) found that unanticipated inflation was a primary source of error in cost estimation and that, when cost estimates were adjusted to reflect inflation, estimates were generally within 20 percent of actual costs. They also suggest that some of the improvement in cost estimate accuracy can be explained by reduced delays in project development and construction—and the resulting shorter time period between cost estimation and project opening. With regard to the accuracy of ridership forecasts, Spielberg et al. (2003) suggest four reasons for observed improvements: increases in experience, greater scrutiny, improved forecasting methods, and improvements in computing power (although, as described earlier in this chapter, Alonso (1968) argued that the increased model complexity that relies in increased computing power may serve to exacerbate forecast error). They also find that ridership forecasts for initial lines of new systems have been less accurate than expansions of existing systems and that forecasts for transitways and downtown people movers were particularly inaccurate.
In spite of the gains documented by Spielberg et al. (2003), Lewis-Workman et al. (2008) did not find that the accuracy of cost estimates or ridership forecasts had continued to improve in the five years following Spielberg et al.’s (2003) study. Both Spielberg et al. (2003) and Lewis-Workman et al. (2008) find that actual service levels have generally been well below those assumed when generating ridership forecasts, but note that it is not clear whether service was reduced in response to low ridership or ridership was lower than anticipated in response to lower-than-anticipated service.
As shown in Table 2, there are many possibly metrics that can describe forecast accuracy and the choice of an accuracy metric can have nontrivial consequences in forecast evaluation. Evaluations of transportation demand forecast accuracy have often used a version of MMER, which is the difference between the forecast and observed values, divided by the forecast value (Welde and Odeck 2011; Parthasarathi and Levinson 2010; Bain 2009; Lewis-Workman et al. 2008; Spielberg et al. 2003; Schmitt 2016; Flyvbjerg, Skamris Holm, and Buhl 2005; Kriger, Shiu, and Naylor 2007; Odeck and Welde 2017; Nicolaisen and Driscoll 2014). As discussed earlier in this article, this metric is asymmetrical in the sense that it imposes a greater penalty for underestimates than for overestimates. Since most of these studies find that demand forecasts have been optimistically biased, it is remarkable that this finding is generally based on a metric that understates the degree of that bias.
Usefulness in Transportation Infrastructure Demand Forecasts
Few studies evaluating transportation demand forecasts are primarily focused on the question of forecast usefulness, although most explicitly or implicitly adopt the argument expressed by Pickrell (1989): If the divergence between a project’s forecast and actual cost-effectiveness in attracting new transit passengers exceeds the margin by which the chosen alternative was preferred to others that were rejected, the planning process may not have led to selection of the most desirable project. (p. vi)
Second, Pickrell’s argument also assumes that forecasts have an important influence on decision makers’ selection of a preferred alternative. This assumption may seem to conflict to some degree with claims by other researchers suggesting that the early selection of an alternative (based on considerations other than performance forecasts) influences forecasts to a greater degree than forecasts influence the selection of an alternative (Kain 1990; Richmond 2005; Wachs 1990). However, the two claims are not mutually exclusive, since “decision makers” are not a monolithic group: project promoters may successfully insert optimism into forecasts in order to win over more skeptical stakeholders.
Pickrell (1992) suggests that ridership forecasts be expressed as confidence intervals, a practice which would be consistent with Murphy’s (1993) argument that a good forecast must represent the forecaster’s best judgment, not only with regard to the prediction itself but also with regard to the uncertainty associated with that prediction. Interval forecasts would certainly be more accurate than point forecasts, but would they be more useful? In many cases, forecast users may convert forecast ranges to point forecasts by simply using the midpoint of the ranges.
Ultimately, the usefulness of a forecast depends on the purpose for which it is used. If a transportation demand forecast is used for capacity planning, the consequences of underestimating demand are more serious than the consequences of overestimating demand. In this case, forecasts that are less accurate, and biased toward overestimation may be more useful than more accurate forecasts. Conversely, when demand forecasts are used to allocate public investment to the most cost effective projects, more accurate and less biased forecasts are desirable. In these cases, there remains a question with regard to the level of accuracy that is needed. Further research is needed on questions of how forecasts are used in transportation planning practice and how a diversity of uses may be best served by different forecast methodologies and different levels of forecast accuracy and bias.
Conclusion
Returning to the broader topic of the uses of forecasts in planning more generally, in a future-oriented discipline such as planning, researchers and practitioners should base their work on the best possible predictions about the future. To do this, we require a framework for evaluating what makes one forecast or prediction better than another. This article presents a framework of three forecast characteristics are relevant to forecast evaluation: the methodology used to produce the forecast, the accuracy of the forecast, and the usefulness of the forecast.
The review of the general literature on forecast methodology brings into sharp relief the degree to which human judgment is used to shape forecasts that may be presented as the outcome of objective mathematical models. Forecast users should therefore be vigilantly aware that the biases of forecasters will necessarily always be reflected in the biases of the forecasts they produce. Since these sources of bias cannot, and perhaps should not, be removed from the forecasting process, forecast users should seek a variety of forecasts produced by a diversity of methods and forecasters.
In evaluating forecast accuracy, forecast users should also recognize the assumptions associated with the metrics they use to describe and evaluate forecast accuracy. Are errors in the positive direction more acceptable than errors in the negative direction? Is a small number of high-magnitude errors more acceptable than a large number of low-magnitude errors? Is there a threshold of acceptable error? The answers to questions such as these should guide the selection of an accuracy metric.
Finally, forecasters should have a clear understanding of how the forecasts they produce are likely to be used and how influential they will be in decision-making. Conversely, forecast users should be intentional in determining the criteria for their decision-making. In selecting these criteria, they should be explicit about the weight they will place on forecasts.
The three characteristics discussed above are closely related. Sound methodologies are necessary to consistently produce accurate forecasts, and an inaccurate or untrustworthy forecast is generally less useful than a more accurate one. Thus, although methodology, accuracy, and usefulness are distinct concepts, forecast evaluation often rests on an (implied or explicit) assumption that one may be a proxy for the other two.
In the specific case of evaluating transportation demand forecasts, most researchers have emphasized forecast accuracy, likely because accuracy lends itself better to ostensibly objective measurement than methodology and usefulness do. However, it is worth emphasizing that accuracy evaluation requires the evaluator to select an accuracy metric and a period of time over which to observe demand for comparison to forecast values, and these decisions can have an important influence on the perceived quality of a set of forecasts. Researchers and policy makers have responded to shortcomings in accuracy by calling for methodological improvements (e.g., reference class forecasting; Schmitt 2016; Flyvbjerg 2006) and for improved ethics and incentives for forecasters (Wachs 1990; Flyvbjerg, Skamris Holm, and Buhl 2005). Few researchers have directly addressed the question of what level of accuracy is necessary for transportation demand forecasts to be useful, although Pickrell’s (1992) call for forecasts to be presented as ranges or confidence intervals rather than as point values certainly addresses the question of forecast usefulness.
Although the methodological and technical activities associated with producing forecasts vary by the topic or value being forecast, many important considerations in forecasting are consistent across disciplinary boundaries. Planners would do well to draw on lesson learned from a wide variety of planning subdisciplines as well as from disciplines beyond planning.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
