Abstract
For a sport that is approaching a state of analytic saturation, major league baseball is devoid of one seemingly critical metric: a manager-value estimator. Whether managers have ever meaningfully influenced their teams’ performances and still do today are matters of significant dissensus. Without a manager-value estimator, that debate (critical, at a minimum, to informed front-office decisionmaking) cannot be convincingly resolved. Based on a sample of over 500 managers spanning the history of AL/NL seasons since 1901, this paper develops a manager-value estimator based on manager performance in relation to team records predicted by aggregate player WARs. Simulation and Bayesian methods are used to test for False Discovery Rates and to form posterior estimates in relation to Regions of Practical Equivalence. Results suggest that a substantial fraction of managers (including current and recently active ones) have over their careers influenced team “winning percentages” ≥ ± 0.012, the equivalent of ± 2 wins per 162 games. In addition to enabling historical and contemporary comparisons, the mWAR Estimator can also be calibrated to reflect the asymmetric-error costs and risk preferences that characterize the tournament structure of contemporary MLB economics.
Introduction: What difference does the manager make? An analytics lacuna
Do managers matter as much today as they did before the baseball analytics revolution? Debate cleaves commentators down the middle (Keating, 2019; Stark, 2019). Moreover, even those who deny that analytics have diluted the significance of managers are not of one mind: about half say that managers have always been important to the fates of their clubs (Miller, 2025), and half that they never were (Baumann, 2019; Paine, 2014a).
Debates over the determinants of team success in major league baseball are no shock; what is, though, is the silence of contemporary analytics itself in this particular dispute. Baseball is suffused with recently developed, data-driven metrics for quantifying the impact that player performances have on team records. But to date, no value estimator akin to WAR has even been developed for gauging the consequence of managers.
This is not only weird but disconcerting. A central mission of sports analytics is to guide enlightened front-office decisionmaking. If managers do matter, executive management is effectively flying blind when they make the multi-million annual investments now devoted to securing the services of field managers (Elsey, 2025).
This paper aims to fill this measurement void. Using Bayesian statistics, it derives a manager-value estimator by comparing the records of over 500 managers to the performances of the teams they managed as predicted by aggregate player WARs. The estimator scores represent latent managerial skill in terms of wins added to (or subtracted from) the season total of a putative .500 team under an average manager. These mWAR scores can be used to address the question of whether “managers matter” both collectively and individually. Concluding that the answer is yes, the paper shows how the mWAR Estimator can be fashioned into tools that enable users—whether fans, analysts, or front offices—to make informed appraisals of managers as their careers unfold.
Background: The manager-measurement problem
What are we looking for?
We can plainly see what determines baseball outcomes: the scoring of runs. The more runs a team scores, and the fewer of them it allows, the more likely it is to win individual games and to finish atop the standings in a season.
What we can’t see directly is precisely what generates the ratio of runs scored to runs allowed. To be sure, the answer is good hitting, good pitching, and good fielding; but how do we know exactly how good players are at hitting, pitching, and fielding, and how important each is for producing runs and stifling the same?
From the beginning of baseball, fans, commentators, and front-office decisionmakers have been seeking to quantify these attributes with statistical measures (Schwarz, 2004). These statistics—from batting average to ERA to fielding percentages—are value estimators: metrics for quantifying on the basis of observable indicators the not directly observable player characteristics that create and prevent runs.
The history of baseball analytics consists in the refinement and perfection of these estimators. Metrics like OPS, wOBA, Fielding Independent Pitching, and Defensive Efficiency (Blabac, 2010) have relegated the traditional performance statistics to historical curios. The epitome of this process of statistical maturation is “Wins Above Replacement” or WAR (Brill and Wyner, 2024; Smith, 2024)—an empirically validated measure of “true” player value that enjoys tremendous influence, maybe not uniformly among everyday fans but universally with informed analysts and front-office executives (Castrovince, 2019; Yomtov, 2025).
There is no equivalent to WAR for managers—no established value estimator that translates observable indicators into a serviceable index of managers’ impact on outcomes. Commonly cited metrics, like career wins or winning percentage, are the equivalent of the primitive player estimators—batting average, ERA, and fielding percentage—that prevailed before the advent of modern analytics. The most blatant flaw in these measures is their conflating of managerial proficiency with the skill of their players and with the contributions that their teams’ front offices made to assembling them.
But on the basis of successful player-value estimators, we can identify the criteria that a genuine manager-value estimator would need to satisfy. First, such an instrument would need to be valid. That is, it would need to be based on inferentially sound methods that measure what they purport to measure. Earned Run Average, for example, is an invalid estimator: it credits pitchers with run-stifling effects that reflect the proficiency of a team's fielders (Click and Keri, 2006). Fielding percentage is likewise invalid: it equates fielding proficiency with success per play attempted, ignoring that the number of plays a fielder is able to attempt (by virtue of speed and timing) must be considered as well (Humphreys, 2011). These metrics have been superseded by FIP and Defensive Efficiency, respectively.
The declining emphasis on batting average is arguably tied to lack of validity, too: the fraction of at-bats that result in a base hit does not indicate a batter's true contribution to team run-scoring, particularly in comparison to a measure like OPS (Thorn et al., 1984), a truth the book and later movie Moneyball (Lewis, 2004) exposed to common knowledge.
Second, any serviceable value estimator must be informative. It must be expressed in or readily translatable into units that figure in the judgments of those who use them. WAR exemplifies an informative evaluator: it purports to tell us how many wins—exactly what, say, a front-office decisionmaker is trying to maximize—a player is worth. FIP also satisfies this criterion: it tells us how many runs a pitcher is responsible for avoiding independently of his supporting fielders. A metric like OPS is not expressed in runs or wins produced but can easily be converted into an internally and externally valid estimate of exactly those quantities. In contrast, analyses that tell us merely that some aspect of performance is “statistically significant” are generally uninformative—telling us (supposedly) that something matters more than nothing but without indicating how much more or in what way (Cohen, 1994).
What we’ve seen
The field of manager-value estimation has not borne fruit but nevertheless is not barren (Pavitt, 2025: ch. 8). Studying existing systems sharpens understanding of the difficulties that must be overcome to create a workable manager-value estimator.
Previous investigations can be grouped into three classes. The first makes use of the Pythagorean Expectation formula. Devised by James (1983), the formula posits that team winning percentage equals
This approach, however, does not satisfy the validity criterion. The Expectation formula captures—mathematically—the expected ratio of wins to losses when runs scored and runs allowed vary in a normal way (Kaplan and Rich, 2017; Miller, 2007). If we know the runs-scored/runs-allowed ratio, the Pythagorean Expectation formula does not dictate a unique season record—only one located somewhere in the probability distribution of winning percentages associated with that ratio (Braunstein, 2010). Accordingly, any deviation from the Pythagorean Expectation indicates only what the particular random distribution of runs scored by a team happened to look like in a given season. If a team exceeds its expectation, then the frequency of runs scored was necessarily smoother than usual, a pattern that naturally conserves runs for effective use in close games. If a team falls short of its exact Expectation total, then the occurrence of run scoring must have been atypically ragged: blowout wins “wasted” runs that would have had higher marginal value in games lost by one- or two-run margins. But there is no valid evidence that a manager has anything to do with any of this; he is just along for the random-variable ride (Ruggiero et al., 1997).
Another approach purports to find manager value in the gap between team records and pre-season forecasts. Bill James in his book Guide to Baseball Managers (1997) proposes a formula for predicting a team's upcoming season winning-percentage based on a weighted sum of the club's records in the past three seasons. The difference is attributed to the manager.
This system, too, does not meet the validity standard. To start, James offers no evidence that his forecasting metric actually predicts team performance; if it doesn’t, the wins and losses this method assigns to managers will be arbitrary. In addition, this system assumes—implausibly—that the full allotment of extra wins or losses is due entirely to the manager; surely some fraction of the residual between expected and realized team winning percentage—if not the entirety of it—is a product of simple random variation. Finally, even if deviation from past performance measures something, there is no reason to believe that managers are the source of it. Any such effect is much more likely to be a consequence of front office decisions either to rejuvenate a poorly performing club with fresh talent or to trade off or passively accept migration of a successful team's stars via free agency.
The final class of empirical assessments evaluates managers’ won-loss records controlling for team quality. The idea is that if teams managed by a particular manager do better or worse than predicted by measures of player skill, then that manager can be inferred to be adding—or subtracting—independent value. Some researchers employing this type of analysis have found that managers matter (Kahn, 1993; Porter and Scully, 1982; Singell, 1993), some not (Smart et al., 2008).
This approach is strategically sound in theory but has been implemented imperfectly. Some of its proponents again assign the entire residual of team performance to managers with no attempt to measure what fraction of it might be reasonably attributed to them as opposed to chance variation or other systematic influences (Jaffe, 2010; Porter and Scully, 1982). 1 Others use multivariate methods but rely on weak measures of player performance devised before the advent of modern analytics (Kahn, 1993; Porter and Scully, 1982; Singell, 1993). The variance explained by these measures tends to be much lower than those associated with state-of-the-art metrics reflected in the prevailing WAR systems. Leaving explainable player-performance variance on the table risks both the creation of spurious manager effects and the inflation of genuine ones. These are validity defects.
Just as important, analyses in this class are often not informative. Characteristic of many traditional empirical analyses, work in this area is dominated by null hypothesis testing. Investigators are content to show that a manager-effect term had a significant coefficient or added to model R2 without either specifying the practical import of that effect or identifying it with particular managers (Crosby, 2021; Kahn, 1993; Paul et al., 2016). Investigations that culminate in a bland pronouncement that manager impact generally can be shown to “differ from zero at p < mpa#nbsp;0.05” are of little help to analysts trying to appraise the performances of past managers or to front-office decisionmakers trying to gauge the expected value of retaining or replacing any particular manager today.
It is not possible to say that these methodological defects explain the absence of a widely accepted manager-value metric. But it can be said that no measure that fails to correct them merits acceptance. Constructing a metric that satisfies the criteria of validity and informativeness was the motivation for the current inquiry.
Overview of sample, methods, and analytical strategy
Sample
The sample for the current study consisted of 512 managers. This total included every individual who had been entrusted to manage a team between 1901 and 2025 with the exception of those handling only short-term stints (15 games or fewer), trial periods that added little evidentiary value and complicated model fitting. The records of the sampled managers were compiled at the game level and aggregated as appropriate over seasons and careers. The data were compiled using Retrosheet. 2
Previous empirical studies have limited their consideration to only a select number of managers—either ones who managed a substantial number of games over their careers (e.g., Horowitz, 1994) or ones whose tenures spanned a specified set of seasons (e.g., Kahn, 1993; Porter and Scully, 1982; Singell, 1993). Truncating the sample in this fashion not only limits power but also risks bias by extracting model parameters from potentially nonrepresentative classes of managers. These problems are avoided by a sampling method as close as feasible to universal.
Value benchmark: Team WAR
An examination of previous works suggests that the most analytically solid reference point for assessing manager value is the skill of his team's players (Ruggiero et al., 1997). WAR is now recognized as the best estimator of player skill. It stands to reason, then, that the capacity to match or exceed the record projected by aggregate player WAR should furnish the most robust and reliable measure of a manager's own value. 3
One characteristic of WAR that supports this conclusion is its explanatory power. The R2 for team WAR as a predictor of wins has consistently been between 0.80 and 0.85. Because it accounts for such a large percentage of the variance in team records, using WAR as a performance benchmark largely vanquishes the specter of omitted-variable bias that has afflicted various previous studies.
Even more importantly, using WAR as the control variable for estimating manager impact neutralizes the two most obvious confounds to manager influence: front-office management and player performance. If a front office acquires better players—or fails to retain ones better than their replacements—this effect will be reflected in team WAR; thus, the inclusion of the WAR predictor offsets the risk that a manager will be credited with this form of front-office contribution. Likewise, nothing the front office does to affect the composition of the team can be expected to influence whether the team exceeds or falls short of the record predicted by the WARs of the players so assembled. Accordingly, there are good grounds to believe that it is neither the front office nor the players but the manager himself who is responsible if one of these patterns regularly occurs on his watch.
That said, there is one potential disadvantage of relying on WAR as a manager-performance benchmark. As James (1997: 148) notes, managers “manage through the talent” of their players. That is, they use their players (or if they are poor managers don’t) to their players’ best effect. To the extent that a manager strategically deploys his individual players in a manner calculated to exploit their strengths and shield their weaknesses (or inappropriately fails to do the same), there should be some correlation between manager quality and player WAR.
The practical effect of this relationship, however, is to understate manager effects. The good manager who inflates his players WARs will get less credit, and the bad one who deflates them suffer less discredit, than they in truth deserve: the winning-percentage impact of their use of player skills will be attributed to their teams’ WARs, not to them. Accordingly, when aggregate player WAR is used as a control variable, any statistical analysis will be biased to some degree toward a finding of no manager effect. This result is unfortunate. But it poses much less of a threat to proper inference than a confound that erroneously attributes to managers effects originating elsewhere. With regard to any signal of manager influence that persists, the endogeneity between WAR and manager quality can be corrected for by recognizing that resulting estimates of manager influence will tend to be uniformly conservative. 4
Two-step analysis
Data analysis was anticipated to involve two steps. The first, preliminary test would be to confirm existence of a genuine signal of manager influence on team performance after taking account of player WARs. Such a signal is a precondition of a valid estimator; it is the absence of one that invalidates studies based on the Pythagorean Expectation formula.
The test contemplated evaluation of the posited manager influence in relation to a realistic simulated null (Beasley and Rodgers, 2012; Westfall, 2011; Zhang, 2020; Zhang and Sun, 2019). The simulation would be structured to measure how far sample managers’ teams would be expected to deviate from their projected WAR records in the absence of any genuine manager decisionmaking. These pure-chance deviations would reveal the effective random-variable perimeter of team WAR. To support the conclusion that managers matter—that teams under their guidance have out- or underperformed WAR by margins that exceed those expected to occur by chance—it would be necessary to show deviations even larger than these. This is exactly the test that manager-value studies based on the Pythagorean Expectation formula fail.
Assuming this test generated a different conclusion here, the next step would be to form a full, Bayesian estimator. Using a multi-level hierarchical regression model, team performance would be modeled at the first level as a function of the observed team WAR, and at the second a manager-specific latent effect drawn from a population distribution. The parameters at both levels would be estimated iteratively via Markov-chain Monte Carlo, a process that repeatedly updates the team-level coefficients, the manager-specific latent effects, and the dispersion of the manager effects until the estimates stabilize consistent with the model's assumptions and the observed data (Gelman and Hill, 2007: 408–409). Such a model would yield posterior estimates of individual managers’ impacts—ones based on appropriately combining observed outcomes with a prior distribution of manager effects centered at zero. The posterior estimates would represent the latent skill of the managers—the results not of cumulative performance but of cumulative measurement of an unobserved managerial acumen presumed constant from the outset of their careers.
These estimates would satisfy the criterion of informativeness. Not mere recitations of “statistical significance,” the estimator results would take the form of probabilistic assessments of the additional wins a manager's guidance yields for what would otherwise be a .500 team under an “average” manager. Because an “average” manager is effectively always available as a replacement in the supply-heavy manager market (Silvers and Susmel, 2014; cf. Becker, 2024), the resulting score was to be designed as the manager's “mWAR,” for managerial wins above replacement. 5
Operationalization
After completion of the formation of the mWAR Estimator, refinements related to its optimal use were to be addressed. These included development of a single-season variant of the estimator and a customization feature aimed at conforming Estimator assessments to heterogeneous utility functions and risk preferences anticipated to characterize responses to the distinctive tournament economic structure of Major League Baseball.
Data analyses
Step 1: Confirming a signal
A GLM binomial logit was fit to the data:
A parametric bootstrap was used to simulate a realistic null (Beasley and Rodgers, 2012; Efron, 2008; Westfall, 2011). In this simulation, the mWAR model was estimated without the manager fixed-effect term ui; because whatever influence particular managers exert was deliberately removed, this process effectively estimated the impact that WAR would have were every team piloted by the “average” manager (Hardy, 1993, 68). Then 20,000 replicate manager careers were generated by substituting new win totals drawn randomly from a binomial distribution pegged to each team's WAR-per-game-estimated winning percentage. At that point, the full mWAR Model, with manager fixed effects restored, was fit to each simulated career set. The Model's “average” manager effect shares (0 on net) were thus distributed to each sample member's 20,000 phantom fill-ins in amounts that varied solely on account of the random statistical churning of team wins and the impacts of the zWPG upon the same (Davison and Hinkley, 1997).
The use of a realistic simulated null offers a number of advantages. Ordinarily, empirical results are evaluated in relation to a “theoretical null”—one that assumes the data-generation process being examined will generate values with normally distributed test statistics (Efron, 2008). That assumption, however, is contestable; if in fact, the process generates a non-normal distribution, the theoretical null will generate biased results—one either too conservative or too liberal. An appropriately designed simulated null can be expected to generate a more accurate representation of what the random distribution of outcomes associated with it actually looks like. This distribution can then be used to form a more trustworthy estimate of the frequency of extreme outcomes one would expect such a process to generate sheerly by chance (Westfall, 2011; Zhang and Sun, 2019).
The simulated realistic null also addresses an inferential problem associated with diverse career durations. Some long-term managers will acquire the appearance of non-chance influence sheerly by virtue of the higher level of measurement precision attributable to a large number of games managed (Kennedy, 2003). Short-term managers might have their impact concealed for similar reasons—but a certain number will also misleadingly appear consequential by virtue of a chance trough in the effect size of WAR as a random-variable influence on winning percentage (Gelman and Weakliem, 2009). The sum effect is a state of uncertainty about the number of non-extreme outcomes a “true” null benchmark would actually contain.
That is exactly the difficulty that a well-formed realistic null simulation is crafted to dispel. Each manager is judged in relation to his own null—his zombie counterparts’ 20,000 career-level encounters with the random influence of WAR on team winning percentage. The extremeness of a manager's record is not being measured against a “one size fits all,” sample-wide standard deviation but against the full probability distribution of outcomes for a career identical in length, and in team quality, to his own; each manager's z-statistic is just a report, placed on a sample-wide common scale, of where his personal winning percentage ranks in relation to his individual null-sample stand-ins. All that needs to be done at that point is to compare the respective number of observed managers who exceed the specified |z| ≥ 1.96 threshold with the number of zombie ones who do in each simulation replicate. How much more frequently (if at all) the real-world managers exceed that threshold can then be used to compute the ratio of genuine to spurious instances of manager effects in the observed sample (Efron, 2008, 2012: ch. 5; Storey, 2003).
A final benefit is the realistic simulated null's freedom from dependence on the selection of any particular strategy for calculating standard errors. Where, as here, observations are clustered and reflect differing exposure levels, one or another robust standard-error estimation strategy is ordinarily selected; none of these choices, however, can be presumed to correct the model-specification errors generated by observations that are not genuinely i.i.d. (King and Roberts, 2015). Where a realistic null is simulated, a cumulative distribution function can be used to relate simulated and observed values. Such a process involves identifying where a particular observed outcome (here a manager's career-level marginal impact on team winning percentage after controlling for team WAR) falls in the distribution of simulated outcomes for that same manager. Any test statistic (whether a p, a t, or a z) derived from this information is pivotal, meaning that it remains valid even when true variance is unknown (Casella and Berger, 2024: 430–440; Efron and Hastie, 2016; Amaral et al., 2007).
Analyzed in relation to the simulated null, these results suggest systematic and not merely random influences. This effect is open to visual inspection in Figure 1. On average, the simulated manager careers had 26 values at or above |z| = 1.96. None of the 20,000 career-level replications had 57 |z| ≥ 1.96 exceedances; the largest number, achieved in 1 of the replications, was 49 (Figure 1.A). A proportion of exceedances this high by chance is roughly 1 in 45 million. These results imply that a substantial fraction of the particularly strong and particularly weak performances within the real-life sample are more plausibly viewed as a product of genuine individual differences than as haphazard chance landing points within the distribution of team WAR itself as a random variable.


Formal analysis supports this conclusion. Based on the relative densities of the observed and simulated nulls, the “False Discovery Rate” furnishes a Bayesian estimate of the proportion of observed exceedances that are attributable to systematic rather than chance influences (Efron, 2004; Storey, 2003). In this case, the global FDR for the tail regions (|z| ≥ 1.96) was 0.31—indicating that the rate of “genuine non-nulls” in those regions was 69%.
Again, the point of this analysis was not to identify which manager observed effects merit being labeled “truly statistically significant.” No arbitrary threshold was set for the minimal “proportion” of “true” versus “false” nulls. The goal was only to assess whether the observed sample exhibited a manager-effect sharp enough to pierce the random-variable perimeter of WAR; so long as such a signal exists, Bayesian methods can be relied upon to assign it the weight that it deserves and in a manner that is more discerning than—and not constrained by the arbitrary characteristics of—p-values (Goodman, 1999a, 1999b). In this case, the conclusion that such a signal exists is reasonably satisfied.
Step two: Formation of an estimator
Here the outcome variable is again the probability that a particular manager will win a game in a particular season based on his team's season winning percentage in that season. Team WAR per game (zWPG) is again treated as a fixed effect. But now ui, the manager's career marginal impact on the probability of winning, is estimated as a second-level, random intercept. Modeled in this manner, manager effects are latent or unobserved individual differences, which, when estimated independently of the fixed-effect team WAR predictor, naturally reflect managers’ marginal impacts relative to the sample grand mean, namely, the winning percentage of a .500 team under an “average manager.” The two-level structure of the model also naturally accommodates the clustered nature of the manager-season observations (Gelman and Hill, 2007).
Weakly informed priors centered at no effect were selected:
Ample MCMC draws were made. The model ran 6 chains with 1500 warmup and 2500 post-warmup draws each—15,000 posterior draws total.
The operations of the hierarchical Bayes model that produces these estimates, while intricate, can nevertheless be represented in terms consistent with a straightforward rendering of Bayes's Theorem. In this case, the prior estimate of all managers is posited to be a probability distribution with a mean of zero and a standard deviation of τ, which itself represents differences in manager ability as revealed in the model-fitting process (in which τ, too, is conceived of as a random variable). The data, in effect, are the managers’ observed season winning percentages after controlling for team WAR; they can reasonably be equated with the fixed-effect ui estimates of the MLE model before regularization (cf. Robinson, 1991), and take the form of probability distributions associated with each manager's measurement uncertainty. The data so conceived and the prior are then combined to form a posterior estimate. It, too, takes the form of a probability distribution (a posterior mass), the mode of which—the maximum a posteriori or MAP—is the most likely value of the manager's true skill level,
The manager MAP can be expressed either in terms of a manager's marginal impact on the winning percentage of what would otherwise be a .500 team or, more accessibly, an estimate of the number of additional season wins (positive or negative) that such a team would experience under his direction as opposed to that of an “average” manager. The latter formulation of
A select list of the managers with the 39 highest and 39 lowest mw162s in AL/NL history appears in Table 2. Additional mw162 summaries—including ones of Hall of Fame Managers and of currently and recently active ones—appear in the Supplemental Information (SI).
Table 2 also indicates 0.95 HDIs or Bayesian “credible intervals”: these identify the densest portion of the manager's estimated posterior mass containing 95% of its values. The posterior mass encodes uncertainty about θi. It can be used to determine the probability (or degree of belief) that the true value of θi occupies any particular range of values. The 0.95 HDI, then, identifies the portion of the range that has a 95% probability of containing the true value θi. However, the probabilities of values within the range are not themselves equally probable—because the posterior distribution is not uniform; they can be determined with precision only by calculating the fraction of the posterior mass that those values themselves occupy.
Table 2 also reports the data scores—that is the MLE fixed-effect estimates for the managers. The discrepancies between these and managers’ mw162s reflect how much the fixed-effect model (1) scores were “shrunk” or discounted toward the prior estimate of zero effect. Managers who consistently out- or under-performed the mean over relatively long careers retained 65%-70% of their data scores. But many of the managers’ raw model estimates were shrunk by 50% or more. This feature of the Estimator, and the pervasive re-ranking of managers it entailed, demonstrate how heavily Bayesian updating penalizes the individual MLE model estimates as a result of the prior, the existence of between-manager variance (τ2), and the imprecision of within-manager measurements.
The Table 2 scores include ones at the tails of the overall θ distribution. John McGraw's mw162 of 6.4 (a marginal winning-percentage probability of .040) is 4.7 SDs from the mean. An |mw162| ≥ 3 is extremely rare—it occurs in less than 3.5% of the sample. About 62% of the 512 managers in the sample had mw162s between −1.0 and 1.0, and 33% between −0.5 and 0.4 (Figure 3).

Player WAR is supposed to measure the value that players deliver to their team. The best evidence that it does is its consistently powerful explanation of differences in team records. Across AL/NL history, aggregated player WAR has accounted for 83% of the variance in team winning percentages, a figure that has remained impressively uniform season over season.
When team winning percentages are regressed on team WAR and manager mWAR scores, variance explained increased to 0.87. That increment is about 24% of the variance in team performance left over when the impact of WAR is accounted for. Adding managers’ mWAR scores also reduces winning percentage RMSE by 0.004—meaning that estimated total wins are about 0.65 per team closer to the mark in a 162-game season. These effects, too, have been remarkably consistent over the history of the American and National Leagues (Figure 4). The same test that validates player WAR thus validates mWAR as a measurement of team value imparted—here by managers.

Operationalizing the mWAR estimator
Isolating consequential differences
Up to this point, the analysis has been aimed at constructing and validating the mWAR Estimator. How to use it when appraising individual managers is a distinct and more subtle question. To answer it, a reflective judgment must be made about the margin by which a manager's latent skill estimate must differ from the mean before it is viewed as meaningful. Once that judgment has been made, the Estimator can be used to apply it.
To illustrate, the individual-manager mw162 scores were assessed in relation to a consequence threshold of ± 0.012 marginal-win probability—the equivalent .488 and .512 for a .500 team, or 2 wins or losses in a conventional 162-game schedule. On this analysis, then, mw162 skill levels between −2 and 2 are lumped together as the effect of an “average” (or effetively replacement) manager.
This form of bracketing is conventionally referred to as ROPE—or “region of practical equivalence”—testing (Kruschke, 2018). Using any manager's posterior mass, it is possible to calculate the probability that his “true managerial value” falls outside the |mw162| > 2 region (Figure 5). Were an analyst of the belief that the consequence threshold should be either higher or lower, he or she could adjust the mw162 ROPE accordingly. All the information necessary to set a threshold of |mw162| > 1.6, |mw162| > 2.5, or any other value, and to determine the probability that a particular manager satisfies it, can be gleaned from the managers’ posterior means and variances (included in the supporting materials).

mWAR Estimator_p
The mWAR Estimator rates managers on their career performances, in toto or to date. A manager's score is not cumulative; the scoring of it is. That is, each season's performance is treated as data for updating the model's prior estimate of θi, which is presumed to correspond to a stable unobserved quality.
This updating can be expected to converge and (in the vast majority of cases) stabilize on a particular value. But the process takes time. There is variability in how much, but generalizations are possible. For managers whose careers last that long, the Estimator value for mw162 usually converges on a stable value within 900 games or about 5.5 modern seasons. It takes less time for it to generate a manager MAP beyond a specified critical threshold of significance: the median number of games it took the Estimator to recognize that a manager had a |mw162| of ≥ 2 was 522 games, or 3.2 modern seasons. 6
Those studying baseball history are unlikely to find the pace of the mWAR Estimator's learning curve to be problematic, but users who are making decisions in real time sometimes might. They will often want to make assessments within time frames more compact than those likely to yield stable estimates of θi, particularly for managers early in their careers.
A responsive strategy is to adapt the mWAR Estimator into a season-level estimator: the provisional “Estimator_p.” The Estimator_p assigns managers the mean manager-attributable share of the difference between their teams’ actual and expected wins.
Each team’s expected winning percentage (wt) is computed using zWPG, the non-manager predictor in the mWAR models (1, 2):
Estimator_p, then, simply takes this fraction of a team's residual under this manager-free, player-WAR model in any season and imputes it to the team's manager. So if a team, for example, won five more games than expected under this model, its manager would be credited with 1.2 wins; if it won five fewer, he would be charged 1.2 losses. This can be referred to as the manager‘s mw162_p. As further explained in the SI, a manager’s mw162_p is a close functional approximation of the data estimate that informs the full Estimator.
The single-season manager effects generated by Estimator_p are sensible on their face (Figure 6). That is to say, in part, mw162_ps are modest—much more so, certainly, than estimates that assign to managers the entire effect of what's left over after team performance is accounted for (e.g., Jaffe, 2010; James, 1997: 149–152). The scores are reasonably in line with those generated by the full mWAR Estimator—although again they are more akin to the full Estimator data than to its posterior estimates.

They are also reliable. Table 3 presents the Estimator_p scores for the last two seasons. The rankings across the seasons display a substantial level of agreement (Kendall's W = 0.68). Three managers finished in the top six in both 2024 and 2025. Stephen Vogt topped the list in both. This result fortifies the conclusion that the mWAR Estimator_p can reasonably be relied upon pending accumulation of a more robust assessment of a manager's latent skill via the full mWAR Estimator.
2024-2025 mw162p scores. mWAR Estimator_p estimates in wins above average per 162 games (mw162p) for managers in 2024-2025 seasons. For partial seasons, mw162p score is pro-rated to proportion of team-games managed.
What makes the mWAR Estimator_p provisional is the expectation that it will be supplanted by the full mWAR Estimator after a sufficient number of games. The Estimator_p's rapid, data-based assessment of manager ability makes it a sensible strategy for mitigating the inconvenience of the full Estimator's deliberate learning curve. But ultimately the full Estimator's Bayesian machinery is required to connect season-by-season data to a latent quality of managerial acumen. Those interested in making accurate appraisals of this disposition will thus want to switch to reliance on the full mWAR Estimator score for guidance as soon as feasible—certainly no later than the completion of a manager's third full season.
That said, user goals will be heterogeneous. Users with a practical stake in shaping or predicting team performance will attach significant value to the greater discernment of the full Estimator. But ordinary fans, and the outlets that serve them, are likely to be more interested in making comparative assessments of manager performance season over season. This is not what the full mWAR Estimator produces; it is designed to estimate a stable, unobserved skill level, using season performance as data only. For that reason, the Estimator_p might actually furnish a better fit than the full Estimator for pure entertainment assessments of the sort that figure in online and media “leaderboards.” A comprehensive list of mw162ps can be accessed via the SI.
Calibration for decision-theory loss functions
But it is actually not as rare as it might seem. Because in every season there are at least as many managers as teams, the Estimator acquires many times more information about the skill levels of managers as a whole—30 times as much in today's MLB—than it does about the skill level of any individual manager. Accordingly, the Estimator will be able to form a much more confident conclusion that a fraction of the manager population meets a particular skill threshold than that any identified individual manager does.
Figure 7 illustrates that point in relation to the |mw162| ≥ 2 threshold. When the Estimator is applied to the manager population as a whole, the expected number of managers who meet it is 206: 103 with mw162 > 2 and 103 with mw162 < -2. This is over 40% of the sample as a whole (Figure 7).

Manager population distribution of θ. Derived by count of |mw162| > 2 across 7.68 million MCMC draws. The areas on either side of the gray band constitute the portions of the distribution outside of it. Approximately 20% of the sample would be expected to have mw162 > 2, and 20% < -2.
The gap between the number of specific individual managers identified as outside the |mw162| > 2 ROPE and the sample-wide incidence of that level of θ creates a practical difficulty for those (GMs, gamblers, professional prognosticators) seeking to use the Estimator to make decisions. Should they be guided only by the ROPE estimates of individual managers when making judgments of consequence? If they adopt this stance, they will necessarily be excluding from their consideration a potentially much larger number who genuinely possess the level of skill (or ineptitude) that they regard as consequential. But if in response such users expand the zone of consideration, they must endure a correspondingly higher risk of error in individual cases. What is the optimal balance of these considerations?
The mWAR Estimator uses Bayesian statistical methods because they dominate alternatives in predictive error reduction. They achieve this result by optimizing the tradeoff between bias and variance, accepting slightly more of the former in order to reduce the latter by an amount that generates more predictive accuracy on net (Efron and Morris, 1973). Mathematicians often say that Bayesian estimators thus have the lowest “loss function,” meaning that their point-estimate MAPs miss the predictive target by the smallest margin, measured in terms of mean squared error or MSE (James and Stein, 1961).
But this understanding of “loss” reflects an understandably statisti-centric outlook. Conventional Bayesian estimation most effectively reduces loss if what one is trying to maximize is accuracy simpliciter.
A user of a Bayesian estimator, however, might be trying to maximize something that does not perfectly align with error reduction in this sense (Berger, 1985). This situation is especially likely to be true where the user disvalues certain errors more than others, either because of their asymmetric costs or because of the user's risk preferences (Elliott and Timmermann, 2008, 2016; Patton and Timmermann, 2007).
Whether to retain a manager might fall into this class. Both good and bad managers can be consequential, the analyses so far confirm. But managers who are consequentially good or bad are rare; they make up a smaller than normal fraction of the overall manager population, whose ranks are bloated by managers of average ability.
The practical upshot of this distribution of latent managerial skill arguably supports viewing errors in manager retention as asymmetric. On this account, holding on too long to a potentially poor manager is more costly than releasing and replacing him prematurely. The former mistake results in palpable costs for the team. In contrast, the most likely effect (by far) of the latter mistake is the substitution of one average manager for another: allowed more time, the prematurely released manager would most likely have proven himself to be average in ability—which is the most likely value of his replacement as well.
The opposite is true at the other end of the spectrum. That is, prematurely replacing a potentially good manager is worse than holding on to him too long. The former mistake costs the team the services of the rare manager who would have netted it additional wins. The latter mistake, in contrast, is most likely to be costless or nearly so: the manager who fails to rise to the ranks of the elite will most likely be revealed as average; and again, average is the expected value of any replacement.
This asymmetry can be amplified by the tournament structure of baseball-revenue distribution. Hitting a threshold in wins can result in outsize returns associated with post-season playoff berths; the incremental cost of additional losses when teams fall short of that threshold are close to zero (Gennaro and Beattie, 2013). Under this “winner,” or select winners, “take all” payout system (Frank and Cook, 1996), teams can be expected to adopt toward manager hiring and retention a stance that is risk-preferring with respect to wins and risk-averse with respect to losses (Becker and Huselid, 1992; Lee and Lee, 2004). That is, they will quickly dispense with the services of a potentially poor manager, but stick it out—take a flier—on the manager they perceive to have a high but as-yet unrealized upside.
This pattern—risk aversion toward poor performers, risk preference toward relatively neutral ones—is familiar in corporate management retention. Econometricians have long observed that boards do not process information in a purely Bayesian fashion (understood in MSE terms) but instead react more quickly to evidence that a CEO is performing worse than average than they do to evidence that a CEO is not performing above average (Jenter and Lewellen, 2021). One explanation is that the costs of failing to take heed of the former type of information exposes firms to greater risks, and denies them smaller opportunities for potential upside gain, than failing to take heed of the latter.
The distinction between error reduction simpliciter and optimal decisionmaking generally can be straightforwardly addressed. To handle it, a Bayesian estimator can be calibrated to a decision theory loss function that maximizes utility in terms of the user's choosing (Berger, 1985). The mWAR Estimator can be adjusted in exactly this way. Figure 8 illustrates how recalibrating the Estimator to reflect the front-office loss function described above would alter the Estimator's assessments of manager latent skill. In it, the Estimator has been programmed to treat errors asymmetrically. Erroneously under-estimating a manager who does have a true mw162 ≥ 2—that is, mistakenly short-changing him, and assigning him a mw162 < 2—is two times more costly than erroneously identifying him as a manager who does possess a mw162 ≥ 2 when he does not. The same for a manager who has a true mw162 ≤ -2: the penalty is two times more severe for incorrectly scoring him as possessing a mw162 > -2 than for incorrectly scoring him as below that threshold.

When the Estimator is made to respect this loss function rather than an MSE one, it adjusts the mw162 score of every manager in accord with the resulting expected error cost, as determined by both the size of the penalty and the fraction of his posterior mass that crosses either of the two asymmetric error thresholds. This operation doesn’t just arbitrarily change the mw162s of a handful of managers who otherwise would fall just short of the upper threshold or just barely exceed the lower one. Rather it nudges some closer to the critical thresholds; pushes some others across one or the other; and relocates still additional managers—ones with posterior means only marginally higher than |mw162| ≥ 2—even farther out of harm's way. It should be noted, though, that the impact is not uniform or monotonic; the degree to which mw162s are recomputed is conscious of the information in variance levels unique to individual managers. The result is a picture of manager skill level comprehensively redrawn to match the user's own utility. More information on the formal properties of this loss function and how the mWAR Estimator was adjusted to implement it can be found in the SI.
Adjusting the Estimator loss function in this manner, then, ameliorates the dilemma associated with the gap between individual-level estimates and the manager-population estimates of θ (Table 2, Figure 7). As discussed, the gap is a direct consequence of the difference in the volume of information the mWAR Estimator acquires about the population of managers, on the one hand, and the volume it acquires about individual managers, on the other. This disparity implies that there is a rich vein of talent in the vicinity of the mw162 ≥ 2 cutoff, and a deep chasm of ineptitude in the vicinity of the mw162 ≤ 2 one. Modifying the Estimator's default MSE-loss function at those thresholds maximizes the probability that users will successfully mine the former and avoid plummeting into the latter.
Necessarily, though, this adjustment also tolerates a greater level of risk that some managers will be misclassified. But because the costs of those errors can be expected to be negligible, and because MLB's tournament payoff structure generates asymmetric risk preferences with respect to managers proximate to the mw162 −2 and +2 thresholds, a utilitarian loss function of this kind dominates the MSE-minimization one that guides the full mWAR Estimator when run without adjustment (Elliott and Timmermann, 2016).
Or at least it might. Whether to make such an adjustment, and how much risk of error to tolerate in exchange, are matters of judgment—ones that should be informed by still more empirical analysis.
Any utility-maximizing loss function identified in this manner can be substituted for the default MSE-minimization one that otherwise guides the Estimator. The amenability of the mWAR Estimator to a user-defined decision-theory loss function magnifies what users can learn about manager potential, especially early on in their careers.
Concluding observations
What the data say
So do managers matter today? did they ever?
The study presented here suggests the answers are yes, and yes.
The evidence consisted of the records of 512 managers in relation to the performances of their teams as predicted by aggregate player WARs. Using Bayesian methods, the data were used to form a manager-value estimator. Capable of discerning differences in latent managerial skill, the mWAR Estimator indicated that a sizable fraction of the managerial population—in the vicinity of 40% of it—has over the history of the American and National Leagues generated ± 2 wins per season to their teams’ performances over and above the contribution made by those teams’ players.
The manager impact that the mWAR Estimator detected, moreover, was not peculiar to any particular era. Managers of consequence—positive and negative—are found over all MLB time periods. This includes a cohort of active and recently active managers whose careers started concurrently with or well after the advent of modern analytics (including Fredi González, Craig Counsell, Dave Roberts, Bruce Bochy, and Stephen Vogt).
Just as important, it was shown that the mWAR Estimator admits of ready use for practical decisionmaking. It can detect reliable indications of above- and below-average latent managerial skill within several seasons. Moreover, the information it yields can be fashioned into interim strategies that enable reasonable assessments to be made even more quickly—including after two or three seasons early on in a manager's career. Finally, it can be calibrated to yield adjusted estimates tailored to the unique loss functions and risk preferences of those with a stake in hiring and retaining managers.
But are the estimator results really true?
The mWAR estimates, while seemingly plausible as a whole, feature numerous surprises at the individual level. Certain legendary managers, such as Sparky Anderson (mw162 = 0.8; 118th overall) and Earl Weaver (0.2; 194th) are buried in the pack. Other seemingly undistinguished ones such as Cecil Cooper (2.7; 13th) and Steve O’Neill (2.9, 10th) rank unexpectedly high. Does the conflict between these outcomes and the “eye test” (Burroughs, 2020) cast doubt on the mWAR Estimator?
The answer should be a resounding no (Paine, 2014b). One can’t observe the trait of managerial acumen directly; that is why the development of a manager-value estimator is necessary. A valid estimator is supposed to be normative for what we see; making what we see normative for the validity of the estimator is a classic example of confirmation bias (Nickerson, 1998).
The credibility of the mWAR Estimator or any other scheme for assessing manager-value depends on a critical appraisal of the system's methods, not its results. We cannot trust our eyes; but we can and must trust our reason. Are the methods that such an instrument employs, and inferences drawn from it, sound?
Even if the answer is yes, however, it does not follow that the Estimator results should be afforded the status of “truth.” Indeed, the question whether the Estimator outcomes are “true” is ill-formed and misconceives the basic philosophy of knowledge acquisition that animates it.
Just as the Estimator uses Bayesian methods to continuously adjust its probabilistic assessments based on new evidence, so a meta-Bayesian orientation should be applied to determine the effect one gives the Estimator's own results. Rather than passively deferring to them as final verdicts of managerial acumen, one should treat mw162s as evidence, and use them to update one's prior beliefs based on the weight that critical reflection suggests these scores are due. Indeed, this posture of provisionality is ingrained in Estimator posteriors themselves, which as probability distributions anticipate their own going-forward status as priors amenable to revision based on whatever additional evidence subsequently emerges.
When one must appraise and (in the case of front office executives and gamblers) act, it is sensible to treat one's current best-reasoned understanding as if it were true. But only for now with no expectation that one will not change one's mind when new data arrive. In the Bayesian universe, there are no singing fat ladies.
Beyond baseball managers
There is no reason why these methods cannot be generalized. Individual player WAR, for example, cumulates raw performance outputs and neglects Bayesian or comparable estimation methods geared to measuring latent skill (Brill and Wyner, 2024
The real question: Not how much managers matter—but how much more they could
The oft-stated claim that analytics has rendered contemporary managers irrelevant essentially gets things backwards. Analytics, this paper suggests, not only show that managers matter; they can also be used to make managers an even more consequential aspect of the game. Front offices can employ the measures and protocols developed in this paper to help them more reliably identify the managers most likely to improve their teams’ performances. Future researchers, moreover, can, in the normal course of scholarly exchange, be expected to devise additional measures that complement or even supersede the estimator developed here. In sum, there is no reason for the effect of analytics on measuring the proficiency of those who make real-time strategic decisions during the game to fall short of the impact it has had on measuring the quality of those who play it.
Supplemental Material
sj-docx-1-san-10.1177_22150218261452597 - Supplemental material for mWAR: A Bayesian estimator of manager value
Supplemental material, sj-docx-1-san-10.1177_22150218261452597 for mWAR: A Bayesian estimator of manager value by Dan M. Kahan in Journal of Sports Analytics
Footnotes
Acknowledgements
I wish to thank John Ruggiero, Michael Schell, and Abraham Wyner for incisive comments, and participants in the 2026 Sloan-MIT Sports Analytics Conference for helpful feedback, on an earlier draft.
Ethical approval/informed consent
The reported study did not involve human subjects.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
