Abstract
Penalty shootouts in association football are sometimes criticized by fans and pundits as an imperfect tie-breaking procedure. In this study, we analyze through a Bayesian model if shootouts are governed more by skill or by chance. Using a representative dataset from twelve recent European seasons, we fit a hierarchical logistic model with appropriate random effects and a within–shootout latent autoregressive state to capture evolving pressure. The model is implemented through Hamiltonian Monte Carlo approach. Our proposed framework allows us to quantify the amount of skill involved in shootouts by the proportion of logit-scale variance attributable to persistent heterogeneity versus idiosyncratic and state noise. We also compare the full specification to nested alternatives via PSIS-LOO, stacking, and decision-oriented scores computed from leave-one-shootout-out predictive distributions. Empirically, persistent individual effects are found to be small: the posterior
Introduction
In association football (also known as soccer in North America, Australia and some other parts of the world), penalty shootout is the tie-breaking procedure used in knockout matches when the scoreline remains drawn after regulation time and extra time (wherever applicable). Interestingly, penalty shootouts decide a nontrivial fraction of high-stakes football matches. For instance, in the dataset we consider, 9.54% of all knockout matches marked as
A quick overview of the penalty shootout system in football is warranted here. 1 In the shootout, players from the two teams take penalty kicks (i.e., shots from the penalty spot typically marked at 12 yards from the goal-line and centered between the goalposts) at the same goal in alternating sequences (ABAB format), with the opposition goalkeepers trying to stop the scoring. At the outset, the setup of the shootout can be considered to be fair for both teams. Only the players still on the field at the final whistle after the regular game are eligible for taking the kicks: if one team has less eligible players (due to red card or injury), the other team must reduce to equate so that both sides have the same number of kickers. A coin toss determines which team kicks first, and which goal the shootout will take place. The teams then alternate up to five attempts each, with the contest ending early if one side attains an unassailable lead. If the score is still level after five apiece, the procedure continues in sudden-death pairs until one scores and the other fails. Each eligible player must take exactly one kick before any player is permitted a second. All kicks are taken under the Laws of the Game. 2
Several studies (see Hurley, 2005; Lopez and Schuckers, 2017; Wood et al., 2015; Wunderlich et al., 2020, among others) have discussed whether shootouts in different sports depend primarily on skill or if it is a game of chance. In studying this critical problem in football, two key modeling challenges arise. First, the data are clustered at multiple levels: kicks are nested within shootouts and are executed by specific shooters facing specific goalkeepers who are representing teams with specific playing styles and strategies. Ignoring this structure conflates idiosyncratic kick-level noise with persistent heterogeneity attributable to players or preparation. Second, the sequence of kicks exhibits temporal and strategic features (for instance, the shooting order as well as the pre-kick score differential) that plausibly induce serial dependence beyond measured covariates. Analyses that treat kicks as independent Bernoulli trials, or that aggregate over sequences, are therefore ill-suited to infer the contribution of skill relative to chance.
We address these issues with a Bayesian hierarchical framework that models each kick-level outcome via a logistic link with additive random effects for shooters, goalkeepers, teams, and augments the linear predictor with a latent autoregressive (AR) component that evolves within a shootout. This specification captures persistent ability while accommodating serial correlation consistent with evolving pressure or momentum. The shootout result is treated as a deterministic function of the kick sequence under the Laws of the Game, ensuring that inference follows the observational unit that generates the outcome of interest. The procedure is explicated in Section “Methodology” below, whereas the data and the analysis are presented in Section “Results”. Before closing the paper, we discuss some important implications of the research in Section “Managerial implications” and present our concluding remarks on the generalizability and future scopes of our method in Section “Concluding remarks”.
As we shall see below, our contribution in this work is two-fold. Methodologically, we provide an integrated design that (i) separates persistent heterogeneity from idiosyncratic noise through partial pooling, (ii) permits within-shootout dependence via a latent AR state, and (iii) yields transparent variance decompositions on the logit scale that quantify the share of variation attributable to skill versus chance. Empirically, the framework supports out-of-sample evaluation against independent and identically distributed benchmarks and nested alternatives, enabling a direct assessment of whether the data are more consistent with a lottery or with systematic differences in ability. As a by-product of our approach, we can also obtain the shooting efficiency of different players and thereby deduce the best order for a team. Collectively, these elements deliver an interpretable and testable answer to the substantive question of whether shootouts are skill-based. Albeit a few earlier studies have assessed this before, they do not capture all of the above-mentioned components which we do in a statistically solid Bayesian procedure. On that note, a succinct account of existing literature is provided in the next section.
A brief review of relevant literature
One of the most critical branches of literature in this domain focused on the concept of ‘‘first mover advantage.” Taking inspiration from economic studies, especially since the seminal work of Apesteguia and Palacios-Huerta (2010), there has been a lot of attention in this direction for shootout-like games. Several researchers (see, e.g., Arrondel et al., 2019; Feri et al., 2013; Jordet et al., 2012; Santos, 2023; Vandebroek et al., 2018) have discussed whether psychological pressure has any impact on the outcome of shootouts in sports, and whether the second movers suffer from it. The conclusions are mixed, some finding evidence in favor of this while most arguing against it. Indeed, the last two studies mentioned above report no significant difference between the winning probability of the first mover and the second mover in penalty shootouts in football. Earlier, Kocher et al. (2012) discussed this in detail by extending the study of Apesteguia and Palacios-Huerta (2010) and demonstrated that the conclusion changes with a larger sample. Cohen-Zada et al. (2018) established the same thing in the context of tie-breaks in tennis and showed that the advantage of serving first is not present there. In another recent study, by analyzing a large number of penalty shootouts in football, Pipke (2025) showed that there is no first mover advantage in penalty shootouts, which is indeed in line with what we also find in our analysis.
In connection to this aspect, it is important to note that research has also been carried out to design theoretically fair designs for the ordering of kicks in penalty shootouts, which would take away the first mover advantage, if it exists. For example, Rudi et al. (2020) discussed how much impact the ordering of the shots in a shootout-like game matters and accordingly how to best make the sequence so that the game may become fair. A similar analysis through a mathematical lens was carried out by Lambers and Spieksma (2021). The reader is further referred to the works of Csató and Petróczy (2022), Vollmer et al. (2024) who discussed the same problem in varied directions.
Turn attention to the extant literature on analyzing players’ skills and capabilities on taking or saving penalties. In one of the earliest works, McGarry and Franks (2000) proposed a strategy of ranking the players and accordingly decide the penalty-taking order that should maximize the chance of winning the shootout. Jordet et al. (2007) analyzed penalty kicks from the FIFA World Cup, European Championships, and Copa America between 1976 and 2004; and identified that psychological pressure has a bigger impact on the success rate as compared to skill, physiology or chance. Meanwhile, Baumann et al. (2011) demonstrated that more skillful players have a higher degree of specialization to take penalty kicks, but this has neither an adverse nor a beneficial impact on their success rate; which can be used to argue in favor of the idea that shootouts are more of a game of chance. In a similar thread, borrowing information from previous data and psychological aspects, statistical analysis of best strategies for penalty shootouts were conducted by multiple researchers, e.g., Bar-Eli and Azar (2009), Memmert et al. (2013), Brinkschulte (2025). Based on these studies, we see that there is generally an agreement that skill matters in penalty shootouts. As we shall demonstrate later, along with identifying the role of chance in the shootout, our proposed approach also helps in quantifying the shooting ability of the penalty-takers as well as the saving ability of the goalkeepers.
Methodology
Before presenting the formal notation, it is useful to summarize the logic of the framework in intuitive terms. We model each penalty kick as a binary event whose success probability depends on three broad sources of variation. First, observed covariates related to the situations, teams, or competitions capture the immediate match context. Second, random effects for shooters, goalkeepers, and teams represent persistent heterogeneity, which we interpret as the stable component of skill. Third, a latent within-shootout autoregressive term allows the probability of success to evolve over the sequence of kicks, thereby capturing transient dependence that may reflect pressure, momentum, or other short-run psychological effects. Any remaining variability, together with the baseline uncertainty implied by the logistic link, is interpreted as idiosyncratic variation or chance. In this way, the model provides a structured decomposition of shootout outcomes into context, persistent skill, and transient randomness.
Let us formally set some notations now. Throughout this article, wherever used,
Our main objective is to analyze penalty shootouts
Model specification
For each
In the above,
Bayesian estimation
We adopt a fully Bayesian approach and fit the model with Hamiltonian Monte Carlo (HMC) as implemented in Stan (Carpenter et al., 2017). All analyses were run in
We emphasize that these prior choices are weakly informative on the logit scale while still regularizing extreme parameter values that are difficult to identify from sparse shootout data. In particular, the Gaussian prior for the fixed effects is broad enough to allow substantial covariate effects on the odds scale, while discouraging unrealistically large coefficients. For the standard deviation parameters, the chosen half-
To derive the complete posterior density in a clean form, let us first define the data vector
Further, for ease of explanation, let us index all the kicks in the data by
We next incorporate the concept of Pólya-Gamma augmentation (hereafter abbreviated as
To find the conditional posterior of the latent state vector, we note that for the shootout in the
Next, for each observation, we can get the conditional posterior of the Pólya-Gamma variables as
For the variance components, on the other hand, we first recall that the half-
Finally, turn attention to the autoregressive coefficient
For the main analysis, we run four independent NUTS chains with warm-up for step-size and mass-matrix adaptation, target acceptance
Inference, model comparison, and interpretation
For inference, we rely on the posterior samples of the quantities of interest obtained via the Bayesian computation mentioned in the previous section. We shall present the results based on the posterior means of the parameters, corresponding standard errors, posterior standard deviations, posterior medians and 95% equal-tailed credible intervals. With the key aspect of differentiating between skill and chance, we find it critical to first avoid ambiguity and distinguish between ‘‘persistent skill,’’ ‘‘idiosyncratic variation,’’ and ‘‘chance’’—three terms we shall use repeatedly in the remaining article. By persistent skill, we mean stable heterogeneity attributable to shooters, goalkeepers, and teams, as represented by the random effects in the model. By idiosyncratic variation, we mean kick-specific unexplained variability that is not tied to persistent entities. We use the term chance in a broader sense to refer to the combined contribution of idiosyncratic and transient components, including the latent within-shootout state and the logistic residual variance. Thus, while the terms noise and chance are related, the former is used in a narrower statistical sense, whereas the latter refers generally to the part of the outcome not explained by persistent skill. Thus, in model (2), the random effect terms corresponding to shooters, goalkeepers, and teams can be attributed to skills. In terms of chance, we recognize that on the logit scale, the latent-variable representation of the logistic model implies a residual variance
A brief remark on the fixed variance term
As the first candidate model, we remove the state
In the next model (
Further, to isolate only potential momentum or psychological carryover within a shootout, we remove all random effects and retain the latent within-shootout component with AR(1) structure. This model (hereafter called
Finally, as a parsimonious benchmark, we suppress all random effects and retain only measured covariates and the over-dispersion term:
We find it imperative to point out that the competing models are carefully chosen to peel away the structure of
Coming to the aspect of comparison, to do it in a way that respects within-shootout dependence, we evaluate out-of-sample performance by leaving out one shootout at a time and refitting implicitly via Pareto-smoothed importance sampling leave-one-out cross-validation, or the PSIS-LOO approach (Vehtari et al., 2017). For implementation in
We report elpd differences relative to the best model and their standard errors using the usual delta method. We also compute stacking weights
Furthermore, because a shootout winner is a deterministic function of the penalty sequence, we form the leave-one-out predictive win probability by propagating kick-level probabilities through a Monte Carlo mapping. For the
It is worth emphasizing that these comparison criteria target related but not identical predictive tasks. The PSIS-LOO approach evaluates how well a model predicts the full sequence of kick outcomes, aggregated at the shootout level, and therefore rewards a good description of the entire data-generating process. Stacking weights summarize which models contribute most to the best leave-one-out predictive mixture under that same objective. In contrast, Brier score, log-score, and classification accuracy are based on the leave-one-out predictive probability of the final shootout winner, which is a path-dependent functional of the kick sequence. Consequently, it is possible for one model to describe kick-level likelihoods better, while another performs better at predicting the winner. Such differences should therefore be interpreted as reflecting distinct predictive targets rather than as contradictions across evaluation methods.
To wrap up the discussion in this section, we revisit the skill versus chance question. Under our framework, a model that meaningfully captures persistent skill should deliver higher elpd, better Brier and log scores than a random baseline model, and stacking weights that concentrate on specifications with nonzero random-effect variances. Conversely, if elpd differences are small, stacking allocates substantial weight to the simplest models, and the Brier or log-scores are comparable to the random baseline, this provides quantitative evidence that, conditional on observed context, shootout outcomes are largely indistinguishable from chance.
Results
Data and exploratory analysis
We assemble a kick-level panel of penalty shootouts from European club football. The raw data is extracted from the Transfermarkt website. 3 The cleaned and processed data is publicly available in a GitHub repository maintained by the corresponding author. 4 The dataset provides match details for leagues and cups from 14 countries in Europe, along with the European club competitions Champions League, Europa League and Conference League, for the last 12 seasons (we consider all matches until the last completed season, i.e., 2024/25). In this entire set, we focus on the knockout matches in the domestic and European cups since the concept of penalty shootout does not apply for domestic leagues or in the group matches. Further, the ‘‘Penalty shoot-out’’ sections of the corresponding Transfermarkt match URLs are parsed to ensure consistency and accuracy of the data. For each match, the lineups are extracted to map each shooter to a team and to identify the opposing goalkeeper. In the process, name variants are resolved and players are de-duplicated across matches. Finally, we exclude matches lacking a full shootout listing or with conflicting entries, and retain the information of all complete shootouts with unambiguous per-kick records. This leads to an overall collection of 2356 penalty kicks from 230 matches during the mentioned period. The overall conversion rate is approximately 75.1%.
To better understand the distribution of penalty kicks, in Figure 1, we present the distribution of number of penalties for different seasons in the data in the top left panel, and the distribution of shootouts in different seasons in the bottom left panel. These plots give a slight indication that the overall number of matches that end up in shootouts is generally increasing over the years. However, the conversion rates have largely remained around 75%, except for 2018 where it dropped to an uncharacteristically low value of 67.2%. From the top right panel of the same figure, we note that the number of penalties have mostly stayed around 9 and 10, although a large proportion ended up going to the sudden death. Numerically, we note that the number of penalties in the shootouts has both mean and median around 10, about 80% of the shootouts end within 12 penalties, and the maximum number of penalties recorded in our dataset for one shootout is 22. If we further look at the conversion rate for different penalty kicks in the sequence (refer to bottom right panel of Figure 1), there is a decreasing trend, which directly aligns with the idea that the best penalty-takers typically go in the first part of the sequence (McGarry and Franks, 2000).

(Top left) Distribution of penalty kicks for different seasons, along with the overall conversion rates. (Bottom left) Number of shootouts across the seasons. (Top right) Number of matches corresponding to number of kicks in the shootouts for the entire data. (Bottom right) Conversion rates at different points of the sequence of the shootouts.
In our main analysis, we use four key categorical covariates: shooting order to capture first mover advantage, pre-kick score status, stage, and competition type. Further, as discussed in Section “A brief review of relevant literature”, team strength is often found to be a determinant behind their success rates in penalty shootouts. While we plan to capture this primarily through the random effect
Below, in Table 1, the conversion rates across different labels of the four categorical covariates are summarized. At a descriptive level, conversion does not appear to differ by shooting order: kicks taken first in a round convert at 74.9% versus 75.3% for the team shooting second. We also perform analysis of variance (ANOVA) test to detect significant difference between the two samples, and in this case the
Summary of conversion rate according to different labels of the covariates.
The
Main analysis
We start with a summarization of the posterior samples obtained from the proposed hierarchical logistic model (
Posterior summaries for the proposed model (
The intercept is positive, with posterior median
Turning attention to the standard deviations of the random effects, we notice that they are modest on the logit scale:
While the above points towards shootouts being a game of chance, because an exceedingly large latent scale can mask genuine structure in fixed and random effects, our inferential strategy does not rely on
Comparison of the models using
The grouped PSIS-LOO comparison ranks
These two findings are not contradictory; they reflect different targets. Grouped
As a final piece of discussion, we compute the

Overall, for the central question—skill versus chance—three quantitative signals emerge. First, across models that retain persistent heterogeneity by defining random effects of shooter, keeper, or team, the gains in decision-level predictive accuracy over the random baseline are small in absolute terms (e.g., improvements in Brier score are around
Model diagnostics
Our key conclusions in the previous section are primarily based on the output of
Next, acknowledging that our analysis relies on the Bayesian implementation with the aforementioned prior specifications, we find it relevant to assess the robustness of our main conclusions to the choices of the priors. Here, we present a brief yet focused prior sensitivity analysis around the baseline setting used in the main model. Specifically, we vary three classes of hyperparameters while keeping the likelihood and the model structure unchanged. First, to examine the effect of stronger or weaker regularization on the fixed effects, we change the Gaussian prior scale
Sensitivity of key posterior summaries in model
Note: Reported are posterior medians with 95% credible intervals in parentheses.
We can observe that the main substantive conclusions of the paper are broadly robust to the range of prior specifications, although the magnitude of some posterior summaries naturally changes when the priors are made substantially tighter or looser. Throughout, the posterior
As another diagnostic step, we note that the posterior computation was stable across all reported parameters. In particular, the split-

Posterior predictive check based on conversion rates by kick number. The shaded band represents the 95% interval for the median under
A by-product: Ranking of shooters and goalkeepers
A natural by-product of the hierarchical specification in
Specifically, for each posterior draw, we rank shooters by their effect

Posterior ranks and credible intervals of the top 20 penalty takers (top) and the top 20 penalty stoppers (bottom) according to the proposed model.
We notice that a few famous players like Bruno Fernandes (captain of Manchester United during 2024–25), Robert Lewandowski (played at the top level in both Bundesliga and La Liga), Ousmane Dembélé (2025 Ballon d’Or winner), Martin Ødegaard (captain of Arsenal during 2024–25), Mohamed Salah (Liverpool’s key player during 2017–2025) and Pierre-Emerick Aubameyang (Arsenal star during 2018–2022) appear in the list of top penalty takers, with Fernandes leading the pack by a substantial margin. Numerically, these shooters displayed posterior means for
Albeit the above appears as an interesting by-product of our main analysis, it is important to interpret these rankings with caution. First, we observe that the rank uncertainty is substantial for both shooters and goalkeepers. This wide dispersion reflects limited per-player shootout exposure (the number of attempts for each shooter is less than 10 in our dataset while the number of attempts against a keeper is at most 20) and the intended regularization of partial pooling. Further, we find that many players have probabilities of being above average close to 0.5. Accordingly, the rankings should not be viewed as definitive statements about a stable underlying hierarchy of penalty-taking ability. Rather, they are best understood as probabilistic summaries that identify players whose posterior mass lies somewhat above or below the population average, while explicitly reflecting the substantial uncertainty inherent in shootout data. Indeed, these results also align with our main findings: persistent individual effects exist but are modest relative to contextual and idiosyncratic variation. Overall, the rankings can be informative only when it helps in identifying players whose posterior mass lies above average. In the next subsection, we are going to see how such analysis may partially help in deducing an optimum order of penalty takers for specific matches.
Optimum order for shootout: A case study
It is a fair assumption that a team would choose their best five penalty takers for a shootout. Thus, in this section, we are going to see how the fitted hierarchical model
For illustration purposes, we pick one of the most famous shootouts of 2025—UEFA Champions League Round of 16 match between Real Madrid and Atlético Madrid, two Spanish giants who faced each other in the second leg on 13th March 2025. After the two-legged fixture ended level on aggregate, the two teams moved to a penalty shootout that was eventually won by Real Madrid. The original sequence of the penalty shootout and how it unfolded in favor of a Real Madrid victory are provided in Table 5. In the same table, we also show the recommended sequence by the above-mentioned procedure. Note that the information of the fifth penalty-taker of Atlético is not available in the data and therefore, we work with four shooters for their team in this case study.
Original sequence (outcome given in parentheses) and suggested sequence (probability of above average conversion rate given in parentheses) of penalties in the shootout between Real Madrid and Atlético Madrid on 13th March 2025.
.
Using the match-specific covariate vector for this tie, the posterior shooter abilities are found to be tightly clustered around zero for both teams. For Real Madrid, all five shooters have posterior means within
Two key implications follow from this exercise. First, when player-level posteriors are tight, the sequencing edge arises less from “who is best overall” and more from “who fits best in each round’s context”. Second, and more importantly, the resulting uplift in win probability is minimal, suggesting that even with an optimum order, player-specific contributions in a shootout are not substantial. The case study should therefore be viewed primarily as an illustration of how the fitted model can be translated into a decision-support tool, rather than as evidence of large practically exploitable gains. Consistent with our main findings, it implies that such recommendations may be useful only at the margin and should be interpreted together with their associated uncertainty.
Managerial implications
This study proposes a hierarchical, leave-one-shootout-out validated framework for penalty shootouts that decomposes variation in kick outcomes into observed context, persistent skills attributable to individuals and teams, and an idiosyncratic component. The comparative evidence indicates that the player-level effects are present but modest, team-level heterogeneity helps stabilize likelihood fit, and a simple state-dependence term can aid winner prediction; yet the dominant share of variation at the level of individual kicks remains idiosyncratic. The managerial question, therefore, is not whether skill exists, but how to act optimally when skill differentials are small relative to noise and when the decision problem is path-dependent.
From a coaching perspective, the model suggests treating shootouts as high-variance environments. Rather than relying on deterministic list of best five shooters (potentially derived from raw conversion rates), teams should adopt probability-weighted shortlists that acknowledge posterior uncertainty and the match context. Because the skills of the takers are typically modest, an effective ordering may suggest two or three above-average shooters early to mitigate adverse momentum, one reliable option reserved for kick five or the onset of sudden death, and alternatives to be evaluated through fast simulation under the fitted model and match-specific covariates. Indeed, player evaluation benefits from such model-based approach. Posterior means and rank intervals for shooter and goalkeeper effects provide a scientific procedure that tempers the small-sample volatility of raw percentages. In terms of training, the dominance of the idiosyncratic term implies that investments which reduce situational variance can pay larger dividends than marginal searches for outlier talent. Emphasis on stable pre-kick routines, pressure inoculation, and clear role assignment is consistent with the data-generating process our model uncovers. For goalkeepers, preparation should prioritize anticipatory cues and opponent-specific tendencies, translated into concise decision aids. In-game and roster strategy should reflect the same logic. Substitutions made solely to access an alleged penalty specialist warrant caution unless supported by meaningful posterior evidence and adequate attempt histories. Otherwise, expected gains are often smaller than perceived.
We further note from our analysis that conditional on observed context, kicks from the mark are largely chance-dominated, with only modest and unstable gains from persistent individual skill. This motivates exploring tie-break formats that (i) mitigate procedural biases and (ii) place more weight on team-level or dynamic, open-play skills that our models indicate are more stable. For example, within the shootout framework, instead of ABAB, teams may take the penalties in an ABBA pattern, which has also been suggested by Anbarci et al. (2015), Da Silva and Matsushita (2024). On the other hand, Carrillo (2007) earlier proposed a unique framework where shootout is carried out before the extra time and argued how it may make the game of football more interesting. Another possibility is the one versus one dribble shootout like field hockey. Here, each attempt starts with an attacker in possession a few meters from the goal and the objective is to beat the goalkeeper and score within a short time limit (e.g., 8 seconds). Compared with a static penalty, this format rewards ball-carrying, feints, and decision-making under pressure, and lets keepers express timing and positioning skill in a dynamic setting. In terms of our modeling, it should increase the role of player-specific random effects and transient within-sequence dynamics (in both, we should see stronger signals) and should lessen the dominance of the random error that overwhelms spot-kick outcomes.
Exploration can also be done in the golden goal format used in various sports. 6 It is a tie-breaking method where the match continues in the extra time following the usual laws until the first goal is scored and that decides the winner. In football, this rule (and its variant silver goal) was in place during 1993 to 2004. Brocas and Carrillo (2004) discussed how this rule or its modification can make football more exciting. Based on our findings on the relevance of skill in penalty shootouts, the governing bodies may look at this rule again and consider its variants. For instance, the teams may play an additional brief period (e.g., 10 min) with a smaller team size (say, 7-versus-7) which would enlarge space per player, thereby increasing the probability of an open-play goal. Such rules would shift resolution toward coordinated team actions and set-play execution, and better aligns with our model’s evidence that team-level structure is more reliable than persistent player-specific effects.
From a design perspective, ABBA is perhaps the least disruptive and easy to implement. The other mentioned possibilities are stronger interventions that explicitly tilt the tie-break toward skills which our results identify as more informative and less noise-dominated. We want to highlight that any such option can be evaluated against the current format ex ante within our framework, by simulating sequences under the proposed rules and comparing expected log predictive density as well as different scoring rules.
Concluding remarks
To summarize, this article proposed a first-of-its-kind hierarchical Bayesian framework for penalty shootouts that allows an evaluation of skill and chance. The model is compared with nested alternatives using various metrics. Overall, the empirical evidence points to a chance-dominated environment, as the model allocates a negligible share of variance to persistent skill. In comparison, the team-only model finds essentially no stable club signal, and the best winner prediction is obtained by a simple state-dependent specification without any player-specific effects. It is also identified that the first mover advantage is statistically insignificant in penalty shootouts, which is different from several other studies in this domain (e.g., Huang et al., 2026; van Hemert and van der Kamp, 2026; Vandebroek et al., 2018). Broadly, our results align with the studies by Pipke (2025); Wunderlich et al. (2020), while providing a more in-depth understanding of the outcome variability. We emphasize that our conclusion is supported jointly by variance decomposition, PSIS-LOO comparison, stacking weights, and decision-oriented predictive scores. While these pieces jointly discourage strong claims about enduring individual superiority, they still support small, actionable differences that can be exploited at the margin (e.g., order selection of penalty-takers in a shootout).
Several extensions naturally follow. An empirical extension would be to validate the key findings for international tournaments as well. Methodologically, our model treats the individual effects as time-invariant and exchangeable. We do not allow for learning, aging, or player-keeper interaction terms. A future direction would be to allow learning or adaptation over time. In the current model, player-specific effects are treated as stable latent traits, but in practice these abilities may evolve across seasons as players accumulate experience, change clubs, or face repeated high-pressure situations. One possible extension would be to let the player-specific random effects vary dynamically over season or calendar time, for example through random-walk or autoregressive state equations. Another possibility would be to include explicit measures of prior shootout exposure, such as the cumulative number of previous shootout attempts or saves, as predictors of current performance. Such formulations would make it possible to distinguish short-run adaptation from long-run ability and to assess whether repeated exposure to shootouts reduces the apparent role of chance. This would broaden the framework from a static decomposition of skill versus randomness to a dynamic analysis of how skill is acquired, retained, or eroded over time. It would also be interesting to consider richer state processes (e.g., hidden Markov models or score-dependent AR terms). From a design standpoint, our approach can be adapted to evaluate alternative tie-break procedures under counterfactual rules, reporting expected log predictive density and decision scores ex ante.
Let us end the paper with a discussion about the generalizability of our method. The proposed framework is modular and readily portable beyond football shootouts. At its core is a Bernoulli–logit observation model, with partially pooled random effects for persistent heterogeneity, and a low-dimensional latent state for within-sequence dynamics. Each component can be adapted to new settings, for example, other tie-break mechanisms in sports, or when the response variable is multinomial or ordered and appears in a sequence. We indeed recommend conducting a similar assessment study for other sports. Although direct numerical benchmarks are not readily comparable across sports, we believe that the estimated
Footnotes
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: The author serves in the Editorial Board of the Journal of Sports Analytics as an Associate Editor and will not have any involvement in the editorial handling or the evaluation process of the paper in any way. There is no other competing interests to declare that are relevant to the contents of this article.
Data availability statement
The raw data used in the main analysis of this paper is extracted from the Transfermarkt website (link: https://www.transfermarkt.co.uk/). The cleaned and processed data, along with the R codes, are publicly available in a GitHub repository maintained by the author (link:
).
