Sage Journals: Discover world-class research

Abstract

Penalty shootouts in association football are sometimes criticized by fans and pundits as an imperfect tie-breaking procedure. In this study, we analyze through a Bayesian model if shootouts are governed more by skill or by chance. Using a representative dataset from twelve recent European seasons, we fit a hierarchical logistic model with appropriate random effects and a within–shootout latent autoregressive state to capture evolving pressure. The model is implemented through Hamiltonian Monte Carlo approach. Our proposed framework allows us to quantify the amount of skill involved in shootouts by the proportion of logit-scale variance attributable to persistent heterogeneity versus idiosyncratic and state noise. We also compare the full specification to nested alternatives via PSIS-LOO, stacking, and decision-oriented scores computed from leave-one-shootout-out predictive distributions. Empirically, persistent individual effects are found to be small: the posterior $SkillShare$ is near zero in the full model, suggesting that shootouts are primarily chance dominated. As a by-product of our approach, we show how the proposed model can also be utilized to rank the shooters as well as to optimize penalty-taking orders. We also discuss a few alternative tie-breaking procedures as future recommendations which can be evaluated in a similar modeling framework.

Keywords

Bayesian modeling Hamiltonian Monte Carlo mixed effects model OR in sports soccer analytics

Introduction

In association football (also known as soccer in North America, Australia and some other parts of the world), penalty shootout is the tie-breaking procedure used in knockout matches when the scoreline remains drawn after regulation time and extra time (wherever applicable). Interestingly, penalty shootouts decide a nontrivial fraction of high-stakes football matches. For instance, in the dataset we consider, 9.54% of all knockout matches marked as Final, Semi-Finals and Quarter-Finals had to be decided based on penalty shootouts. Not only in football, but in several other sports (e.g., hockey, water-polo, handball, rugby etc.) shootouts are utilized in various tournaments, following different rules. Albeit there are differences in the rules of these shootouts (see Csató, 2021, for a detailed discussion of penalty shootout designs in football), there is an eternal debate about whether their outcomes primarily reflect persistent differences in ability or resemble a lottery, i.e., if it is simply a game of chance. Simple summaries such as overall conversion rates or team-level success frequencies are suggestive but cannot separate stable skill from randomness, nor can they account for evolving pressure and strategic context within a shootout. In this article, we develop a principled statistical model that attributes variation to identifiable, persistent sources while also allowing for transient, within-shootout dynamics, and thereby helps us in assessing if the outcomes of penalty shootouts are functions of skills or if it is equivalent to tossing a coin.

A quick overview of the penalty shootout system in football is warranted here.¹ In the shootout, players from the two teams take penalty kicks (i.e., shots from the penalty spot typically marked at 12 yards from the goal-line and centered between the goalposts) at the same goal in alternating sequences (ABAB format), with the opposition goalkeepers trying to stop the scoring. At the outset, the setup of the shootout can be considered to be fair for both teams. Only the players still on the field at the final whistle after the regular game are eligible for taking the kicks: if one team has less eligible players (due to red card or injury), the other team must reduce to equate so that both sides have the same number of kickers. A coin toss determines which team kicks first, and which goal the shootout will take place. The teams then alternate up to five attempts each, with the contest ending early if one side attains an unassailable lead. If the score is still level after five apiece, the procedure continues in sudden-death pairs until one scores and the other fails. Each eligible player must take exactly one kick before any player is permitted a second. All kicks are taken under the Laws of the Game.²

Several studies (see Hurley, 2005; Lopez and Schuckers, 2017; Wood et al., 2015; Wunderlich et al., 2020, among others) have discussed whether shootouts in different sports depend primarily on skill or if it is a game of chance. In studying this critical problem in football, two key modeling challenges arise. First, the data are clustered at multiple levels: kicks are nested within shootouts and are executed by specific shooters facing specific goalkeepers who are representing teams with specific playing styles and strategies. Ignoring this structure conflates idiosyncratic kick-level noise with persistent heterogeneity attributable to players or preparation. Second, the sequence of kicks exhibits temporal and strategic features (for instance, the shooting order as well as the pre-kick score differential) that plausibly induce serial dependence beyond measured covariates. Analyses that treat kicks as independent Bernoulli trials, or that aggregate over sequences, are therefore ill-suited to infer the contribution of skill relative to chance.

We address these issues with a Bayesian hierarchical framework that models each kick-level outcome via a logistic link with additive random effects for shooters, goalkeepers, teams, and augments the linear predictor with a latent autoregressive (AR) component that evolves within a shootout. This specification captures persistent ability while accommodating serial correlation consistent with evolving pressure or momentum. The shootout result is treated as a deterministic function of the kick sequence under the Laws of the Game, ensuring that inference follows the observational unit that generates the outcome of interest. The procedure is explicated in Section “Methodology” below, whereas the data and the analysis are presented in Section “Results”. Before closing the paper, we discuss some important implications of the research in Section “Managerial implications” and present our concluding remarks on the generalizability and future scopes of our method in Section “Concluding remarks”.

As we shall see below, our contribution in this work is two-fold. Methodologically, we provide an integrated design that (i) separates persistent heterogeneity from idiosyncratic noise through partial pooling, (ii) permits within-shootout dependence via a latent AR state, and (iii) yields transparent variance decompositions on the logit scale that quantify the share of variation attributable to skill versus chance. Empirically, the framework supports out-of-sample evaluation against independent and identically distributed benchmarks and nested alternatives, enabling a direct assessment of whether the data are more consistent with a lottery or with systematic differences in ability. As a by-product of our approach, we can also obtain the shooting efficiency of different players and thereby deduce the best order for a team. Collectively, these elements deliver an interpretable and testable answer to the substantive question of whether shootouts are skill-based. Albeit a few earlier studies have assessed this before, they do not capture all of the above-mentioned components which we do in a statistically solid Bayesian procedure. On that note, a succinct account of existing literature is provided in the next section.

A brief review of relevant literature

One of the most critical branches of literature in this domain focused on the concept of ‘‘first mover advantage.” Taking inspiration from economic studies, especially since the seminal work of Apesteguia and Palacios-Huerta (2010), there has been a lot of attention in this direction for shootout-like games. Several researchers (see, e.g., Arrondel et al., 2019; Feri et al., 2013; Jordet et al., 2012; Santos, 2023; Vandebroek et al., 2018) have discussed whether psychological pressure has any impact on the outcome of shootouts in sports, and whether the second movers suffer from it. The conclusions are mixed, some finding evidence in favor of this while most arguing against it. Indeed, the last two studies mentioned above report no significant difference between the winning probability of the first mover and the second mover in penalty shootouts in football. Earlier, Kocher et al. (2012) discussed this in detail by extending the study of Apesteguia and Palacios-Huerta (2010) and demonstrated that the conclusion changes with a larger sample. Cohen-Zada et al. (2018) established the same thing in the context of tie-breaks in tennis and showed that the advantage of serving first is not present there. In another recent study, by analyzing a large number of penalty shootouts in football, Pipke (2025) showed that there is no first mover advantage in penalty shootouts, which is indeed in line with what we also find in our analysis.

In connection to this aspect, it is important to note that research has also been carried out to design theoretically fair designs for the ordering of kicks in penalty shootouts, which would take away the first mover advantage, if it exists. For example, Rudi et al. (2020) discussed how much impact the ordering of the shots in a shootout-like game matters and accordingly how to best make the sequence so that the game may become fair. A similar analysis through a mathematical lens was carried out by Lambers and Spieksma (2021). The reader is further referred to the works of Csató and Petróczy (2022), Vollmer et al. (2024) who discussed the same problem in varied directions.

Turn attention to the extant literature on analyzing players’ skills and capabilities on taking or saving penalties. In one of the earliest works, McGarry and Franks (2000) proposed a strategy of ranking the players and accordingly decide the penalty-taking order that should maximize the chance of winning the shootout. Jordet et al. (2007) analyzed penalty kicks from the FIFA World Cup, European Championships, and Copa America between 1976 and 2004; and identified that psychological pressure has a bigger impact on the success rate as compared to skill, physiology or chance. Meanwhile, Baumann et al. (2011) demonstrated that more skillful players have a higher degree of specialization to take penalty kicks, but this has neither an adverse nor a beneficial impact on their success rate; which can be used to argue in favor of the idea that shootouts are more of a game of chance. In a similar thread, borrowing information from previous data and psychological aspects, statistical analysis of best strategies for penalty shootouts were conducted by multiple researchers, e.g., Bar-Eli and Azar (2009), Memmert et al. (2013), Brinkschulte (2025). Based on these studies, we see that there is generally an agreement that skill matters in penalty shootouts. As we shall demonstrate later, along with identifying the role of chance in the shootout, our proposed approach also helps in quantifying the shooting ability of the penalty-takers as well as the saving ability of the goalkeepers.

Methodology

Before presenting the formal notation, it is useful to summarize the logic of the framework in intuitive terms. We model each penalty kick as a binary event whose success probability depends on three broad sources of variation. First, observed covariates related to the situations, teams, or competitions capture the immediate match context. Second, random effects for shooters, goalkeepers, and teams represent persistent heterogeneity, which we interpret as the stable component of skill. Third, a latent within-shootout autoregressive term allows the probability of success to evolve over the sequence of kicks, thereby capturing transient dependence that may reflect pressure, momentum, or other short-run psychological effects. Any remaining variability, together with the baseline uncertainty implied by the logistic link, is interpreted as idiosyncratic variation or chance. In this way, the model provides a structured decomposition of shootout outcomes into context, persistent skill, and transient randomness.

Let us formally set some notations now. Throughout this article, wherever used, $N, Z, R$ denote the set of natural numbers, the set of integers, and the set of real numbers, respectively. The notation $I (A)$ is utilized as an indicator function of the event $A$ , i.e., it takes the value 1 when $A$ is satisfied and 0 otherwise. The symbol $I_{n}$ will be used to denote identity matrices of order $n \times n$ , but we may omit the subscript if the order is clear from the context. Similarly, $0$ will indicate a vector of all zeros. The abbreviation ‘‘iid,’’ as is ubiquitous in statistics literature, will be used to indicate ‘‘independent and identically distributed’’ variables.

Our main objective is to analyze penalty shootouts $P_{m}$ , indexed by matches $m = 1, \dots, M$ . As we shall see later in Section “Data and exploratory analysis”, $M = 230$ in our application. Within the shootout $P_{m}$ , penalties occur in sequence: we index this as $t = 1, \dots, τ_{m}$ (including sudden death, if applicable). Let $Z_{m, t} \in {0, 1}$ denote the binary outcome of the $t$ th kick in the $m$ th shootout ( $1$ indicates scored, $0$ indicates missed or saved); whereas the corresponding identities of the shooter, goalkeeper, and the team are denoted by $s (m, t) \in {1, \dots, S}$ , $g (m, t) \in {1, \dots, G}$ , and $v (m, t) \in {1, \dots, V}$ , respectively. In addition to that, let $X_{m, t} \in R^{p}$ denote an appropriate set of observed covariates which may impact the conversion probability. For our application, this is explained in more detail in Section “Data and exploratory analysis”. The key response variable is the shootout-level result $Y_{m} \in {0, 1}$ (with 1 indicating if the team shooting first in the shootout won, and 0 indicating otherwise) which is deterministically induced by the sequence $Z_{m} = (Z_{m, 1}, \dots, Z_{m, τ_{m}})$ under the Laws of the Game. In our proposed procedure, we focus on modeling $Z_{m}$ and treat $Y_{m}$ as a derived quantity.

Model specification

For each $(m, t)$ , we consider the framework

Z_{m, t} ∣ p_{m, t} \sim Bernoulli (p_{m, t}), logit (p_{m, t}) = \log (\frac{p_{m, t}}{1 - p_{m, t}}) = η_{m, t},

(1)

where the linear link term is modeled as a linear combination of fixed effects for observed covariates, additive random effects for shooter, goalkeeper, and team, and a latent within-shootout component:

η_{m, t} = X_{m, t}^{⊤} β + θ_{1} {Order}_{m, t} + θ_{2} {ScoreStatus}_{m, t} + a_{s (m, t)} + b_{g (m, t)} + c_{v (m, t)} + u_{m, t} .

(2)

In the above, ${Order}_{m, t} \in {0, 1}$ indicates whether the shooting team takes the first kick in the current round, and ${ScoreStatus}_{m, t} \in {Leading, Trailing, Equal}$ is the pre-kick situation (based on goals for minus goals against) for the shooting team. Indeed, from implementation perspective, both these factors will be included within $X_{m, t}$ , but they are shown here explicitly for interpretability. The random effects corresponding to shooter, goalkeeper and team are represented by the symbols $a, b, c$ , respectively. These terms are modeled as exchangeable effects

\begin{matrix} a_{s} ∣ σ_{a}^{2} & \overset{iid}{\sim} N (0, σ_{a}^{2}), s = 1, \dots, S; \\ b_{g} ∣ σ_{b}^{2} & \overset{iid}{\sim} N (0, σ_{b}^{2}), g = 1, \dots, G; \\ c_{v} ∣ σ_{c}^{2} & \overset{iid}{\sim} N (0, σ_{c}^{2}), v = 1, \dots, V; \end{matrix}

(3)

with sum-to-zero constraints

\sum_{s} a_{s} = \sum_{g} b_{g} = \sum_{v} c_{v} = 0

for identifiability of the intercept absorbed into

β

. Finally, to allow for serial dependence beyond the measured covariates and the random effects, we introduce the Gaussian error term

{u_{m, t}}

that evolves within each shootout according to the formulation

u_{m, 1} \sim N (0, \frac{σ_{u}^{2}}{1 - ϕ^{2}}), and u_{m, t} = ϕ u_{m, t - 1} + ε_{m, t}, ε_{m, t} \overset{iid}{\sim} N (0, σ_{u}^{2}) for t ⩾ 2,

(4)

with

| ϕ | < 1

. This induces appropriately specified serial dependence among the logits

η_{m, t}

conditional on covariates and random effects.

Bayesian estimation

We adopt a fully Bayesian approach and fit the model with Hamiltonian Monte Carlo (HMC) as implemented in Stan (Carpenter et al., 2017). All analyses were run in R (version 4.5.1) and RStudio (version 2025.05.0+496) using the interface rstan (Stan Development Team, 2025). To set up the Bayesian computation, suitable prior specification in model (2) is necessary. For the benefit of the reader, we briefly recall here the two main ingredients of the Bayesian concepts: prior distributions encode weak initial regularization on the parameters before the data are observed; the posterior distribution combines these priors with the likelihood and therefore reflects the updated uncertainty after seeing the data. For our analysis, we propose weakly informative priors for the fixed effects in the additive terms of the model to regularize estimation while allowing substantive signal. Specifically, for each coefficient terms $θ_{1}, θ_{2}, β_{j}$ (for $j = 1, \dots, p$ ), we assume

θ_{1}, θ_{2}, β_{j} \sim N (0, ξ^{2})

(5)

where

ξ

can be taken in the range

[3, 10]

and we choose

ξ = 5

in our implementation. Next, for each of the standard deviation parameters that govern the variability of the random effects and the Gaussian error term, we consider the half-

t

prior distribution on the support of positive real line, which is a special case of the folded

t

distribution (Psarakis and Panaretoes, 1990). We set the prior degrees of freedom (

ν

) at 3 and the scale term (

ζ

) fixed at 0.25:

σ_{a}, σ_{b}, σ_{c}, σ_{u} \sim half- t (3, 0.25) .

Finally, for the autoregressive parameter

ϕ

mentioned in (4), we impose the standard normal prior, truncated within the space where the parameter is defined, i.e.,

ϕ \sim N (0, 1^{2}) I (| ϕ | < 1) .

We emphasize that these prior choices are weakly informative on the logit scale while still regularizing extreme parameter values that are difficult to identify from sparse shootout data. In particular, the Gaussian prior for the fixed effects is broad enough to allow substantial covariate effects on the odds scale, while discouraging unrealistically large coefficients. For the standard deviation parameters, the chosen half- $t$ priors place most mass on small-to-moderate latent-scale heterogeneity but retain heavy tails for larger values when supported by the data. This reflects the expectation that persistent differences among shooters, goalkeepers, and teams may exist, but are unlikely to be extremely large once contextual covariates are taken into account. The same reasoning applies to the latent dynamic component, where the prior is intended to allow serial dependence without a priori forcing large transient volatility.

To derive the complete posterior density in a clean form, let us first define the data vector $Z = (Z_{m, t})$ (arranged according to matches and within each match, in the order of the kicks), the parameter set $Θ = (β, θ_{1}, θ_{2}, {a_{s}}, {b_{g}}, {c_{v}}, ϕ, σ_{a}, σ_{b}, σ_{c}, σ_{u})$ , and the Gaussian error process $U = (u_{m, t})$ . Then, the complete-data likelihood is

L (Z ∣ Θ, U) = \prod_{m = 1}^{M} \prod_{t = 1}^{τ_{m}} [Bernoulli (Z_{m, t} ∣ {logit}^{- 1} (η_{m, t}))],

and the posterior density can be obtained from

π (Θ, U ∣ Z) \propto L (Z ∣ Θ, U) π (Θ) π (U ∣ ϕ, σ_{u}),

(6)

where

π (Θ)

represents the product of all prior densities and

π (U ∣ ϕ, σ_{u})

denotes the AR(1) Gaussian state density for the complete data.

Further, for ease of explanation, let us index all the kicks in the data by $i = 1, \dots, N$ after sorting them according to match and penalty sequence as mentioned above. Set $y_{i} = Z_{m, t}$ , and $κ_{i} = y_{i} - 0.5$ . Then, concatenate the predictors into a design vector $w_{i}$ so that

η_{i} = w_{i}^{⊤} θ + u_{i}, θ = {(β^{⊤}, θ_{1}, θ_{2}, a^{⊤}, b^{⊤}, c^{⊤})}^{⊤},

where

a, b, c

are the vectors of the random effect terms. Now, the Bernoulli-logit likelihood mentioned in (1) can be factorized as

L (Z ∣ θ, u) = \prod_{i = 1}^{N} \frac{\exp (κ_{i} η_{i})}{1 + \exp (η_{i})} .

(7)

We next incorporate the concept of Pólya-Gamma augmentation (hereafter abbreviated as $PG$ ), which is a common approach in Bayesian models involving Bernoulli random variables (see, e.g., Polson et al., 2013). Specifically, we introduce iid latent variables $ω_{i} \sim PG (1, η_{i})$ , for $i = 1, \dots, N$ . Using the identity of this augmentation, the Bernoulli-logit term can be written as

\frac{\exp (κ_{i} η_{i})}{1 + \exp (η_{i})} = \exp (κ_{i} η_{i}) \int_{0}^{\infty} \exp (- \frac{ω_{i} η_{i}^{2}}{2}) p (ω_{i}) d ω_{i},

which implies that the joint augmented likelihood is proportional to

L^{*} (Z, ω ∣ θ, u) \propto \prod_{i = 1}^{N} \exp (κ_{i} η_{i} - \frac{ω_{i} η_{i}^{2}}{2}),

with

ω = (ω_{1}, \dots, ω_{N})

. It is easy to observe that conditional on

ω

, the augmented likelihood is Gaussian in

η = (η_{1}, \dots, η_{N})^{⊤}

. This property will be critical in the Bayesian computation. For further discussions, let

W

be the

N \times q

design matrix (

q

being the dimension of

θ

) with rows

w_{i}^{⊤}

Ω = diag (ω_{1}, \dots, ω_{N})

, and

κ = (κ_{1}, \dots, κ_{N})^{⊤}

. Then,

\log L^{*} (Z, ω ∣ θ, u) = - \frac{1}{2} {(u + W θ)}^{⊤} Ω (u + W θ) + κ^{⊤} (W θ + u) + constant term .

(8)

One may complete the square to obtain Gaussian full conditionals for

θ

and

u

(the key parameters of interest in the estimation procedure) given

ω

and the hyperparameters. For our purpose, first define the Gaussian prior for

θ

θ \sim N (0, Q_{0}^{- 1})

, where

Q_{0}

is block-diagonal with entries

ξ^{- 2}

(β, θ_{1}, θ_{2})

and

σ_{a}^{- 2} I_{S}

σ_{b}^{- 2} I_{G}

σ_{c}^{- 2} I_{V}

(a, b, c)

, together with linear constraints

\sum_{s} a_{s} = \sum_{g} b_{g} = \sum_{v} c_{v} = 0

(enforced in practice by centering). The conditional posterior of

θ

is given by

θ ∣ ω, u, hyperparameters, Z \sim N (m_{θ}, Q_{θ}^{- 1}),

with precision and mean

Q_{θ} = Q_{0} + W^{⊤} Ω W, m_{θ} = Q_{θ}^{- 1} W^{⊤} (κ - Ω u) .

To find the conditional posterior of the latent state vector, we note that for the shootout in the $m$ th match, the AR(1) prior for $u_{m} = (u_{m, 1}, \dots, u_{m, τ_{m}})^{⊤}$ is Gaussian with a tridiagonal precision matrix $Q_{m}^{AR 1}$ , given by

Q_{m}^{AR 1} = \frac{1}{σ_{u}^{2}} [\begin{matrix} 1 & - ϕ & 0 & \dots & 0 \\ - ϕ & 1 + ϕ^{2} & - ϕ & ⋱ & ⋮ \\ 0 & - ϕ & 1 + ϕ^{2} & ⋱ & 0 \\ ⋮ & ⋱ & ⋱ & ⋱ & - ϕ \\ 0 & \dots & 0 & - ϕ & 1 \end{matrix}] + \frac{1}{σ_{u}^{2}} [\begin{matrix} \frac{- ϕ^{2}}{1 - ϕ^{2}} & 0 & 0 & \dots & 0 \\ 0 & 0 & 0 & ⋱ & ⋮ \\ 0 & 0 & 0 & ⋱ & ⋮ \\ ⋮ & ⋱ & ⋱ & 0 & 0 \\ 0 & 0 & \dots & 0 & 0 \end{matrix}],

(9)

which encodes the stationary initial variance. Stacking across matches, let

Q^{AR 1}

denote the block diagonal matrix with blocks

Q_{1}^{AR 1}, \dots, Q_{M}^{AR 1}

. The augmented likelihood contributes

Ω

, thereby implying that the conditional posterior for

u

u ∣ ω, θ, hyperparameters, Z \sim N (m_{u}, Q_{u}^{- 1}),

with precision and mean

Q_{u} = Q^{AR 1} + Ω, m_{u} = Q_{u}^{- 1} (κ - Ω W θ) .

It must be noted that the tridiagonal (per match) structure makes the computation of

Q_{u}^{- 1} r

linear in

N

Next, for each observation, we can get the conditional posterior of the Pólya-Gamma variables as

ω_{i} ∣ θ, u, Z \sim PG (1, η_{i}), η_{i} = w_{i}^{⊤} θ + u_{i} .

For the variance components, on the other hand, we first recall that the half- $t (ν, ζ)$ prior admits a representation via the inverse-gamma ( $IG$ ) distribution:

σ_{*}^{2} ∣ λ_{*} \sim IG (\frac{1}{2}, \frac{1}{ζ^{2} λ_{*}}), λ_{*} \sim IG (\frac{ν}{2}, \frac{ν}{2}),

(10)

for

* \in {a, b, c, u}

. With Gaussian prior

a_{s} \sim N (0, σ_{a}^{2})

etc., the conditionals can be derived as

σ_{a}^{2} ∣ a, λ_{a} \sim IG (\frac{S + 1}{2}, \frac{\sum_{s = 1}^{S} a_{s}^{2}}{2} + \frac{1}{ζ^{2} λ_{a}}), λ_{a} ∣ σ_{a}^{2} \sim IG (\frac{ν + 1}{2}, \frac{ν}{2} + \frac{1}{ζ^{2} σ_{a}^{2}}),

(11)

with analogous forms for

(σ_{b}^{2}, λ_{b})

and

(σ_{c}^{2}, λ_{c})

. For the term

σ_{u}^{2}

, we need to replace

\sum a_{s}^{2}

in the last expression by the AR(1) quadratic form

u^{⊤} Q^{AR 1} u

(observe that the precision matrix is updated in every step as a function of

ϕ

, but we omit that from the notation for convenience), which yields

σ_{u}^{2} ∣ u, ϕ, λ_{u} \sim IG (\frac{N + 1}{2}, \frac{u^{⊤} Q^{AR 1} u}{2} + \frac{1}{ζ^{2} λ_{u}}), λ_{u} ∣ σ_{u}^{2} \sim IG (\frac{ν + 1}{2}, \frac{ν}{2} + \frac{1}{ζ^{2} σ_{u}^{2}}) .

(12)

Finally, turn attention to the autoregressive coefficient $ϕ$ . There is no conjugate conditional for this parameter, and the conditional log-density (up to an additive constant) can be written as

\log π (ϕ ∣ u, σ_{u}^{2}) = \sum_{m = 1}^{M} [- \frac{1}{2} \log (1 - ϕ^{2}) - \frac{u_{1}^{2}}{2 σ_{u}^{2} (1 - ϕ^{2})} - \frac{1}{2 σ_{u}^{2}} \sum_{t = 2}^{τ_{m}} {(u_{t} - ϕ u_{t - 1})}^{2} - \frac{ϕ^{2}}{2}],

(13)

where the sum is over all matches in the data. In the above expression, the final term stems from the

N (0, 1)

prior and we also need to impose the required condition

(| ϕ | < 1)

. Maximization or slice sampling can be used within a Gibbs scheme; in Stan we sample

ϕ

jointly with all parameters via HMC. In that regard, when sampling with HMC in our implementation, we work with the non-augmented posterior

\begin{aligned} \log π (θ, u, σ^{2}, ϕ ∣ Z) & = \sum_{i = 1}^{N} [κ_{i} η_{i} - \log (1 + \exp (η_{i}))] \\ - \frac{1}{2} u^{⊤} Q^{AR 1} u - \frac{1}{2} \sum \frac{‖ r_{*} ‖^{2}}{σ_{*}^{2}} + \log π (hyper), \end{aligned}

where

r_{*} \in {a, b, c}

and

σ_{*}

is defined accordingly,

η_{i} = w_{i}^{⊤} θ + u_{i}

, and

\log π (hyper)

collects the log-priors for

(α, β, θ_{1}, θ_{2}, σ^{2}, ϕ)

(including the Jacobian for any transforms and the indicator

I (| ϕ | < 1)

). Note that the matrix

Q^{AR 1}

is a function of the parameters

ϕ

and

σ_{u}^{2}

; and that dependence is understood throughout the following discussions. Gradients of the above with respect to

(θ, u)

are

\frac{\partial \log π}{\partial θ} = W^{⊤} (κ - p) - Q_{0} θ, \frac{\partial \log π}{\partial u} = κ - p - Q^{AR 1} u,

(14)

where

p = (p_{1}, \dots, p_{N})^{⊤}

with

p_{i} = {logit}^{- 1} (η_{i})

, and

Q_{0}

is the prior precision on

θ

(after enforcing sum-to-zero by centering). The gradient with respect to

ϕ

follows from

\partial Q^{AR 1} / \partial ϕ

and the stationary-variance term. In the HMC framework, we use the No-U-Turn Sampler procedure (abbreviated as NUTS, see Hoffman and Gelman, 2014, for details) as implemented in Stan. There, we consider the formulation of the random effects

{\tilde{a}}_{s} \overset{iid}{\sim} N (0, 1), a_{s} = σ_{a} {\tilde{a}}_{s} - \frac{σ_{a}}{S} \sum_{s^{'}} {\tilde{a}}_{s^{'}},

(15)

with analogous formulas for

b_{g}

and

c_{v}

, enforcing

\sum_{s} a_{s} = \sum_{g} b_{g} = \sum_{v} c_{v} = 0

. Meanwhile, the AR(1) state is parameterized via standardized innovations

u_{m, 1} = \frac{σ_{u}}{\sqrt{1 - ϕ^{2}}} {\tilde{ε}}_{m, 1}, u_{m, t} = ϕ u_{m, t - 1} + σ_{u} {\tilde{ε}}_{m, t}, {\tilde{ε}}_{m, t} \overset{iid}{\sim} N (0, 1) .

(16)

We impose

- 1 < ϕ < 1

through a bounded parameter or, equivalently,

ϕ = 2 {logit}^{- 1} (ψ) - 1

with

ψ \sim N (0, 1)

(the latter adds a Jacobian term automatically handled by Stan’s transform).

For the main analysis, we run four independent NUTS chains with warm-up for step-size and mass-matrix adaptation, target acceptance $0.95$ , and maximum tree depth $12$ . Convergence is assessed using $\hat{R} \approx 1$ , bulk and tail effective sample sizes, and inspection of energy-Bayesian fraction of missing information (E-BFMI). Divergent transitions trigger increases in the target acceptance and checks of (i) covariate scaling, (ii) strength of priors on scale parameters, and (iii) adequacy of the non-centered parameterizations. Posterior predictive checks are based on replicated sequences drawn from the fitted model. To that end, we utilize the R packages posterior (Bürkner et al., 2025) and bayesplot (Gabry et al., 2019).

Inference, model comparison, and interpretation

For inference, we rely on the posterior samples of the quantities of interest obtained via the Bayesian computation mentioned in the previous section. We shall present the results based on the posterior means of the parameters, corresponding standard errors, posterior standard deviations, posterior medians and 95% equal-tailed credible intervals. With the key aspect of differentiating between skill and chance, we find it critical to first avoid ambiguity and distinguish between ‘‘persistent skill,’’ ‘‘idiosyncratic variation,’’ and ‘‘chance’’—three terms we shall use repeatedly in the remaining article. By persistent skill, we mean stable heterogeneity attributable to shooters, goalkeepers, and teams, as represented by the random effects in the model. By idiosyncratic variation, we mean kick-specific unexplained variability that is not tied to persistent entities. We use the term chance in a broader sense to refer to the combined contribution of idiosyncratic and transient components, including the latent within-shootout state and the logistic residual variance. Thus, while the terms noise and chance are related, the former is used in a narrower statistical sense, whereas the latter refers generally to the part of the outcome not explained by persistent skill. Thus, in model (2), the random effect terms corresponding to shooters, goalkeepers, and teams can be attributed to skills. In terms of chance, we recognize that on the logit scale, the latent-variable representation of the logistic model implies a residual variance $π^{2} / 3$ . Therefore, the contribution of persistent skill components can be summarized by

SkillShare = \frac{σ_{a}^{2} + σ_{b}^{2} + σ_{c}^{2}}{σ_{a}^{2} + σ_{b}^{2} + σ_{c}^{2} + σ_{u}^{2} / (1 - ϕ^{2}) + π^{2} / 3} .

(17)

A brief remark on the fixed variance term $π^{2} / 3$ is warranted here. It stems from the idea that in logistic regression, the observed binary outcome is obtained by thresholding a latent continuous variable whose error term follows a standard logistic distribution, the variance of which is known to be $π^{2} / 3$ . This can be verified directly from its density or moment-generating function. Unlike the random-effect variances or the latent AR variance, this term is not estimated from the data; rather, it is a fixed contribution implied by the choice of the logit link. In our setting, it serves as a baseline source of unexplained kick-level variation on the latent scale, against which the contribution of persistent skill can be compared. It further implies that the numerical magnitude of $SkillShare$ is inherently scale-dependent and should be interpreted comparatively across model specifications rather than as an absolute measure of sporting skill. Thence, posterior concentration of $SkillShare$ away from zero indicates a nontrivial systematic component attributable to players and teams. In the same fashion, one can also compute the marginal shares of skills due to the shooters, goalkeepers or teams. Our results are illustrated in Section “Main analysis” below. In the same section, we also describe the predictive performance of our proposed model and compare it against other candidate approaches. In that regard, to separate skill from chance without over-parameterizing within-shootout dynamics, we consider a sequence of simpler models starting from the model described in detail above (hereafter to be denoted as $M_{0}$ ).

As the first candidate model, we remove the state $u_{m, t}$ that induces serial correlation and replace it with an iid Gaussian white noise term. Specifically, we modify the logit equation in the modeling framework and rewrite (2) as

\begin{aligned} η_{m, t} & = X_{m, t}^{⊤} β + θ_{1} {Order}_{m, t} + θ_{2} {ScoreStatus}_{m, t} + a_{s (m, t)} \\ + b_{g (m, t)} + c_{v (m, t)} + ε_{m, t}, ε_{m, t} \overset{iid}{\sim} N (0, σ_{u}^{2}) . \end{aligned}

We shall denote this model as

M_{1}

. Akin to before, the random effects

a, b, c

follow the centered Gaussian priors in (3) with sum-to-zero constraints, the fixed effects have weakly informative Gaussian priors, and

σ_{u}

receives a half-

t

prior. By removing serial dependence,

M_{1}

delivers a cleaner variance decomposition in which

σ_{a}^{2}, σ_{b}^{2}, σ_{c}^{2}

quantify skills due to shooter, keeper, and team respectively whereas

σ_{u}^{2}

captures idiosyncratic noise, allowing a direct assessment of the proportion of latent variation attributable to skill.

In the next model ( $M_{2}$ ), we attempt to examine if institution-level factors alone can explain the variability instead of the individual-level effects, i.e., whether success rate relies on the teams and not on the players. Thus, we retain only the team random effect along with the over-dispersion term, and define

η_{m, t} = X_{m, t}^{⊤} β + θ_{1} {Order}_{m, t} + θ_{2} {ScoreStatus}_{m, t} + c_{v (m, t)} + ε_{m, t}, ε_{m, t} \overset{iid}{\sim} N (0, σ_{u}^{2}),

(18)

where

c_{v} \overset{iid}{\sim} N (0, σ_{c}^{2})

with

\sum_{v} c_{v} = 0

and

σ_{c}

carrying a half-

t

prior.

Further, to isolate only potential momentum or psychological carryover within a shootout, we remove all random effects and retain the latent within-shootout component with AR(1) structure. This model (hereafter called $M_{3}$ ) is defined as

\begin{aligned} η_{m, t} & = X_{m, t}^{⊤} β + θ_{1} {Order}_{m, t} + θ_{2} {ScoreStatus}_{m, t} + u_{m, t}, \\ u_{m, 1} & \sim N (0, \frac{σ_{u}^{2}}{1 - ϕ^{2}}), u_{m, t} = ϕ u_{m, t - 1} + ε_{m, t}, \end{aligned}

with

| ϕ | < 1

and

ε_{m, t} \overset{iid}{\sim} N (0, σ_{u}^{2})

. Priors on

ϕ

and

σ_{u}

are weakly informative.

Finally, as a parsimonious benchmark, we suppress all random effects and retain only measured covariates and the over-dispersion term:

η_{m, t} = X_{m, t}^{⊤} β + θ_{1} {Order}_{m, t} + θ_{2} {ScoreStatus}_{m, t} + ε_{m, t}, ε_{m, t} \overset{iid}{\sim} N (0, σ_{u}^{2}) .

(19)

This simple specification (to be denoted as

M_{4}

) sets the lower bound on

SkillShare

(zero by construction for unobserved persistent heterogeneity) and calibrates how far one can go by covariates alone. Empirically, we note that

M_{2}

provides a bridge between the fixed-effects baseline

M_{4}

and the full heterogeneity of

M_{1}

, whereas the model

M_{3}

asks whether serial dependence alone improves fit or prediction relative to

M_{4}

, thereby separating momentum from persistent skill. In predictive comparisons, we shall also include a baseline approach which assigns success randomly with a

Bernoulli (0.5)

probability distribution.

We find it imperative to point out that the competing models are carefully chosen to peel away the structure of $M_{0}$ in a controlled fashion. If the posterior evidence in favor of $SkillShare$ (captured via $σ_{a}^{2}, σ_{b}^{2}, σ_{c}^{2}$ as mentioned earlier) is small and predictive accuracy is comparable to $M_{4}$ or random baseline, it supports the interpretation that, conditional on observed context, most variability is indistinguishable from chance. If $M_{2}$ performs similarly to $M_{1}$ , club-level factors dominate such persistent differences; conversely, a gap between the two models implies shooter or keeper-specific skills. Finally, comparing $M_{3}$ to $M_{4}$ tests whether within-shootout dependence (momentum) has material explanatory power; a posterior $ϕ$ near zero and no predictive gain argue against strong serial effects. On a related note, we recognize that a potential identification issue in $M_{0}$ is that the persistent random effects $(a, b, c)$ and the latent AR(1) state $u_{m, t}$ can both absorb non-independence in the kick outcomes. Conceptually, these components play different roles: the random effects represent entity-specific heterogeneity attributable to shooters, goalkeepers, and teams, whereas the AR(1) term captures within-shootout dependence that evolves over the sequence of kicks. In finite samples, however, these sources of structure are not perfectly separable, and some degree of confounding is possible. This is another reason why we compare $M_{0}$ against the aforementioned reduced specifications that selectively remove the AR(1) term or the random effects. Collectively, these comparisons help assess whether the near-zero $SkillShare$ in the full model reflects a genuine dominance of transient variation or whether it arises because the latent state absorbs variation that could otherwise be attributed to persistent heterogeneity.

Coming to the aspect of comparison, to do it in a way that respects within-shootout dependence, we evaluate out-of-sample performance by leaving out one shootout at a time and refitting implicitly via Pareto-smoothed importance sampling leave-one-out cross-validation, or the PSIS-LOO approach (Vehtari et al., 2017). For implementation in R, we use the loo package (Vehtari et al., 2024) in this regard. To understand the approach, recall that $Z_{m} = (Z_{m, 1}, \dots, Z_{m, τ_{m}})$ denotes all kicks in the $m$ th shootout. For a posterior draw $Θ^{(d)}$ of all parameters, define the shootout-level log-likelihood

ℓ_{m} (Θ^{(d)}) = \sum_{t = 1}^{τ_{m}} \log p (Z_{m, t} ∣ Θ^{(d)}),

(20)

where

p (\cdot)

stands for the Bernoulli likelihood. Also, define the raw importance ratios

r_{m}^{(d)} = \exp {- ℓ_{m} (Θ^{(d)})}

. PSIS replaces the upper tail of

{r_{m}^{(d)}}

by a generalized Pareto fit to stabilize variance, yielding smoothed weights

{\tilde{w}}_{m}^{(d)}

that are normalized. The leave-one-shootout-out expected log predictive density (elpd) is next estimated by

{\hat{elpd}}_{loo} = \sum_{m = 1}^{M} \log (\sum_{d = 1}^{D} {\tilde{w}}_{m}^{(d)} \exp {ℓ_{m} (Θ^{(d)})}) .

(21)

We report elpd differences relative to the best model and their standard errors using the usual delta method. We also compute stacking weights ${w_{j}}$ across the model set ${M_{j}}_{0 ⩽ j ⩽ 4}$ by maximizing the leave-one-out log predictive density of the convex mixture predictor (Yao et al., 2018):

max_{w_{1}, \dots, w_{J}} \sum_{m = 1}^{M} \log (\sum_{j = 1}^{J} w_{j} \sum_{d = 1}^{D_{j}} {\tilde{w}}_{m, j}^{(d)} p (Z_{m} ∣ Θ_{j}^{(d)})) subject to w_{j} ⩾ 0, \sum_{j = 1}^{J} w_{j} = 1.

(22)

These stacking weights quantify how much each model contributes to the best LOO predictive mixture, and are particularly informative when models are similarly ranked by elpd.

Furthermore, because a shootout winner is a deterministic function of the penalty sequence, we form the leave-one-out predictive win probability by propagating kick-level probabilities through a Monte Carlo mapping. For the $d$ th posterior draw of the parameters, let $p_{m, t}^{(d)}$ denote the conversion probability implied by the estimated parameters and the observed covariates for the $m$ th shootout. With $B$ number of Monte Carlo replicates per draw, we compute

\begin{aligned} \hat{Pr} (Y_{m} = 1 ∣ Θ^{(d)}) = \frac{1}{B} \sum_{b = 1}^{B} I (team 1 wins when kicks \\ a r e s i m u l a t e d w i t h {p_{m, t}^{(d)}}_{t = 1}^{τ_{m}}) \end{aligned} .

(23)

The PSIS-LOO predictive win probability for the

m

th shootout is then calculated by

{\hat{p}}_{m}^{loo} = \sum_{d = 1}^{D} {\tilde{w}}_{m}^{(d)} \hat{Pr} (Y_{m} = 1 ∣ Θ^{(d)}) .

(24)

We finally summarize performances using two proper scoring rules (Gneiting and Raftery, 2007) at the shootout level:

Brier = \frac{1}{M} \sum_{m = 1}^{M} {({\hat{p}}_{m}^{loo} - Y_{m})}^{2}, log-score = \frac{1}{M} \sum_{m = 1}^{M} [Y_{m} \log ({\hat{p}}_{m}^{loo}) + (1 - Y_{m}) \log (1 - {\hat{p}}_{m}^{loo})] .

(25)

Uncertainty in these scores is assessed by a nonparametric block bootstrap at the shootout level. Essentially, we resample the

M

shootouts with replacement, recompute each score on the resample, and report the median with a central 95% interval across 1000 replicates. This respects within-shootout dependence and provides transparent uncertainty on the model ranking.

It is worth emphasizing that these comparison criteria target related but not identical predictive tasks. The PSIS-LOO approach evaluates how well a model predicts the full sequence of kick outcomes, aggregated at the shootout level, and therefore rewards a good description of the entire data-generating process. Stacking weights summarize which models contribute most to the best leave-one-out predictive mixture under that same objective. In contrast, Brier score, log-score, and classification accuracy are based on the leave-one-out predictive probability of the final shootout winner, which is a path-dependent functional of the kick sequence. Consequently, it is possible for one model to describe kick-level likelihoods better, while another performs better at predicting the winner. Such differences should therefore be interpreted as reflecting distinct predictive targets rather than as contradictions across evaluation methods.

To wrap up the discussion in this section, we revisit the skill versus chance question. Under our framework, a model that meaningfully captures persistent skill should deliver higher elpd, better Brier and log scores than a random baseline model, and stacking weights that concentrate on specifications with nonzero random-effect variances. Conversely, if elpd differences are small, stacking allocates substantial weight to the simplest models, and the Brier or log-scores are comparable to the random baseline, this provides quantitative evidence that, conditional on observed context, shootout outcomes are largely indistinguishable from chance.

Results

Data and exploratory analysis

We assemble a kick-level panel of penalty shootouts from European club football. The raw data is extracted from the Transfermarkt website.³ The cleaned and processed data is publicly available in a GitHub repository maintained by the corresponding author.⁴ The dataset provides match details for leagues and cups from 14 countries in Europe, along with the European club competitions Champions League, Europa League and Conference League, for the last 12 seasons (we consider all matches until the last completed season, i.e., 2024/25). In this entire set, we focus on the knockout matches in the domestic and European cups since the concept of penalty shootout does not apply for domestic leagues or in the group matches. Further, the ‘‘Penalty shoot-out’’ sections of the corresponding Transfermarkt match URLs are parsed to ensure consistency and accuracy of the data. For each match, the lineups are extracted to map each shooter to a team and to identify the opposing goalkeeper. In the process, name variants are resolved and players are de-duplicated across matches. Finally, we exclude matches lacking a full shootout listing or with conflicting entries, and retain the information of all complete shootouts with unambiguous per-kick records. This leads to an overall collection of 2356 penalty kicks from 230 matches during the mentioned period. The overall conversion rate is approximately 75.1%.

To better understand the distribution of penalty kicks, in Figure 1, we present the distribution of number of penalties for different seasons in the data in the top left panel, and the distribution of shootouts in different seasons in the bottom left panel. These plots give a slight indication that the overall number of matches that end up in shootouts is generally increasing over the years. However, the conversion rates have largely remained around 75%, except for 2018 where it dropped to an uncharacteristically low value of 67.2%. From the top right panel of the same figure, we note that the number of penalties have mostly stayed around 9 and 10, although a large proportion ended up going to the sudden death. Numerically, we note that the number of penalties in the shootouts has both mean and median around 10, about 80% of the shootouts end within 12 penalties, and the maximum number of penalties recorded in our dataset for one shootout is 22. If we further look at the conversion rate for different penalty kicks in the sequence (refer to bottom right panel of Figure 1), there is a decreasing trend, which directly aligns with the idea that the best penalty-takers typically go in the first part of the sequence (McGarry and Franks, 2000).

Figure 1.

(Top left) Distribution of penalty kicks for different seasons, along with the overall conversion rates. (Bottom left) Number of shootouts across the seasons. (Top right) Number of matches corresponding to number of kicks in the shootouts for the entire data. (Bottom right) Conversion rates at different points of the sequence of the shootouts.

In our main analysis, we use four key categorical covariates: shooting order to capture first mover advantage, pre-kick score status, stage, and competition type. Further, as discussed in Section “A brief review of relevant literature”, team strength is often found to be a determinant behind their success rates in penalty shootouts. While we plan to capture this primarily through the random effect $c_{v}$ , as an additional team-level proxy for organizational scale and status, we include the home-stadium seating capacity of the shooting team as a fixed effect (matched by team and season). Stadium capacity is strongly correlated with long-run fan base, facilities, and financial resources. Thus, it serves as a reasonable surrogate for club size in the absence of direct measures such as wage bill, market value, or rating indices (refer to Van Ours, 2021, who adopted a similar strategy). UEFA Intelligence Centre⁵ also notes that such variables can be utilized as key barometers of club health, underscoring the link between stadium infrastructure and a club’s scale. In terms of implementation in our model, we use the transformation $\log (capacity)$ to mitigate scale effects and outliers. We emphasize that this covariate, which we shall refer to as “club scale” hereafter, is descriptive rather than causal: it captures systematic effect of the larger clubs, not venue-specific crowd effects at the match site. Indeed, we do not intend to interpret this variable as a direct determinant of the technical quality of the players on the pitch. Rather, it is intended to capture one observable dimension of long-run institutional stature. In our model, this fixed effect and the team-specific random effect play different roles: the former represents measured team-level structure, while the latter absorbs residual unobserved heterogeneity at the club level.

Below, in Table 1, the conversion rates across different labels of the four categorical covariates are summarized. At a descriptive level, conversion does not appear to differ by shooting order: kicks taken first in a round convert at 74.9% versus 75.3% for the team shooting second. We also perform analysis of variance (ANOVA) test to detect significant difference between the two samples, and in this case the $p$ -value is found to be 0.8462, indicating statistically similar rates for the two cases. The same inference is drawn for the pre-kick score status, as we see close conversion rates for cases when the shooter’s team is leading (74.1%), or trailing (74.4%), or the scoreline is equal (76.0%). Differences across tournament types are also negligible. By contrast, rates vary more markedly across tournament stages: while finals show conversion rate of 77.0%, the rates are lower in quarter finals or round of 16 matches, and drop further for the semi-final matches. According to ANOVA, these differences are significant. We however note that the empirical approach adopted here ignores binomial sampling variability at the kick level, heteroskedasticity arising from unequal group sizes, and clustering by shooter, goalkeeper, team, and match. Thus, it is critical to look at the formal inference relying on the proposed model, which accounts for kick-level variability, persistent effects, and within-shootout dependence. These results are presented next.

Table 1.

Summary of conversion rate according to different labels of the covariates.

Variable	Label	Count	Conversion rate (%)	ANOVA $p$ -value
Order	Shooting first	1220	74.9	0.8462
	Shooting second	1136	75.3
ScoreStatus	Leading	301	74.1	0.6662
	Trailing	982	74.4
	Equal	1073	76.0
Stage	Final	461	77.0	0.006 $*$
	Semi-final	302	66.9
	Quarter-final	358	74.0
	Round of 16	419	75.2
	Other matches	816	77.5
Tournament type	Domestic cup	1680	74.8	0.6351
	International cup	391	77.0
	Other tournaments	285	74.4

The $p$ -values in the ANOVA procedure indicate if empirically there is any significant difference (Note: $*$ indicates significance at 5% level).

Main analysis

We start with a summarization of the posterior samples obtained from the proposed hierarchical logistic model ( $M_{0}$ ), focusing on the estimated parameters and associated uncertainty. In Table 2, first, the posterior mean of the fixed effects and their standard error, along with the posterior standard deviation, posterior median and 95% credible intervals are reported. Recall that the coefficients are on the log-odds (logit) scale. For the four categorical features, baseline levels correspond to kicks being taken by the team shooting second, at score parity, in other matches (i.e., not knockout stages explicitly listed) and in other tournaments (not domestic or international cups). In the same table, we also report the same posterior summaries for the standard deviation parameters corresponding to the random effects and the latent AR(1) component.

Table 2.

Posterior summaries for the proposed model ( $M_{0}$ ): reported are the posterior mean, standard error (SE) of the mean, posterior standard deviation (SD), and the posterior median with its 95% credible interval in parentheses.

Parameter	Mean	SE (mean)	SD	Median (Credible interval)
Intercept	$1.16$	$0.01$	$1.01$	$1.15 (- 0.83, 3.12)$
Order (shooting first)	$- 0.85$	$0.05$	$2.48$	$- 0.83 (- 5.99, 4.28)$
Score status: Leading	$- 0.80$	$0.05$	$2.96$	$- 0.74 (- 6.86, 5.00)$
Score status: Trailing	$- 1.34$	$0.04$	$2.43$	$- 1.33 (- 6.16, 3.68)$
Stage: Final	$0.55$	$0.05$	$2.88$	$0.44 (- 4.79, 6.54)$
Stage: Semi-final	$- 8.55$	$0.04$	$3.19$	$- 8.37 (- 15.26, - 2.84)$
Stage: Quarter final	$- 1.99$	$0.04$	$2.86$	$- 2.03 (- 7.67, 4.00)$
Stage: Round of 16	$- 1.11$	$0.05$	$2.82$	$- 1.16 (- 6.65, 4.75)$
Tournament: Domestic cup	$1.94$	$0.04$	$2.75$	$1.92 (- 3.60, 7.46)$
Tournament: International cup	$2.00$	$0.04$	$3.13$	$1.95 (- 4.34, 8.34)$
Club scale	$2.90$	$0.03$	$1.36$	$2.65 (1.01, 6.26)$
$σ_{a}$ (shooter SD)	$0.29$	$0.01$	$0.37$	$0.19 (0.01, 1.20)$
$σ_{b}$ (keeper SD)	$0.27$	$0.00$	$0.30$	$0.19 (0.01, 1.00)$
$σ_{c}$ (team SD)	$0.27$	$0.00$	$0.29$	$0.19 (0.01, 1.01)$
$σ_{u}$ (noise SD)	$44.18$	$0.49$	$20.19$	$40.17 (16.93, 95.78)$
$ϕ$ (AR coefficient)	$- 0.06$	$0.00$	$0.04$	$- 0.06 (- 0.15, 0.02)$

The intercept is positive, with posterior median $1.15$ and 95% credible interval $(- 0.83, 3.12)$ . On the probability scale, this corresponds to a baseline conversion probability near $0.76$ , consistent with modern football (Veldkamp and Koning, 2023). The estimated first mover advantage is slightly negative: posterior median of the effect of shooting first as compared to shooting second is estimated as $- 0.85$ , implying an odds ratio of about $\exp (- 0.85) \approx 0.43$ . Considering that the first-mover advantage is a central issue in the penalty-shootout literature, we find it critical to point out that the corresponding estimated credible interval is extremely wide and comfortably includes the null value; thereby suggesting that it is not significant in nature. Thus, conditional on the covariates, random effects, and the latent AR component, the posterior distribution provides no compelling evidence of a systematic first-mover advantage in kick conversion. Pre-kick score status effects are likewise modest relative to their uncertainty. The estimate of the coefficient for the trailing situation is $- 1.34$ , suggesting that being behind may depress conversion odds somewhat, but the credible interval still admits negligible effects once other terms are included. Stage effects show a structured signal: semi-final matches carry a distinctly negative association, with a posterior median of $- 8.37$ . It is consistent with the descriptive dip seen in Table 1. To our knowledge, prior work has not directly compared psychological pressure between semifinals and finals; our dataset suggests semifinals may elicit stronger pressure, consistent with general choking-under-pressure mechanisms. The Final effect is mildly positive with posterior median $0.44$ , whereas the parameters for Quarter final and Round of 16 are slightly negative with intervals straddling zero. Tournament-type effects are also imprecise and compatible with negligible differences once the other factors are controlled. In contrast, the scale of the club turns out to be highly significant, offering a positive impact on the conversion rates. It indicates that while some variability in the data can be attributed to the random effect of the teams (see the discussions below), the scale of the club can indeed help with handling the pressure of shootouts in a positive way.

Turning attention to the standard deviations of the random effects, we notice that they are modest on the logit scale: $σ_{a} = 0.29$ (shooters), $σ_{b} = 0.27$ (goalkeepers), and $σ_{c} = 0.27$ (teams); but the residual scale is extremely large: $σ_{u} = 44.18$ . The autoregressive parameter is close to zero, suggesting negligible within-shootout serial dependence after accounting for covariates and random effects. In combination, these features imply that the AR(1) state contributes little persistence while its innovation scale effectively acts as an unstructured over-dispersion term that absorbs most variation. Substantively, this pushes $M_{0}$ toward a chance-dominated description of kick outcomes: persistent heterogeneity attributable to shooters, keepers, or teams is small relative to the idiosyncratic component captured by $σ_{u}$ .

While the above points towards shootouts being a game of chance, because an exceedingly large latent scale can mask genuine structure in fixed and random effects, our inferential strategy does not rely on $M_{0}$ alone. Instead, as mentioned in Section “Inference, model comparison, and interpretation”, we are going to contrast $M_{0}$ with reduced specifications using various measures. If simpler models attain comparable predictive performance and stacking weights concentrate on the parsimonious specifications, it constitutes quantitative evidence that, conditional on observed context, shootout outcomes are largely indistinguishable from chance; conversely, sustained gains for models retaining random effects would support a non-negligible role for persistent skill. We present these results in Table 3. Recall that $elpd$ tells us how well a model is expected to predict unseen data. We compute it via PSIS-LOO and report $elpd$ differences as it lets us rank models on a common predictive scale. On the other hand, stacking weights choose the best convex combination of models (a mixture) to maximize LOO predictive performance. They show how much each model contributes to the best predictor. Classification accuracy is the fraction of correctly predicted winners (threshold 0.5) for the LOO predictive probability.

Table 3.

Comparison of the models using ${elpd}_{diff}$ (relative to the best, larger is better) with its standard error in parentheses, stacking weights, classification accuracy, Brier scores and log-scores for leave-one-shootout-out predictive performance (95% bootstrap intervals reported in parentheses).

Model	${elpd}_{diff}$ (SE)	Stacking weight	Brier (95% CI)	Log score (95% CI)	Class. accuracy
$M_{0}$	$- 67348.7 (2370.6)$	$0.000$	$0.274 (0.246, 0.299)$	$- 2.90 (- 3.50, - 2.28)$	$0.600$
$M_{1}$	$- 1048.4 (25.1)$	$0.000$	$0.237 (0.225, 0.248)$	$- 0.667 (- 0.690, - 0.641)$	$0.604$
$M_{2}$	$0.0 (0.0)$	$1.000$	$0.277 (0.250, 0.304)$	$- 1.34 (- 1.56, - 1.10)$	$0.578$
$M_{3}$	$- 746.0 (24.4)$	$0.000$	$0.223 (0.212, 0.233)$	$- 0.635 (- 0.657, - 0.613)$	$0.691$
$M_{4}$	$- 1051.5 (25.1)$	$0.000$	$0.236 (0.225, 0.247)$	$- 0.665 (- 0.687, - 0.641)$	$0.622$
Random	—	—	$0.250 (0.250, 0.250)$	$- 0.693 (- 0.693, - 0.693)$	$0.470$

The grouped PSIS-LOO comparison ranks $M_{2}$ (team random effect with iid noise) at the top in terms of shootout-grouped expected log predictive density. The corresponding stacking weight is $1$ , with all other models receiving negligible weight. In contrast, models that include serial dependence or richer random effects structure exhibit substantial $elpd$ deficits, and the fully specified $M_{0}$ is heavily penalized by PSIS-LOO. This indicates that, when the target is the kick-sequence likelihood aggregated at the shootout level, a parsimonious structure with team-level heterogeneity plus iid over-dispersion fits most stably. Interestingly, decision-oriented predictive scoring yields a complementary perspective. Using leave-one-shootout-out win probabilities as the target, $M_{3}$ (model with AR(1) state in the noise, but no random effects) attains the best predictive performance according to classification accuracy, Brier score and log-score. Models $M_{1}$ and $M_{4}$ perform similarly to one another and improve modestly upon the random baseline, while $M_{2}$ fares the worst among the structured specifications for predicting the ultimate winner, despite its $elpd$ dominance. The fully specified $M_{0}$ underperforms on decision metrics, consistent with its very large residual scale concentrating probability mass away from the empirically realized sequences.

These two findings are not contradictory; they reflect different targets. Grouped $elpd$ rewards models that describe the distribution of kick outcomes well (penalizing heavy tails and instability), whereas the winner prediction is a path-dependent functional of the sequence, sensitive to clustering of successes and failures and to the order of kicks. The fact that $M_{3}$ performs best on winner prediction suggests that capturing within-shootout clustering via a simple latent state can be beneficial for the decision-level target, even if the same model is not preferred by $elpd$ . Conversely, the team heterogeneity in model $M_{2}$ improves likelihood fit but does not translate into better out-of-sample winner prediction.

As a final piece of discussion, we compute the $SkillShare$ values for the three models $M_{0}, M_{1}, M_{2}$ using (17), and present them in Figure 2. Since the other two models do not have specific random effects, the same cannot be computed for $M_{3}$ or $M_{4}$ . Across specifications, the estimated $SkillShare$ , i.e., the fraction of logit-scale variance attributable to persistent random effects, is minimal. In the complete model $M_{0}$ , it is near zero, with median $\approx 0.014 %$ , indicating that within-shootout idiosyncrasy dominates kick-level variation. After removing the serial dependence, $M_{1}$ reallocates a meaningful but still non-dominant portion to persistent heterogeneity (median $\approx 15.3 %$ ). Finally, the team-only specification $M_{2}$ yields virtually zero $SkillShare$ (median $\approx 0.0019 %$ ). Taken together, these values suggest that conditional on the observed covariates and under our modeling assumptions, penalty shootouts appear to be predominantly driven by transient and idiosyncratic variation, with only limited evidence of persistent individual skill. In other words, a low $SkillShare$ means that, relative to transient variation on the latent logit scale, persistent heterogeneity explains only a small proportion of total variation. Furthermore, depending on different models, we can say that skill is present but its effect is mild and not robust across model specifications.

Figure 2.

$SkillShare$ values for the three models $M_{0}, M_{1}, M_{2}$ .

Overall, for the central question—skill versus chance—three quantitative signals emerge. First, across models that retain persistent heterogeneity by defining random effects of shooter, keeper, or team, the gains in decision-level predictive accuracy over the random baseline are small in absolute terms (e.g., improvements in Brier score are around $0.01$ to $0.03$ ). Second, the model that predicts winners best in our framework, $M_{3}$ , does so without any persistent random effects; its advantage is consistent with transient within-shootout dynamics (clustering, or momentum) rather than stable, player-specific skill. Third, the stacked model places all weight on $M_{2}$ , indicating that, when the goal is to represent the kick sequence likelihood, systematic structure at the club level suffices; and any player-specific variances are not needed to improve $elpd$ materially. With these in mind, along with the earlier variance decomposition results showing small random effect scales relative to the idiosyncratic component, we get substantial evidence in favor of a chance-dominated interpretation of penalty shootouts conditional on observed context. There is limited but detectable structure: team-specific effects help stabilize the likelihood fit, while a simple serial mechanism improves winner prediction. However, we find no strong, robust signal of persistent individual skill that materially changes shootout outcomes. From a practical standpoint, the improvements over a coin-flip baseline are statistically real but modest, suggesting that even with careful modeling the role of chance remains preponderant.

Model diagnostics

Our key conclusions in the previous section are primarily based on the output of $M_{0}$ , but we recognize that it could be an artifact of the latent AR component absorbing variation that might otherwise be attributed to persistent heterogeneity. To rule out this formally, we examine posterior correlations between the stationary AR variance $σ_{u}^{2} / (1 - ϕ^{2})$ and the random-effect variances $σ_{a}^{2}$ , $σ_{b}^{2}$ , and $σ_{c}^{2}$ . These correlations are all extremely close to zero, both on the Pearson scale (between $-$ 0.018 and $0.004$ ) and on the Spearman scale (between $-$ 0.024 and $0.016$ ). The correlation between the AR variance and the total persistent variance is also negligible. This suggests that, within the posterior distribution, the AR component is not strongly trading off against different variance components. Therefore, the near-zero $SkillShare$ in the full model does not appear to arise simply from mechanical over-allocation of variation to the latent AR term. Rather, it is consistent with the broader evidence from the reduced-model comparisons and predictive evaluation, which together indicate that persistent skill plays only a limited role relative to transient and idiosyncratic variation.

Next, acknowledging that our analysis relies on the Bayesian implementation with the aforementioned prior specifications, we find it relevant to assess the robustness of our main conclusions to the choices of the priors. Here, we present a brief yet focused prior sensitivity analysis around the baseline setting used in the main model. Specifically, we vary three classes of hyperparameters while keeping the likelihood and the model structure unchanged. First, to examine the effect of stronger or weaker regularization on the fixed effects, we change the Gaussian prior scale $ξ$ from its baseline value of $5$ to $2.5$ (this setting is identified as tighter_beta) and $10$ (looser_beta). Second, to study sensitivity to shrinkage of the variance components, we altered the common half- $t$ scale for $σ_{a}, σ_{b}, σ_{c},$ and $σ_{u}$ from $0.25$ to $0.10$ (tighter_sigma) and $0.50$ (looser_sigma), while keeping the degrees of freedom fixed at $3$ . Third, to evaluate the effect of stronger prior concentration on the serial-dependence parameter, we reduce the prior standard deviation of $ϕ$ from $1.0$ to $0.3$ (tighter_phi). These perturbations are chosen to provide a moderate but substantively meaningful range of weakly informative priors, so that the sensitivity analysis probes whether the main findings are stable to plausible alternative regularization choices rather than being driven by a single specification. For implementation, in the interest of computation time, we run three independent NUTS chains and ensure convergence in each case. Below, Table 4 summarizes the prior sensitivity analysis for a selected set of key parameters. For other parameters, the results largely remain the same and are hence omitted for brevity.

Table 4.

Sensitivity of key posterior summaries in model $M_{0}$ under alternative prior specifications.

Setting	$SkillShare$	$σ_{u}$	$ϕ$	Order (shooting first)
$M_{0}$ (original priors)	$0.00014 (0.000006, 0.0032)$	$41.2 (17.0, 98.9)$	$- 0.064 (- 0.147, 0.021)$	$- 0.70 (- 5.59, 4.22)$
tighter_beta	$0.00052 (0.00002, 0.015)$	$19.9 (8.19, 49.8)$	$- 0.065 (- 0.147, 0.021)$	$- 0.36 (- 3.01, 2.04)$
looser_beta	$0.00004 (0.000002, 0.0007)$	$78.3 (33.7, 177.0)$	$- 0.064 (- 0.143, 0.018)$	$- 1.55 (- 10.6, 8.41)$
tighter_sigma	$0.0183 (0.0008, 0.271)$	$0.08 (0.003, 0.47)$	$- 0.163 (- 0.954, 0.852)$	$- 0.08 (- 0.31, 0.14)$
looser_sigma	$0.00054 (0.00003, 0.013)$	$40.0 (16.9, 96.5)$	$- 0.064 (- 0.149, 0.020)$	$- 0.86 (- 5.65, 4.11)$
tighter_phi	$0.00013 (0.000006, 0.0027)$	$41.4 (17.6, 104.0)$	$- 0.062 (- 0.144, 0.021)$	$- 0.73 (- 6.02, 4.37)$

Note: Reported are posterior medians with 95% credible intervals in parentheses.

We can observe that the main substantive conclusions of the paper are broadly robust to the range of prior specifications, although the magnitude of some posterior summaries naturally changes when the priors are made substantially tighter or looser. Throughout, the posterior $SkillShare$ remains very small. This confirms the main finding that, in the full model $M_{0}$ , persistent heterogeneity attributable to shooters, goalkeepers, and teams explains only a very small proportion of the total latent variation once the AR(1) component and the logistic residual variance are taken into account. Varying the prior scale for the fixed effects changes the absolute magnitudes of several regression coefficients, but does not alter the qualitative message. The most substantial sensitivity appears when the priors on the variance parameters are tightened. Under tighter_sigma, the posterior medians of $σ_{b}$ , $σ_{c}$ , and especially $σ_{u}$ are pulled strongly downward, and the median $SkillShare$ increases to approximately $0.018$ , with a much wider upper tail. This is not surprising, since stronger shrinkage on the scale parameters mechanically reduces the amount of latent variance that can be attributed to the AR(1) component. Even in this case, however, the posterior median $SkillShare$ remains small in absolute terms, and the qualitative interpretation continues to favor a predominantly chance-driven view. We also observe that the prior on the autoregressive coefficient $ϕ$ has very little impact on the substantive conclusions. Overall, the parameter summaries reinforce that the principal empirical message of the paper is robust.

As another diagnostic step, we note that the posterior computation was stable across all reported parameters. In particular, the split- $\hat{R}$ values were close to 1 for all main parameters, indicating satisfactory chain mixing, and the effective sample sizes were sufficiently large for reliable posterior summarization. Together with the absence of divergent transitions and pathological energy behavior, these diagnostics suggest that the NUTS sampler explored the posterior distribution adequately. Furthermore, as a representative posterior predictive check, we compare the observed conversion rate at each kick position with the corresponding posterior predictive distribution under the fitted model (see Figure 3). The observed rates are closely aligned with the posterior predictive medians across the sequence, and all observed values lie within the 95% posterior predictive intervals. This indicates that the model captures the main sequential pattern in kick success rates reasonably well. The uncertainty bands widen in later kicks, which is expected because relatively few shootouts extend that far into the sequence.

Figure 3.

Posterior predictive check based on conversion rates by kick number. The shaded band represents the 95% interval for the median under $M_{0}$ .

A by-product: Ranking of shooters and goalkeepers

A natural by-product of the hierarchical specification in $M_{0}$ is a set of posterior distributions for the individual random effects of shooters ( $a_{s}$ ) and goalkeepers ( $b_{g}$ ). Because these effects are centered and exchangeable with sum-to-zero constraints, they are interpretable as deviations from the population average conditional on the observed covariates. To communicate these effects in a way that is useful for practitioners, we construct player-level rankings together with associated uncertainty.

Specifically, for each posterior draw, we rank shooters by their effect $a_{s}$ (larger value implies higher conversion tendency) and goalkeepers by $- b_{g}$ (larger value implies stronger shot-stopping tendency). This yields a posterior distribution of ranks for every player, from which we report the mean rank and a credible interval. In addition, we map the effects to a reference conversion probability by evaluating $p_{s}^{⋆} = {logit}^{- 1} (x^{⊤} β + a_{s})$ for shooters and $p_{g,, save}^{⋆} = 1 - {logit}^{- 1} (x^{⊤} β + b_{g})$ for goalkeepers, where $x$ is a neutral covariate profile (for illustration, we set it at shooting first, parity score, other matches, international cup, and average club scale). We also report the probabilities of being above average under this reference context. For visualization, we present a rank-interval forest plot in Figure 4 which shows the top 20 shooters and goalkeepers in the shootout context. Each player appears as a horizontal segment corresponding to the credible interval with a point at the posterior mean rank, and the color representing the probability of being above average.

Figure 4.

Posterior ranks and credible intervals of the top 20 penalty takers (top) and the top 20 penalty stoppers (bottom) according to the proposed model.

We notice that a few famous players like Bruno Fernandes (captain of Manchester United during 2024–25), Robert Lewandowski (played at the top level in both Bundesliga and La Liga), Ousmane Dembélé (2025 Ballon d’Or winner), Martin Ødegaard (captain of Arsenal during 2024–25), Mohamed Salah (Liverpool’s key player during 2017–2025) and Pierre-Emerick Aubameyang (Arsenal star during 2018–2022) appear in the list of top penalty takers, with Fernandes leading the pack by a substantial margin. Numerically, these shooters displayed posterior means for $a_{s}$ between approximately $0.36$ and $0.67$ on the logit scale, which translate to multiplicative odds gains of around 1.43 to 1.95 relative to an average taker. Their conversion probability values lie around $0.85$ , which is a substantial improvement over the baseline probability of around $0.75$ . Among goalkeepers, interestingly, not many well-known names appear. The top rank is secured by former Germany and Borussia Dortmund goalkeeper Roman Weidenfeller, followed by his countryman Finn Dahmen. We indeed notice that the posterior means for $- b_{g}$ are quite small, in the range of $0.007$ to $0.018$ even for the top penalty stoppers. The bottom panel of Figure 4 also suggests that there is limited separation among keepers once uncertainty and shrinkage are accounted for. The reference save probabilities are extremely small as well, which is expected given the high baseline conversion.

Albeit the above appears as an interesting by-product of our main analysis, it is important to interpret these rankings with caution. First, we observe that the rank uncertainty is substantial for both shooters and goalkeepers. This wide dispersion reflects limited per-player shootout exposure (the number of attempts for each shooter is less than 10 in our dataset while the number of attempts against a keeper is at most 20) and the intended regularization of partial pooling. Further, we find that many players have probabilities of being above average close to 0.5. Accordingly, the rankings should not be viewed as definitive statements about a stable underlying hierarchy of penalty-taking ability. Rather, they are best understood as probabilistic summaries that identify players whose posterior mass lies somewhat above or below the population average, while explicitly reflecting the substantial uncertainty inherent in shootout data. Indeed, these results also align with our main findings: persistent individual effects exist but are modest relative to contextual and idiosyncratic variation. Overall, the rankings can be informative only when it helps in identifying players whose posterior mass lies above average. In the next subsection, we are going to see how such analysis may partially help in deducing an optimum order of penalty takers for specific matches.

Optimum order for shootout: A case study

It is a fair assumption that a team would choose their best five penalty takers for a shootout. Thus, in this section, we are going to see how the fitted hierarchical model $M_{0}$ can be used for an actionable recommendation, where we compute the win probability for each possible ordering of the five designated takers and select the order that maximizes this probability. For a given order $O$ and a choice of starting first or second, we simulate the sequence of kicks under $M_{0}$ . Unseen players, keepers, or teams are handled by treating their random effects as realizations from the corresponding Gaussian distributions. The objective is the posterior predictive win probability:

\hat{π} (O) = \frac{1}{D} \sum_{d = 1}^{D} \frac{1}{B} \sum_{b = 1}^{B} I {win (O; Θ^{(d)}, z^{(b)})},

(26)

where

Θ^{(d)}

are draws from the posterior of

M_{0}

and

z^{(b)}

are Monte Carlo shootout paths. We evaluate all

5! = 120

orders (for both starting positions) using a two-stage procedure: fast screening with small

(D, B)

followed by refinement on the top candidates with larger

(D, B)

and common random numbers. This is done to stabilize the comparisons and to manage computational burden. This simulation-based search procedure respects path dependence (score pressure and turn order) and interactions with the opposing goalkeeper, yielding a model-driven recommendation rather than relying on marginal player ranks alone.

For illustration purposes, we pick one of the most famous shootouts of 2025—UEFA Champions League Round of 16 match between Real Madrid and Atlético Madrid, two Spanish giants who faced each other in the second leg on 13th March 2025. After the two-legged fixture ended level on aggregate, the two teams moved to a penalty shootout that was eventually won by Real Madrid. The original sequence of the penalty shootout and how it unfolded in favor of a Real Madrid victory are provided in Table 5. In the same table, we also show the recommended sequence by the above-mentioned procedure. Note that the information of the fifth penalty-taker of Atlético is not available in the data and therefore, we work with four shooters for their team in this case study.

Table 5.

Original sequence (outcome given in parentheses) and suggested sequence (probability of above average conversion rate given in parentheses) of penalties in the shootout between Real Madrid and Atlético Madrid on 13th March 2025.

Original sequence		Recommended sequence
Real Madrid	Atlético Madrid	Real Madrid	Atlético Madrid
Kylian Mbappé (scored)	Alexander Sørloth (scored)	Jude Bellingham	Julián Álvarez
Jude Bellingham (scored)	Ángel Correa (scored)	Antonio Rüdiger	Alexander Sørloth
Federico Valverde (scored)	Julián Álvarez (missed)	Lucas Vázquez	Ángel Correa
Lucas Vázquez (missed)	Marcos Llorente (missed)	Kylian Mbappé	Marcos Llorente
Antonio Rüdiger (scored)		Federico Valverde

Using the match-specific covariate vector for this tie, the posterior shooter abilities are found to be tightly clustered around zero for both teams. For Real Madrid, all five shooters have posterior means within $\pm 0.01$ on the logit scale, with Rüdiger slightly positive ( $\hat{a} \approx 0.008$ ) and the others slightly negative. Their probabilities of above average performances are within the range $(0.49, 0.51)$ , which is essentially equivalent to $Bernoulli (0.5)$ trials for all players. Atlético’s quartet is even more tightly clustered, again with probability of above-average performance being equal to 0.5. Given this near-indistinguishability in shooter skill, the optimized shootout orders are driven primarily by structural terms in the model: the round-level order, the score status adjustments, the opponent goalkeeper effect, and the latent state with AR(1) persistence.

Two key implications follow from this exercise. First, when player-level posteriors are tight, the sequencing edge arises less from “who is best overall” and more from “who fits best in each round’s context”. Second, and more importantly, the resulting uplift in win probability is minimal, suggesting that even with an optimum order, player-specific contributions in a shootout are not substantial. The case study should therefore be viewed primarily as an illustration of how the fitted model can be translated into a decision-support tool, rather than as evidence of large practically exploitable gains. Consistent with our main findings, it implies that such recommendations may be useful only at the margin and should be interpreted together with their associated uncertainty.

Managerial implications

This study proposes a hierarchical, leave-one-shootout-out validated framework for penalty shootouts that decomposes variation in kick outcomes into observed context, persistent skills attributable to individuals and teams, and an idiosyncratic component. The comparative evidence indicates that the player-level effects are present but modest, team-level heterogeneity helps stabilize likelihood fit, and a simple state-dependence term can aid winner prediction; yet the dominant share of variation at the level of individual kicks remains idiosyncratic. The managerial question, therefore, is not whether skill exists, but how to act optimally when skill differentials are small relative to noise and when the decision problem is path-dependent.

From a coaching perspective, the model suggests treating shootouts as high-variance environments. Rather than relying on deterministic list of best five shooters (potentially derived from raw conversion rates), teams should adopt probability-weighted shortlists that acknowledge posterior uncertainty and the match context. Because the skills of the takers are typically modest, an effective ordering may suggest two or three above-average shooters early to mitigate adverse momentum, one reliable option reserved for kick five or the onset of sudden death, and alternatives to be evaluated through fast simulation under the fitted model and match-specific covariates. Indeed, player evaluation benefits from such model-based approach. Posterior means and rank intervals for shooter and goalkeeper effects provide a scientific procedure that tempers the small-sample volatility of raw percentages. In terms of training, the dominance of the idiosyncratic term implies that investments which reduce situational variance can pay larger dividends than marginal searches for outlier talent. Emphasis on stable pre-kick routines, pressure inoculation, and clear role assignment is consistent with the data-generating process our model uncovers. For goalkeepers, preparation should prioritize anticipatory cues and opponent-specific tendencies, translated into concise decision aids. In-game and roster strategy should reflect the same logic. Substitutions made solely to access an alleged penalty specialist warrant caution unless supported by meaningful posterior evidence and adequate attempt histories. Otherwise, expected gains are often smaller than perceived.

We further note from our analysis that conditional on observed context, kicks from the mark are largely chance-dominated, with only modest and unstable gains from persistent individual skill. This motivates exploring tie-break formats that (i) mitigate procedural biases and (ii) place more weight on team-level or dynamic, open-play skills that our models indicate are more stable. For example, within the shootout framework, instead of ABAB, teams may take the penalties in an ABBA pattern, which has also been suggested by Anbarci et al. (2015), Da Silva and Matsushita (2024). On the other hand, Carrillo (2007) earlier proposed a unique framework where shootout is carried out before the extra time and argued how it may make the game of football more interesting. Another possibility is the one versus one dribble shootout like field hockey. Here, each attempt starts with an attacker in possession a few meters from the goal and the objective is to beat the goalkeeper and score within a short time limit (e.g., 8 seconds). Compared with a static penalty, this format rewards ball-carrying, feints, and decision-making under pressure, and lets keepers express timing and positioning skill in a dynamic setting. In terms of our modeling, it should increase the role of player-specific random effects and transient within-sequence dynamics (in both, we should see stronger signals) and should lessen the dominance of the random error that overwhelms spot-kick outcomes.

Exploration can also be done in the golden goal format used in various sports.⁶ It is a tie-breaking method where the match continues in the extra time following the usual laws until the first goal is scored and that decides the winner. In football, this rule (and its variant silver goal) was in place during 1993 to 2004. Brocas and Carrillo (2004) discussed how this rule or its modification can make football more exciting. Based on our findings on the relevance of skill in penalty shootouts, the governing bodies may look at this rule again and consider its variants. For instance, the teams may play an additional brief period (e.g., 10 min) with a smaller team size (say, 7-versus-7) which would enlarge space per player, thereby increasing the probability of an open-play goal. Such rules would shift resolution toward coordinated team actions and set-play execution, and better aligns with our model’s evidence that team-level structure is more reliable than persistent player-specific effects.

From a design perspective, ABBA is perhaps the least disruptive and easy to implement. The other mentioned possibilities are stronger interventions that explicitly tilt the tie-break toward skills which our results identify as more informative and less noise-dominated. We want to highlight that any such option can be evaluated against the current format ex ante within our framework, by simulating sequences under the proposed rules and comparing expected log predictive density as well as different scoring rules.

Concluding remarks

To summarize, this article proposed a first-of-its-kind hierarchical Bayesian framework for penalty shootouts that allows an evaluation of skill and chance. The model is compared with nested alternatives using various metrics. Overall, the empirical evidence points to a chance-dominated environment, as the model allocates a negligible share of variance to persistent skill. In comparison, the team-only model finds essentially no stable club signal, and the best winner prediction is obtained by a simple state-dependent specification without any player-specific effects. It is also identified that the first mover advantage is statistically insignificant in penalty shootouts, which is different from several other studies in this domain (e.g., Huang et al., 2026; van Hemert and van der Kamp, 2026; Vandebroek et al., 2018). Broadly, our results align with the studies by Pipke (2025); Wunderlich et al. (2020), while providing a more in-depth understanding of the outcome variability. We emphasize that our conclusion is supported jointly by variance decomposition, PSIS-LOO comparison, stacking weights, and decision-oriented predictive scores. While these pieces jointly discourage strong claims about enduring individual superiority, they still support small, actionable differences that can be exploited at the margin (e.g., order selection of penalty-takers in a shootout).

Several extensions naturally follow. An empirical extension would be to validate the key findings for international tournaments as well. Methodologically, our model treats the individual effects as time-invariant and exchangeable. We do not allow for learning, aging, or player-keeper interaction terms. A future direction would be to allow learning or adaptation over time. In the current model, player-specific effects are treated as stable latent traits, but in practice these abilities may evolve across seasons as players accumulate experience, change clubs, or face repeated high-pressure situations. One possible extension would be to let the player-specific random effects vary dynamically over season or calendar time, for example through random-walk or autoregressive state equations. Another possibility would be to include explicit measures of prior shootout exposure, such as the cumulative number of previous shootout attempts or saves, as predictors of current performance. Such formulations would make it possible to distinguish short-run adaptation from long-run ability and to assess whether repeated exposure to shootouts reduces the apparent role of chance. This would broaden the framework from a static decomposition of skill versus randomness to a dynamic analysis of how skill is acquired, retained, or eroded over time. It would also be interesting to consider richer state processes (e.g., hidden Markov models or score-dependent AR terms). From a design standpoint, our approach can be adapted to evaluate alternative tie-break procedures under counterfactual rules, reporting expected log predictive density and decision scores ex ante.

Let us end the paper with a discussion about the generalizability of our method. The proposed framework is modular and readily portable beyond football shootouts. At its core is a Bernoulli–logit observation model, with partially pooled random effects for persistent heterogeneity, and a low-dimensional latent state for within-sequence dynamics. Each component can be adapted to new settings, for example, other tie-break mechanisms in sports, or when the response variable is multinomial or ordered and appears in a sequence. We indeed recommend conducting a similar assessment study for other sports. Although direct numerical benchmarks are not readily comparable across sports, we believe that the estimated $SkillShare$ in this study is low relative to what one might expect in repetitive, highly individualized tasks such as basketball free throws, where player-specific skill is known to be stable and strongly measurable. Conversely, tie-break mechanisms that retain large state dependence and one-off interactions, such as hockey shootouts, may more closely resemble the variance-dominated pattern observed here. Different links (probit, cloglog, etc.) can also be incorporated in the model if needed. Finally, recalibration to new environments can be achieved by refitting hyperparameters, and validating with PSIS-LOO, thereby yielding a broadly applicable template for sequence-of-play decisions where agents, context, and momentum interact.

Footnotes

ORCID iD

Soudeep Deb

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: The author serves in the Editorial Board of the Journal of Sports Analytics as an Associate Editor and will not have any involvement in the editorial handling or the evaluation process of the paper in any way. There is no other competing interests to declare that are relevant to the contents of this article.

Data availability statement

The raw data used in the main analysis of this paper is extracted from the Transfermarkt website (link: https://www.transfermarkt.co.uk/). The cleaned and processed data, along with the R codes, are publicly available in a GitHub repository maintained by the author (link: ).

Notes

References

Anbarci

Sun

C-J

Ünver

(2015) Designing fair tiebreak mechanisms: The case of FIFA penalty shootouts. Available at SSRN 2558979.

Apesteguia

Palacios-Huerta

(2010) Psychological pressure in competitive environments: Evidence from a randomized natural experiment. American Economic Review 100(5): 2548–2564.

Arrondel

Duhautois

Laslier

J-F

(2019) Decision under psychological pressure: The shooter’s anxiety at the penalty kick. Journal of Economic Psychology 70: 22–35.

Bar-Eli

Azar

(2009) Penalty kicks in soccer: An empirical analysis of shooting strategies and goalkeepers’ preferences. Soccer & Society 10(2): 183–191.

Baumann

Friehe

Wedow

(2011) General ability and specialization: Evidence from penalty kicks in soccer. Journal of Sports Economics 12(1): 81–105.

Brinkschulte

(2025) Influencing factors on the success in football penalty kicks: The role of nationality, skill, and pressure. PhD thesis.

Bürkner

P-C

Gabry

Kay

, et al. (2025) posterior: Tools for Working with Posterior Distributions. R package version 1.6.1. Available at: https://mc-stan.org/posterior/.

Brocas

Carrillo

(2004) Do the “three-point victory” and “golden goal” rules make soccer more exciting? Journal of Sports Economics 5(2): 169–185.

Carpenter

Gelman

Hoffman

, et al. (2017) Stan: A probabilistic programming language. Journal of Statistical Software 76: 1–32.

10.

Carrillo

(2007) Penalty shoot-outs: Before or after extra time? Journal of Sports economics 8(5): 505–518.

11.

Cohen-Zada

Krumer

Shapir

(2018) Testing the effect of serve order in tennis tiebreak. Journal of Economic Behavior & Organization 146: 106–115.

12.

Csató

(2021) A comparison of penalty shootout designs in soccer. 4OR 19(2): 183–198.

13.

Csató

Petróczy

(2022) Fairness in penalty shootouts: Is it worth using dynamic sequences? Journal of Sports Sciences 40(12): 1392–1398.

14.

Da Silva

Matsushita

(2024) Cognitive biases in penalty shootouts: Evaluating fairness in ABAB and ABBA formats. Psychology International 6(4): 827–841.

15.

Feri

Innocenti

(2013) Is there psychological pressure in competitive environments? Journal of Economic Psychology 39: 249–256.

16.

Gabry

Simpson

Vehtari

, et al. (2019) Visualization in Bayesian workflow. Journal of the Royal Statistical Society Series A: Statistics in Society 182(2): 389–402.

17.

Gneiting

Raftery

(2007) Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association 102(477): 359–378.

18.

Hoffman

Gelman

(2014) The no-U-turn sampler: Adaptively setting path lengths in hamiltonian monte carlo. Journal of Machine Learning Research 15(1): 1593–1623.

19.

Huang

Liang

Dai

, et al. (2026) A Markovian model for football penalty shootout. Journal of Applied Statistics 1–23.

20.

Hurley

(2005) Overtime or shootout: Deciding ties in hockey. In: Anthology of Statistics in Sports. SIAM, 193–196.

21.

Jordet

Hartman

Jelle Vuijk

(2012) Team history and choking under pressure in major soccer penalty shootouts. British Journal of Psychology 103(2): 268–283.

22.

Jordet

Hartman

Visscher

, et al. (2007) Kicks from the penalty mark in soccer: The roles of stress, skill, and fatigue for kick outcomes. Journal of Sports Sciences 25(2): 121–129.

23.

Kocher

Lenz

Sutter

(2012) Psychological pressure in competitive environments: New evidence from randomized natural experiments. Management Science 58(8): 1585–1591.

24.

Lambers

Spieksma

(2021) A mathematical analysis of fairness in shootouts. IMA Journal of Management Mathematics 32(4): 411–424.

25.

Lopez

Schuckers

(2017) Predicting coin flips: Using resampling and hierarchical models to help untangle the NHL’s shoot-out. Journal of Sports Sciences 35(9): 888–897.

26.

McGarry

Franks

(2000) On winning the penalty shoot-out in soccer. Journal of Sports Sciences 18(6): 401–409.

27.

Memmert

Hüttermann

Hagemann

, et al. (2013) Dueling in the penalty box: Evidence-based recommendations on how shooters and goalkeepers can win penalty shootouts in soccer. International Review of Sport and Exercise Psychology 6(1): 209–229.

28.

Pipke

(2025) No evidence of first-mover advantage in a large sample of penalty shootouts. Journal of Economic Psychology 108: 102816.

29.

Polson

Scott

Windle

(2013) Bayesian inference for logistic models using Pólya–gamma latent variables. Journal of the American statistical Association 108(504): 1339–1349.

30.

Psarakis

Panaretoes

(1990) The folded t distribution. Communications in Statistics - Theory and Methods 19(7): 2717–2734.

31.

Rudi

Olivares

Shetty

(2020) Ordering sequential competitions to reduce order relevance: Soccer penalty shootouts. PLoS One 15(12): e0243786.

32.

Santos

(2023) Effects of psychological pressure on first-mover advantage in competitive environments: Evidence from penalty shootouts. Contemporary Economic Policy 41(2): 354–369.

33.

Stan Development Team (2025) RStan: the R interface to Stan. R package version 2.32.7. Available at: https://mc-stan.org/.

34.

van Hemert

van der Kamp

(2026) Team strength and penalty shootouts in soccer: Market value predicts the winner. International Journal of Sports Science & Coaching 17479541261416771.

35.

Van Ours

(2021) Common international trends in football stadium attendance. PLoS One 16(3): e0247761.

36.

Vandebroek

McCann

Vroom

(2018) Modeling the effects of psychological pressure on first-mover advantage in competitive interactions: The case of penalty shoot-outs. Journal of Sports Economics 19(5): 725–754.

37.

Vehtari

Gabry

Magnusson

, et al. (2024) LOO: Efficient leave-one-out cross-validation and WAIC for Bayesian models. R package version 2.8.0. Available at: https://mc-stan.org/loo/.

38.

Vehtari

Gelman

Gabry

(2017) Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing 27(5): 1413–1432.

39.

Veldkamp

Koning

(2023) Waiting to score. Conversion probability and the video assistant referee (VAR) in football penalty kicks. Journal of Sports Sciences 41(18): 1692–1700.

40.

Vollmer

Schoch

Brandes

(2024) Penalty shoot-outs are tough, but the alternating order is fair. PLoS One 19(12): e0315017.

41.

Wood

Jordet

Wilson

(2015) On winning the “lottery”: Psychological preparation for football penalty shoot-outs. Journal of Sports Sciences 33(17): 1758–1765.

42.

Wunderlich

Berge

Memmert

, et al. (2020) Almost a lottery: The influence of team strength on success in penalty shootouts. International Journal of Performance Analysis in Sport 20(5): 857–869.

43.

Yao

Vehtari

Simpson

, et al. (2018) Using stacking to average Bayesian predictive distributions (with Discussion). Bayesian Analysis 13(3): 917–1007.

Skill or chance? A Bayesian analysis of dependence and heterogeneity in penalty shootouts in football

Abstract

Keywords

Introduction

A brief review of relevant literature

Methodology

Model specification

Bayesian estimation

Inference, model comparison, and interpretation

Results

Data and exploratory analysis

Main analysis

Model diagnostics

A by-product: Ranking of shooters and goalkeepers

Optimum order for shootout: A case study

Managerial implications

Concluding remarks

Footnotes

ORCID iD

Funding

Declaration of conflicting interests

Data availability statement

Notes

References