Abstract
In this work, we study the ranking algorithm used by Fédération Internationale de Football Association (FIFA); we analyze the parameters that it currently uses, show the formal probabilistic model from which it can be derived, and optimize the latter. In particular, analyzing games since the introduction of the algorithm in 2018, we conclude that game’s “importance” (defined by FIFA and used by the algorithm) is counterproductive from the point of view of the predictive capacity of the algorithm. We also postulate that the algorithm should be rooted in the formal modeling principle, where the Davidson model proposed in 1970 seems to be an excellent candidate, preserving the form of the algorithm currently used. The results indicate that the predictive capacity of the algorithm is considerably improved by using the home-field advantage (HFA), as well as the explicit model for the draws in the game. Moderate but notable improvement may be achieved by introducing the weighting of the results with the goal differential, which, although not rooted in a formal modeling principle, is compatible with the current algorithm and can be tuned to the characteristics of the football competition.
Introduction
In this work we evaluate the algorithm used by Fédération Internationale de Football Association (FIFA) to rank the international Men teams, as well as we propose and study simple modifications to improve the prediction capacity of the algorithm.
Rating and ranking are important elements of sport competitions and the surrounding entertainment environments. In general, the rating has an informative function that provides fans and profane observers with quick insight into the relative strength of the teams. For example, the press is often interested in the “best” teams or the national team reaching some record position in the ranking.
More importantly, ranking leads to consequential decisions such as (i) seeding, i.e., defining which teams play against each other in competitions (e.g., used to establish the composition of the groups in the qualification rounds of the FIFA World Cup), (ii) promotion / relegation (e.g., determining which teams move between the English Premier League (EPL) and the English Football League Championship, or teams that move between the groups in Nations Leagues), or (iii) defining the participants in prestigious (and lucrative) end-of-season competitions (such as Champions League in European football, Stanley Cup series in National Hockey League (NHL)).
Most of the currently used ratings simply count wins/losses (and draws, when applicable), but some of the sport governing bodies have gone beyond these simple methods and implemented more sophisticated rating algorithms where the rating levels attributed to the teams are meant to represent the “skills” or “strengths”; the ranking is obtained by sorting these numbers and is also known as a “power ranking".
In particular, FIFA started a new ranking/rating algorithm in 2018, where the rating levels (skills) assigned to the teams are calculated from the game outcome, of course, but also from the skills of the teams before the game. The resulting rating algorithm has the virtue of being simple and defined in a (mostly) transparent manner.
Considering that the football association is, by any measure, the most popular sport in the world, and given the importance of the rating/ranking, the main objective of this work is to analyze the FIFA ranking using statistical methods. This evaluation has a value of its own and follows the line of many works that analyzed the past ranking strategies used by FIFA, e.g., (Lasek, Szlávik, & Bhulai, 2013; Ley, Van de Wiele, & Van Eetvelde, 2019). Furthermore, the approach we propose can also be applied to evaluate other rating algorithms, e.g., such as the one used by Fédération Internationale de Volleyball (FIVB), (FIVB, 2020).
In this work we will: Derive the FIFA algorithm from the first principles. In particular, we will define the probabilistic model underlying the algorithm and identify the estimation method used to estimate the skills. Assess the relevance of the parameters used in the current algorithms. In particular, we will evaluate the role played by the change of the adaptation step according to game-importance (as defined by FIFA). Optimize the parameters of the proposed model. As a result, we derive an algorithm which is equally as simple as FIFA’s one, but allows us to improve the prediction of the game results. Propose modifications of the algorithm that take into account the goal differential, also known as margin of victory (MOV). We consider legacy-compliant algorithms and a new version of the rating.
Our work is organized as follows. In Section 2 we describe the FIFA algorithm in the framework that simplifies the manipulation of models and the evaluation of the results. This is also where we clarify the origin of the data, make a preliminary evaluation of the relevance of the game-importance parameters currently used to control the size of the adaptation step, and assess the impact of the shootout/knockout rules present in the FIFA algorithm.
The algorithm is then formally derived in Section 3 where we also discuss the evaluation of the results and the batch estimation approach we use. Incorporation of the MOV in the rating is evaluated in Section 4 using two different strategies. In Section 5 we return to the on-line rating, evaluating and re-optimizing the proposed algorithms, and discussing the practical role played by the scale. We conclude the work in Section 6 summarizing our findings and in Section 6.1 we make an explicit list of recommendations which may be introduced to improve the current version of the FIFA algorithm.
FIFA ranking algorithm
We consider the scenario in which there are a total of M teams playing against each other in the games indexed with
Let θt,m denote the skill of the teams m ∈ {1, …, M} before the game t. The skills of all teams are gathered in a vector
Game results
The basic rules of the FIFA rating for a team m ∈ {i
t
, j
t
} are defined as follows:
When the team m does not play, its skills do not change, i.e., θt+1,m ← θt,m.
The steps I c are defined by FIFA, and we divide them into two components:
Game-categories c and the corresponding update steps I c = Kξ c , (FIFA, 2018), where K = 5 and ξ c = I c /K. The number of games T c and their frequency, f c = T c /T in the observed categories between June 4, 2018 and March 31, 2022 is also given (total number of games is T = 3444)
The basic equation governing the change in skills in (2) is next supplemented with the following rules: Knockout rule: in the knockout stage of any competition (which follows the group stage), instead of (2) we use
Shootout rule: If the team m wins the game in the shootouts, we use
This rule, however, does not apply in two-legged qualification games if the shootout is required to break the tie.
The rating we describe has been published by FIFA since August 2018, roughly once a month. The algorithm was initialized on June 4, 2018, with the initialization values
To run the algorithm, we need to know the initialization
In our discussion of models and algorithms, we want (i) to understand the rationale behind the current FIFA rating algorithm and (ii) to propose new and simple rating algorithms.
We start by asking simple questions: Are the parameters I c defining the “importance” of the game suitably set? If not, how should we define them to improve the results? These questions are interesting in their own right because the concept of game-importance is not unique to the FIFA rating: it also appears in the FIVB rating, (FIVB, 2020) and in the statistical literature, e.g., (Ley et al., 2019, Sec. 2.1.2).
In statistics, a conventional approach to performance evaluation is to rely on a metric, called a scoring function, which relates the result y
t
to its prediction obtained from the estimates at hand (here,
At this point we want to use only the elements that are clearly defined in the FIFA ranking and since the only explicit predictive element defined in the FIFA algorithm is the expected score (3),
Using the squared prediction error,
The MSE in (9) may be treated as an estimate of the expectation,
Therefore, by reducing the (absolute value of the) bias B (z t , y t ), that is, by improving the calculation of the expected score F (z t /s), should manifest itself in a lower value of the MSE, which is calculated as in (9). 6
Using the MSE, we are now able to assess how the values of the importance parameters I c (or alternatively, K and ξ c ) affect the expected value of the score (i.e., the estimate of the mean).
We find the coefficients K and/or ξ
c
by minimizing the MSE (9) using the following alternate optimization which turned out to converge quickly (and be independent of the initialization)
The one-dimensional optimizations (11) and (12) only require one-dimensional line search (e.g., over a grid) and we preferred to avoid derivative-based methods which are not well suited to deal with the complicated functional relationship resulting from the recursive rating algorithm.
The results are shown in Table 2 and we observe the following: The common update step K increases ten-fold in the optimized solution and it seems that it is the most important contributor to the improvement of the MSE (which changes from For the games in the categories well represented in the data, i.e., c ∈ {0, 1, 2, 4, 5}, the relative importance of the games ξ
c
does not seem to be critically different and, for sure, does not match the values used in the FIFA algorithm. Overall, the optimized weights ξ
c
yield a very small improvement in MSE, when compared to the use of constant weights, ξ
c
≡ 1. In fact, the Friendlies played in the International Match Calendar window are weighted down (ξ1 = 0.8) compared to the Friendlies played outside the window, and this is contrary to what the FIFA algorithm does. Estimates of ξ
c
for categories c ∈ {3, 6, 7, 8} should not be considered very reliable because the number of games in each of these categories is rather small (less than 6% of the total). Furthermore, games in the categories c = 7 and c = 8 were observed only in June 2018, during the 2018 World Cup; therefore, their effect is most likely very weak in the games from the second half of the observed batch, see (9), which starts in Sept. 2020.
Parameters K and ξ
c
, in (5), are either fixed (shadowed cells), or obtained by minimizing the MSE (9). The last three column show the value of the MSE obtained after the FIFA algorithm is modified by removing the shootout rule (
Using a very simple MSE criterion derived from the definitions used by the FIFA algorithm, we obtain results that cast doubt on the optimality/utility of the game-importance parameters, I c proposed by FIFA.
However, drawing conclusions at this point may be premature. For example, regarding K (which, after optimization should be much larger than 5), it is possible that the relatively short period of observation time (34 months) is not sufficient for small K to guarantee the sufficient convergence but may pay off in a long run, when smaller values of K will improve the performance after the convergence is reached. We will return to this issue in Section 5.1 when analyzing the interaction between the initialization of the algorithm and the scale s.
On the other hand, to address concerns about the weights ξ c , the situation is quite different. Even after the convergence, the weights associated with different categories should affect the results in a meaningful way. To elucidate this point, we will take a more formal approach and, in Section 3, go back to the “drawing board” to derive the rating algorithm from the first principles.
Before that, however, we will evaluate the impact of the knockout/shootout rules.
The basic algorithmic Equations (1)–(2) guarantee that the teams “exchange” the rating points so that their total stays constant, i.e.,
In the absence of known mathematical principles from which the knockout/shootout rules are derived, our initial hypothesis is that the knockout rule is a heuristics introduced to compensate for the increased value of I c in the advanced stages of competitions. To test this hypothesis, we will proceed by removing the shootout or/and knockout rules from the algorithm and observe the impact of such removals on the MSE.
The results are shown in the last three columns of Table 2 and we conclude that the prediction capacity of the algorithm is negligibly affected by the shootout rule and it slightly but still notably deteriorates if the knockout rule is removed. 8 Thus, our hypothesis is not supported by the results, even if the argument in favor of using the knockout rule is very weak, as we will also see later in Section 5.2.
To obtain an intuitive understanding of how the removal of shootout/knockout rules affects the results, Table 3 compares the ranking obtained using the FIFA algorithm (first column) with the rating resulting from the modified algorithm in which we (i) eliminate the shootout rule (second column), (ii) eliminate the knockout rule (third column), as well as (iii) eliminate both rules (fourth column).
Ranking of the top teams: Brazil (BRA), Belgium (BEL), France (FRA), Argentine (ARG), and England (ENG). The first row shows the Spearman correlation coefficient, ρ, calculated between the modified rankings and the ranking obtained by the original FIFA algorithm, shown in the first columns (as of March 31, 2022); the three next columns display the results when the shootout or/and knockout rules are removed
Ranking of the top teams: Brazil (BRA), Belgium (BEL), France (FRA), Argentine (ARG), and England (ENG). The first row shows the Spearman correlation coefficient, ρ, calculated between the modified rankings and the ranking obtained by the original FIFA algorithm, shown in the first columns (as of March 31, 2022); the three next columns display the results when the shootout or/and knockout rules are removed
The Spearman correlation coefficient, ρ (Myers & Well, 2003, Ch. 18.5.3), which quantifies the difference in the rankings of all teams (perfect agreement yields ρ = 1.0) indicates that the changes in the ranking are similar to what was obtained by observing the MSE: comparing the shootout and the knockout rules, the latter affect the results more significantly.
We also show the rankings of the top teams where the changes are not major and the most notable is the switch of ranks between Belgium (BEL) and France (FRA), which can be attributed to a different number of times the teams benefitted from the knockout rules. Indeed, by analyzing the results of the games, we observed that in the original ranking, BEL benefited four times from the knockout rule for a total of 85 points (which would be lost without the rule (6)), while FRA benefited only once, gaining 14 points. 9
Although the knockout rule provides a slight but notable improvement from the prediction point of view, we may still debate whether this heuristics is fair and desirable.
In particular, we note that the points-preserving knockout rule partially ignores the direct comparison between the teams. For example, games in which BEL was not penalized (for losing in the knockout stages) were played against FRA (twice). Thus, despite direct evidence indicating that FRA was able to beat BEL, the knockout rule preserved the points earned by BEL in other games.
In fact, such situations are not surprising and, indeed, the top teams are likely to make it to the final stages of the important competitions and then play against each other in the games where knockout rules are applicable (in case of BEL’s games: World Cup 2018, Euro 2020, and UEFA Nations League 2021). Although these games will provide direct comparison results, the current knockout rule will preserve the points of the losing team.
To understand and eventually modify the rating algorithm used by FIFA we propose to cast it in the well-defined probabilistic framework. To this end we define explicitly a model relating the game outcome y
t
to the skills of the home-team (θ
i
t
) and the away-team (θ
j
t
), where the most common assumption is that the probability that a random variable Y
t
takes the value y
t
, depends on the skill difference z
t
= θt,i
t
- θt,j
t
, i.e.,
We are interested in the on-line rating algorithms, in which the skills of the participating teams are changed immediately after the results of the game are known. Nevertheless, we will start the analysis with batch processing i.e., assuming that the skills
Assuming that the observations are independent when conditioned on skills, the rating may be based on the weighted maximum likelihood (ML) estimation principle
The weighting of the log-likelihoods (by ξ c t in (16) is used in the estimation literature to take care of the model mismatch, (Hu & Zidek, 2001; Amiguet, 2010): less confidence we have that the observations are generated according to the assumed model, smaller weights should be applied.
In our problem, the confidence is associated with the game category c, so smaller ξ c means that we have less confidence that the games outcomes in the category c are well described by the model (13).
Since multiplication of all ξ c by a common factor is irrelevant in minimization, we remove any ambiguity by setting again ξ0 = 1.
We may solve (16) using the steepest descent
The on-line version of (18) is obtained replacing batch-optimization with the stochastic gradient (SG) which updates the solution each time a new observation becomes available, i.e.,
The rating now depends on the choice of the likelihood function L (z ; y) and here we opt for the Davidson model, (Davidson, 1970), being a particular case of the multinomial model used also in Egidi and Torelli (2021).
Using (21)–(23) in (19), with straightforward algebra we obtain (see Appendix A and (Szczecinski & Djebbi, 2020, Sec. 3.1))
Therefore, the SG algorithm (20) becomes
It is easy to see that for η = 0 and κ = 0 (i.e., when L (z ;
Although we conclude that the FIFA rating algorithm may be seen as the instance of the maximum weighted likelihood estimation, this is, of course, a “reverse-engineered” hypothesis because the FIFA document, (FIFA, 2018), does not mention any remotely similar concept.
While our goal is to obtain the on-line rating algorithms where the skills at time t + 1 are calculated from the observations up to time t, we will, for a moment, ignore this on-line rating aspect and rather focus on the evaluation of the model and the optimization criterion that underlie the algorithms.
We thus concentrate on the original problem defined in (16) for the entire set of data, and, in this way, we (i) will not need to remove a significant portion of the data (meant to eliminate the initialization effects during evaluation, see (9)) and (ii) eliminate the limitation of the SG optimization where, by fine-tuning the adaptation step K, the estimation error is traded off against the convergence speed.
We start by noting that the problem (16) is, in general, ill-posed: since the solution depends only on the differences between the skills, z
t
, all solutions
Under the model (21)–(23), the regularized batch-optimization problem (28) is useful to resolve another difficulty. Namely, if there is a team m having registered only wins, i.e., when ∀i
t
= m, y
t
=
The estimated skills
Regarding the optimization criterion, we recall that the FIFA algorithm only specified the expected score, so the quadratic error (8) allowed us to evaluate the algorithm and stay within the boundaries of its definitions. Now, however, with the explicit skills-outcome model, we may go beyond this limitation and will use the prediction metrics known in machine learning such as the (negated) log-score, (Gelman et al., 2014)
Furthermore, thanks to the batch-rating, we are able to consider the entire data set in the performance evaluation by averaging the scoring functions (30) or (31) over all games
In simple words, for given parameters (α, κ, η, ξ
c
), we find the skills
Although both the average log-score (32) and the accuracy (33) can now be optimized with respect to α, κ, η, and/or ξ
c
, we only optimize the log-score whose optimal value is denoted as
we will do to this to determine how useful it is to optimize only some parameters.
Again, we use alternate minimization similar to the one shown in (11)–(12):
Optimized α, κ, η, ξ
c
are shown in Table 4 and indicate that The data does not provide evidence for using category-dependent weights ξ
c
. In fact, the results obtained using the FIFA weights ξ
c
are worse than those obtained using constant weights ξ
c
= 1 (i.e., essentially ignoring the possibility of weighting). Although it may be argued that the results are affected by a small number of games in some categories (such as a World Cup), it is very unlikely that observing more games will speak in favor of variable weights and almost surely not in favor of the highly disproportionate weights used in the FIFA algorithm. Note that the optimal weights ξ1 (Friendlies within the IMC) and the weights ξ2 (Group phase of Nations Leagues) are smaller than those of the regular Friendlies. This result stands in contrast with the FIFA algorithm which doubles the weight ξ1 of the Friendlies played in the IMC and triples the weight of ξ2. But, of course, we should note a shallow minimum of the objective function which attains the same values A notable improvement in the prediction capacity measured by the log-score is obtained by considering the HFA. The value η ∈ {0.3, 0.4} emerges from the optimization fit and we note that η = 0.25 was used in eloratings.net (2020).
14
A more important improvement is obtained by optimizing the parameter κ which takes into account the draws and their frequency as discussed in Szczecinski and Djebbi (2020).
Batch-rating parameters obtained via minimization of the log-score (32). The parameters (α, κ, η, ξ c ) are either fixed (shadowed cells), or obtained via optimization. The upper-part results correspond to the conventional FIFA algorithm: using κ = 2 and η = 0, the expected score is calculated using a logistic function. The last line corresponds to the parameters η and κ obtained via (36)–(37).
Here, it is interesting to compare the parameters we found by optimization with the simplified formulas proposed in Szczecinski and Djebbi (2020, [Sec. 3.2])
The parameter ηhfa predicted by (36) is practically equal to the one obtained by optimization. And while the parameters κhfa and κneut. are slightly different from the one predicted by (37), using them in the rating, we obtained
In the search for a possible improvement of the rating, we now want to consider the use of the MOV variable, defined by the difference of the goals scored by each team, denoted by d t . In this regard, the most recent works adopt two conceptually different approaches.
The first keeps the structure of the known rating algorithm (such as the FIFA algorithm) and modifies it by changing the adaptation step size as a function of d t . This was already done in eloratings.net (2020), Hvattum and Arntzen (2010), Silver (2014), Ley et al. (2019), and Kovalchik (2020), and is conceptually similar to the weighting according to the game-category we consider in the previous section.
The second approach, studied before in Maher (1982), Ley et al. (2019), Lasek and Gagolewski (2020), and Szczecinski (2022), changes the model between the skills and the MOV variable d t . We will focus on the simple proposition from Lasek and Gagolewski (2020) based on the formulation of of Karlis and Ntzoufras (2008).
MOV via weighting
For context, in Table 5 we show the number of games depending on the value of the MOV variable d. While, in principle, it is possible to use directly d, it is customary to consider their absolute value, |d|.
Number of games T|d| which finished with the goal difference |d| and their relative frequency f|d| = T|d|/T
Number of games T|d| which finished with the goal difference |d| and their relative frequency f|d| = T|d|/T
The Elo/FIFA algorithms (27) can be easily modified as follows, to take into account the MOV variable:
Integrating the MOV-weight into the online rating defined in (27) (by replacing, therein Kξ c t with Kξ c t ζ v t ) yields the Davidson-MOV algorithm.
For example, (eloratings.net, 2020) uses
To elucidate how useful such heuristics are, we note that the problem is very similar to the importance weighting we analyzed before; the difference lies in the fact that the weighting now depends on the product ξ c ζ d . Therefore, we may reuse our optimization strategy to find the optimal weights for games with different values of |d|.
To this end, we discretize |d| into V + 1 MOV-categories, v = 0, …, V and we use a very simple mapping v = |d| for v < V and v = V ⇔ |d| ≥ V. For example, with V = 2, ζ0 weights the draws (|d|=0), ζ1 weights the games with one goal difference (|d|=1) and ζ2 weights the games with more than one goal difference (|d|≥2).
Breaking with the predefined functional relationship as shown in (41) we are more general than the latter, e.g., treating the cases |d|=0 and |d|=1 separately. This makes sense since these events are not only the most frequent ones (covering, respectively, 23% and 36% of the total; see Table 5), but also correspond to the events of draw and win/loss treated differently by the algorithm.
On the other hand, we are also less general due to the merging of events |d| ≥ V, although this effect will decrease with V, simply because there will be very few observations, as may be understood from Table 5. For example, with V = 4, the weighting ζ4 will be the same for the events with |d|=4 and |d|>4 but the latter make only about 5% of the total.
We again consider the game categories defined in Table 1 and thus we now solve the following problem:
Parameters ξ
c
, ζ
v
, η, κ, and α will be optimized again using the ALO approach we described in Section 3.2, that is, by minimizing the log-score criterion (32) using an alternate optimization similar to that defined in (11)–(12). The results shown in Table 6 allow us to conclude that: The weighting of the MOV-categories is more beneficial than the weighting of the game-categories: by optimizing the MOV-weights ζ
v
(and keeping ξ
c
= 1) yields Optimization indicates that ζ
v
defined by (41) are suboptimal. In particular, the optimal MOV weights, ζ
v
are monotonically growing (as foreseen by the heuristic (41)) only for |d|≥1 while the draws (i.e., |d
t
|=0) have a weight that is more important than the weights of the events |d
t
|=1; thus, these two events (d
t
= 0 and |d
t
|=1) should not be combined, nor should we impose a particular functional form on the weights ζ
v
. The best prediction improvement is obtained again by optimizing the parameters η and κ of the Davidson model together with the MOV weights ζ
v
.
Batch-rating parameters obtained via minimization of the log-score (32) with weighting of the MOV-variables. The parameters (α, κ, η,
The MOV modelling consists in defining a formal relationship between the skills
The model (43) is a particular case of a more general form shown in Karlis and Ntzoufras (2008), which models the offensive and the defensive skills. Here, however, we are interested in rating and thus one skill per team should be used. As noted in Ley et al. (2019) and in Lasek and Gagolewski (2020) this offers a sufficient prediction capacity avoiding the problem of overparameterization due to the doubled number of skills.
Using (45) in (43), the following log-likelihood is obtained:
The derivative of (46) is given by
The batch rating then consists in solving the following problem:
To calculate the log-score, we have to calculate the probabilities
The results shown in Table 7 indicate that, with this very simple approach (only two parameters of the model which must be optimized), we are able to improve over the MOV-weighting strategy and this should be attributed to the use of a formal skills-outcome model. The price to pay for the improvement lies in abandoning the legacy of the Elo algorithm.
Moreover, possible implementation issues may arise since the expected score (49) is theoretically unbounded. Thus, whether the improvement of the log-score from
Batch-rating parameters obtained via minimization of the log-score (32) using the Skellam model (46).
Before starting a metrics-based comparison of the on-line algorithms, in Section 5.1 we will address the practical issue of setting the scale.
Scale adjustment
The scale is obviously irrelevant in batch optimization, and the on-line update can also be written in a scale-invariant manner by dividing (20) by s:
However, in the FIFA ranking, a non-zero initialization
It is easy to see that using s > s0 will force the algorithm to change significantly
Since scaling the skills up/down changes their empirical moments, we suggest choosing the scale s in a moment-preserving manner. To this end, we define the empirical standard deviation of the skills
In fact, the initialization used by FIFA yields σ0 = 220 and, after running the original FIFA algorithm, we obtain σ T = 252.
Changing the scale s, we will obtain different σ T so the idea is to run the algorithms for different values of the scale s (e.g., for multiples of 50) and to choose the one that produces a standard deviation σ T ≈ σ0. In practice we might do it using historical data before the new rating is deployed.
In this manner we found s = 150 to be suitable for the Davidson algorithm: we obtained σ T ≈ 220 when κ = 2 and σ T ≈ 210 for κ = 1.
This indicates that the scale s = 600 was too large for the FIFA rating. This can be seen by comparing the result of the FIFA rating with ξ c = 1 (in Table 8a) to the results of the Davidson algorithm (with η = 0 and κ = 2). Both algorithms are essentially the same (although FIFA uses the shootout/knockout rules, which have rather small impact on performance) and the main difference resides in the scale. Since the Davidson algorithm (η = 0, κ = 2) with the scale s = 150 is equivalent to the FIFA algorithm with the scale s = 300, this latter scale value would ensure a better performance of the FIFA rating. However, this effect appears only due to the limited observation time we have at our disposal and will vanish after a sufficiently large number of games.
Similarly, for the Davidson-MOV algorithm, using s = 200, and for different values of V we obtained σ T ≈ 220, while using the scale s = 300 in the Skellam algorithms yields σ T = 225.
To evaluate the SG algorithms, we used the same methodology we applied to make a preliminary evaluation of the FIFA rating in Section 2.1. That is, we used the first (approximate) half of the observation period for initialization and the second half is used to calculate the performance metrics. The difference from Section 2.1 is that now we use the log-score and the accuracy metrics
We consider the original and modified FIFA algorithm (Table 8a), the Davidson algorithm (Table 8b), the Davidson-MOV algorithm (Table 8c), and the Skellam algorithm (Table 8d).
Parameters and performance of the on-line rating algorithms obtained by minimizing the log-score (60) for a) FIFA algorithm, b) Davidson algorithms, c) Davidson-MOV algorithm from Section 4.1, and d) Skellam algorithm from Section 4.2
a) FIFA algorithm, s = 600. The log-score obtained by removing the knockout/shootout rules is indicated by
b) Davidson algorithm, s = 150.
c) Davidson-MOV algorithm, s = 200.
d) Skellam algorithm, s = 300.
In all cases, but in the original FIFA algorithm, we ignore the game-category weighting (i.e., we use ξ c ≡ 1) because, as we have already shown, its effect is negligible. This is clearly shown in the first row of Table 8a where we see that, using the FIFA weighting, we obtain worse results than when the weighting is ignored. This is essentially the same result as the one we have shown in Table 2 but we repeat it here to show the log-score metric which we could not calculate without first introducing the Davidson model underlying the FIFA algorithm.
In Table 8a we also show the log-score The most notable improvements are due to, in similar measures, two elements: the introduction of the HFA coefficient η and the explicit use of the Davidson model (and thus the optimization of the coefficient κ). Additional small but still perceivable gains are obtained by introducing the MOV-weighting, where from the lesson learned in Section 4.1 we independently weight the draws and the home/away wins. It is sufficient to use only two weights (V = 1), i.e., the very concept of the MOV is, de facto reduced to a distinction between the draws and the howe/away wins. The MOV-modeling using the Skellam distribution again brings a small benefit.
We present in Table 9 the rating obtained for the top teams through new rating algorithms. Of course, due to the different scales that we used, the skills obtained with different algorithms cannot be compared directly.
Ranking of the top teams using the algorithms compared in Table 8: FIFA with ξ c ≡ 1, K = 55, Davidson with K = 35, η = 0.3, κ = 1.0, Davidson-MOV with V = 1, K = 40, η = 0.3, κ = 0.9, and Skellam with K = 7.5, η = 0.2, c = -0.07
We emphasize that the quality of these rankings (i.e., ordered skills) cannot be assessed because there is no reference order of the teams to which the shown rankings can be compared. The only tool we have to assess their validity is to calculate the performance criteria as we did in Table 8 using the log-score.
Noting that even rather mild differences between the Davidson and the Davidson-MOV algorithms alter the final order/ranking, Table 9 should be treated as a cautionary illustration that the ranking/order of the team can be very easily changed by relatively benign modifications of the rating algorithm. With that caveat, the different algorithms based on different models consistently put the same group at the top of the list. In fact, the algorithms are rather consensual regarding the current (as of March 31, 2002) official top team, BRA which, or remains on the top of the list, or has skills within fraction of percentage of top team’s rating. On the other hand, BEL’s second position in the official ranking (see Table 3) is much more questionable. While the second spot is preserved with the optimized FIFA-like algorithm (the first column of Table 9 where we use K = 55 and ξ c ≡ 1, but the knockout and shootout rules are kept), the new algorithms consistently demote BEL to the fifth and lower position.
In this work, we analyze the FIFA ranking using the methodology conventionally used in probabilistic modeling and statistical inference. In the first step, we made a preliminary evaluation of the algorithm using the probabilistic concepts explicitly used in the FIFA description. In this way, we were already able to question the need for the weighting of the outcomes which depends on the FIFA-defined game category.
We also evaluate the heuristic shootout/knockout rules that are used in the FIFA rating. We concluded that since their impact on overall performance is small and they may distort the relationship between the ratings of the strong teams, which often face each other in the final stages of the competitions, their usefulness is questionable.
To go beyond the limitation of the rudimentary probabilistic concepts of the FIFA algorithm, we identified the model that relates the game results to the parameters that must be optimized (skills). More precisely, we have shown that the FIFA algorithm can be formally derived as the stochastic gradient (SG) optimization of the weighted maximum likelihood (ML) criterion in the Davidson model (Davidson, 1970).
This step allows us to define the performance metrics related to the predictive performance of the algorithms we study. This is particularly important in the case of the FIFA ranking algorithm, which does not model the outcomes of the game but only explicitly specifies the expected score; and this is not sufficient to accurately assess the rating results. It also allows us to apply the batch approach to rating and skills estimation. This conventional machine learning strategy frees us from considerations related to scale, initialization, or modeling of skills dynamics.
Using the batch rating, we have shown that the weighting dependent on the game-category is negligible at best, and counterproductive at worst, which is the case of the weighting used by the FIFA rating. This observation is interesting in its own right because, while on the one hand the concept of weighting is used in the rating literature, e.g., (Ley et al., 2019), on the other hand, the literature does not show any evidence that it is in any way beneficial and our findings consistently indicate the contrary.
Next, we consider extensions of the algorithm by including the home-field advantage (HFA) and optimizing the parameter responsible for the draws. These two elements seem to be particularly important from the point of view of the performance of the rating algorithm. While the HFA is well-known and is part of FIFA Womens’ rating (FIFA, 2007), the possibility of generalizing the Elo algorithm by using the Davidson model has only recently been shown in in Szczecinski and Djebbi (2020).
We also evaluated the possibility of using the margin of victory (MOV) given by the goal differential: we analyzed the inclusion of the MOV through weighting, as well as the explicit modeling of the MOV variables using the Skellam distribution. These two methods further improve the results at the cost of greater complexity. Here, optimization of the weights also yields interesting and somewhat counterintuitive results. That is, we have shown that games won with a small margin should have smaller weights than the draws. This stands in stark contrast with the weighting strategies proposed before, e.g., by Hvattum and Arntzen (2010), Silver (2014), or by Kovalchik (2020) which, using monotonically increasing functions of the MOV variable, do not allow for a separate treatment of the draws.
Recommendations
Given the analysis and the observations we made, if the FIFA rating is to be changed, the following steps are recommended: Add the home-field advantage (HFA) parameter to the model because playing at the home venue is a strong predictor of victory. This well-known fact is already exploited in Women’s FIFA ranking, and such a modification is most likely the simplest and the least debatable element. In our opinion, it is surprising that the current rating adopted in 2018 does not include the HFA. The HFA can be obtained through optimization, or it can be calculated using the simple formula (36). Use an explicit model to relate skills to outcomes. Not only is the expressiveness increased by providing the explicit probability for the draws, but also the prediction results are improved. Note that the rating algorithm introduced recently by FIVB adopts this approach and specifies the probability for each of the game outcomes. In the context of the FIFA ranking, the Davidson model we used in this work is an excellent candidate for that purpose: it relies on a natural generalization of the Elo algorithm, preserving the legacy of the current algorithm. Again, to find the parameter of the model, we may use optimization or the simple formula (37). Remove the weighting of the games according to their assumed importance because the data does not provide any evidence for their utility, or rather provides the indication that the weighting in its current form is counterproductive. If the concept of the game-importance is of extra-statistical nature (such as entertainment), it is preferable to diminish its role, e.g., by shrinking the gap between the largest and the smaller values of ξ
c
used. Remove the shootout and knockout rules that are not rooted in any solid statistical principle. As far as the knockout rule is concerned, its beneficial effect on the prediction quality is negligible comparing to the advantages of using the HFA and the draw model. Regarding the shootout rule, from a rating perspective, we recommend that shootouts be treated as draws. Overall, a small frequency of events when the shootout/knockout rules can be applied, and a marginal change in the obtained score, make their impact negligible and their fairness is very debatable although the heuristics behind the knockout deserves more study. If the rating was to consider the MOV, the simplest solution would be to weight the update step using the goal differential. On the other hand, modification of the model to the Skellam distribution may cause numerical problems, and relatively small performance gains hardly justify the added complexity.
Further work
The analysis we carried out in this work should not be considered exhaustive by any means and was meant (i) to provide an understanding of the current FIFA rating and (ii) to propose the simplest and yet meaningful modifications of the current algorithm.
Our recommendations regarding further work on the improvement of the rating are the following: Beside the simple weighting strategies we analyzed here, to deal with the MOV we should consider alternative solutions similar to those already considered in Women’s teams FIFA ranking. Again, the latter should be studied, e.g., using the methodology we used in this work and basing the results on a formal probabilistic model. To improve the tracking capabilities of the algorithm and to reduce its sensitivity to the randomness of the game outcomes, we should consider the Bayesian estimation methods proposed previously in Glickman (1999) (Glicko algorithm) and in Herbrich and Graepel (2006) (True Skill algorithm). These algorithms explicitly estimate the reliability of skills estimates, which improves the predictive capacity and provides a more nuanced interpretation of the rating. However, it should be noted that the Glicko and True Skill algorithms are not model-agnostic, and using them with the Davidson model, as we postulated here, is not straightforward.
16
. In this regard, the recent formulation in Szczecinski and Tihon (2021) which is applicable in any model, may streamline the development. We should design a uniform treatment to deal with teams that play very infrequently, which may happen naturally (e.g., due to geographic isolation) or be done intentionally (to preserve the ranking position). Among the possible venues to deal with this issue is (i) an automatic rating-point penalty similar to what is done in the FIVB ranking and/or (ii) a decrease of estimation reliability similar to what is done in Glicko and True Skill algorithms. Adding new teams to the rating should be handled with more care. This issue was highlighted by the recent (March 2022) reintroduction of the Cook Islands to the rating after many years without playing any FIFA-recognized game. In fact, many national teams, e.g., among those already recognized by CONCACAF may, at some point, be also recognized by FIFA which will have to decide on their initial rating. In this case, taking into account games that are not recognized by FIFA is likely the most efficient approach.
Footnotes
Appendix
We had to deal with minor exceptions: We recognized the victory of Guyana (GUY) over Barbados (BRB) in the game played on Sept. 6, 2019 already on the date of the game, while in the FIFA rating, the draw was originally registered and GUY’s victory was recognized only later, when BRB was disqualified for having fielded an ineligible player. We remove the Côte d’Ivoire (CIV) vs. Zambia (ZAM) game, played on June 19, 2019, where CIV (the winner) and ZAM exchanged 2.21 points. The removal of this game from the FIFA-recognized list seems to be the reason why FIFA changed the ratings of both teams between two official publications on Dec. 19, 2019 and on Feb. 20, 2020. Namely, CIV’s rating was changed from 1380 to 1378 and ZAM’s from 1277 to 1279. This was done despite both teams not playing at all in this period of time. The game of Cook Islands (COK) played against Solomon Islands (SOL) on March 17, 2022 was removed because this is the only COK’s game in the entire period and thus its rating, which was assumed by FIFA to be equal to 908 before the game, is not based on any recent results. The disadvantage of this removal is that we affect the rating of SOL which played three more games before March 31, 2022. We recognize that the introduction of a new team to the system is indeed a challenging issue from a rating perspective.
The role of the scale is to ensure that the values of the skills θt,m are situated in a visually comfortable range; the interplay between the scale and the initialization
The team ranked r was assigned the rating θm,0 = 1600 - 4 (r - 1), so Germany (ranked first, r = 1) was assigned θm,0 = 1600 and the rating of the other teams was then decreased by four points with each position, so Brazil was assigned θm,0 = 1596, Belgium θm,0 = 1592, etc. The tied positions in the previous ranking were dealt with by removing the lowest of the ranking positions, e.g., two teams ranked r = 10, meant that the next available ranking position was r = 12.
Information provided by Football Rankings (2021) is highly valuable because it is far from straightforward to verify which games are included in the rating and what their importance I c is. In particular, games in the same tournament can be included or excluded from the rating, and in some cases the changes can be made retroactively, further complicating the understanding of the rating results.
We emphasize that
We note that, in (9) we assume that all scoring functions
Alternatively, we may remove (12), e.g., set K = 5, and carry out only the optimization (11) for c = 0, 1, …, 8. The optimal solution will then be obtained by exploiting (5) as K ← Kξ0 and ξ c ← ξ c /ξ0, c = 1, …, 8.
The immediate question is whether the optimization of the
Of course, due to recursive calculations in the FIFA algorithm, eliminating the knockout/shootout rules is not the same as evaluating the points (not lost in the original algorithm) and discarding them from the final results.
The negation in (17) allows us to use a minimization in (16) which is a very common formulation.
Out of T = 3444 games we considered, 948 were played on neutral venues. To automatically verify the venues, we used (The Roon Ba, 2022; SoccerWay, 2022); only for the game Djibuti vs. Mauritius played on Nov. 23, 2019, the venue (home) was not registered in the data bases and we found it manually.
The same problem arises, of course, when a team registers a sequence of pure losses. This is not a hypothetical issue, and in the official FIFA games, three teams registered streaks of unique losses: Tonga (three), Eritrea (two), and American Samoa (four). Thus, the attempt to solve the batch-optimization problem without regularization (i.e., with α = 0) would yield
A quick comment may be useful regarding the interpretation of the performance metrics. The accuracy (33) is easily understandable: it is an average number of events that were predicted correctly (as those y which yield the largest likelihood L (z
t
/s ; y)). On the other hand, the metric (32) may be represented as
However, the fundamental difference between the two metrics is that we can use the accuracy without specifying the distribution for all possible outcomes, but we cannot calculate the log-score in such a case.
We also note that the common confusion is to interpret the function F (z
t
/s) in the Elo/FIFA algorithm as the probability of the home win, and the value 1 - F (z
t
/s), as the probability of an away win. This, of course, implies that the draw probability is equal to zero. With this interpretation, we can still calculate the accuracy metric even if we never predict the draw. On the other hand, we cannot calculate the log score because, when the draw occurs, we have an undefined metric
Therein, the unnormalized value ηs = 100 is reported and since s = 400, we obtain η = 0.25.
We can rewrite (45) as
The True Skill algorithm uses the Rao-Kupper (Rao & Kupper, 1967) model for the draws. On the other hand, the model used in the Glicko algorithm may be recognized as the Davidson model, but development is done only for κ = 1
