Sage Journals: Discover world-class research

Abstract

The Indian Premier League is the most prestigious cricket league globally. There are significant finances in terms of both team ownership and player salaries. It is, therefore, essential to understanding if a team’s record is due to luck (good or bad) or if a team’s record is due to the team’s overall performance. The research presented here is motivated by how to accurately predict a team’s winning percentage in the Indian Premier League based on underlying statistics. A similar analysis has been done in other sports, mainly based on the concept of the Pythagorean expectation. This research derives a similar model for the IPL based on historical data. However, the structure of a match in the Indian Premier League is fundamentally different than the structure of games in other sports. As a result of this structural difference, this study creates additional models using both least absolute shrinkage and selection operator and stepwise regression to identify variables that are good predictors for calculating the expected winning percentage. These models compare favorably to the Pythagorean expectation model. This article presents a model combining both the determined variables and Pythagorean expectation.

Keywords

Baseball cricket luck Pythagorean expectation sport analytics

Introduction

The game of cricket is a bat-and-ball game where the objective is to score more runs than the opposing team. Cricket is among the most popular sports globally, with viewership second only to football/soccer. The world’s most financially lucrative cricket league is the Indian Premier League (IPL); in 2019, the value of the IPL was over US$6 billion. This research investigates the calculation of the expected winning percentage of a team during a season based on their underlying statistics. The format of T20 cricket played by the IPL differs from other sports. For most other sports, a clock controls the game’s length. In cricket, there is no clock, so the game ends after each team has completed their overs rather than the clock expiring. However, there is an additional significant difference between cricket and other sports concerning the change from offense to defense. In most sports, teams continually shift between offense and defense as possession of the ball changes. For cricket, there is a single change between offense and defense.

As will be seen in the literature review, there has been substantial work investigating baseball winning percentages. Baseball has many similarities to cricket (both sports have similar origins). For example, both are bat-and-ball games and do not involve a clock. However, the switch from offense to defense is significantly different between the two sports. Each team has 27 outs in baseball, with the outs split up into three-out innings. In a baseball game, one team has three outs, then the teams switch from offense to defense, and the second team gets three outs. The process repeats until all 27 outs have been utilized. In T20 cricket, one team utilizes all of its overs, and then there is a single instance where the teams switch from offense to defense. In baseball and T20 cricket, if one team has no offensive opportunities left and has a lower score, the game ends without playing out the remaining opportunities. In baseball, this is not as significant of concern because, at most, three outs are not played, and, based on the average rate of scoring, less than one run would be scored by the team during the skipped offensive opportunities. A more significant percentage of the total number of overs could go unplayed, and the scoring rate per over is substantially >1. Based on the data for this article, an average of 17.8 overs per game are played out of the 20 possible overs. As shown later, this single switch from offense to defense provides a challenge in evaluation that is not seen in other sports because the total score is not as significant as the scoring rate.

Literature review

As both sports analytics and the size of the IPL grows, the amount of academic literature is also increasing. While not directly related to team performance, research has investigated the efficiency of spending by IPL franchises. Singh¹ uses data envelopment analysis (DEA) to measure the technical efficiency of spending versus performance. Jana et al.² use DEA and structural equation modeling (SEM) to perform a similar analysis.

Several researchers have applied machine learning techniques to predict the winner of a match immediately before the match commences. Nimmagadda et al.³ implement a random forest algorithm to predict the winner of the match based on predicted scores determined using multiple variable regression and logistic regression. Kapadia et al.⁴ use filter methods to identify key features and then use this data to solve the classification problem of which team will win. Raja et al.⁵ use machine learning techniques to predict player performance and use these estimates to solve the classification problem of the winner of the match. Jayanth et al.⁶ use support vector machines (SVM) to predict the winner of a match by grouping players at different levels in the batting order. While not explicitly applied to the IPL, Wickramasinghe⁷ uses a Naive Bayes approach to predict the winners of a One Day International (ODI) cricket match. Bhattacharjee and Talukdar⁸ investigate the application of discriminant analysis using a defined pressure index. Their cross-validated results show good predictive performance.

In addition to building a model for predicting match winners, Singh and Kaur⁹ and Raviteja et al.¹⁰ both, in addition, perform data visualization to help show player performance in addition to predicting the winner and loser of the match. Tekade et al.¹¹ use different supervised machine learning outcomes to predict the result of a match. Similarly, Vistro et al.¹² use machine learning algorithms to predict the winner before the match begins. Tripathi et al.¹³ also predict the winner of a single match but have the added contribution of solving multicollinearity issues. Sinha et al.¹⁴ provide results that not only predict the winner of a match using machine learning algorithms but also provide a model that helps identify the order batters should bat in and bowlers should bowl in.

Rather than having predictions made at the beginning of the match, Shah¹⁵ uses Duckworth–Lewis par score to predict the winner of the match on a ball-by-ball basis. Because of the nature of the game, the expected winner can change significantly as the match progresses. Bose and Chakraborty¹⁶ investigate a ball-by-ball method using control charts applied to the second innings of the match. Weeraddana and Premaratne¹⁷ use XGBoost to predict the winning team and the score on an over-by-over basis.

Prakash et al.¹⁸ implemented a Deep Mayo Predictor model that was applied to all matches of the IPL in 2016 and could predict most of the outcomes correctly. However, the model is based on a game-by-game basis rather than investigating the performance over the entire season.

Jayalath¹⁹ uses classification and regression tree (CART) and logistic regression to investigate factors contributing to the likelihood of winning a match and show that the home field is an important consideration for ODI cricket. Factors such as home-field advantage are influential for predicting single matches. Throughout the IPL season, a team plays one match at their home and one match at the other team’s home, so the winning percentage should be less impacted by the home field over the entirety of the season.

While creating a model to predict which team will win the match provides essential information, Dhonge et al.²⁰ also use linear, least absolute shrinkage and selection opertor (LASSO), and ridge regression to predict the final score of a match. Patil and Dalgade²¹ use machine learning to predict the score of the team batting first and the win probability of the chasing team.

Most models are based on batter and bowler statistics, such as wickets and runs. In contrast, Scholes and Shafizadeh²² show that for Champions League T20, a model can use fielding indicators to predict the winner of a match.

Rather than focus on a single season or league, Khan et al.²³ investigate the factors contributing to Bangladesh’s performance in ODI cricket. They compare the use of a logistic regression model and a modified Poisson model, with both models providing good results but with the Poisson regression having smaller confidence intervals.

Many models face a challenge: when two teams compete against each other, the model will always pick the same team to win. However, it is not uncommon for two teams that play each other repeatedly to split which team wins and which team loses. Lemmer et al.²⁴ introduce a consistency adjustment to increase the accuracy of their predicted model.

The team that bats first is unaware of how the second team will perform offensively. In comparison, the team that bats second knows what it must accomplish to win the match. Modekurti²⁵ developed a deterministic model that determines an appropriate target for the team batting first to estimate what level of offensive production will likely win the match. This model provides a similar target to the target a team batting second has.

While individual match results are essential, the question investigated by this research focuses on the performance over the entire season. Sudhamathy and Meenakshi²⁶ use historical data and machine learning to predict who will win the series. Singh et al.²⁷ use machine learning to identify the ICC Men’s T20 Cricket World Cup winner in 2020.

To the author’s best knowledge, no scholarly research has been presented to date investigating the predicted win-loss record of a team in the IPL for a season. However, especially for baseball, this research has been performed. The most well-known method is the Pythagorean expectation which Miller formally derived.²⁸

Contributions

After every season, team ownership attempts to enact changes that will enhance the club’s performance during the following season. Many times, the win-loss record of the team influences these decisions. A team that won most of its matches is unlikely to make significant changes, while a team that lost most matches is likely to make substantial changes. However, a team’s win-loss record is not deterministic; if a season were somehow played multiple times, the results would not be identical with each replication. Because luck plays a role in the final win-loss record, management should base decisions on the expected win-loss record rather than the actual win-loss record. To the best of the author’s knowledge, no research has been done investigating luck in cricket. However, Bill James²⁹ first investigated luck in baseball, and subsequent work has built on his findings. In baseball, a lucky team that won significantly more than they should have won will likely lose more in the upcoming season if significant changes do not occur. Meanwhile, a very unlucky team will probably see considerable improvement the following season, even if no significant changes occur. It is logical to conclude that luck similarly impacts a team in any sport.

The significant contribution of this research is that it uses machine learning techniques to determine various models that determine the expected win-loss record for a team in the IPL based on underlying statistics. The Pythagorean expectation created for baseball is adapted and applied using nonlinear regression. This study uses LASSO and stepwise regression to identify essential variables for consideration. A final model combines the Pythagorean expectation approach and the identified critical linear elements.

By better understanding the impact of luck on a team’s record, it is easier to determine what changes should or should not be made to a team. In addition, understanding the statistics contributing to a team’s record provides information that can be used in selecting players most likely to increase a team’s expected winning percentage.

Methods

As mentioned in the introduction, other sports have existing methods for predicting the winning percentage based on underlying statistics. These expected winning percentages are often published daily to give additional insight into the performance of teams. By developing an equation for the expected winning percentage for the IPL, the expected number of wins could be calculated throughout the season, giving fans insight into how lucky (or unlucky) their team is as the season progresses. One of the first is Bill James’s Pythagorean expectation applied to baseball.²⁹ Equation (1) is the formula used for the Pythagorean expectation. Because this method uses a single exponent in calculating the winning percentage, models in the results section based on this formulation are referred to as single exponent models.

w = \frac{r_{o}^{2}}{r_{o}^{2} + r_{d}^{2}}

(1)

where

w

is the expected winning percentage for the particular team for the season,

r_{o}

is the total number of runs scored over the entire season for the particular team while on offense, and

r_{d}

is the total number of runs conceded during the whole season for the particular team while on defense.

Pythagorean expectation alternatives

While the Pythagorean expectation formula provides reasonable estimates, improvements have been suggested to show better performance. Other approaches have adjusted the exponential from the integer value of 2 to 1.83 to fit the data better. Rather than using a static exponent for all teams, formulas that use a different exponent for each team have been developed, such as equation (2) developed by Clay Davenport.³⁰ Because this method uses a logarithm in calculating the exponential for each data point, models in the results section based on this formulation are referred to as logarithm models.

e = 1.5 \log_{10} (\frac{r_{o} + r_{d}}{g}) + 0.45

(2)

where

e

is the calculated exponent for each team,

g

is the number of games, and the other terms are as defined previously. For the IPL, the number of “games” is the number of “matches.” An alternative calculation by David Smyth³¹ has also been put forth and is presented in equation (3). Because this method uses a power in calculating the exponential for each data point, models in the result section based on this formulation are referred to as power models.

e = {(\frac{r_{o} + r_{d}}{g})}^{0.287}

(3)

The results presented below include using regression to determine the static value of

e

in equation (4) that minimizes the sum of the squares of the residuals of the actual and calculated winning percentages. Equations (2) and (3) provide the value of

e

for equation (4). Other results presented below use regression to alter equation (2) to determine the multiplier (1.5 for baseball) and intercept (0.45 for baseball) that minimize the sum of the squares of the residuals of actual and calculated winning percentage for the IPL. The results also use regression to find the best exponent from equation (3) (0.287 for baseball) that minimizes the sum of the squares of the residuals of the actual and calculated winning percentages for the IPL.

w = \frac{r_{o}^{e}}{r_{o}^{e} + r_{d}^{e}}

(4)

Model creation and metrics

A concern with using regression to fit coefficients is the potential of overfitting the data.³² Although the concept was initially used in baseball, the idea has been expanded to other sports,³³ with adjustments made to the exponents to fit the available data. In the results section, we will apply a sample of the previously defined exponents from baseball and use nonlinear regression to fit a single exponent. Multiple folds of the data were used in this study to validate the calculated coefficients. The dataset was split into five random groups; four groups were used to train the model, and the fifth group was used to test the model’s performance. Five instances of each test correspond to each of the five random groups being excluded from the training data and used for the testing data. These same five groups were used for all presented results.

Three different metrics are used to evaluate empirically derived values: the mean squared error (MSE), the coefficient of determination ( $R^{2}$ ), and the mean absolute error (MAE), which are given by equations (5) to (7) respectively

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(W_{i} - w_{i})}^{2}

(5)

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(W_{i} - w_{i})}^{2}}{\sum_{i = 1}^{n} {(W_{i} - {\bar{W}}_{i})}^{2}}

(6)

M A E = \frac{1}{n} \sum_{i = 1}^{n} | W_{i} - w_{i} |

(7)

where

n

is the number of data points,

W

is the observed winning percentage, and

w

is the calculated winning percentage. The three metrics provide us with valuable but slightly different information. The MSE measures the error between the numerical model and the observed data and is sensitive to outliers. In contrast, the

R^{2}

provides the amount of variability in the data explained by the model. The MAE measures the error between the numerical model and the observed data and is less sensitive to outliers. In addition, the MSE for the test dataset and the overall MSE are provided for each model produced below.

In addition, the Akaike information criterion (AIC)³⁴ is included for some of the models. Equation (8) is used to calculate the AIC

Δ A I C = 2 k + n \ln [\sum_{i = 1}^{n} (W_{i} - w_{i})^{2}]

(8)

where

k

is the number of parameters of the model, and the other terms are as described previously.

Raw statistics versus rates

While the total score-based methods have provided good results for other sports and are somewhat reasonable for the IPL, further investigation demonstrated that different team performance measures are better suited for the IPL. Our results indicate that rates are better than raw totals for the IPL. In particular, wickets also have a significant impact on the winning percentage. There is a continual alternation between offense and defense in most other sports. As mentioned above, an IPL match involves a single switch between the offensive and defensive teams.

It is not uncommon for a team to only score one more run than its opposition, but this is problematic for the Pythagorean methods based on run differential. The results are skewed when a team that bats second only wins by one (or few runs) but has a significant number of overs remaining. Net run rate (NRR) in equation (9) is a common statistic used in cricket to account for this discrepancy and uses the economies rather than the total number of runs; the ratio of runs to overs is also known as the economy

N R R = \frac{r_{o}}{o_{o}} - \frac{r_{d}}{o_{d}}

(9)

where

o_{o}

is the number of overs the team utilized while batting on offense and

o_{d}

is the number of overs the team bowled to the opposing team while on defense.

The use of economy rather than the total number of runs illustrates that the differences in the sport require a different approach to calculating the expected winning percentage than in other sports. Therefore, the desire was to investigate whether other team statistics might better estimate the winning percentage of an IPL team because of the different construction of the play of a match.

Algorithms

Two different methods were used to investigate other potential calculations of the winning percentage. The LASSO³² and stepwise regression³² were used to find models to fit the data. Over 40 different variables were considered for the model. The Appendix includes all considered variables.

Because both techniques have the potential to overfit models, all ten produced models were analyzed to determine commonalities. There are five datasets and two algorithms. Therefore, ten models are produced, with the first five models created with LASSO and each of the five datasets and the second five models created with stepwise regression and the five datasets. Each of the 10 models has, potentially, different independent variables that they use to calculate the expected winning percentage. Independent variables that appear in a vast majority of the 10 models are likely to influence the winning percentage. Independent variables that appear in only a few models are more likely a product of overfitting.

LASSO includes a regularization parameter³⁵; rather than selecting a single parameter, 100 were applied. Of the 100 resulting models, the model that produced the best mean squared error while not requiring more than seven decision variables were chosen for the LASSO results. The choice of a maximum of seven decision variables was made empirically as a tradeoff between accuracy and overfitting. LASSO was selected over ridge regression because it can force coefficients to be 0.

For the stepwise regression results, the maximum p-value for entry into the model was 0.05, and the minimum p-value for removal was 0.1. Interaction terms were not included in the stepwise analysis.

Results and discussion

Data from the 2008 through 2019 seasons were used for this analysis. Each data point includes the winning percentage and accumulated statistics during non-playoff matches for a team over an entire season; in most instances, the winning percentage and other statistics are for the 14 matches a particular team played. There were 100 data points, and all 100 data points were used in the analysis. Each test group consisted of 20 randomly selected data points, and the remaining 80 data points were used for model training. Each of the five test datasets contained unique data, so each of the 100 data points was a member of one and only one test dataset.

Traditional baseball coefficients

Table 1 compares results using standard coefficients from baseball as described in the previous section. From the two results, the best mean squared error is 0.0172, which is a difference in the winning percentage of 0.13. As a result, for the typical 14-match season, the expected error in the predicted number of wins is about 1.8. While the results are not terrible, improvement is still possible.

Table 1.

Baseball Pythagorean coefficients.

Coefficient	MSE	MAE
2	0.0172	0.1056
1.83	0.0174	0.1064

MSE: mean squared error; MAE: mean absolute error.

Fitting coefficients

Table 2 provides those results. Of note are the results from groups 1 and 2. While these two groups had the lowest MSE of the five groups for the testing data, these two MSE testing data values are lower than all training data MSEs; the two groups also have the highest training data MSE values. Investigating the MSE over all 100 data points for all five calculated exponents shows that the results are not very sensitive to the calculated exponent within the range of values in the table. As a first method to improve the results, nonlinear regression was used with cross-validation to determine the appropriate coefficient empirically.

Table 2.

Fit single exponent.

Group	Exponent	MSE train	$R^{2}$	MAE train	MSE test	MSE all
1	6.6737	0.0150	0.2993	0.0975	0.0090	0.0138
2	6.5851	0.0142	0.3388	0.0956	0.0119	0.0138
3	6.4355	0.0127	0.3427	0.0902	0.0180	0.0138
4	7.6227	0.0136	0.4088	0.0906	0.0151	0.0139
5	5.9513	0.0131	0.3078	0.0888	0.0167	0.0138

MSE: mean squared error; MAE: mean absolute error.

The average of the five exponents is approximately 6.65. As the next step of the investigation, exponents of 6.6 through 6.7 in steps of 0.01 were evaluated. For all 11 values of the exponent, the MSE agrees to six decimal places. However, 6.65 is the local minimum of the set. Suppose a single exponent is to be used. In that case, 6.65 is the recommended choice because it is the average of the five different training sets and the local minimum of evenly-spaced tested exponents. Due to the low sensitivity of the MSE over all 100 data points, any value between 6 and 7 would produce similar results.

While using the derived exponent is an improvement, the $R^{2}$ value being between 0.3 and 0.4 means there is still room for improvement. Considering the MSE, the best results of the single exponent still result in an error of approximately 1.6 wins for a 14-match season. Regression was used to fit the constant terms in equations (2) and (3) for the IPL data. Table 3 provides the results for the logarithmic exponent based on equation (2), and Table 4 provides the results for the power exponent based on equation (3). The significant takeaway from these results is that the performance of the multiple exponents is relatively the same as the single exponent. However, based on the principle of Occam’s Razor, the single coefficient is the best choice of the methods investigated so far. To numerically validate the single coefficient model’s selection, AIC is included in Table 5. As is seen, the single exponent model performs best of the three alternatives.

Table 3.

Logarithmic exponent.

Group	Multiplier	Constant	MSE train	$R^{2}$	MAE train	MSE test	MSE all
1	7.7160	$-$ 12.5348	0.0150	0.2996	0.0975	0.0091	0.0138
2	11.0075	$-$ 20.7986	0.0142	0.3396	0.0953	0.0122	0.0138
3	$-$ 11.4351	34.9037	0.0127	0.3437	0.1035	0.0180	0.0138
4	$-$ 22.2692	63.1173	0.0135	0.4119	0.1014	0.0155	0.0139
5	$-$ 19.9819	55.8194	0.0131	0.3109	0.1116	0.0169	0.0138

MSE: mean squared error; MAE: mean absolute error.

Table 4.

Power exponent.

Group	Exponent	MSE train	$R^{2}$	MAE train	MSE test	MSE all
1	0.3312	0.0150	0.2995	0.0975	0.0091	0.0138
2	0.3291	0.0142	0.3393	0.0954	0.0121	0.0138
3	0.3242	0.0127	0.3415	0.0902	0.0180	0.0138
4	0.3532	0.0136	0.4068	0.0906	0.0150	0.0139
5	0.3097	0.0132	0.3064	0.0888	0.0167	0.0139

MSE: mean squared error; MAE: mean absolute error.

Table 5.

AIC values.

Group	Single exponent	Logarithmic exponent	Power exponent
1	16.4	18.3	16.3
2	12.3	14.2	12.3
3	3.3	5.2	3.5
4	8.6	10.2	8.9
5	5.9	7.5	6.1

AIC: Akaike information criterion.

Rates rather than totals

As mentioned previously, for the IPL, the total number of runs scored and conceded is not as informative as using the economies that consider the total number of runs and the rate at which those runs were scored. For brevity, rather than repeating the results with the baseball coefficients, we will proceed to fit a single coefficient using equation (10)

w_{i} = (\frac{{(r_{o} / o_{o})}^{c}}{{(r_{o} / o_{o})}^{c} + {(r_{d} / o_{d})}^{c}})

(10)

Table 6 presents these results.

Table 6.

Single exponent for economies.

Group	Exponent	MSE train	$R^{2}$	MAE train	MSE test	MSE all
1	7.3903	0.0114	0.4644	0.0874	0.0069	0.0105
2	7.5051	0.0112	0.4796	0.0852	0.0078	0.0105
3	7.0436	0.0105	0.4548	0.0823	0.0106	0.0106
4	8.2348	0.0097	0.5784	0.0779	0.0143	0.0106
5	7.2354	0.0096	0.4924	0.0776	0.0142	0.0105

MSE: mean squared error; MAE: mean absolute error.

Comparing Tables 2 and 6, it can be seen that using economies rather than the total number of runs results in an improved value of both MSE, $R^{2}$ , and MAE for all groups. This improvement is seen in the training, testing, and overall datasets for all five groups. Similar to before, the mean value of the exponents can be determined to be approximately 7.48. With steps of 0.01 from 7.4 to 7.6, the local minimum for the dataset is 7.5. However, the difference for all values is in the sixth decimal place. For brevity, the tabled results of the fitted multiple coefficients are not included. However, it is noted that the multiple coefficient results were similar to the single exponent result.

Our results show that the IPL differs from other leagues because economies are more predictive of winning percentages than total runs. While using economies rather than total runs has produced a superior model, the expected error of approximately 1.4 wins per 14-match season could still be improved. Still, there is the question of whether other statistics could potentially provide superior results.

Independent variable selection

LASSO was applied to the five training datasets with the chosen solution with the minimum MSE while not having more than seven variables. There were over 40 variables under consideration. However, all non-rate variables were converted to a per-over basis based on the results above, indicating that over rates will likely be more effective than raw totals. The number 7 was chosen empirically. Table 7 has the numerical results, and Table 8 has the LASSO-identified variables; all variables are on a per-over basis.

Table 7.

LASSO result metrics.

Group	MSE train	$R^{2}$	MAE train	MSE test	MSE all
1	0.0083	0.6124	0.0731	0.0077	0.0082
2	0.0075	0.6525	0.0690	0.0165	0.0093
3	0.0068	0.6465	0.0662	0.0126	0.0080
4	0.0081	0.6494	0.0696	0.0131	0.0091
5	0.0077	0.5952	0.0705	0.0132	0.0088

MSE: mean squared error; MAE: mean absolute error; LASSO: least absolute shrinkage and selection operator.

Table 8.

LASSO variables included.

Group	Elements
1	Runs Scored, Penalty Runs Scored, Wide Runs Against, 4’s Scored, Wickets Conceded, Wickets Taken, Bowling Average
2	Penalty Runs Scored, Wickets Conceded, Wickets Taken, Hit Wicket Bowling, Batting Average, Bowling Average
3	Runs Scored, Bye Runs Against, 4’s Scored, Wickets Conceded, 2’s Conceded, Wickets Taken, Bowling Average
4	Penalty Runs Scored, Penalty Runs Against, 4’s Scored, Wickets Conceded, Wickets Taken, Batting Average, Bowling Average
5	Penalty Runs Scored, Legbye Runs Against, 4’s Scored, Wickets Conceded, Caught Batting, Wickets Taken, Bowling Average

LASSO: least absolute shrinkage and selection operator.

The five folds of data are to prevent overfitting the data. However, the variables included in the model are significantly different for all five test cases. Since all five groups are a random subsample of the population, the fact that the variables are different points to the potential of overfitting the data using a single model. As an alternative approach, stepwise regression was implemented to identify appropriate models. The numerical results are in Table 9 and Table 10 list the included variables for each model; all variables are on a per-over basis.

Table 9.

Stepwise regression metrics.

Group	MSE train	$R^{2}$	MAE train	MSE test	MSE all
1	0.0107	0.5237	0.0826	0.0091	0.0104
2	0.0088	0.6065	0.0726	0.0168	0.0104
3	0.0072	0.6611	0.0617	0.0156	0.0089
4	0.0070	0.7308	0.0613	0.0131	0.0082
5	0.0075	0.6408	0.0645	0.0107	0.0081

MSE: mean squared error; MAE: mean absolute error.

Table 10.

Stepwise regression variables included.

Group	Elements
1	Wide Runs Against, Wickets Conceded, Wickets Taken
2	Wickets Conceded, Wickets Taken
3	Bye Runs Against, 3’s Scored, 4’s Scored, Wickets Conceded, 2’s Conceded, Bowling Average
4	Runs Scored, Runs Against, Penalty Runs Scored, Penalty Runs Against, 4’s Scored, Wickets Conceded, 2’s Conceded, Wickets Taken
5	Runs Scored, Runs Against, 5’s Scored, Wickets Conceded, 6’s Conceded, Wickets Taken

Final model construction

Comparing Tables 6, 7 and 9, it can be seen that the results for all methods are reasonably comparable, with each technique performing best (or tied for best) on at least one of the test datasets. Examining Tables 8 and 10, it can be noted that the only variables presented in at least nine of the 10 models are wickets taken and wickets conceded. Notably, these are the only two variables in the stepwise regression model for the second dataset. As mentioned previously, these values are based on a per-over rate. Regression was used to fit a linear model of only these two variables and an intercept term. Table 11 includes these results. It was known that the $R^{2}$ would be less for the wickets-only model because, by formulation, a regression model with a subset of variables will have a $R^{2}$ that is smaller than a model with a larger number of variables. However, this smaller $R^{2}$ can be due to overfitting the training data rather than an actual relationship in the population. The test dataset provides insight into the model’s expected performance when applied to data that was not used to train the model. When comparing the testing data MSE, the wicket-only model values are less than the LASSO and stepwise regression models for some testing groups and are relatively close to the others. A simpler model is usually best, and the wickets-only models seem to perform virtually the same as the LASSO and stepwise regression models but benefit from simplicity.

Table 11.

Wickets only results.

Group	MSE train	$R^{2}$	MAE train	MSE test	MSE all
1	0.0108	0.4943	0.0843	0.0070	0.0100
2	0.0085	0.6065	0.0726	0.0168	0.0101
3	0.0094	0.5160	0.0768	0.0127	0.0100
4	0.0108	0.5319	0.0829	0.0081	0.0102
5	0.0101	0.4649	0.0795	0.0102	0.0101

MSE: mean squared error; MAE: mean absolute error.

The Pythagorean solution for the economies reasonably predicts winning percentages, and the linear combination of wicket rates also predicts winning percentages. The five training datasets were individually used to fit equation (11).

w = c_{1} \frac{Ω_{o}}{o_{o}} + c_{2} \frac{Ω_{d}}{o_{d}} + c_{3} (\frac{1}{1 + {(\frac{r_{d} / o_{d}}{r_{o} / o_{o}})}^{c_{4}}}) + c_{5}

(11)

where

c

are the unknown values fit by regression,

o

are overs,

Ω

are wickets,

r

are runs, the subscript

o

is for offense, and the subscript

d

is for defense. The first term on the right-hand side of the equation is the wicket rate while batting, the second term is the wicket rate while bowling, and the third term is from equation (10), and the last term is a constant intercept. Table 12 provides the data for using nonlinear regression to combine these two sets of variables. Of the five training datasets, the results for the first set had the lowest MSE for the test dataset and the lowest MSE for the entire population, as shown in Table 12. Table 13 includes the coefficients for each group fitting equation (11). Figure 1 provides a visual comparison of the calculated winning percentages compared to the actual winning percentages.

Figure 1.

Calculated versus actual winning percentages.

Table 12.

Combination of wicket rates and economies.

Group	MSE train	$R^{2}$	MAE train	MSE test	MSE all
1	0.0087	0.5936	0.0776	0.0055	0.0080
2	0.0072	0.6663	0.0688	0.0123	0.0082
3	0.0080	0.5874	0.0715	0.0084	0.0081
4	0.0080	0.6536	0.0713	0.0091	0.0082
5	0.0078	0.5878	0.0709	0.0093	0.0081

MSE: mean squared error; MAE: mean absolute error.

Table 13.

Final model coefficients.

Group	$c_{1}$	$c_{2}$	$c_{3}$	$c_{4}$	$c_{5}$
1	$-$ 0.8173	1.2981	2.4227	1.7804	$-$ 0.8659
2	$-$ 1.0446	1.6584	4.1229	0.8358	$-$ 1.7690
3	$-$ 0.9253	1.3457	1.4190	2.6156	$-$ 0.3472
4	$-$ 0.9512	1.1096	2.2706	2.2835	$-$ 0.6885
5	$-$ 0.7026	1.1102	3.3233	1.4129	$-$ 1.2914

The outliers in Figure 1 illustrate the value of calculating the expected winning percentage. As one example, consider the point in the lower right of the figure with a winning percentage of 64% (9 wins), but the underlying statistics estimated a winning percentage of 37% (5.2 wins). The team was fortunate and had a good season, and there was a desire to replicate the same level of success the following year. However, luck is usually not consistent in sports from one season to the next.³³ In comparison, consider the team in the upper right of the figure. The team had a winning percentage of 57% (8 wins) and an expected winning percentage of 68% (9.6 wins). While this second example is not as extreme as the first, it is a team that had a great year statistically but not as good of a year in terms of overall record. The team will likely perform better in the following season without significant changes as their luck regressed to the mean.

Logically, the wicket rate on offense would have a negative slope since fewer wickets fallen while batting tends to lead to higher scores and, thus, a better chance to win. Similarly, the positive slope for the bowling wicket rate follows the same logic. The fact that the exponent ( $c_{4}$ ) is significantly less than for the economy-only rate (from equation (10)) was unexpected. However, as Stevenson and Brewer showed, it is likely because there is a correlation between runs scored/conceded and wickets fallen/taken.³⁶

Conclusions

In many sports, the Pythagorean expectation approach provides reasonable estimates of the winning percentage of teams. For the IPL, the Pythagorean expectation approach offers reasonably accurate estimates but can use some improvement. Data from 12 seasons of the IPL was used along with nonlinear regression to find the exponent for the Pythagorean expectation approach that best predicted the winning percentage.

The results presented in this paper shows mathematically that the results improve with economies rather than total runs. Because of the unique nature of the IPL compared to other sports, rather than investigating the total number of runs, the economies were used to determine the expected winning percentages. This approach provided more accurate estimates for the training data and, more importantly, the testing data. However, NRR, as a typical team performance metric, implies this conclusion.

A follow-up question was whether statistics, in addition to or replacement of economy, might be a reasonable means of determining the expected winning percentage. To have an unbiased approach to considering alternative variables, both LASSO and stepwise regression fit models for the various training datasets. The ten models built by these two approaches over the five datasets included wicket rates, and most included additional variables. Linear regression fit models with only wicket rates as inputs. The results compared well with the Pythagorean economy model and the models with other variables determined by LASSO and stepwise regression.

The approaches of both a Pythagorean-based economy model and a linear wickets rate model had been determined to reasonably predict winning percentages. The final question was whether combining these two approaches would provide superior results. The combined model does perform better than either of the individual models.

The number of seasons for the IPL is significantly less than the number of seasons in the history of other major sports leagues such as Major League Baseball (MLB), the National Basketball Association (NBA), the National Hockey League (NHL), etc. and there are fewer teams in the IPL than these other leagues. As a result of these factors, there are fewer data points for the regression models. As additional data becomes available, the accuracy of these models may potentially improve. However, three different approaches were applied to determine models that reasonably predict a team’s winning percentage in the IPL. In all instances, rates needed to be considered rather than totals due to the unique structure of cricket. While the combination of the two base models has the best performance measures, the simpler base models have the advantage of having fewer variables and providing reasonably accurate results.

There is value in understanding whether a team has been lucky, unlucky, or is performing as expected. The results presented here help explain the amount of a team’s record in the IPL due to luck and the amount due to the underlying statistics. Knowing the expected win rate of a team will enable a better decision-making process for the team. In addition, understanding the impact of variables such as wickets and runs on overall winning percentages can help determine how management should construct a team. In addition, by using the expected winning percentage throughout the season, fans and club officials will be able not to panic as much if their team is not winning (but should be winning) and can temper their excitement if their team has been winning a lot, but has simply been very lucky.

Footnotes

Acknowledgements

I want to thank Vamsi Penumatsa for his initial concept that led to this work.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship and/or publication of this article.

ORCID iD

Aaron B Hoskins

Appendix – Considered Team Statistics

Runs Scored

Runs Against

Wide Runs Scored

Bye Runs Scored

Legbye Runs Scored

Noball Runs Scored

Penalty Runs Scored

Wide Runs Against

Bye Runs Against

Legbye Runs Against

Noball Runs Against

Penalty Runs Against

1 Scored

2 Scored

3 Scored

4 Scored

5 Scored

6 Scored

Wickets Conceded

Caught Batting

Bowled Out Batting

Runout Batting

LBW Batting

C&B Batting

Stumped Batting

Hit Wicket Batting

Obstruction Batting

1 Conceded

2 Conceded

3 Conceded

4 Conceded

5 Conceded

6 Conceded

Wickets Taken

Caught Bowling

Bowled Out Bowling

Runout Bowling

LBW Bowling

C&B Bowling

Stumped Bowling

Hit Wicket Bowling

Obstruction Bowling

Dot Balls Batting

Dot Balls Bowling

Average Batting

Average Bowling

References

Singh

. Measuring the performance of teams in the Indian Premier League. Am J Oper Res 2011; 1: 180–184.

Jana

Ghosh

Guha

. IPL 2019: Evaluating the performance of teams by DEA & SEM. Malaya J Mat 2021; S: 41–45.

Nimmagadda

Kalyan

Venkatesh

et al. Cricket score and winning prediction using data mining. Int J Adv Res Dev 2018; 3: 299–302.

Kapadia K, Abdel-Jaber H, Thabtah F and Hadi W. Sport analytics for cricket game results using machine learning: An experimental study. Appl Comput Inform 2022; 18: 256–266. DOI: 10.1016/j.aci.2019.11.006.

Raja

MAM

Manasa

VVL

Reddy

DSN

et al. Applying data science for cricket predictions. Ann Rom Soc Cell Biol 2021; 25: 1853–1863.

Jayanth

Anthony

Abhilasha

et al. A team recommendation system and outcome prediction for the game of cricket. J Sport Analy 2018; 4: 263–273.

Wickramasinghe

. Naive Bayes approach to predict the winner of an odi cricket game. J Sports Analy 2020; 2: 75–84.

Bhattacharjee

Talukdar

. Predicting outcome of matches using pressure index: evidence from twenty20 cricket. Commun Stat-Simul Comput 2020; 49: 3028–3040.

Singh

Kaur

. IPL visualization and prediction using hbase. Procedia Comput Sci 2017; 122: 910–915.

10.

Raviteja

Macha

Anantharaman

. Predicting and analyzing the performance of the IPL cricket using regression models. Complex Int J 2019; 23: 353–359.

11.

Tekade

Markad

Amage

et al. Cricket match outcome prediction using machine learning. Int J Adv Sci Res Eng Trends 2020; 5: 44–50.

12.

Vistro

Rasheed

David

. The cricket winner prediction with application of machine learning and data analytics. Int J Sci Technol Res 2019; 8: 985–990.

13.

Tripathi

Islam

Khandor

et al. Prediction of IPL matches using machine learning while tackling ambiguity in results. Indian J Sci Technol 2020; 13: 4013–4035.

14.

Sinha

Tripathi

Vishwakarma

et al. IPL win prediction system to improve team performance using SVM. Int J Future Generat Commu Netw 2020; 13: 17–23.

15.

Shah

. Predicting outcome of live cricket match using Duckworth-Lewis Par score. Int J Syst Sci Appl Math 2017; 2: 83–86.

16.

Bose

Chakraborty

. Managing in-play run chases in limited overs cricket using optimized Cusum charts. J Sports Analy 2019; 5: 335–346.

17.

Weeraddana

Premaratne

. Unique approach for cricket match outcome prediction using xgboost algorithms. J Theor Appl Inform Technol 2021; 99: 2162–2173.

18.

Prakash

Patvardhan

Lakshmi

. Data analytics based deep mayo predictor for IPL-9. Int J Comput Appl 2016; 152: 6–10.

19.

Jayalath

. A machine learning approach to analyze ODI cricket predictors. J Sports Analy 2018; 4: 73–84.

20.

Dhonge

Dhole

Wavre

et al. IPL cricket score and winning prediction using machine learning techniques. Int Res J Modernization Eng Technol Sci 2021; 3: 1723–1730.

21.

Patil

Dalgade

. Cricket prediction using random forest regression. Int Res J Modernization Eng Technol Sci 2021; 3: 2372–2375.

22.

Scholes

Shafizadeh

. Prediction of successful performance from fielding indicators in cricket: Champions League T20 tournament. Sports Technol 2014; 7: 62–68.

23.

Khan

Biswas

Kabir

. A quantitative approach to influential factors in one day international cricket: analysis based on Bangladesh. J Sports Analy 2019; 5: 57–63.

24.

Lemmer

Bhattacharjee

Saikia

. A consistency adjusted measure for the success of prediction methods in cricket. Int J Sports Sci Coach 2014; 9: 497–512.

25.

Modekurti

DPV

. Setting final target score in T-20 cricket match by the team batting first. J Sports Analy 2020; 6: 205–213.

26.

Sudhamathy

Meenakshi

. Prediction on IPL data using machine learning techniques in R package. ICTACT J Soft Comput 2020; 11: 2199–2204.

27.

Singh

Aggarwal

Kundu

. Quantitative analysis of forthcoming ICC Men’s T20 world cup 2020 winner prediction using machine learning. Int J Comput Appl 2020; 176: 46–51.

28.

Miller

. A derivation of the Pythagorean won-loss formula in baseball. Chance 2007; 20: 40–48.

29.

James

Albert

Stern

. Answering questions about baseball using statistics. Chance 1993; 6: 17–30.

30.

Davenport

KWJ

Davenport

, et al. Revisiting the pythagorean theorem: Putting bill james’ pythagorean theorem to the test, 1999. https://www.baseballprospectus.com/news/article/342/revisiting-the-pythagorean-theorem-putting-bill-james-pythagorean-theorem-to-the-test/.

31.

Prospectus

. Baseball prospectus 2022. LLC: Stylus Publishing, 2022.

32.

Chandramouli

. Machine learning. 1st ed. New Dehli: Pearson Education India, 2018. ISBN 93-89588-13-8.

33.

Beneventano

Berger

Weinberg

. Predicting run production and run prevention in baseball: the impact of sabermetrics. Int J Bus Humanit Technol 2012; 2: 67–75.

34.

Brunham

Anderson

. Model selection and multimodel inference: a practical information-theoretic approach. 2nd ed. New York: Springer, 2002.

35.

Hastie

Tibshirani

Friedman

. Linear methods for regression. In The elements of statistical learning. Springer, 2009. pp.43–99.

36.

Stevenson

Brewer

. Bayesian survival analysis of batsmen in test cricket. J Quant Anal Sports 2017; 13: 25–36.

Calculating expected win percentage of an Indian Premier League team

Abstract

Keywords

Introduction

Literature review

Contributions

Methods

Pythagorean expectation alternatives

Model creation and metrics

Raw statistics versus rates

Algorithms

Results and discussion

Traditional baseball coefficients

Fitting coefficients

Rates rather than totals

Independent variable selection

Final model construction

Conclusions

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

ORCID iD

Appendix – Considered Team Statistics

References