Cricket data analytics: Forecasting T20 match winners through machine learning

Abstract

In the ever-evolving world of cricket, the T20 format has captured the imaginations of fans worldwide, intensifying the anticipation for match outcomes with each passing delivery. This study explores the realm of predictive analytics, leveraging the power of machine learning to alleviate the suspense by forecasting T20 cricket match winners before the first ball is bowled. Drawing on a rich dataset encompassing factors such as past team performance and rankings, a diverse ensemble of predictive models, including logistic regression, support vector machine (SVM), random forest, decision tree, and XGBoost, is meticulously employed. Among these, the random forest Classifier emerges as the standout performer, boasting an impressive prediction accuracy rate of 84.06%. To assess the real-world applicability of our predictive framework, a post-case study is conducted, focusing on the high-stakes World Cup T20 matches of 2022, where England emerges as the triumphant team. The dataset underpinning this study is meticulously curated from ESPN CricInfo, ensuring the robustness of our analysis. Moreover, this paper extends its contribution by offering a comprehensive comparative analysis, scrutinizing performance metrics such as accuracy, precision, recall, and the F1-score across benchmark machine learning models for cricket match prediction. This in-depth evaluation not only validates the efficacy of our models but also sheds light on their superior execution time and statistical robustness, further bolstering their utility in the realm of cricket outcome forecasting.

Keywords

Machine learning prediction cricket T-20 format classification

1. Introduction

Cricket, a beloved sport with a global fanbase, has witnessed an unprecedented surge in popularity over the years. With billions of enthusiasts and players worldwide, cricket offers diverse formats, including T20 Internationals, One Day Internationals, and Test Matches. Among these, the T20 format reigns supreme, captivating audiences with its concise 20-over structure, ideal for today’s fast-paced lifestyles. The widespread promotion of T20 cricket through commercial events like the Indian Premier League (IPL) in India and the Big Bash League (BBL) in Australia has further fueled its allure. Each T20 match, characterized by its brevity, delivers nail-biting excitement, leaving fans on the edge of their seats until the last ball. In this electrifying atmosphere, fans often engage in predicting the winner before or during the match, intensifying their anticipation. This study introduces a systematic approach to predict the winners of T20 cricket matches. Focusing on the 8th edition of the ICC Men’s T20 Cricket World Cup, where 12 teams vied for supremacy, the research delineates the journey of these teams, which included an initial Super12 stage and the subsequent emergence of semi-finalists. The winners of the semi-finals eventually battled in the grand finale to secure the championship for the next two years. By harnessing data-driven methodologies and predictive analytics, this study seeks to unveil the secrets behind predicting the winning team in T20 cricket matches, adding an analytical layer to the excitement of this beloved sport. In contrast to many existing papers that focus on individual player performance metrics, such as career statistics and recent participation, as seen in Jhanwar et al. [6], Awan et al. [24], and Prakash et al. [16], our proposed study takes a distinctive approach. Recognizing cricket as a team sport, our study places emphasis on evaluating and predicting team performance as a whole, rather than relying on individual player statistics. This unique perspective allows us to consider the collective dynamics and factors that impact team success in T20 cricket, providing valuable insights into team-based predictions. The outcome of T20 games is dependent on different factors, which we have considered in this work,

•
Previous appearances and titles: This study assesses the frequency of a team’s appearances in international T20 matches and quantifies their championship victories.
•
Frequencies of playing in the final and semi-final: This factor shows how many times each team played for the final and semi-final stages in the T20 format.
•
ICC T20 ranking: ICC T20 ranking depicts international ranking for each team. It also reflects the overall performance of each cricket team.
•
Previous T20 World Cup match records: The study places significant emphasis on the critical factors, including World Cup match records spanning from 2007 to 2020 and the most recent five years of international match data, as essential components for analysis.

The novelty of this work lies in its comprehensive and rigorous comparative analysis of various machine learning models for predicting T20 cricket match winners. This research breaks new ground by evaluating diverse data sources, incorporating statistical testing, considering execution time, and employing comprehensive evaluation metrics. Furthermore, its application of machine learning to cricket match prediction, particularly in the T20 format, represents a unique contribution to sports analytics. These novel aspects collectively advance the field by enhancing the precision and robustness of predictions while shedding light on the relative merits of different machine learning algorithms in the context of sports outcome forecasting. After rigorous data collection, extensive data cleaning, and organization efforts, this study employs a range of supervised learning algorithms, including Random Forest, Decision Tree, Gradient Boosting, Hist Gradient Boosting, Bagging Classifier, XGBoost, Voting Classifier, and K-Nearest Neighbors (KNN). These algorithms are harnessed to construct predictive models aimed at forecasting the winning team in international T20 cricket matches, using a separate testing dataset for evaluation. Among these models, the Random Forest Classifier stands out with an impressive accuracy rate of 84.06%. The predictive framework draws upon critical factors such as ICC T20 rankings and historical performance, specifically appearances in the final and semi-finals, for each participating team. Teams with higher ICC rankings and more frequent appearances in semi-finals or finals are identified as stronger contenders for championship victory. The prediction process unfolds in a multi-step fashion. Initially, predictions are made to identify the teams most likely to qualify for the playoffs. Subsequently, among the qualified playoff teams, predictions are made to determine which teams are likely to advance to the final. Finally, using the aforementioned predictive criteria, the study forecasts the ultimate winner between these two finalist teams. This methodical approach aims to provide comprehensive insights into the dynamics of T20 cricket match outcomes. This study makes several notable contributions that significantly advance the field of sports analytics and cricket match prediction.

•
High Prediction Accuracy: The research achieves a prediction accuracy of 84.06% using the Random Forest Classifier. This level of accuracy is a substantial contribution to the field, as accurately predicting the outcomes of cricket matches, especially in the T20 format, is a challenging task.
•
Comparative Analysis: The paper conducts a comprehensive comparative analysis of various machine learning models, including Decision Tree, Gradient Boosting, Histogram Gradient Boosting, Bagging Classifier, XGBoost, and Voting Classifier. This analysis provides valuable insights into the strengths and weaknesses of different algorithms, aiding researchers and practitioners in selecting the most suitable model for similar tasks.
•
Unique Data Sources: The use of data collected from diverse sources such as ESPN Cricinfo, the ICC website, and Wikipedia enriches the dataset and sets the research apart. Leveraging varied data sources enhances the depth and breadth of the analysis, contributing to more accurate predictions. The main idea is to build a model for automated prediction of the winning team among all participating teams in the T-20 Cricket tournament based on the match records (previous appearance, previous titles, number of semifinals, and finals played, ICC T-20 ranking) of the T20 world cup 2022, held during October 2022. It is used as a case study. The experimented dataset is made by collecting data from ESPN Cricinfo.
•
Statistical Testing: The inclusion of statistical tests like the Chi-Square test and $T$ -test adds a rigorous dimension to the analysis. These tests provide statistical evidence of the model’s effectiveness and highlight the significance of the predictions, making a valuable contribution to the methodology.
•
Execution Time Analysis: Analyzing the execution time of different machine learning models is crucial for practical applications. This aspect of the research helps in understanding the computational demands of each model, aiding in the selection of the most efficient algorithm for real-time prediction systems. A detailed comparison in terms of execution time (seconds) and other parameters among the benchmark supervised learning methods are discussed.
•
Comprehensive Evaluation Metrics: In addition to accuracy, the study considers precision, recall, and F1-score, offering a holistic evaluation of model performance. These metrics provide a nuanced understanding of how well the models can predict cricket match winners and minimize false predictions.
•
Future Research Directions: The paper suggests future research directions, such as incorporating additional parameters like weather conditions and extending the models to predict outcomes in other cricket formats like One-day Matches and Test Matches. These avenues for further exploration expand the scope of research in sports analytics.

The paper is organized into various sections, each serving a specific purpose. Section 2 delves into a comprehensive review of pertinent literature related to this study. Section 3 provides essential background information necessary for constructing our predictive models and conducting experiments. In Section 4, we present our proposed models and their applications for predicting the winner of the T20 World Cup. Section 5 offers a detailed analysis of the results and comparisons, shedding light on the outcomes of our research. Finally, Section 6 concludes the paper and outlines potential avenues for future research in this domain.
2. Literature review

In this section, we delve into a body of research that pertains to the prediction of various factors within the realm of cricket. These studies serve as a source of inspiration and motivation for the development of prediction models tailored specifically for forecasting the winning team in T20 Cricket World Cup matches. A data mining approach to ODI Cricket Simulation and Prediction” by Sankaranarayanan, Sattar, and Lakshmanan [1] proposes a data mining approach to simulating and predicting the outcome of One Day International (ODI) cricket matches. The authors argue that the complex rules governing the game, along with the numerous natural parameters affecting the outcome of a cricket match, present significant challenges for accurate prediction. The model is trained on a dataset of historical match data, and it uses a combination of linear regression and nearest-neighbor clustering algorithms to predict the number of runs that will be scored in the remaining overs of the match. The authors evaluated the performance of their model on a test dataset of 100 matches, and they found that it was able to predict the winner of the match with an accuracy of 70%. They also found that the model was able to predict the number of runs scored with an accuracy of 65%. Gagana [2] has given a brief analysis of T20 matches with the help of Machine learning algorithms. The model is capable of predicting the runs scored on each ball based on some pre-labeled data. The data is collected from previous matches, and the problem is considered to be of Classification. The Naïve Bayes model is applied with 90% train and 10% test, and the accuracy obtained is 42.50 %, using Decision Tree and Random Forest on the same split of data, the accuracy obtained is 82.52 % and 90.88% respectively. There is also a scope to improve the accuracy using the Recurrent Neural Network (RNN) and Hidden Markov Model (HMM). Predicting the Winner in One-Day International Cricket” by Ananda Bandulasiri [3] investigates the factors that affect the outcome of One Day International (ODI) cricket matches. The author uses a logistic regression model to analyze a dataset of 200 ODI matches played between 1996 and 2007. According to the findings, the model has an accuracy rate of 80% when predicting the outcome of a cricket match. The study also showed that the most important predictor of the match outcome is the average score of the team batting first. The research thus demonstrated the potential for using statistical models to predict the outcome of cricket matches.

Score Prediction and Player Classification Model in the Game of Cricket Using Machine Learning” by S. Kumar and S. Roy [4] proposes a machine learning approach to predicting the score of a cricket match and classifying players into different categories. The authors use a dataset of historical match data to train a decision tree model to predict the score of a match. They also use a support vector machine model to classify players into different categories, such as batsmen, bowlers, and all-rounders. The authors evaluated the performance of their models on a test dataset of matches, and they found that the decision tree model was able to predict the score of a match with an accuracy of 80%. They also found that the support vector machine model was able to classify players into different categories with an accuracy of 90%. The authors conclude that their models are a promising approach to cricket score prediction and player classification. They suggest that the models could be used to improve the decision-making of coaches and players, and they could also be used to develop betting strategies. A Survey on Team Selection in the Game of Cricket Using Machine Learning” by I. Technology and S. MS [5] surveys the literature on the use of machine learning for team selection in cricket. The authors discuss the different machine learning techniques that have been used for team selection, the challenges of team selection, and the future directions of research in this area. The authors conclude that machine learning has the potential to be a powerful tool for team selection in cricket. However, they also note that more research is needed to develop more accurate and robust models. To predict ODI cricket matches, Jhanwar [6] presented a Cricket Outcome Predictor. A dataset of matches was used to test the model, and the findings revealed that it can predict matches’ outcomes with an average accuracy of 92.6%. The potential of squad composition-based machine learning models for forecasting the results of one-day international cricket matches is thus highlighted by this work. To forecast the results of a one-day international cricket match while it was taking place, Bailey and Clarke [7] undertook a study. A dataset of matches was used to test the model, and the findings revealed that it can predict matches’ outcomes with an average accuracy of 85.2%. This study thus highlights the potential of machine learning models for predicting the outcome of one-day international cricket matches while the game is in progress. According to N. Pathak [8], the chance of winning or losing an ODI cricket match is predicted using a variety of classification approaches, including Naive Bayes, Support vector machines, and Random Forest. A COP (Cricket Outcome Predictor) is then constructed based on these results. A dataset of matches was used to test the model, and the findings revealed that it can predict matches’ outcomes with an average accuracy of 87.3%. This study demonstrates how effective contemporary classification methods may be at forecasting the results of one-day international cricket matches. S. Kampakis [9] used a variety of machine learning models to forecast the results of English County Twenty Over Cricket Matches. Here, characteristics are added to the model to improve both its performance and its accuracy. The accurate result is provided by Naive Bias (64.9%). Here, a straightforward prediction technique is paired with intricate hierarchical characteristics to create the model. K. Passi [10] analyzed the performance of individual teams using machine learning algorithms. Different types of classification models like Naïve Bayes, Multilevel SVM, and Random Forest are applied which give the best result. Here different features like batting performance and the number of wickets gained by a bowler of an individual team are used to get desired outcomes. Random Forest gives the best result: 90.74% (Batting Performance) and 92.25% (Bowling Performance) of a player of an individual team. M. Yasir [11] presented a multi-level perception model that is used to predict the result of an ongoing t-20 international match and evaluate the model on the historical balls by ball datasets. For prediction, several variables are considered, including the commencement and duration of the match. Models are created using information from previous team performances, the match’s location, and player performance. This produces results that are 85% accurate before the match and 89% accurate after it (after the match). Applications of Machine Learning in Cricket, A Systematic Review” by I. Wickramasinghe [12] conducts a systematic review of the literature on the applications of machine learning in cricket. The author identifies 72 relevant articles published between 2001 and 2021. Manoj S. Ishi and J.B. Patil [13] surveys the literature on the use of machine learning for team formation and winner prediction in cricket. The authors discuss the different machine-learning techniques that have been used for these tasks, the challenges of these tasks, and the future directions of research in this area. The authors conclude that machine learning has the potential to be a powerful tool for team formation and winner prediction in cricket. However, they also note that more research is needed to develop more accurate and robust models. The paper is a good overview of the state of the art in machine learning for team formation and winner prediction in cricket. The authors do a good job of summarizing the different machine-learning techniques that have been used in this area and the challenges that need to be addressed. Overall, the paper is a valuable contribution to the literature on machine learning for team formation and winner prediction in cricket. It provides a good starting point for researchers who are interested in this area. F. Nasim [14] presented that the prediction of a batsman’s performance can be done using a model named HMM or Hidden Markov Model. Here the dataset is taken from the CricInfor website. Using the first-order Markov chain, it can predict how many runs the batsman can make on the next ball they face. This can help to select the best players based on their performance. Some other factors like weather, the nature of the wicket, the performance of the opposing team, the crowd of the stadium, the venue of the game, etc., affect the batsman’s performance. D.G.T.L. Karunathilaka [15] analyzed in this paper that IPL annotations will be used in the future by the scientific community. Here a model that can forecast the win probability of a team in every cricket over in real time is developed. That model uses many features from the player statistics. These features can be individual teams’ performances (both batting and bowling performances), the venue of the matches, and many other factors. For this case, the Adaboost and Multi-layer Perceptron model is used to predict the winning probability of a cricket match. According to an article by C. Deep Prakash [16], the Deep Player Performance Index i.eDPPI, can be used to evaluate a T20 Cricket player’s performance index in both bowling and batting. Its work is to capture a player’s present performance and position in the team. DPPI helps to make it possible for researchers and T20 cricket fans to compare players played on various teams who perform similar roles. Also, it determines the approximate team strength by the combined DPPI values of players in various positions on a team. The deep Player Performance Index is based on K-Means clustering and the Random Forest algorithms. Apart from other indexes, DPPI has a better player’s performance holding capability and it is a helpful index for cricket fans, coaches, and managers to get a better knowledge of players and their performances of past matches. According to a study by S. Sarangi [17], all four teams’ chances of winning are greatly impacted by the number of fielding mistakes and bowler economy. However, none of the teams were impacted by the umpire’s country of origin, but other factors had a distinct impact on each team’s performance that are the number of 4 s and 6 s scored, extras conceded, umpire’s nationality, fielding mistakes, bowling economy, number of debutants from each team, pitch condition, etc. For determining these parameters, the binary logistic regression model is used and these parameters are fully independent of each other. The proposed models in the research can be used by team management and trainers to build match strategy and player selection for higher win outcomes because they are based on a mixture of historical pattern data for certain variables and real data for others.

In a paper by A. Singhal [18], prediction of the winner of an Indian Premier League (IPL) match has been done by using various machine learning algorithms for classification. Python simplifies data analysis by providing visual representations of results. Here four classification algorithms – decision tree classifier, K-nearest neighbor classifier, support vector machine classifier, and random forest classifier are used and out of the four, the best model was selected for prediction, and results are visualized as graphs. According to a study by T. Mahmood [19], PSL data from pertinent sources is gathered and created as a validated data set for machine learning studies. They are trying to implement the solution for “PSL Eye”, which predicts the match-winning team using neural networks (NNs). At first, the data in the dataset was preprocessed to remove any extraneous variables before fine-tuning the NN hyper-parameters. The accuracy of PSL Eye is 82%, based on the testing dataset. The final results are obtained by running our NN-based PSL Eye after obtaining the hyperparameter’s ideal values. According to a paper written by A. Sahu [20], the prediction of outcomes of IPL cricket matches has been done using machine learning techniques. The dataset which is used for this work is taken from Kaggle which has two factors – wind speed and humidity. For data preprocessing of data in the dataset, manual encoding and label encoders are used. After removing all irrelevant things from the dataset, feature selection, and $p$ -value testing, models like Random Forest Classifier, Adaboost, and Multinomial Logistic Regression models are applied. Out of the three models, the random forest classifier gives the best result with a training accuracy of 98.14 and a testing accuracy of 89.47. P. Tekade [21] presented a paper where the prediction of outcomes of IPL (Indian Premier League) matches is done using supervised machine learning algorithms instead of unsupervised machine learning algorithms because the data is not labeled properly but in this case labeling data for prediction is a must. Various regression models like Decision Tree Regression, Random Forest Regression, Naive Bias, and Logistic Regression models are used for prediction. Based on many key factors like home ground, past performances, the current form of an individual team, the overall experience of all players, records at the same venue, and all players of that team, the Logistic Regression model gives the best result(has an accuracy of 90). According to a research paper written by F.A. Shakti [22], predictions of the result of international cricket matches are done using data mining approaches. This study has two goals – the features that affect the result of a cricket match and the prediction of the outcome of a cricket match. For doing these things, feature selection algorithms, recursive feature elimination, and machine learning algorithms – ZeroR, Decision Tree, Random Forest, and XGBoost are applied. For prediction dataset is used that contains the data of international T-20 and ODI cricket matches from 2004 to 2012. Out of these models, the XGBoost algorithm gives the best result with an accuracy of 85.48. According to a study by M.A.M. Raja [23], the prediction of cricket matches is done using various machine-learning models. Here K-Nearest Neighbor regression, Linear regression, Random Forest, Decision Tree, and Gradient Boosting are used to predict team run rate, batsman strike rate, bowler’s economy, and wickets that help to predict the best playing eleven for a cricket match. Various factors like ground name, playing against and batting innings for team run rate, ground name, playing against, match ID, over type, innings, and bowler for batsman strike rate, bowlers with required minimum experience, bowler name, ground name, playing against and batting innings for bowler economy rate and ground name, playing against, striker, over type, innings, bowler and match id for wicket prediction have been used for this work. Gradient Boosting gives the best result for batsman strike rate prediction (23.20%), Gradient Boosting gives the best result for batsman strike rate prediction (23.20%), K-Nearest Neighbor regression gives the best result for bowlers’ economy prediction (24.41%), and Random Forest Classifier gives the best result for wickets prediction (71.11%). Based on the above parameters and individual players’ predictions, this paper work can predict the best playing eleven of the teams. M.J. Awan [24] presented a paper where the prediction of the winning team of an ODI cricket match is done with the help of a machine learning approach. With the help of a linear regression model, the prediction of team scores can be done. Besides, big data analysis can be also done with the big data analysis framework Spark ML. The ODI dataset was collected from cricsheet.com. Two best linear regression models have been chosen and later on, Spark ML is applied to these two models. One gives an accuracy of 96 and another one gives poor accuracy in the confusion matrix and R-mean squared error. The best linear regression model that gives 96% accuracy has a 30.2 root mean squared error (RMSE), 1350.34 mean squared error (MSE), and 28.2 mean absolute error (MAE). The study conducted by K. Suresh [25] introduces an innovative approach to predict first-inning cricket scores in the Indian Premier League (IPL) using chatbots. Chatbots serve as intermediaries between humans and machines, and they find diverse applications across various domains, including marketing, education, support systems, cultural heritage, healthcare, and entertainment. Chatbots function by taking sentences as input and providing corresponding results. In the context of predicting cricket match scores, the study employs six different machine learning models. These models likely use historical data, features, and contextual information to generate predictions about the first-inning scores in IPL matches. By combining chatbot technology with machine learning models, this approach aims to enhance the accuracy and accessibility of cricket score predictions, potentially benefiting cricket enthusiasts and stakeholders in the IPL. The swimmers chosen for this paper [28] are only male athletes. An adapted MCDA approach named COMET is utilized to address the problem for the demands of the given topic, and this attempt is successful. A decision model including complete information and uncertainty is explored and created using the notion of fuzzy numbers in conjunction with the COMET approach. In addition to the study findings, a useful technique is created to help trainers assess the athlete’s selection and inclination. Furthermore, the system facilitates the prediction and verification of the impact of altering a particular property on the outcome [28]. The goal of this study [29] is to develop an expert model with many criteria for assessing football players’ performances. Using football as an example, this study proposes an objective fuzzy inference system based on fuzzy logic to assess players in team sports. A multi-criteria model based on the Characteristic Objects Method (COMET) has been created to assess players according to their match statistics for forward positions. The study [29] has demonstrated that this approach is useful for rating players according to their performance. The selection of the COMET approach is based on its distinct attributes.

In this study, historical data encompassing IPL matches held between 2008 and 2017 serves as the training dataset for our predictive models. Among the six models employed, the Random Forest Classifier emerges as the top-performing model, excelling in terms of prediction accuracy, precision, recall, F-score, and various statistical parameters.

3. Background details

3.1 Supervised learning techniques

Supervised learning is a subset of machine learning wherein machines make predictions based on well-labeled training data that has been utilized to train the models. The term ‘labeled data’ denotes input data for which the corresponding output is known in advance [26, 27]. In supervised learning, the model is provided with both the input data and the corresponding output data to learn and make predictions. For the completion of this study, four data sets are used which are:

1.
Fixtures of the T20 World Cup of 2022.
2.
ICC T20 ranking of all international cricket teams.
3.
All about information like previous appearances, previous titles, number of semifinals, and number of finals played by each team.
4.
All sort of match details of previous T20 World Cup along with last 5 years international T20 matches.

The dataset utilized in this study is sourced from ESPN Cricinfo, the ICC website, and Wikipedia. Initially, label encoding was employed to convert the dataset into numerical values. To address the challenge of imbalanced data, the Synthetic Minority Oversampling Technique (SMOTE) algorithm was adopted. SMOTE is particularly effective in enhancing accuracy when dealing with smaller dataset proportions. While various oversampling techniques exist, SMOTE is considered a contemporary and advanced method. This research encompasses the development of six distinct models, each constructed using the aforementioned machine learning algorithms and the training dataset, as depicted in Fig. 1.

Figure 1.
Proposed methodology for predicting winners of T20 cricket match.

3.1.1 Decision tree

The decision tree is a well-known, powerful, and famous model for classification and prediction. A decision tree is typically structured as a tree-like data structure, featuring root and leaf nodes as decision points and intermediate nodes as potential outcomes (representing many possible, albeit unknown, results). Edges within the tree depict potential outcomes. In problems involving classification and regression, the CART (Classification and Regression Trees) method is frequently applied. CART divides a node into sub-nodes using the Gini Index criterion. The CART algorithm commences by considering the training set as the root node and attempts to split it into two sub-nodes. This process is executed recursively, with the algorithm continuously dividing nodes until it reaches a predefined maximum number of leaves or achieves pure sub-nodes. The Eq. (1) below represents a decision tree classifier, where ‘Pi’ denotes the probability of class ‘i’ and ‘C’ represents the total number of classes in the classification problem.

$\displaystyle GI=\mathop{\sum}\limits_{i=0}^{C}{Pi}({1-{Pi}})\textit{ which % can be written as }GI=1-\mathop{\sum}\limits_{i=0}^{C}Pi^{2}$ (1)

3.1.2 Random forest

Random Forest is a widely used supervised machine learning model which is used for classification as well as regression problems. The Random Forest model is a widely embraced supervised machine learning technique applicable to both classification and regression tasks. The Random Forest flowchart typically comprises two segments: the first involves the training set, and the latter pertains to the test set. In this process, the root node, or initial node, represents the training data, and intermediate nodes symbolize individual decision trees. Following this, a majority vote is taken for classification tasks, while for regression problems, the average vote is considered for the output. In classification problems, the model calculates both entropy and the Gini index. Equation (2) assists in determining how nodes branch within a decision tree, while Eq. (3) aids in computing the Gini index for each branch at a given node.

$\displaystyle\textit{Entropy}=\mathop{\sum}\limits_{i=1}^{C}-Pi*\log 2(Pi)$ (2) $\displaystyle\textit{Gini}=1-\mathop{\sum}\limits_{{i}=1}^{C}(Pi)^{2}$ (3)

3.1.3 Hist gradient boosting classifier

The Histogram Gradient Boosting Algorithm is a member of the ensemble machine learning family. Ensemble learning refers to the practice of combining multiple models to create a single model that delivers improved accuracy compared to individual models. This algorithm bears similarities to the Gradient Boosting Classifier, but it distinguishes itself by reducing the number of features utilized in predictions. This reduction not only enhances the overall speed of the algorithm but also augments its predictive accuracy.

$\displaystyle\textit{Input}:\textit{Data}(x_{i},y_{i})_{i=1}^{n}\textit{ and a differentiable Loss Function }L(y_{i},F(x))$ (4)

In Eq. (4), $x_{i}$ signifies the input that is given to the Model and $y_{i}$ is the target variable whose value is being tried to predict using a model. Thus, based on the expected probability, the log-likelihood of the data may be predicted.

$\displaystyle\textit{Log}(\textit{likelihood of the observed data given the % prediction})$ (5) $\displaystyle=[y_{i}*\log(P)+(1-y_{i}*\log(1-P)]$

In Eq. (3.1.3), $y_{i}$ represents the observed value (0 and 1), and $p$ is the predicted probability.

Maximizing the log-likelihood function is the objective. Therefore, if the log (likelihood) is used as our loss function and smaller values indicate better-fitting models. Below Eq. (6) is shown the mathematical representation.

$\displaystyle\textit{Log}(\textit{likelihood})*(-1)$ (6)

3.1.4 Bagging classifier

Generally, the Bagging Classifier serves as an ensemble meta-estimator. Its methodology involves the selection of random subsets from the original dataset to train the base classifier independently on each of these subsets. The final prediction is then determined through a collective decision, typically involving voting or averaging of the individual predictions. The Bagging Classifier can be applied to a wide range of machine learning algorithms, including but not limited to Artificial Neural Networks (ANN), Support Vector Classifiers (SVC), and decision stumps. Additionally, it can be adapted for regression problems, extending its utility beyond classification tasks. The mathematical expression for a bagging classifier can be represented as,

$\displaystyle Y_{\textit{pred}}=\frac{1}{N}*(Y_{\textit{pred}}^{1}+Y_{\textit{% pred}}^{2}+\ldots+Y_{\textit{pred}}^{N})$ (7)

In the Eq. (7), $N$ is the number of base models or trees, $Y_{\textit{pred}}^{i}$ is the prediction of the $i^{\text{th}}$ tree, and y_pred is the final prediction obtained by averaging the predictions from all the trees.

3.1.5 Gradient boosting classifier

Gradient Boosting is a powerful technique primarily employed for solving classification problems. It falls under the ensemble learning category, which involves amalgamating multiple decision trees to construct a more accurate predictive model. This algorithm operates through an iterative process where decision trees are successively trained on the residuals, i.e., the errors, of the preceding tree. Each subsequent tree aims to rectify the mistakes made by its predecessors. In the Gradient Boosting Classifier, decision trees are trained using a gradient descent optimization algorithm, which is geared toward minimizing the loss function of the model. The loss function quantifies the disparity between the model’s predicted values and the actual values within the training dataset. By minimizing this loss function, the algorithm endeavors to create a model that exhibits strong generalization capabilities when applied to new, unseen data. Equation (8) encapsulates the mathematical representation of the Gradient Boosting Classifier (GBC) within this context.

$\displaystyle G(x)=\sum T_{i}(x)$ (8)

Where $G(x)$ is the predicted value for the input variable $x$ , and $T_{i}(x)$ represents the individual decision trees in the ensemble. The Gradient Boosting Classifier (GBC) leverages a combination of decision trees to fashion a more accurate predictive model. These decision trees within the GBC framework are constructed by minimizing a loss function, which quantifies the disparity between the model’s predictions and the actual values present in the training data. While the choice of loss function can vary, commonly employed functions include the mean squared error (MSE) and the cross-entropy loss. The ultimate prediction generated by the GBC is the summation of the outputs from all the individual decision trees within the ensemble. The term ‘gradient boosting’ is aptly applied because the algorithm optimizes the loss function by reducing the gradient of this function concerning the predicted values. This optimization process serves to iteratively refine the model’s predictive accuracy, making it well-suited for a wide range of classification tasks.

3.1.6 XGB classifier

The XGBoost Classifier is an implementation of the Gradient Boosted Decision Trees methodology. Within this classifier model, weights play a crucial role as decision trees are constructed sequentially. The incorporation of weights allows independent variables to contribute to the final prediction, with the understanding that if the weights increase beyond a certain threshold, the initial prediction might be incorrect, necessitating the construction of a subsequent decision tree. Each individual tree in the ensemble is designed to make predictions that enhance the overall accuracy of the model. Equation (9) provides the mathematical expression that characterizes the XGBoost Classifier model within this context.

$\displaystyle Y_{\textit{pred}}=w_{0}+w_{1}*f_{1}(x)+w_{2}*f_{2}(x)+\ldots+w_{% m}*f_{m}(x)$ (9)

Where $m$ is the number of trees in the ensemble, $f_{i}$ is the prediction of the $i^{\text{th}}$ tree, $w_{i}$ is the weight assigned to the $i^{\text{th}}$ tree, $x$ is the input feature vector, and $Y_{\textit{pred}}$ is the final prediction.

3.1.7 Voting classifier

The Voting Classifier operates as an ensemble machine learning algorithm that harnesses the collective wisdom of multiple models to make predictions. It assesses the likelihood of each class being the final outcome and anticipates the output class based on this likelihood. By aggregating the results of the classifiers provided as parameters to the voting classifier, it determines the output through majority voting. Equation (10) provides a mathematical representation that encapsulates the essence of the Voting Classifier, illustrating its methodology for arriving at a final prediction.

$\displaystyle Y_{\textit{pred}}=\textit{mode}(Y_{\textit{pred}}^{1}+Y_{\textit% {pred}}^{2}+\ldots+Y_{\textit{pred}}^{N})$ (10)

Where $N$ is the number of base models, $Y_{\textit{pred}}^{i}$ is the prediction of the $i^{\text{th}}$ model, and mode(.) is a function that returns the most common value in a set.

4. Proposed methodology

Input:

T1.
Collect a performance dataset where previous appearances, previous titles, previous finals, and semi-finals played in the T20 World Cup are there. This dataset is collected from Wikipedia.
T2.
Collect a ranking dataset where the ICC ranking of major teams in T20 is given. The source of this dataset is the ICC official website.
T3.
Collect most of the international T20 matches from 2007 to 2020 from ESPN Cricinfo.
T4.
Fixtures of the 2022 T20 World Cup are collected from ICC’s official website.

Criteria (Factors):

This study takes into consideration the following factors which act as indirect inputs,

1.
Previous Title: This indicates the number of times the team has won the titles. A team with more previous titles indicates that it is a formidable team and has a high chance of winning the tournament. Kampakis S. [9] analyzed 500 team and player performance statistics to predict outcomes from the English T20 game. Under team statistics, they considered the frequency of titles won by individual teams.
2.
Previous Finals: This indicates the teams that have played the finals in the past years, which indicates the team’s capability to outplay all the rest teams and gain a good position. A team that qualified for the final, has a good enough chance to qualify for the finals this time as well.
3.
Previous Appearances: This indicates the previous matches of the two teams and who had won between them. This helps to understand the winning probability of that team when they play against some particular team. Sankaranarayanan [1] used previous match fixtures for analysis of previous appearances to predict for ODI team winner prediction.
4.
Semi-Finals Played: This indicates the number of times each team has played the semi-finals and their wins or losses. This helps to predict which of the two teams that are playing in the semi-finals has a higher probability of winning the game and qualifying for the finals.
5.
ICC Rankings: This contains the ICC T20 rankings of the different teams. This indicates that the team with a higher position in the ranking has a higher probability of securing a good position or even winning the tournament. Yasir M. [11] took the ICC T20 ranking for the prediction of the winning team during the ongoing T20 game.

Output:

Predicted winner of the T20 World Cup.

Procedure:

10. 1.
Clean the T3 dataset by dropping some columns (Date axis, Margin axis, and Ground axis).
2.
From T3, Team_1 and Team_2 columns are chosen as separate data frames with name features1 and features2. At the same time, the winner column is set to a new data frame with the name class_value.
3.
In the next step, class_value is converted into label value as well and features1 and features2 values are transformed into a numeric value with the help of label encoding.
4.
For better performance and balanced class distribution, the SMOTE method is used on numeric features1, features2, and class_value.
5.
Two features value and class_value are split into two datasets. 20% dataset is used for the test dataset and the rest part is used for training data.
6.
Then CART, random forest, Hist gradient boosting classifier, bagging, XGB, and voting techniques are applied for training the model.
7.
Random Forest is applied with the parameters n_estimator $=$ 500, max_depth $=$ 22 and random_state $=$ 148. Then the model is fitted with training features and training class value.
8.
Similarly Decision Tree is used with the parameters criterion $=$ gini and random_state $=$ 100.
9.
Next, the Hist Gradient Boosting Classifier is used and fitted with training features and training class value as well.
10.
The Hist Gradient Boosting Classifier is one of the popular classifier models. This model is applied in this study for a better accuracy rate. The parameter max_iter $=$ 100 is passed through this model.
11.
A Bagging Classifier is also applied in this study. Bagging Classifier also gives a good accuracy rate.
12.
Gradient Boosting Classifier is used in this study. It also gives a significant performance rate.
13.
Another important classifier is the XGB Classifier. This classifier model is also used in this research to predict output as the winning team.
14.
The last classifier is a hybrid classifier which is a voting classifier that combines logistic regression, SVC, and CART classifier. Then the model is fitted with training features and training class value.

Result:

10. 15.
The Random Forest classifier shows the best accuracy which outperforms compared to other models. The testing accuracy and training accuracy for Random Forest is 83.03% and 82.06%.
16.
Other datasets like ranking datasets are tested with the model mentioned above. The winner is England.

Initially, our dataset is sourced from multiple online platforms, including the ESPN Cricinfo website, the ICC website, and Wikipedia. The data is collected and structured in CSV format. Since much of the data is in textual form, a crucial step involves transforming this textual data into a numerical format to facilitate the application of various machine learning algorithms. To achieve this conversion, we employ label encoding, which effectively transforms the textual data within the dataset into a numerical representation. Figure 1 provides an encompassing flowchart that illustrates the methodology employed for predicting the winning cricket team, outlining the key steps in this process.
5. Result analysis

This study and analysis consider four types of datasets in this work.

5.1 Dataset description

•
The first one contains the previous appearances, previous titles, previous finals, and semifinals played in the T20 World Cup. The dataset is named world_cup_t20_dataset.csv. Not to mention that the dataset is in CSV format. The data is collected from Wikipedia and public open media. No preprocessing is applied for the data set and it is used as it is. The considered attributes are Team, Group, Previous appearances, previous titles, previous finals, previous semifinals, and Current rank.
•
The second dataset contains the ICC rankings of the teams. This dataset plays a main role in predicting the winner. The dataset is named icc_rankings.csv. This dataset is also in CSV format. The data is collected from the ICC official Website. No preprocessing is applied for the data set and it is used as it is. The considered attributes are Position, Team, and Points.
•
The third dataset contains most of the international T20 matches from 2007 to 2020. The dataset is named final.csv. This dataset is in CSV format. The data is collected from ESPN Cricinfo. During the preprocessing of the dataset phase, the Margin and Ground columns are dropped from the dataset. All the rows that had Null of NaN values are eliminated, as they might create imbalance during model training and testing accuracy. The considered attributes are Date, Team_1, Team_2, and Winner.
•
The fourth dataset contains the fixtures of the matches to be played for the T20 cricket World Cup in 2022. The dataset file is named fixtures.csv. The data is collected from the ICC official website. No preprocessing is applied to the data set and used as it is. The considered attributes are Round, Date, Stadium, Venue, Team_1, Team_2, Group, Result.

The chosen problem in this study revolves around multi-class classification, a domain where various machine learning models prove effective. Among the ensemble of classifiers suitable for such tasks, we have explored Logistic Regression, Random Forest, Decision Tree, Support Vector Machine, K-Nearest Neighbor, Gradient Boosting, Histogram Gradient Boosting, AdaBoost, XGBoost, Voting Classifier, and Bagging Classifier. Our study encompasses predictions conducted on all of these models, yielding promising outcomes. In this research, we focus on a comparative analysis of the top-performing eight classifiers that have exhibited the most promising results. These classifiers include Random Forest, Decision Tree, Gradient Boosting, Histogram Gradient Boosting, Bagging Classifier, XGBoost, Voting Classifier, and K-Nearest Neighbor. Among these models, Random Forest has emerged as the most accurate, surpassing other state-of-the-art classifiers. The study provides a comprehensive performance assessment and comparative insights for each of these models, presenting the results through informative tables and illustrative figures, which are detailed in the subsequent sections.

Table 1 offers a concise comparison of the training and testing set accuracies, along with loss metrics, for the diverse classifiers applied to our dataset. These metrics, including training accuracy, testing accuracy, and the loss function, are pivotal in comprehending the performance of the models. Figure 2 visually depicts the contrast in loss functions and their corresponding accuracies among different models. The loss function serves as a vital evaluation metric, shedding light on the functioning of the applied models. Figure 3 provides an insightful comparison of the performance of various machine learning algorithms, including Random Forest, Decision Tree, Gradient Boosting, Histogram Gradient Boosting, Bagging, XGBoost, and Voting Classifier. Among these models, Random Forest emerges as the leader, achieving an impressive training accuracy of 83.03% and testing accuracy of 84.06%, all while maintaining generalization performance (without overfitting).

Table 1
Training and testing accuracy comparison of different classifiers

Models Training accuracy (%) Testing accuracy (%) Loss

Random forest (RF) 83.03 84.06 0.0406

Decision tree 81.71 83.69 0.0307

Gradient boosting 59.62 58.30 0.1536

Hist gradient boosting 82.06 83.03 0.0307

Bagging 81.00 81.27 0.0541

XGBoost 82.68 82.15 0.0379

Voting 81.09 83.74 0.1163

K nearest neighbours 69.25 72.08 0.2747

Figure 2.
Graphical comparison of the loss function and the accuracy of different models.

Figure 3.
Graphical comparison of training and testing accuracy of different models.

Table 2 presents important statistical inferences about the models used, including the Chi-Square test and $T$ -test. The Chi-square test is a statistical tool employed to compare observed and expected results, while the $T$ -test assesses the means of two groups, commonly applied in hypothesis testing. These tests offer valuable insights into the operation, accuracy, and performance of the models.

Table 2
Chi-square test and $T$ -test comparison of different classifiers

Models Chi-square test $T$ -test (statistics, $p$ -value)

Random forest (RF) 1.0 $-$ 0.310625, 0.756317

Decision tree 1.0 $-$ 0.369012, 0.624567

Gradient boosting 1.0 0.728613, 0.466849

Hist gradient boosting 1.0 $-$ 0.536154, 0.592279

Bagging 1.0 0.025085, 0.980004

XGBoost 1.0 $-$ 0.576592, 0.564679

Voting 1.0 $-$ 0.103567, 0.917586

K nearest neighbours 1.0 0.042911, 0.965802

Table 3
Execution time (sec) details of the various applied models

Models Execution time (sec)

RF 4.4438

CART 0.0237

Gradient boosting 130.2290

Hist gradient boosting 33.6942

Bagging 42.9083

XGBoost 4.8924

Voting 5.9961

Figure 4.
Decision-making analysis at each stage of the tournament.

Table 4
Precision, recall and F1 score analysis of different models

Models Precision Recall F1 score

Random forest classifier 0.830 0.830 0.830

Decision tree classifier 0.844 0.844 0.845

Gradient boosting classifier 0.151 0.151 0.152

Hist gradient boosting classifier 0.830 0.830 0.830

Bagging classifier 0.812 0.826 0.827

XGBoost classifier 0.826 0.826 0.827

Voting classifier 0.837 0.837 0.837

Figure 5.
Execution time (sec) comparison.

Figure 6.
Graphical comparison among precision, recall, and F1 scores of different models.

Figure 3 offers a visual representation of the entire T20 Cricket World Cup in 2022, from the Super 12 stage to the participating teams in the championship match. Throughout the Super 12 stage, the Random Forest algorithm is employed to predict the winner of each match. The study and model predict the semi-finalists as England, Sri Lanka, Pakistan, and Bangladesh, with the top two teams from each group advancing to the semi-finals. Subsequently, two semi-final winners are determined to vie for the championship title. As per the model’s predictions, England and Bangladesh are expected to compete in the 2022 T20 World Cup finals. The model’s forecast for the winner of the T20 Cricket World Cup 2022 is England. The decision analysis at each stage of the tournament is visualized in Fig. 4.

Table 3 provides insights into the execution times of various models employed in this study. The time

Table 5
Comparison Analysis with some popular works

Parameters Proposed work Jhanwar et al. [6] Raja et al. [23] Passi et al. [10] Singhal et al. [18] Prakash et al. [16] Awan et al. [24] Puram et al. [14]

Objective of Prediction Winning Team Outcome of a One Day International (ODI) cricket match Performance of players Performance of players Match winner IPL T20 Cricket players’ in-form and role-based performance evaluation index Match winner In Twenty20 (T20) cricket, the impact of contextual circumstances and subsequent decisions on team performance

Dataset & Essential Features Past performances (previous appearance, previous titles, previous finals and semi final played in T20 world cup), ICC T-20 ranking of teams Played Matches, Bowling Innings, Wickets Captured, Bowling Economics, FWkts Hauls, and Average Bowler’s name, team they are playing against, venue of the match and past economies Batting average, strike rate, wickets taken of previous matches IPL Dataset Batting performance index: Runs, Average, Strike Rate, Fours, Six Bowling performance index: Wickets, bowling average, strike rate, economy mid, date, location, bowl team, bat team, batter, overs, last five runs, last five wickets, last five strikers, and total runs Match-by-match information for the IPL’s nine seasons

Technology Used CART, Random Forest, Bagging, Boosting, Voting classifiers KNN KNN, RF, CART, Gradient Boosting, Linear Regression Naïve bayes, random forest, multiclass SVM, CART Decision tree, KNN, SVM, and random forest K-Means and RF inspired Deep player performance index Machine learning, Big Data, Spark ML Tree-based machine learning (ML) models

Best accuracy model RF KNN RF RF SVM – Spark ML Bayesian additive regression tree (BART)

Accuracy (%) 82.06 76.28 71.11 90.67 81.57 – 95 81

module is utilized to calculate these execution times. The process involves initializing a variable ‘start’ with time.time() before running each model, subsequently running the model, and finally initializing another variable ‘end’ with time.time(). The execution time is determined as ‘end-start,’ and these times are measured in seconds. Figure 5 visually represents the execution times, offering a graphical overview of the model runtimes.

Table 4 furnishes the precision, recall, and F1 scores for the various machine-learning models employed in this study. These scores are vital indicators of model performance and are graphically visualized in Fig. 6. Precision is defined as the ratio of True Positives (correctly classified positive samples) to the total number of classified positive samples (True Positives $+$ False Positives). Recall gauges the model’s ability to detect positive samples, while the F1 score serves as a machine learning evaluation metric that encompasses both precision and recall to measure overall model accuracy. Table 5 presents a comprehensive comparison between our proposed work and some state-of-the-art studies. This comparison takes into account prediction objectives, critical features, technology applied, and the best accuracy model, offering insights into the uniqueness and contributions of our research.

We have conducted a comprehensive comparison of several models, considering various parameters such as the prediction objective, dataset, essential features, technology utilized, best accuracy model, and overall accuracy. Notably, this study goes beyond existing research by including additional critical metrics, including loss function, execution time, Chi-Square test results, and $T$ -test statistics and $p$ -values. These comprehensive comparisons contribute to a more thorough understanding of the model’s performance and offer unique insights not found in previous research.
6. Conclusions and future work

Models	Training accuracy (%)	Testing accuracy (%)	Loss
Random forest (RF)	83.03	84.06	0.0406
Decision tree	81.71	83.69	0.0307
Gradient boosting	59.62	58.30	0.1536
Hist gradient boosting	82.06	83.03	0.0307
Bagging	81.00	81.27	0.0541
XGBoost	82.68	82.15	0.0379
Voting	81.09	83.74	0.1163
K nearest neighbours	69.25	72.08	0.2747

Models	Chi-square test	$T$ -test (statistics, $p$ -value)
Random forest (RF)	1.0	$-$ 0.310625, 0.756317
Decision tree	1.0	$-$ 0.369012, 0.624567
Gradient boosting	1.0	0.728613, 0.466849
Hist gradient boosting	1.0	$-$ 0.536154, 0.592279
Bagging	1.0	0.025085, 0.980004
XGBoost	1.0	$-$ 0.576592, 0.564679
Voting	1.0	$-$ 0.103567, 0.917586
K nearest neighbours	1.0	0.042911, 0.965802

Models	Execution time (sec)
RF	4.4438
CART	0.0237
Gradient boosting	130.2290
Hist gradient boosting	33.6942
Bagging	42.9083
XGBoost	4.8924
Voting	5.9961

Models	Precision	Recall	F1 score
Random forest classifier	0.830	0.830	0.830
Decision tree classifier	0.844	0.844	0.845
Gradient boosting classifier	0.151	0.151	0.152
Hist gradient boosting classifier	0.830	0.830	0.830
Bagging classifier	0.812	0.826	0.827
XGBoost classifier	0.826	0.826	0.827
Voting classifier	0.837	0.837	0.837

Parameters	Proposed work	Jhanwar et al. [6]	Raja et al. [23]	Passi et al. [10]	Singhal et al. [18]	Prakash et al. [16]	Awan et al. [24]	Puram et al. [14]
Objective of Prediction	Winning Team	Outcome of a One Day International (ODI) cricket match	Performance of players	Performance of players	Match winner	IPL T20 Cricket players’ in-form and role-based performance evaluation index	Match winner	In Twenty20 (T20) cricket, the impact of contextual circumstances and subsequent decisions on team performance
Dataset & Essential Features	Past performances (previous appearance, previous titles, previous finals and semi final played in T20 world cup), ICC T-20 ranking of teams	Played Matches, Bowling Innings, Wickets Captured, Bowling Economics, FWkts Hauls, and Average	Bowler’s name, team they are playing against, venue of the match and past economies	Batting average, strike rate, wickets taken of previous matches	IPL Dataset	Batting performance index: Runs, Average, Strike Rate, Fours, Six Bowling performance index: Wickets, bowling average, strike rate, economy	mid, date, location, bowl team, bat team, batter, overs, last five runs, last five wickets, last five strikers, and total runs	Match-by-match information for the IPL’s nine seasons
Technology Used	CART, Random Forest, Bagging, Boosting, Voting classifiers	KNN	KNN, RF, CART, Gradient Boosting, Linear Regression	Naïve bayes, random forest, multiclass SVM, CART	Decision tree, KNN, SVM, and random forest	K-Means and RF inspired Deep player performance index	Machine learning, Big Data, Spark ML	Tree-based machine learning (ML) models
Best accuracy model	RF	KNN	RF	RF	SVM	–	Spark ML	Bayesian additive regression tree (BART)
Accuracy (%)	82.06	76.28	71.11	90.67	81.57	–	95	81

In this study, we employ a diverse array of machine learning algorithms and methodologies to predict the winner of the T20 Cricket World Cup. As a case study, we utilize data relevant to the 2022 T20 World Cup for prediction purposes. The selected learning models include Random Forest, Decision Tree, Gradient Boost, Histogram Gradient Boost, Bagging Classifier, XGBoost, and Voting Classifier. Notably, among these well-established models, the Random Forest Classifier stands out, achieving a commendable accuracy of 83.03% during the training phase and further improving to 84.06% during the testing phase. The culmination of our predictive analysis revealed that England, Sri Lanka, Pakistan, and Bangladesh emerged as the top four T20 teams, securing their positions in the semi-finals of the T20 Cricket World Cup. Subsequently, England and Bangladesh advanced to the finals, with England ultimately emerging as the triumphant team in the championship match. The set of future works are listed below,

•

While our study primarily focuses on the T20 format, future researchers have the opportunity to extend this work to predict outcomes in other forms of cricket, including One-day Matches and Test Matches. The methodologies and insights gained from our study can serve as a valuable foundation for exploring predictive analytics in a broader spectrum of cricket formats

•

In the future, Kabaddi, and more. It’s important to note that the implementation of these techniques may vary significantly from one sport to another researchers may explore the application of classification techniques in various other sports such as Football [30], Basketball, as each sport possesses its own unique set of characteristics and dynamics.

•

To obtain more reliable results and accurate prediction, we can include more useful features for the training purpose of the models in the future.

•

In the future, we have the potential to incorporate additional parameters such as weather conditions and humidity, which are known to influence the performances of both teams. These supplementary factors can enhance the accuracy of our predictions, offering a more comprehensive and precise outcome forecast.

In conclusion, this work contributes significantly to the field of sports analytics and cricket match prediction by advancing prediction accuracy, conducting comparative analyses, utilizing diverse data sources, employing statistical testing, analyzing execution times, and offering comprehensive evaluation metrics. These contributions enhance the understanding and application of machine learning in the domain of sports prediction.

Footnotes

Conflict of interest

The authors declare no conflict of interest.

Funding

There is no funding involved for this work.

References

Sankaranarayanan

Sattar

Lakshmanan

. Auto-play: A data mining approach to ODI cricket simulation and prediction. In Proceedings of the 2014 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics. 2014, April. pp. 1064–1072.

Gagana

, & Paramesha

. A perspective on analyzing IPL match results using machine learning. Int J Sci Res Develop. 2019; 7(03).

Bandulasiri

. Predicting the winner in one-day international cricket. Journal of Mathematical Sciences & Mathematics Education. 2008; 3(1): 6–17.

Kumar

Roy

. Score prediction and player classification model in the game of cricket using machine learning. International Journal of Scientific & Engineering Research IJSER. 2018; 9(8).

Pujbai

Chaudhari

Pal

Nhavi

Shimpi

Joshi

. A survey on team selection in game of cricket using machine learning. International Research Journal of Engineering and Technology. 2019; 6(11).

Jhanwar

Pudi

. Predicting the Outcome of ODI Cricket Matches: A Team Composition Based Approach. In MLSA@ PKDD/ECML. 2016, September.

Bailey

Clarke

. Predicting the match outcome in one day international cricket matches, while the game is in progress. Journal of Sports Science & Medicine. 2006; 5(4): 480.

Pathak

Wadhwa

. Applications of modern classification techniques to predict the outcome of ODI cricket. Procedia Computer Science. 2016; 87: 55–60.

Kampakis

Thomas

. Using machine learning to predict the outcome of english county twenty over cricket matches. arXiv preprint arXiv:1511.05837. 2015.

10.

Passi

Pandey

. Increased prediction accuracy in the game of cricket using machine learning. arXiv preprint arXiv:1804.04226. 2018.

11.

Yasir

Chen

Shah

Akbar

Sarwar

. Ongoing match prediction in T20 International. International Journal of Computer Science and Network Security. 2017; 17(11): 176–181.

12.

Wickramasinghe

. Applications of machine learning in cricket: A systematic review. Machine Learning with Applications. 2022; 10: 100435.

13.

Ishi

Patil

. A study on machine learning methods used for team formation and winner prediction in cricket. In Inventive Computation and Information Technologies: Proceedings of ICICIT 2020. Springer Singapore. 2021. pp. 143–156.

14.

Nasim

Yousaf

Masood

Jaffar

Rashid

. Data-Driven Probabilistic S for Batsman Performance Prediction in a Cricket Match. Intelligent Automation & Soft Computing. 2023; 36(3).

15.

Karunathilaka

DGTL

Rajakaruna

Navarathna

Anantharajah

Selvarathnam

. “Can Mumbai Indians Chase the Target?” Predict the Win Probability in IPL T20-20. In Proceedings of Sixth International Congress on Information and Communication Technology: ICICT 2021, London, 2. Springer Singapore. 2022. pp. 991–999.

16.

Prakash

Verma

. A new in-form and role-based deep player performance index for player evaluation in T20 cricket. Decision Analytics Journal. 2022; 2: 100025.

17.

Sarangi

Singh

. Winning one-day international cricket matches a cross-team perspective. Journal of Business Analytics. 2022; 1–20.

18.

Singhal

Agarwal

Singh

Valecha

Malik

. IPL Analysis and Match Prediction. In Intelligent System Design: Proceedings of INDIA 2022. Singapore: Springer Nature Singapore. 2022. pp. 29–38.

19.

Mahmood

Riaz

Nasir

Afzal

Siddiqui

. Psl eye: Predicting the winning team in Pakistan Super League (PSL) matches. KIET Journal of Computing and Information Sciences. 2021; 4(2): 13–13.

20.

Sahu

. Predictive analysis of cricket. Turkish Journal of Computer and Mathematics Education (TURCOMAT). 2021; 12(6): 5111–5124.

21.

Tekade

Markad

Amage

Natekar

. Cricket match outcome prediction using machine learning. International Journal. 2020; 5(7).

22.

Shakil

Abdullah

Momen

Mohammed

. Predicting the Result of a Cricket Match by Applying Data Mining Techniques. In Software Engineering Perspectives in Intelligent Systems: Proceedings of 4th Computational Methods in Systems and Software 2020, Vol. 2 4. Springer International Publishing. 2020. pp. 758–770.

23.

Raja

MAM

Manasa

VVL

Reddy

DSN

Sundari

. Applying Data Science for Cricket Predictions. Annals of the Romanian Society for Cell Biology. 2021; 1853–1863.

24.

Awan

Gilani

SAH

Ramzan

Nobanee

Yasin

Zain

Javed

. Cricket match analytics using the big data approach. Electronics. 2021; 10(19): 2350.

25.

Suresh

Vikas

. Design and Analysis of a ChatBot with IPL First Inning Score Prediction. In 2021 International Conference on Advancements in Electrical, Electronics, Communication, Computing and Automation (ICAECA). IEEE. 2021, October. pp. 1–4.

26.

Parsuramka

Goswami

Malakar

Chakraborty

. An empirical analysis of classifiers using ensemble techniques. In Data Management, Analytics and Innovation: Proceedings of ICDMAI 2020, Volume 1. Springer Singapore. 2021. pp. 283–298.

27.

Chakraborty

Kumar

Paul

Kairi

. A study of product trend analysis of review datasets using Naive Bayes, K-NN and SVM classifiers. Int J Adv Eng Manag. 2017; 2(9): 204–213.

28.

Sałabun

Wiȩckowski

Wa̧tróbski

. Swimmer Assessment Model (SWAM): Expert system supportig sport potential measurement. IEEE Access. 2022; 10: 5051–5068.

29.

Sałabun

Shekhovtsov

Pamučar

Wa̧tróbski

Kizielewicz

Wiȩckowski

Nyczaj

. A fuzzy inference system for players evaluation in multi-player sports: The football study case. Symmetry. 2020; 12(12): 2029.

30.

Chakraborty

Dey

Kairi

Maity

. Prediction of Winning Team in Soccer Game – A Supervised Machine Learning-Based Approach, Advances on Mathematical Modeling and Optimization with Its Applications, CRC Press, Taylor and Francis, ISBN: 9781032479613. 2023.