Sage Journals: Discover world-class research

Abstract

Predicting the outcome of cricket matches prior to their start is challenging due to the inability to generate real-time information from the thousands of data points generated by each match. Twitter (now “X”) and other social media sites have demonstrated their capacity to produce content in real time. This study proposes a machine learning-based method for predicting the results of T20 international matches (before they begin) using historical and Twitter data. A sentiment analysis was carried out to create a feature by considering three leading methods, revealing that the fine-tuned RoBERTa-based model outperformed those using LSTM and VADER with an F1 score of 95.3%. Three datasets were formulated, with historical, Twitter, and both data types. The XGBoost classifier produced the best models for all three datasets, and it was found that removing correlated and least important variables enhances their performance. Models using Twitter data outperformed those using historical data, while models using both data (with a 73.7% F1 score) outperformed both individual data models. Moreover, this best model performed better than the bookmakers’ predictions and could generate a profit by accurately predicting 11 matches out of 14 for the T20I World Cup 2022 data.

Keywords

cricket sports statistics sentiment analysis natural language processing machine learning

Introduction

Sentiment analysis on social media has become an essential task for understanding public opinion and sentiments regarding various topics (Omar and Abd El-Hafeez, 2023). Recent studies have demonstrated that social media crowds, especially those using Twitter (“X”) tweets (posts), have the ability to predict real-world events (Cheng et al., 2021). The popularity of Twitter has increased since its launch in 2006. Of the 556 million active users of the platform, 61.2% use it to stay up to date on news and current events (DIGITAL 2023: GLOBAL OVERVIEW REPORT, 2023). Unlike Facebook postings, which can be private to specific viewers, tweets are public by default. Even though Twitter doubled message length to 280 characters in 2017, tweets continue to be shorter and more to the point than other social media posts. The Twitter hashtags allows users to search for and categorize specific topics, allowing for sentiment sorting. Moreover, researchers and analysts have shown that social media data, particularly tweets, can be a valuable addition to historical data when making predictions.

Football, considered the most popular sport in the world, has been the subject of a significant amount of research of this kind (Kampakis and Adamides, 2014). For a real-world example, in 2021, Phil Lynch, head of media at Manchester United Football Club, disclosed that the club uses fan sentiment graphs based on social media content for each player (“Phil Lynch on Manchester United's media strategy and ‘the Ronaldo effect,’” 2021). However, when it comes to cricket, a limited number of studies have been conducted to predict the match outcome compared to football.

With more than 2.5 billion estimated fans, cricket is the second most popular sport in the world behind football (World Atlas, 2020). It is popular primarily in the United Kingdom and certain former British colonies, such as India, Pakistan, and Sri Lanka. Cricket has three main formats: International Level - Test, One Day International (ODI), and Twenty20 International (T20I). Test format, the oldest format, consisted of four innings and played up to five days, while ODI matches consisted of fifty overs (300 deliveries) per side. T20 is the shortest format of cricket, with each team getting to bat for 20 overs (120 deliveries) and usually lasting around 3–4 h. T20 cricket matches do not have much history. The first T20I was played on February 17, 2005, between Australia and New Zealand. There is a tremendous interest in cricket, especially T20 matches, due to its high-scoring, energetic, and shorter nature, both socially and commercially. Based on this format, many cricket-playing countries now have their own franchise-based cricket tournaments. For example, India has the most famous premier league, the Indian Premier League (IPL), Australia has the Big Bash League (BBL), and Sri Lanka has the Lanka Premier League (LPL). These leagues also support the development and popularity of cricket internationally.

According to ICC Men's Twenty20 International Playing Conditions, 2022, between 30 and 15 min prior to the planned start time, the toss for the decision of bat first or field first is conducted under the supervision of an ICC match referee. Before the toss, each captain must provide a written list of 11 players, along with a maximum of four substitutes. Because of these reasons, team management, coaches, and players always try to prepare well in advance before a match. If they can get an idea of what's going to happen in the match and the result even before the toss, it will be a great help in making their decisions. On the other hand, companies and franchises want to know about public sentiment toward each team so they can decide which team to invest in. Even broadcast channels invite experts to their studios to discuss, analyze, and predict the match winners to boost viewership and ad revenue. Moreover, cricket has a large pre-match betting industry, and millions of dollars are wagered on cricket matches. Due to these reasons, developing an accurate model to predict the outcome of a T20 match before it begins will be extremely beneficial to many parties.

However, due to the complexity, predicting and obtaining high accuracy rates in cricket matches, particularly in the T20 format, is still one of the most difficult tasks. When it comes to predicting the match outcome before the match starts, it becomes even more difficult. Every cricket match generates numerous data points, but extracting certain real-time information like players’ injuries, sentiment towards the teams, and match-banned players, which plays a crucial role, from those historical data points is not possible. The prediction models developed in earlier cricket research have either focused on historical features or Twitter data. To the best of the author's knowledge, this study is the first effort to develop a model to predict the winners of T20I cricket matches before starting, utilizing both historical and Twitter data.

The study's main objective is to build a suitable model using machine learning to predict the winner of the Twenty20 International Cricket matches before the match starts by considering both historical data and features derived from Twitter. In addition, there are several secondary objectives: to build a model for classifying the pre-match tweets as positive or negative using an efficient sentiment analysis technique, to study whether Twitter-derived features can provide useful information to predict match outcomes, and to check whether there is any potential profit from pre-match betting using the proposed model.

The paper is divided into five sections. The second section, Literature Review, reviews the relevant literature on the study's subject. Section three describes the research design, the data collection, sentiment analysis, data analysis, and model building. Section four provides the results of the study. The last section summarizes the study's main findings, identifies the limits, and provides suggestions for further research.

Literature review

Studies on machine learning regarding cricket have slightly increased since 2010. Around 35% of those research studies have focused on game outcome prediction, while only 4% of studies have considered cricket commentary or media for the studies (Wickramasinghe, 2022). When predicting the outcome of a cricket match, past studies focused on two main approaches: using historical data and using collective (social) knowledge (Hatharasinghe and Poravi, 2019). Out of those two approaches, the first one is the most popular among researchers.

The objective of Kaluarachchi and Varde (2010) was to predict the chances of victory in the ODI matches and to develop a tool for that prediction. They considered Naïve Bayes, Decision Tree classification using C4.5, Bagging, and Boosting. Naïve Bayes was the best classifier for the concerned dataset. Further, they observed that winning the toss does not significantly affect the outcome. However, they didn't consider the previous match results and statistics of the two teams. Sankaranarayanan et al. (2014) used historical data and the instantaneous state of an ongoing match to predict the ODI match outcome and progression of a match. According to the findings, accuracy was between 68% and 70%. They not only considered past data but also instantaneous features.

When considering the T20 format, Kampakis and Thomas (2015) carried out a study to predict the outcome of the English County twenty-over cricket matches before the commencement of the match. This study considered only statistically-based historical data. They developed two main models: the first consisted only of team data, and the second consisted of both team and player data. Out of Logistic Regression with PCA, Naïve Bayes, Random Forest, and Gradient Boosting, the Naïve Bayes classifier produced the best results for both models, with the first model, which used solely team data, having an accuracy of 62.4%. They concluded that, from their second model, which consisted of both team and player data, it was possible to predict the match outcome in almost two-thirds of instances. Since they didn't consider external data, such as social media comments, they think that if they did, they could increase the accuracy of the models.

Lamsal and Choudhary (2018) considered the most famous franchise-based premier league of all time, the Indian Premier League (IPL), and trained six machine-learning models to predict the outcome of the 2018 IPL matches 15 min before the match started, after knowing the toss decision. They got more than 60% accuracy for SVM, Logistic Regression, Random Forest, and Multilayer Perceptron Models. The best F1 score came from the Multilayer Perceptron model (72%). Since the model considers the toss decision, it is not useful to parties who are interested in knowing the outcome before the toss.

Ul Mustafa et al. (2017) evaluated how well machine learning models performed in predicting the outcome of a cricket match before the start using pre-match tweets gathered from Twitter pages. After creating a list of nicknames along with the most popular hashtags for each team, they scraped tweets and calculated three features: tweet volume, aggregated fans’ sentiment score, and average predicted score. To find the sentiment of a tweet, they selected a set of linguistic features using the TF-IDF score and used it to group the tweets into positive or negative categories. However, when meeting a new linguistic feature, it will not be possible to find the sentiment of tweets since this analysis is based on a pre-defined set of features. Support Vector Machine, Naïve Bayes, and Logistic Regression were considered and got up to 75% correct predictions for the large-scale data analysis and verified them on CWC15 and IPL2014 data. The final results showed that SVM performed better than the other two classifiers.

All the studies discussed above were conducted by considering either historical data or social media data, but not both. Wickramasinghe and Yapa (2018) considered both statistical data and social media data. By considering the complete dataset for the considered IPL matches, three models were built: one based only on natural variables, the second based only on features retrieved from Twitter, and the last based on both natural parameters and tweets. A second set of models was built to investigate how prediction accuracy changed after every ten overs with only tweets. Up to 85% accuracy was obtained from the proposed sentimental model. Logistic Regression, SVM, Naïve Bayes, and Random Forest were considered when training the prediction models. Logistic regression was the best classification approach for both tweet-based and natural parameter-based models, and SVM was the best for the combined model. The combined model performed better than the other two individual models. However, this study merely took into account IPL data and was not specifically designed to predict the T20 match outcome before the commencement.

Unlike cricket, football is subjected mostly to this kind of study. Kampakis and Adamides (2014) investigated whether features derived from tweets can be used to predict whether an English Premier League game will end in a home team win, an away team win, or a tie before the game begins. For the study, they considered three datasets: the Twitter dataset, the historical dataset, and the combined dataset merged from the two initial datasets. Tweets were gathered using the Twitter streaming API. They made a list of hashtags closely related to the respective teams, along with any possible nicknames. When creating the dataset, they didn’t consider the tweets that contained hashtags for more than one team. A bag of unigrams or bigrams represented the home and away features of the Twitter model, and each club's statistics represented the features of the historical model. Finally, they were able to prove that it is possible to predict the outcome of English Premier League games with features extracted from Twitter. The Twitter-based model performed better than the historical statistics model, while the combined model outperformed both individual models. For both the Twitter model and the combined model, Random Forest gave the best results, and for the historical model, Naïve Bayes gave the best result. However, the study conducted by Schumaker et al. (2016) for match and point spread prediction only using Twitter used another way to deal with the tweets that contained hashtags for more than one team. They labeled tweets with more than one hashtag with the first club hashtag. They believed that when two or more clubs are mentioned in a tweet, Twitter may have wanted to place more focus on the first club mentioned. Godin et al. 2015 also wanted to predict the winner of soccer games in the English Premier League (EPL) 2013–14 before the match began, utilizing both statistical data and collective wisdom that was extracted from the tweets. The top 10,000 unigrams were used to create a sentiment classifier for this study using SVM, which was trained on 3400 manually annotated tweets. Then, using this classifier, they classified the filtered tweets out of the 50 million scraped tweets posted 24 h before the game into three categories: positive, negative, and neutral. Naïve Bayes, Logistic Regression, and SVM were applied to this dataset. Finally, they came to the conclusion that a mix of statistical and Twitter-based information could outperform expert and bookmaker predictions.

Almost every study that utilized tweets to predict the match outcome considered a rule-based or traditional machine-learning approach to find the sentiment of tweets. In order to find a fresh and more accurate approach to finding the sentiment of tweets, different study domains were taken into consideration. Lay et al. (2019) noticed that in most of the studies related to sentiment analysis, researchers use manual data labeling, which takes a lot of time and money. Therefore, they suggested a semi-supervised learning strategy using mostly unlabeled data and only a small portion of labeled data from the IMDB dataset. The outcomes demonstrated that the unlabeled data did assist in model training without having a negative effect on the model's performance. Al-Shabi 2020 focused on five of the most significant lexicons: VADER, SentiWordNet, SentiStrength, Liu and Hu opinion lexicon, and AFINN-111. They made an effort to assess the most significant lexicon utilized in the field of sentiment analysis using data from Twitter. Their findings indicated that VADER has the highest accuracy for classifying tweets as positive or negative. Ghasiya and Okamura (2021) classified headlines using the cutting-edge RoBERTa sentiment classification algorithm. They achieved 90% validation accuracy with that model, which outperformed the other traditional models.

In conclusion, many studies on cricket focused on the ODI format. Out of the few studies done on Twenty20 matches, the majority of them have been based on league matches. These studies considered either historical match data or social media opinions, and most of them have taken into consideration instantaneous features and ongoing match status as well. Among the studies carried out to predict the match outcome before the match starts, predict the match winner after the toss. Therefore, the study addresses these research gaps by developing a more accurate model to predict the winner of T20 international matches even before the toss based on both historical and social media data. Moreover, the previous studies developed their sentiment analysis models using a rule-based approach or traditional machine learning models with limitations. Therefore, in this research, a new approach is introduced for sentiment analysis as well.

Methodology

Research design

The study involved three phases to develop prediction models: first phase, focusing on historical and basic statistical data; second phase, incorporating features from tweets; and finally, building combined models with both historical and Twitter-derived attributes (Figure 1).

Figure 1.

Three phases of the research design.

Each of these phases primarily consisted of the steps outlined in Figure 2.

Figure 2.

Steps in phases.

Data collection

The datasets were created by considering 519 matches played among the top 9 teams (as of October 14, 2022) from the 1st of January 2011 to the 14th of October 2022. The study only considered matches with a clear outcome (won or lost); tied matches, matches with no results, and abandoned matches were excluded. Afghanistan (10th team as of October 14, 2022) had only played 29 matches, which was a very low count compared to the other teams in the top 10. Therefore, only the top nine teams—India, England, the West Indies, Pakistan, Australia, Sri Lanka, South Africa, New Zealand, and Bangladesh—were taken into consideration for the study.

Historical dataset

The historical dataset consisted of 16 variables, as detailed in Table 1.

Table 1.

Variables in the historical dataset.

Variable	Description
Venue	Match played continent
T1_Mat	Total number of matches played by Team1
T1_AvgRunsScored	Average runs scored per over by Team1
T1_AvgBound	Average number of boundaries earned per over by Team1
T1_AvgRunsConceded	Average runs conceded per over by Team1
T1_AvgWktsTaken	Average number of wickets taken per over by Team1
T1_W/L	Win/Loss ratio of Team1
T1_AvgWktsLost	Average number of wickets lost per over by Team1
T2_Mat	Total number of matches played by Team2
T2_AvgRunsScored	Average runs scored per over by Team2
T2_AvgBound	Average number of boundaries earned per over by Team2
T2_AvgRunsConceded	Average runs conceded per over by Team2
T2_AvgWktsTaken	Average number of wickets taken per over by Team2
T2_W/L	Win/Loss ratio of Team2
T2_AvgWktsLost	Average number of wickets lost per over by Team2
Winner	Winner of the match (Team1 or Team2)

To create the above variables, a cricket statistics database accessible through ESPNCricinfo, Statsguru (Statsguru | Searchable Cricket Statistics database | ESPNcricinfo.com, 2000) was considered.

Each instance of the historical dataset represented a match, and the instance was mainly split into three parts: Team1 features, Team2 features, and the target variable. In addition, the dataset also contained the venue variable. This study was to predict the outcome of a match as a win for Team1 or a win for Team2. Most studies in this field consider the home team or the away team as either Team1 or Team2. However, this study was for international matches and contained data not only for the bilateral series. Therefore, Team1 is the team with the highest win/loss ratio for past matches. Team2 is the team with the lowest win/loss ratio.

W i n_L o s s_R a t i o = \frac{Number of matches won}{Number of matches lost}

(1)

Twitter dataset

A Python library called Snscrape was used to scrape the tweets. The employed search query for scraping tweets had four main inputs: (1) Hashtags related to a particular match, (2) User handles related to a particular match, (3) Language of tweets, and (4) Dates and time range.

To fill in the first two inputs, a list of hashtags and user handles associated with each team, depending on familiarity, was manually created and verified using search queries on Twitter. The user handles were extracted from each team's official Twitter accounts (Table 2).

Table 2.

List of hashtags and handles.

Team	Hashtags	Handles
India	#TeamIndia #MenInBlue #BCCI #IndianCricket #IndianCricketTeam #bharatarmy	@BCCI
England	#ECB #englandcricket #cricketengland #englandteam #EnglandCricketTeam	@englandcricket
Pakistan	#BackTheBoysInGreen #TheGreenArmy #packistancricket #packistancricketteam #TheRealPCB #PakCricket #cricketpakistan #PCB	@TheRealPCB
South Africa	#ProteaFire #Proteas #PureProtea #southafricacricket #CricketSouthAfrica #rsacricket #sacricket #SouthAfricanCricket #CSA	@ProteasMenCSA
New Zealand	#BACKTHEBLACKCAPS #CricketNation #blackcaps #NZC #newzealandcricket #nzcricket	@BLACKCAPS
Australia	#australiancricket #CricketAustralia #australiacricket #CricketAus	@CricketAus
West Indies	#MenInMaroon #WiAllin #WestIndies #Windies #windiescricket #westindiescricket	@windiescricket
Sri Lanka	#OneTeamOneNation #RoaringForGlory #ApeKollo #TeamSriLanka #SriLankanCricketTeam #SriLankanTeam #SrilankaCricket #GemmakThamai #OfficialSLC	@OfficialSLC
Bangladesh	#RiseOfTheTigers #BCB #BCBtigers #BangladeshCricket #bangladeshcricketteam #bdcricket #bdtigers #bdcricket_team	@BCBtigers

When considering the third input of the search query, English was taken as the preferred language due to the limitations of other languages. The fourth input to the query was added to include that specific time range. For this study, tweets that were posted between 24 h and an hour prior to the match were considered. The starting time of each previous match was manually extracted using the archived schedule reports provided by Cricbuzz.

The scraped tweets were converted to lowercase and duplicates were removed. There is no benefit to this kind of task from tweets that raise a question. With the use of a regular expression pattern, a function was defined to recognize and remove tweets that raise a question.

The dataset analysis also revealed that the majority of tweets contained several hashtags. Therefore, tweets with multiple hashtags and handles were classified according to the first hashtag or first handle.

Like in the “Historical” dataset, in this dataset, a record was mainly split into three parts: Team1 features, Team2 features, and the target variable. Table 3 presents the variables included in the final "Twitter" dataset.

Table 3.

Variables in the twitter dataset.

Variable	Description
T1_TwitterVol	Volume of pre-match tweets for Team1
T1_FansSent	Aggregated fans’ sentiment for Team1
T1_FansPred	Average mentions of Team1 as the winner
T2_TwitterVol	Volume of pre-match tweets for Team2
T2_FansSent	Aggregated fans’ sentiment for Team2
T2_FansPred	Average mentions of Team2 as the winner
Winner	Winner of the match (Team1 or Team2)

The volume of pre-match tweets and aggregated fans’ sentiment variables for each team were calculated based on the study of Ul Mustafa et al. (2017).

T J_T w i t t e r V o l^{i} = \frac{C o u n t o f t w e e t s_{j}^{i}}{T o t a l n u m b e r o f t w e e t s^{i}}

(2)

The variable $T J_T w i t t e r V o l^{i}$ indicates the pre-match tweet volume of Team j for $i^{t h}$ match. This variable was calculated by dividing the count of tweets of Team j for $i^{t h}$ match by the total number of tweets posted for that particular match.

T J_F a n s S e n t^{i} = \frac{C o u n t o f p o s i t i v e t w e e t s_{j}^{i}}{T o t a l n u m b e r o f t w e e t s^{i}}

(3)

This variable indicates the sentiment score of Team j for $i^{t h}$ match, and this variable was created by dividing the count of positive tweets of Team j for $i^{t h}$ match by the total number of tweets posted for the $i^{t h}$ match.

A sentiment analysis was carried out to create the two variables T1_FansSent and T2_FansSent.

The fans’ prediction variable (average mentions as the winner) for each team was newly introduced in the study. While analyzing the scraped tweets, it was noticed that fans primarily use two patterns (“Team j win” or “Team j will win”) to express their predictions. The variable was calculated by dividing the number of tweets containing the selected patterns for Team j (“Team j win” or “Team j will win”) in the $i^{t h}$ match by the number of tweets containing those selected patterns for both teams in the same match.

T J_F a n s P r e d^{i} = \frac{{Count of tweets with selected patterns}_{j}^{i}}{{Total number of tweets with selected patterns for both teams}^{i}}

(4)

The “Historical + Twitter” dataset was created by combining the above two datasets (Table 4).

Table 4.

Overview of datasets.

Dataset Name	Number of Observations	Number of Predictive Variables
Historical	519	15
Twitter	519	6
Historical + Twitter	519	21

Note: The target variable for all datasets is “Winner.”

Dataset summary

Sentiment analysis

Due to a lack of resources and time constraints, only 3000 tweets were randomly selected from the collection, and those tweets were labeled to perform and evaluate the sentiment analysis. For this job, three annotators were employed, and final labels were assigned based on the majority vote. The objective of following such a procedure was to reduce subjectivity and human bias to some extent. Labeled tweets were then split into three sets called training (70%), validation (15%), and testing (15%). When carrying out the sentiment analysis, a RoBERTa-based model was mainly considered, and its performance was compared with VADER and LSTM.

The whole study was conducted using the Python programming language, with the majority of the code written and executed in Jupyter notebooks within the Anaconda environment (conda version 22.11.1). Google Colab with GPU acceleration was used to guarantee effective processing for more computationally demanding operations, such as training the RoBERTa model.

RoBERTa-based model

“Twitter-roberta-base-sentiment-latest (https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest)” was chosen as the appropriate RoBERTa model from the hugging face model hub for this analysis. It is based on RoBERTa (Robustly Optimized BERT Pretraining Approach), proposed by Liu et al. (2019), which is a variation of BERT (Bidirectional Encoder Representations from Transformers). BERT, a pre-trained transformer model introduced by Devlin et al. (2018), can be used to perform sentiment analysis on a new dataset using transfer learning. There are so many improvements in the RoBERTa models compared to the BERT. Employing more training data, eliminating next-predict loss, and expanding the batch size are some of them (Zhao et al., 2021). Under the preprocessing step, a tokenizer called “AutoTokenizer” was used for the padding and truncating. After fine-tuning this selected model on the dataset, the fine-tuned model was uploaded to the hub (Hugging Face, 2022) for later usage. This model can be accessed by using this link: https://huggingface.co/sppm/cric-tweets-sentiment-analysis/.

The hyperparameters used for model training are listed in Table 5.

Table 5.

Best hyperparameters of the RoBERTa-based model.

Hyperparameters	Value
learning_rate	5e-05
per_device_train_batch_size	16
per_device_eval_batch_size	16
seed	223
optimizer	Adam
num_epochs	200

VADER

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based sentiment analysis tool and is well suited for short and informal text data in social media, like tweets (Chiny et al., 2021). Based on its compound score, predictions were made in this study. When the score was greater than or equal to zero, tweets were labeled as positive; otherwise, they were labeled as negative. Lastly, the predictions were compared to the ground truth labels provided by annotators.

LSTM

LSTM (Long Short-Term Memory), an updated version of RNN, is widely utilized when building models from scratch for NLP applications like sentiment analysis, and it overcomes the vanishing gradient problem in RNN (Hochreiter and Schmidhuber, 1997). For this study, the training, validation, and test sets were cleaned using the “neattext” library for the task. WordNetLemmatizer was used for lemmatization, and NLTK's RegexpTokenizer was taken into consideration for tokenization. Only the words from the 400,000-word GloVe embedding list were retained after all of these steps. The 50-dimensional GloVe model was selected for the study. Simple data analysis was performed on the training set to identify the maximum number of tokens in a tweet, and that value was 50. The length of each tweet was equalized by taking that value into account. If a tweet didn’t have the defined number of tokens, that was padded with zeros. Then the architecture of the LSTM model was defined, and the model was trained on the training set using suitable hyperparameters.

The hyperparameters and their corresponding values are presented in Table 6.

Table 6.

Best hyperparameters of the LSTM model.

Hyperparameters	Value
input_shape	(50,50)
return_sequences	True
units	First LSTM layer −128 units The subsequent two layers – 64 units
dropout	0.2
activation	Sigmoid function
epochs	200
optimizer	Adam
loss	BinaryCrossEntropy
learning_rate	1e-04

An immediate evaluation was done on the validation set, and the final performances were calculated on the test set.

The optimal technique for sentiment analysis was then found by comparing the test-set metrics of these three models, and the scores were as given in Table 7.

Table 7.

Final results comparison of sentiment analysis.

Model	Precision	Recall	Accuracy	F1 score
Fine-Tuned RoBERTa	93.9%	96.7%	92.2%	95.3%
Vader	90.8%	95.7%	88.4%	93.1%
LSTM	92.2%	83.7%	80.9%	87.8%

The fine-tuned RoBERTa model worked better than the other two models. The class-wise F1 scores of this model were 95.3% for the positive class and 76.8% for the negative class, respectively.

The RoBERTa model was trained on Google Colab using an NVIDIA Tesla T4 GPU and its free-tier resources. It took 244.82 s (around 4 min) to complete.

Finally, the RoBERTa model was applied to determine the sentiment of the entire collection of tweets. Then, the values for the T1_FansSent and T2_FansSent variables were calculated by dividing the count of positive tweets of each team for a particular match by the total number of tweets posted for that specific match.

Data analysis

When considering the number of matches won, more than half of the matches (58.55%) were won by Team1. Therefore, the dataset was imbalanced. The correlation-association plot, shown in Appendix A, suggested that multicollinearity exists in all three datasets. Factor analysis groups highly correlated variables together into factors in order to reveal the underlying concept of a set of variables (Gie Yong and Pearce, 2013). Therefore, factor analysis was carried out to identify the highly correlated variables group-wise and to provide a solution for the multicollinearity. For that, both Bartlett's test and the KMO test should be satisfied. If the p-value in Bartlett's test is less than a specified significant level (e.g., less than 0.05), the correlation matrix is not an identity matrix, and factor analysis can be performed. The KMO Test evaluates if the sample size is adequate for factor analysis. Usually, if KMO is greater than 0.6, it is considered adequate (Sharma, 1995). The Bartlett's p-values for both the Historical and Historical + Twitter datasets were zero, and the overall KMO for the Historical and Historical + Twitter datasets were 0.68 and 0.73, respectively. However, the overall KMO value for the Twitter dataset was 0.54. These values indicated that the Historical and Historical + Twitter datasets were appropriate for factor analysis, but the Twitter dataset was not. Therefore, factor analysis was not performed for the Twitter dataset.

Model building

Categorical variables were encoded using one-hot encoding, and all the variables were standardized using StandardScalar(). Since the dataset was imbalanced, SMOTE was applied to each dataset. After that, each dataset was split into training (75%) and testing (25%) (Table 8).

Table 8.

Summary of train-test split (number of observations).

Dataset Name	Training set (75%)	Testing set (25%)
Historical	389	130
Twitter	389	130
Historical + Twitter	389	130

Prediction models were developed under three steps:

● Step 1: Models were trained with all variables available.

● Step 2: Models were built after removing unimportant variables.

● Step 3: The most important variables from every factor were chosen, and models were created.

After Step 1, the feature importance plot for the best-performing model was generated. In Step 2, models were built by removing the three least important variables, the two least important variables, and the least important variable sequentially according to that feature importance plot. In Step 3, the two most important variables from each factor were selected based on the importance plot of the best model in Step 2. Then various possible combinations of those variables were taken into consideration, and the combination of variables that gave the best results was identified. This process was carried out for all three datasets.

For Random Forest and XGBoost models, feature importance plots were generated based on the Mean Decrease in Impurity (MDI). For Support Vector Machine models, feature importance plots were created using the absolute values of the coefficients. The feature importance plots for the best models at each step for each dataset are provided in Appendix B.

Results

Performance of the final models

The final findings of the three datasets are provided in Table 9.

In addition, a Multi-Layer Perceptron (MLP) classifier was trained on each dataset with the most important variables to determine whether or not the results could be further improved, but there was no improvement. The results are presented in Table 10.

By looking at both tables, it can be noticed that models built using the Historical + Twitter dataset gave the best results compared to the other two. The best results for the Historical and Historical + Twitter datasets were given by Step 3 (Table 9), while for the Twitter dataset, the best result was given by Step 2 (Table 9). It is important to note that when comparing the models, the F1 score was primarily taken into account due to the imbalance in the dataset. The Twitter data model performed better than the Historical data model with a F1 score of 71.5%, and the Historical + Twitter data model surpassed both the Historical data model and the Twitter data model with a F1 score of 73.7% (Table 9).

Table 9.

Final results summarization (values relevant to training set are given within brackets).

Dataset	Step	Best Classifier	ROC AUC	Precision	Recall	Accuracy	F1 score
Historical	Step 1	Random Forest	61.9% (64.6%)	66.0% (66.1%)	56.4% (60.1%)	61.5% (64.6%)	60.8% (62.9%)
	Step 2	XGBoost	60.6% (65.0%)	63.5% (67.0%)	60.0% (59.3%)	60.6% (65.0%)	61.7% (62.9%)
	Step 3	XGBoost	61.1% (64.0%)	62.3% (62.9%)	69.1% (69.5%)	61.5% (64.0%)	65.5% (65.9%)
Twitter	Step 1	XGBoost	61.0% (67.5%)	61.9% (64.6%)	70.9% (77.4%)	61.5% (67.5%)	66.1% (70.4%)
Twitter	Step 2	XGBoost	65.5% (72.6%)	64.7% (69.5%)	80.0% (80.7%)	66.3% (72.6)	71.5% (74.7%)
Historical + Twitter	Step 1	SVM	69.0% (71.8%)	70.2% (72.1%)	72.7% (71.2%)	69.2% (71.8%)	71.4% (71.6%)
	Step 2	SVM	67.0% (70.6%)	67.8% (71.0%)	72.7% (69.5%)	67.3% (70.6%)	70.2% (70.3%)
	Step 3	XGBoost	70.8% (76.5%)	71.2% (75.9%)	76.4% (77.8%)	71.2% (76.5%)	73.7% (76.8%)

Table 10.

Results of MLP classifier (values relevant to training set are given within brackets).

Dataset	ROC AUC	Precision	Recall	Accuracy	F1 score
Historical	57.8% (59.3%)	56.6% (61.0%)	66.7% (65.5%)	57.8% (59.6%)	61.2% (63.2%)
Twitter	64.8% (66.5%)	65.1% (63.4%)	74.5% (77.8%)	65.4% (66.5%)	69.5% (69.9%)
Historical + Twitter	72.4% (73.2%)	73.2% (76.5%)	70.8% (70.9%)	72.4% (73.1%)	72.0% (73.6%)

T1_FansSent, T2_FansSent, T2_AvgBound, T2_W/L, T1_AvgBound, T1_Mat, and T2_AvgRunsConceded variables were the input variables for the best Historical + Twitter model.

The best model was trained based on the optimal hyperparameters given in Table 11, obtained through GridSearchCV.

Table 11.

The optimal hyperparameters of the best model.

Hyperparameters	Value
learning_rate	1e-04
n_estimators	2000
max_depth	7
min_child_weight	1
subsample	0.25
colsample_bytree	0.2

Moreover, the feature importance plot for this best model indicated that the newly created two features using sentiment analysis are the most important variables.

The marginal impact of one or two variables on the outcomes predicted by the machine learning model can be displayed by presenting partial dependence plots (PDPs/PD plots) (Friedman, 2001). Selected partial dependence plots for certain variables are shown in Figure 3.

Figure 3.

Partial dependency plots.

Note that the X-axis of these graphs shows the standardized values.

According to the partial dependency plot of the T1_FansSent, when Team1 has higher values for their sentiment score, they have a high chance of being the winner. The second partial dependency plot, which is for the Team2 fans’ sentiment score, shows that, up to nearly 0.9, if the T2_FansSent value increases, then the winning chance of Team1 decreases. But after that, it begins to increase. When looking at the reason for this, it was noted that it was due to a lack of data. Therefore, it can be concluded that overall, when the T2_FansSent increases, the probability of Team1 winning decreases. When considering the historical variables, there seems to be a negative association between T2_AvgBound and the winning probability of Team1. Another negative association exists between T2_W/L and Team1's winning chance. If Team2 had a higher win-loss ratio and took more boundaries per over, then Team1 would have a lower chance of winning.

Check the effectiveness of the proposed methodology

To check the effectiveness of the proposed methodology, predictions from the best model for the T20 World Cup 2022 matches were compared with the bookmakers’ pre-match predictions for those matches. On a daily basis, one hour before the start of each T20 World Cup 2022 match, the pre-match betting odds were manually gathered from oddsportal.com (https://www.oddsportal.com/). Just 14 of the matches contested among the top nine teams had a clear winner. Out of those 14 matches, only nine were successfully predicted by bookmakers, but 11 matches were correctly predicted by the best model proposed in this study. However, rather than the number of matches that were accurately predicted, bookmakers make their predictions based on the odds (Ul Mustafa et al., 2017). Therefore, evaluating the two approaches by comparing the profit/loss amounts is better. Suppose $1 was bet for each match; that means for all 14 matches, a total of $14 was placed for betting. Based on the bookmakers’ predictions, only a total payout of $12.13 could have been obtained, so the bettor has to face a loss. However, a total payout of $16.96 (a profit of $2.96) could have been earned by following this proposed methodology. This implies that the model proposed by this study follows a different prediction approach than that of the bookmakers.

Discussion and conclusion

Discussion

Wickramasinghe and Yapa (2018) achieved 85% accuracy in sentiment analysis for cricket tweets, while our fine-tuned RoBERTa model achieved 92.2% accuracy and a high F1 score of 95.3%, making it a better model for sentiment detection in cricket tweets. In contrast to their research, our model pre-trained on a sizable tweet set to precisely identify tweet sentiment. That might be the reason for the improved performance.

The primary goal of this study was to use historical and Twitter data to predict the outcome of T20I matches before they began. The summary of the study's final findings for those predicting best models is displayed in Table 12.

Table 12.

Predictive model comparison using F1 scores.

Model	Using the complete dataset	After removing the least important variables	After using the variable reduction method
Historical	60.8% (Random Forest)	61.7% (XGBoost)	65.5% (XGBoost)
Twitter	66.1% (XGBoost)	71.5% (XGBoost)	-
Historical + Twitter	71.4% (SVM)	70.2% (SVM)	73.7% (XGBoost)

According to the author's knowledge, no study up to now has built a model to predict the T20I match outcome before it starts by integrating both Historical and Twitter data. The proposed best model in this study achieved ROC AUC, precision, recall, accuracy, and F1 score values of 70.8%, 71.2%, 76.4%, 71.2%, and 73.7%, respectively, for the Historial + Twitter model by surpassing both individual data models.

Limitations

The study used ESPNcricinfo Statsguru's vast collection of historical data (Statsguru | Searchable Cricket Statistics database | ESPNcricinfo.com, 2000), but it was challenging to gather every data point. Only batting, bowling, and team statistics tables were scraped, and features were chosen based on previous works. Due to time and cost limitations, only 3000 tweets were used for training and evaluating sentiment analysis models. The study was limited to a short time period due to unavailability and shortage of tweets.

Future work

Future research could involve adding more features to the historical dataset, labeling more tweets for sentiment analysis, and adding new features to the Twitter dataset, considering metrics like retweet count and likes. Creating a web or mobile application for the prediction model would be beneficial, and a prediction model that uses historical and Twitter data might be used for various cricket formats and sports.

Conclusion

This study aimed to predict match outcomes for the T20I matches. The combined model outperformed each model, with the Twitter model performing better than the historical model. The models used data collected one hour before the match, allowing them to predict match outcomes even before the toss. This can be really helpful for team management when deciding their final squad. The transfer learning-RoBERTa-based model surpassed VADER and LSTM + GloVe for sentiment analysis. The final prediction performance improved when the two datasets were combined. The fans’ sentiment score variables for the two teams were the most important variables in the best model, indicating that Twitter data provides information that historical data does not. The best model was applied to the data from the T20 World Cup 2022. It was found that this model could outperform bookmakers’ forecasts and generate profits.

Footnotes

ORCID iDs

Pavanthi Sudasinghe

Sameera Viswakula

Pemantha Lakraj

Author contributions

Pavanthi Sudasinghe performed data collection, developed the models, carried out coding and analyses, and wrote the manuscript. Supervisors, Sameera Viswakula and Pemantha Lakraj, provided guidance throughout the project, including conceptual design, methodology, and critical revisions of the manuscript.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

The complete dataset is available at: .

Appendices

References

Al-Shabi

(2020) Evaluating the performance of the most important Lexicons used to Sentiment analysis and opinions Mining. IJCSNS International Journal of Computer Science and Network Security 20(1): 51–57.

Cheng

Heyl

Lad

, et al. (2021) Evaluation of twitter data for an emerging crisis: An application to the first wave of COVID-19 in the UK. Scientific Reports 11(1): Nature Publishing Group UK: 1–14.

Chiny

Chihab

, et al. (2021) LSTM, VADER and TF-IDF based hybrid sentiment analysis model. International Journal of Advanced Computer Science and Applications 12(7): 265–275.

Devlin

Chang

M-W

Lee

, et al. (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding .

Friedman

(2001) GREEDY FUNCTION APPROXIMATION: A GRADIENT BOOSTING MACHINE. The Annals of Statistics 29(5): 1189–1232.

Ghasiya

Okamura

(2021) Investigating COVID-19 news across four nations: A topic modeling and sentiment analysis approach. IEEE Access 9: Institute of Electrical and Electronics Engineers Inc.: 36645–36656.

Gie Yong

Pearce

(2013) A beginner’s guide to factor analysis: Focusing on exploratory factor analysis. Tutorials in Quantitative Methods for Psychology 9(2): 79–94.

Godin

Zuallaert

Vandersmissen

(2015) Beating the bookmakers: Leveraging statistics and twitter microposts for predicting soccer results. In: Proceedings of the 2014 KDD International Workshop on Large-Scale Sports Analytics, 2015, pp.17–21.

Hatharasinghe

Poravi

. Data Mining and Machine Learning in Cricket Match Outcome Prediction: Missing Links. In: 2019 IEEE 5th International Conference for Convergence in Technology, I2CT 2019, 2019, 2019, pp. 1–4. IEEE. Available at: https://ieeexplore.ieee.org/abstract/document/9033698.

10.

Hochreiter

Schmidhuber

(1997) Long short-term memory. Neural Computation 9(8): 1735–1780.

11.

Kaluarachchi

Varde

(2010) CricAI: A classification based tool to predict the outcome in ODI cricket. In: Proceedings of the 2010 5th International Conference on Information and Automation for Sustainability, ICIAfS 2010, 2010, pp.250–255.

12.

Kampakis

Adamides

(2014) Using Twitter to predict football outcomes. Epub ahead of print 2014.

13.

Kampakis

Thomas

(2015) Using Machine Learning to Predict the Outcome of English County twenty over Cricket Matches.: 1–17.

14.

Lamsal

Choudhary

(2018) Predicting Outcome of Indian Premier League (IPL) Matches Using Machine Learning. Epub ahead of print 2018.

15.

Lay

Lee

Gan

, et al. (2019) Semi-supervised learning for sentiment classification using small number of labeled data. Procedia Computer Science 161: 577–584.

16.

Liu

Ott

Goyal

, et al. (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach. Epub ahead of print 26 July 2019.

17.

Omar

Abd El-Hafeez

(2023) Quantum computing and machine learning for Arabic language sentiment classification in social media. Scientific Reports 13(1), Nature Research.

18.

Sankaranarayanan

Sattar

Lakshmanan

LVS

. Auto-play: A data mining approach to ODI cricket simulation and prediction. In: SIAM International Conference on Data Mining 2014, SDM 2014, April 2014, 2014, pp. 1064–1072. Society for Industrial and Applied Mathematics.

19.

Schumaker

Jarmoszko

Labedz

(2016) Predicting wins and spread in the premier league using a sentiment analysis of twitter. Decision Support Systems 88: 76–84.

20.

Sharma

(1995) Applied Multivariate Techniques. New York, NY: Wiley & Sons, Inc.

21.

Ul Mustafa

Nawaz

Lali

MIU

, et al. (2017) Predicting the cricket match outcome using crowd opinions on social networks: A comparative study of machine learning methods. Malaysian Journal of Computer Science 30(1): 63–76.

22.

Wickramasinghe

(2022) Applications of machine learning in cricket: A systematic review. Machine Learning with Applications 10: 100435.

23.

Wickramasinghe

Yapa

. Cricket match outcome prediction using tweets and prediction of the man of the match using social network analysis: Case study using IPL data. In: 18th International Conference on Advances in ICT for Emerging Regions, ICTer 2018 - Proceedings, September 2018, 2018, pp. 1–1. IEEE.

24.

Zhao

Zheng

. A BERT based Sentiment Analysis and Key Entity Detection Approach for Online Financial Texts. In: In 2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design (CSCWD), May 2021, 2021, pp. 1233–1238. IEEE.

25.

DIGITAL 2023: GLOBAL OVERVIEW REPORT (2023) DataReportal. Available at: https://datareportal.com/reports/digital-2023-global-overview-report (Accessed: December 30, 2023).

26.

Hugging Face (2022) “cric-tweets-sentiment-analysis.” Available at: https://huggingface.co/sppm/cric-tweets-sentiment-analysis (Accessed: December 30, 2023).

27.

ICC Men’s Twenty20 International Playing Conditions (2022) ICC Cricket. Available at: https://resources.pulse.icccricket.com/ICC/document/2022/10/13/28311034-6d66-436e-a20a6d9eaf1d2c24/ICC-Men-s-T20I-Pla-October-2022.p (Accessed: March 9, 2023).

28.

“Phil Lynch on Manchester United’s media strategy and ‘the Ronaldo effect’” (2021) Google Podcasts. Available at: https://podcasts.google.com/feed/aHR0cHM6Ly9hdWRpb2Jvb20uY29tL2N oYW5uZWxzLzUwNjMyMDIucnNz/episode/dGFnOmF1ZGlvYm9vbS5jb2.

29.

Statsguru | Searchable Cricket Statistics database | ESPNcricinfo.com (2000) https://stats.espncricinfo.com/ci/engine/stats/index.html .

30.

World Atlas (2020) “The Most Popular Sports In The World.” Available at: https://www.worldatlas.com/articles/what-are-the-most-popular-sports-in-the-world.html (Accessed: December 30, 2023).

Can social media opinions add value to historical data?: A study for T20I cricket match outcome prediction using machine learning

Abstract

Keywords

Introduction

Literature review

Methodology

Research design

Data collection

Historical dataset

Twitter dataset

Dataset summary

Sentiment analysis

RoBERTa-based model

VADER

LSTM

Data analysis

Model building

Results

Performance of the final models

Check the effectiveness of the proposed methodology

Discussion and conclusion

Discussion

Limitations

Future work

Conclusion

Footnotes

ORCID iDs

Author contributions

Funding

Declaration of Conflicting Interests

Data availability statement

Appendices

References