Abstract
In order to find the determinants of non-penalty goals scored per match, in association football (soccer), this paper developed a regression model consisting of 8 explanatory variables, based on observations for 98 teams playing in the top tiers of club football in England, Spain, Germany, France and Italy. We started with a framework that considered twenty-one different pitch actions that included both technical and tactical variables. Using data for the 2015-16 football season we narrowed down to the 8 variable model. The paper used a log-linear regression model in order to remove heteroscedasticity. The model estimated the number of non-penalty goals per game with error of less than |0.33| for 93 teams out of 98. For 52 teams the margin of error was less than |0.1|. Shots from penalty box per game, share of shots from goal box in total shots and long pass accuracy were found to have statistically significant positive impact on non-penalty goals scored per game. Share of long passes in total passes and crosses per game have significant negative impact.
Keywords
Introduction
Over the last ten years performance analysis in association football 1 (soccer) has made some serious progress. A sizeable section of this body of research attempts to identify factors that influence team performances. Researchers have attempted to identify performance indicators that differentiate between successful and unsuccessful teams, both in tournament format competitions as well as in league competitions. Hughes and Bartlett (2002) defined performance indicators as a set of action variables that attempts to define at least some aspects of a performance. In case of tournaments, success has been generally defined by the stage of the competition reached by the team. For leagues, points scored and standing in the league table defined success. Success may depend on possession (Collet, 2013, James et al., 2004) high-intensity running and sprints undertaken (Di Salvo et al., 2009), passing (Saito et al., 2013, Scoulding et al., 2004), chance (Lagos, 2007), or even analysis of game related statistics (Lago-Penas et al., 2010). While success in a final game might depend on a few factors like shots on goal and effective goalkeeping (Szwarc, 2007), success in a league depends on multiple factors like goals to shots ratio, percentage of goals scored from outside the box, ratio of short to long passes, number of crosses, number of goals conceded and even number of yellow cards (Oberstone, 2009). There are studies that attempt to identify the determinants of the performance indicators. For example, Lago and Martin (2007) investigated the determinants of possession.
The most important determinant of success in football is scoring more goals than the number of goals conceded. While success depends on both offensive and defensive prowess of the team, a very low scoring team cannot win a season-long league. This in effect makes goal-scoring the most important activity on the pitch, in football leagues. Also, spectators spend their money and effort primarily to see goals. Scoring goals or creating goal-scoring opportunities depends on various technical and tactical parameters, as well as on the situation of the game. Research papers like Ensum et al. (2005), Hughes and Franks (2005), Konstadinidou and Tsigilis (2005), Janković et al. (2011)), Lago-Penas et al. (2010a) Tenga and Sigmundstad (2011), Wright et al. (2011) etc. identified various determinants including passing accuracy, shooting accuracy and success, possession, types of passes and passing sequences, attacking third entry, position of attempt and type of shoot, distance covered, formation etc. Another strand of literature focusses on identifying goal scoring patterns (Garganta et al., 1997, Yiannakos & Armatas, 2006, Armatas et al., 2007, Redwood-Brown, 2008, Armatas et al., 2009, Lago-Penas et al., 2010b, Tenga et al., 2010, Ridgewell, 2011, Mitrotasios & Armatas, 2014, Pratas et al., 2012) depending on time of goal scored, sequence of actions prior to goal, passing pattern before goal scoring, area of scoring attempt, type of attack and other situational variables.
Method of analysis
Most of the studies mentioned in the previous section were done with data taken from international knock-out tournaments. One reason for choosing international knock-out tournaments like FIFA World Cup or UEFA European Championship is presence of larger number of teams vis-á-vis domestic leagues. A larger number of teams, and hence a larger number of observations, allows the researchers to consider a larger number of factors or explanatory variables that might have effect on goal scoring. It is not possible to study domestic leagues, where only 20 teams participate, considering a large number of explanatory variables. There will be very few degrees of freedom if the number of explanatory variables is increased with only 20 observations. In order to consider a large number of explanatory variables, in this paper we used 98 observations from English Premier League, La Liga, Bundesliga, Ligue One and Serie-A for the 2015-16 season. In order to find the determinants of average number of non-penalty 2 . goals scored per game, we considered 8 technical or skill related variables, 11 tactical variables and 2 set-piece related variables as plausible determinants of non-penalty goals scored per game.
Data source
We used data from whoscored.com, which is now an influential website for football (soccer) statistics. The data sources for whoscored.com 3 are Opta Sports and eNetPlus, which are reliable and acceptable sources. The website provides rating for players as well as for teams and keeps the data available in public domain.
Variables
Goal scoring ability of a football team may depend on five different kinds of pitch actions – (1) shots, (2) passes, (3) crosses, (4) set-pieces, (5) dribbles, (6) aerial balls and (7) possession. Some of these pitch actions can be broken into finer details. We considered pitch actions as illustrated in Fig. 1.

Pitch actions that create goal scoring opportunities.
Goal box is the six yard box. Penalty box is the 18 yard box. “Shots” means shots taken at goal with intent of scoring. “Shots from penalty box” means the shots taken from inside the 18 yard box but outside the six yard box. Penalty kicks are also taken from the spot inside the 18 yard box, but outside the six yard box. However, penalty kicks are not included in “shots from penalty box”. The pitch actions are explained in Table 1.
Explanation of pitch actions
Out of these pitch actions we created 21 variables, which can be classified into three categories – (A) technical or skill related, (B) tactical and (C) set-pieces earned, as summarized in Table 2. The technical variables are measures of accuracy and of success of different pitch actions, and depend on the skill level of the players and coordination among team-mates. However, whether to play long passes or short passes, whether to attempt a shot on goal from outside the box or from within the box, whether to attempt dribbles or rely on passing, whether to play from the wide positions and to attempt crosses, and whether to have possession or to let the opponent have possession are tactical decisions made by the manager and the coaching staff. We have classified such variables as tactical variables. Earning free-kicks and corners depends on the how much a team can press on the opponent as well as on the referee. That’s why we kept those variables in aseparate category. We have calculated the values of each of these explanatory variables for each of the 98 teams in the five leagues using the data collected from whoscored.com. The data was collected on 18th May of 2016, after all the games in all the five leagues werecompleted.
Definitions of explanatory variables
All passes were classified as either short passes (less than 25 yards long) or long passes (more than 25 yards long). Therefore, percentage share of short passes is only (100 – percentage share of long passes). Hence, instead of considering percentage share of long passes as well as that of short passes, we considered only the percentage share of long passes. Similarly, all shots were classified as either from outside of the box, or from inside the penalty box (but outside the goal box), or from inside the goal box. Since we considered percentage share of shots from penalty box as well as that from goal box, there is no reason to take the percentage share of shots from out of box separately.
Since we are interested in finding the determinants of non-penalty goals scored per game (NPGPG)
4
, it becomes our dependent variable. NPGPG is defined as
Among the five leagues from which we took data, all except Bundesliga had 20 teams and hence each team played 38 matches during the season. But Bundesliga had 18 teams and hence the Bundesliga teams played 34 matches each during the season. Because of this asymmetry in number of games played by teams, we took non-penalty goals scored per game as our dependent variable, instead of total non-penalty goals.
We understand that some of the 21 explanatory variables defined in Table 2 may be highly correlated resulting in presence of multicollinearity 5 . After checking pairwise correlation, we removed at least one of the variables among those that had pairwise correlation coefficients higher than |0.8|. In order to retain the maximum number of variables we used a simple rule. If a variable is pairwise correlated with more than one variable, but the variables with which it is correlated are correlated only with this variable, then we removed this variable only. Five variables that we eliminated are SHGB, SPACC, SPPG, FKACC and POSSH. The correlation matrix is given in the Appendix (Table A1).
Using Eviews 6, we ran the following linear regression model.
The regression result is given in Table A2 (see appendix). Though the adjusted R2 is high (0.7998) and the probability value of the F-statistic is 0, indicating that the model is overall statistically significant, we can see from Table A2 (in the Appendix) that the t-statistic is significant (higher than 1.98) 6 for only 5 variables. This might be due to further presence of multicollinearity, or due to presence of heteroscedasticity 7 , or because the residuals are not normally distributed. Looking at the scatter diagrams for NPGPG against some of the explanatory variables we suspected presence of heteroscedasticity. Since our sample is sufficiently large, we ran a White test for the model (1). The result of the test is given in Table 3. Since the probability values for both F-statistic as well as that of the χ2 are less than 0.05, we couldn’t rule out presence of heteroscedasticity at 5% level.
In presence of heteroscedasticity the estimators fail to be BLUE (Best Linear Unbiased Estimator), and the model (1) is not acceptable. As an additional diagnostic test we ran the Jarque-Bera test on model (1) to see if the residuals are nearly normally distributed. The result is shown in Fig. 2. The Jarque-Bera (JB) 8 statistic is high and the probability is low, we reject the hypothesis that the residuals are normally distributed.

Histogram of residuals for model (1).
White Heteroscedasticity Test for Model (1)
Since there exists heteroscedasticity and the residuals are not normally distributed, we need to change the model (1). A log transformation is likely to reduce heteroscedasticity because it compresses the scales in which the variables are measured. Taking a log transformation of the model (1) we constructed the following model and ran the regression.
where i is the name of the team, i = [1, 98], β*k is the coefficient of the kth variable, α* is the constant term and ui is the residual term for the ith observation.
The result of regression run on model (2) is given in Table A3 in the Appendix. The high adjusted R2 (0.7598) and 0 probability value of the F-statistic indicates that the model is overall statistically significant. Though the adjusted R2 is slightly less than that of model (1), we chose model (2) over model (1) on basis of AIC (Akaike Information Criteria) 9 and SIC (Schwarz Information Criteria) 10 .
The purpose of developing model (2), rejecting model (1), was the presence of heteroscedasticity in model (1). As a diagnostic test we ran the White test on model (2). The result of the test is given in Table 4.
White Heteroscedasticity Test for Model (2)
Since the probability values for both F-statistic as well as that of the χ2 are more than 0.05, we can rule out presence of heteroscedasticity at 5% level. We also ran the Jarque-Bera test on model (2) to see if the residuals are nearly normally distributed. The result is shown in Fig. 3. Since the JB statistic is low (less than 1) and the probability is high (0.6275), we conclude that the residuals are normally distributed. The Durbin-Watson d-statistic is 1.8825, suggesting that there is no autocorrelation
11
. This means, model (2) satisfies all conditions for the estimators to be BLUE. Despite that, the t-statistic are not significant for most of the variables (Refer to Table A3 in the Appendix). That must be due to presence further of multicollinearity. In such a scenario the practice is to first remove the explanatory variables with t-statistic < |1|. From Table A3 (given in the Appendix) it can be seen that the t-statistic is in the interval (–1, 1) for ln(SHACC), ln(SHOB), ln(SHSPB), ln(LPPG), ln(DRSUC), ln(DRPG) and ln(ARSUC). Removing these seven explanatory variables we reconstructed the regression model as:
where i is the name of the team, i = [1, 98], β*k is the coefficient of the kth variable, α* is the constant term and ui is the residual term for the ith observation.

Histogram of residuals for model (2).
The result of the regression run on model (3) is given in Table A4 (see Appendix). The adjusted-R2 (0.7662) is higher than that of model (2). More importantly, the AIC (–0.9475) and SIC (–0.6837) values are less than those for model (2). This indicates that the variables removed were irrelevant and hence model (3) is a better model than model (2). To be sure we ran the White test (to check heteroscedasticity) and the Jarque-Bera test (to check normality of the residuals) on model (3). The results of both tests were negative, i.e., we could reject heteroscedasticity and accept the hypothesis that the residuals are normally distributed. The Durbin-Watson d-statistic is 1.9986, which indicates that there is no autocorrelation either. The t-statistic is significant 12 for ln(SHPB), ln(SHSGB), ln(LPACC), ln(SHLP) and ln(CRPG). For the other variables, except ln(COPG), the t-statistic are larger than |1|.
Since the t-statistic for ln(COPG) is –0.6166, we removed the variable in our next level of iteration and reconstructed the regression model asfollows:
where i is the name of the team, i = [1, 98], β*k is the coefficient of the kth variable, α* is the constant term and ui is the residual term for the ith observation.
The result of regression run on model (4), as given in Table A5 of the appendix, suggests that model (4) is the most suitable regression model for estimating the determinants of non-penalty goals per game. There is no explanatory variable with t-statistic in the interval (–1, 1). The adjusted-R2 (0.7678), AIC (–0.9636) and SIC (–0.7261) are all better than those of model (3). To be sure we ran White test to rule out heteroscedasticity and Jerque-Bera test to ensure that the residuals are normally distributed. The tests affirmed homoscedasticity (i.e., rules out heteroscedasticity) and normality of residuals. The Durbin-Watson d-statistic is 2.0082, indicating that there is no autocorrelation.
The estimated coefficients along with standard error, t-statistic and probability values for the explanatory variables of model (4) are given in Table 5.
Estimated coefficients for model (4)
Estimated coefficients for model (4)
Since the degrees of freedom of the model is 89, the t-statistic are significant when greater than |1.98|. As can be seen from Table 4, the t-statistic are significant for ln(SHPB), ln(SHSGB), ln(LPACC), ln(SHLP) and ln(CRPG). Using the coefficients from Table 4 we can write our estimation equation as:
or,
where, i is the name of the team, i = [1, 98].
Using equation (4E’) and the real values of the explanatory variables we estimated the non-penalty goals scored per game for each of the 98 teams and compared against the actual values of the variables. The comparison of actual NPGPG and estimated NPGPG for the top 14 teams (in terms of actual NPGPG) is given in Table 6.
Estimated NPGPG for 14 top scoring (per game) teams
The scatter plot of estimated NPGPG against actual NPGPG for all the 98 teams is shown in Fig. 4. We have marked the scatter plots of the top 14 teams in the scatter diagram. Our estimates almost perfectly matched with actual values for Barcelona, Dortmund, Napoli, Manchester City and Lyon among the top 14 teams, and for many other teams.
Among the top 14, we underestimated Paris St. Germain, Juventus and Tottemham by a margin of less than 0.2. Atletico Madrid and Borussia M.Gladbach were underestimated by margins less than 0.3. Bayern and Arsenal were overestimated, while Real Madrid and Roma were underestimated by margins more than 0.33. Margin for Bayern was just –0.335. Among all 98 teams we underestimated only 2 teams (Real Madrid and Roma) and overestimated only 3 teams (Arsenal, Sevilla and Bayern) with a margin more than 0.33. For 93 teams ourmargin of error was less than |0.33| and for 52 teams our margin of error was less than |0.1|. Refer to Table A6 in the Appendix.

Scatter diagram of estimated NPGPG against actual NPGPG.
In this paper we tried to identify the pitch actions (both technical and tactical) that significantly affect goal scoring. Regression models developed on observations from five leagues in Europe during the season 2015-16 shows that the number of shots from penalty box, per game, is the most important determinant of non-penalty goals per game. This result is supported by our log-linear regression model developed on basis of observations for all 98 teams as well as by the model developed on basis of the observations for the 35 teams that scored above average number of non-penalty goals per game. From the regression model (4) we conclude that increasing the share of shots from goal box increases the number of goals. That means it is a better strategy to attempt goals from close range than from a distance.
We believe that the coaches and managers may find the following result useful. Share of long passes in total passes and number of crosses played per game adversely affects goal scoring, but accuracy of long passes positively impact it. Technical perfection in long passes and passes in general is required, but strategically it is better to increase the number of shot passes played per long pass. This is what Johan Cruyff and his spiritual disciples in football strategy like Arsene Wenger or Pep Guardiola, have been saying for ages and we have seen great teams like Ajax (1971-74), Netherlands national team (1972-78), Barcelona (1992-94 and 2008 to present), Bayern Munich (2012 to present) and Arsenal (1997–2007) that successfully employed the strategy. In the season 2015-16 we have seen teams like Barcelona, Bayern, Dortmund, Manchester City, Arsenal, Paris Saint Germain etc. apply that strategy.
Number of crosses, per game, increases if a team tends to attack from the wide. While it is a might be a good strategy to employ full backs to go on occasional overlaps, playing from the wide reduces the goal scoring opportunity. When a team attacks from the wide, the centre backs of the opposition gets more time and can anticipate the crosses. This result is juxtaposed to Mara et al. (2012), which showed that in 2010-11 season of W-league 13 24% goals were scored from crosses. That might be a serious difference between women’s game and the men’s game.
Footnotes
Appendix
Difference between actual and estimated NPGPG (all 98 teams)
| Sl | Team | Actual NPGPG | Estimated (NPGPG) | Difference |
| 1 | Real Madrid | 2.68 | 1.95 | 0.74 |
| 2 | Barcelona | 2.58 | 2.57 | 0.01 |
| 3 | Paris Saint Germain | 2.5 | 2.35 | 0.15 |
| 4 | Borussia Dortmund | 2.24 | 2.21 | 0.03 |
| 5 | Roma | 2.08 | 1.48 | 0.6 |
| 6 | Bayern Munich | 2.06 | 2.39 | –0.34 |
| 7 | Napoli | 1.87 | 1.92 | –0.05 |
| 8 | Borussia M.Gladbach | 1.79 | 1.51 | 0.28 |
| 9 | Manchester City | 1.74 | 1.73 | 0.01 |
| 10 | Juventus | 1.68 | 1.5 | 0.19 |
| 11 | Tottenham | 1.66 | 1.54 | 0.12 |
| 12 | Lyon | 1.66 | 1.69 | –0.04 |
| 13 | Atletico Madrid | 1.61 | 1.32 | 0.29 |
| 14 | Arsenal | 1.61 | 2.06 | –0.45 |
| 15 | West Ham | 1.58 | 1.33 | 0.25 |
| 16 | Liverpool | 1.58 | 1.48 | 0.1 |
| 17 | Leicester | 1.5 | 1.29 | 0.21 |
| 18 | Bayer Leverkusen | 1.44 | 1.5 | –0.06 |
| 19 | Athletic Club | 1.42 | 1.09 | 0.33 |
| 20 | Southampton | 1.42 | 1.36 | 0.06 |
| 21 | Everton | 1.37 | 1.32 | 0.05 |
| 22 | Chelsea | 1.34 | 1.51 | –0.16 |
| 23 | Mainz 05 | 1.32 | 1.17 | 0.15 |
| 24 | Rayo Vallecano | 1.32 | 1.15 | 0.16 |
| 25 | Nice | 1.32 | 1.17 | 0.14 |
| 26 | Fiorentina | 1.32 | 1.19 | 0.12 |
| 27 | VfB Stuttgart | 1.29 | 1.31 | –0.01 |
| 28 | Schalke 04 | 1.29 | 1.4 | –0.1 |
| 29 | Werder Bremen | 1.26 | 1.23 | 0.03 |
| 30 | Wolfsburg | 1.26 | 1.46 | –0.2 |
| 31 | Monaco | 1.26 | 1.2 | 0.07 |
| 32 | Bordeaux | 1.24 | 1.05 | 0.18 |
| 33 | Celta Vigo | 1.24 | 1.26 | –0.02 |
| 34 | Inter | 1.24 | 1.26 | –0.03 |
| 35 | Marseille | 1.21 | 1.16 | 0.05 |
| 36 | Rennes | 1.18 | 1.21 | –0.02 |
| 37 | Guingamp | 1.16 | 0.92 | 0.24 |
| 38 | Sassuolo | 1.16 | 0.98 | 0.17 |
| 39 | Montpellier | 1.16 | 0.99 | 0.17 |
| 40 | Sevilla | 1.16 | 1.51 | –0.35 |
| 41 | Real Sociedad | 1.13 | 1.14 | –0.01 |
| 42 | Manchester United | 1.13 | 1.16 | –0.03 |
| 43 | Lazio | 1.13 | 1.21 | –0.07 |
| 44 | AC Milan | 1.13 | 1.22 | –0.09 |
| 45 | Hertha Berlin | 1.12 | 1.18 | –0.06 |
| 46 | Sampdoria | 1.11 | 0.88 | 0.23 |
| 47 | Eibar | 1.11 | 0.96 | 0.15 |
| 48 | Sunderland | 1.08 | 0.95 | 0.13 |
| 49 | Reims | 1.08 | 1.03 | 0.05 |
| 50 | Darmstadt | 1.06 | 0.9 | 0.15 |
| 51 | Hoffenheim | 1.06 | 1.21 | –0.15 |
| 52 | Deportivo La Coruna | 1.05 | 0.99 | 0.07 |
| 53 | Genoa | 1.05 | 1.03 | 0.02 |
| 54 | Newcastle United | 1.05 | 1.04 | 0.02 |
| 55 | Bournemouth | 1.05 | 1.07 | –0.02 |
| 56 | Torino | 1.05 | 1.14 | –0.09 |
| 57 | FC Cologne | 1.03 | 1.15 | –0.12 |
| 58 | Toulouse | 1.03 | 0.92 | 0.1 |
| 59 | Villarreal | 1.03 | 1.01 | 0.02 |
| 60 | Lorient | 1.03 | 1.03 | –0.01 |
| 61 | Valencia | 1.03 | 1.09 | –0.06 |
| 62 | Granada | 1 | 0.87 | 0.13 |
| 63 | Sporting Gijon | 1 | 0.95 | 0.05 |
| 64 | Empoli | 1 | 0.97 | 0.03 |
| 65 | Hamburger SV | 1 | 1.07 | –0.07 |
| 66 | Las Palmas | 1 | 1.07 | –0.07 |
| 67 | Chievo | 0.97 | 0.91 | 0.06 |
| 68 | Norwich | 0.97 | 0.94 | 0.04 |
| 69 | Espanyol | 0.97 | 1.1 | –0.13 |
| 70 | Augsburg | 0.97 | 1.05 | –0.08 |
| 71 | Angers | 0.95 | 0.85 | 0.1 |
| 72 | Saint-Etienne | 0.95 | 0.9 | 0.05 |
| 73 | Palermo | 0.95 | 0.93 | 0.01 |
| 74 | Swansea | 0.95 | 1.04 | –0.1 |
| 75 | Getafe | 0.92 | 0.99 | –0.06 |
| 76 | Malaga | 0.92 | 0.99 | –0.07 |
| 77 | Stoke | 0.92 | 1.02 | –0.09 |
| 78 | Lille | 0.92 | 1.07 | –0.15 |
| 79 | Atalanta | 0.89 | 0.92 | –0.02 |
| 80 | Levante | 0.89 | 0.98 | –0.09 |
| 81 | Crystal Palace | 0.89 | 0.98 | –0.09 |
| 82 | Eintracht Frankfurt | 0.88 | 1.02 | –0.14 |
| 83 | GFC Ajaccio | 0.87 | 0.86 | 0.01 |
| 84 | Caen | 0.87 | 0.97 | –0.11 |
| 85 | SC Bastia | 0.84 | 0.62 | 0.22 |
| 86 | Udinese | 0.84 | 1.07 | –0.23 |
| 87 | Hannover 96 | 0.82 | 0.95 | –0.13 |
| 88 | Frosinone | 0.82 | 0.74 | 0.08 |
| 89 | Bologna | 0.82 | 0.77 | 0.05 |
| 90 | West Bromwich Albion | 0.82 | 0.85 | –0.03 |
| 91 | Watford | 0.82 | 0.88 | –0.07 |
| 92 | Real Betis | 0.82 | 0.92 | –0.1 |
| 93 | Nantes | 0.79 | 0.98 | –0.19 |
| 94 | Carpi | 0.74 | 0.86 | –0.12 |
| 95 | Verona | 0.74 | 0.96 | –0.22 |
| 96 | Ingolstadt | 0.71 | 0.84 | –0.13 |
| 97 | Troyes | 0.63 | 0.86 | –0.23 |
| 98 | Aston Villa | 0.58 | 0.86 | –0.28 |
Henceforth football means association football (soccer) in this paper.
Goals excluding those scored from the penalty kicks
Some of the regressors (explanatory variables) are collinear.
For 81 degrees of freedom, significant t at 5% level of significance is 1.98.
The variances of the residuals are not equal.
, where S is skewness and K is kurtosis. The JB statistic follows a χ2 distribution with 2 degrees of freedom. If the residuals are normally distributed, JB = 0 and the probability value very high.
, where k is the number of regressors, n is the number of observations and is the estimated residual for the ith observation. When multiple models are compared, the model with the lowest AIC is preferred.
, n, k and are as defined in footnote 10. Model with lower SIC value is preferred.
Autocorrelation means the residuals for different teams are correlated. Logically there is no reason for existence of autocorrelation in the present data. Autocorrelation can be ruled out if dL <d < (4-dL). For 98 observations and 16 variables, dL = 1.203.
At 88 degrees of freedom the t-statistic is significant if it is greater than |1.98|.
National women’s soccer league in Australia.
