Abstract
There is growing on-going research into how footballer attributes, collected prior to, during and post-match, may address the demands of clubs, media pundits and gaming developers. Focusing upon individual player performance analysis and prediction, we examined the body of research which considers different player attributes. This resulted in the selection of 132 relevant papers published between 1999 and 2020. From these we have compiled a comprehensive list of player attributes, categorising them as static, such as age and height, or dynamic, such as pass completions and shots on target. To indicate their accuracy, we classified each attribute as objectively or subjectively derived, and finally by their implied accessibility and their likely personal and club sensitivity. We assigned these attributes to 25 logical groups such as passing, tackling and player demographics. We analysed the relative research focus on each group and noted the analytical methods deployed, identifying which statistical or machine learning techniques were used. We reviewed and considered the use of character trait attributes in the selected papers and discuss more formal approaches to their use. Based upon this we have made recommendations on how this work may be developed to support elite clubs in the consideration of transfer targets.
Introduction and motivation for study
There has been significant progress in the development of techniques to deliver more effective automated and intelligent analysis of footballer and team performance (de Sousa, 2011). The demands of broadcasters, media pundits, gaming developers and the clubs themselves to gather accurate and timely player attributes have continued to grow. In all cases the financial rewards which may result from the interpretation of these data are a very significant driver. For example, the annual transfer fee investments in the five major European championships (English Premier League, Spanish La Liga, German Bundesliga, Italian Serie A and French Ligue 1) increased by 429% to Euro 6,622M between 2010 and 2019 (Poli et al., 2019). In the gaming industry, FIFA 19 generated $786M in 2019 (Saed, 2020). For the gaming developers, continuing to improve the realism of their products is a key business driver. For the increasing number of broadcasters and pundits, the ability to present and discuss player and team activities and performances better than the competition is a major component of their ability to attract audiences and therefore maximise their subscriptions and advertising revenues. For example, in 2018/19 Sky TV’s global football revenues were Euro 28.9 Bn (Delloite, 2020). For the clubs themselves, the pursuit of all opportunities to improve the performance of individual players and the team as a whole is vital to their businesses. The combined revenues of clubs in the five major European championships is projected to grow by over 42% from Euro 11.3Bn in season 2013/14 to 16.1Bn in 2020/21 (Deloitte, 2017; Deloitte, 2020). The pressures on clubs to identify successful transfer targets, at the right fee and consequent salary and bonus package, is a very significant issue for all clubs and particularly for the elite clubs facing seemingly unending price escalation.
There is considerable on-going research into how player and team attributes, both static and dynamic during matches, may be collected automatically, for example using automated video data collection and analysis (Filetti et al., 2017). This is often supplemented by experts, usually ex-players (PA Sport, 2020) and in the case of SoFIFA input from a community of 8,000 coaches, scouts and season ticketholders (SoFIFA, 2020).
In a variety of different player and match attributes and scenarios, statistical (Gelade & Hvattum, 2020) and increasingly artificial intelligence, in the main machine learning (Stanojevic, & Gyarmati, 2016), methods have been deployed to draw conclusions and make useful predictions of individual and team performances.
We have, however, found very few examples of analyses including player character traits such as motivation, cognitive functions, self-control, sustained attention etc. This is in stark contrast to other industry recruitment activities where the calibration of such traits is considered critical. We suggest that the inclusion of an appropriate selection of such attributes presents the opportunity for a game-changing step forward in footballer analytics, in particular, in the selection of potential transfer targets.
Methods
Data collection
A systematic review of papers relevant to sporting analytics, with a specific focus on those addressing football (soccer) was conducted. No historical time limit was placed upon the papers considered, with over 1,500 initially selected papers falling within a timeframe of January 1999 to January 2021. All papers identifying footballer attributes, such as passing, tackling, assists etc., for review, analysis or predictive purposes, were curated. A focus upon eleven-a-side competitive professional football was maintained and papers addressing the analyses of small sided games (such as five-a-side games, training/practice games and video game matches) were excluded unless novel footballer attributes were identified. This resulted in a collection of 132 directly relevant papers (Table 12). With the aim of achieving a comprehensive review of relevant research, the identification of these papers included the review of relevant papers referenced by each, as well as those citing them, and where appropriate these were included for curation. In each case the publishing journal, conference or organization was noted. Additionally, where analyses were conducted, the analytical methods (statistical analysis, machine learning, mixed) were recorded. In order to determine whether the analyses were statistical or machine learning methods we adopted the accepted definition that statistical models (e.g. ANOVA, Chi squared analysis, Spearman correlation test) are designed for inference and description of the relationships between variables, whereas machine learning models (e.g. decision tree, neural networks) are designed to make the most accurate predictions possible (Rajula et al., 2020).
Selected papers
Selected papers
For each paper their main findings and conclusions were summarized (Table 14).
Selected papers’ main findings and conclusions
A comprehensive list of attributes was then compiled from all papers selected, resulting in 2537 attributes used in total across all selected papers, including duplicates. Analyses were made to establish the frequency and predominance of types of individual attributes addressed in the papers selected.
Where these papers exploited footballer attributes extracted from available football datasets such as SoFIFA (SoFIFA, 2020), Stats Perform (Stats Perform, 2020) etc., this was noted in order to develop a full list of available datasets (Table 1). In most cases these are freely available: however, where not the case this is noted.
Sources of player data
1Number of player attributes extracted from selected papers.
These datasets assign values to the selected attributes and often apply their own formulae to create an overall score for each player as a measure of their rank compared to other players. For example, the SoFIFA dataset comprises 80 attributes for each of 18,944 international players. The SoFIFA overall score is calculated as the sum of each attribute value multiplied by a coefficient specific to the position of the individual player, added to a value representing the player’s international reputation (SoFIFA, 2020). As an example, the SoFIFA attributes, including calculated overall value are shown in Table 13, which lists the actual attribute values for each of Robert Lewandowski and Kevin DeBruyne. This table illuminates the diversity of the player attributes collected ranging from age, weight, height and other demographic data to measures of technical skills such as shooting and passing as well as mentality measures.
SoFIFA player attributes illustrated by Robert Lewandowski and Kevin De Bruyne values
Each attribute was classified by data type (Wakelam et al., 2016), integrity, temporality, accessibility and sensitivity (Table 2).
Data classifications
Data classifications
2,3Where the source paper(s) are unclear or conflicting in data type specification or data attribute, the authors have done their best to select the most appropriate. 4Alternatively, such data may be given a subjective measure by club scouts/coaches/psychologists. 5Where in any doubt in the identification of sensitivity of data items the authors have selected the more sensitive definition.
Attributes were then allocated to 25 logical groups: Player data & history; Speed & movement; Pass; Goals, shots & shooting; Tackles; Aerial & header; Possession; Fouls & cards; Dribble; Free kick; Cross; Interception; Block; Duel; Clearance; Error, mistake, fail; Ball; Ball recovery; Assist; Offside; Injury; Outfielder position specific; Goalkeeper; Data applicable to any player; Character traits. Given the very wide variety of player attributes, it is possible to select these groups in a variety of different ways, and for the purposes of this paper we have tried to align our selection to reflect some of what appear from our research to be groups of interest to clubs and researchers, whilst at the same time keeping the groups as logical as possible. For example, while Free kicks may be considered as a component of Goals, shots and shooting, free kicks tend to be taken by so called “free kick specialists” in teams and we therefore chose to allocate them to a group of their own. In the case of the Player data and history group we have included the data that describes player demographics such as age and nation origin, physical attributes such as height and BMI, statistical attributes such as games played and international caps and those attributes which attempt to define the player such as their specific skills and strengths.
Where an attribute was allocatable to more than one group this was done. For example, ball recovery by tackle is relevant to each of the Tackles and the Recoveries groups and running while in possession to both the Possession and Speed and Movement groups.
Papers
The complete list of the 132 papers selected is provided in Table 12 and the main findings and conclusions of each paper are summarized in Table 14.
The papers are sourced from a wide range of publishers, in total 78. We find that each of the International Journal of Performance Analysis in Sport with 14 of the selected papers, the Journal of Sports Sciences, with 11, the MIT Sloan Sports Analytics Conference proceedings, with 8, and the Journal of Sports Analytics with 5, together, account for 29% of the total. The next highest sources are Human Movement Science (4) and Cornell University Library’s arXiv (4), although we must note that arXiv is classed as a pre-publication distribution service and open-access archive for scholarly articles and publications are not peer-reviewed. Publishers Sports Medicine, Perceptual and Motor Skills and PLOS ONE, each with 3 papers follow and the remainder are ones and twos.
An analysis of the publication dates of the 132 relevant papers compiled shows how the growth of research interest in the field of footballer analytics has accelerated between 1999 and 2020 (Fig. 1). Nineteen of the selected papers were published between 1999 and 2012, an average of less than 1.5 per year, whereas 113 of the selected papers were published in the 8 years from 2013 to 2020, an average of almost 14 papers per year.

Number of relevant papers published between 1999 and 2020.
Where player attributes were analysed, either statistical, machine learning or a mixture of both techniques were applied (Table 3), with 117 of the 132 papers conducting some form of analysis, and over two thirds of these solely applying descriptive statistical techniques. The remaining 15 used combinations of machine learning and statistical techniques. Where machine learning was deployed, linear regression techniques were the most deployed, however, as we might expect, a variety of other commonly used ML techniques were also used (Table 4). It should be noted that the number of papers analysed in Table 4 is consistent with some papers deploying more than one technique, for example, the deployment of a combination of artificial neural networks, case based reasoning systems and k- Nearest neighbor algorithms is noteworthy in the paper A study of Prediction models for football player valuations by quantifying statistical and economic attributes for the global transfer market (Patnaik et al., 2019). Table 14 illustrates the very wide variety of research topics both statistical and machine learning techniques are applied to.
Data analysis methods
Analysis of machine learning techniques
The resulting database comprised 2,537 extracted attributes, including those attributes duplicated across papers (noted to permit analyses of their frequency of use). Following the removal of duplicates, a master list of 1,518 attributes was produced for future analysis.
After allocation of attributes to each of the 25 selected groups, comparisons between the predominance of attributes in the different groups were calculated (Table 5).
Attribute groups
Attribute groups
6Where an attribute appropriate to more than one group it has been included in each.
Perhaps unsurprisingly, the groups pass and Goals, shots & shooting comprised the two with the highest proportion of attributes analysed by researchers. These were very closely followed by Player data & history. This attribute group includes player demographic (data and history) and attributes such as age, international caps, playing position and assessments of their motivation, potential and specialties such as free kicks, playmaking etc.
Similarly the group Outfield player specific which directly addresses attributes for each of defenders, attackers, midfielders etc. followed closely in terms of proportion of attributes collected, including attributes such as wide midfielder interceptions, forward successful aerial duels, central midfielder shots.
The next most analysed attributes are those measuring player speed and movement such as locations of play, speeds and percentages of times spent jogging/walking or running.
These first 5 of the 25 groups accounted for 60% of the attributes selected by researchers for collection and analysis.
Despite each being a critical part of success in matches, it is a little surprising that related attributes such as possession, dribbling, ball recovery, interceptions and blocking are not more highly placed in analyses; none of these were higher than 2% of the attributes analysed.
As football fans will recognize, while pundits, coaches and fans spend a great deal of time discussing players skills such as speed, passing vision, shooting, free kick taking, a great deal of emphasis appears to be placed upon their character traits such as attitude, composure, influence, motivation. Given this it is somewhat surprising that only 3% of such attributes have been considered for analysis in our research findings.
An analysis of attribute data types is presented in Table 6 below. More than four fifths (81%) of all player attributes are numeric, allowing analysis by a wide range of statistical and machine learning techniques and a further 7% are ordinal.
Attribute data types
Attribute data types
7Where an attribute appropriate to more than one group it has been included in each.
Of the remaining 12% nominal attributes, almost 30% (91 of 325) are player demographic attributes, such as name, team, position, dominant foot, in the Player data and history group. This is followed by 23% (74 of 325) and 13% (42 of 325) in the Goals, shots and shooting and Pass groups respectively.
Noting the data types present in the data set is essential as not all machine learning techniques are suitable to be applied to combined numeric and nominal data, and while it is possible to encode the nominal data as numeric, this does not exploit the strengths of the technique. For example, in the cases of K-nearest neighbours, the distance measurement needs to be adjusted to cope with a data set involving both continuous values and nominal values. Decision trees, random forest and naïve Bayes techniques, however are suitable for the analysis of mixed data.
For most attributes their measurement may be either quantitative or qualitative. For example, passing could be measured as the number of passes during a specified period or as the quality of passing (where quality could be defined on a Likert scale - poor, average, good, very good) or as a nominal value such as passing back (yes/no).
With the exception of the Player data and history and the Character traits attribute groups, all other groups are comprised of 67% and above numeric attributes and in total numeric and ordinal attribute counts comprise almost 90% of total attributes.
An analysis of attribute data accuracy is presented in Table 7 below. The majority (84%) of player attributes are objectively measured, i.e. are capable of unambiguous measurement, for example, the number of goals scored, the percentage of time running or jogging, the position of a player on the pitch at any given time. It is important to identify which attributes fall into this category as analyses based upon objective data are fundamentally more reliable.
Attribute accuracy
Attribute accuracy
8Where an attribute appropriate to more than one group it has been included in each.
However, that is not to say that subjective data are not valuable. For example, the assessment of a player’s potential is likely to remain most accurately assessed by the subject matter experts, in this case managers and coaches. Other subjective attributes include ball control skill and composure.
It is also important to note that in some of the collections of freely available attribute data (Table 1) elements of the data collection are delegated to selected fans attending matches who provide their data. These data also have value but must be clearly identified as subjective, compared to subject matter experts and treated with care in any scientific analysis.
As we identified in the analysis of attribute data types we can see that it is the Player data and history and the Character traits attribute groups that depend upon the highest numbers of subjective assessments, for example, self-confidence, motivation, playing style, degree of ball control. In the case of data accuracy we can add to this the attribute group Applicable to any player. This group includes attributes such as ball control skill, effective/balanced defensive play, performance rating at a given position, all measurable subjectively. However, upon close inspection of individual attributes in all the Player data and history and the performance rating at a given position groups, although they were treated as subjective in the source research papers, it is clear that many may be collected objectively. For example, pass accuracy can also be measured as the percentage of successful pass completions.
In the case of Character traits, although the majority (78%) have been identified as subjective, there is a significant body of scientific evidence supporting how a number of these may be more rigorously measured using cognitive psychometric testing. We discuss this later under the section Potential for exploitation of character trait attributes.
Minimal player attributes which were derived from a mixture of objective and subjective data were identified. An example is Number of man of the match awards where although the number of awards is an objective value, the award itself is in each case a subjective selection by a human being or group of human beings.
An analysis of attribute data temporality is presented in Table 8 below.
Attribute temporality
Attribute temporality
9Where an attribute appropriate to more than one group it has been included in each.
The majority of published research activity into footballer analytics focuses upon their performance during matches and this is reflected in the high proportion (83%) of player attributes categorised as dynamic. As we would therefore expect, these focus upon player activities such as assists, pass, and duels. As with our data type and accuracy metrics, it is the attribute groups Character traits and Player data and history that have the least dynamic measurements.
It is important to note, however, that in a number of attribute groups we can see player attributes which although they may be viewed as a static statement of a player’s ability or performance, are also capable of change: these are therefore categorised as evolving static. For example, the quality of free kick taking or shooting accuracy are examples of capabilities which may be improved through practice and coaching on the training ground and match experience. Similarly in the group Player data and history, a player’s strength and fitness levels may be developed as part of their inter match training routines. Also, in the group Character traits, a player’s self-confidence and a selection of mentality traits are good examples of player attributes which may be developed.
An analysis of attribute data accessibility and sensitivity is presented in Table 9 below.
Attribute Accessibility and Sensitivity
Attribute Accessibility and Sensitivity
Accessibility of player attributes alongside sensitivity (privacy/ethical) issues is critically important in all analysis activities.
In terms of accessibility, there is a considerable difference between those attributes which are readily accessible and measurable, such as the number of passes or shots and data which may only be collected through direct interaction and cooperation with the player, such as the level of family support.
A great deal of activity is being invested into the development of automated vision systems to recognise and count such metrics in real time, both for during match punditry and for post-match analysis by clubs too (Castellano et al., 2014). These systems rely upon accurate tracking of momentary position, speed and acceleration measures of players using stereo camera technology (Linke et al., 2020). For example, the application of appropriate computer vision techniques to extract trajectory data from match video input (Stein et al., 2017) allows the automatic collection of metrics such as pass distance, player movement and dominant regions of the pitch.
Of the 25 attribute groups, 23 comprise of attributes which are readily available to anyone for collection and analysis. It is only the group Player data and history and the group Character traits where we find attributes where player input/cooperation is required. Examples in the former group include such attributes as sleep patterns and parental/social support which in total represent fewer than 20% of the attributes in this group. However, in the latter group, Character traits, the proportion of attributes where player input/cooperation is required is almost two thirds (65%). This high proportion is consistent with the potentially intrusive nature of character trait assessments, with its predisposition to psychometric testing.
We see a similar pattern in the assessment of attribute sensitivity in terms of privacy and ethical issues. It is only the Player data and history and Character traits groups where this is an issue. In respect of character traits, by their very nature it is appropriate to categorise all (100%) of these attributes as sensitive. Even where an individual player may be happy for publication of attributes such as game influence or decision making, where these have been rigorously measured as opposed to pundit opinions in the media, the club would likely consider these data commercially sensitive.
In respect of the group Player data and history, we see a clear split between sensitive (18%) and readily available attributes (72%), however we have also categorised a modest number (10%) as potentially sensitive. These include attributes such as body type, provocation, hours of practice and market value. In each case these tend to be attributes where some assessments external to the player and club may be made. Nevertheless, ethical and privacy decisions made by the player and the club will take precedence in these and all cases of attribute accessibility and sensitivity.
Inclusion of character traits in the reviewed papers
As described above, very few occurrences of player character traits were identified (proportionally 3% of the total attributes collected). Of the 2,537 attributes identified from the selected papers, only 83 may be categorized as character traits, reducing to 72 after the removal of duplicates. In fact, only 3 of the 132 papers (2%) included a significant number (between 8 and 15) of such attributes in their analyses (Table 10).
Papers including character trait attributes
Papers including character trait attributes
The lack of such attributes in the identified body of research is likely to be related to the perceived and actual difficulty of measuring them.
This is surprising given the importance assigned to such characteristics in other businesses. Furthermore, it is evident that football fans seem to regard attributes such as tenacity, composure, determination very highly. Indeed, managers and coaches often refer to these characteristics when discussing individual players in media situations, as do commentators during matches and media pundits in their post-match analyses. Most important, however, is their potential role in the identification of suitable transfer targets.
It is worth noting that in other industries interviewing and psychometric testing is permissible prior to making recruitment decisions. This is not the case in professional football where in transfer considerations no approach to a player is permissible before clubs have agreed terms. Typically, club staff may only meet the player when the subsequent medical and personal terms negotiation is taking place.
It would appear that the development of in-roads into the inclusion of selected character traits in footballer analytics could provide a step change in the improvement of successful transfer selection for elite clubs.
In order for the use of a player’s attributes such as self-control, aggression or self-confidence to be useful for analytical or predictive purposes it is critical that some authenticity is given to their measurement.
There would appear to be two alternatives: either, the use of formal psychological testing methods based upon established research-based character trait theory; or, expert-based subjective scoring.
For the latter we may consider a scoring (for example, on a scale of 1 to 10) against each selected attribute, made by each of a psychologist and a club appointed football expert, for example the team coach. The combined, perhaps averaged, score would provide an ordinal value for the attribute. Over time, the measured feedback of results versus prediction scores may allow improvement of the efficacy of the process, however these would remain subjective data.
For the former method, in order to take advantage of the established body of psychological research, a suitable and more objective starting point may be to consider those categorisations already in use in the field of psychology. In particular it may then be feasible to exploit proven methods of character trait measurement. Previous research in this area includes several different categorisations of character/personality traits. For the purposes of this paper, we have included four respected categorisations for illustrative purposes.
Many personality psychologists believe that there are five basic dimensions of personality, often referred to as the “Big 5” personality traits (Digman, 1990). These are openness, conscientiousness, extraversion, agreeableness, and neuroticism, sometimes described by the acronym OCEAN, each of which is sub-dividable into on average five sub-traits.
Another approach is the “Alternative five model of personality” (Zuckerman, 1992) which focusses upon Neurotism, Aggression, Impulsiveness, Sociability and Activity, each of which sub-divide into on average eight sub-traits
The Eysenck Personality Questionnaire (Eysenck, 1975) focuses upon temperament, measuring Extraversion, Neuroticism, Psychoticism and Dissimulation (lying) tendencies. Each of these is further sub-divided into nine further sub-traits.
Lastly, Cattell’s 16 Personality Factors (Cattell, 2008) includes Abstractedness, Apprehension, Dominance, Emotional stability, Liveliness, Openness to change, Perfectionism, Privateness, Reasoning, Rule-consciousness, Self-reliance, Sensitivity, Social boldness, Tension, Vigilance and Warmth.
An examination of the character trait attributes included for analysis in the selected papers (Table 11) indicates that many of these are potentially alignable with one or another of the above formal categorisations, in some cases with appropriate football specific interpretation.
Character traits used in selected papers
Character traits used in selected papers
Note: The 72 character trait attributes tabulated correspond to a total of 83 identified from the selected papers, less duplicates.
We discuss potential next steps under recommendations for future research.
A systematic review of the literature shows a steep increase in the number of studies involving football analytics research in the past seven years.
There appears to be scope for increasing and intensifying the application of machine learning analyses given that of the 103 papers conducting some form of analysis, 65% solely applied statistical techniques and only 21% applied ML techniques with the remaining 6% applying a mixture of both. Where machine learning was used, Linear regression techniques were the most deployed, however, as we might expect a variety of other commonly used ML techniques were also used, for example neural networks, clustering, random forest, decision tree, k nearest neighbour and support vector machines.
The sport of football allows the identification and measurement of a very large number of attributes. Over 1,500 different footballer attributes were curated from the selected papers.
However, of the 1,518, only 70 could be categorised as character traits. Experience from all other industries indicates that analyses of footballers’ potential may benefit from consideration of these traits (Tett, 1991).
A significant majority of all attributes (81%) are numeric (measurable) and a further 7% ordinal, therefore lending them to rigorous analysis and predictive techniques. The remaining 12% nominal attributes were mainly in the character trait and player base data groups and may be analysed separately in the first instance by proven statistical and machine learning techniques.
The majority (84%) of all attributes were categorised as objective, similarly supporting more scientifically credible analyses.
As with the remaining subjective data, attribute accessibility and sensitivity issues were also entirely focused on the player data and history and the character trait groups.
Because of this it would be appropriate to treat these two groups with more care in future analyses.
In respect of attribute subjectivity, where analyses include attributes which are collected by fans it is important that the results of subsequent analysis and predictions are noted as such.
Clearly, the very large number of over 1500 different attributes warrants examination in terms of their independence and usefulness. Although some papers have applied principle component analysis (PCA) methods to reduce dimensionality there does not appear to be a comprehensive study available. Such a study may be able to reduce the attributes list for analysis and prediction purposes.
Recommendations for future work
It would be interesting to apply dimensionality reduction methods, for example principal component analysis, to the comprehensive attribute set, populated from freely available data. This research may allow the identification of a useful but reduced attribute set.
The comparative predictive accuracy of appropriately selected machine learning techniques, e.g. decision trees, neural networks, k nearest neighbors, random forest, etc. may be analysed, applied to the reduced attribute set.
The allocation of attributes to the selected groups would benefit from the input of club subject matter experts in order to better align groups. For example, Player data and history, and Outfielder position specific attribute groups. Similarly, club expert input into the selection of those character traits deemed critical to player selection would be beneficial.
The identification of an appropriate mapping of those character trait attributes identified in this paper to the traits defined within proven methods of character trait measurement may be of benefit, as may be the exploration of methods that involve using such data in the analysis of football transfer targets.
