Sage Journals: Discover world-class research

Abstract

In judged sports, such as rhythmic gymnastics, figure skating, and baton twirling, inter-judge variability in scoring—the degree to which judges differ in their scoring of the same athlete—is a common concern. This study quantitatively examined inter-judge variability using freestyle scores from the World Baton Twirling Championships held in 2018 and 2022. Data were collected from the preliminary rounds of senior and junior women's divisions, with scores assigned by seven judges. Welch's analyses of variance were performed to assess the effects of athlete ranking group (high, middle, low) on inter-judge variability across two scoring axes: technical merit (TM) and artistic expression (AE). These analyses were conducted separately for each competition year (2018 and 2022) and division (senior and junior). In the senior division in 2018, inter-judge variabilities for both TM and AE were significantly lower in the high-ranking group than in the other ranking groups. In the junior division in 2018, inter-judge variabilities for both TM and AE were significantly lower in the high-ranking group than in the low-ranking group. These findings are interpreted in terms of the interactions among the competitive structure of baton twirling and the cognitive processes involved in judging.

Keywords

aesthetic sports artistic sports cognitive process judged sports ranking

Introduction

In judged sports, such as rhythmic gymnastics, figure skating, and baton twirling, both the scoring system and judges’ evaluation processes are widely recognised as persistent concerns (Flessas et al., 2015; Huang and Foote, 2011; Osório, 2020; Premelč et al., 2019). With the expansion of international media coverage, what was once considered confined to judges and athletes has evolved into a broader societal issue (Collins, 2010). Sports such as figure skating have introduced revised scoring systems to mitigate the impact of scoring bias on final rankings (Cheng and Gonzalez, 2022). The first author, with extensive experience as a member of judging panels, including as a judge's chair at international baton twirling competitions, has personally observed similar issues in baton twirling. Among the various challenges related to evaluation, the World Baton Twirling Federation (WBTF) has identified inter-judge variability, that is, the degree to which judges differ in their scoring of the same athlete, as a major concern. To address this issue, the WBTF implemented several countermeasures, including rule revisions and enhanced judge education programs (World Baton Twirling Federation, 2020). However, the numerical target for acceptable variability in these efforts has largely been determined through subjective considerations based on long-standing experience rather than objective data-driven metrics, which is a common tendency in human decision making (Ariely et al., 2003; Plessner and Haar, 2006). Accordingly, this study aimed to quantify inter-judge variability in scoring and examine the factor that influences it. By providing an objective framework to assess scoring variability, these findings may contribute to the development of empirically grounded assessment protocols and training initiatives. Furthermore, a deeper understanding of the cognitive processes underlying judgments in judged sports may advance performance research in sports sciences.

In evaluations of baton twirling, field observations indicate that mid-ranked athletes’ scores fluctuate more than those of top-ranked athletes. These observations are consistent with previous research. For example, in studies on rhythmic gymnastics, scoring variability was particularly high for athletes competing at intermediate performance levels (Leandro et al., 2017). Similarly, research on wine contests has shown a high degree of variability in judges’ scoring of wines that are awarded gold medals, whereas lower-quality wines tend to receive more consistent evaluations (Hodgson, 2008). However, such variability-related behaviors have not yet been objectively verified among baton twirling judges. Such patterns may also be influenced by cognitive processes uniquely tied to the specific characteristics of competition (Carvalho et al., 2023; Findlay and Ste-Marie, 2004; Ste-Marie, 1999).

The purpose of this study was to quantitatively clarify the effect of ranking group on inter-judge score variability in international baton twirling competitions. In addition, on the basis of the obtained findings, the potential background factors underlying this variability are discussed from the perspectives of competition characteristics and cognitive processes.

Methods

Data acquisition

This study analysed the scores assigned by judges at the World Baton Twirling Championships organised by the WBTF. Data were acquired from publicly available live-streaming broadcasts of championships. The purpose and methodology of the study were explained to the WBTF president, and written consent was obtained for the use of anonymised data. The Institutional Review Board of Waseda University determined that ethical review was exempt (2025-HN003).

Data selection criteria

Freestyle scores from the WBTF World Baton Twirling Championships, held biennially, were used as the dataset for this study. Specifically, scores from the 2018 and 2022 championships were selected (the 2020 championship was cancelled due to the COVID-19 pandemic). These years were selected because both competitions employed the same scoring system, and the rule known as “Overall Degree of Excellence,” which constrains judges’ scoring ranges to within one point, was not in effect, whereas this scoring system was adopted in other years. The absence of this rule ensures that the data are suitable for investigating inter-judge variabilities.

Among all the freestyle events, the preliminary rounds in the senior (Sr: 18 years or older, n = 64) and junior (Jr: under 18 years of age, n = 62) women's divisions were selected because of their larger sample sizes. The scores used in this study were the original and unadjusted judge scores recorded before any deductions or modifications were made. Both world championships were judged by qualified international judges who had been selected through a formal examination. The demographic characteristics of the judges are summarised in Table 1. The panel of judges differed by year and by division. The panel of judges numbered seven for each division in each year. Six judges served on more than one occasion; therefore, the total number of individual judges was 20.

Table 1.

Demographics of judges.

Year	2018			2022
Division	ID	Nationality	Experience (years)	ID	Nationality	Experience (years)
Junior	J1	Canada	27	J2	United States	25
	J2	United States	25	J15	France	24
	J3	Japan	24	J3	Japan	24
	J4	Switzerland	11	J11	Canada	11
	J5	Italy	8	J14	Italy	8
	J6	France	6	J16	Netherlands	3
	J7	Australia	1	J17	England	1
Senior	J8	Netherlands	26	J8	Netherlands	26
	J9	United States	23	J2	United States	25
	J10	United Kingdom	20	J3	Japan	24
	J11	Canada	11	J18	France	24
	J12	France	10	J19	Spain	19
	J13	Japan	9	J11	Canada	11
	J14	Italy	8	J20	Italy	8

Note: Years of experience are calculated as of 2022.

In baton-twirling competitions, performance is evaluated using two scoring axes: technical merit (TM) and artistic expression (AE). TM assesses the variety and difficulty level of techniques incorporated into performance and determines the corresponding developmental stage, while AE evaluates the athletic artistry of a performance in relation to musical interpretation. Both categories employ a build-up scoring method with a maximum score of 10, recorded to one decimal place. For example, in TM evaluation, the seven judges may assign raw scores such as 7.8, 8.1, 8.0, 7.9, 8.2, 7.7, and 8.0 to a single performance; AE is evaluated in the same manner. According to the WBTF Judges Manual (World Baton Twirling Federation, 2020), the developmental stages are classified into the following discrete ranges: fair (0.00–2.00), average (2.01–4.59), good (4.60–7.00), excellent (7.01–9.00), and superior (9.01–10.00). Each stage is supplemented with performance examples that visually illustrated the corresponding evaluation standards.

Statistical analysis

In this study, inter-judge variability, measured as the sample variance of the scores of seven judges for a single athlete, was used as the dependent variable. As a preliminary check prior to calculating the sample variance, the range of the seven judges’ scores for each athlete was examined. These were defined as the difference between the maximum and minimum scores, as a simple descriptive indicator of potential extreme dispersion in the raw data. For most performances, the score range was 2 points or lower, although larger ranges were generally observed in 2022 than in 2018, particularly in the Sr division, where several performances showed ranges greater than 2 points, with a maximum of 2.8 points. Importantly, the value of 2 points was not used as a formal criterion to define outliers but only as a descriptive reference to summarize the observed dispersion. Because larger ranges reflect lower inter-judge agreement, which is central to the research question of the present study, cases with large ranges were retained and not excluded from subsequent analyses.

The sample variance was log-transformed; outliers were addressed using the interquartile range (IQR) method; and one-way Welch ANOVAs were conducted separately by year (2018 and 2022), division (Sr and Jr), and scoring axes (TM and AE), with ranking group specified as the sole factor. The normality of the dependent variable was visually assessed using Q–Q plots (see Supplementary Material), which indicated no substantial deviations from normality. Post-hoc pairwise Games–Howell comparisons were conducted following a significant main effect.

The ranking group factor was categorised into three levels based on the competitor rankings: top third (high-ranking group), middle third (middle-ranking group), and bottom third (low-ranking group) (Table 2). In the actual competition, athletes’ final rankings are determined by discarding the highest and lowest scores among the seven judges and subtracting any applicable penalties. However, because the present study focuses on variability among the seven judges, research-specific rankings were determined by summing the mean TM score and the mean AE score across all seven judges, and athletes were then grouped based on these rankings. This classification approach is considered appropriate as it approximately corresponds to the advancement structure in the freestyle individual event at the WBTF World Baton Twirling Championships, where 20 athletes progress to semifinals and 10 advance to finals.

Table 2.

Number of baton twirlers in the high-, middle-, and low-rank groups within each division for each year. Numbers in parentheses represent the score ranges.

Ranking group	Year and division
	2018		2022
	JuniorNo. of judges: 7	SeniorNo. of judges: 7	JuniorNo. of judges: 7	SeniorNo. of judges: 7
High	12 (TM: 8.39–6.59 AE: 8.50–6.63)	11 (TM: 9.30–7.09 AE: 9.34–7.10)	9 (TM: 8.61–6.23 AE: 8.67–6.29)	11 (TM: 9.01–6.89 AE: 9.16–6.90)
Middle	12 (TM: 6.50–5.17 AE: 6.56–5.23)	11 (TM: 7.00–5.43 AE: 7.06–5.43)	9 (TM: 6.23–4.54 AE: 6.19–4.54)	10 (TM: 6.77–5.91 AE: 6.81–6.07)
Low	11 (TM: 5.10–3.24 AE: 5.20–3.24)	11 (TM: 5.27–3.73 AE: 5.27–3.74)	9 (TM: 4.51–3.50 AE: 4.49–3.49)	10 (TM: 5.80–3.96 AE: 5.87–3.89)

Results

Figure 1 presents the inter-judge variance scores, which were log-transformed to reduce skewness, categorized by ranking group in the Sr group for 2018 and 2022. Prior to statistical analysis, outliers were identified and removed using the IQR method. For TM, a one-way Welch ANOVA revealed a significant main effect of ranking group in 2018 (F(2, 17.64) = 9.365, p = 0.0017, ηp² = 0.473), whereas no significant main effect was observed in 2022 (F(2, 16.56) = 0.949, p = 0.407, ηp² = 0.050) (Figure 1(a) and (b)). For AE, a significant main effect was also found in 2018 (F(2, 16.37) = 14.74, p = 0.0002, ηp² = 0.512), but not in 2022 (F(2, 17.51) = 0.510, p = 0.609, ηp² = 0.030) (Figure 1(c) and (d)).

Figure 1.

Log-transformed inter-judge variance of scores by ranking group in the Sr group, after IQR-based outlier removal. TM variance in 2018 (a) and 2022 (b). AE variance in 2018 (c) and 2022 (d). L, low; M, middle; H, high.

For conditions showing a significant main effect in the Sr group, post-hoc pairwise comparisons were conducted using the Games–Howell test. The results for TM are summarized in Table 3, revealing significant differences between the Low- and High-ranking groups and between the Middle- and High-ranking groups, whereas no significant difference was observed between the Low- and Middle-ranking groups. The results for AE are summarized in Table 4, revealing significant differences between the Low- and High-ranking groups and between the Middle- and High-ranking groups, whereas no significant difference was observed between the Low- and Middle-ranking groups.

Table 3.

Results of post-hoc pairwise Games–Howell comparisons for TM in the Sr group in 2018.

Comparison	Mean difference	95% CI	Effect size	DoF	t-value	p-value
Low–High	1.153	(0.467, 1.840)	1.454	16.486	3.552	0.007
Low–Middle	−0.170	(−0.611, 0.272)	−0.346	16.341	−0.813	0.700
Middle–High	1.323	(0.677, 1.969)	1.807	13.595	4.405	0.002

Table 4.

Results of post-hoc pairwise Games–Howell comparisons for AE in the Sr group in 2018.

Comparison	Mean difference	95% CI	Effect size	DoF	t-value	p-value
Low–High	1.187	(0.433, 1.941)	1.361	17.314	3.318	0.011
Low–Middle	−0.472	(−0.937, −0.007)	−0.883	13.934	−2.178	0.110
Middle–High	1.659	(0.978, 2.340)	2.138	11.789	5.318	0.001

Figure 2 shows boxplots of the results for the Jr group, after IQR-based outlier removal. For the Jr group, a one-way Welch ANOVA revealed a significant main effect of ranking group on TM in 2018 (F(2, 19.74) = 6.084, p = 0.0087, ηp² = 0.274), whereas no significant main effect was observed in 2022 (F(2, 15.73) = 0.305, p = 0.741, ηp² = 0.020) (Figure 2(a) and (b)). For AE, a significant main effect was also found in 2018 (F(2, 19.83) = 5.013, p = 0.017, ηp² = 0.239), but not in 2022 (F(2, 14.28) = 0.070, p = 0.932, ηp² = 0.0084) (Figure 2(c) and (d)).

Figure 2.

Log-transformed inter-judge variance of scores by ranking group in the Jr group, after IQR-based outlier removal. TM variance in 2018 (a) and 2022 (b). AE variance in 2018 (c) and 2022 (d). L, low; M, middle; H, high.

For conditions showing a significant main effect in the Jr group, post-hoc pairwise comparisons were conducted using the Games–Howell test. The results for TM are summarized in Table 5, revealing a significant difference between the Low- and High-ranking groups, whereas no significant differences were observed for the other pairwise comparisons. The results for AE are summarized in Table 6, revealing significant differences between the Low- and High-ranking groups, whereas no significant differences were observed for the other pairwise comparisons.

Table 5.

Results of post-hoc pairwise Games–Howell comparisons for TM in the Jr group in 2018.

Comparison	Mean difference	95% CI	Effect size	DoF	t-value	p-value
Low–High	1.102	(0.364, 1.840)	1.225	15.305	3.178	0.016
Low–Middle	0.499	(0.073, 0.926)	0.999	19.777	2.445	0.060
Middle–High	0.603	(−0.134, 1.340)	0.686	15.404	1.740	0.222

Table 6.

Results of post-hoc pairwise Games–Howell comparisons for AE in the Jr group in 2018.

Comparison	Mean difference	95% CI	Effect size	DoF	t-value	p-value
Low–High	1.057	(0.316, 1.798)	1.169	15.105	3.039	0.021
Low–Middle	0.423	(−0.077, 0.923)	0.702	19.258	1.767	0.207
Middle–High	0.634	(−0.146, 1.414)	0.673	18.130	1.707	0.229

Discussion

This study aimed to clarify the effect of ranking group on inter-judge score variability in international baton twirling competitions. Inter-judge variability was quantified using the sample variance of judges’ scores, and one-way Welch ANOVA was applied to examine differences across ranking groups. The analyses were conducted separately by year, division, and scoring axis. This discussion addresses the significant main effects observed and offers implications for judging systems.

Similarity between TM and AE

As shown in Figures 1 and 2, TM and AE exhibit similar overall patterns. This similarity may reflect certain characteristics of baton twirling competitions. In baton twirling, TM and AE scores tend to show similar trends because AE is generally dependent on TM, as TM provides the technical foundation for AE. When technique and skills are insufficiently developed, it may become difficult to fully express artistic intentions, no matter how strong those intentions may be. In contrast, TM evaluates the accuracy of baton and body manipulation (i.e., whether skills are successfully executed), as well as the difficulty of and variation in techniques. These evaluations can be made without information related to musical interpretation or narrative progression; for example, scores can be determined even if the athlete performs with a neutral facial expression. Accordingly, the observed similarity between TM and AE scores in the present study may be explained by AE's considerable dependence on TM.

Effect of ranking group

In the Sr division in 2018, inter-judge variabilities for both TM and AE were significantly lower in the high-ranking group than in the other ranking groups. In the Jr division in 2018, inter-judge variabilities for both TM and AE were significantly lower in the high-ranking group than in the low-ranking group. These results suggest that different cognitive processes may be involved in evaluating athletes across ranking groups. This possibility is explored by considering the judges’ introspective experiences in light of cognitive models.

The lower inter-judge variability observed for the high-ranking group aligns with the judges’ introspective reports during evaluation: scoring for high-ranking athletes tends to be decisively clear, whereas scoring for middle-ranking athletes often involves hesitation. This hesitation stems from the difficulty of determining the most appropriate developmental stage corresponding to an athlete's techniques and skills. The first author, who served as a judge's chair at international competitions organised by the WBTF, observed this introspective pattern among judges. For instance, in TM evaluation, when assessing the combined execution of a thumb toss and a grand jeté, judges must classify 12 evaluation items that capture multiple dimensions of the technique into five developmental stages and further categorize each stage into upper, middle, and lower subgroups. A similar process applies to AE, which consists of seven evaluation items divided into five developmental stages, each further stratified into three subranges. High-ranking athletes consistently performed at an advanced level across all evaluation items, enabling the judges to rely on intuitive judgments. In contrast, middle-ranking athletes exhibited greater variability in evaluation combinations, making instantaneous judgments more cognitively demanding. This introspective account by judges is consistent with previous research on rhythmic gymnastics, in which scoring variability was particularly high for athletes competing at intermediate performance levels (Leandro et al., 2017). Additionally, findings from clinical examinations indicate that examiner bias tends to be more pronounced for borderline candidates than for those who clearly pass or fail (Shulruf et al., 2018). These parallels suggest that an increased cognitive load contributes to higher inter-judge variability when evaluating athletes in the middle performance range.

The cognitive processes observed in the evaluation of the high- and middle-ranking groups may reflect two distinct reasoning systems: heuristic and analytic system (De Neys, 2006, 2022; Kahneman, 2011). A heuristic system is an automatic and rapid process that enables intuitive evaluation, whereas an analytical system is a slower and resource-demanding process. Although both systems operate simultaneously to some extent (De Neys, 2022), judges’ introspective reports suggest that evaluations of high-ranking athletes rely primarily on a heuristic system, whereas evaluations of middle-ranking athletes rely on an analytic system. As these cognitive processes can be verified experimentally (De Neys, 2006), the findings of this study should be validated in future research.

Furthermore, if this explanation were correct, one would expect the judges’ scores for the low-ranking group to converge similarly to those of the high-ranking group. However, this pattern was not observed in the present study. This discrepancy may be attributed to methodological factors, such as the criteria used to define ranking groups or the inherently high skill level of competitors in international championships. These factors also warrant further investigation in future studies.

Potential effect of differences in judge preparation

In the present study, significant main effects were observed only in the 2018 championships, in both the Sr and Jr divisions, whereas no significant effects were found in 2022. This pattern may be related to differences in the extent of judge training between the two competition years. Under normal circumstances, judge training is typically conducted systematically over several months prior to the championships, allowing for a shared understanding of how evaluation criteria are interpreted and applied. However, due to the COVID-19 pandemic, judge training for the 2022 championships became more limited. When such training is limited, the consistency of judges’ evaluations may be reduced, and individual interpretations may play a larger role in scoring. Therefore, inter-judge variability may become more diffusely distributed, rather than being associated with specific ranking groups. Under these conditions, systematic differences in variability between ranking groups may be attenuated and, therefore, less likely to be detected as statistically significant effects. Accordingly, the absence of significant ranking-group effects in the 2022 championships does not necessarily indicate that such effects were absent; rather, it may reflect a situation in which variability in the evaluation process increased and became less structured, making these effects more difficult to detect. However, other contextual factors, such as changes in athletes or competition venues, may also have influenced the results.

Implications for judging systems

The WBTF has made concerted efforts to reduce inter-judge score variability; however, the underlying evaluation criteria have traditionally been based on subjective indicators. In this context, the present findings—such as the similarity between TM and AE scores and the smaller inter-judge variability observed in higher-ranking groups, particularly in 2018—provide objective patterns that align with judges’ practical understanding of the sport and illustrate the potential value of integrating objective indicators into the judging process. The present findings therefore serve as a useful resource for fostering constructive communication among judges from diverse linguistic and cultural backgrounds with the aim of minimizing inter-judge variability.

Supplemental Material

sj-docx-1-san-10.1177_22150218261436207 - Supplemental material for Inter-judge variability in scoring at the world baton twirling championships

Supplemental material, sj-docx-1-san-10.1177_22150218261436207 for Inter-judge variability in scoring at the world baton twirling championships by Tomoko Natsuda, Yuto Kurihara, Takahide Etani, Taketoshi Sugisawa and Akito Miura in Journal of Sports Analytics

Footnotes

Acknowledgements

ChatGPT (versions 4.5 and 5.2; OpenAI) was used to improve the readability of the English text. The text was subsequently reviewed and the authors assume full responsibility for the final content.

ORCID iDs

Tomoko Natsuda

Yuto Kurihara

Takahide Etani

Taketoshi Sugisawa

Akito Miura

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by JSPS KAKENHI (Grant Number 25H01237).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

The judges ‘scoring data are available at the following repository: .

Supplemental material

Supplemental material for this article is available online.

References

Ariely

Loewenstein

Prelec

(2003) “Coherent arbitrariness”: Stable demand curves without stable preferences. The Quarterly Journal of Economics 118(1): 73–106.

Carvalho

Esteves

Nunes

, et al. (2023) The assessment of the match performance of association football referees: Identification of key variables. PLOS One 18(9): e0291917.

Cheng

Gonzalez

Jr (2022) Technical and program component scores frozen together: Difficulty bias and outcome prediction in international figure skating. Maths and Sports 4(1). https://doi.org/10.5149/ms.1220

Collins

(2010) The philosophy of umpiring and the introduction of decision-aid technology. Journal of the Philosophy of Sport 37(2): 135–146.

De Neys

(2006) Dual processing in reasoning: Two systems but one reasoner. Psychological Science 17(5): 428–433.

De Neys

(2022) Advancing theorizing about fast-and-slow thinking. Behavioral and Brain Sciences 46: e111.

Findlay

Ste-Marie

(2004) A reputation bias in figure skating judging. Journal of Sport and Exercise Psychology 26(1): 154–166.

Flessas

Mylonas

Panagiotaropoulou

, et al. (2015) Judging the judges’ performance in rhythmic gymnastics. Medicine and Science in Sports and Exercise 47(3): 640–648.

Hodgson

(2008) An examination of judge reliability at a major U.S. wine competition. Journal of Wine Economics 3(2): 105–113.

10.

Huang

Foote

(2011) Using generalizability theory to examine scoring reliability and variability of judging panels in skating competitions. Journal of Quantitative Analysis in Sports 7(3). https://doi.org/10.2202/1559-0410.1241.

11.

Kahneman

(2011) Thinking, fast and slow. New York, NY, USA: Farrar, Straus and Giroux.

12.

Leandro

Ávila-Carvalho

Sierra-Palmeiro

, et al. (2017) Judging in rhythmic gymnastics at different levels of performance. Journal of Human Kinetics 60: 159–165.

13.

Osório

(2020) Performance evaluation: Subjectivity, bias and judgment style in sport. Group Decision and Negotiation 29(4): 655–678.

14.

Plessner

Haar

(2006) Sports performance judgments from a social cognitive perspective. Psychology of Sport and Exercise 7(6): 555–575.

15.

Premelč

Vučković

James

, et al. (2019) Reliability of judging in DanceSport. Frontiers in Psychology 10: 1001.

16.

Shulruf

Adelstein

Damodaran

, et al. (2018) Borderline grades in high stakes clinical examinations: Resolving examiner uncertainty. BMC Medical Education 18(1): 272.

17.

Ste-Marie

(1999) Expert–novice differences in gymnastic judging: An information-processing perspective. Applied Cognitive Psychology 13(3): 269–281.

18.

World Baton Twirling Federation (2020) Official WBTF judges’ manual .