Abstract
There is a lack of data-driven training instructions for sports shooters, as instruction has commonly been based on subjective assessments. Many studies have correlated body posture and balance to shooting performance in rifle shooting tasks, but have mostly focused on single aspects of postural control. This study has focused on finding relevant rifle shooting factors by examining the entire body over sequences of time. A data collection was performed with 13 human participants carrying out live rifle shooting scenarios while being recorded with multiple body tracking sensors. A pre-processing pipeline produced a novel skeleton sequence representation, which was used to train a transformer model. The predictions from this model could be explained on a per sample basis using the attention mechanism, and visualised in an interactive format for humans to interpret. It was possible to separate the different phases of a shooting scenario from body posture with a high classification accuracy (80%). Shooting performance could be detected to an extent by separating participants using their strong and weak shooting hand. The dataset and pre-processing pipeline, as well as the techniques for generating explainable predictions presented in this study have laid the groundwork for future research in the sports shooting domain.
Introduction
There are many factors affecting a shooter’s performance in rifle shooting. Important factors are time spent in the aiming process, weapon movement before triggering, and body postural sway as a result of poor balance [1]. Because the eyes are focused on the target at the moment of shooting, they can not be used to control postural stability [2]. Several studies correlate poor posture control with poor shooting results for rifle shooting [3, 4, 5, 6]. These studies mainly define posture control as body sway calculated from the force exerted by each foot on the ground measured with force plates. Because only balance in the legs is measured, these force plate sensors may not produce the full picture for the entire body’s postural control. The force plates also restrict the subject to a stationary position. Therefore, it is interesting to examine other approaches that could be taken to gain a more comprehensive understanding of how postural stability affects shooting performance.
Sports shooting can be broadly divided into two domains: static shooting and dynamic shooting. This study focuses on dynamic shooting, which has a higher level of complexity with movements of both the shooter and targets. Both static and dynamic rifle shooting training is commonly conducted with one or multiple practitioners being instructed by a supervisor. For novices, training is mostly focused on striking static targets at fixed distances, where a supervisor focuses on factors such as stability, aiming, control, and movement [7]. Different shooting instructors can give contradicting feedback due to personal biases in the interpretation of data [8]. Consequently, there is a need for an objective and consistent data-driven feedback based on statistical analysis of data from real shooting scenarios.
Bio-mechanic pose estimation has been researched for almost half a century, often by using physical markers placed on human participants to build 3D body representations from visual sensors [9]. New techniques developed during the last decade allow for the tracking of joints by using image processing and machine learning. These technologies can build accurate 3D depth maps, producing so called
As a result of these accurate low-cost technologies, the use of skeleton data together with machine learning is studied more actively, especially in the area of action recognition [11, 12, 13, 14, 15]. Skeleton data can also successfully measure motor functions [16], postural stability [17, 18, 19], as well as to some extent assess skill level in sports, e.g. handball [20]. There is a scientific consensus that postural balance is an important factor for rifle shooting performance [3, 4, 21, 6, 2]. Therefore, it is interesting to examine whether there are other postural factors besides postural sway that affect shooting performance, as such knowledge could be of assistance during the training of novice shooters.
Aim and scope
There is a research gap in machine learning approaches for skeleton data that explain their reasoning, as well as in using machine learning for decision support within the sports shooting domain. This paper presents a fully encompassing workflow ranging from data collection in live shooting tasks, to sensor merging, data processing, and the construction of explainable machine learning models. The
It is expected that, as a result of the proposed work, an increased understanding is gained of how to model experiments for shooting scenarios that use multiple body tracking sensors. The study will identify the possibilities and limitations of the gathered skeleton data, as well as give an initial insight into how skeleton data from live shooting scenarios can be modelled to generate relevant explainable factors.
Related work
Postural stability and shooting performance
Several studies have used force plate sensors to measure posture control in shooting tasks. However, these force plates are prohibitively expensive and can be impractical in situations outside controlled research studies and clinical environments [23]. The studies that have examined body tracking in shooting tasks have had a delimited approach, mainly focusing on correlating shooting performance to mean sway velocity calculated from joint angles [5]. Skilled and experienced shooters have generally shown a better postural stability than novices [2, 4, 3]. Gun barrel stability has also been correlated to postural stability by several studies [5, 4, 3]. Although some studies have correlated posture to shooting performance between individuals [4], other studies have only shown intra-individual correlations [6]. These conflicting results, as well as a general lack of studies examining the effects of the posture and movements of the entire body on shooting performance, motivate further study.
Kinect and body tracking in scientific research
The Azure Kinect sensor [10] is a viable low-cost alternative to classical marker-based motion capture systems [19] as well as force plates [18] for measuring balance. Compared to motion capture suits, the Kinect has performed moderately to excellently depending on the tasks. The upside is that the participants are not restricted by wearing a motion capture suit. However, poorer tracking of the feet and ankles has been observed [16, 19]. The Kinect sensor holds up well as a balance measurement when compared to force plate sensors [18, 19], sometimes even outperforming them, with the upside of not limiting subjects to a stationary position. Downsides are limitations in accurately representing anterior-posterior swaying movements, overestimating large swaying movements, and having a framerate that is limiting for certain tasks [24].
Despite an exhaustive literature search, very few studies were found that use skeleton data to assess skill levels in humans. Most such studies have focused on isolated aspects of the body, such as angles between key joints in handball throws [20], or the take-off velocity of jumping motions [24].
Because of occluded joints forcing the Kinect sensor to estimate positions, Gao et al. [25] found that using two Kinect sensors placed in front of the target at an angle led to higher body tracking accuracy. Núñez et al. [26] proposed data augmentations on the skeleton data by shifting the graph slightly, because limited amounts of data would cause generalisability issues for machine learning models. Vemulapalli et al. [27] suggested that model performance could be improved by making the skeletons invariant to the absolute locations of subjects. They did this by transforming the coordinates to a body-centric view with the hip center as the origin, and the x-axis running parallel to the hip.
Action classification
Action classification tasks use time-series skeleton data, often with deep learning techniques, to classify human actions such as jumping, running, throwing etc. This section provides a brief chronological survey of the area.
Du et al. [11] used a hierarchically bi-directional RNN approach, and found that it was beneficial to divide the skeleton into five body parts (arms, legs, torso) before fusing them iteratively into one feature representation. Liu et al. [28] achieved good results with Long Short Term Memory (LSTM) and hierarchical tree traversal techniques, in large part due to the LSTM’s inherent strengths in discarding irrelevant information such as weak sensor data. Adding the attention mechanism to this approach improved the LSTM’s global context memory and increased predictive performance [29]. Attention has also been able to locate the most discriminative time frames across longer sequences of skeletons, e.g. being able to identify the point at which the arm begins to stretch in a punching motion [30].
Ke et al. [12] used convolutional neural networks (CNNs) to classify actions with skeleton data, arguing that they could remember long time sequences better than LSTMs. They restructured skeleton sequences into three cylindrical coordinate channels, each composed of four frames. Convolutions on graphs have also been used to automatically extract feature maps from joints connected spatially between each other, as well as temporally through time [15]. Plizzari et al. [31] combined a high level local feature extracting Graph Convolutional Network (GCN) with transformers and self-attention, allowing for explanations of what features were mutually important for predictions, as well as alleviating the long temporal relation issues of LSTMs [22]. Similarly to Song et al. [30], they separated the temporal and spatial skeleton graphs into separate streams, which allowed the model to attend both to the important time frames and the joints with the most discriminative power. Transformers are composed solely of attention layers, and have an added benefit compared to other models of being able to explain their reasoning using the attention matrices produced from each attention layer [22].
Method
Experiment
The research question will be answered with the help of an experiment which will rely on data gathered from several human participants. The participants will perform a rifle shooting task, as described in the data collection section. The experiment is a quasi-experiment, meaning that there will be no random assignment, and that the potentially discovered relations in the data may not be true cause and effect relations [32]. The experiment will be controlled in the sense that there will be a full control over the manipulation of the data, and as many iterations as needed can be performed. The dependent variables will be the various shooting poses of participants as well as which orientation (strong or weak hand) the participants use. The independent variables will be the sequential body tracking data produced by the Azure Kinect sensors.
Scenario illustration. Sequential illustration of the shooting scenario for the data collection. Source: [33].
The data collection for this study relied on one well-defined scenario, with a focus on identifying the postural effects on shooting performance with a semi-automatic rifle. The data collection was performed in an indoor shooting range to limit the effects of weather and wind on the participants and sensors. The scenario was designed to measure shooting performance when switching between different shooting targets and body postures (Fig. 1).
The participant is equipped with a semi-automatic rifle (HK416 with a 5.56 mm
HK416. The HK416 semi-automatic rifle (5.56 mm Two pop-up targets are placed 20 metres away from the shooter, with roughly 1.5 metres between each other (Fig. 3).
Targets and shot detection system. (a): The yellow and red squares show the left and right pop-up targets used in the data collection. The blue square shows the LOMAH system [34], responsible for detecting shot positions. (b): The International Practical Shooting Confederation (IPSC) target plate with scoring zones that was used in the data collection, the size of an A4 paper sheet. The participant starts in a standing, shooting ready position. The left target pops up, the participant aims at the target and fires a set of three shots. The participant switches back to a shooting ready position. 15 seconds after the left target popped up, the right target pops up. The participant switches targets and fires a set of three shots at the right target. The participant switches to a kneeling position. The participant fires another set of three shots at the right target. The participant switches targets and fires a final set of three shots at the left target. The participant secures and unloads the weapon.


Figure 3b shows the shooting target that was used in the data collection scenario. The participants were instructed to hit as close to the centre as possible while still maintaining a high speed of execution.
In order to build a robust and diverse dataset it would have been ideal to have shooting participants of different experience levels. However, due to safety constraints only experienced shooters with official weapons training could participate. 13 shooters were used in the data collection, 12 of whom had a military background, three of whom had a sports shooting background, and three of whom had a background in hunting. All participants were male, and ranged from the age of 31 to 62, with an average age of 48. In an attempt to simulate poor posture, the participants performed half of the scenarios with their weak hand. Each participant performed the scenario six times, resulting in a total number of 78 recorded shooting scenarios. Figure 4a,b shows a participant performing the scenario.
Data collection participant and resulting skeletons. (a,b): A participant performing the data collection scenario, first standing (a) and then kneeling (b). (c,d): The resulting skeleton representations.
The Azure Kinect body tracking sensor [10] was used to capture the body movements of the participants during the scenario. The accuracy of the Kinect sensor can be affected by several factors such as occlusion of body parts, e.g. a hand behind the body from the view of the camera. Other factors include poor lighting conditions, or disruption of the sensors’ infrared signals from e.g. sunlight or infrared heaters. To mitigate the effects of poor sensor readings, three Kinect sensors were used: one behind the participant, and two to the front on either side of the participant at roughly a 45 degree angle, as suggested by Gao et al. [25]. The sensor behind the participant was placed slightly to the right, because when placed straight behind the participant the sensor would have difficulties with determining in which direction the participant was facing. To ensure a consistent and coordinated trigger timing of 30 frames per second on all three devices, the Kinect sensors were connected together with a synchronisation wire signalling when to capture new frames.
The Location of Miss and Hit (LOMAH) system (Fig. 3a) was used to record the time and position of shots by detecting the sound waves produced by bullets [34]. These shot positions were recorded in an external system as horizontal and vertical distance in millimetres from a defined centre of target position. A microphone worn on a backpack equipped on the participant detected shots and ensured that the Kinect data could be matched to the shots detected by the LOMAH system.
The large computational resource requirements of the Kinect sensors necessitated the use of two different machines to process the data, in order to retain the highest possible frame rate (30 hertz). This meant that the Kinect sensors were unaware of each other’s time systems, making it difficult to match which frames represented the same movements. Consequently, each participant was instructed to perform a synchronisation movement before each scenario by extending their non-shooting arm upwards and moving it slowly outwards in an arc away from the body and down to the side (Fig. 5).
Skeleton synchronisation and merging. (a–c): A participant performing the synchronisation movement from the views of the front left (a), front right (b), and back (c) sensors, with the origins of their coordinate systems at the sensor positions. (d): A captured frame of the participant from the view of the front left sensor. (e): The merged version of the sensors, with the coordinate system aligned with the body positions.
The Kinect produced 32 skeleton joints for each person present in each frame (roughly 30 frames per second). The joints broadly represented physical joints in a human body, e.g. hips, shoulders, knees etc. Because hand joints received poor estimations when further than 1.5 metres away from the sensor, they were excluded from the data. For each joint the following variables were collected:
X position in millimetres from the sensor, horizontally from the view of the sensor. Y position in millimetres from the sensor, vertically from the view of the sensor. Z positions in millimetres from the sensor, extending straight out from the sensor. Confidence level for the joint (specifying whether the joint was in view of, occluded from, or too distant from the sensor).
Because the raw data from the three Kinect sensors were expressed in their own absolute cartesian coordinate system with the origin located at each sensor’s position (Fig. 5a-c), some pre-processing was required. The first steps involved matching the raw data from the three devices both temporally and spatially, and to isolate the body of interest, i.e., the shooter. A body-centric coordinate system was used, as suggested by Vemulapalli et al. [27]. The synchronisation movement performed by the participant was used to synchronise sensor data temporally across Kinect devices, and as a frame of reference for calculating new unit vectors for each device. The synchronisation skeleton frame index (
The normalised unit vector
The unit vector
The origin
Each joint vector
Despite the sensor data being transformed into the same coordinate system, the joint estimations and body rotations from the three sensors differed slightly, which made skeleton merging difficult. To overcome this, the sensor on the side of the weapon-holding hand was used as a reference to attach body-centric joint positions from the other sensors to this sensor’s skeleton. These joint positions were calculated by performing new body-centric transformations on each skeleton frame for each sensor, ensuring that the body orientation was not affected by the differing sensor estimations. All joints with a high confidence value (i.e. not occluded by other body parts from the view of a sensor) were used to compute an average joint position
Although three sensors were used in order to prevent the occlusion of body parts, joints were sometimes out of view from any of the sensors, resulting in outlier positions. To simulate the actual trajectory of these joints, their positions were estimated with a loosely fit 4th degree polynomial regression from the surrounding high confidence frames. To remove twitching joint movements, a median smoothing was performed on each frame from the surrounding five frames, followed by a mean smoothing from the surrounding three frames, resulting in a smooth merged skeleton representation (Fig. 5e). The shots detected by the LOMAH system were matched temporally to the skeleton data by detecting the shots with a microphone. Any samples containing shots that the microphone sensor failed to detect were discarded.
Features were extracted from the merged skeletons based on the angles of
Skeleton bone feature representation. Each bone was defined by three features per frame (X, Y, and Z angles), here illustrated for the right femur bone from a frame in a skeleton sequence.
Model architecture. The input patches produced from a skeleton sequence, and how they were processed by the ViT model. Adapted from Dosovitskiy et al. [35].
To make the models more robust, a set of data augmentations were used: For each skeleton sequence, six new orientations were produced from rotations around the Y axis with 10, 20, and 40 degrees of rotation to both the left and the right, and used to train the model in addition to the original orientation, as suggested by Núñez et al. [26]. Gaussian noise was also added randomly during training to some of the bone features in an attempt to make the models more robust to imperfect data.
Because of the strength of Transformers on sequential data tasks, as well as their inherent explainability capabilities [22], a Vision Transformer (ViT) was chosen for the learning tasks. ViTs divide images into equally sized rectangular patches, and embed them into projected one-dimensional vector representations by feeding them through trainable embedding layers. These embeddings are then treated in the same way as positional tokens in standard transformers. This means that ViTs are not limited solely to use on images, but can be used on any data that can be represented with patch embeddings. ViTs can explain their predictions by using the attention produced from the patch embeddings [35].
Each skeleton sequence was represented with time frames as columns, and the 25 different bones as rows with the X, Y, and Z radian values stacked on each other in a fixed feature representation of 3
For each prediction, the model produced 32 attention matrices from the eight heads of the four attention layers, which were summed up into one representation of the total attention. Figure 8 shows how the attention map was processed from the raw attention map produced by the model into a format that was deemed easier for humans to interpret, explaining which joints affected the prediction for each time frame.
Attention extraction. Pipeline describing the process from an input skeleton sequence to an explainable attention visualisation for a model prediction.
A pose classification task was used for demonstrating the capabilities of the workflow presented in this study. Four pose classes were identified and labelled from the data collection scenario:
Class distribution for the pose estimation task
Class distribution for the pose estimation task
Due to the relatively low amount of participants and data samples, we could not predict skill through continuously increasing skill levels in a regression task with the shooting score as the target variable. Instead, a binary classification task was constructed to test whether our approach could separate skilled and unskilled shooters from each other. The samples produced from the scenarios where the participants used their strong shooting hand were used to denote good posture, and the samples that used the weak hand were used to denote poor posture. The reasoning was that each shooter was much more accustomed to shooting with their strong hand, and would therefore make mistakes when shooting with their weak hand. The shooting accuracy and speed differences between strong and weak hand shooting showed that this was the case for most of our participants; with the exception of one participant the performance became worse when using the weak hand. The average time taken for a three shot series with the strong hand was 2.6 seconds with an average shot score of 2.19, whereas it took 3.4 seconds with an average shot score of 1.89 with the weak hand. Those who had a more similar performance between their strong and weak hand were also observed to have a more similar posture than those who had a large performance difference when using different shooting hands.
Class distribution for the shooting hand estimation task
Class distribution for the shooting hand estimation task
The same feature representation, model architecture, and non-random repeated cross-validation were used as in the pose estimation task (13 folds, 156 models). To limit the scope of this task, only standing shots were investigated. We extracted two-second samples by looking at the frames from 1.5 seconds before the first shot of a standing three shot series to 1.5 seconds after the last shot in the same series, extracting new sequences starting at every third frame. The class distribution was very close to equal between the two classes (Table 2). All skeletons were adjusted to a right-handed orientation by mirroring the left-handed shooting scenarios against the plane formed by the Y and Z axes, and switching the positions of the joints on the left and right side of the body. This ensured that the model could not learn to simply classify left- and right-handed scenarios, but that it would have to learn relevant factors of the actual poor posture that was produced from the shooters using their weak hand.
Pose estimation metrics
Pose estimation attention statistics
Shows the top 10 (out of 25) bones for predicting poses. Note that left-handed samples were mirrored, and thus FOREARM_RIGHT denotes the forearm of the arm that pulls the trigger for both left- and right-handed samples.
Pose estimation
Table 3 shows the total results for the pose estimation task in the form of total results and the 95% confidence intervals based on the results from each fold (participant) in the experiment. Because the class distribution was imbalanced, both Accuracy and Cohen’s
Attention statistics were calculated for the model by ranking the importance of each bone feature for each fold (participant), and then computing the average of these rankings across all participants. The attention often focused on the position of the arms or the femur bones (Table 4), which was reasonable, as the angle of the arms could help determine whether a participant was aiming or not, and the angles of the femur bones could help the model determine whether a participant was standing or kneeling. Figure 9 (
Attention visualisation: Switching position. The input skeleton sequence of one of the participant switching from a standing to a kneeling position, the joint attention produced from the attention matrix from the pose estimation model, and a skeleton visualisation of the joint attentions for one time frame. Darker colour indicates stronger attention.
Table 5 shows the results from the shooting hand estimation experiment. Overall, the results indicate that it is possible to some extent to separate participants using their strong hand from participants using their weak hand. However, a large difference was observed in prediction performance between the participants, as well as for each individual shooter depending on which other participant acted as validation data and thus stopping criterion for the training. It is not entirely clear why there was such a big difference in model performance between participants, but most likely more training data is needed. Another reason for the difference in performance could be the differing shooting styles, and too many pose faults in relation to the number of collected samples. The model may have learned simple indicators of good versus bad posture, but these could be difficult to generalise between shooters with e.g. aggressive or more relaxed poses.
Shooting hand estimation metrics
Shooting hand estimation metrics
Table 6 shows the computed attention statistics from all trained models for the shooting hand estimation task. These statistics indicate that the model has learned that the main difference in posture for strong versus weak hand shooting lies in the position of the forearm of the trigger arm. Other important features came from the head and centre torso body parts. This is in line with observations that were made from studying skeleton samples, where the shooting samples using the weak hand often had a more unnatural pose where the participants leaned into the weapon differently. Figure 10 highlights the difference in pose for one participant while actively firing the weapon during the scenario. This figure shows that the elbow of the trigger arm is raised when using the strong hand, whereas the participant is more huddled together and leaning unnaturally into the weapon with the weak hand. This is in line with many of the other top attended features from Table 6. Overall, domain experts noted an unrelaxed shooting position of the head and shoulders when studying individual samples using the weak shooting hand.
Shooting hand estimation attention statistics
Shows the top 10 (out of 25) bones for predicting whether shooting with the strong or weak hand. Note that left-handed samples were mirrored, and thus FOREARM_RIGHT denotes the forearm of the arm that pulls the trigger for both left- and right-handed samples.
Strong and weak hand comparison. Shows a comparison of one of the participants in the act of firing their weapon with their (a) strong hand and (b) weak hand. Highlighted are the top 10 bones from the overall attention statistics for the shooting hand task (Table 6). Note that all left-handed samples were swapped to a right-handed orientation.
An attempt was also made to predict skill by modelling a regression task where the target variable was the shooting score achieved by each participant, but this yielded close to random predictions. Most likely a lot more data is needed for such a task.
This study could correlate shooting performance to posture to an extent by using strong and weak hand scenarios to denote good versus bad posture (Table 5). However, the results varied widely both between different participants, as well as between which validation fold was used as stopping criterion for the training. It is unclear why such a difference in performance between participants was produced; more experiments are needed. Because the experiment shows such volatility, it is likely that more training data is needed for the models to be more stable and statistically significant. There are also other factors that could affect the performance, such as poor sensor readings. Additionally, body postures in general may not have a large enough effect on shooting performance to be used as a sole factor; many other factors such as trigger jerking, small weapon movements, and eye movements could potentially have a larger impact on performance [6]. There have also been conflicting reports on whether postural stability correlates to shooting performance across multiple individuals [4, 6]. Various shooting styles were observed from the participants; some had a very forward leaning, aggressive shooting pose, whereas others had a more upright pose while still achieving high scores. With only 13 participants, these differences in body postures may have been detrimental to the discriminative powers of the model. The results that were attained from estimating shooting performance through shooting score as a regression target were essentially random. For the regression approach to work it is likely that much more training data is needed. It is also possible that using shooting score as a regression target is simply not a good approach to predict skill. A potential future approach could be to use a combination of simulated poor posture from using the weak hand, expert judgement of shooter posture, and a wide variety of skill level between shooters to bin participants into different levels of posture quality. A
The pose estimation task saw stronger performance (Table 3), likely due to the more apparent differences in poses in different phases of the shooting scenario compared to the differences between good and poor posture. The model learned a reasonable representation, as the poses that were classified incorrectly were usually the most similar to each other. Because the pose labels were produced automatically by using shot moments, the labels were not always accurate. All frames up until 0.5 seconds before the first kneeling shot were labelled as belonging to the
The attention mechanism produced explainable predictions from the models, often being able to attend to the relevant frames and bones, as was also observed by Song et al. [30] and Plizzari et al. [31]. Because neither task saw excellent results, one also has to consider that a non-negligible part of the attention statistics were computed from incorrect predictions, and should therefore be seen as more of an indication of bone importance, than as an absolute truth. Figure 9 shows how a frame from a prediction could be explained through visualisation by colouring the joints of a skeleton frame with the values from a processed attention matrix. Longer sequences were visualised using 3D modelling or videos of skeletons with continuously shifting levels of attention on joints. It is the opinion of the authors that the visualisation provides an intuitive insight into the model’s reasoning, which can aid both in better model development, and as a basis for decision making during rifle shooting training.
A limiting factor to the proposed transformer model was the fixed input size, i.e. all bones and their radian X, Y, and Z values over a sequence of 60 time frames (two seconds). Using longer sequences would require aggregating the data along the time dimension, which would cause the model to perceive subjects to be moving much faster than they were in reality. A potential workaround to this would be to include a time token in the input sequence, indicating to the model that it should judge samples differently based on time taken. Although standard transformer models could use different input sizes, longer sequences could possibly be more difficult to train on, and thus require more data. Other sequential models such as LSTMs, or GCNs using graphs across time could work well for the learning tasks, but would not produce the same level of explainability.
Some compromises were made in the selection of participants due to safety constraints, both in the number of participants and the amount of scenarios they could perform. Because of this, the data may be skewed to one type of behaviour. Ideally there would have been a wide variety of skill levels represented among the participants. Having half the scenarios be performed with the weak shooting hand to simulate different skill levels worked quite well to increase the diversity in the dataset; shooters shot better and faster with their strong hand, and their body postures and movements were generally more in line with shooting doctrine according to domain experts.
The use of three body tracking sensors to limit the poor pose estimations caused by occluded body parts worked to an extent. However, joints were still occasionally occluded, and would stutter substantially between frames, which may have limited the possibilities for the machine learning models to focus on small-scale details in shooting posture. Although standing poses were represented reasonably well, kneeling or sitting postures often had very poor joint estimations in the legs. Using polynomial regression to interpolate low confidence joints by looking at surrounding high-confidence frames of the same joints served as a good heuristic for actual joint positions, making the skeletons represent reality better. The synchronisation movement helped to match the three sensors’ skeleton data temporally and spatially, although the poor sensor readings made it difficult to perform a perfect matching. Despite transforming the different sensor data to the same coordinate system, their differences in intra-skeleton joint estimations was an additional issue for skeleton merging; one sensor would estimate a joint to be at a slightly different distance and angle from the other joints of the body than another sensor. Performing a transformation on each single frame by using one sensor’s skeleton as the main skeleton helped to circumvent some of these issues.
Future work
There are many possibilities for future studies to expand on the work in this study. For better shooting performance estimation, there may be a need for more manual features based on expert knowledge of the shooting domain, such as mean sway velocity and other established factors [2, 4, 3]. Although deep learning models can generally find important features from raw data, the relatively small dataset could benefit from handcrafted features with more discriminative power. Such features could be used both in classical machine learning models or as additional dimensions in the architecture proposed in this study. Another future step could be to examine different ways of modelling shooting skill through classification or regression tasks. Different input sequence lengths would allow for studying both long- and short-term movement patterns. This could be done by warping inputs of different lengths to one size and including some time indication token, or through using multiple models with different input sizes, or models capable of dealing with varying input sizes. It could also be interesting to examine whether a different coordinate representation (e.g. cylindrical [12] or spherical) could have an effect on model performance.
If shooting performance estimation was more successful in the future, the attention explanations could be used as a direct visual feedback to a novice shooter in a product used in live training scenarios by highlighting important aspects of their posture control. This study has examined and demonstrated one of the ways in which attention maps can be visualised through simple sums of the attention placed on a bone in time, but there are many additional possibilities for explaining predictions with visualisations. Because it is known which pair of time/bone patch attend to each other (or themselves), visualisations of the attention between pairs of bones/joints could extend the explanations further. Attention could also help during model development as a tool to identify incorrectly learned patterns, or where the features were lacking. It would also be interesting to perform user studies with shooting instructors to determine whether the attention visualisations were helpful and intuitive, and use the expert assessment to improve explanations further.
Additional data quality improvement techniques such as smoothing could be examined further, as the poor joint estimations produced by the sensors were potentially detrimental to model performance. Other body tracking approaches such as body tracking suits could potentially be an alternative in order to produce more accurate skeleton representations [16, 19]. More data samples overall are probably needed, with increased diversity among shooter styles and skill levels. Different shooting scenarios with a bigger range of movements could also be of interest for future data collections.
Conclusions
This study has examined postural factors in rifle shooting scenarios through the use of multi-sensor body tracking. The study has described the difficulties of large-scale data collection involving human participants and body tracking sensors, and the mitigating measures taken to produce high quality data. A system has been developed for merging data from multiple body tracking sensors with differing time and coordinate systems, and pre-processing steps to smooth out the data and represent it as input features into machine learning algorithms. The approach to generate explainable predictions from multi-sensor body tracking is general, and can be adapted to domains other than rifle shooting. It can be argued that relevant factors for rifle shooting tasks have been extracted from postures and body movements to an extent; the shooting performance of participants could be classified through using strong and weak hand shooting scenarios, although the model performance between participants varied widely. However, it was possible to separate the different phases of a shooting scenario through pose estimation with high accuracy. This was done using a Vision Transformer (ViT) model, which could explain its predictions on a per sample basis through the attention mechanism. These explanations were processed and visualised to be interpretable by humans by presenting skeletons in an interactive 3D environment with continuously shifting attention values per joint over time. Although there is much more work that can be done in the sports shooting domain with body tracking data, this study has laid a groundwork for future studies to build on.
Footnotes
Acknowledgments
We wish to thank the anonymous reviewers, whose suggestions helped improve and clarify this manuscript. Thank you to Max Pettersson and Saga Bergdahl, whose ideas and collaboration in the execution of this study were valuable and rewarding. A big thank you to Anders Johanson, Per Lexander, Olof Bengtsson, Robert Andrén, and everyone else at Saab AB, Training and Simulation who have assisted in the data collection and idea stages of this study. Thank you also to the 13 anonymous participants who took part in the data collection. This work has been performed within the
