Abstract
Interpersonal coordination in musical ensembles often involves multisensory cues, with visual information about body movements supplementing co-performers’ sounds. Previous research on the influence of movement amplitude of a visual stimulus on basic sensorimotor synchronization has shown mixed results. Uninstructed visuomotor synchronization seems to be influenced by amplitude of a visual stimulus, but instructed visuomotor synchronization is not. While music performance presents a special case of visually mediated coordination, involving both uninstructed (spontaneously coordinating ancillary body movements with co-performers) and instructed (producing sound on a beat) forms of synchronization, the underlying mechanisms might also support rhythmic interpersonal coordination in the general population. We asked whether visual cue amplitude would affect nonmusicians’ synchronization of sound and head movements in a musical drumming task designed to be accessible regardless of musical experience. Given the mixed prior results, we considered two competing hypotheses. H1: higher amplitude visual cues will improve synchronization. H2: different amplitude visual cues will have no effect on synchronization. Participants observed a human-derived motion capture avatar with three levels of movement amplitude, or a still image of the avatar, while drumming along to the beat of tempo-changing music. The moving avatars were always timed to match the music. We measured temporal asynchrony (drumming relative to the music), predictive timing, ancillary movement fluctuation, and cross-spectral coherence of ancillary movements between the participant and avatar. The competing hypotheses were tested using conditional equivalence testing. This method involves using a statistical equivalence test in the event that standard hypothesis tests show no differences. Our results showed no statistical differences across visual cues types. Therefore, we conclude that there is not a strong effect of visual stimulus amplitude on instructed synchronization.
Keywords
Introduction
In ensemble music performance, musicians use multisensory cues to achieve a synchronized sound. Such cues likely include: auditory feedback to reduce asynchronies and asynchrony variability (Chen et al., 2002); intrapersonal somatic cues such as head movements to reinforce a sense of musical meter (Phillips-Silver & Trainor, 2007, 2008); and visual cues to facilitate anticipation of upcoming temporal patterns in the music (Colley et al., 2018). Assuming co-performers in a musical environment can see each other, intrapersonal somatic cues may also become interpersonal visual cues, such that one person’s rhythmic body movements might be seen by another person. Indeed, mutual visual access among partners in previous work (a dyadic sensorimotor-synchronization task with musical sequences) was found to improve the synchrony of partners’ ancillary head movements, as well as their synchronization with the target auditory stimulus (Colley et al., 2020).
Studies on pure visuomotor synchronization (no audio component) have shown mixed results regarding the effect of amplitude of a periodic visual stimulus on one’s ability to synchronize with the stimulus. Participants were found to spontaneously synchronize forearm movements with an oscillating circle better with larger amplitudes of circle movement, even when the period duration was kept the same (Varlet et al., 2012). Additionally, postural sways showed greater phase entrainment with larger environmental stimulus movements (Dijkstra et al., 1994). In both cases, synchronization with the visual stimulus was considered uninstructed, meaning participants were spontaneously synchronizing their movements, possibly without awareness. On the other hand, research on instructed rhythmic synchronization suggests there is no effect of stimulus amplitude (de Rugy et al., 2008; Peper & Beek, 1998). Similarly, synchronizing finger taps with an image of a finger featuring apparent motion was not affected by the amplitude of the apparent motion (Hove & Keller, 2010). Additionally, synchronization tapping with a virtual conductor was not influenced by the amplitude of conductor gestures (Wöllner et al., 2012).
Regarding music-related ancillary movements, previous studies have demonstrated that ancillary movements generally play a role in communicating a performer’s expressive intentions, with larger movements signaling increased expressive intensity (Davidson & Broughton, 2016; Luck et al., 2014; Thompson & Luck, 2011). However, the influence of the size of ancillary movements on co-performers’ synchronization abilities has not been tested. This would be difficult to test, as any benefit of a co-performer on a partner’s synchronization depends to some extent on the skill and reliability of the co-performers (Pecenka & Keller, 2011) as well as social motives (Lumsden et al., 2012). As such, it would be hard to have a consistent visual cue in the form of ancillary movements.
To further explore the role of visual stimulus amplitude on synchronization we focused on the role of range of motion—or movement amplitude—of a high-performing co-performer’s movements on one’s ability to synchronize with a concurrent musical beat. To address the issue of not having a reliable stimulus, we programmed a virtual co-performer. With this controllable stimulus, we tested whether larger body movements of a very accurate co-performer could improve the synchronization accuracy of an observer. Also, assuming the co-performer’s movements were always matched to the musical beat (which we controlled for), then larger movements would produce higher velocities. Velocity has been shown to be an important factor in visually mediated synchronization in earlier work (Luck & Sloboda, 2008, 2009; Luck & Toiviainen, 2006; Varlet et al., 2014). Velocity is also important in conductor gestures such that musicians and nonmusicians synchronize best with movement featuring high rates of vertical velocity change (Colley et al., 2018).
Overall, there is some evidence that uninstructed visuomotor coordination is affected by stimulus amplitude, but there is also evidence that stimulus amplitude has weak to no effects on instructed visuomotor coordination. The aim of the current study was to test whether movement amplitude of a visual stimulus affects one’s ability to synchronize in a musical situation, where synchronization among co-performers often involves visual cues. Another interesting aspect of musical synchronization is that synchrony is not necessarily instructed. Certainly the main objective in most music is to match sounds in time, and as such, audio-motor synchronization among performing musicians is instructed. However, occasional apparent visuomotor synchronization may be uninstructed, or, ancillary.
We tested the influence of stimulus amplitude by having research volunteers drum to the beat of specially composed pieces of tempo-changing ensemble music, while observing a virtual co-performer (avatar), whose movements were manipulated to exhibit various amplitudes of motion, but were always matched to the musical beat. We used drumming as opposed to finger-tapping because individuals tend to miss fewer beats when drumming (Madison et al., 2013; Manning et al., 2017), thus yielding higher-quality data. We recorded their drumming in order to measure the asynchrony of their drum strokes, and to quantify their predictive timing, which is the ability to anticipate upcoming beat intervals (Colley et al., 2017 , 2018). We also motion capture recorded participants during the drumming task (Colley et al., 2018) to measure the synchrony of their ancillary head movement with the avatar using cross-spectral coherence (Richardson et al., 2005; Schmidt & O’Brien, 1997; Varlet et al., 2015), as well as to quantify the determinism of their ancillary movements using detrended fluctuation analysis (DFA; Wang & Yang, 2012). We used DFA alongside cross-spectral coherence to understand the impact of visual cues on postural sway independent of synchrony with the visual cue. Coherence alone would not capture the structure or rigidity of posture, and previous work has shown that movements associated with postural control while standing tend to default to pink noise type fluctuations (Blázquez et al., 2009), but this is altered by rhythmic visual cues (Colley et al., 2018).
Given the mixed prior research, we had two separate hypotheses regarding the effect of avatar movement amplitude on one’s ability to synchronize with a musical beat. Based on work on uninstructed coordination, temporal asynchronies relative to a musical pacing signal will be lower when participants observe an avatar with a large movement amplitude, compared to avatars with relatively small, or no movement amplitude. Based on work on instructed coordination, temporal asynchronies will be lower with a moving avatar compared to a still image, but will not change with different movement amplitudes.
Our other measure from the musical drumming task was predictive timing. Based on the finding that temporally relevant biological motion (compared to temporally relevant non-biological motion) facilitates predictive timing (Colley et al., 2018), we structured our hypothesis in a similar manner to the previous hypothesis. Predictive timing will be higher when participants observe an avatar with a large movement amplitude, compared to avatars with relatively small, or no movement amplitude. Predictive timing will be higher with a moving avatar compared to a still image, but will not change with different movement amplitudes.
Regarding our motion capture measures (cross-spectral coherence and DFA), we also had two possible hypotheses. Coherence (between the participant and avatar) and αDFA will be higher when participants observe avatars with larger movement amplitudes, compared to relatively small movement amplitudes, or no movement. Coherence and DFA will be higher with avatars featuring any movement compared to a still image, but will be the same across movement amplitudes.
To test these hypotheses, we used the method of conditional equivalence testing (Campbell & Gustafson, 2018).
Methods
Participants
Participants (N = 30, 23 male, Mage = 19) were recruited through Western Sydney University’s research participation programme, and given course credit for completing the experiment. Participants were accepted regardless of musical experience, as we were interested in synchronization abilities in the general population. However, we assessed musical training with a questionnaire. Three participants had more than 5 years of musical training, and were currently involved in instrumental music performance. Of the remaining 27 participants, 12 people reported having 1 academic year or less of music education, and 15 people reported having no formal music education.
In a previous study, Varlet et al. (2012) report an effect size of partial η2 = .29 for the effect of stimulus amplitude on unintended visuomotor synchronization. Based on a post-hoc power analysis, our study with 30 participants had greater than .95 power to detect an effect of movement amplitude on synchrony, if the effect generalizes across dependent measures and to other synchronization tasks.
Design
The main experimental design was repeated measures, with one-factor, which we will call visual cue (see Figure 1). The factor visual cue refers to magnitude of movements in the visual stimulus, and had four levels: normal movement, movement amplified by 100%, movement amplified by 200%, and no movement (control). As a shorthand, the four conditions will be referred to as Regular, Amp1, Amp2, and Still respectively.
Auditory Stimuli
The music with which participants drummed was made for a previous experiment (Colley et al., 2018) and is described in greater detail in the associated article. The duration of each piece was 2 min (and therefore the trial duration was also 2 min). It was composed using MIDI instruments (xylophone, glockenspiel, harp) with short sound envelopes (150–250 ms) so that notes in the melody would not overlap, thereby avoiding ambiguous beat onsets. The musical texture was homophonic and the harmonies were common in Western voice leading. There was no change in rhythm in any of the three instrument parts, so that the lines of music created a single target pulse stream. The average IOI was 500 ms, but there were tempo changes throughout the music (IOI range: 332–668 ms) in order to assess participants’ anticipatory timing abilities. The range of these tempo changes was in the order of those observed in expressive musical performance (e.g., Repp, 1992, 1998). There were six rates of change for the tempo changes: +/- 10, +/- 16, and +/-22 ms per beat. There were three pieces of music. All three were similar in style but featured the tempo changes at different times in the music. It should be noted that the tempo changes were randomly generated for each of the three pieces when the stimuli were made but were not randomly generated at each experimental session. In other words, all participants heard the same music. Further details about the music structure, timing, and composition can be found in a previous study (Colley et al., 2018).
Visual Stimuli
The avatar used in the visual stimuli was made by averaging the motion capture recordings of 10 high-performing participants (i.e., relatively good synchronizers) from a previous experiment (Colley et al., 2018), in which they drummed to the same music used here. Thus there were three versions of the avatar, one for each of the three pieces of music. In order for a former participant’s data to be included in an avatar, a participant had to be right-handed, have no missed beats, and an average absolute asynchrony below 30 ms for all three pieces of music. With 10 of these participants identified, we reduced the data in their recordings by selecting a subset of motion capture markers that gave the impression of a human body. We removed the left arms from the motion capture recordings used in the visual stimuli, as the former participants tended to exhibit task-irrelevant movements with the left hand (e.g., scratching their head, or resetting a loose marker). Each of these motion capture datasets contained xyz coordinates (represented as distance from an origin point) of the aforementioned markers for each frame of the recording. In brief, these coordinate values were averaged across the 10 model participants. Further details about the averaging procedure used to create the avatars can be found in similar work (Colley et al., 2018).
Once the base avatar was made, we manipulated its movement trajectory to create the other visual cue conditions. The Amp1 condition was made by expanding the range of motion of all markers along all spatial axes (x, y, z) by 100%. In other words, the position coordinates of the base avatar were linearly mapped to fit in between new minimum and maximum values. Thus the timing and relative shape of the avatars stayed the same, but the range of motion increased. The same was done for the Amp2 condition, but the range was increased by 200%. The Still condition (control) was an image of the avatar in its first frame of the animation.
Apparatus
An Alesis Percpad (tapping pad) was used to collect the drumming data in MIDI format. Participants’ movements were recorded with a 12-camera Vicon motion capture system at 100 Hz sampling rate, with reflective markers arranged using a custom model with four markers on the head and one marker on each of the following locations: central on the back of the neck, left shoulder, right shoulder, right shoulder blade, right elbow, right inner wrist, and right outer wrist (all participants were right-handed). The motion capture recording and the drum recording were synced by sending a serial trigger signal to Nexus (the motion capture software) at the onset of each trial. The experimental procedure (avatar animations, stimuli presentation, trigger signals, and data collection) was programmed using C++ in the Xcode coding environment on a 2015 MacBook Pro. Auditory stimuli were sent through stereo speakers, and visual stimuli were presented on a 17” monitor with a 60 Hz refresh rate.
Procedure
Participants received a study information and consent form by email after signing up for the experiment. They were given a paper copy to sign when they arrived for the experiment. Next, with permission from the participant, the experimenter attached motion capture markers to the body parts listed in the Apparatus section. While attaching the markers, the experimenter explained the task and answered questions.

Three images of the three moving visual cues that participants observed during the task. The Regular condition (left) was the averaged motion profile of natural movements. Amp1 (center) increased the range of motion of the Regular condition by 100% along the horizontal and vertical planes. Amp2 (right) increased the range of motion of the Regular condition by 200%. The Still control condition maintained the image of the avatar shown in this figure for the entire trial (without the scales and arrows). Note that depth of movement was represented by the changing diameters of the circles, but there was very little movement along this axis.
Participants were instructed to “drum along to the beat of the music,” to “be aware that the speed of the music would sometimes change,” and to “always be watching the visuals on the monitor.” In an attempt to ensure participants watched the visual cues, we used catch letters, wherein a letter would appear at the center of the screen at pseudo-random timepoints during a trial. Participants were told to say these letters out loud so the experimenter could verify that they were observing the screen and reporting the correct letters. Letter appearances were timestamped to assess whether they had any influence on drumming asynchrony (see Data Analysis section). No specific instructions regarding movement were given. Instead, participants were told to stand however they felt comfortable throughout the trial, so long as their feet and eyes were facing the monitor. There were 24 trials of duration 2 min. Participants had one 30 s-long practice trial with no visuals, which they could repeat upon request. There was no electronically generated auditory feedback from the drum pad, though participants could hear and feel the drum stick hitting the drum pad. After the experiment, participants were given a short musical background questionnaire to assess their musical training (if any) and music-listening habits.
Data Analysis
Drumming Analysis
To check for unusual influence by the catch letters on asynchronies we used the seasonal hybrid extreme studentized deviant (SH-ESD) test on the asynchrony time series. SH-ESD detects outliers in seasonal time series data, “seasonal” meaning the time series has periods of fixed length, as in our tempo-changing music. SH-ESD is similar to Grubbs’ test for outliers, but is preferred for time series. To check if the catch letters successfully sustained participant attention, the experimenter confirmed that the letter said by the participant matched what appeared on the screen throughout the experiment session.
From our drumming recordings we produced two measurements: asynchrony and predictive timing. Asynchrony was calculated as the average of absolute time differences in ms between the cumulated sequence of musical beat intervals (or inter-onset intervals [IOIs]) and the cumulated sequence of participant drum intervals (or inter-tap intervals [ITIs]). To quantify predictive timing we used the prediction/tracking index (Colley et al., 2017; Pecenka & Keller, 2009). This measure is the ratio of a prediction coefficient over a tracking coefficient. The prediction coefficient represents the strength of the statistical relationship between the ITI and IOI series. The tracking coefficient is the statistical relationship between the ITI series and the lag-1 IOI series. Thus the prediction coefficient is high if participants are anticipating the changing beat intervals and thereby closely matching the intervals, and the tracking coefficient is high if participants are responding to changing beat intervals one beat later, thereby resembling the lagged IOI series. For asynchrony and P/T Index, we used Grubbs’ test to identify outliers.
Motion Capture Analysis
All reported analyses of motion capture recordings considered the head movements of participants, as it was found in previous research that this was the part of the body that moved the most (besides the arm, which is considered an instrumental movement, and our hypotheses concern ancillary movements of the body) in these experimental procedures. Also, the head is the most visible part of the body in most musical ensemble contexts, and therefore would presumably serve as the most salient cue with which an individual might synchronize their own ancillary movements. The validity of this assumption is lent support by the fact that recent work on interpersonal coordination in ensembles has focused on head movements (Bishop et al., 2019; Chang et al., 2019).
From our motion capture recordings we produced two measures: cross-spectral coherence and DFA. For both measures, we used the root-sum-square of the raw motion capture data. This produces a directionless signal that incorporates features from all three spatial planes (x, y, z), and we had no specific hypotheses regarding the direction of participant movements. We reduced the motion capture data further by down-sampling to 50 Hz from 100 Hz, and filtering the resulting signal with a 10 Hz low-pass filter.
Cross-spectral coherence measures the consistency of phase relationships among multiple frequencies in a signal. It produces a value between zero (no synchrony) and one (perfect synchrony at all measured frequencies). In this case, we are measuring the phase relationships among different frequencies of movement between participants, and the avatar. As there was no movement in the control stimulus (a still image), we used a pseudo-pair control. This means that to analyze control trials, we compared the movement of a participant with the movement of the same participant from a different trial featuring the same music. The coherence window size was set at 512, and the overlap size at 50%. The range of measured frequencies was .1 Hz to 8 Hz, and the reported coherence scores are the average of all coherence values from within this range.
DFA quantifies the noise color of a signal. Briefly, signals can exhibit white noise (random values within a narrow range), pink noise (some degree of predictable patterns; some drift), or Brownian noise (highly predictable pattern; heavy drift). Body sway during passive standing tends to exhibit pink noise (Wang & Yang, 2012). If participants entrain to a rhythmic stimulus, we expect DFA to show values above pink noise, as ancillary body movements become more rhythmic and predictable. The output from DFA is α, which typically ranges from 0.5 (white noise) to 1.5 (Brownian noise) with 1.0 (pink noise) in between. For both coherence and DFA we again used Grubbs’ test to identify outliers.
Equivalence Test
We used conditional equivalence testing (Campbell & Gustafson, 2018) to address our divergent hypotheses. In traditional hypothesis testing, non-significant test statistics indicate that one should not reject the null hypothesis that two means are equal, but this does not speak to the equivalence of the two or more conditions being compared. In other words, one cannot accept the null hypothesis that two or more means are equal. With conditional equivalence testing, one first uses a standard hypothesis test (in our case, ANOVA). If there are null results in a comparison of two means of interest, and if it is relevant to the hypothesis, one then uses an equivalence test to determine whether the means are statistically equal, or if their relationship is inconclusive with the given data.
The equivalence test we used was the two one-sided test (TOST) method (Lakens et al., 2018). This involves three basic steps. Setting equivalence bounds [-EQlow, EQhigh]. The equivalence bounds form the range of difference scores that are not significant and therefore the comparisons are considered equal. The bounds are set to include effect sizes that are considered theoretically equal. If this range is not known or there is no theoretical reason to set a particular set of equivalence bounds, then one uses the smallest detectable effect size given the current data distribution and sample size to set the bounds. Testing whether the difference score of interest falls within the equivalence bounds. This is done by running two one-sided t-tests (also called one-tailed tests), with H01 that the mean group difference between conditions is greater than EQhigh, and H02 that the mean group difference is less than -EQlow. Another way to think of this is as a 90% confidence interval of the estimate of interest (difference scores in this case) that is generated by the two t-tests. If both t-tests (i.e., the 90% confidence interval of difference score estimates) fall within the equivalence bounds as indicated by two significant p-values, then we reject the null hypotheses that the difference score is either greater than the high equivalence bound, or less than the low equivalence bound, and declare equivalence. If one t-test is non-significant, the confidence interval will exceed the equivalence bounds, and we declare inconclusive results. If both one-sided t-tests are non-significant, then the original ANOVA comparison was significant (this is just a conceptual example; an equivalence test would be unnecessary in this case since the ANOVA was significant).
To set our equivalence bounds we used the data-driven smallest detectable effect size method, as we had no theoretical reason to identify a priori non-significant effect sizes for our measures. We considered basing our equivalence bounds for asynchrony on a just noticeable difference (JND) for asynchronous beats, but studies on this topic have had mixed results (Drake & Botte, 1993; Halpern & Darwin, 1982), and a JND for asynchrony would depend on IOI size (Friberg & Sundberg, 1995; Lerens et al., 2014), which is not constant in our stimuli. An asynchrony JND would likely also depend on the acoustical features of a sound (London et al., 2019) and of the room. As such, the smallest detectable effect size method of setting equivalence bounds seemed appropriate. We corrected for multiple comparisons using Bonferroni correction. Figure 3 shows the results of the equivalence tests, with the larger of the two p-values shown for each test.
In addition to the frequentist statistics, Bayes factors were also used to quantify the evidence in favor of the alternative hypothesis over the null hypothesis (BF10). They are reported alongside p-values and are consistent with the results of both the ANOVAs and equivalence tests.
Results
Asynchrony
We first checked whether participants succeeded in the catch-letter task. All participants correctly named all letters, so we believe the task was effective. We then tested for outliers in participants’ asynchrony series due to the catch letters. The SH-ESD test showed, on average, 2.6 outlying asynchrony scores for each participant. This is far fewer than the number of letters that appeared in a trial, and only 5 of 78 total outliers across all participants occurred within 500 ms after a letter appearing. As such, we have little reason to believe the letters influenced asynchronies.
Prior to the asynchrony ANOVA, we used a log10 transform as the average asynchrony scores were positively skewed in the Regular and Amp2 conditions. No participants were outliers. The ANOVA showed no statistically significant differences between any of the four visual cue condtions (Regular, Amp1, Amp2, and Still), F(3, 87) = 1.25, p = .30, η2 = .01, BF10 < 1 for all comparisons (see Figure 2 ). Therefore, we used a series of equivalence tests to determine if the different condition comparisons were statistically equal, or inconclusive given the current data. This is best summarized visually in Figure 3 , top row, which shows the 90% confidence intervals that correspond to each TOST comparison. Intervals within the equivalence bounds are statistically equal. We see that asynchrony was statistically equivalent when comparing the following conditions: Regular to Amp2, Regular to Still, and Amp2 to Still. While only marginally non-significant, the remaining comparisons are considered inconclusive, meaning we cannot conclude a statistical difference or equivalence with the current dataset. The results for individual comparisons were: Reg-Amp1 (tlow (29) = -.003, thigh (29) = .04, p = .04); Reg-Amp2 (tlow (29) = -.04, thigh (29) = .03, p = .001); Reg-Still (tlow (29) = -.02, thigh (29) = .03, p = .002); Amp1-Amp2 (tlow (29) = -.05, thigh (29) = .003, p = .04); Amp1-Still (tlow (29) = -.03, thigh (29) = .008, p = .01); Amp2-Still (tlow (29) = -.02, thigh (29) = .04, p = .003.

The untransformed mean asynchrony scores expressed in ms. Note that the statistical tests used the log10 transformed data but the untransformed distributions are shown here. Error bars represent standard error of the mean.

The equivalence bounds and corresponding TOST results to test for statistical equivalence. Each row corresponds to one of our four dependent variables. Each column corresponds to a particular pair-wise comparison of the four conditions. The error bars represent 90% confidence intervals of difference scores.
P/T Index
The P/T distributions were positively skewed for all conditions so we used a log10 transform on the data. Three participants were removed as outliers after the transform. The ANOVA showed no statistically significant differences between any of the four visual cue condtions (Regular, Amp1, Amp2, and Still), F(3, 78) = 1.90, p = .14, η2 = .02, BF10 < 1 for all comparisons (see Figure 4 ). The equivalence tests (Figure 3, second row) showed equivalence for all comparisons: Reg-Amp1 (tlow (26) = -.18, thigh (26) = .13, p < .001); Reg-Amp2 (tlow (26) = -.22, thigh (26) = .08, p = .02); Reg-Still (tlow (26) = -.23, thigh (26) = .01, p = .02); Amp1-Amp2 (tlow (26) = -.23, thigh (26) = .07, p = .04); Amp1-Still (tlow (26) = -.24, thigh (26) = .07, p = .01); Amp2-Still (tlow (26) = -.13, thigh (26) = .12, p < .001).

The non-transformed mean P/T Index scores expressed as a ratio of leading/lagging ARMA coefficients (see Methods). Note that the statistical tests used the log10 transformed data, but the natural distributions are shown here. Error bars represent standard error of the mean.
DFA
DFA distributions were all normal. No participants were identified as outliers. DFA values were generally slightly above 1.0 ( Figure 5 ), and within the range observed in previous work on ancillary motion (Colley et al., 2018). The ANOVA did not yield significant effects between any of the four visual cue conditions (Regular, Amp1, Amp2, and Still), F(3, 87) = 1.90, p = .14, η2 = .004, BF10 < 1 for all comparisons. The equivalence tests (Figure 3) showed the following statistical equivalences: Reg-Amp1 (tlow (29) = -.03, thigh (29) = .007, p = .01); Reg-Amp2 (tlow (29) = -.03, thigh (29) = .01, p < .001); Reg-Still (tlow (29) = -.009, thigh (29) = .03, p = .01); Amp1-Amp2 (tlow (29) = -.02, thigh (29) = .02, p < .001); Amp2-Still (tlow (29) = -.006, thigh (29) = .04, p = .02). There was one inconclusive comparison, Amp1-Still (tlow (29) = .002, thigh (29) = .04, p = .09).

The mean DFA scores of participants.
Coherence
The distributions for cross-spectral coherence were normal, and there were no outliers. Coherence values were generally between 0.5 and 0.6 ( Figure 6 ), which is in line with previous work (Colley et al., 2020). The ANOVA was significant, F(3, 87) = 531, p < .001, η2 = .77. A Bonferroni-corrected post-hoc test showed that the pseudo-pair control condition showed lower coherence than all other conditions. There were no other statistical differences. The difference between the pseudo-pair and other conditions was reflected in Bayes factor as well, BF10 > 1018 (all other comparisons had BF10 < 1). The equivalence test ( Figure 3 ) reflected this: there were statistical equivalences for all comparisons of Regular, Amp1, and Amp2: Reg-Amp1 (tlow (29) = -.02, thigh (29) = .03, p < .001); Reg-Amp2 (tlow (29) = -.02, thigh (29) = .02, p < .001); Amp1-Amp2 (tlow (29) = -.02, thigh (29) = .01, p < .001). However, when Regular, Amp1, or Amp2 conditions were compared to the pseudo-pair control, the confidence intervals were well above the equivalence bounds: Reg-Control (tlow (29) = .44, thigh (26) = .53, p = .1); Amp1-Control (tlow (29) = .43, thigh (29) = .52, p = 1); Amp2-Control (tlow (29) = .44, thigh (29) = .53, p = 1).

The mean cross-spectral coherence scores between participants and the avatar (or a pseudo-pair).
Discussion
This experiment investigated the role of movement amplitude of a visual stimulus in facilitating musical synchronization and influencing ancillary movements. The visual stimulus of which we manipulated the amplitude was a high-performing virtual co-performer (a motion capture avatar). The rationale for this is that a co-performer can be beneficial to a partner if the co-performer is good at the task (Pecenka & Keller, 2011). Additionally, higher amplitudes of movement that are timed to a fixed musical sequence produce higher velocities (by moving more distance in the same time), which have been shown to improve musical synchronization (Colley et al., 2018). Given mixed prior results on movement amplitude and visuomotor synchronization, we advanced two hypotheses: if overall musical synchrony (i.e., instructed and uninstructed movements) is influenced by the amplitude of co-performer movements, then higher amplitudes of stimulus movement will result in lower asynchrony, and higher coherence; alternatively, if musical synchrony is not influenced by the amplitude of a co-performer, then higher amplitudes of stimulus movement will not produce differences in our dependent measures. We also considered the determinism of ancillary movements (DFA), which is not a measure of synchrony but quantifies the extent to which movements are predictable. If stimulus amplitude influences movements, then we would expect larger amplitudes to produce higher DFA values, as movements linked to the musical structure would be relatively predictable. If stimulus amplitude does not influence movements, then we would expect no difference in DFA values across amplitude conditions.
Overall, our results suggest that there is no reliable effect of movement amplitude of a visual stimulus on synchronization accuracy, predictive timing, ancillary movement fluctuations, or the synchrony of ancillary movements between the participant and the avatar. A number of comparisons between the moving visual stimulus conditions were statistically equivalent, suggesting that our amplitude manipulation produced three effectively identical stimuli (despite physical differences in the visual displays), and so we have greater support for our second set of hypotheses. What is surprising is that the movement conditions were generally no different than the control condition, in which participants observed a still image. The exception to this was the cross-spectral coherence measure, which showed higher coherence between participants’ head movements and the moving avatars’ head movements, than between participants’ head movements and a copy of their own movements from another trial (a pseudo-pair). This finding, alongside the apparent success of the catch letters, suggests that participants were not ignoring the visual display. If they were not observing the visual cues, then their ancillary coherence in the experimental trials would likely resemble the coherence from the pseudo-pair control. Note that in the non pseudo-pair conditions, mean coherence was around 0.6, which is a moderately high degree of coordination. This is likely because the stimuli were rhythmic, providing some degree of predictability for corresponding body movements.
First, we will discuss the drumming dependent variables: asynchrony and P/T Index. It seems that the instructed synchronization of our participants was not affected by the moving visual cues, even compared to a still image visual cue. This could be due to participants’ generally small amount of training in music, which was reflected in the average absolute asynchrony across conditions (about 45 ms, compared to 25 ms for the highly synchronized individuals used in creating the avatar). This is consistent with another synchronization study that tested nonmusicians with similar tempo-changing stimuli (Mills et al., 2015). For example, motor experts (people with experience executing deliberate movements in a given domain) tend to be more perceptually sensitive to gross body movements in their domain. Basketball players predict shot success better than referees, who typically observe but do not play the game (Aglioti et al., 2008). Similarly, violinists predict tone onsets better than musicians of other instruments when observing video of a violinist performing a cueing motion, a movement meant to help observers predict a tone onset (Wöllner & Canal-Bruland, 2010). More recent work has shown that gestures can effectively convey a beat and tempo in musical duos, but only expert musicians were tested, and musicians with more ensemble experience synchronized better (Bishop & Goebl, 2018a). In another study, musicians were generally able to perceive audiovisual asynchronies in musical performance videos, but pianists showed more perceptual sensitivity when observing other pianists (Bishop & Goebl, 2018b). Given the results of these studies, musical expertise may be beneficial for integrating temporal information from a moving body. Furthermore, musicians in one study only looked at the conductor 28% of the time during the performance of a piece of music, and each glance was less than 1 s in duration (Fredrickson, 1994), suggesting they have trained the ability to receive temporal information from brief glances. Only three of our participants had extensive musical training, and only two had ensemble training, meaning the sample was mostly nonmusicians. The three musicians’ asynchrony scores were in the lowest four values of the sample, so they were performing well relative to the remaining sample. However, they did not qualify as outliers so we have no reason to treat them as a separate group. Furthermore, removing the three musicians from the sample (resulting in N = 27) did not change the significance of the results of the hypothesis tests. As such, the participants may have observed the stimuli as instructed, but may not have been able to extract relevant temporal information from a full upper-body display, which had multiple moving parts. In other words, participants did not have experience watching a complex rhythmic stimulus to form a temporal prediction.
Expanding on this, a previous study showed that a video of a conductor (from the waist up, similar to our avatars) yielded more precise tapping than a video of a metronome for musicians, but not nonmusicians. In the same study, neural activation in the superior frontal gyrus correlated positively with the amount of time spent practicing with a conductor (Ono et al., 2015). Both groups performed the same in the metronome condition, perhaps because the metronome had a single moving part that corresponds directly to the beat. A previous study (Colley et al., 2018) showed that both musicians and nonmusicians benefitted from a virtual conductor, which was presented as a single moving circle. This suggests that visual cues for instructed synchronization are most effective for the general population if they are kept simple (i.e., one moving part). Complex whole-body movements likely require training to analyze in real time. Indeed, it has been shown that body movements exhibit multiple periodicities when dancing (Burger et al., 2014; Su, 2016), and that tracking multiple moving objects simultaneously complicates action prediction (Atmaca et al., 2013). Thus, segments of the body that move in relation to a musical beat might be perceived as individually moving parts rather than as a whole phase-locked system, which in turn might depreciate the value of a visual cue. Future studies might test this explicitly by manipulating the number of visible limbs/moving parts in an avatar, and comparing synchronization performance between ensemble musicians, solo musicians, and nonmusicians.
Our motion capture results reinforced one common finding: individuals tend to entrain their movements to a visual rhythm (Clayton, 2007; Kotz et al., 2014; Schmidt et al., 2007; Schmidt & Turvey, 1994; Varlet et al., 2015). But this uninstructed visuomotor entrainment of the head does not appear to be increased by the amplitude of the visual stimulus, at least in a multisensory context such as music performance. But again, the effect of stimulus amplitude on synchrony may be a matter of expertise, such that experienced ensemble musicians would be more likely to show greater ancillary movement coherence with the amplified avatars, particularly if auditory stimuli were removed (Goebl & Palmer, 2009). Alternatively, an effect of stimulus amplitude on synchrony in a musical task might be more prominent among pairs of live co-performers, without any virtual avatar, as suggested by relevant findings in dyadic synchronization tasks (Colley et al., 2020; Goebl & Palmer, 2009; Keller & Appel, 2010). As for the fluctuations of movements as measured by DFA, there was no difference across conditions. Importantly, participants’ DFA scores for all conditions were centered just above 1.0, suggesting that people tended to move with little more structure than passive standing balance (Blázquez et al., 2009). We expected the amplitude manipulation to increase DFA scores, indicating more rhythmically structured movements of the participants. If our participants were in fact unable to extract temporal information from the avatars, then they may have neglected the visual information entirely as it was deemed unreliable (Elliott et al., 2010).
In general, it is possible that our participants exhibited ceiling effects in their behavior. To address this in future work, task difficulty could be increased by introducing large-scale discontinuous tempo changes and pauses into the musical pacing signal. Studies of musical duo performance have shown that the benefits of visual cues are enhanced in the presence of such features (e.g., Bishop et al., 2019; Kawase, 2014). Larger amplitude movements of an avatar may therefore be beneficial when auditory cues are characterized by greater temporal uncertainty than was the case in our study.
On the topic of movement amplitudes, the amplitude manipulation in this experiment was not natural, such that we edited the visual recordings to exaggerate the movements. As humans are especially sensitive to biological/natural movements (e.g., Ueda et al., 2018), the unnatural manipulation may have reduced processing efficiency or even caused the motion cues not to be processed as behaviorally relevant signals. Previous research has found that perceptual judgments of performer identity in dancing avatars are influenced by variations in movement amplitude induced by asking the models to dance expressively versus unexpressively (Sevdalis & Keller, 2011). Future studies on sensorimotor synchronization could take a similar approach by directly recording model participants who drum at different amplitudes instead of using artificial modulations.
Finally, it should be noted that our sample came from a healthy population. However, an individual’s ability to control periodic movements can be impaired if afflicted with a motor disorder such as Parkinson’s Disease (Hove et al., 2012; Nombela et al., 2013). Research on rehabilitation in Parkinson’s Disease has shown that external rhythmic cues—both auditory and visual—can restore some functionality to patients (Ghai et al., 2018; Hove & Keller, 2015). The moving visual stimuli presented in this experiment might provide some benefit to patients with movement disorders where healthy participants received no advantage relative to the control stimulus.
To conclude, our finding that co-performer movement amplitude did not have reliable effects on instructed or uninstructed synchronization suggests that this specific visual cue might not be functionally relevant to basic aspects of interpersonal timing in musical contexts, at least in samples of individuals with little musical training. We draw this conclusion based not only on statistically non-significant differences, but on several statistically equivalent comparisons as well. Future studies of visuomotor and audio-visuomotor synchronization should consider the influence of expertise, especially in musical synchronization. Other possible variables of interest are the complexity or richness of the musical material (e.g., the potential for expressive variation) and stimulus movement (as measured by the number of moving parts or distinct movement frequencies). Musical expertise and complexity may be influential to the extent that ancillary movements play a greater role in providing cues for flexibly aligning expressive performance parameters than in facilitating strictly synchronized timing (Keller, 2014).
Footnotes
Contributorship
IC conceived the study, prepared the stimuli, ran the experiment, analyzed the data, and drafted the manuscript. PK, MV, and JM were involved in the conception of the study. They also advised on data analysis and interpretation, and reviewed and provided feedback on each version of the manuscript.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Action editor
Andrew Goldman, Indiana University, Department of Music Theory, Jacobs School of Music.
Peer review
Birgitta Burger, Universität Hamburg, Institut für Systematische Musikwissenschaft.
Two anonymous reviewers.
