Abstract
A “rhythmic agent” is simulated based on the foundation of a previously published behavioral sensorimotor synchronization (SMS) model. The model is adjustable to control the auditory and tactile modalities of the tap's feedback. In addition to the conventional mechanisms of phase and period error correction, as well as their activation conditions, the period is estimated by modeling a central timekeeper impacted by a novel short-term memory. Inspired by The ADaptation and Anticipation Model (ADAM), a mechanism for linearly extrapolating anticipation is also tested. To better match the perceptual and motor cognitive functions, the model's parameters have been tuned to observations from experimental neurosensory literature with an emphasis on transduction delays. The agent is programmed to synchronize with various external rhythmic input signals while accounting for both adaptive and predictive mechanisms. The definition of the agent is based on a minimal set of rules yet has successfully replicated results of real-world observations: against a metronome; it produces the well-known negative mean asynchrony. In a rhythmic joint action, the simulation of joint delayed coordination shows a behavior previously observed in human subjects: in a rhythmic collaboration, a moderate amount of delay is necessary to keep the tempo steady, and below that threshold, the rhythm tends to speed up. It is also shown that giving more weight to the tactile afferent feedback than the auditory intensifies this effect. Moreover, it is observed that including anticipation in addition to the reactive mechanism will decrease the effect. The proposed model as a rhythmic engine, combined with other standard modules such as a beat detection algorithm, can be used to implement musical co-performers that could improvise with a human rhythmically or perform a given score in a way that feels human-like.
Introduction
Behavioral studies in sensorimotor synchronization (SMS) have a long history (Blumenthal, 1975; Michon, 1967; Repp, 2005; Repp & Su, 2013) and, in general, are more mature than neurocognitive approaches to studying rhythm. To let the behavioral approaches benefit from existing knowledge in neuroscience and facilitate interdisciplinary discussions, it is valuable to connect the two bodies of knowledge (Buhusi & Meck, 2005). To explain the mechanisms underlying rhythmic behavior, recent neurocognitive studies have started to frame results from behavioral experiments (Keller et al., 2014; Nozaradan et al., 2018; Schultz et al., 2021; Schwartze et al., 2011). We will attempt to make such a link by looking at classic models of SMS while considering the peripheral properties of the body, such as transduction delays across auditory and tactile modalities. This work is a mathematical attempt to simulate the mechanisms behind synchronizing motor actions with sensory events in time. First, we provide a literature review of different approaches relevant to computational SMS. Then, a minimal set of rules are used to implement the structure of the agent and its algorithmic function, informed by the two bodies of knowledge mentioned earlier, neurosensory studies and behavioral SMS. Finally, the rhythmic agent is tested against different inputs, such as responding to a step change in tempo or performing with another agent. Where available, the results are compared with known experiments involving humans. The aim is to replicate the results of some known SMS experiments observed in previous research, namely its behavior in response to a simple metronome, a sudden tempo change, and rhythmic joint collaboration.
Traditionally, two main theoretical approaches are distinguished in the SMS literature (Repp, 2005): the information-processing approach usually models responses with event-based discrete time series and focuses on cycle-to-cycle error corrections. On the other hand, approaches inspired by dynamic systems theory (Large, 2008) represent movement as a trajectory in phase space and deal with continuous, nonlinear, and within-cycle coupling (Repp, 2005).
In the former approach, synchronization to external rhythmic stimuli is typically controlled by two error correction processes which are “asynchrony-based” and “interval-based” (Schulze et al., 2005). The first process, the phase correction mechanism, corrects phase error (asynchrony), is considered mostly automatic and unconscious, and does not affect the tempo (Repp, 2001a; Repp & Penel, 2002). The latter process, period correction, is usually intentional and deals with the discrepancy, that is, errors in intervals and changes in tempo (Repp & Su, 2013).
Recently, hybrid models have evolved to incorporate elements from each approach, such as following the classic adaptive formulation of phase-correction, while modeling period correction dynamically (Loehr et al., 2011). Models based on continuous-time dynamical systems can also incorporate event-based error correction rules (Large et al., 2023). Since the classical pacemaker accumulator models do not reveal the neural mechanisms of counting pulses (Zemlianova et al., 2022), continuous-time neuromechanistic models combine error-correction with neuronal entrainment concepts by achieving internally generated timings from parameters of a dynamical neuronal system. In a biophysically-based neuronal framework, Bose et al. (2019) showed that a neuronal-level oscillator could learn both the period and phase of an external isochronous rhythm by utilizing discrete clocks formed by gamma rhythms and synchronizing spike times to achieve rhythmic timekeeping across a range of musically-relevant frequencies. Byrne et al. (2020) proposed a neural system that could adapt its oscillatory behavior through iterative error-correction of internal parameters, described by a two-dimensional event-based map.
In this work, in line with the former, information-processing approach, we will base the formulation of a rhythmic agent on a cycle-to-cycle design. We assume a central timekeeper to keep track of time events and intervals and apply linear error correction of phase and period to synchronize the motor commands to an external stimulus sequence (Vorberg & Schulze, 2002; Vorberg & Wing, 1996). The terminology and the choice of variables primarily follow the model Mates developed in his twin papers, explaining the synchronization mechanism between motor actions and sensory events (Mates, 1994a, 1994b). In addition to the conventional adaptive/reactive mechanisms of phase and period error correction, we have incorporated a separate mechanism of period estimation based on the impact of a novel, optional, short-term memory on the central timekeeper. Following the framework laid out by (Van Der Steen & Keller, 2013) in their ADaptation and Anticipation Model (ADAM), we have also tested the anticipatory mechanisms involved in SMS through linear extrapolation. We have chosen the model's parameters based on experimental literature by recent references to anatomical and neurosensory studies to quantify the model constants. In contrast with the recent neuromechanistic models, instead of investigating the dynamics of beat generation at a neuronal level, we will focus on the biophysical properties of transduction delay.
The Model
The agent's architecture in Figure 1 is inspired by the anatomical structure of a human performer and the connections between the parts involved in the synchronization task. The input sequence,

The structural architecture of the SMS agent used in the simulation.
Variables involved in the structure of the SMS agent.
Description of the Variables
Mates’ notation is widely used in SMS research and is followed in this paper too. He described the temporal data from an SMS task either by event variables (“reading of a clock”), denoted by capital letters, or interval variables (temporal differences between two events) symbolized by lower-case letters. For example,
The Internal Representation of the External Events
Concerning the notion of the dedicated timekeeper in the information-processing framework, external and objective temporal events are assumed to have internal representations in the central nervous system (CNS). 2 The distinction between the internal and external events in neurocognitive studies has roots in the “perception latency hypothesis”: there is a delay between stimulus input and the temporal availability of its representation in the CNS (Pöppel et al., 1990). Literature indicates two main theories to explain such delays in perception: the “nerve conduction hypothesis” (Paillard–Fraisse hypothesis) and the “sensory accumulator model” (SAM). The nerve conduction hypothesis accounts for neural transmission delay of the sensory information as the primary source of the perceived latency (Aschersleben, 2002). Alternatively, the more comprehensive SAM attempts to explain such latencies based on the central processing time of perceptual information (Fraisse, 1980; Repp & Su, 2013), instead of the peripheral conduction time. The simulation in this study is more in line with the nerve conduction hypothesis, where such lags are the results of constant transduction delays. Therefore, the neural transmission delay of the sensory input or other time-consuming processes of information is reduced to a constant temporal delay that bridges between the external variables and their central representation. Small boxes in Figure 1 represent these delays. We will attempt to quantify them with a value, or a range of values, informed by experimental literature. Note that in an SMS task that does not incorporate the visual channel, the transduction delay can appear only in auditory and tactile forms.
In Mates’ terminology, both events,
The response
(Roman et al., 2019) has accounted for the presence of auditory feedback from one's own produced onsets, using similar hypotheses as the current work, and by simulating an oscillator receiving its own delayed activity as input. However,
The tactile feedback element, on the other hand, is delayed by a tactile transduction delay of
The two afferent,
4
auditory and tactile, representations of a response, respectively delayed by the values
Initiation of the Next Motor Command
After presenting the agent's structure, we discuss its function by modeling the inner working of the CNS box in Figure 1. CNS is the main component in the modeling of the motor system's dynamic behavior in planning and control (Wolpert et al., 1995). An information-processing approach assumes that the output of this component, the trigger of the motor act, is merely determined by the received stimuli and performed responses, or in our model from their internal representations. In this section, we explain the algorithmic inner working of the model of the CNS.
The CNS module in Figure 1 calculates the next motor trigger as a function of previously observed stimulus and response streams of data. This function defines the SMS model in use and is usually expressed in terms of external events. Here, we present it based on historical values of two sequences of internal events, the feedback from an already performed sequence,

Timeline of the events and intervals used to decide the motor trigger,
Central Interval
In a one-to-one SMS task,
Variables involved in the function of the SMS agent.
With central timekeeper interval being
Tranchant et al. (2022) makes a distinction in this regard between musicians and musically untrained individuals. They show that in non-musicians, relying more heavily on the innate spontaneous production rates (lower
Short-Term Memory
In musical terminology, tempo is defined as the speed or pace of a given rhythmic piece. For an isochronous sequence of stimuli
The central interval used here still does not account for the role of short-term memory, since it is only based on the last interval.
Several memory models incorporate a “decaying factor” to explain how information fades in short-term memory with time. Exponential decay is an arbitrary function used to represent such decline in the probability of information retrieval (Atkinson & Shiffrin, 1968) or in remembering a sequence of numbers in short-term memory (Shepard & Teghtsoonian, 1961). This function can also be observed in the context of auditory memory, such as in the loudness of a recently heard tone in short-term memory (Lu & Sperling, 2003) or the recurrence frequency of a song as involuntary musical imagery (Byron & Fowles, 2015). Although we did not discover a specific source detailing a time-based exponential decay for weightings of recent intervals in tempo inference, we were inspired by its appearance in other contexts and generalized our model by incorporating the role of short-term memory in accounting for the current tempo.
To implement, we took the n most recent IOIs as a vector and their weighted average as another vector with the same length n, called memory vector,
Calculating the Planned Interval
The default interval of equations (8) and (9) can take the value of the last received ISI, that is,
The correction term,
Next, to account for the correction of mismatch between the performed and received sequences, we will define error variables and correction mechanisms that attempt to correct these errors.
Phase Error and its Correction
The temporal mismatch between stimulus and response is called asynchrony, synchronization error, or phase error. The time difference between the corresponding stimulus and response variables reflects the external asynchrony, denoted with the
Period Error and its Correction
In addition to the asynchrony between event variables, another type of error measures the mismatch between stimulus and response intervals and gives a fundamentally different sense of error. Called period error or discrepancy, this error is typically derived from the temporal difference between ISI and IRI, that is,
Note that in our model, the period estimation of the central interval based on the degree of attention (
Combining Dual Correction Processes
(Mates, 1994a, 1994b) assumed that correction for synchronization errors is made directly on the timing of the motor output and is independent of corrections for period errors. The phase correction decides the next tap and the period correction determines the next time interval, thereby applying both terms in the same equation.
The threshold for detecting asynchrony in the auditory domain is reported to be under 10 ms (Lauzon et al., 2020), alternatively citing values between 15 and 20 ms for trained subjects and 60 ms for untrained subjects (Babkoff, 1975). Although such values for conscious detection of asynchronies are reported, various experiments have also shown that phase correction can operate below these thresholds. (Repp, 2001a) argues that subliminal asynchronies, even well below the level of awareness, can still be perceptually registered and utilized in the correction process, as such control mechanisms may involve lower-level, old brain structures such as the cerebellum, which do not require conscious awareness (Ivry, 1997). While a theoretical lower bound for registering asynchronies is below the conscious awareness, we did not find evidence to set it to zero. Therefore, for the phase error correction to be activated in our model, an asynchrony still needs to be registered above a minimal theoretical threshold, even if it is below the awareness threshold and not consciously detected. Mathematically speaking, if the central representations of the stimulus and the response take place temporally closer to each other than a certain threshold, δphase, the corresponding process is switched off by setting the phase correction gain to zero, α = 0. This means the synchronization error is within an asynchrony tolerance threshold and will not be registered in the model. While to simplify the model, this threshold can be set to zero, δphase = 0, we take the lower bound of the values reported for asynchrony detection threshold, i.e., δphase ≃ 10ms. Above this adjustable value, the model will register the phase error and correct for it, although it may still be below the conscious awareness.
Anticipation
The dual error processes, described in sections “Phase Error and its Corrections” to “Combining Dual Correction Processes,” have traditionally been studied as the major models in SMS. In addition to these reactive models, more attention has been made recently to predictive models that attempt to describe how individuals can extract and predict a sequential pattern from the stimulus train (Schubotz, 2007). In a modular approach proposed by (Wolpert & Kawato, 1998), the distinction between reactive and anticipatory processes is modeled by inverse (controller) or forward (predictor) models. Forward models represent the causal relationship between the input and output of the SMS agent. Given the system's current state, they predict the effect a particular motor command will have upon the body and the dynamic environment. Inverse models, on the other hand, provide the motor command that is necessary to produce a desired change in state of the body and the environment. By showing how auditory environment may trigger involuntary action in the absence of prediction, (Schultz et al., 2021) suggest that predictive and reactive audio-motor integration mechanisms could operate independently or interactively to optimize human behavior.
(Van Der Steen & Keller, 2013) have defined two different modules, ADaptation to implement the reactive mechanisms, and an Anticipation module to account for predictive mechanisms. Anticipation in their ADA model, or ADAM, works based on a temporal extrapolation process that generates a prediction about the timing of the participant's next tap based on the most recent series of IOIs. Extending systematic patterns of tempo changes enables this module to model tempo accelerations, unlike the reactive processes. For example, a decelerating sequence with increasing intervals leads to a prediction that the next response will occur after an even longer interval. We use a linear regression for the last m values of the central interval,
Real-World Range of Intervals
Humans are able to perceive rhythms in the range of 0.5 to 8 Hz, with optimal beat perception around 2 Hz (Repp, 2005). The interval time range involved in real-world scenarios of rhythmic SMS, such as playing music in an ensemble, finger tapping to an external beat, or preferred rates of self-paced tapping, is typically in the order of a few hundred milliseconds. (Drake et al., 2000) report a preferred inter-tap interval of about 500 ms in self-paced, isochronous tapping. A similar preferred IRI of 600 ms has also been reported (Collyer et al., 1997; Fraisse, 1982). (Etani et al., 2018) reported the optimal tempo for groove-based music to induce body movements to be around 100–120 bpm, corresponding to IRIs of 500–600 ms. (McAuley et al., 2006) also reported that participants tend to prefer tapping at an IOI of around 600 ms when they can choose freely. With respect to these preferred ranges, the agent will be set to start the performance with an initial tempo (
Human Rate Limits to Intervals
Due to anatomical features of the human body, there are temporal limits to the length of intervals. Such limitations constitute human SMS, both regarding the perception of rhythm and its performance. However, to restrain our model's behavior, we will consider them the limits of an otherwise ideal system. There are two types of limitations involved in the action: Central limits residing in the CNS and biomechanical limits due to the muscular system, known as peripheral limits (Burnley & Jones, 2018). When the frequency of impulses exceeds a typical range of 5 Hz to 7 Hz, even though the sequence of shorter intervals can still be perceived as rhythmic, biomechanical rate limits impose a maximal rate of finger tapping (Repp, 2006). On the other hand, for longer intervals, external frames of reference, such as a watch, are needed to identify them as isochronous or not. To complete the model, we define the shortest intervals at which the motor act is still physically feasible, and the longest ones where the performance still makes a rhythmic sense, as lower and upper rate limits, respectively (Repp, 2006). Both central limits and peripheral limits can pose a lower bound to IOI and are easily measurable in the lab, while the perceptual limits are somewhat harder to identify.
Lower Limit
The lowest limit involved in any SMS task is perceptual and reflects the ability to determine the temporal order of two beats, known as order threshold. The auditory order threshold is defined as the minimum temporal interval between two auditory stimuli that must exist before a person is able to identify the correct order of two successive events (Fink et al., 2006). This threshold has been reported to be between 20–40 ms in a number of studies, for audio, tactile and visual stimuli (Kanabus et al., 2002). Temporal-order judgments (TOJa) are then a subset of SMS tasks dedicated to investigating processing times of information in different modalities (Rorden et al., 2018). TOJ studies have shown that temporal order decisions can be influenced by stimuli characteristics (Hendrich et al., 2012). As one example, Friberg and Sundström observed that for a tone to be perceived as singular it had to be 100 ms or more in separation from the nearest tone (Friberg & Sundströöm, 2002). The mean of performance can also affect the TOJ, for example crossing the hands over the midline can impair the ability to correctly judge the order of a pair of tactile stimuli, delivered in rapid succession, one to each hand (Sambo et al., 2013).
For the successful performance of an SMS task, detecting the order of temporal events is necessary but not sufficient. There is another perceptual lower limit posed on the perception of the fastest possible rhythm: How fast can a rhythm still be perceived as rhythmic? To assess the fastest rates of rhythmic perception, one needs to remove the burden of biomechanical limits. To do that, using
Upper Limit
The upper rate limits are less distinct than the lower rate limits, but we can assess them by measuring where phase transition would take place from an anticipatory rhythmic pattern that maintains synchronization between stimulus and response to a reactionary delayed response (Repp, 2005). Repp showed that tapping is a rather effortless activity up to an IOI of 1500 ms but exceeding 1800 ms becomes a difficult task requiring cognitive effort. Repp also showed that the typical anticipation tendency, which is recognized as the critical feature of SMS, turns into reaction rather than prediction (Repp, 2006). (Bååth & Madison, 2012) established the relation between the subjective difficulty of performance and tempo by testing Repp's hypothesis and thereby reported a steep shift in the subjective difficulty around an IOI of 1800 ms. They also verified that there is a qualitative difference between tapping at “fast” (<1200 ms) and “slow” (> 2400 ms) tempi. To implement the upper limit, we use a conditional in the algorithm that halts the trial if the stimulus intervals exceeds the higher limit, that is, if
Simulating Duets
A single agent is used to replicate experiments where a human individual plays against a machine. For situations where more agents are rhythmically collaborating, agents’ exchange of inputs and outputs is defined concerning the scenario. To account for a duet, we expose two agents to each other by feeding one's output to the other's input and simulating the collaboration over a delayed line (see Figure 3). While the internal delays discussed in the previous section are inherent parameters of an agent, the external delays are varied in this simulation and studied as a parameter of interest.

Two co-performer agents against each other over a delayed line.
The external delays
Implementation and Results
Based on the knowledge from behavioral and neurosensory research, the previous section described the structure and function of an SMS agent. To evaluate our approach, this section presents the results of implementing the agent with values randomly selected within the ranges defined in the previous section. We will test the agents’ behavior across different values of delays and tempi, other parameters of interest, such as
Scenario 1: Human Against a Metronome
Consider an agent called A (the results of which are plotted as blue curves in the upcoming figures) representing a human listening to an input sequence of
Figure 4(a) shows the output IRI of agent A,

Scenario 1: A simulated trial for an agent A (blue) performing an SMS task against a 100-bpm metronome (red). Some “jitter” added to the timing of the agent picked from a Gaussian distribution with a mean of zero and a standard deviation of 10 ms. (a) Output IRI,
Figure 4(b) presents agent A's phase error (external asynchrony) and the tolerance range defined by its asynchrony tolerance threshold (see equation (20)). The area chart marked by dark blue shows IRIs for agent A, within which, the simulation tolerates (ignores) the phase error. If
Figure 4(c) shows agent A's period error (discrepancy). The wider area chart marked by light blue represents the tolerance range for discrepancy according to equation 21. Since the discrepancy tolerance ratio for agent A is set to
One interesting property to study in this scenario is the mean asynchrony that agent A exhibits against a metronome. We define mean asynchrony as the average of objective asynchrony (based on the reference clock) over the course of one trial with the length of n onsets:

Mean asynchrony (
As reflected in equation (15), such a distinction itself arises from the difference between the transduction delays,
This observed phenomenon, known as NMA, is one of the oldest behaviors known to the researchers of SMS and has generated a considerable amount of research (Repp, 2005; Repp & Su, 2013; Stephen et al., 2008; Yang et al., 2019). Some of the earliest investigators of the field noted that while subjects tap to a metronome, their taps tend to precede the sequence tones they hear by a few tens of milliseconds rather than being distributed symmetrically around the tone onsets (Miyake, 1901; Woodrow, 1932). A wide range of explanations for NMA has been suggested: an anticipatory tapping necessary for individuals to gain the subjective impression of tapping in synchrony with the stimuli (Aschersleben, 2002), different nerve transmission times from the finger and the ears to the brain and an asymmetric cost function of the error tolerance (Vos & Helsper, 1992), a slower central nerve system registration of tactile as compared to audio information (Aschersleben et al., 2001), or a tendency to underestimate the IOI duration (Wohlschläger, 1999). The NMA has been reported to vary with IOI duration. An increase in drummers’ NMA has been reported as the metronome IOI increased from 300 to 1,000 ms (Wohlschläger, 1999). In another study, Repp and Doggett examined 1:1 tapping at slow metronome tempi with IOIs ranging from 1,000 to 3,500 ms. Non-musicians’ NMAs were found to increase linearly as the IOI increased, whereas musicians’ NMAs were smaller and nearly constant (Repp & Doggett, 2007). The NMA can also change with musical training. It tends to be smaller for musicians than for non-musicians (Repp & Doggett, 2007) and is also reported to be larger for untrained participants (Yang et al., 2019). In a tapping study on drummers, professional pianists, amateur pianists, singers, and non-musicians, drummers showed the smallest NMA (about 20 ms), whereas others had NMAs in the vicinity of 50 ms (Krause et al., 2010) for the IOI = 800 ms. In another study, professional drummers showed mean asynchronies ranging from 0 ms to 13 ms in synchrony with a metronome, depending on the instrument and tempo (Fujii et al., 2011). This phenomenon can also be recognized in experiments aiming to achieve objective synchronization through additional instructions provided to trained subjects. For example, the elimination of NMA in objective terms is reported to lead to the perception of positive asynchrony. When non-musicians were trained to abolish their NMA using feedback on the direction and size of their asynchronies, after some practice, they managed to tap without an observable NMA but also reported that they perceived their taps behind the received stimuli (Aschersleben, 2003).
In section “The internal representation of the external events,” we touched upon two different assumptions behind delayed perception: the nerve conduction hypothesis and the SAM. The former views the NMA as a necessary mechanism for correcting intrinsic delays in the perceptual system due to the greater peripheral delay of tactile signals relative to auditory signals. The latter, however, while accounting for observations similar to the nerve conduction hypothesis, leaves room to include other mechanisms. According to the SAM, the accumulator function for the auditory modality is steeper than that of the tactile modality; therefore, in addition to the shorter transduction delays for auditory signals, this constitutes quicker central processing of auditory information, requiring taps to precede tones to be registered simultaneously. However, the steepness of the accumulator function is not constant according to the SAM and can be a function of other factors specific to an SMS task. As an example, the magnitude of the sensory input is shown to influence the sensory accumulator functions: signals with a lower amplitude take longer to accumulate toward the synchronization threshold (Aschersleben et al., 2001).
In our modeling, according to the Paillard–Fraisse hypothesis (Aschersleben & Prinz, 1995), we assumed that the NMA arises from differences in nerve conduction times between click and tap and their corresponding central representations. Thus, when anesthesia eliminates the slower feedback component by blocking the tactile feedback and keeping the faster auditory feedback (as well as the earlier kinesthetic feedback from joints and muscles), a decrease in the amount of negative asynchrony is expected. Figure 5 confirms this expectation and show that while NMA is observable for all range of values of
Conversely, SAM model assumes that tapping is planned at a “late” brain site, not affected by afferent nerve conduction times, but instead by the amount of activation arising from the taps and therefore argues that nerve block can lead to increase in NMA (Aschersleben et al., 2001). Our model is more in agreement with the Nerve Conduction Hypothesis. It is possible to make our model comply more with SAM, as an alternative, for example, by considering that at higher tempi, a steeper accumulation of tactile feedback is caused by subjects’ more forceful tapping (Peters, 1989). The larger force that is applied to fingers at lower IOIs, that is, higher tempi, leads to an increased amplitude for the tactile feedback (Kaernbach et al., 2004). The stronger tactile feedback translates to a smaller λ in equation (3), which can lead to a larger intensity of NMA at higher tempi.
Negative asynchrony can also be explained based on the strong anticipation hypothesis. Strong anticipation is characterized by predictions that arise from the regular operations of a system, as opposed to weak anticipation, which relies on explicit internal simulations or models of the system's dynamics (Stepp & Turvey, 2010). Roman et al. (2019) model the brain's synchronization as an oscillator with delayed recurrent feedback to account for latencies in neural processes. This delayed feedback allows the brain to predict upcoming beats and compensate for delays by tapping earlier. In this model, musicians who process feedback more efficiently, show less negative asynchrony compared with non-musicians. Thus, negative asynchrony arises from the brain's proactive adjustments based on ongoing interaction and feedback with external stimuli.
Asymmetric error correction models of the NMA hypothesize the asymmetric error correction process for positive and negative asynchrony as one mechanism behind NMA: the error correction gain for positive asynchrony is greater than that for negative asynchrony (Tomyta et al., 2023). While we did not account for this in the design of our model by assuming a fixed error phase correction gain, we tested whether introducing the asynchrony tolerance threshold would replicate such an asynchrony observed in the effective rate of phase correction in simulated trials.
Tomyta et al. (2023), using the dataset presented by Yang et al. (2020), plotted

Plotting the asynchrony of each onset against that of the previous one, following Tomyta et al. (2023), our simulation resulted in a symmetric distribution of asynchrony, irrespective of whether the onset asynchrony was positive or negative.
Scenario 2: Delayed Joint Action
In the next scenario, we simulate a rhythmic duo by coupling two agents with each other, according to Figure 3. Agents A and B are defined by quantifying their parameters based on the constants or distributions presented in chapter “The Model,” and, hence, acquire slightly different parameter values; however, these parameters will be constant throughout the performance of each agent across each simulated trial and all its repetitions. Figure 7 shows the results for one trial performed by agents A (blue) and B (red) without any transmission latency, or delay, between them (

Results for simulations of scenario 2: coordinated joint tapping of agent A (blue) against agent B (red), coupled according to Figure 3. Both agents are given an initial tempo of 100 bpm, and without a transmission delay, that is,
Another real-world phenomenon that can be replicated by this simulation emerges when the two agents in Figure 7 are set to perform the rhythmic duo under the influence of external delays. In joint tapping experiments where two performers are tasked with synchronizing their actions over a delayed line, it is observed that a moderate delay is necessary to maintain steady rhythmic collaboration. Without this delay, the trials appear to accelerate. Chafe & Gurevich (2004) first reported this phenomenon in a mutual hand-clapping experiment over an adjustable delayed line, where pairs of subjects were instructed to play in synchrony (see also Chafe et al., 2010). They found that shorter delays (<11.5 ms) produced a modest but surprising acceleration, which we refer to as “the Chafe effect” below. In a similar experiment by Farner et al. (2009), this counterintuitive finding was confirmed in various acoustic environments, observing that during a duo-clapping with short delays up to about 15 ms, the tempo increased. In another study, Darabi et al. (2008) showed that a strategy function could describe this effect algorithmically. Based on a mathematical interpretation of the behavioral data from both experiments, it was concluded that for latencies below a critical boundary, performers tend to compensate for a suspected delay, which, if larger than the physical delay, will lead to an acceleration.
To test replicating the results of Chafe et al. (Chafe et al., 2010; Chafe & Gurevich, 2004), we let 24 pairs of randomly chosen agents perform against each other over a symmetrical delay line. We used the same 12 delays, 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66, and 78 ms, as was used in Chafe & Gurevich (2004) and simulated each randomly chosen pair of agents to play once at each tempo. Since a complementary hand-clapping pattern was used (clapping ××○× against ×○××), with tapping events (×) as supposed to silent notes (○), which are not physically generated but have hypothetical counterparts in the CNS, we applied re-indexing to ensure that external events and their internal representations correctly match the corresponding onset. In accordance with the original experiment, the initial tempo was randomly chosen at 84, 90, and 96 bpm. For both agents, the value of

Simulated trials with 12 mutual delays given to 24 random pairs of agents. The agents had jitter picked from a normal distribution with a mean of 0 and an std of 10 ms.
In this simulation, we also include the tactile afferent feedback. To compare the results with the original experiment, we follow the same methodology of analysis presented in Chafe et al. (2010) with the same lead/lag factor to quantify the onset asynchrony for every trial in a delay group. Observing a normal distribution in the lead/lag factor, their quantification is used to estimate 95% confidence intervals, see Figure 9 (top chart with the module introduced in the section “Anticipation” off, and the bottom chart with turning this module on), for three values of

Simulating the «Chafe effect»: Mean lead/lag factor value according to (Chafe et al., 2010) aggregated at each onset of all trials simulated for 24 randomly chosen rhythmic agents, presented as a function of line delay with 95% confidence interval. A positive critical delay for three values of
So far, we have shown that our model can demonstrate both the NMA and the acceleration of joint tapping at minimal latencies. These observations seem to support that both NMA and the “Chafe effect” can be artifacts of peripheral and central processing delays. The speeding up by two agents in the absence of a delay is an artifact of NMA in each agent and does not necessarily represent a higher cognitive process in coping with rhythmic interactions. It has been mentioned since very early work on SMS that in duet tasks or musical performances, the precedence of the motor output over the auditory input ensures that both subjects experience synchrony (Gasser & Grundfest, 1939; Loewenstein et al., 1958). Nevertheless, the speeding-up effect itself had not been explicitly reported until Chafe and Gurevich (2004), Farner et al. (2009), and Chafe et al. (2010), and the quality of attributing this effect to NMA under various settings needs further investigation.
Scenario 3: Agent Versus Step-Changing Metronome
In Figure 10, we plot the same charts as in Figure 4 when agent A is exposed to a step-changing metronome with a sudden jump from 140 to 100 bpm (a shift in ISI from 600 to 429 ms). We call this a negative step change because it comes with a decrease in the interval size. The output generates an overshoot, as expected from the literature (Mates, 1994b; Michon, 1967; Repp & Su, 2013). Michon (1967) could exhibit the initial overshoot that is typically observed in a sudden change of tempo. Friberg and Sundberg (1995) claimed that the occurrence of overshoot in response to a step change in tempo does not depend on the amplitude of the step change, but rather on the awareness that the step change has taken place or not. Darabi and Svensson (2021) and Darabi et al. (2010a) studied the qualities of such overshoot in the domain of frequency instead of time. To test the replication of their experimental results of tapping against step-changing metronomes, we attempted to follow a similar methodology, both in data collection and analysis. In their experiment, human participants took part in an SMS task to tap a finger on a keyboard, following a metronome that changed tempo from 100 bpm to a higher tempo between 102 and 200 bpm and the other way around. The discrete tapping events of the input stimuli and the output responses were aggregated over repetitions for each step size. A dynamic systems model was used to interpret the result, requiring the interpolation of the discrete tapping events into effectively continuous-time signals with a frequency of 60 Hz (with an uncertainty of ±8.3 ms). The upsampled signal was then fed to the MATLAB system identification toolbox (Ljung, 1999) to identify the transfer function that describes the relationship between the input and the output of the system. The dynamic system model using Laplace transformation (Widder, 2015) allowed a formulation of the system in the complex frequency domain (the so-called s-domain), instead of the time domain. The time response to the step change in tempo was modeled by quantifying five parameters presented in the following equations (the gain

Scenario 3: Simulation of tapping an agent against a step-changing metronome, with a tempo jumping from 100 to 140 bpm (that is, IRI from 600 ms to 429 ms), showing (a) IRI and ISI, (b) phase error, and (c) period error. The dark blue area marks the IRI range within which the asynchrony is tolerated (that is, outside of which the phase error correction process is activated, marked by +). The light blue area depicts the tolerance range for period error correction process. Onsets for agent A are marked by × if the latter mechanism is activated. Planned intervals that fall within both light and dark blue ranges are marked by ○ and are not corrected for either process, although they can still be executed with a jitter.
Similar to the experimental data reported by Darabi and Svensson (2021), we chose three randomly generated agents, aggregated and upsampled the simulated trials according to the algorithm described in the original experiment, and compared the results with the aggregated observations from the three human subjects over the same reported step changes and number of repetitions. With forcing the gain
Figure 11 shows one example of the 27 analyzed step responses to sudden tempo changes. The positive step change (Figure 11(a)) shows an increase in the interval, in this example, from 429 to 600 ms, equivalent to a decrease in tempo from 140 to 100 bpm, after normalization to a unit step response. Conversely, in line with Figure 10, the negative step change (Figure 11(b)) shows a sudden reduction in the interval or an increase in tempo by the same values. Both charts are normalized by the step size, so the step input ranges between 0 and 1 (or −1). The green curves show the experimental data with the brown curves representing their simulated counterparts. The thicker, lighter curves show the upsampled aggregated step responses for this step size, aggregated over all participants/agents and their repetitions (observed or simulated). The thinner, darker curves show the aggregated step response modeled by a pair of complex poles, a delay, and a zero according to the dynamical systems method in equation (24) (observed or simulated).

Step response to a sudden tempo change between 100 and 140 bpm. A positive step (a) shows an increase in the interval or a decrease in tempo. Conversely, a negative step (b) shows a reduction in the interval (increase in tempo). Aggregated trials from three participants in a real-world experiment (green) is compared with that of three randomly chosen simulated agents (brown). The thicker lighter curves represent the aggregated IRIs over all participants and repetitions, upsampled with PCHIP interpolation. MATLAB system identification toolbox models the darker, thinner lines with a delay, one zero, and two poles (also known as a P2DUZ model. The accuracy of the model is reported with a fit ratio based on a normalized root-mean-square error (NRMSE).
For subliminal step changes, where the relative change in the tempo is below 7% (Repp, 2001b), the results of the identified parameters are overall noisy, particularly for the experimental data. For the supraliminal step changes in the range of 108 to 200 bpm, we observed similar trends for both observation and simulation, and checked if a linear regression can predict the identified values.
In Figure 12, the top two charts (

Identified model parameters expressed as a function of the relative step size for P2DUZ model (the first two poles,
The damping ratio (
In the supraliminal range, however, there is a good agreement in estimating the zero value between the observation and the simulation, as seen in the good alignments between the linear regressions in each subplot. To summarize, Figure 13 shows the estimated parameters of the experimental data against those calculated from the simulation. The unit on both axes is seconds, except for the damping ratio (

Identified parameters for the experiment versus simulation for the P2DUZ model. The unit on both axis is seconds, except for the damping ratio (
Another statistical method we use to analyze the agreement between the simulation and the experiment in the time domain is the mean-difference (Altman) plot (Cleveland, 1993). Consider two IRI arrays of the same length, both from the same step size and direction, one from observation and the other from the simulation. Assume

Altman plots, showing the agreement between the model outcomes and human performance for scenario 3. The horizontal axis represents the tempo related to the average of corresponding IRIs from the experiment and simulation data (in bpm). The vertical axis shows the time difference between corresponding IRIs of the simulation and the experiment when expressed in terms of tempo (in bpm). The solid horizontal line marks the mean difference between the two arrays. This line does not differ significantly from 0 with respect to the dashed lines (the mean of differences ± 1.96 standard deviation of the difference), also known as the 95% limits of agreement.
In this figure, the horizontal axis shows the average of experiment and simulation, and the vertical axis shows their difference. The solid horizontal line shows the mean difference between the two arrays. This line does not differ significantly from 0 in comparison with the dashed lines which show this mean of differences ± 1.96 standard deviation of the difference, also known as the 95% limits of agreement, 6 which does not indicate the presence of a systemic bias. If a consistent bias is observed, it can be adjusted for by subtracting the mean difference from the new method.
Conclusion
We have simulated a “rhythmic agent” by deconstructing and modifying the Mates’ behavioral model of SMS. The auditory and tactile components of the tap's feedback were adjustable with a weighting factor. In addition to the adaptive/reactive error correction processes, a mechanism to extrapolate anticipation linearly has been introduced. Period estimation and period correction were both incorporated, as with the application of short-term memory, both were deemed necessary to produce human-like tempo adjustment. The simulation confirmed observed patterns of human synchronization across three scenarios.
In scenario 1, exposing the agent to a simple metronome recreated the well-known human behavioral phenomenon, NMA, that while subjects tap to a metronome, taps tend to precede a sound stimulus onset by a few tens of milliseconds, instead of being distributed symmetrically around the sound onsets. A single parameter
In scenario 2, the presented model was tested in a joint delayed rhythmic collaboration, and the so-called “Chafe effect” was reproduced. That is, if a communication delay is introduced, the tempo decreases in a similar manner as observed in real-world experiments. In addition, a speed-up effect is observed for transmission delays smaller than around 10 ms.The introduction of jitter in our model generates case-to-case variation similar to real experiments. The weighting factor,
In scenario 3, an agent performing against a step-changing metronome generated overshoot in its reaction to a tempo step in similar manners as observed in real-world experiments. Fitting a dynamical system model to the simulated data, some modeled parameters of the overshoot, namely
We can think of two applications for this model. Practically, to implement an automatic musical accompaniment application, the model can be combined with a real-time beat tracking algorithm to dynamically track the location of a received input signal and compare it with a given musical score (Lin et al., 2020). The application would take the role of another instrumentalist or a whole orchestra to accompany the solo musician at the right tempo, with human-like behavior resembling a real-life performance (Arzt, 2016). In addition, the numerical values of the identified parameters of the transfer functions used in the last scenario could inspire SMS models that do not constrain their formulation to the time domain, and eventually inform the value of error correction gains or their formulation.
Footnotes
Action Editor
Jessica Grahn, Western University, Brain and Mind Institute and Department of Psychology.
Peer Review
Jonathan Cannon, McMaster University, Department of Psychology, Neuroscience and Behaviour as well as one anonymous reviewer.
Contributorship
Nima Darabi is the main author who wrote the paper and conducted the experiments. The work is done under U. Peter Svensson's close supervision, and Paul Mertens has provided neurological insights and parts of the literature review while being a part of the pilot research.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Ethical Approval
This study is essentially a secondary “analysis,” simulation, or modeling of participant data collected in a previous experiment from three human subjects (e.g., Darabi & Svensson, 2021), which received ethical approval: “The studies involving human participants were reviewed and approved by Q2S Centre of Excellence, NTNU. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.”
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Norges Teknisk-Naturvitenskapelige Universitet, and Uninett through the “Centre for Quantifiable Quality of Service in Communication Systems, Centre of Excellence,” appointed by the Research Council of Norway.
Data Availability Statement
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
