Abstract
“Earworms” have been proposed as a particular type of involuntary musical imagery (INMI) where musical material is repeated in the mind. The structure of the repetition is investigated by proposing a spreading activation model (SAM), where mental experience consists of priming and activation of nodes that represent objects, events, and relationships, including music. Music consists of chaining together nodes representing small music segments within hierarchical structures. Listening to music at a point in time activates the music's represented segment, which then primes the node representing the segment that follows. Repeating musical segments are coded recursively, with an additional layer of “context” nodes tracking global, structural location. From this basis, two hypotheses were proposed: H1 “Contiguous repetition at encoding” and H2 “Low environmental focus.” H1 predicts that when an INMI episode is a contiguously repeating segment, it must be based on music that contains contiguous repetition: it will be perceived as a subset of INMI – involuntary, limited, and contiguously repeating musical imagery (InLaCReMI). H1 challenges current views about preferred segments for looping, such as the “hook” of a tune. H2 predicts that InLaCReMI occurs when an individual is not focused on the immediate environment. In such a state there is less social imperative to activate high attentional-demand contextual information and so adherence to contextual integrity in thought is relaxed, leading to looping of recurrently activated nodes that were encoded with contiguous repetition. Additional predictions were made using SAM, demonstrating the potential for SAM to provide a unifying understanding of INMI. InLaCReMI is proposed as a frequently occurring species of INMI and confirmation of this phenomenon through more structured empirical investigation will provide novel insights into mental operation, and the nature of INMI.
Keywords
The spontaneous, involuntary, innocuous, non-psychotic appearance of a music fragment heard in one's mind, without the same music sounding simultaneously in the environment, is now believed to be a common experience (Liikkanen & Jakubowski, 2020). The experience can take several forms. A piece of music, or part of it, may appear in the mind once and then stop. This is referred to as a mind-pop (Elua et al., 2012). Another form is when a fragment of music is repeated in part or in full at least once, but possibly many times. This type of experience is referred to as involuntary musical imagery (INMI), and colloquially is typically referred to as an “earworm.” Williams (2015) argued that the INMI terminology was not used consistently to describe the repetitious aspect of the phenomenon and proposed that the INMI terminology is more appropriate to use as the superordinate level in a taxonomy that encompasses the repetitious form of the experience and mind-pops, among others. The term “earworms,” argued Williams, more consistently refers to the repeating species of the experience, but again, the terminology is not used consistently, a matter that is at the heart of the present investigation.
While an identifying characteristic of an earworm is repetition, the definition of an earworm would benefit from specificity about whether the repetition occurs contiguously (the fragment recalled repeats immediately, i.e., “loops”), or whether there may be additional musical material inserted between the repeated material (e.g., taken from the piece of music with which the fragment is associated) or the repetition takes place in different ways. A theoretically driven understanding of this aspect of the experience is all but absent in the literature, but may provide a key to the mechanisms that underlie earworms and other internal, mental activities. This is because, as will be argued later, a key aspect of mental organization of temporally based sequential material, such as music, is the efficiency with which recurring material is cognitively organized, but under certain circumstances at the expense of structural precision of the recalled material.
This article will argue that contemporary understanding of the cognitive organization of sequential material presents clear-cut, testable hypotheses that predict earworms to be common experiences under specifiable, preparatory conditions. Furthermore, the nature of the material used for such an experience, as well as the nature of its repetition can also be predicted. The model of mental organization that will be presented is also applicable to other aspects of INMI, potentially presenting a unifying theory, that is much needed (Beaman & Williams, 2010, p. 693; Margulis, 2014, p. 75).
To date, the definition provided for earworms and INMI (when repetition of INMI material is involved) is not explicit on the matter of contiguous repetition, but implies that a fragment of music does repeat contiguously. Williams (2015) scoured the literature to find 26 definitions of INMI. Of these, 20 mentioned repetition as an integral part of the definition, with words such as “repeatedly,” “sticking,” “persistent,” and “replay.” However, only three of the 26 definitions contained descriptions of the experience that can be unambiguously interpreted as contiguous repetition: “endless repetition” (Sacks, 2011), “over and over again” (Hyman et al., 2013), and “looping” (Williamson et al., 2014). It is therefore risky to assume that earworms, and therefore INMI, necessarily consist of contiguous repetition. While those definitions that omit reference to some kind of contiguous repetition may do so as a result of oversight or viewing it as sufficiently axiomatic, it may also be the case that INMI consisting of contiguous repetition is better thought of as a separate species of INMI. The definitions published since have started to refer to contiguous repetition more frequently, or to something that can be seen as reasonably likely to mean contiguous repetition, suggesting that either a correction in the definition is underway, or that INMI can still incorporate episodes where the repetition is not necessarily or exclusively contiguous in its repetition (as discussed by Liikkanen & Jakubowski, 2020). The approach taken here reflects the continuing lack of unanimity about this aspect of INMI.
To distinguish the mental experience of contiguous repetition of a music fragment from other INMI experiences, the former will be referred to as involuntary, limited, and contiguously repeating mental imagery (InLaCReMI; pronounced “in-lack-re-mi,” similar to the Lachrimæ theme found in compositions by the English Renaissance composer John Dowland, such as the ayre and lute arrangement of “Flow, my tears”). A few caveats about this proposed terminology should be stressed. “Limited” needs to be specified to avoid interpreting the experience as being completely beyond the conscious control of the individual. While InLaCReMI can end as it began – spontaneously – it is likely that the individual experience can be consciously controlled to some extent. Otherwise, if completely out of the volitional control of the individual, or interminable, the experience may be seen as slipping into psychosis (or, as referred to in much of the literature, “pathology” – see Priest, 2018; Williams, 2015). Reference is also made to a “fragment” of music. Again, this refers to extant understandings of InLaCReMI experience. The material appropriated for the InLaCReMI may be of a section of music and (less likely) the entire piece, but, as will be demonstrated later, it is currently unclear exactly what, and whether, there is any pattern to the section that makes it a candidate for InLaCReMI.
Aim
Given the limitations and mixed results on the nature of the musical material recalled in InLaCReMI episodes and the lack of theory to guide our understanding, this article (1) draws together extant, related findings concerned with (a) the musical fragment that is recalled and (b) whether the fragment repeats contiguously, as well as (2) presenting a contemporary theory of memory and cognition to help situate the data and future research programs on this and other matters related to INMI. Testable hypotheses and revised definitions will be proposed by reconciling theory and extant data.
Extant Evidence
Music Features Conducive to InLaCReMI
The nature of the music that is recruited for InLaCReMI episodes can be broken into three areas: (1) the musical features, (2) the quality/section of the music itself, and (3) duration of the music fragment. Musical features have received the bulk of research attention regarding the nature of INMI material. This is partly because there has been a subsidiary interest in how precisely (“veridically”) the encoded musical material is recalled during INMI episodes.
A study by Jakubowski et al. (2015) tracked INMI experiences of 17 individuals. Upon becoming aware of such an experience, the participant was asked to tap to the beat of the internally perceived music. The tapping data were collected via accelerometers worn by the participant. The participant was also asked to identify the piece. Given the large amount of music to which people are exposed is normally through canonical sound recording media (meaning, a single, likely recording to which the individual is exposed), it was possible to compare the tapping data with the tempo of the recorded material that was likely to have formed the basis of the INMI.
Psychophysical and music perception research on error in tempo tracking of remembered music (e.g., singing a well-known song) is roughly 8% (Levitin & Cook, 1996), meaning that known music is not recalled at exactly the tempo at which it was performed, but will be slightly faster or slower by about this amount. This is close to, and possibly within the limit of, human auditory temporal processing capacity, which has been estimated as between 3% and 11% of the canonical tempo, depending on method of measurement, nature of the task, and other factors (for an overview, see Foster et al., 2021; Levitin & Cook, 1996).
In the Jakubowski et al. (2015) study, tapping was typically at 15% of the canonical recording tempo. Another study using a similar design with 20 participants found the tempo of an INMI episode was typically with a less than 10% deviation from the canonical recording, with the mental version usually being slower (Jakubowski et al., 2018). An arresting example of tempo veridicality at recall was the analysis of 54 field recordings of the orally transmitted Australian Aboriginal dance-song “Djanba 14,” where average tempo deviation was just 2% (Bailes & Barwick, 2011). These data suggest that absolute tempo can be a feature of musical memories and thus points to the nature of the music played in memory as being essentially that which is heard during encoding of (i.e., exposure to) the musical material. The literature on INMI is in line with general findings in mental imagery that the temporal structure of sounded music is preserved when deliberately recalled in memory (Halpern, 1988; Zatorre et al., 1996).
Bailes (2007) asked participants experiencing musical imagery in general to indicate the vividness of a range of the episode's musical features, including dynamics, expression, harmony, lyrics, melody, texture, and timbre. Of these, overall higher vividness scores were given to melody and lyrics, while the remaining rated musical features were on average each rated near or below the scale midpoint. The study was conducted with musically trained participants, however, similar findings were reported in a larger-scale study by Bailes (2015). Melody perception is considered a cognitively salient part of music (Fujioka et al., 2005; Lee et al., 2009), however, the finding of strong vividness for lyrics over other features may seem surprising if it is assumed that INMI is encoded veridically, since by definition “veridically” means all aspects of the sounded music should be encoded into memory. Bailes (2007) suggested that lyrics, above other features, were significant because of the ease of reproducing lyrics through the voice. This interpretation foreshadowed more recent investigations of musical imagery that now typically include the desire or activation of movement response, such as subvocalizing, singing, and moving in synchrony with the episode (Floridou et al., 2015; Killingly et al., 2021). Furthermore, memory for melodies is strengthened when remembered as lyrics (Weiss et al., 2019).
Since melody and lyrics are favored, does this mean that an INMI episode is typically not veridical with the richness of the sounded musical experience? Does it mean that harmony, dynamics, and other musical features are not reproductions of the canonical performance during the INMI experience? It is worth noting that in the Bailes study, it was the vividness of the feature that was rated, not the presence of the feature. The relatively low ratings of vividness for several features could also be attributed to the individual not being asked to draw their attention to the other features or, relatedly, the participant might not have a sufficient understanding of what to attend to – for example, if the individual does not have a clear concept of what harmony or dynamics in music mean – reflecting a possible impact of levels of expertise (e.g., Sears et al., 2012) and listening styles (Kreutz et al., 2008). Furthermore, these studies report experience of INMI material, which is not necessarily the same as the type of musical material itself that INMI favors.
A more explicit interrogation of the melodic characteristics that INMI prefers was conducted by Jakubowski et al. (2017). Based on a large-scale survey, they selected 100 melodies that were identified by participants as being experienced as INMI material. These melodies were matched with 100 melodies taken from songs of similar popularity and style, but never named as INMI in the study. Eighty-three statistical aspects of the songs were examined using music information retrieval techniques developed by Müllensiefen (2009). The analysis included comparison of melodic features compared with the features extracted from a comparable corpus of music. The analysis of the corpus can be interpreted as the degree to which a melodic feature is archetypal. The researchers found that the melodic contour of songs used as INMI material was well matched with the song corpus, while contour between melodic turning points (rises and falls in melodic contour) tended to be different when compared with the corpus. The other main finding was that the tempi of songs that appeared as INMI were overall faster than that of the typical corpus.
Jakubowski et al. (2017, p. 131) observed in this research that findings from earlier studies were not replicated, but noted that in these earlier studies, corpus (“second order”) related information was not accounted for, and so they argued that their study encompasses a greater range of possible variables involved in melodic processing, improving the likelihood of identifying relevant features. The study, therefore, may also better reflect the underlying cognitive organization of music because the corpus of music to which an individual is exposed creates a foundation according to which newly perceived music is organized (as will be discussed later). Importantly, the Jakubowski et al. (2017) study considered the possibility of identifying a formula for generating melodies that would be conducive to forming INMI episodes, presenting a solution to the current research aims. However, and after some sophisticated modeling, they still concluded that multiple formulae would be needed, suggesting a limitation in the explanatory value of a stable, fixed, identifiable set of patterns that are consistently preferred by INMI episodes.
Musical Structures Conducive to INMI Episodes
While the preceding review of the literature suggests that the prospect of identifying musical features that are preferred by INMI are slim, there are some researchers who examine the larger-scale structures that INMI episodes may prefer. Two regularly reported ones are the hook, and the chorus.
The “hook” or being “catchy” (e.g., Floridou et al., 2015) has been considered a strong contender for material that INMI would appropriate for its episodes. Yet what are the characteristics of a hook or of music that is catchy? Jakubowski et al. (2017) tackled the matter drawing on the work of Burgoyne et al. (2013) who defined the hook as “the most salient, easiest-to-recall fragment of a piece of music” (p. 1, cited by Jakubowski et al., 2017). The characteristics of this definition also reflect those of the INMI episode, making the concept of the hook potentially circular and descriptive, and above all difficult to apply in a testable manner on INMI material. As several authors conclude, identifying consistent, global characteristics of a catchy hook is a worthwhile research enterprise, but until an unambiguous definition is arrived at, the concept has questionable utility (see also Burns, 1987; Traut, 2005). Similar to the assessment by Jakubowski et al. (2017) on the search for melodic features conducive to INMI (discussed earlier), the idea of the “hook” may end up consisting of the same results – that several formulations of “hooks” would be required, because there is no straightforward solution to identifying unique characteristics of the (currently) nebulous concept of the hook that can meaningfully reveal the optimal structural content of INMI.
Related to the concept of the hook, but more easily defined, is the chorus. The chorus is a section of a song that contains material unique to it (but which may be derived from other material used in the song), and is repeated several times in the song. The repeated versions consist of the identical (or nearly identical) lyrics. This is in contrast to the verse section of a song, where musical material is also repeated, but the lyrics change on each repetition (Dunbar-Hall & Hodge, 1988). Studies by Bailes (2007), Beaman and Williams (2010), and McCullough Campbell and Margulis (2015) found that the chorus is the part of a song most often experienced as INMI. In their review of the literature, Liikkanen and Jakubowski (2020) also observed a focus on the chorus or fragments from it as the material conducive to INMI episodes.
Fragment Duration
The musical material used for INMI episodes could be explained by the operation of “short-term store” as presented in the influential theory of human memory by Atkinson and Shiffrin (1968) that consists of three memory structures: the sensory register, short-term store, and long-term store. The theory aligns well because of the typical duration for short-term store reported in the literature. For musical stimuli, long-term store contains entire pieces of music that have been previously heard, and that stay available in memory for long periods, possibly over the life span (Bartlett & Snelus, 1980; Spivack et al., 2019). Short-term store (sometimes referred to as working memory) interacts with long-term store and can code sensory input into a recognizable form. For example, it can store a phrase or so of music, but the memory will vanish in a short amount of time without some deliberate strategy to retain it. The typical time duration of retention for the short-term store is 4–8 s, and rarely exceeds 30 s (Atkinson & Shiffrin, 1971; Peterson & Peterson, 1959). The sensory register is an initial capture of sensory input. In auditory sensory input the mechanism is sometimes referred to as echoic memory (Neisser, 1967). It lasts for less than 2 s, and is activated without conscious attention, and is generally not accessible to consciousness as meaningful units – the musical equivalent of the linguistic “morpheme.” Seeger (1960, p. 76) coined the smallest musical unit that would present itself to short-term store as the “museme.” Echoic memory is a holding tank for incoming auditory information that can become accessible to short-term store for further processing (Clark, 1987).
Consistent with this theory, Levitin (2006, p. 155) proposed that short-term store capacity would set the boundaries for INMI fragment duration, suggesting a range of 15–30 s. Beaman and Williams (2010) surmised that this duration represents that found in “theme music for television shows [which] seem to get stuck more often – or to affect more people – than complex pieces of music” (p. 639). Using a retrospective, self-report design for gathering data on general INMI experiences, Pitman et al. (2021) concurred, finding fragment durations were typically in the range 10–30 s. Liikkanen and Jakubowski (2020) cautioned against retrospective self-report methods because of their limitations in estimating the duration of INMI episodes (see also Beaman, 2018), and such estimates would benefit from diverse methods of investigation.
Moeck et al. (2018) developed an INMI induction procedure that was not dependent on retrospective self-report, because the participant was exposed to a piece of music, with the undisclosed intention of implicitly triggering an earworm in a laboratory setting. After the listening-exposure phase, and as a surprise task, the participant was asked to monitor any spontaneous earworm experiences, and if present, to hold down a computer keyboard key for the duration of the earworm. The monitoring period was for five minutes. Based on these data, the researchers presented a calculation for the “earworm duration” (which could be taken to mean the earworm fragment duration, but was more likely the episode duration). The average length was shorter than the 10–20 s range, with an average duration of 8.77 s. An introspective case study by Brown (2006) reported experiencing INMI fragment durations of 5 to 15 s. Thus, while evidence of fragment duration is within the bounds of the short-term store of the Atkinson and Shiffrin (1968) model, the range of the durations alone (ranging from 5 to 30 s across the studies cited) presents a limited understanding of the nature of the earworm fragment.
In sum, current evidence suggests that INMI (and therefore InLaCReMI) appear as veridical representations of a canonical source, and that, to date, no particular regularity in the musical features or patterns has been identified that are prevalent in the episodes. Furthermore, the literature is not always clear as to whether the experiential durations refer to the INMI episode, or the material that is recruited for repetition within the INMI episode.1 While it seems likely that those studies refer to episode duration, the significance of understanding the duration (and indeed the content) of the fragment being repeated cannot be overstated. While the methods for gathering INMI material durations are problematic, there appears to be a plausible link between these durations and the operation of the short-term store structure of the Atkinson and Shiffrin (1968) theory of human memory. Atkinson and Shiffrin's theory, while influential, does not reflect the current state of understanding of memory and cognition. A more contemporary theory is therefore proposed aimed at explaining and predicting the nature of INMI material, and InLaCReMI material in particular.
Theoretical Position, Building on Spreading Activation Models of Memory and Cognition
The Basics of Spreading Activation Models
Spreading activation models (SAMs) are inspired by neuroarchitecture and brain function, and are used to explain a wide range of memory and cognition processes, such as solving problems and making (human-like) errors and corrections (Danner & Thøgersen, 2022; Mak et al., 2021; Pace-Sigge, 2018; Siew, 2022; Völker, 2021; Wang et al., 1988; Werner et al., 2018). SAMs have been used to provide mechanistic explanations of the cognitive processing of music, such as the organization of chord progressions (Bharucha & Stoeckig, 1987), links from pieces of music being brought to mind as a result of hearing another (Faubion-Trejo & Mantell, 2022, p. 305), affective responses to music (Schubert et al., 2014; Völker, 2022), and the creative process (Duch, 2007; Gabora & Ranjan, 2013; Schubert, 2012, 2021). Mace's research on involuntary memories has been influential, particularly through his application of spreading activation models of cognition (Mace, 2005a, 2005b; Mace et al., 2018, 2020), but has yet to be developed into a fully fledged theory to explain INMI in general and InLaCReMI specifically.
SAMs are based on a network architecture that consists of two basic components: nodes (or “cognitive units”) and links (Anderson, 1980; Collins & Loftus, 1975). While there are different nomenclatures for these components, such as “network vertex” and “network edges,” respectively (Mattar & Bassett, 2020; Newman, 2003), we will use the “nodes and links” terminology (Siew, 2019). The nodes are analogous to physical neurons and links are analogous to physical synapses in the brain. Nodes are points in a network that can receive excitation (where “excitation” can be thought of as transmission of information) from incoming links, and then may or may not (usually depending on past excitation occurrences) transmit the excitation to outgoing (target) links that each lead to other nodes.
To aid the present discussion, when a node is receiving excitation from a link, we will refer to the node as a target. When the node is transmitting excitation to one or more other links, we will refer to that node as a source. A node is both a target and a source, but differentiating between these two perspectives will make the ensuing discussion easier to digest.
A link connects a particular node output (source) and another particular node input (target). The links have a dynamic weighting system, where weighting is commensurate with the potential to prime or activate (or inhibit, a matter we can put aside in the present discussion) the target node to which it continues transmission.
“Priming” refers to a small amount of excitation that “prepares” the target node for activation (as in an unfulfilled expectation), while “activation” refers to sufficient excitation from incoming nodes to activate the target node. If activated, that target node itself passes excitation though links to its target nodes.
The weighting of the links adjust themselves dynamically based on previous attempts to activate the target node. Frequent activation of a target and source node at the same time will generally lead to an increase in the link weighting between those two nodes, resulting in greater ease of activation by the source node of its target if ever that source should again become activated. This process of developing a strong connection between two nodes through repeated activation is referred to by the adage “nodes that fire together, wire together” (Shamay-Tsoory, 2021). It is related directly to the principle of learning and, at a more basic level, association. For example, the relationship “where there is smoke, there's fire,” if known to an individual, means that they have a source node representing “smoke” that, when activated, also primes a node representing “fire.”
Spatial and Semantic Coding
A node can act as a percept (the mental identification of something beyond the collection of features that constitute the thing – see Imperatore, 1970; Jepson & Richards, 1993), and the link between two nodes can act as a relationship between the two nodes, be it semantic information (i.e., conceptual and propositional – see Jones et al., 2015) or information about spatial relationships. Semantic and spatial relationships are represented by the SAM network because they occur, indeed must occur, together in the physical, observable world. One of the marvels of the SAM network is that an object, feature, or concept can be linked to other objects, features, and/or concepts through the dynamic and numerous links that are attached to each node.
However, the SAM is not a simple, associator system. It is considerably more sophisticated because a target node may itself first become activated as a result of other links that terminate at that target, even if a source that “normally” activates it is not itself activated (and so unable to excite the target node). The node is not solely dependent on one source node for activation. Multiple source nodes leading to a single target node may each provide small amounts of excitation, insufficient to activate the target node, but upon summation the excitation is sufficient. For example, activation of the “smoke” node in the preceding example might alone not activate the “fire” node, but additional information such as “do you know an adage about [smoke]?” might.
SAM processes can be closely linked to human behaviors and conscious experiences: perception corresponds to node activation; being reminded of something corresponds to one node activating another; learning corresponds to the asymptotic (continually increasing toward a limit) strengthening of links among two or more nodes; expectation corresponds to priming of a node; and satisfaction or realization of expectation is the activation of the previously primed node (Schubert, in press). The individual also has conscious access to some of the SAM networks through thought and attentive perception (and supposedly the concept of will; see Moskowitz et al., 1999), but activity also takes place without consciously accessible thought (Banks, 2021).
Nodes operate on two proximity-based principles: (1) spatial/semantic coding, which refers to coding of objects and their meanings, discussed previously, but also (2) temporal coding, which refers to coding of time-based events. These two principles build on the philosophically grounded view that space and time “precede and structure how humans experience the environment” (Dehaene & Brannon, 2010, p. 517; see also Mattar & Bassett, 2020). Furthermore, contextual information – the information that accompanies and surrounding objects and events – appropriates these two principles, and is also coded in both space and time.
Temporal Coding
Temporal coding in SAM networks uses the same node–link–node activation principles as spatial/semantic coding, but here a node represents a discrete, time-dependent event. When such a source node is activated, it will only activate another temporally coded target node with which it has been previously temporally linked (i.e., as part of a time-dependent sequence) at the conclusion of its own activation (Howard et al., 2008; Veliz-Cuba et al., 2015). The activation of one node representing a segment of music by a previously activated node representing the previous portion of the music is referred to as veridical chaining (Schubert, 2021; Schubert & Pearce, 2016). Because in SAM theory temporal nodes encoding music are best thought of as representing musical segments (structures that can be represented in isolation, but also optimal for recombination with other node-represented segments), we no longer need to refer to the more generic terminology of “fragments” of music.
In spatial/semantic coding, the same nodes are recruited to represent repeated experiences of the same percept. This will also be the case for time-based coding. If a temporal event has been previously coded (e.g., a segment of music has become familiar due to previous exposure), upon rehearing or imaging that event, the same nodes employed to encode this information will be reactivated. Even if an event is repeated several times within a sequence, such as a phrase of music that is repeated (“intraopus” repetition), the node representing that phrase will be reactivated on each occurrence (e.g., on each sounding) of that phrase of music.
A common, well-established way of representing this in SAMs is the recurrent network (Botvinick & Plaut, 2006; Elman, 1990; Fernández & Vico, 2013; Todd, 1989). In its simplest form, a recurrent network will treat contiguously repeating sequential information as the reactivation of the same node (or collection/layer of nodes). The termination of the first activation of a node layer reactivates itself via an intermediate node layer. The intermediary nodes are referred to as the context layer of nodes (Elman, 1990). These nodes provide information about the global location of the sequence, meaning the large-scale structure of the sequence event based on the overall sequence of events during encoding.
Consider the YY elements in the temporal sequence XYYZ, where each letter (element) represents a unique phrase or subphrase of music. If each phrase (X, Y, and Z) is represented by unique, time-dependent nodes, activation of X will prime the second element, Y, and then when X has completed its time-dependent activation, Y becomes activated as part of the chain, priming the next Y (i.e., itself). Without contextual information, the second Y activation recruits the identical network used to code the Y element, and so leading to an incorrect (or more aptly, non-veridical) sequence recall of XYYYY …. The Y element is repeated ad nauseum (example shown in Figure 1). What is the evidence for this kind of contiguous recurrency error?

Illustration of simple melody (“Twinkle twinkle little star” or “Ah vous dirai-je, Maman”) segmentation using recurrent node coding with contextual information: (i) score of melody labeled to illustrate its A B A phrase structure (with subphrases labeled as WX YY WX); (ii) melody in (i) shown as a recurrent network using only three nodes (representing the unique subphrases). Letters denote music segment encoded into each node (circle) with corresponding melody score excerpt (subphrase) shown above the node, taken from (i). Numerals on arrows (links) denote temporal sequence position that provides structural context. Without the numerically sequenced structure (context), the active node (W, X, or Y) will activate any of its outgoing links (not necessarily as determined by the veridical version of the melody shown in (i)), meaning that repeated looping may occur at node Y, either with itself or back to the beginning of the piece (node W). Repeated looping may also occur at the larger time scale if contextual information is not available to indicate that segment X has been activated a second time as part of the sequence.
Evidence of Contiguous Recurrency Error
Pfordresher and Palmer's seminal studies on sequential ordering of music memory (Palmer & Van de Sande, 1995; Pfordresher & Palmer, 2006; Pfordresher et al., 2007) demonstrated that memory errors for recall during music performance can be arranged into two broad types. One concerns recall of contiguous surface (local) feature events, such as the sequence of notes in a melody. Here, if an error (in comparison to the veridically encoded sound sequence) occurs as a replacement of the “correct” event, that error will not be a completely random event, but most likely an intrusion from a nearby event, either (i) a note nearby that has already been activated, that is, an error due to “delay,” or (ii) a note nearby that is about to be activated, that is, an error due to “prelay” (Pfordresher & Palmer, 2006). The closer a nearby serial event is to the target event of the sequence in question, the greater the likelihood that it will intrude upon or replace the “correct” event, to produce the error. The language equivalent is the spoonerism. This memory error is referred to as the serial distance hypothesis (de Graaff & Schubert, 2016; Palmer & Van de Sande, 1995). It is a kind of error that occurs for events that are at the surface structure (note by note, if you like) time scale, with event durations less than a few seconds, corresponding to sensory register processing discussed earlier (Atkinson & Shiffrin, 1968).
The other type of error, more pertinent to the question of larger-scale recurrence, concerns recall of the ordering of a previously learnt sequence at the global structural level. A simple example is a melody that is repeated several times in the course of a piece of music, for example with a structure that can be represented as ABACAD, where A, B, and C are each unique melodies or melodic segments. An error in recall of such a piece along these lines is that the shared material across the repeated sections creates structural slippage in a performance. Each time the A section is performed, it might not be followed by the veridically proposed sequence and, for example, continually slip back to the B section, to produce an ABABAB … sequence instead of ABACAD. That is, when material is about to be repeated (immediately after the B section in this case), insufficient context is available to the performer about serial location, meaning that they skip to one of the earlier (or later) repeated sections (as another example, see the Y subphrase in Figure 1). But the typical situation in musical structures, particularly in popular music, is even more perilous.
Consider the excerpt in Figure 2. The two-bar subphrase shown is repeated four times, in a structure that can be represented as b b b b. These four subphrases constitute the chorus of the song, which if denoted in total as B, gives the overall song a structure that can be represented as IABAIB′, where A is the verse of the song, and I is the introductory, instrumental material (neither shown in the figure). Here, B′ indicates additional repetition of b material, which continues until a fade out in the canonical recording. The mental storage of this song would entail recycling activation of the b material recurrently, with a context layer keeping track of the number of repetitions performed/activated, and therefore maintaining the overall structural integrity of the song (IABAIB′).2 If the contextual information activation were relaxed or ignored, then the lower-level song location tracking alone may lead to errors of the kind discussed earlier (see, in particular, Figure 1), with the possibility that the b subphrase continues to prime and reactivate itself ad nauseum, producing an endless loop effect.

Example of a chorus containing multiple internal contiguous repetitions. (Taken from Example 2.37: The Steve Miller Band, “Take the Money and Run” (1976) in Nobile, 2014, p. 83). Reproduced with permission.
In language research, this kind of repetition error is analogous to “backtracking” (Mandler, 1978; Mandler & Johnson, 1977) and occurs when the individual recalling text, such as a story, has been unsuccessful in keeping track of the overall structural progress of the story. That is, the positional association of items in a sequence provides cues for recall. If A follows B, and A also follows A at some point, then A is more likely to be recalled in a sequence erroneously (Lee & Estes, 1981).
Another characterization of this recall failure is the “associative intrusion,” where an item from a previously remembered sequence of events will activate the member of a list that followed it during encoding, regardless of the overall structural serial position of the item during encoding. That is, if a contiguous pair in a sequence is recalled correctly, it may still appear in the wrong overall position (Wickelgren, 1966).
Backtracking and associative intrusions, however, suggest that when material is repeated in a time-dependent sequence there is a reasonable probability that at recall the individual will get “stuck in a loop.” For example, when recalling the sequence AABAACAAD, the individual may simply continue to recall the A part of the sequence, initially correctly. However, since A is followed by A, at a local level, A is most likely to follow A, and to simplify processing in the SAM network, sounding the A sequence will prime A at the local (surface feature) level. In the absence of larger-scale contextual information, priming and then activating B in AAB, rather than A again, as in AAA, is not assured. According to this characterization of the mental architecture, the more complex the larger-scale structure, and the more repeated material within the structure of the sequence, the greater the cognitive load required to activate the sequence veridically.
Apart from recall error literature, the tendency to seek to continue the same, contiguously repeated source material is also evident in the “Pulse Continuity Phenomenon,” where a fade out of a repeating chorus at the end of an audio recorded song initiates a perceptual continuation of the pulse, even when the fade out has completed and the sounded song has ended (Kopiez et al., 2015). As a final example, the implication-realization model (Narmour, 1990) asserts as its first hypothesis that there is a bottom-up, automated, and subconscious music expectancy system that, upon processing a structure consisting of short sequences (such as a musical interval) each with a similar pitch trajectory, a followed by a, creates an expectancy of “continuation” that the next short sequence will also contain the structure of a. For example, if a consists of two tones, with one tone rising above another, and the next instantiation of a consists of another pair of rising notes, then a third pair of rising notes is expected, as in a + a + a. Narmour (1991) puts it in the following way: “[i]n terms of subconscious expectation, [the] hypothesis of continuation [can be notated] in the following way: if a + a in registral direction or intervallic motion occurs, then listeners expect another a” (p. 4, and for further discussion see p. 19). This theoretical stance gives further credence to the propensity for the human cognitive apparatus to expect and to seek-to-continue a pattern that has already been repeated in the physical world.
Context and Temporal Coding
Evidence suggests that these recurrent “loop” errors occur fairly rarely in deliberate sequence recall tasks (Caplan et al., 2022; Henson, 1998; Kahana et al., 2022). Yet does that mean that the associative intrusion mechanism is false? Botvinick and Plaut (2006) and Anderson and Matessa (1997) developed network based models that better describe human-like recall behavior of serial lists, where contextual information (e.g., the hierarchical structure of the to-be-remembered list and tracking the serial position of each item in the list) is encoded along with the serial association of any element with the element that immediately follows it. That is, two processes are in operation in parallel. One is the surface-level association processing, where an item triggers the next item of the sequence in memory. The other is contextual processing, where the information about the position of an item within the larger sequence is stored (in the context of the overall list). This dual processing explanation is built on data where deliberate memorization and recall tasks are investigated. They concur with research in sport, story, and word recall tasks where accurate recall of sequences is supported by high degrees of expertise, concentration, and, consequently, cognitive load (Amichetti et al., 2013; Rubínová et al., 2021; Tenenbaum et al., 1994).
If these principles are applied to free, involuntary (as distinct from deliberate) recall, such as that which takes place during INMI and more specifically InLaCReMI episodes, we are in a position to make clear predictions. The two that emerge from the preceding discussion are that:
the INMI fragment will be one that is already encoded recurrently, and a key enabling of the (“erroneous,” non-veridical) recurrent recall is a relaxation of contextual tracking.
There is already ample evidence that music consists of considerable amounts of intraopus recurrency (Margulis, 2014; Ockelford, 2006). Yet what is the evidence that contextual information is relaxed during INMI episodes?
Default Mode Network
In periods of relaxation and mind-wandering a so-called default mode network of the brain is believed to be activated (Raffaelli et al., 2018). According to Buckner et al. (2008), the network is active when individuals are focused on thoughts and matters not directly concerned with the current external environment. Here, the constraints of the veridical world do not need to be considered by the individual. It is a more private world in which the past is reflected upon and the future is planned without the necessary imposition of logical and social constraints.
In an experience sampling study, Floridou and Müllensiefen (2015) asked participants to report experiences of mind-wandering, INMI, and a range of emotions and activities at the time of the request. Through statistical modeling, the researchers concluded that mind-wandering enabled INMI experiences. Mind-wandering was itself associated with absence of high-cognitive load, absence of external audio-visual stimulation, and absence of socializing. SAM is able to explain how this comes about from a cognitive perspective. This conclusion is also consistent with the finding that the active default mode network can produce spontaneous repetitive thoughts (Baars, 2010).
When music enters such levels of consciousness, the mind need not activate more complex, large-scale mental representations of music sequences that were encoded during exposure to the physical music signal, and the mind can more freely cycle through representations. That is, if an individual is not compelled to consciously reproduce a veridical performance of a previously encoded piece, the network will be free to jump to whichever events is most strongly primed outside, or with reduced reliance on, the usual global, contextual constraints. If the music fragment being processed in a piece is of the form AABA, then the lower-level associative, recurrent links are more likely to be activated (e.g., if A is activated, it may prime B, but more likely will prime A because A follows two out of the four sections of the piece) without necessarily instantiating the larger-scale structure (in this case, from the global contextual perspective, if the second A is activated, B is most strongly primed). In times of default mode network activation, higher-level, contextual information (“A is followed by A again which is then followed by B which is then followed by A a third time”) is less primed than more local, surface feature associations (“A is sounding now, A probably comes next” …).
Discussion and Implications: Reconciling SAM and Extant Evidence about InLaCReMI Data
With the preceding theory, we are in a position to more clearly define the InLaCReMI and other species of INMI and to understand the consistent findings of earworms within a theoretical framework. These understandings are presented as two hypotheses about INMI in general and two hypotheses about InLaCReMI specifically. Methodological issues in testing the InLaCReMI hypotheses are also discussed.
Definition: InLaCReMI is a Special Case of INMI that Recruits a Contiguously Repeating Musical Fragment Already Available during Encoding
The SAM-based theory suggests that contiguous repetition at encoding will be an important aspect in structuring InLaCReMI experiences. No empirically verifiable research has been cited that explicitly examines when and whether INMI episodes consisting of contiguously repeated material can be traced to music that already had some form of contiguous recurrence. From a SAM perspective, this is an essential question, since the proposed mental architecture suggests that this kind of experience is likely to occur, particularly under certain enabling circumstances (see H2). Therefore the definition of INMI can be further clarified through a consequent, revised taxonomy of the nature of INMI experiences. Building on the taxonomies developed by Williamson et al. (2012) and Liikkanen and Jakubowski (2020), different INMI species need to be clearly differentiated according to the musical structure of the episode, and the taxonomy proposed in Figure 3 aims to do this, demonstrating how InLaCReMI is a specialised species of INMI.

Taxonomy of musical imagery contents, focusing on involuntary musical imagery (INMI) and identifying the species that is central to the present investigation, involuntary limited and contiguously repeating musical imagery (InLaCReMI) – shown in grey background. “Veridical” here refers to “in correspondence with the music contents and structure found in the canonical sound-recording/performance.”
INMI H1: Mentally Represented Music is Recruited for INMI – “Familiarity” Hypothesis
INMI will be of music that is already mentally represented, usually represented as a result of previous exposure because, according to the SAM-based theory, nodes activate during INMI conditions, meaning that when music representations are activated, they must have been previously encoded, and this encoding usually takes place through previous exposures, where nodes come to represent external stimuli (objects and events). Under self-report experimental conditions, exposure can be approximated by familiarity (Freitas et al., 2018).
The majority of studies on INMI have employed participant selected stimuli, and those studies almost unanimously support the familiarity hypothesis (e.g., Byron & Fowles, 2015; Hyman et al., 2013; Williamson et al., 2012). Even when novel tunes are presented in training sessions for the purpose of inducing novel (not previously available) mental representations, INMI occurrences of the specially composed stimuli are reported more frequently when the training stage exposure to the stimulus was more frequent (Byron & Fowles, 2015).
Using novel music for training and induction of INMI allows a degree of experimental control of the stimulus involved, but the compositions employed may be confounded with the individual's existing mental library of previously experienced music fragments, particularly if the compositions share characteristics of the individual's autobiographical representations of music segments (Bayard, 1950; Jan, 2004, 2007, 2015; Merkley, 1988; Meyer, 1956; Pendlebury, 2020; Schubert, 2021; Volk & Van Kranenburg, 2012). An individual in contemporary Western culture is likely to have a vast mental library of music (Folkestad, 2012; Hargreaves, 2012). This is why even when supposedly novel music is used to falsify the hypothesis, there must be some assurance that stylistic aspects of the novel music are not already mentally represented (Gjerdingen, 1985, 1990; Gjerdingen & Perrott, 2008).
INMI H2: Preference for Primed Music for INMI – “Recency” Hypothesis
The material that will be used for INMI will be that which has been recently activated. This is because once a node has ceased being activated, some residual excitation remains in the node, which then decays over time (Smith & Queller, 2004). The decay time may be in the order of seconds, or minutes (Huyck, 2020; Stawarczyk et al., 2019). Hearing the music again in the physical world, or activation of a semantically, spatially, or contextually related link can also reignite (prime or activate) the node (Berkowitz, 1984; Smith & Queller, 2004). Yet the more recent the node activation, the greater the possibility that it will be reactivated by a smaller than usual amount of additional excitation. This hypothesis is well supported by extant data. Recency is among the most frequently reported predictors in INMI research (e.g., Bailes, 2007; Byron & Fowles, 2015; Halpern & Bartlett, 2011; Hyman et al., 2013; Williamson et al., 2012).
InLaCReMI H1: Material Contiguously Recurrent at Encoding is Preferred Material for InLaCReMI Episodes – “Contiguous Repetition at Encoding” Hypothesis
InLaCReMI H1 hypothesizes that contiguous repetition during encoding (most commonly when exposed to the music) provides the material that will be recruited for an InLaCReMI episode. Material that is not contiguously repeated, such as music of the form ABACAD will be less likely to evoke an InLaCReMI episode (because the repeated material A is never repeated contiguously) than, for example, music of the form BAACDA because of the presence of contiguous repetition in the latter (the A section after the B section is immediately repeated). In the former example, looping repetition is possible, such as ABA that then loops back repeatedly to the first A, hence ABABAB … . However, this requires that either the duration of the AB sequence together is sufficiently short to allow it to be treated as a segment, or that more contextual tracking is possible than the second example that has the immediately repeated shorter segment. This hypothesis is complicated by issues concerned with nested repetition and individual differences, which also concern InLaCReMI H2, discussed later. InLaCReMI H1 also raises some theoretical and definitional matters, concerned with the veridicality of repetition, and the necessity of contiguity, which will now be considered.
Veridicality of Repetition
Through a mechanistic interpretation of SAM, the veridical looping of a segment of music should consist of precise repetitions, since it is the same fragment of music stored in memory that is being replayed, like a sound recording. However, the biological reality and the astounding complexity of SAM mean that this cannot be treated too literally. When a fragment of music is repeated, whether intraopusly or interopusly, small changes can occur with the repeating segments, and yet each version of the segment is still perceived as repeating. In fact, such small variations are an essential part of music making, without which music sounds mechanical, robotic, or even non-human (Margulis, 2014, pp. 117–119; Schubert et al., 2017). The production and perception of “expressive timing” is a case in point. Different performances of the same piece will necessarily be performed differently, and thus play a role in identifying the performer, as well as the time and place of the performance. Yet the piece retains its identity. Western-trained musicians frequently refer to these differences between performances as “interpretation” (for detailed discussions, see Fabian et al., 2014). In short, small variations in a repeated segment are integral to the sophisticated, dynamic encoding process of music into mental representations. Nevertheless, how they realize themselves during InLaCReMI experience cannot be predicted by the hypothesis as it stands.
Necessity of Contiguity
Repetition alone is not a sufficient criterion for the InLaCReMI species of INMI, according to SAM. Repetition may occur frequently in a piece of music, but if the repeating fragment is distributed throughout the piece, it can only be a candidate for InLaCReMI if there are at least some occurrences of the fragment that is repeated contiguously. This conceptual distinction may be easier to understand through a natural language example. If Haruma says to Xiao “Can you hear me?”, Xiao replies “What did you say?”, and Haruma then repeats “Can you hear me?”, we have repetition of the question “Can you hear me?”, but it is not contiguous. If, on the other hand, Haruma says “Xiao! Can you hear me? Can you hear me?”, we have contiguous repetition of the same expression. Consider the study by Jakubowski et al. (2017) in which indices measuring repetition were reported – referred to as “m-type” features. Such measures provided information about the amount of repetition of a particular order of notes that occur in a phrase of a song, and also a comparison of such indices against the amount of repetition found in the corpus of songs used for comparison. While repetition across phrases provides information about the amount of repetition in the stimulus, it does not necessarily indicate contiguous repetition. Furthermore, according to the SAM-based prediction, contiguous repetition is more akin to a dichotomous variable – the music either has contiguously repeating material (some or a lot), or it does not. As it turns out, musical forms, particularly pop music forms, have considerable amounts of contiguous repetition as part of their structure, such as the highly common AABA form (Graziano, 2008; Owens, 2003) – the first two sections, AA, illustrating contiguous repetition. Thus, both typical and even quite sophisticated algorithms that to date estimate the amount of repetition in music do not necessarily provide the best means of assessing music stimulus suitability for InLaCReMI episodes.
This mechanistic interpretation of the SAM-based explanation is that there will be no gap or any other material between contiguous mental renderings of the looping segment. The node is simply reactivated when it has completed its own activation, hence the music segment shares a temporal boundary only with itself (it ends and then immediately begins again). However, in practice this is unlikely to be an immutable rule. Given that InLaCReMI episodes occur during or as a result of mind-wandering, the mind will be free of rule-based confines of any literal, rule-imposed contiguous repetition. At any time during the looping process, associative intrusions are free to become activated, potentially interrupting the looping process (“means of structuring” such interruptions between otherwise contiguous loop boundaries are proposed by Huovinen & Tuuri, 2019) and even ending the episode. However, it is the InLaCReMI episodes that are of theoretical interest here, and that which the hypothesis predicts. That is, the regularity of the InLaCReMI species of INMI should be first demonstrated, and if supported, such refinements should be further investigated.
InLaCReMI H1 provides an alternative to the search for musical features typifying INMI, and also provides a hypothesis for the type of music that favors InLaCReMI species of INMI. This is because the SAM-based theory is agnostic about the characteristics of the music: It is the organization of musical material at a structural level that is the critical factor in determining potential material for InLaCReMI. Music fragment organization in the mind, rather than the nature of isolated music fragments, drives the potential for InLaCReMI episodes, according to the SAM-based theory. Contiguous repetition of a musical fragment, when coded mentally in a recurrent node–link–node network will set up the basic enabling conditions for an InLaCReMI episode.
This SAM-based theory does not rely on fluid and nebulous concepts of catchiness and “hooks.” It is the interlinking of the musical fragments, and less so their qualities, that determines InLaCReMI suitability. InLaCReMI gives preference to low-level structural recurrence, but there may be more opportunities for InLaCReMI-friendly structures than common song analyses permit.
A designated section of a piece of music such as the chorus or verse is likely to contain repetition within a single instance of the section (e.g., see Figure 2), and furthermore, the end of such songs frequently repeat the already repeating structures within the chorus. In the example of Figure 2, contiguous repetition can be observed at four levels – every two bars, which is nested within four-bar repetition (system one and two of the score shown in Figure 2). Furthermore, in the canonical recording of the piece shown in the figure, the chorus continues to repeat at the end of the song until a fade out. The evidence that the chorus of a song constitutes INMI material (e.g., Beaman & Williams, 2010; Hyman et al., 2015) supports the notion of preference for material that recurs contiguously during encoding.
Figure 2 shows the chorus section of a pop song, and it may or may not be viewed as a “hook,” but from a structural organization perspective, based on SAM, the contiguous repetition of the single phrase makes it a clear-cut candidate for INMI and for InLaCReMI in particular. Current research has not explicitly investigated whether contiguously repeated material, such as that shown in Figure 2, also constitutes the material for InLaCReMI, with the encoded repeated material being preferred and experienced with continuing, contiguous repetition. However, to test this explicitly brings with it some methodological challenges.
Methodological Challenge – Musical Structure
Caution is needed in assuming that a section within a formal structure is without any further internal (nested) recurrence, and methodology needs to carefully consider this matter. In the example shown in Figure 2, nested recurrence is easily identified. However, there is further contiguous repetition, namely the harmonic progression, which is repeated four times in the excerpt shown (displayed as chord labels “G F C”). This harmonic progression is also the same as the chord progression in the verse (not shown in the figure), and the entire song, suggesting another layer of repetition. To simplify testing of InLaCReMI H1, such nested and ambiguous intraopus repetition should be avoided or controlled.
While some contiguous repetition is required to warrant a piece of music as an InLaCReMI candidate (as per InLaCReMI H1), the nature of the repetition presents interesting methodological challenges. A piece of music with the form ADDBCDD should be a better candidate for INMI (particularly the D section) than a song with identical but rearranged sections, such as DADBDCD (no contiguous repetition, but contiguous repetition at a higher structure level, such as DBDBDB … is possible). Specially composed pieces may need to be employed. However, even with specially composed pieces, care must be taken about the relationship between the musical knowledge of the individual and the style of the composition. If the individual's mental library of music were to become available to the researcher, though, this would prove an immensely lucrative resource for such research (for an approach to gathering autobiographical playlists through self-report techniques, see Istvandity, 2022). Other methodological challenges, particularly those concerned with individual differences, are discussed for InLaCReMI H2.
InLaCReMI H2: Reduced Contextual Activation – “Low Environmental Focus” Hypothesis
InLaCReMI H2 proposes that the episodes will occur when the individual is not strongly focused on the immediate environment. This is because such states are conducive to activation of the default mode network, which leads to low focus on, or demand for, veridical recall of (in this case musical) thoughts, as would be the case in higher demand situations (e.g., for public performance) (Fareri et al., 2020). Priming of contextual nodes, those involved in recalling intraopus contextual information (such as large-scale music structure), is diminished in such a state, and so the recall process can employ less-demanding, lower-level, surface feature strategies, facilitating recurrent excitation of music fragments (i.e., looping). This hypothesis is well supported (for a review, see Liikkanen & Jakubowski, 2020). For example, in ecologically typical (as distinct from laboratory) settings, individuals experience more imagined music when alone (Liikkanen, 2012) and probably less frequently when with strangers than with family members and friends (a possible interpretation of data reported by Bailes, 2007, p. 560). However, not explicitly tested is whether the contents of the INMI episodes consist of contiguously repeating material, a prediction that the SAM-based theory affords during periods of low environmental focus.
Methodological Challenge – Musical Expertise
If the “Low environmental focus” hypothesis (InLaCReMI H2) is further supported, research will be needed to ascertain other factors that enable the InLaCReMI species of INMI. For example, different levels of expertise and experience may influence the initiation of an InLaCReMI species of INMI. High experience or expertise might mean that structural information is easy to store, and such individuals are less likely to experience the InLaCReMI species of INMI, or do so on a larger scale; for example, a sequence ABCADEA could trigger an InLaCReMI as a loop around ABC that may also skip to another loop around ADE, and skip back to looping around ABC (because of the recurrency of the structural pivot material A).
Methodological Challenge – Individual Differences
Other individual differences, apart from musical expertise, may also be implicated; for example, those who have a cognitive style preferential to processing music as structures and relationships, so-called “music systemisers” (Kreutz et al., 2008), or those who favor thinking styles that involve considerable repetition in everyday life, such as people with obsessive compulsive (Beaman & Williams, 2010; Müllensiefen et al., 2014; Wendler & Schubert, 2019) and ruminative tendencies (Dalgleish et al., 2009; Garrido & Schubert, 2011; Szasz, 2009). The evidence for these personality factors is extremely limited and should be considered with caution.
Methodological Challenge – Self-Report Interference with Default Mode Network
While activating the default mode network seems to be feasible in the laboratory, the matter of probing the participant for a response during the activation of an episode is an imposition of an external demand that necessarily requires some degree of focus and attention to the external task, and hence the immediate external environment. Thus, if an INMI episode has been induced in the laboratory, asking the individual to report on the content of the episode (if something was heard, whether it involved repetition, and if so, was the repetition contiguous, and so on) could interfere with the INMI episode being probed. For example, in the study by Moeck et al. (2018), the participant holding down the space bar during an INMI episode allowed the researcher to estimate duration. But it is unclear if the vigilance required to perform a task for the experimenter, such as the act of holding down the space bar, would require so much self-interrogation of a spontaneous experience that free flow of consciousness is dragged too far away from its internally associative state to represent the underlying effect being investigated (see also Farrugia et al., 2015, pp. 73–74).
While the self-report method is an indispensable part of the empirical researcher's tool kit, other options should also be considered to enrich data from the self-report technique where the participant's consciousness is at risk of being drawn away from the experience under scrutiny (Gelding et al., 2022; Hubbard, 2022). One solution is to gather responses without notice or implicitly. For example, once the INMI has commenced, interrupt it with a surprise task to see if the participant can recall any of the immediate structure of the episode (e.g., Moeck et al., 2018). An option that avoids self-report altogether is brain imaging. While brain imaging studies of (voluntary) musical imagery are available (Halpern, 2001; Regev et al., 2021; Zatorre, 1999; Zatorre & Halpern, 2005; Zatorre et al., 1996), no studies have been cited reporting brain imaging during INMI episodes, and there is little known about what trackable, time-based brain activity corresponds to imagined musical structures, but there are some promising developments (Lloyd, 2020).
Conclusion
This article investigated the nature of material used for INMI episodes through an examination of extant data and contemporary theory of cognition and memory. A SAM of mental cognition and memory was applied to understand how INMI comes about and, central to the present investigation, the nature of the musical material used for such episodes. The SAM model proposes that spatial, semantic, and temporal phenomena are coded via a vast network of interconnected nodes and that the perception of a particular, familiar object or event will activate the same set of nodes that were initially recruited during exposure to the corresponding object or event. From this, hypotheses on recency and familiarity were proposed, and the extant literature provides initial support for them.
Repetition is an important part of INMI experience, but researchers have yet to systematically delineate the different ways that the musical episodes organize the repetition. Based on the theoretical positioning, it became evident that more work needs to be done in taxonomizing the way INMI material is organized. A possibly common species of INMI was proposed, namely InLaCReMI, where the organization of musical material during the INMI episode consists of musical material that contiguously repeats. Such repetition has been implied by much past research, occasionally stipulated, but has not, to the knowledge of the author, been explicitly and systematically investigated.
Through SAM's propensity to recycle previously encoded, temporally based mental representations, two hypotheses were proposed regarding InLaCReMI:
The first is the “Contiguous repetition at encoding” hypothesis. This hypothesis predicts that the InLaCReMI species of INMI is more likely to occur with music that already has contiguously repeating material. Evidence of the availability of such music is widespread, but the hypothesis predicts that it is the contiguously repeating material that will be appropriated for the InLaCReMI episode, rather than an arbitrary fragment, a chorus section of a song or the somewhat nebulous “hook,” proposed in previous literature. While this hypothesis has not been explicitly tested, the widespread reporting of the chorus of a song being stuck in one's mind provides support because the chorus section, as it turns out, commonly consists of contiguous repetition. The second, the “low environmental focus” hypothesis, is concerned with an enabling condition of InLaCReMI, and the activation of the default mode network. This translates to the individual being in states of daydreaming or relaxation, and can also be aided by being away from strangers and cultural expectations of veridical (authentic) performance. The hypothesis is derived from the notion that contextual information about the large-scale structure of a piece of music is coded in parallel with local, surface-level information, and that the “higher demand,” large-scale contextual information is itself untracked when the default mode network is activated, facilitating the activation of more surface-level recurrence alone – as according to the “contiguous repetition at encoding” hypothesis.
There are still several puzzles that INMI research needs to address to understand not just the nature and other aspects of INMI, but also the insights it can provide for cognition and memory. The structure of the INMI episode requires more systematic attention, and may benefit from implicit methods of data collection, such as brain imaging techniques, which for INMI research are in their infancy (Farrugia et al., 2015), and ways to carefully control the structure of the musical material for inducing INMI are also needed.
Theoretically founded investigation and consistent, unambiguous definitions (Williams, 2015) are essential to help move forward understanding the nature of cognition and memory. In an effort to continue to refine definitions and to further develop theory on INMI, and InLaCReMI in particular, the present article provides an initial, potentially unifying theoretical grounding, from which focused INMI research programs can be developed.
Footnotes
Action Editor
Mats Küssner, Humboldt-Universität zu Berlin, Department of Musicology and Media Studies.
Peer Review
Freya Bailes, University of Leeds, School of Music. Kelly Jakubowski, Durham University, Department of Music.
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Ethical Approval
This research did not require ethics committee or IRB approval. This research did not involve the use of personal data, fieldwork, or experiments involving human or animal participants, or work with children, vulnerable individuals, or clinical populations.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
