Sage Journals: Discover world-class research

Abstract

The tritone paradox is a bistable auditory phenomenon where two Shepard tones can be interpreted as either ascending or descending. Previous studies have demonstrated that preceding auditory context can bias the direction of tritone perception. Here, we systematically manipulated both the quantity (anywhere between 1 and 10) and types (higher, lower, same as first target tone, or silent) of context tones before presenting a target tritone pair. We found that the contextual biasing effect can emerge with as few as 1–2 context tones, and plateaus quickly within this small window. Notably, low-frequency context tones produced a more pronounced and immediate bias than high-frequency tones. Together, this study demonstrates a narrow window of the auditory context effect, where minimal contextual cues are sufficient to guide perceptual interpretation of ambiguous auditory stimuli. The findings pave the way for more detailed investigations into the cognitive mechanisms of auditory perception, emphasizing the swift influence of immediate auditory contexts on perceptual outcomes.

Keywords

auditory perception Shepard tones tritone paradox contextual influence pitch bias auditory memory

How to cite this article

Hou, C., Wang, J., Lo, Y., & Tseng, P. (2025). Rapid biasing effect of prior auditory contexts on bistable tritone perception. i-Perception, 16(6), 1–14. https://doi.org/10.1177//20416695251409272

Introduction

Perception is a complex process that allows us to interpret the world around us through information gathered by our senses. To do this, the auditory system relies on prior knowledge and context to disambiguate sensory information and make accurate perceptual decisions. This is particularly useful when one is confronted with ambiguous stimuli. Indeed, research using ambiguous auditory stimuli has shown that the perceptual disambiguation of such stimuli is heavily shaped by both long-term (e.g., childhood exposure) (Deutsch et al., 2004) and short-term immediate auditory contexts (e.g., preceding tones) (Chambers et al., 2017). The influence of context on the perception of subsequent ambiguous sounds is a well-established principle that extends across a variety of auditory phenomena. For instance, short-term prior auditory context has been demonstrated to significantly affect the perceptual organization of bistable auditory streams, such as those formed by ABA pure-tone patterns (Snyder et al., 2008). Furthermore, longer-term influences like linguistic experience have also been shown to shape auditory stream segregation, for example, in the perception of ambiguous speech sounds (Billig et al., 2013). However, the precise temporal characteristics and dynamics of this contextual influence on bistable perception remain to be fully delineated, which is the primary focus of the present study.

One of the tools for investigating perceptual/auditory decisions is the bistable paradigm. Bistable phenomena involve ambiguous stimuli that can be interpreted in two, often opposite, ways. In the context of auditory perception, one such example of bistable stimulus is the tritone paradox, which consists of two Shepard tones that can be perceived as either ascending or descending. In short, a Shepard tone comprises a complex series of tones, each separated by an octave, and appropriately fading the volume of each tone as it reaches the upper or lower limit of the audible range (Shepard, 1964). A tritone paradox consists of two successive Shepard tones that are separated by six semitones, or a tritone, interval. This results in an ambiguous sequence for the listeners such that the same tritone pair can be heard as ascending or descending, depending on listener and context (Deutsch, 1986).

Previous studies have found that, among native English speakers, listeners from South England versus California had almost opposite patterns of pitch preferences, suggesting that it is not the language, but the accent, that shapes one's perceptual preference (Deutsch, 1991). Furthermore, the preference for different accents can likely be traced back to development since childhood, such as one's mother tongue. Deutsch et al. (2004) collected data from two Vietnamese groups: adults who arrived in the U.S. late and spoke fluent Vietnamese but had limited English proficiency, versus another group that arrived in the U.S. as infants or young children and spoke fluent English but were not proficient in Vietnamese. Despite these marked differences in their current language proficiency and usage (Vietnamese-dominant vs. English-dominant), the two groups showed no significant differences in their perception of the tritone paradox. Deutsch et al. inferred from this lack of difference that the current language environment was not the primary determinant of tritone perception in these individuals. Instead, they proposed that because both groups were likely exposed to Vietnamese from their primary caretakers during their critical early developmental periods (even if the latter group later became English-dominant), it was this early linguistic exposure—specifically the mother tongue of the primary caretaker—that played a crucial role in shaping their (similar) perceptual tendencies for the tritone paradox. Although robust evidence remains scarce, as such connection between language and Shepard tone perception was mostly speculative based on null findings, and that subsequent study only partially replicated the regional differences (see Repp, 1994), these results possibly highlight the role of long-term influences as one of the important factors in determining tritone percepts.

Recently, a novel effect of immediate prior context on auditory perception has been reported, showing that short-term context, involving just a few seconds of preceding tones, can also be instrumental in biasing one's subsequent auditory perception. One important study by Chambers et al. (2017) presented participants with sequences of tones that served as pretarget auditory “context” (i.e., context tones), followed by a target tritone pair, and participants had to indicate whether the pair was ascending or descending in pitch. These authors demonstrated that preceding context tones had a significant effect on participants’ perceptual judgments. Specifically, when the target tone was preceded by 10 context tones that were lower than the first target tone, participants were more likely to perceive the first target tone as higher in pitch, causing the second target tone to be lower (i.e., a descending pattern), and vice versa. Furthermore, the authors manipulated the number of context tones and found that the biasing effect could reach a plateau with merely three context tones, such that three context tones were enough to achieve the same perceptual biasing effect as 10.

In this study, we aimed to build upon the findings of Chambers et al. by systematically examining the temporal and structural properties of auditory contextual influence on the tritone paradox. Specifically, we sought to determine: (1) how many context tones are sufficient to elicit a perceptual bias, (2) whether the directionality of context tones (higher vs. lower) differentially modulates the percept, and (3) whether the biasing effect is additive (accumulating with more tones) or driven primarily by recent stimuli (e.g., winner-take-all). These questions address important gaps in the literature regarding the resolution and mechanism of auditory contextual effects. Our approach goes beyond prior work by varying both the quantity and type of contextual tones (including neutral and silent conditions), allowing us to isolate the temporal dynamics and perceptual weighting of auditory context in bistable pitch perception.

Methods

Participants

A total of 28 participants (11 male, 17 female, mean age 22.1 years, SD = 9.37 years) without hearing impairment participated in this experiment. All participants were naïve about the purpose of the experiment and gave written informed consent prior to their participation. All protocols were approved by the Joint Institutional Review Board of Taipei Medical University, and all methods were performed in accordance with the relevant guidelines and regulations. Detailed musical training histories were not collected. However, participants were recruited from the general university population without any known bias toward musical expertise.

Apparatus and Stimuli

The original auditory stimuli were Shepard tones extracted from the online demos shared by Chambers et al. (2017) (http://audition.ens.fr/dp/illusion/). Shepard tones are chords constructed from a fundamental frequency (Fb) at the base frequency, which is then superimposed with up to eight additional octave-related pure tones and a fixed Gaussian amplitude envelope. This fixed Gaussian amplitude envelope was centered at 960 Hz with a standard deviation of 1 octave, rendering the perceived pitch ambiguous and cyclical. Each trial consisted of 10 context tones, C1 ∼ C10, followed by two target tones, T1 and T2. All tones were Shepard tones, and T1 and T2 were separated by a tritone. T1's base frequency (Fb) was randomly drawn from 60 to 120 Hz, while T2's Fb was set six semitones (a tritone) from T1. In this ambiguous configuration, T1's frequency components were positioned midway between T2's components on a log-frequency scale, enhancing perceptual ambiguity.

Unlike the Chambers et al. (2017) study, where all 10 context tones were either higher or lower than T1, in this study, we manipulated the number of higher and lower tones. To do this, the base frequencies of C1 ∼ C10 were all somewhere between T1 and T2. Specifically, two half-octave wide frequency regions were defined relative to the components of T1. First, a pool of candidate Shepard tones was established by identifying all tones with base frequencies falling within the half-octave region above (High, H; blue notes in Figure 1) or below (Low, L; orange notes in Figure 1) T1. For the H and L conditions, each context tone in the sequence was randomly sampled from this respective pool. For example, if T1 is 100 Hz, then each high context tone is independently drawn between 101 and 141 Hz. For the same condition, these notes were the same as T1 (S; green notes in Figure 1). For the Silent condition, there were no tones (X; x marks in Figure 1).

Figure 1.

Schematic illustration of sample auditory context sequences preceding the target tritone pair (T1, T2).

The duration of each context tone was 125 ms, and the intertone interval was 125 ms. The interval between the context and target was 500 ms. These parameters matched those from the Chambers et al. study to maximize generalizability between studies.

Procedure

The experiment was conducted online using a custom-built web platform. Participants completed the task individually in a quiet environment using their own headphones. They were instructed to adjust the volume to a comfortable listening level before beginning. A total of 56 randomized trials were presented to each participant. Participants used their mouse to select either “ascending” or “descending” after each trial. There was no time limit for responses, and participants completed the task at their own pace. The average task duration was approximately 10 min across all participants.

The task of the participants was to indicate whether T1 ascended or descended to T2 in pitch. As mentioned above, our pre- and postcontext tones can be categorized into three types: conflict (opposite condition: precontext and postcontext tones are high vs. low or low vs. high), nonconflict (same condition: precontext tones are same as T1, followed by high or low postcontext tones), or nonexistent (silent condition: no precontext tones; direct replication of the Chambers et al. study) (Figure 1). Each type can be divided into two variations (high-then-low and low-then-high), resulting in a total of 6 trial-type conditions (Table 1):

Context tones starting with H, ending with L (opposite condition): 9H1L, 8H2L, 7H3L, 6H4L, 5H5L, 4H6L, 3H7L, 2H8L, 1H9L.

Context tones starting with L, ending with H (opposite condition): 9L1H, 8L2H, 7L3H, 6L4H, 5L5H, 4L6H, 3L7H, 2L8H, 1L9H.

Context tones starting with S, ending with L (same condition): 9S1L, 8S2L, 7S3L, 6S4L, 5S5L, 4S6L, 3S7L, 2S8L, 1S9L.

Context tones starting with S, ending with H (same condition): 9S1H, 8S2H, 7S3H, 6S4H, 5S5H, 4S6H, 3S7H, 2S8H, 1S9H.

Context tones starting with X, ending with L (silent condition): 9X1L, 8X2L, 7X3L, 6X4L, 5X5L, 4X6L, 3X7L, 2X8L, 1X9L.

Context tones starting with X, ending with H (silent condition): 9X1H, 8X2H, 7X3H, 6X4H, 5X5H, 4X6H, 3X7H, 2X8H, 1X9H.

Table 1.

Trial-type conditions.

Trial Type	Condition Label	Context Configuration Examples
H→L	Conflict	9H1L, 8H2L, …, 1H9L
L→H	Conflict	9L1H, 8L2H, …, 1L9H
S→L	Non-conflict	9S1L, 8S2L, …, 1S9L
S→H	Non-conflict	9S1H, 8S2H, …, 1S9H
X→L	Silent	9X1L, 8X2L, …, 1X9L
X→H	Silent	9X1H, 8X2H, …, 1X9H

Results

We first examined the possible effect of the number of context tones. That is, how many context tones does it take to reach the peak of its biasing effect? To investigate how the overall 10-tone context sequence (C1–C10) influenced the perception of the subsequent target tritone pair (T1–T2), we analyzed the probability of specific perceptual outcomes. This contextual influence is termed the “biasing effect.” Figure 2 illustrates these effects. Specifically, Figure 2A plots “P(Bias),” which for this main illustrative panel is defined as the probability of participants reporting an “upward” shift in the T1–T2 target pair when the 10-tone context sequence ended with High (H) type tones (this mirrors the data in Figure 2C). This is shown as a function of the pre–post context tone ratio across the three trial-type conditions (conflict, nonconflict, and silent). Figure 2B and C provides more detailed breakdowns, plotting the probability of downward shifts for L-ending contexts and upward shifts for H-ending contexts, respectively (see the full Figure 2 caption for detailed descriptions of each panel). In the conflict condition, context tones start with either high or low context tones and end with tones of opposite directions (Figure 2, blue). In the nonconflict condition, context tones start with the same tones as T1 and end with either high or low context tones (Figure 2, red). In the silent condition, context tones start with silent intervals and end with either high or low context tones (Figure 2, yellow). For example, a “8:2” on the X-axis means eight high tones followed by two low tones (or vice versa) in the conflict condition, eight T1 tones followed by two low tones in the nonconflict condition, and eight silent intervals followed by two low tones in the silent condition.

Figure 2.

(A) P(Bias) as a function of different pre–post context tone ratios across trial types, which can be further divided into (B) Bias toward a descending percept for low-ending contexts, and (C) Bias toward an ascending percept for high-ending contexts. The discrepancy between (B) and (C) at 9:1 suggests that descending-biased low tones can bias the tritone percept with one single note, as long as there are neighbors (regardless of whether it's conflicting or nonconflicting) preceding the single low. Note. The same trend is observed in ascending-biased high tones, but it takes at least 2 notes for the biasing effect to surface. Error bars represent ±1 SEM. Blue: conflict condition, red: nonconflict, yellow: silent condition.

We conducted a repeated-measures two-way ANOVA with the within-subjects factors of trial type (conflict vs. nonconflict vs. silent) and pre–post context ratio (from 9:1 to 1:9) to investigate the possible effect of 3 types of context tones on pitch-shift judgements. The Geisser–Greenhouse correction was applied when the assumption of sphericity was violated. There was no significant main effect of trial type, F_(2,54) = 0.824, p = 0.444, ƞ2p = 0.030. However, there was a significant main effect of pre–post context ratio, F_{(2.298,62.050)} = 4.675, p = 0.010, ƞ2p = 0.148. The interaction between trial type and pre–post context ratio was not significant, F_{(7.020,189.549)} = 1.549, p = 0.153, ƞ2p = 0.054. Subsequent post hoc comparisons using Holm–Bonferroni correction revealed that only 9:1 was significantly different from the other ratios, but not for other pairwise comparisons (9:1 vs. 8:2: t(27) = −4.312, p < 0.001, 9:1 vs. 7:3: t(27) = −3.980, p = 0.003, 9:1 vs. 6:4: t(27) = −4.201, p = 0.001, 9:1 vs. 5:5: t(27) = −4.864, p < 0.001, 9:1 vs. 4:6: t(27) = −4.201, p = 0.001, 9:1 vs. 3:7: t(27) = −3.648, p = 0.010, 9:1 vs. 2:8: t(27) = −4.312, p < 0.001, 9:1 vs. 1:9: t(27) = −5.307, p < 0.001). Together, these results suggest that pre–post context ratio, and specifically the 9:1 ratio, is perhaps more influential than our intended manipulation of trial types. A summary of all ANOVA results, including effect sizes and significance levels, is provided in Table 2 for clarity.

Table 2.

Repeated measures ANOVA table with within-subjects factors of trial type (conflict, non-conflict, silent) and pre–post context ratio (from 9:1 to 1:9).

Variables	F	p-Value	Partial η²
Trial type	0.824	0.444	0.030
Ratio	4.675	0.010	0.148
Trial x Ratio	1.549	0.153	0.054

To descriptively examine the robustness of the 9:1 ratio effect, we conducted exploratory analyses within each trial type (opposite, same, and silent) to assess whether the 9:1 ratio yielded a stronger bias than other ratios. Although the interaction between trial type and pre–post context ratio was not statistically significant, we aimed to explore whether the apparent influence of 9:1 held consistently across different contextual configurations. For conflict condition, 9:1 was different from 2:8 and 1:9, 9:1 versus 2:8: t(28) = −3.457, p = 0.024, 9:1 versus 1:9: t(28) = −3.457, p = 0.024 (Holm–Bonferroni correction). None of the other pairwise comparisons was significant. For same condition, 9:1 was different from 5:5, 9:1 versus 5:5: t(28) = −3.441, p = 0.025 (Holm–Bonferroni correction). None of the other pairwise comparisons was significant. For silent condition, 9:1 was significantly different from other ratios, 9:1 versus 8:2: t(27) = −4.529, p < 0.001, 9:1 versus 7:3: t(27) = −4.529, p < 0.001, 9:1 versus 6:4: t(27) = −4.313, p < 0.001, 9:1 versus 5:5: t(27) = −4.313, p < 0.001, 9:1 versus 4:6: t(27) = −3.666, p = 0.009, 9:1 versus 3:7: t(27) = −4.097, p = 0.002, 9:1 versus 2:8: t(27) = −3.450, p = 0.020, 9:1 versus 1:9: t(27) = −4.960, p < 0.001 (Holm–Bonferroni correction). Therefore, only the 9:1 ratio seems to be consistently different from the rest, across all three conditions.

If we hone in on the 9:1 ratio and compare across the three trial types (conflict vs. nonconflict vs. silent), the one-way ANOVA of trial type is significant, F_{(1.599, 43.181)} = 5.421, p = 0.012, ƞ2p = 0.167. Subsequent post hoc comparisons using the Holm–Bonferroni test showed that silent condition was different from the other two conditions, silent versus conflict: t(28) = −2.529, p = 0.028, silent versus nonconflict: t(28) = −3.091, p = 0.009. There was no difference between conflict and nonconflict conditions, t(28) = −0.562, p = 0.576. The same cross comparisons in 8:2 yield no significant difference across the three trial types, F_{(2, 54)} = 1.000, p = 0.375, ƞ2p = 0.036, as it is visible in Figure 2A that the three trial types have plateaued from 8:2 and on. Therefore, it appears that the auditory context can achieve its biasing effect in as few as two context tones. We note that these exploratory comparisons are not intended to imply statistical interaction and are reported here to provide descriptive insight into the pattern of effects. Conclusions drawn from these should therefore be interpreted with caution.

Regression

To investigate the effect of the number of tones, as well as its possible interaction with the types of tones, we conducted multiple logistic regression models, with participants’ responses from each trial as the binary dependent variable (1 = upward, 0 = downward).

In model 1, the fixed effects were numbers of high context tone, low context tone, and other nonconflict context tones (including same tone and no tone). We found that numbers of high and low context tones affected the judgment of the tritone pair (high: β = 0.206, SE = 0.035, p < 0.001; low: β = −0.086, SE = 0.034, p = 0.012), but not nonconflict context tones (same: β = 0.056, SE = 0.039, p = 0.153; no tone: β = 0.402, SE = 0.207, p = 0.052). In other words, the number of conflict tones (i.e., opposite condition), whether high or low context tones, has a significant impact on the participants’ pitch-shift judgment, while the number of nonconflict tones (i.e., same and silent condition) has no effect. This result highlights the importance of the number of high and low context tones, although it is unable to differentiate the various orders of those tones (e.g., 5H5L and 5L5H would be treated the same in this model). Therefore, in the following model, we included a postcontext last-tone factor (high vs. low), so that we can differentiate the order in which the high/low tones appeared.

In model 2, we added the very last tone into the fixed effect; therefore, there were numbers of high context tone, low context tone, nonconflict context tones (including same tone and no tone), and last tone (high vs. low) as fixed effect factors. We found that both the number of high context tones and the nature of the last context tone (C10, coded as high vs. low) significantly predicted participants’ judgments. A greater number of high context tones slightly increased the likelihood of an “upward” response (β = 0.084, SE = 0.021, p < 0.001). More prominently, if the last tone was high (compared to low), this significantly increased the likelihood of an “upward” response (β = 1.861, SE = 0.132, p < 0.001). According to Table 3, it can be observed that the last tone is the most significant factor influencing judgments. This finding may suggest that context tone(s) closer to the target tones have a stronger impact, although it is not known from this design whether this proximity effect is primarily driven by temporal factors (i.e., recency in time) or by ordinal position (i.e., the number of intervening stimuli). Since we are interested in understanding the duration for which this influence can be sustained (or the number of tones involved), we incorporated the penultimate tone into the regression model in Model 3.

Table 3.

Regression model 1 for the tritone judgment.

parameter	Estimate	Std. Error	z Value	p-Value
(Intercept)	−0.808	0.343	−2.357	0.018
High tone	0.206	0.035	5.887	<0.001
Low tone	−0.086	0.034	−2.517	0.012
Same tone	0.056	0.039	1.431	0.153
No tone (silent)	0.402	0.207	1.942	0.052

In Model 3, we added the penultimate context tone (C9) as an additional fixed-effect factor to the previous logistic regression model (Model 2) to assess its contribution. The fixed effect factors included numbers of high context tone, low context tone, same context tone, no tone (i.e., silent condition), last tone (high vs. low tone), and penultimate tone (high vs. low vs. nonconflict tone). We found that the last tone (C10, high vs. low) and the penultimate tone (C9, with “Same/Nonconflict” as the reference level) were significant predictors. A high last tone continued to strongly increase the likelihood of an “upward” response (β = 1.430, SE = 0.255, p < 0.001). Similarly, if the penultimate tone was high (compared to “Same/Nonconflict”), this also significantly increased the likelihood of participants reporting an “upward” tritone percept (β = 1.342, SE = 0.330, p < 0.001). The effect for a low penultimate tone (compared to “Same/Nonconflict”) was not significant (β = 0.371, SE = 0.358, p = 0.301). When we include the last and penultimate tone factors into the model, the number of conflict tones becomes less statistically significant, as both last and penultimate tones have now become the key factors that affect participants’ pitch-shift judgment.

In Model 4, the third-to-last context tone (C8, coded as high vs. low vs. same/nonconflict) was added as a further fixed-effect factor to Model 3. Even with the inclusion of the third-to-last tone, a high last tone robustly predicted more “upward” responses (β = 1.425, SE = 0.256, p < 0.001), and a high penultimate tone also maintained its significant positive association with “upward” responses (β = 1.348, SE = 0.330, p < 0.001), both relative to their respective reference levels. Therefore, we stopped generating further models for this analysis. Together, according to these models, we can conclude that the last tone (i.e., C10) and penultimate tones (i.e., C9) have the most influence on tritone judgment.

Discussion

In this study, we aimed to examine the role of context quantity (i.e., number of context tones) and type (i.e., opposite, same, or silent), as well as their possible interactions on the perception of the subsequent tritone paradox. To this end, we played anywhere between 1 and 10 context tones and varied these context tones so that they were higher than T1, lower than T1, same as T1, or completely silent. We found that just 1–2 context tones were enough to bias the subsequent percept (Figure 2). Specifically, these context tones were able to “attract” the target tones such that high context tones led to upward tritone percept and low context tones led to downward tritone percept. This is consistent with the findings of Chambers et al. (2017), who called this an attractive effect.

One way such an attractive effect could emerge is through the process of perceptual grouping. That is, perhaps the high context tones biased T1 to be perceived as slightly higher via grouping. Such perceptual grouping might cause T1 to be perceived as occupying a generally higher pitch region. Shepard tones are known for their pitch class circularity; while their absolute octave is ambiguous, their pitch class (e.g., C, D-flat, and D) is clear. The interval between T1 and T2 in a tritone paradox is six semitones—exactly half an octave—meaning they are diametrically opposite on this pitch class circle. If the preceding context biases T1 to be perceived as higher, the auditory system might then resolve the T1–T2 ambiguity by selecting the (now shorter) ascending path along this pitch class circle to T2's pitch class. This resolution would result in the T1–T2 sequence being perceived as “ascending,” thus T2 is heard as higher relative to the contextually elevated T1. This choice could be favored as it might represent a more coherent or smaller perceptual step in pitch height from the already biased T1. One potential functional advantage of such an attractive contextual mechanism—whereby the percept is aligned toward the direction of previous context—is that it can serve to disambiguate ongoing sensory inputs. This particular mode of disambiguation would be advantageous for maintaining perceptual consistency when processing continuous or related streams of auditory information.

The grouping account, however, is not the only account that can explain the current results. One recent study by Englitz et al. (2023) recorded single-unit responses from a population of auditory cortical cells in awake ferrets and found that high context tones actually pushed T1 downward and led to a low percept of T1. T2 was similarly influenced, though the overall perception may shift relatively upward due to the already lowered T1. The reverse was true for low context tones. This “repulsive” effect on the target tones is more analogous to perceptual contrast, as opposed to perceptual grouping. Recent studies have extended our understanding of how perceptual context shapes stimulus interpretation. Perceptual grouping mechanisms, beyond traditional Gestalt principles, have been shown to influence how sequential stimuli are bound and retrieved in memory, which may underlie the attractive effect observed in our results (Schmalbrock et al., 2022). At the same time, the interplay of attraction and repulsion in perceptual judgments closely parallels the literature on serial dependence, where current perception is biased by prior stimuli either toward (attractive) or away from (repulsive) past inputs. This has been most thoroughly documented in the visual domain (Fischer and Whitney, 2014), but likely reflects a shared computational principle across modalities—including audition. As of now, both accounts can explain our results from the present study.

Regardless of the grouping versus contrast account, our results here suggest that such an attractive/repulsive process likely takes place within a small window of 2 tones/notes, where perceptual upward/downward bias stabilized after just 1–2 context tones (Figure 2 and Table 4). In this small window, for high context tones (Figure 2C), perceptual bias went from low (i.e., 9:1) to high (i.e., 8:2), and plateaued from that point on. Thus, there is no evidence of averaging within this two-tone window. Similarly, in low context tones (Figure 2B), just one context tone was enough to bring the downward bias to its highest point, and it remained high from 1:9 to 9:1. Therefore, the effect of auditory contexts seems to operate in a winner-take-all manner, within a very short time window.

Table 4.

Regression models 2–4 for the tritone judgments.

Model	Parameter	Estimate	Std. Error	z Value	p-Value
2	(Intercept)	−1.419	0.096	−14.826	<0.001
	Last tone (high)	1.861	0.132	14.070	<0.001
	High tone	0.084	0.021	4.061	<0.001
3	(Intercept)	−1.743	0.353	−4.937	<0.001
	Last tone (high)	1.430	0.255	5.599	<0.001
	Penultimate tone (low)	0.371	0.358	1.035	0.301
	Penultimate tone (High)	1.342	0.330	4.065	<0.001
4	(Intercept)	−1.74	0.353	−4.93	<0.001
	Last tone (high)	1.425	0.256	5.579	<0.001
	Penultimate tone (low)	0.382	0.358	1.068	0.286
	Penultimate tone (High)	1.348	0.330	4.080	<0.001

One possible mechanism underlying this rapid contextual influence involves auditory short-term memory (STM). Auditory STM is known to have a limited capacity—typically allowing for the retention of about three to four discrete items—and its effectiveness declines when this limit is exceeded (Visscher et al., 2007; Cowan, 2001; Saults and Cowan, 2007). While recognition accuracy for items held in auditory STM may only decrease modestly over certain retention intervals in some paradigms (Schmalbrock et al., 2022), the specific contextual information that biases bistable perception, as in our study, might be particularly susceptible to rapid overwriting by subsequent stimuli or operate within a highly constrained temporal window. Our finding that only 1–2 context tones are sufficient to establish a stable bias aligns with a system where this limited capacity is quickly saturated, or where only the most recent auditory events are weighted heavily for this type of perceptual disambiguation. Consequently, once this effective capacity or temporal window is exceeded, information from the more distant auditory past may contribute less significantly to the immediate perceptual outcome. These capacity constraints may account for why earlier tones (e.g., C1–C7) exert little influence on perceptual outcomes, while more recent tones remain accessible and contribute more directly to bias formation.

We are uncertain whether the small time window of context effect is truly due to “time” per se, or perhaps due to the number of stimuli or “slots.” That is, the time window account suggests that auditory STM has a limited temporal span, a notion supported by our findings as the impact of tones diminishes with time elapsed since their presentation. This also aligns with the understanding that echoic memory primarily encompasses the most immediate stimuli, thus emphasizing the temporal constraints within which auditory information is processed (Tseng et al., 2012). However, because we used consistent 125 ms duration for every context tone as Chambers et al. did, the time window is inevitably conflated with the number of context tones.

Importantly, the “short-lasting” window we report refers to the exposure needed to form a contextual bias, not to how long that bias can be maintained once established. Chambers et al. (2017) showed that after only three to five 125-ms context tones the bias plateaued, matching our own rapid build-up, but that the induced bias then persisted across silent gaps of up to 32 s (and for some listeners 64 s). Their later experiment therefore quantified the durability of an already-formed trace, whereas our design—with no intervening gap—isolates the minimal induction period. Taken together, the two studies imply a two-phase process: a very fast binding mechanism that leverages echoic/working-memory resources to set up the prior, followed by a slower, activity-silent storage process (e.g., synaptic adaptation) that can preserve the trace for tens of seconds. This reconciliation also explains why both studies converge on the importance of the most recent one or two tones while diverging in the temporal span over which the effect can still be observed.

This mixture of item-based and time-based limitations for the observed explicit contextual biasing effect remains to be dissociated in future work. It is important to note that this very constrained window (1–2 tones) for contextual influence on an explicit perceptual judgment may differ from other forms of auditory memory. For instance, evidence suggests that implicit auditory memory, such as that involved in tracking statistical regularities in sound sequences, can integrate information over considerably longer periods, potentially spanning 20 tones or more (Barascud et al., 2016). This distinction highlights that the auditory system likely employs multiple memory mechanisms operating on different timescales, depending on the specific task demands (e.g., explicit disambiguation versus implicit regularity detection). The rapid contextual effect on bistable tritone perception observed in our study may thus reflect the operation of a highly specific, quickly-updating buffer primarily relevant for immediate perceptual interpretation, rather than representing a universal limit for all types of auditory sequential processing.

Differential Contextual Effects of High Versus Low Tones

The reason for the observed differences between low (L) and high (H) context tones in their effects on perceptual bias—specifically, the stronger bias seen at the 9:1 ratio in low contexts compared to the 8:2 ratio in high contexts—is currently unclear. This discrepancy may suggest differential weighting of context tones in auditory processing. We have some speculations about the origin of such a difference. First, the range of vocal frequency was first demonstrated to impact one's tritone percepts by Deutsch and colleagues (Visscher et al., 2007). Subsequent studies have suggested that individuals’ perception of auditory stimuli can be influenced by the frequency characteristics of their own voice. Similarly, Perrachione et al. (2011) found that the frequency range of an individual's voice can affect how they perceive pitch, which may explain why low-frequency tones in the context seem to produce stronger biases in perception. This implies that low-frequency tones, being closer to the frequency range of one's own voice, might be more influential in shaping auditory biases. Second, the impact of native language has also been demonstrated by Deutsch and colleagues to bias one's percept for tritones (Deutsch, 1991). Additionally, Lee and Lee (2010) demonstrated that speakers of tonal languages, such as Mandarin, process pitch information differently, often showing heightened sensitivity to specific frequency ranges that are prominent in their native language. This suggests that native language background might influence the differential processing of high and low context tones. To this end, it is important to note that all participants in our study are native Mandarin speakers. Although the exact mechanisms underlying these post HL differences are unclear, it is plausible that a combination of one's vocal frequencies and native language may contribute to the observed differential effects of auditory contexts.

One limitation of the current study is the lack of detailed information on participants’ musical background. While musical expertise may influence certain aspects of pitch processing, prior work on the tritone paradox suggests that bistable pitch percepts can emerge robustly in nonmusicians as well (Malek, 2018; Deutsch et al., 1987). Future studies may benefit from explicitly stratifying participants based on musical training to assess its potential modulatory effects.

Conclusion

In this study, we investigated the role of context tones on tritone pairs by manipulating context quantity (i.e., number of context tones) and type (i.e., opposite, same, or silent). Unlike previous research that focused on broader context effects, our experimental design precisely manipulated the ratios and arrangements of context tones. We found that just 1–2 context tones were enough to bias the subsequent percept, and biases no longer fluctuated after two tones, suggesting a very limited and narrow window within which auditory contexts operate. We also observed that low tone contexts produced a faster biasing effect (i.e., 1 instead of 2 tones) than their higher counterparts, which remain to be investigated.

Supplemental Material

sj-docx-1-ipe-10.1177_20416695251409272 - Supplemental material for Rapid biasing effect of prior auditory contexts on bistable tritone perception

Supplemental material, sj-docx-1-ipe-10.1177_20416695251409272 for Rapid biasing effect of prior auditory contexts on bistable tritone perception by Cheng-You Hou, Jyun-Jhe Wang, Yu-Hui Lo and Philip Tseng in i-Perception

Supplemental Material

sj-docx-2-ipe-10.1177_20416695251409272 - Supplemental material for Rapid biasing effect of prior auditory contexts on bistable tritone perception

Supplemental material, sj-docx-2-ipe-10.1177_20416695251409272 for Rapid biasing effect of prior auditory contexts on bistable tritone perception by Cheng-You Hou, Jyun-Jhe Wang, Yu-Hui Lo and Philip Tseng in i-Perception

Footnotes

ORCID iDs

Cheng-You Hou

Yu-Hui Lo

Philip Tseng

Author Contribution(s)

Cheng-You Hou: Conceptualization; Data curation; Formal analysis; Investigation; Methodology; Software; Visualization; Writing – original draft.

Jyun-Jhe Wang: Conceptualization; Data curation; Formal analysis; Investigation; Methodology; Software; Visualization; Writing – original draft.

Yu-Hui Lo: Formal analysis; Methodology; Software; Visualization; Writing – original draft.

Philip Tseng: Conceptualization; Methodology; Resources; Supervision; Writing – original draft; Writing – review & editing.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Taiwan University (NTU-CDP-113L7775, NTU-RPG-113L7324) and National Science and Technology Council, Taiwan (109-2423-H-002-004-MY4, 112-2410-H-002-252, 113-2628-H-002-013-MY3, 114-2423-H-008-003).

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The data supporting the findings of this study are available on request from the corresponding author.

Supplemental Material

Supplemental material for this article is available online.

References

Barascud

Pearce

M. T.

Griffiths

T. D.

Friston

K. J.

Chait

(2016). The ‘roving’ RAD: An index of implicit auditory memory. Proceedings of the National Academy of Sciences, 113(22), 6325–6330. https://doi.org/10.1073/pnas.1508523113

Billig

A. J.

Davis

M. H.

Deeks

J. M.

Monstrey

Carlyon

R. P.

(2013). Lexical influences on auditory streaming. Current Biology, 23(16), 1585–1589. https://doi.org/10.1016/j.cub.2013.06.042

Chambers

, et al. (2017). Prior context in audition informs binding and shapes simple features. Nature Communications, 8, 15027. https://doi.org/10.1038/ncomms15027

Cowan

(2001). The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and Brain Sciences, 24(1), 87–114. https://doi.org/10.1017/S0140525X01003922

Deutsch

(1986). A musical paradox. Music Perception, 3(3), 275–280. https://doi.org/10.2307/40285337

Deutsch

(1991). The tritone paradox: An influence of language on music perception. Music Perception, 8(4), 335–347. https://doi.org/10.2307/40285517

Deutsch

Henthorn

Dolson

(2004). Absolute pitch, speech, and tone language: Some experiments and a proposed framework. Music Perception, 21(3), 339–356. https://doi.org/10.1525/mp.2004.21.3.339

Deutsch

Kuyper

W. L.

Fisher

(1987). The tritone paradox: Its presence and form of distribution in a general population. Music Perception, 5(1), 79–92. https://doi.org/10.2307/40285386

Englitz

, et al. (2023). Decoding contextual influences on auditory perception from Primary Auditory Cortex. bioRxiv.

10.

Fischer

Whitney

(2014). Serial dependence in visual perception. Nature Neuroscience, 17(5), 738–743. https://www.nature.com/articles/nn.3689. https://doi.org/10.1038/nn.3689

11.

Lee

C.-Y.

Lee

Y.-F.

(2010). Perception of musical pitch and lexical tones by Mandarin-speaking musicians. The Journal of the Acoustical Society of America, 127(1), 481–490. https://doi.org/10.1121/1.3266683.

12.

Malek

(2018). Pitch class and envelope effects in the tritone paradox are mediated by differently pronounced frequency preference regions. Frontiers in Psychology, 9, 1590. https://doi.org/10.3389/fpsyg.2018.01590

13.

Perrachione, T. K., Del Tufo, S. N., & Gabrieli, J. D. (2011). Human voice recognition depends on language ability. Science, 333(6042), 595–595. https://doi.org/10.1126/science.1207327

14.

Repp

B. H.

(1994). The tritone paradox and the pitch range of the speaking voice: A dubious connection. Music Perception, 12(2), 227–255. https://doi.org/10.2307/40285653

15.

Saults

J. S.

Cowan

(2007). A central capacity limit to the simultaneous storage of visual and auditory arrays in working memory. Journal of Experimental Psychology: General, 136(4), 663–684. https://doi.org/10.1037/0096-3445.136.4.663

16.

Schmalbrock

, et al. (2022). The role of perceptual grouping in stimulus-response binding and retrieval. Journal of Cognition, 5(1), 1–10. https://journalofcognition.org/articles/10.5334/joc.217. https://doi.org/10.5334/joc.217

17.

Shepard

R. N.

(1964). Circularity in judgments of relative pitch. The Journal of the Acoustical Society of America, 36(12), 2346–2353. https://doi.org/10.1121/1.1919362

18.

Snyder

J. S.

Carter

O. L.

Lee

S. K.

Hannon

E. E.

Alain

(2008). Effects of context on auditory stream segregation. Journal of Experimental Psychology: Human Perception and Performance, 34(4), 1007. https://doi.org/10.1037/0096-1523.34.4.1007

19.

Tseng

, et al. (2012). Neural mechanisms of implicit visual probability learning. Chinese Journal of Psychology, 54(1), 115–131. https://doi.org/10.6129/CJP.2012.5401.07

20.

Visscher

K. M.

, et al. (2007). Auditory short-term memory behaves like visual short-term memory. PLoS Biology, 5(3), e56. https://doi.org/10.1371/journal.pbio.0050056

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.06 MB

0.55 MB