Sage Journals: Discover world-class research

Abstract

The natural environment is filled with a variety of auditory events such as wind blowing, water flowing, and fire crackling. It has been suggested that the perception of such textural sounds is based on the statistics of the natural auditory events. Inspired by a recent spectral model for visual texture perception, we propose a model that can describe the perceived sound texture only with the linear spectrum and the energy spectrum. We tested the validity of the model by using synthetic noise sounds that preserve the two-stage amplitude spectra of the original sound. Psychophysical experiment showed that our synthetic noises were perceived as like the original sounds for 120 real-world auditory events. The performance was comparable with the synthetic sounds produced by McDermott-Simoncelli's model which considers various classes of auditory statistics. The results support the notion that the perception of natural sound textures is predictable by the two-stage spectral signals.

Keywords

visuo-auditory interactions texture models listening

Introduction

In the natural environment, we hear a variety of auditory events such as wind blowing, water flowing, and fire crackling. These textural sounds not only play an important role in auditory scene analysis and the perception of physical events (e.g., Gygi et al., 2004) but also provide a fundamental basis for the subjective richness of our auditory world. In contrast to the perception of speech and music, where not only phoneme features but also temporal order is decisive, the perception of sound textures is thought to be based on the statistical acoustic features over a certain temporal period (Attias & Schreiner, 1997; McDermott & Simoncelli, 2011; McDermott et al., 2013; McWalter & McDermott, 2018).

McDermott and colleagues have proposed a computational model of sound texture perception which is based on several classes of sound statistics (McDermott & Simoncelli, 2011; McDermott et al., 2013). Specifically, the model argues that marginal moments and pairwise correlations computed from subband envelopes decomposed from the original sound can characterize a particular sound texture. Importantly, the model is supported by strong evidence that a variety of natural sound textures can be synthesized from white noise by matching these statistics to the original sounds.

In fact, McDermott–Simoncelli's (MS’s) model of sound texture perception is inherited from Portilla–Simoncelli's (PS’s) model of visual texture perception (Portilla & Simoncelli, 2000). The PS model assumes that the perception of a visual texture is based on classes of low- and high-level image statistics, including moments in the orientation and spatial frequency subbands as well as autocorrelation/cross-band correlation within and across subbands. As with the MS model, the PS model can also synthesize a wide range of natural texture images by matching the PS statistics in a white noise.

As described above, both the MS model for sound texture and the PS model for visual texture have a complicated structure consisting of many classes of statistics. Recently, for visual texture, Okada and Motoyoshi (2021) showed that most of the PS statistics can be simplified by using a two-stage spectral representation (Figure 1a). The Okada–Motoyoshi model assumes that a visual texture is described by only two amplitude spectra (and pixel moment statistics): the two-dimensional spectrum of the luminance image (Fx, Fy) and the four-dimensional spectrum of the subband energy image (Fx, Fy, Fori, Ffreq). This spectral model, especially in its second stage, is considered a dimensional extension (from two-dimensional [2D] to four-dimensional [4D]) of the psychophysical Filter-Rectify-Filter model of visual texture discrimination (Bergen & Adelson, 1988; Bergen & Landy, 1991), as well as a Fourier-transformed representation of the auto/cross-correlation of subband energy in the PS model. Okada & Motoyoshi (2021) demonstrated the validity of their model by showing that a noise image that preserves the original two-stage spectra of natural texture (e.g., gravel) appears very similar to the original natural textures to a degree comparable to the PS-synthesized textures.

Figure 1.

Two-stage spectral representation of visual and auditory textures. (a) Two-stage spectral representation of visual texture (2D spectrum of luminance and 4D spectrum of subband energy). (b) Two-stage spectral representation of auditory texture (1D spectrum of sound waves and 2D spectrum of cochlear subband envelopes). See text for details.

Since the perception of visual texture is explained by a simple two-stage spectral space, the perception of sound texture could also be explained by a simple two-stage spectrum by considering the MS statistics in the frequency domain. Specifically, part of the MS statistics can be represented by two amplitude spectra: the one-dimensional spectrum of the input sound—temporal frequency (Ft)—and the two-dimensional spectrum of its subband envelope—temporal frequency (Ft) × frequency of modulation frequency (Ffreq), as shown in Figure 1b. Similar two-stage spectral representations have been employed in the previous analyses of human and animal voices (Singh & Theunissen, 2003).

Can this two-stage auditory spectrum model (Figure 1b) account for the perception of various natural sound textures, as the two-stage spectral model of visual texture perception (Figure 1a) does? One of the strongest and ecologically valid demonstrations of a valid perception model is to synthesize a “metamer” stimulus that mimics the perception of the original natural stimulus (Freeman & Simoncelli, 2011; McDermott & Simoncelli, 2011; Portilla & Simoncelli, 2000). Here, we examine whether the two-stage spectrum model can synthesize a wide range of natural sound textures. The underlying idea is simple: noise sounds that match their two-spectrum original may be perceived as similar. If such is the case, then that would constitute strong evidence in favor of a two-stage spectral representation for the perception of natural sound textures.

Sound Texture Synthesis Based on Two Spectra

We synthesized sound textures using a method previously used in synthesizing visual images (Okada & Motoyoshi, 2021), namely randomizing phase while preserving two-stage spectra. The synthesized sound is generated by a simple procedure shown in Figure 2 (see Supplemental Material for demos). (1) A phase-randomized (PR) sound that preserves only the linear spectrum is obtained from the original sound. In the PR sound, only the linear spectrum of the original sound is preserved whereas the phase spectrum information is discarded and the white noise phase spectrum is used instead. As a result, the linear spectrum is preserved in the PR sound (as in the original sound), but the energy-amplitude spectrum is not. In the spectral calculation, the first and last 600 ms intervals of the signal (typically 3 s segments) were cosine tapered to avoid boundary artifacts. (2) Both the original sound and the PR sound are subjected to 30 band-pass filters, a high-pass filter, and a low-pass filter to obtain 32 filtered sounds and their energies. For computational efficiency, the energy is down sampled to 400 Hz. A compressive nonlinearity of 0.3 is then applied to the energy computations to simulate the cochlear transduction of sound. The bandpass filters are equally spaced in frequencies according to the equivalent rectangular bandwidth (ERB) N scale (Glasberg & Moore, 1990) that ranges from 20 to 10,000 Hz. Filter bandwidth was 3 dB or the equivalent bandwidth of the human ear. (3) The subband energy (envelope) of the original sound is obtained via the amplitude spectrum of the two-dimensional Fourier transformed. The phase spectrum, meanwhile, is obtained by a two-dimensional Fourier transform of the PR sound. The number of data points in the energy amplitude spectrum obtained by this 2D-FFT is 400 × 1/2 × s × 32, and the width of the bins is 400/(3 s × 1/2). (4) The resulting two-dimensional amplitude and phase spectra are then inverse-Fourier transformed to recover the random-phase subband energies while preserving the amplitude spectrum of the original sound energy. (5) The amplitude from each subband energy is recovered and integrated to recover the original sound to obtain a PR sound (le-PR) that preserves the two-stage spectra.

Figure 2.

Schematic diagram of synthesized sound with two-stage spectral representation. PR denotes phase-randomized sound. le-PR denotes a phase-randomized sound that preserves the two-stage spectra. The original sound, PR, and le-PR are represented by waveforms. See text for details.

Figure 3(a) shows examples of spectrograms of the original natural sounds (top) and three types of synthetic sounds. Our casual observations indicated that the MS synthesis and the two-stage spectrally preserved PR sound (le-PR) were perceptually similar to the original for many sounds. On the other hand, PR sounds with only a linear spectrum were often perceived as monotonous bandpass noise. Figure 3(b) shows spectrograms for several exemplars of le-PR sounds generated from different seeds of randomization.

Figure 3.

(a) Example spectrograms of the original sounds and three types of synthesized sounds. From top to bottom, the results are shown for the original sounds, le-PR, PR, and the MS-synthesized sound, respectively. Each column shows the name of the original natural sounds. (b) Examples of le-PR spectrograms with different random phase spectra, respectively.

Psychophysical Experiment

In a simple psychophysical rating experiment, we tested how perceptually similar, or successfully synthesized, two-stage spectrally preserved noise sounds (le-PR) are to the original sounds. We also obtained ratings for two other types of sound synthesis, namely MS synthesis and spectrally matched noise (PR). Here, MS synthesis was performed using the Toolbox provided by McDermott's lab with the recommended parameters: subband variance, envelope mean, envelope variance, envelope skewness, envelope correlation, modulation power, and C1 correlation.

Methods

Apparatus

Auditory stimuli were presented to both ears at 70 dB SPL from a Sennheiser HD280 PRO via a PC-controlled Komplete Audio 1 (sampling rate 48 kHz, 24-bit D/A converter).

Observers

Eight naïve participants without hearing impairment and two authors (mean age 23.3 years) participated in the experiment. All participants were nonprofessionals who were unskilled listeners and naïve about the purpose of the experiment. All experiments were conducted in accordance with the Declaration of Helsinki and with the permission of the Ethical Review Committee for Experimental Research on Human Subjects, Graduate School of Arts and Sciences, the University of Tokyo. All observers provided filled informed consent forms.

Stimuli

The original auditory stimuli were 120 textural sounds related to a variety of natural events, including writing on paper, sleigh bells, thunderstorms, car horns, radio noise, sirens, and so on (see Supplemental Material). These were digital audio data downloaded from an online repository. Each sound was converted to mono, resampled at 48,000 Hz, and normalized to a fixed rms amplitude. For each of these natural sounds, three classes of synthetics sounds were generated: a synthesis based on MS statistics, PR sounds in a two-stage spectrum (le-PR, Figure 4), and PR sounds in the spectrogram. The MS-synthesized sounds were generated using 5 s segments from the original sound, as required by the algorithm, while the le-PR and PR sounds were generated using 3 s segments. In the experiment, 1 s segments were extracted so that one can easily recognize the sound category.

Figure 4.

The perceptual similarity rating for the three types of synthesized sounds to the original natural sound: the MS statistics, the two-stage spectrum (le-PR), and the linear spectrum (PR). (a) Joint histograms of similarity ratings between different types of synthetic sounds. Each panel shows the comparison in ratings for le-PR versus PR (left), MS versus PR (middle), and MS versus le-PR (right). (b) Similarity ratings averaged across 120 natural sounds. Error bars represent ±1 SEM between participants.

Procedure

In each trial, the original natural sound and its synthesized sound were presented for 1 s with a 1 s silence in between. The type of synthesized sound was chosen at random. The participants rated how similar the synthesized sound was to the original natural sound by pressing a button on a 9-point scale: the same or very similar to the original (8), close to the original but slightly degraded (7), close to the original but a little synthesized (6), clearly heard as a synthetic sound (5), clearly synthetic and degraded sound quality (4), recognizable original but clearly unsuccessful synthesis (3), recognizable synthesis of the original but also noise (2), closer to a noise than to the synthesis of the original sound (1), a noise unrelated to the original (0). Note that all of these instructions were given in Japanese. Participants were also instructed not to use mere apparent loudness as a cue. To establish rating criteria, participants were presented with the full set of original sounds in random order immediately before the experiment. All participants rated each stimulus 3 times, and the average was used for analysis.

Results

Figure 4a shows a joint histogram for the distribution of similarity ratings to the original sound for each of the three types of synthetic sound pairs (MS vs. le-PR, le-PR vs. PR, MS vs. PR). The brightness of each cell represents the relative frequency of responses. The gray diagonal line indicates that the ratings for the two types of synthesized sounds are equivalent; the area below this line indicates that the synthetic sound on the horizontal axis was more similar to the original sound than the synthetic sound on the vertical axis.

The two joint histograms on the left show that PR sounds with only a first-order linear spectrum preserved are often perceived as mere band-passed noise (i.e., rating of 0) and are rated as far less similar to the original compared to the other two synthesis types. While synthetic sounds with a two-stage spectral model (le-PR) are rarely less similar than PR sounds, some MS-synthesized sounds are rated as a little less similar than PR sounds. The joint histogram on the right shows that the ratings are similar between le-PR sounds and MS sounds. Figure 4b shows the mean ratings for 120 stimuli. The synthesized sounds by MS and le-PR are clearly more similar to the original sound than the PR sounds, and le-PR is slightly better than MS. One-way repeated-measure ANOVA showed a significant main effect of synthesis type, F(2, 27) = 8.6992, p < .00005. Two-sided t-tests showed significant differences for all combinations: le-PR versus MS, t(9) = 2.351, p < .05, le-PR versus PR, t(9) = 16.659, p < .0000001, and MS versus PR, t(9) = 7.877, p < .0001. This suggests that the perceptual quality of the synthesis based on the two-stage spectrum (le-PR) is comparable with, or slightly better than, the MS synthesis.

Discussion

The present study proposed a simple model that describes the perception of sound textures in terms of first-stage (one-dimensional [1D]) and second-stage (2D) amplitude spectra. Psychophysical examination with 120 natural sounds revealed that synthetic-noise sounds that preserve the two-stage spectra of the original perceptually mimic the original sounds with a quality comparable to the sounds synthesized by MS’s statistics. These results support the idea that the perception of sound texture can be accounted for by two-stage spectral representations.

As mentioned earlier, the two-stage spectral model is analogous to a part of the model of MS statistics. The MS model employs sound statistics in terms of subband amplitudes, moments of energy, and correlations between energy or modulation filter outputs. Our two-stage model virtually represents these as amplitude spectra in the Fourier domain. However, the two-stage spectral model omits several classes of high-level statistics such as C1 and C2 correlations (McDermott & Simoncelli, 2011), and it has a simpler architecture than the MS model. Our model has many more parameters (i.e., amplitude data) than the MS model, and it is capable of representing the spectral structure of the sound with much higher accuracy than the MS model. Given this, it is not surprising that the synthetic sound based on our model (le-PR) was perceptually similar to the original sound with equal or slightly higher quality than the synthetic sounds based on MS statistics. It is also possible that MS-synthesized sounds that we generated with the recommended McDermott-lab toolbox were somehow degraded as compared to those generated with codes originally used in McDermott and Simoncelli (2011). In addition, the high performance of our model may be partially ascribed to the fact that we selected a wide range of textural sound samples without any particular criterion. In the development of the MS model, sound samples, which cannot be successfully synthesized without considering specific statistics such as C1 and C2 correlations, were intentionally introduced to illustrate the effects of various statistics (McDermott & Simoncelli, 2011). Our experimental results might have been different if such challenging samples had been included. Should our interpretation prove correct, our results would also suggest that spectral features assumed in the present model would be sufficient to model the perceptual representation of the majority of sounds in nature. As noted below, the MS model might attempt to consider higher-order features which are not relevant for sounds that are perceived as “textural sounds.”

Some sounds that were successfully synthesized with the two-stage spectral model were repetitive, such as writing on paper, thunder, sirens, and sleigh bells. On the other hand, some sounds, such as a knocking on the door, a baby's cry, and the opening and closing of a door, were not synthesized successfully by either the two-stage spectral model or the MS model. These poorly synthesized sounds may have contained some higher-order acoustic features. Such sounds may not be considered “sound textures,” but it is difficult to objectively define what stimuli are texture-like, either in vision or audition. In vision research, textural images are ambiguously defined as repetitions of similar patterns, but there are many images that are perceived as textural even if the lower-order image features are not spatially uniform. Similarly, sounds that are considered textural do not always consist of temporally stationary statistics. Kurosawa et al. (2021) recently examined “what kinds of images are perceived as textures” by using a wide range of natural images, including objects and scenes. Their results revealed a robust law that PS synthesis succeeds for natural images that are perceptually classified as “textures” by observers and fail for those that are not. This suggests that the image features assumed in the PS model are sufficient to describe the perception of “texture” and that images mischaracterized by the PS statistics are not processed by the visual system as “textures.” If we can define a domain for sound texture as well, we may be able to indicate what information is sufficient for describing auditory texture perception. Unlike the case of visual texture, however, it is unclear whether humans are able to easily distinguish between “textural” and “no-textural” sounds.

The two-stage spectral model of auditory texture perception has an analogous structure to the two-stage spectral model of visual texture (Okada & Motoyoshi, 2021), with the only difference being the dimensionality of the spectral space. Both models assume that texture perception is essentially based on two-stage filtering or convolution, suggesting a common computational principle in visual and auditory neural processing. The multistage convolution in the visual and auditory texture models is consistent with the basic scheme of image processing in the early visual cortex (Baker & Mareschal, 2001; Freeman & Simoncelli, 2011; Hubel & Wiesel, 1968; Ziemba et al., 2016; Zipser et al., 1996), and with sound processing in the auditory system as well (Baumann et al., 2011; Chi et al., 2005; Greenwood, 1990; Joris et al., 2004; Rodríguez et al., 2010). Specifically, it is well known that the auditory cortex has neurons that are selectively sensitive to particular time-frequency modulations (Depireux et al., 2001; Kowalski et al., 1996a, 1996b). The two-stage spectral analysis is also consistent with computations in deep neural networks based on convolution and pooling within multiple layers (Fukushima & Miyake, 1982; LeCun et al., 2015; Riesenhuber & Poggio, 2000), as well as with wavelet scattering networks used to compute translation-invariant image representations for classification (Bruna and Mallat, 2013; Mallat, 2012). We expect that such multiorder spectral analysis would be a general principle in low-level sensory information processing, including other modalities such as touch.

It should be noted that two-stage spectral analyses are still local in space (vision) and in time (audition), and the perception of a texture is determined by the summary (e.g., average) of these local signals over a particular spatial (vision) or temporal (audition) range. The amplitude spectra in our model and statistics in the MS model correspond to such summary representations. Thus, our model and the MS model implicitly assume an additional mechanism that temporally integrates (sums) the outputs of two-stage analyzers. For visual texture, such a summarization mechanism is known to exist in V4 (Freeman & Simoncelli, 2011; Okazawa et al., 2015; Ziemba et al., 2016), where neurons respond to particular image statistics within large spatial receptive fields. Based on these findings, it is expected that analogous neural mechanisms in the high-level auditory cortex would encode the statistical property of a sound over a long temporal period (cf. Feather & McDermott, 2018; Norman-Haignere & McDermott, 2018). However, it is unclear how such a mechanism ought to be implemented nor how long the temporal integration period should be.

Supplemental Material

sj-docx-1-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-docx-1-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-2-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-2-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-3-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-3-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-4-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-4-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-5-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-5-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-6-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-6-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-7-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-7-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-8-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-8-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-9-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-9-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-10-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-10-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-11-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-11-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-12-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-12-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-13-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-13-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-14-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-14-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-15-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-15-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-16-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-16-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-17-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-17-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-18-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-18-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-19-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-19-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-20-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-20-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-21-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-21-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-22-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-22-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-23-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-23-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-24-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-24-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-25-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-25-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-26-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-26-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-27-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-27-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-28-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-28-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-29-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-29-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-30-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-30-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-31-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-31-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-32-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-32-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-wav-33-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-wav-33-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-txt-34-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-txt-34-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-m-35-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-m-35-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Supplemental Material

sj-m-36-ipe-10.1177_20416695231157349 - Supplemental material for A two-stage spectral model for sound texture perception: Synthesis and psychophysics

Supplemental material, sj-m-36-ipe-10.1177_20416695231157349 for A two-stage spectral model for sound texture perception: Synthesis and psychophysics by Hironori Maruyama, Kosuke Okada and Isamu Motoyoshi in i-Perception

Footnotes

Acknowledgments

The MATLAB code for generating synthetic sounds (le-PR) using the two-step spectral model is available at .

Author contribution(s)

Hironori Maruyama: Data curation; Formal analysis; Investigation; Methodology; Resources; Software; Validation; Visualization; Writing – original draft; Writing – review & editing.

Kosuke Okada: Formal analysis; Methodology; Resources; Software; Visualization; Writing – review & editing.

Isamu Motoyoshi: Conceptualization; Data curation; Formal analysis; Formal analysis; Methodology; Project administration; Supervision; Writing – original draft; Writing – review & editing.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Japan Society for the Promotion of Science (grant numbers KAKENHI 20H01782 and KAKENHI 20K21803).

ORCID iD

Isamu Motoyoshi

Supplemental Material

Supplemental material for this article is available online.

How to cite this article

Maruyama, H., Okada, K., & Motoyoshi, I. (2023). A two-stage spectral model for sound texture perception: Synthesis and psychophysics. i-Perception, 14(1), 1–10.

References

Attias

Schreiner

(1997). Coding of naturalistic stimuli by auditory midbrain neurons. Advances in Neural Information Processing Systems, 10, 103–109.

Baker

C. L.

Mareschal

(2001). Processing of second-order stimuli in the visual cortex. Progress in Brain Research, 134, 171–191. https://doi.org/10.1016/S0079-6123(01)34013-X

Baumann

Griffiths

T. D.

Sun

Petkov

C. I.

Thiele

Rees

(2011). Orthogonal representation of sound dimensions in the primate midbrain. Nature Neuroscience, 14, 423–425. https://doi.org/10.1038/nn.2771

Bergen

J. R.

Adelson

E. H.

(1988). Early vision and texture perception. Nature, 333, 363–364. https://doi.org/10.1038/333363a0

Bergen

J. R.

Landy

M. S.

(1991). Computational modeling of visual texture segregation. Computational Models of Visual Processing, 17, 253–271.

Bruna, J., & Mallat, S. (2013). Invariant scattering convolution networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1872–1886.

Chi

Shamma

S. A.

(2005). Multiresolution spectrotemporal analysis of complex sounds. The Journal of the Acoustical Society of America, 118, 887–906. https://doi.org/10.1121/1.1945807

Depireux

D. A.

Simon

J. Z.

Klein

D. J.

Shamma

S. A.

(2001). Spectro-temporal response field characterization with dynamic ripples in Ferret primary auditory cortex. Journal of Neurophysiology, 85, 1220–1234.

Feather, J., & McDermott, J. H. (2018). Auditory texture synthesis from task–optimized convolutional neural networks. In 2018 Conference on cognitive computational Neuroscience. September 7, 2018, Philadelphia, PA, Cognitive Computational Neuroscience. https://doi.org/10.32470/CCN.2018.1142–0

10.

Freeman

Simoncelli

E. P.

(2011). Metamers of the ventral stream. Nature Neuroscience, 14, 1195–1201. https://doi.org/10.1038/nn.2889

11.

Fukushima, K., & Miyake, S. (1982). Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognition, 15(6), 455–469.

12.

Glasberg

B. R.

Moore

B. C.

(1990). Derivation of auditory filter shapes from notched-noise data. Hearing Research, 47, 103–138. https://doi.org/10.1016/0378-5955(90)90170-T

13.

Greenwood

D. D.

(1990). A cochlear frequency-position function for several species—29 years later. The Journal of the Acoustical Society of America, 87, 2592–2605. https://doi.org/10.1121/1.399052

14.

Gygi

Kidd

G. R.

Watson

C. S.

(2004). Spectral-temporal factors in the identification of environmental sounds. The Journal of the Acoustical Society of America, 115, 1252–1265. https://doi.org/10.1121/1.1635840

15.

Hubel

D. H.

Wiesel

T. N.

(1968). Receptive fields and functional architecture of monkey striate cortex. The Journal of Physiology, 195, 215–243. https://doi.org/10.1113/jphysiol.1968.sp008455

16.

Joris

P. X.

Schreiner

C. E.

Rees

(2004). Neural processing of amplitude-modulated sounds. Physiological Reviews, 84(2), 541–577.

17.

Kowalski

Depireux

D. A.

Shamma

S. A.

(1996a). Analysis of dynamic spectra in ferret primary auditory cortex. I. Characteristics of single-unit responses to moving ripple spectra. Journal of Neurophysiology, 76, 3503–3523. https://doi.org/10.1152/jn.1996.76.5.3503

18.

Kowalski

Depireux

D. A.

Shamma

S. A.

(1996b). Analysis of dynamic spectra in ferret primary auditory cortex. II. Prediction of unit responses to arbitrary dynamic spectra. Journal of Neurophysiology, 76, 3524–3534. https://doi.org/10.1152/jn.1996.76.5.3524

19.

Kurosawa

Orima

Okada

Motoyoshi

(2021). Textures vs non-textures: A simple computational method for classifying perceived ‘texturality’ in natural images. i-Perception, 12, 20416695211054540. https://doi.org/10.1177/20416695211054540

20.

LeCun

Bengio

Hinton

(2015). Deep learning. Nature, 521, 436–444. https://doi.org/10.1038/nature14539

21.

Mallat

(2012). Group invariant scattering. Communications on Pure and Applied Mathematics, 65, 1331–1398. https://doi.org/10.1002/cpa.21413

22.

McDermott

J. H.

Schemitsch

Simoncelli

E. P.

(2013). Summary statistics in auditory perception. Nature Neuroscience, 16, 493–498. https://doi.org/10.1038/nn.3347

23.

McDermott

J. H.

Simoncelli

E. P.

(2011). Sound texture perception via statistics of the auditory periphery: Evidence from sound synthesis. Neuron, 71, 926–940. https://doi.org/10.1016/j.neuron.2011.06.032

24.

McWalter

McDermott

J. H.

(2018). Adaptive and selective time averaging of auditory scenes. Current Biology, 28, 1405–1418. https://doi.org/10.1016/j.cub.2018.03.049

25.

Norman-Haignere

S. V.

McDermott

J. H.

(2018). Neural responses to natural and model-matched stimuli reveal distinct computations in primary and nonprimary auditory cortex. PLoS Biology, 16, e2005127. https://doi.org/10.1371/journal.pbio.2005127

26.

Okada, K., & Motoyoshi, I. (2021). Human texture vision as multi-order spectral analysis. Frontiers in Computational Neuroscience, 15, 692334.

27.

Okazawa, G., Tajima, S., & Komatsu, H. (2015). Image statistics underlying natural texture selectivity of neurons in macaque V4. Proceedings of the National Academy of Sciences, 112(4), E351–E360.

28.

Portilla

Simoncelli

E. P.

(2000). A parametric texture model based on joint statistics of complex wavelet coefficients. International Journal of Computer Vision, 40, 49–70. https://doi.org/10.1023/A:1026553619983

29.

Riesenhuber

Poggio

(2000). Models of object recognition. Nature Neuroscience, 3, 1199–1204. https://doi.org/10.1038/81479

30.

Rodríguez

F. A.

Chen

Read

H. L.

Escabí

M. A.

(2010). Neural modulation tuning characteristics scale to efficiently encode natural sound statistics. Journal of Neuroscience, 30, 15969–15980. https://doi.org/10.1523/JNEUROSCI.0966-10.2010

31.

Singh

N. C.

Theunissen

F. E.

(2003). Modulation spectra of natural sounds and ethological theories of auditory processing. The Journal of the Acoustical Society of America, 114, 3394–3411. https://doi.org/10.1121/1.1624067

32.

Ziemba

C. M.

Freeman

Movshon

J. A.

Simoncelli

E. P.

(2016). Selectivity and tolerance for visual texture in macaque V2. Proceedings of the National Academy of Sciences, 113, E3140–E3149. https://doi.org/10.1073/pnas.1510847113

33.

Zipser

Lamme

V. A.

Schiller

P. H.

(1996). Contextual modulation in primary visual cortex. Journal of Neuroscience, 16, 7376–7389. https://doi.org/10.1523/JNEUROSCI.16-22-07376.1996

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.01 MB

0.03 MB

0.00 MB

0.09 MB