The Stolen Voice Illusion

Abstract

Visual cues facilitate speech perception during face-to-face communication, particularly in noisy environments. These visual-driven enhancements arise from both automatic lip-reading behaviors and attentional tuning to auditory-visual signals. However, in crowded settings, such as a cocktail party, how do we accurately bind the correct voice to the correct face, enabling the benefit of visual cues on speech perception processes? Previous research has emphasized that spatial and temporal alignment of the auditory-visual signals determines which voice is integrated with which speaking face. Here, we present a novel illusion demonstrating that when multiple faces and voices are presented in the presence of ambiguous temporal and spatial information as to which pairs of auditory-visual signals should be integrated, our perceptual system relies on identity information extracted from each signal to determine pairings. Data from three experiments demonstrate that expectations about an individual’s voice (based on their identity) can change where individuals perceive that voice to arise from.

Keywords

multisensory cross-modal auditory-visual speech congruity gender identity

Get full access to this article

View all access options for this article.

References

Adler

R. K.

Hirsch

Mordaunt

(2012). Voice and communication therapy for the transgender/transsexual client: A comprehensive clinical guide. San Diego, CA: Plural Publishing.

Chandrasekaran

Trubanova

Stillittano

Caplier

Ghazanfar

A. A.

(2009). The natural statistics of audiovisual speech. PLoS Computational Biology, 5, e1000436. doi:10.1371/journal.pcbi.1000436

Chen

Y. C.

Spence

(2017). Assessing the role of the ‘unity assumption’ on multisensory integration: A review. Frontiers in Psychology, 8, 445.

Chuen

Schutz

(2016). The unity assumption facilitates cross-modal binding of musical, non-speech stimuli: The role of spectral and amplitude envelope cues. Attention, Perception, Psychophysics, 78, 1512–1528.

Crystal

T. H.

House

A. S.

(1982). Segmental durations in connected speech signals: Preliminary results. Journal of the Acoustical Society of America, 72, 705–716. Retrieved from https://www.ncbi.nlm.nih.gov/pubmed/7130529

Dabbs

J. M.

Mallinger

(1999). High testosterone levels predict low voice pitch among men. Personality and Individual Differences, 27, 801–804.

Davies

Goldberg

J. M.

(2006). Clinical aspects of transgender speech feminization and masculinization. International Journal of Transgenderism, 9, 167–196.

Dixon

N. F.

Spitz

(1980). The detection of auditory visual desynchrony. Perception, 9, 719–721. doi:10.1068/p090719

Fant

(1971). Acoustic theory of speech production: With calculations based on X-ray studies of Russian articulations (vol. 2). Berlin, Germany: Walter de Gruyter.

10.

Ghazanfar

A. A.

Maier

J. X.

Hoffman

K. L.

Logothetis

N. K.

(2005). Multisensory integration of dynamic faces and voices in rhesus monkey auditory cortex. Journal of Neuroscience, 25, 5004–5012. doi:10.1523/JNEUROSCI.0799-05.2005

11.

Grant

K. W.

Seitz

P. F.

(2000). The use of visible speech cues for improving auditory detection of spoken sentences. Journal of the Acoustical Society of America, 108, 1197–1208. doi:10.1121/1.1288668

12.

Grant

K. W.

Wassenhove

V. V.

Poeppel

(2003, September). Discrimination of auditory-visual synchrony. Paper presented at the AVSP 2003-International Conference on Audio-Visual Speech Processing, St. Jorioz, France.

13.

Greenberg

Carvey

Hitchcock

Chang

S. Y.

(2003). Temporal properties of spontaneous speech - a syllable-centric perspective. Journal of Phonetics, 31, 465–485. doi:10.1016/j.wocn.2003.09.005

14.

Greenhouse

S. W.

Geisser

(1959). On Methods in the Analysis of Profile Data. Psychometrika, 24, 95–112. doi:10.1007/Bf02289823

15.

Kamachi

Hill

Lander

Vatikiotis-Bateson

(2003). Putting the face to the voice’: Matching identity across modality. Current Biology, 13, 1709–1714. doi:10.1016/j.cub.2003.09.005

16.

Kanaya

Yokosawa

(2011). Perceptual congruency of audio-visual speech affects ventriloquism with bilateral visual stimuli. Psychonomic Bulletin & Review, 18, 123–128. doi:10.3758/s13423-010-0027-z

17.

Keetels

Vroomen

(2005). The role of spatial disparity and hemifields in audio-visual temporal order judgments. Experimental Brain Research, 167, 635–640. doi:10.1007/s00221-005-0067-1

18.

Kim

Davis

(2003). Hearing foreign voices: Does knowing what is said affect visual-masked-speech detection? Perception, 32, 111–120.

19.

Kim

Davis

(2004). Investigating the audio-visual speech detection advantage. Speech Communication, 44, 19–30. doi:10.1016/j.specom.2004.09.008

20.

Lachs

Pisoni

D. B.

(2004). Crossmodal Source Identification in Speech Perception. Ecological Psychology, 16, 159–187. doi:10.1207/s15326969eco1603_1

21.

Lee

Noppeney

(2014). Music expertise shapes audiovisual temporal integration windows for speech, sinewave speech, and music. Frontiers in Psychology, 5, 868. doi:10.3389/fpsyg.2014.00868

22.

Lewkowicz

D. J.

(1996). Perception of auditory-visual temporal synchrony in human infants. Journal of Experimental Psychology. Human Perception and Performance, 22, 1094–1106. Retrieved from https://www.ncbi.nlm.nih.gov/pubmed/8865617.

23.

Magnotti

J. F.

W. J.

Beauchamp

M. S.

(2013). Causal inference of asynchronous audiovisual speech. Frontiers in Psychology, 4, 798. doi:10.3389/fpsyg.2013.00798

24.

Moray

(1959). Attention in dichotic-listening—Affective cues and the influence of instructions. Quarterly Journal of Experimental Psychology, 11, 56–60. doi:10.1080/17470215908416289

25.

Munhall

K. G.

Gribble

Sacco

Ward

(1996). Temporal constraints on the McGurk effect. Perception & Psychophysics, 58, 351–362. Retrieved from https://www.ncbi.nlm.nih.gov/pubmed/8935896

26.

Pernet

C. R.

Belin

(2012). The role of pitch and timbre in voice gender categorization. Frontiers in Psychology, 3, 23.

27.

Pick

H. L.

Warren

D. H.

Hay

J. C.

(1969). Sensory conflict in judgments of spatial direction. Perception & Psychophysics, 6, 203. doi:10.3758/Bf03207017

28.

Pisanski

Mishra

Rendall

(2012). The evolved psychology of voice: Evaluating interrelationships in listeners’ assessments of the size, masculinity, and attractiveness of unseen speakers. Evolution and Human Behavior, 33, 509–519. doi:10.1016/j.evolhumbehav.2012.01.004

29.

Radeau

Bertelson

(1987). Auditory-visual interaction and the timing of inputs. Thomas (1941) revisited. Psychological Research, 49, 17–22.

30.

Rosenblum

L. D.

(2008). Speech perception as a multimodal phenomenon. Current Directions in Psychological Science, 17, 405–409. doi:10.1111/j.1467-8721.2008.00615.x

31.

Rouger

Lagleyre

Fraysse

Deneve

Deguine

Barone

(2007). Evidence that cochlear-implanted deaf patients are better multisensory integrators. Proceedings of the National Academy of Sciences of the United States of America, 104, 7295–7300. doi:10.1073/pnas.0609419104

32.

Smith

H. M. J.

Dunn

A. K.

Baguley

Stacey

P. C.

(2016). Concordant cues in faces and voices: Testing the backup signal hypothesis. Evolutionary Psychology, 14, 1474704916630317.doi:10.1177/1474704916630317

33.

Spence

(2013). Just how important is spatial coincidence to multisensory integration? Evaluating the spatial rule. Annals of the New York Academy of Sciences, 1296, 31–49.

34.

Stein

B. E.

Meredith

M. A.

Wolf

(1993). The merging of the senses. Cambridge, MA: MIT Press.

35.

Stein

B. E.

Stanford

T. R.

(2008). Multisensory integration: Current issues from the perspective of the single neuron. Nature Reviews Neuroscience, 9, 255–266. doi:10.1038/nrn2331

36.

Sumby

W. H.

Pollack

(1954). Visual Contribution to Speech Intelligibility in Noise. Journal of the Acoustical Society of America, 26, 212–215. doi:10.1121/1.1907309

37.

Van der Burg

Cass

Olivers

C. N.

Theeuwes

Alais

(2010). Efficient visual search from synchronized auditory signals requires transient audiovisual events. PLoS One, 5, e10664. doi:10.1371/journal.pone.0010664

38.

Vatakis

Ghazanfar

A. A.

Spence

(2008). Facilitation of multisensory integration by the “unity effect” reveals that speech is special. Journal of Vision, 8, 14–14.

39.

Vatakis

Spence

(2007). Crossmodal binding: Evaluating the “unity assumption” using audiovisual speech stimuli. Attention, Perception, & Psychophysics, 69, 744–756.

40.

Vatakis

Spence

(2008). Evaluating the influence of the ‘unity assumption’ on the temporal perception of realistic audiovisual stimuli. Acta Psychologica, 127, 12–23. doi:10.1016/j.actpsy.2006.12.002

41.

Walker

Bruce

O’Malley

(1995). Facial identity and facial speech processing: Familiar faces and voices in the McGurk effect. Perception & Psychophysics, 57, 1124–1133. Retrieved from https://www.ncbi.nlm.nih.gov/pubmed/8539088

42.

Wallace

M. T.

Roberson

G. E.

Hairston

W. D.

Stein

B. E.

Vaughan

J. W.

Schirillo

J. A.

(2004). Unifying multisensory signals across time and space. Experimental Brain Research, 158, 252–258. doi:10.1007/s00221-004-1899-9

43.

Welch

R. B.

Warren

D. H.

(1980). Immediate perceptual response to intersensory discrepancy. Psychological Bulletin, 88, 638–667. Retrieved from https://www.ncbi.nlm.nih.gov/pubmed/7003641

44.

Wobbrock

J. O.

Findlater

Gergle

Higgins

J. J.

(2011, May). The aligned rank transform for nonparametric factorial analyses using only anova procedures. Paper presented at the Proceedings of the SIGCHI conference on human factors in computing systems, Vancouver, BC, Canada.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

36.42 MB

0.01 MB