Abstract
Extracting talker identity from speech signals is a core perceptual function, yet the mechanisms underlying second Language (L2) identity processing remain unclear. Grounded in the DIVA model and Source-Filter Theory, this study investigated the coupling between perception and production in Tibetan–Mandarin bilinguals. Participants performed a delayed imitation task and a talker-identity discrimination task. Imitation performance was quantified within a multidimensional acoustic space defined by fundamental frequency (F0), harmonics-to-noise ratio (HNR), and formant dispersion (FD). Results indicated that learners achieved significant acoustic convergence toward the L2 model speaker, which persisted as episodic traces across short-term temporal delays. In the discrimination task, sensitivity improved with acoustic distance but plateaued between medium and large distances, while a significant negative response bias in the near condition revealed a tendency toward perceptual assimilation. Crucially, regression and machine-learning analyses revealed that only FD distance was significantly associated with discrimination sensitivity. Unlike source-related cues such as F0 that fluctuate with context, FD reflects relatively invariant vocal-tract structures. These findings suggest that the formation of L2 talker-identity representations involves a functional anatomical alignment with the target speaker through sensorimotor inverse mapping. By locking onto structural invariants like FD, learners can overcome within-person variability to form detailed episodic identity representations. This study extends the scope of auditory targets in speech production models from segmental to indexical levels.
Keywords
Get full access to this article
View all access options for this article.
