Abstract
In this commentary, we respond to Archibald’s assertion that phonology is underrepresented in Lx speech acquisition literature due to the field’s predominant focus on surface phonetic aspects rather than underlying phonological structures. While we appreciate Archibald’s call to bring phonology into the spotlight, we believe that his characterization overlooks essential aspects of existing Lx speech models such as the revised Speech Learning Model (SLM-r), the Perceptual Assimilation Model for L2 speech learning (PAM-L2), and the Second Language Linguistic Perception (L2LP) model, since these frameworks already incorporate abstract phonological representations such as features, syllables, and tones, alongside phonetic components. We argue that the prominence of phonetic approaches in the field is not due to theoretical shortcomings of these models but rather to a compelling need to deal with the highly variable nature of actual Lx speech data, for which traditional phonological theories lack sufficient explanatory and predictive power. To bridge this gap, we propose that integrating probabilistic phonological grammars (e.g. Stochastic Optimality Theory and Noisy Harmonic Grammar) with existing Lx speech acquisition models offers a particularly promising direction for the future of Lx phonetics and phonology as a unified field.
Keywords
I Introduction
In his keynote article, Archibald (2025) argues that Lx phonology has been underrepresented in the field of generative approaches to language acquisition because its fundamental similarities to Lx morphosyntax have been underappreciated. He emphasizes that phonology is ‘rich, hierarchical, recursive, governed by universal grammar (UG), subject to poverty-of-the-stimulus, and algebraic’ (Archibald, 2025: 4), just like morphology and syntax, and thus not merely a system of externalization as posited in the Minimalist Program (Chomsky, 1995) and its predecessors. Although we, as psycholinguists (or Lx speech acquisitionists), are more aligned with a domain-general emergentist approach to language acquisition (e.g. Boersma, 1998; Escudero, 2005; Saito et al., 2020) than with the school of generative linguistics, we certainly agree that phonology deserves more attention and treatment in the Lx acquisition literature. Indeed, most current research on Lx speech acquisition focuses heavily on the phonetic aspects of individual segmental categories (i.e. vowels and consonants), leaving other types of phonological representations such as features, syllables, morae, tone, metrical feet, and prosodic phrases largely unexplored. Filling this gap is crucial for advancing the field (as is proposed in Colantoni et al., 2015), which is admittedly biased towards ‘surface aspects of Lx phonetics’ (Archibald, 2025: 2). In this sense, many Lx speech researchers should find the keynote article highly relevant and useful.
In our view, however, Archibald’s characterization of the current ‘unjust’ state of Lx phonology misses two key points. First, there appears to be a misrepresentation of the existing models of Lx speech acquisition. Archibald (2025: 2) explicitly states: Work within cross-linguistic models of production and/or perception (Best and Tyler, 2007; Flege and Bohn, 2021; van Leussen and Escudero, 2015) is ubiquitous. Yet this work, for the most part, is concerned with the properties of the physical systems (i.e. phonetics) rather than how the sounds are connected to meaning in a mental grammar.
However, the models in question – namely, the (revised) Speech Learning Model (SLM(-r); Flege, 1995; Flege and Bohn, 2021), the Perceptual Assimilation Model for L2 speech learning (PAM-L2; Best, 1995; Best and Tyler, 2007, 2024), and the Second Language Linguistic Perception (L2LP) model (Escudero, 2005; Escudero and Yazawa, 2024; van Leussen and Escudero, 2015) – in fact go beyond the physical aspects of speech to address how native and nonnative sounds are represented in the minds of Lx learners. Second, while Archibald discusses at length how formal phonological analyses would benefit the field, he leaves unexplained why most Lx speech researchers have adopted phonetic approaches instead. Although it may be true that ‘the number of journal articles in any of L2 morphology, syntax, semantics, or processing within a generative perspective far outstrips the number of papers in L2 phonology’ (Archibald, 2025: 4), the number of studies on Lx speech acquisition per se is not necessarily much smaller than the number of studies on Lx morphosyntactic or semantic acquisition. A more pertinent question, then, is why only Lx phonology has been ‘waiting in the wings’ while Lx phonetics has taken center stage.
In what follows, we first discuss how Archibald’s characterization of cross-linguistic speech acquisition models such as SLM(-r), PAM-L2, and L2LP as primarily phonetic rather than phonological is inaccurate (Section II). We then offer our perspective on why quantitative phonetic approaches have gained methodological preference over formal phonological analyses in Lx speech research and how probabilistic extensions of phonological theory may help to resolve this situation (Section III). We conclude with brief remarks on the future prospects for Lx phonology and phonetics as a more unified field (Section IV).
II Lx speech acquisition models and phonological representations
Work within cross-linguistic models of speech acquisition is indeed ubiquitous, with most studies focusing on the articulatory and acoustic relationship between native and target segmental categories. Archibald’s call for more research on Lx phonology is therefore understandable and notable. However, his claim that currently dominant models such as SLM(-r), PAM-L2, and L2LP do not take the learnability of abstract mental representations seriously is not true. We will illustrate this point by reviewing how phonological components are integrated into each of the three models below.
SLM(-r) is arguably the most phonetically oriented of all three models, focusing primarily on the interaction between native and nonnative phonetic categories within a common L1–L2 phonetic space. Nevertheless, a careful look at the model’s constructs reveals an implicit assumption of higher-level abstract representations. Specifically, SLM(-r) claims that the mapping of L2 to L1 sounds occurs at the level of position-sensitive allophones, where ‘position’ seems to be defined in terms of syllables and/or words. For example, in discussing how SLM(-r) can be applied to L1 Japanese speakers’ acquisition of L2 English [ɹ] and [l], Flege et al. (2021: 85) suggested considering whether the liquids ‘occur as singletons or clusters in word-initial position (e.g. lead, read, breed, bleed), as intervocalic singletons, or as word-final singletons or clusters’, since the allophonic realizations of English liquids vary considerably depending on their position within the syllable/word. Thus, phonetic categories, specified in the model as ‘long-term memory representations’ (Flege, 1995: 239), are seen as embedded in higher-level phonological structures. Flege (1995: 265) also made an explicit reference to lower-level sound representations, namely features, noting: ‘It may be that, in certain instances, the positionally defined allophone is too coarse a unit of analysis to provide accurate predictions concerning L2 sound production.’ This note was followed by several intriguing hypotheses regarding the role of features in L2 speech learning (Flege, 1995: 267–68), including: ‘Some production difficulties may arise because features used in the L2 are not used in the L1’; ‘the features used to distinguish L1 sounds can probably not be freely recombined to produce new L2 sounds’; ‘[t]he phenomenon of “differential substitution” shows that we need recourse to more than just a simple listing of features used in the L1 and L2 to explain certain L2 production errors’; and ‘[c]ertain features may enjoy an advantage over others because of the nature of their acoustic (or gestural) specification, or their reliability of occurrence.’ The first of these hypotheses was later incorporated into SLM as the ‘feature’ hypothesis (McAllister et al., 2002) and, although it has recently been replaced by the ‘full access’ hypothesis in the revised model (Flege and Bohn, 2021), Flege’s view of features as a ‘unit’ of speech is still worth noting today.
PAM-L2 differs from SLM(-r) in several important respects, and of particular relevance to our discussion is its presupposition of not only phonetic but also phonological categories. This distinction is perhaps best illustrated by Best and Tyler (2007)’s example of how native English listeners perceive French /r/. Although the French rhotic (voiced uvular fricative [ʁ]) and the English rhotic (voiced alveolar approximant [ɹ]) share little phonetic similarity, L1 English learners of L2 French tend to equate these sounds across the two languages. According to Best and Tyler (2007: 28), this cross-linguistic perceptual assimilation is due not only to orthographic identity (since both rhotics are written with ‘r’) but also to similarities in terms of ‘syllable structure, phonotactic regularities, and allophonic and morphophonemic alternations’. Thus, learners perceptually assimilate the French and English rhotics at the phonological or lexical-functional level, while still perceiving a notable phonetic difference between them. One prediction that follows is that, as L2 learning progresses, learners will eventually establish two phonetic categories (i.e. allophonic variants) – [ʁ] and [ɹ] – under the umbrella phonological category, /r/ (i.e. phoneme). This distinction between phonetic and phonological assimilation contrasts with the SLM(-r) view that L1–L2 sound mappings occur strictly at the allophonic level. It is also worth noting that phonological categories in PAM-L2 are not necessarily restricted to the segmental level. For example, Reid et al. (2015) applied PAM to investigate the perception of Thai lexical tones by native listeners of another tonal language, Mandarin. It was predicted that the discriminability of nonnative tones would be higher when the target tone pair was perceptually assimilated to two different native tones than to a single native tone and, in the latter case, the phonetic goodness of fit between the native and nonnative tones would mediate discriminability. These predictions were borne out, providing support for the applicability of PAM’s principles to the suprasegmental level.
Finally, the L2LP model is similar to PAM-L2 in making an explicit reference to both phonetic and phonological levels of linguistic representation, but it goes even further by positing four levels of representation instead of two: the [auditory] level (i.e. incoming speech sounds in the peripheral auditory system), the /surface/ level (language-specific and invariant representations of speech sounds, including allophonic details), the |underlying| level (canonical phonemic contrasts that can change the meaning of a word); and the <lexical> level (words and morphemes stored in the mind). Following the Bidirectional Phonology and Phonetics (BiPhon) framework (Boersma, 1998, 2011), the L2LP model defines speech comprehension as gradual abstraction from sound to meaning, which consists of pre-lexical perception ([auditory]-to-/surface/ mapping) and lexical recognition (/surface/-to-|underlying| and |underlying|-to-<lexical> mappings). An important and yet often overlooked characteristic of L2LP is that the linguistic representations listeners perceive and recognize differ not only in the level of abstraction but also in the size of the unit. Escudero (2005: 7) makes this very clear by defining speech comprehension as ‘the act by which listeners map continuous and variable speech onto… discrete and abstract phonological units, such as phonemes, phonological segments, phonological features, autosegments, or prosodic structures’. Thus, the phonological representations onto which phonetic cues are mapped are not restricted to segmental categories, as was the case in PAM-L2 illustrated earlier.
Yazawa et al. (2023) provide a good example of L2LP’s flexibility in handling different units of speech for modeling Lx speech perception. The study compared two versions of perceptual simulations based on L2LP – one mapping auditory cues to segmental categories and another to features – to evaluate which better captured how L1 Japanese listeners establish a new sound representation for L2 American English /æ/ (which is perceived as an unusually fronted variant of Japanese /a/, according to previous and their own experiments). The segmental model fell short because it could not explain how learners perceive the deviance of L2 /æ/ from L1 /a/; the simulated L2 learning results were also unrealistic in being too native English-like compared to real learners’ perception. In contrast, the featural model showed that the deviance of /æ/ could be perceived due to an ill-formed combination of height and backness features (*/low, front/); the predicted perception was also qualitatively different from native English perception and more in line with real learners’ data. As this study demonstrates, although previous studies within the L2LP framework have often assumed segmental categories as the basic unit of perception, the model does not inherently require this. Similar to BiPhon, the L2LP model is capable of handling a wide range of phonological phenomena in perception (and production; for a recent description of the model, see Escudero and Yazawa, 2024), including those discussed in Archibald (2025).
III Probabilistic grammars for Lx phonology
As we have seen above, current models of Lx speech acquisition are intended to cover more than surface phonetic phenomena and are well applicable to the study of phonological structures such as features, syllables, and tones. In practice, however, most research on Lx speech acquisition to date has not exploited the full potential of the models (or grounded itself in a specific theoretical framework) to explore these phonological representations. Archibald’s (2025) characterization of the current state of affairs is therefore justified, although a broader question remains: why do Lx speech researchers generally favor phonetic approaches over phonological ones? We believe that this preference has emerged from a discrepancy between traditional phonological theories, which predict categorically defined phenomena based on binary branching representations, and Lx speech data, which are highly variable in nature. Phonetic approaches, on the other hand, have been ‘good enough’ to describe such variability.
To illustrate this point, let us consider the case of vowel epenthesis in the production of L2 English initial consonant clusters by L1 Egyptian Arabic, as discussed in Archibald (2025: 19–20). This is a great example demonstrating the need to take syllable structure into account, because otherwise it would be difficult to explain, for example, (1) why epenthesis would occur, (2) why [i] is inserted rather than other possible vowels in Egyptian Arabic (i.e. [a] or [u]), and (3) why English ‘study’ is produced as [istadi] while English ‘floor’ becomes [filor], with [i] inserted before and after the first consonant (Broselow, 1992; personal communication). However, it is very likely that actual epentheses in Arabic learners’ English do not always conform to the described patterns, since epenthesis is partly optional in at least some dialects of Arabic (Plug et al., 2019; Watson, 2007), and epenthetic patterns tend to be far more variable in L2 speech than in native or loanword phonology (Mattingley et al., 2019; Yazawa et al., 2015). We should thus expect significant variability in (1) how often epenthesis actually occurs, (2) what kind of vowel is inserted, and (3) at which position, both within and across learners (for actual examples of variation, see Broselow, 2025). This kind of variability, often encountered by Lx speech researchers, is difficult to be captured by traditional phonological grammars, which typically assume a single deterministic output. It seems that Lx phonologists have tended to assign the role of explaining variability in real data to the domain of phonetics, where Lx phoneticians have been quite successful; for example, Plug et al. (2019) conducted an acoustic investigation of epenthetic vowels in Tripolitanian Libyan Arabic, demonstrating that not all epenthetic vowels are phonologically inserted ‘epenthesis proper’ but phonetically induced ‘intrusive vocoids.’ However, this kind of ‘relegation’ could be the very reason why Lx phonology has suffered its ‘unjust’ state of affairs.
We propose here that probabilistic phonological grammars would help to resolve the current situation. The importance of incorporating probability into formal phonological analyses has been increasingly acknowledged in recent years because, just as stress is not a single property that is either ‘present’ or ‘absent’ in a language (Peperkamp and Dupoux, 2002), many phonological phenomena (e.g. phonological alternation, incomplete neutralization, and well-formedness judgement) exhibit variation and gradience even in native speech (Alderete and Finley, 2023). Probabilistic grammars, when applied to the above example of vowel epenthesis, would have greater explanatory and predictive power (e.g. ‘novice L1 Arabic learners of L2 English insert [i] before the beginning of “study” X% of the time’) than traditional nonprobabilistic grammars (e.g. ‘they insert [i] before “study” with no exceptions’). By combining this approach with complementary phonetic analyses (e.g. actual quality and duration of inserted vowels), we may finally arrive at a comprehensive picture of what is really going on in Lx speech perception and production.
There have been relatively few attempts to apply probabilistic phonological models to Lx speech acquisition, but the number has been growing in recent years. One of the earliest was by Escudero and Boersma (2004), who sought to explain why L1 Spanish learners of L2 Scottish and Southern British English exhibit a peculiar strategy for perceiving English /iː/ and /ɪ/ that is neither found in Spanish nor English. Empirical evidence shows that many learners, especially those learning Southern British English as the target variety, rely primarily on the duration cue to distinguish the vowels, even though length is not contrastive in their L1 vowels and spectral cues are also important for distinguishing the L2 vowels. Escudero and Boersma explained this puzzling behavior as follows: learners initially map both English /iː/ and /ɪ/ to Spanish /i/, but as they notice the durational differences that seem to be informative for distinguishing the vowels, they establish a new length feature in their phonological system so that the two vowels are represented as /i, long/ and /i, short/, respectively. The hypothesized learning process was computationally modeled by formalizing the learners’ perception grammar and acquisition mechanisms using Stochastic Optimality Theory (StOT; Boersma, 1998) and the Gradual Learning Algorithm (GLA; Boersma and Hayes, 2001), which are probabilistic extensions of Optimality Theory (OT; Prince and Smolensky, 2002) and the Error-Driven Constraint Demotion (EDCD) algorithm (Tesar and Smolensky, 1998, 2000), respectively. Crucially, Escudero and Boersma (2004: 583) noted that traditional nonstochastic OT with EDCD would not have sufficed for their purpose because it is unable to handle variable mapping of auditory cues onto linguistic representations. While many Lx phonologists may see perceptual modeling as outside the scope of phonology, Escudero and Boersma (2004: 553) argued that speech perception should be ‘a natural subject matter for linguistic theory’; Escudero (2005, 2009) and Boersma (2009) further developed this idea by formalizing cue constraints and their weighting as phonological rather than phonetic phenomena. Again, excluding certain phenomena in Lx speech acquisition as merely ‘surface’ and ‘extralinguistic’ risks diminishing the scope of what remains for Lx phonology.
Computational modeling based on StOT and GLA has been applied to various other Lx learning scenarios, such as L1 Dutch listeners’ perception of L2 Spanish vowels (Boersma and Escudero, 2008), L1 Japanese listeners’ perception of L2 American English vowels (Yazawa et al., 2020, 2023), and L1 Brazilian Portuguese speakers’ production of L2 English word-internal codas (Schmitt and Alves, 2014). It has also been extended to model the entire process of speech comprehension, encompassing the mapping from sound to meaning through multiple levels of representation (van Leussen and Escudero, 2015). Another type of grammar that has been recently used is Noisy Harmonic Grammar (Noisy HG; Boersma and Pater, 2016), a probabilistic extension of Harmonic Grammar (HG; Legendre et al., 1990). For example, Zhang and Tessier (2024) combined Noisy HG and GLA to model L1 Beijing Mandarin speakers’ acquisition of low vowel + nasal coda (loV-N) sequences in L2 North American English. Both Mandarin and English have three nasal consonants /m, n, ŋ/ and two low vowels contrasting in backness (/a, ɑ/ for Mandarin and /æ, ɑ/ for English), but Mandarin has two phonological restrictions that do not apply in English: (1) only [n] and [ŋ] are allowed in coda position but not [m], and (2) low vowels must agree in backness with the following nasal coda. The surface realizations of the legal Mandarin loV-N sequences (i.e. /an/ and /ɑŋ/) are also quite variable, with the coda nasal often lenited or entirely deleted. Zhang and Tessier’s model provided a quantitative prediction of how Mandarin speakers’ L1 phonotactic and phonetic knowledge shapes their production of L2 English loV-N sequences, including that coda deletion should be more frequent when the input nasal coda is /m/ and/or its backness is mismatched with the following vowel. Another recent study that used Noisy HG and GLA is Zhou and Hamann (2024), who addressed three phenomena observed in L1 Mandarin speakers’ acquisition of L2 Portuguese tap /ɾ/: (1) individual variability, (2) syllable-position effects, and (3) orthographic influences. For our purposes, we focus on individual variability here. Since Mandarin does not have a tap in its phonemic inventory, L1 Mandarin listeners tend to perceive the Portuguese tap as either /l/ or /t/, but the perceptual mapping patterns exhibit both inter- and intra-learner variation. Specifically, listeners who give more perceptual weight to spectral cues (e.g. formant values) than to the closure cue (e.g. brief silence) would perceive /t/, whereas those with reversed cue weighting would perceive /l/. However, this distinction is not strictly binary, since spectrally oriented perceivers may sometimes hear /l/ and closure-oriented perceivers may sometimes hear /t/, reflecting the probabilistic nature of speech perception noted earlier. Noisy HG grammars as adopted by Zhou and Hamann are well suited to modeling such individual variability because the constraint weights that are used to formalize perceptual cue weighting are probabilistic themselves, a property that would be incompatible with binary nanoparameters (for the problematic use of the term ‘parameter’, see Leivada, 2020).
As the above studies show, probabilistic constraint-based grammars can yield quantitative and testable predictions for any phonological (and ‘phonetic’) phenomenon that falls within the scope of OT or HG, thus offering a promising avenue for future research in the field of Lx phonology (and phonetics). The computational implementation of these grammars is also more accessible than one typically imagines. For example, Praat (Boersma and Weenink, 2024), the de facto standard software for phonetics research, provides functionality specifically designed for probabilistic constraint-based modeling through its ‘OTGrammar’ objects, which both Zhang and Tessier (2024) and Zhou and Hamann (2024) utilized to implement their Noisy HG grammars. Lx phonologists who do not subscribe to constraint-based frameworks may also find probabilistic extensions applicable to their own approaches, such as the Variable Rules framework (Labov, 1969) for SPE-like rules (for a review, see Alderete and Finley, 2023).
IV Concluding remarks
Although our theoretical perspective on Lx speech acquisition diverges from that of Archibald (2025), we share a common belief in the need for more research on phonological representations beyond segmental categories. Building on Archibald’s arguments, we identify two promising directions for future research: (1) integrating insights from formal phonological analyses into existing models of Lx speech acquisition, as has been done within the L2LP model, and (2) acknowledging the role of probability within formal phonological theories, as has been done within BiPhon and L2LP. Similar to how Lx phoneticians have much to learn from Lx phonologists’ exploration of abstract, non-observable phenomena, Lx phonologists may find value in Lx phoneticians’ quantitative, data-driven approaches. As suggested by Archibald (2025), the academic world would certainly be a richer place if both fields were recognized as complementary rather than separate.
Footnotes
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
