Sage Journals: Discover world-class research

Abstract

We present an implementation of DIANA, a computational model of spoken word recognition, to model responses collected in the Massive Auditory Lexical Decision (MALD) project. DIANA is an end-to-end model, including an activation and decision component that takes the acoustic signal as input, activates internal word representations, and outputs lexicality judgments and estimated response latencies. Simulation 1 presents the process of creating acoustic models required by DIANA to analyze novel speech input. Simulation 2 investigates DIANA’s performance in determining whether the input signal is a word present in the lexicon or a pseudoword. In Simulation 3, we generate estimates of response latency and correlate them with general tendencies in participant responses in MALD data. We find that DIANA performs fairly well in free word recognition and lexical decision. However, the current approach for estimating response latency provides estimates opposite to those found in behavioral data. We discuss these findings and offer suggestions as to what a contemporary model of spoken word recognition should be able to do.

Keywords

Spoken word recognition auditory lexical decision task computational modeling DIANA

1 Introduction

The question of how a listener understands the meaning of what is being said is central to the field of speech perception and spoken word recognition. After decades of research, it is clear that the process of spoken word recognition is very complex. The sheer number of moving parts affecting the processing of speech make it very difficult to keep track of all of them—let alone create precise predictions about particular situations—using a verbal model “stored” in one researcher’s mind. Computational models, however, allow (or rather, force) researchers to formalize their theories and generate measures or estimates that can be directly compared to behavioral data (see, for example, Perfors et al., 2011; Scharenborg & Boves, 2010; Tan et al., 2021 for similar arguments). Importantly, they allow researchers to observe the interplay of multiple relevant factors at the same time. By switching models, model characteristics, or model parameter values, we effectively change our assumptions about the process of spoken word recognition. We can observe the consequences of alterations regarding any of a number of major points of contention in the field. For example, we can test the adequacy of different prelexical representations, compare the performance of a model with and without top-down effects (or even fine-tune the impact of top-down information), pit against each other different ways of competitor activation and retention, and so on. Overviews of models of spoken word recognition and how they attempted to solve these various problems are given in the works of Magnuson et al. (2012); McQueen (2007); Protopapas (1999); Scharenborg and Boves (2010); and Weber and Scharenborg (2012).

A crucial step in the process of spoken word recognition is isolating characteristics of the speech signal that act as reliable cues of its content. This particular problem has proven difficult due to lack of invariance, leading to a long debate and numerous explanations of how this process unfolds. Still, most abstract models of spoken word recognition, which are the focus of the present paper, sidestep the problem of analyzing the acoustic speech signal and “instead use an artificial, often-hand crafted, idealised discrete (prelexical) representation of the acoustic signal as input” (Scharenborg & Boves, 2010, p. 144).

The main reason for eschewing the acoustic signal were technical limitations that all first- and second-generation models of spoken word recognition faced, not lack of understanding of its importance. Topics ranging from acoustic-phonetic invariance to prosodic cues were central in the development of the Lexical Access From Spectra (LAFS) model proposed by Klatt (1979). The acoustic-phonetic representation in bottom-up approaches to spoken word recognition is also discussed by Pisoni and Luce (1987) as they overview what are mostly considered first-generation models of spoken word recognition, and also the Cohort model (Marslen-Wilson & Tyler, 1980; Marslen-Wilson & Welsh, 1978) and TRACE (McClelland & Elman, 1986b). Where the more recent, second-generation models of spoken word recognition are concerned, TRACE I is built around the acoustic signal being used as input, but TRACE II, the model that was actually implemented, employed acoustic pseudofeatures instead (McClelland & Elman, 1986b; Strauss et al., 2007). Similarly, Shortlist (Norris, 1994) used phoneme strings as input, but Shortlist B (Norris & McQueen, 2008) made a step toward representing their variability better using sequences of phoneme probabilities calculated over time slices, which were obtained from a diphone gating study with human listeners. (Despite the differences in model input, note that both TRACE and Shortlist can provide information over time, i.e., as the signal unfolds.) One consequence of using pseudo-acoustic input is difficulty of establishing a direct comparison between model output and human performance, that is, forming a linking hypothesis (see, for example, Magnuson et al., 2012; Tanenhaus, 2004).

However, although many of the technical limitations have since been alleviated, most abstract models of spoken word recognition continue to settle for pseudo-acoustic input. Even the more recently developed time-invariant string kernel (TISK; Hannagan et al., 2013; You & Magnuson, 2018) model uses phonemes as input, assuming that the process of phoneme recognition had already been successfully completed. This issue becomes even more acute when we take into account other domains within the broader field of speech perception. Classic models of speech perception focused on finding invariant acoustic cues in the speech signal, although these largely investigated recognition of individual speech sounds (for an early speech perception model focusing on lexical access, see Klatt, 1979). Some of the earliest implementations of episodic models of speech perception also used actual sound recordings (Johnson, 1997). More recently, researchers have managed to computationally extract features or categories from the acoustic signal using unsupervised learning (see, for example, Lee et al., 2015; Schatz et al., 2021). Similarly, studies show that information collected from spectra can be used to successfully categorize speech sounds (e.g., McMurray & Jongman, 2011; Stevens & Blumstein, 1978). A large body of research has continued to focus on using actual phonetic input to describe and model a number of phenomena in speech perception, such as categorical perception (Kronrod et al., 2016), perceptual recalibration (Norris et al., 2003; Theodore & Monto, 2019; Xie et al., 2021), unsupervised and supervised adaptation (Clayards et al., 2008; Kleinschmidt & Jaeger, 2016), or foreign-accent adaptation (Hitczenko & Feldman, 2016; Tan et al., 2021); for a summary of recent progress, see Kleinschmidt and Jaeger (2015) and Kurumada and Roettger (2022).

Still, even if we restrict our discussion to the abstract models of spoken word recognition that mostly rely on the activation–competition process to determine what the target word is, there are notable exceptions to the trend of avoiding acoustic input (although these models of spoken word recognition do not always investigate or explain how low-level acoustic features influence the process of spoken word recognition). Two early examples are SpeM and Fine-Tracker (Scharenborg, 2008, 2009; Scharenborg et al., 2005), while the other two examples are the most recent additions to the group of models of spoken word recognition, the discriminative lexicon approach to spoken word recognition (Baayen et al., 2019) and EARSHOT (Magnuson et al., 2020). Fine-Tracker maps the acoustic signal to a set of articulatory features, and it is capable of simulating durational and fine-phonetic detail effects captured in behavioral experiments (see, for example, Andruski et al., 1994; Salverda et al., 2003). The discriminative lexicon approach extracts frequency band summary features (Arnold et al., 2017) that are strings describing a particular frequency band of a temporal chunk of the signal in terms of its initial, final, median, and maximum amplitude. Both approaches are promising, but may require additional fine-tuning, as is the case with Fine-Tracker (see Scharenborg & Merkx, 2018), or additional testing, in case of frequency band summary features. EARSHOT uses spectrographic input and a two-layer neural network architecture to connect the said input to pseudo-semantic vectors. Initial model accuracy in word recognition does show promise, but further model testing is necessary. Besides an inspection of the hidden units that showed similarities to human neural responses, EARSHOT’s output has for now only been compared to the results of the visual world paradigm study performed by Allopenna et al. (1998) and previous simulations of that study using TRACE.

To the best of our knowledge, the only other model to also implement the acoustic signal as input is a process-oriented model, DIANA (ten Bosch, Boves, & Ernestus, 2015). Similarly to other notable models of spoken word recognition, DIANA is an activation and competition model. The current setup uses speech corpora to develop acoustic models (AMs), which are then used to calculate phone activations in novel acoustic signals presented to DIANA. What sets DIANA apart in comparison to models like Fine-Tracker, the discriminative lexicon approach, and EARSHOT is that it acts as an end-to-end model of spoken word recognition that is a computational approximation of (1) a general perceptual decision-making framework with a linking function for recognition and estimation of response latency, (2) which is based on word activation and competition principles most models of spoken word recognition agree on, and (3) extends it to take raw acoustic inputs (shared with, for example, EARSHOT). Importantly, all of these features are an integral part of the model in the case of DIANA, so that the user does not have to rely on multiple different theoretical or computational models (or to choose from multiple different specifications of, for example, decision-making) to simulate a behavioral task. DIANA is built upon the computational modeling of an interpretable interaction between two human cognitive functionalities (namely, activation and decision) in such a way that it can take acoustic input (thus supporting ecological validity) while being able to predict overt outcomes (such as response latency). In DIANA, estimated response latencies are the result of underlying processes, rather than the focus of explicit modeling. Among all current computational models of spoken word recognition that can handle audio input, it is the only one that includes an explicit decision component. This decision component overlaps with computational approaches of human decision-making, such as diffusion models and ballistic accumulation models (see, for example, Ratcliff et al., 2004). In the case of DIANA, the drift is nonlinear and hypothesis-dependent.

In the present paper, we first give a more detailed description of DIANA and previous simulations of the auditory lexical decision and word repetition experiments performed using this model. We then motivate the present simulations, presenting the goal of the current study. The central part of the paper describes the simulations we performed and discusses the findings to both develop DIANA and further scrutinize the process of spoken word recognition.

1.1 DIANA

A coarse visual representation of DIANA’s components and parameters is given in Figure 1 (adapted from ten Bosch, Boves, Tucker, & Ernestus, 2015). The first major component is the activation component: The model takes the speech signal as input and uses existing automatic speech recognition AMs (described in detail in later sections) to activate subword units, which are phones in the current setup, and words in the mental lexicon represented as phone strings. This aspect of DIANA was partly inspired by an observation made by Norris and McQueen (2008). In Shortlist B, the direct connection between the model and the input audio is lost due to the use of intermediate phone-phone confusion probability tables instead of the acoustic signal itself. That step was justified by stating that Shortlist B aims at modeling a part of the cascade from audio to word and that it was not clear how humans perform this step. To create a full end-to-end model, one has to bridge that gap in some way. One of the possible implementational options to close this gap is to use speech decoding techniques that (necessarily) operate from real audio, such as, for example, those developed in research on automatic speech recognition. Importantly, automatic speech recognition is not a panacea to solve the problem of how humans make the connection from audio to phone-like or word-like symbols. Instead, speech decoding techniques can be used as a proxy to compute activations of symbolic representations based on their bottom-up support from the (sub-symbolic) audio. The model can also weigh the activations using top-down information, such as word frequency. The impact of top-down information is controlled by changing the parameter $γ$ .

Figure 1.

DIANA takes the acoustic signal as input and has three components (activation, decision, and execution). Word activation depends on the input signal, the acoustic models, and the impact of top-down information adjustable by changing the parameter $γ$ . The decision component is affected by two thresholds (θ_wc for the decision that a word wins in comparison to all other word competitors and θ_lb for the word vs. pseudoword decision). If no decision is made prior to signal offset, parameter β determines the added decision time. The execution component represents the time needed to execute the decision.

The decision component operates in parallel with the activation component, reassessing whether a winner can be found at each time step. As words in the lexicon gain activation, they compete without lateral inhibition. Conceptually, DIANA follows the approach used in the Cohort model (Marslen-Wilson & Tyler, 1980; Marslen-Wilson & Welsh, 1978) to determine plausible candidates. Under this simplifying assumption, candidates that do not match with the input will be discarded as the signal unfolds. Note, however, that in the cohort model early mismatches are heavily penalized as they depend on the strict and correct identification of individual phones. In DIANA’s current implementation, the match between the signal and the internal representations is probabilistic (as in Shortlist B) and early mismatches do penalize the word score, but do not necessarily exclude these words from competition. Activation changes and competition unfolds until a winner is selected based on the difference in activation between the leading candidate and the runner-up. This difference is determined by an adjustable threshold $θ_{w c}$ . If no winner is clear before signal offset, parameter $β$ determines the added time required to make a decision depending on remaining competition. DIANA also makes a word/pseudoword decision by examining whether word activation is similar enough to the string of phones that the model activates independently of the lexicon. If a string of phones not present in the mental lexicon (i.e., a possible pseudoword that happens not to be a lexical item) is activated much better than any phone string that is present in the mental lexicon (i.e., any word), then the input signal is categorized as a pseudoword. The difference in activation required to make a “pseudoword” decision is governed by an adjustable threshold θ_lb.

Finally, the third major component of the model is the execution component. The execution happens after the decision has been made and represents the time taken to actually respond (e.g., press a button). Usually, this time is set to 200 ms in our implementation of DIANA. This estimate of the execution time is based on existing measures of response times (RT) in different tasks (see Kosinski, 2008, for a review). Note, however, that increasing or reducing the execution time is a linear transformation that would not impact the correlation between model estimates and some existing behavioral measure. The more important question is whether an approximation that is a fixed number can represent the variability in human reaction time well—not all humans react equally fast. Since the intent of the model is to represent general tendencies in human behavior and since it is unclear how a distribution of execution times can be modeled, we employ the standard approach and use a fixed number.

Besides these integral processes of activation, decision, and execution, DIANA’s other elements could be replaced. For example, the analysis of the acoustic signal can be performed in many different ways or the representation of the lexicon could be made to match TRACE or the Distributed Cohort Model (Gaskell & Marslen-Wilson, 1997, 1999, 2002). The current setup can then be viewed as a task architecture focusing on the process of spoken word recognition, primarily, in word repetition and auditory lexical decision tasks (but see ten Bosch, Giezenaar, et al., 2016, for a simulation of L2 listener errors in comprehension of reduced word forms in a sentence dictation task). We provide more technical detail about the current task architecture of DIANA when we describe the setup of our simulations.

The simulations of word repetition and auditory lexical decision task using DIANA were performed almost exclusively in Dutch. The first such instance (ten Bosch et al., 2013) modeled auditory lexical decision responses to 613 disyllabic monomorphemic Dutch words made by 20 participants. The model showed comparable error rates to human participants as its accuracy was 96% for the “word” (participant average: 94%) and 93% for the “not a word” (participant average: 95%) response. Model estimates of when the decision should be made also correlated quite well with tendencies in participant response latency. The average correlation between the model estimates and each of the human participants separately was $r = . 47$ . In comparison, the correlations in response latencies between any two participants never exceeded $r = . 30$ . This indicates that even though participant response latency on the same items is quite varied, DIANA corresponds well with some general tendency in participant response behavior, in a way mimicking an average participant. Although these results imply that DIANA is able to simulate general tendencies in participant responses, low correlations between participants themselves raise additional questions.

In a subsequent report describing a simulation of participant performance in a word repetition task using the same stimuli, ten Bosch et al. (2014) explain the lack of between-participant correlation using the notion of “local speed effects” (Ernestus & Baayen, 2007). Local speed effects explain the tendency of response latencies to a certain stimulus to correlate with response latencies to a number of previous stimuli. It is assumed that these correlations are a product of, for example, learning, fatigue, or shifts in attention (see also Phaf et al., 1990, for an early example of a computational model taking participant attention into account). Since the variation induced by these factors can be treated as noise in comparison to long-term effects, such as general cognitive abilities, their effect should be attenuated, especially considering that a computational model of spoken word recognition is not susceptible to similar effects. Therefore, ten Bosch et al. (2014) took into account response latencies to five preceding stimuli when estimating the “true” response latency to a stimulus, similarly to taking into account previous RT in statistical modeling (see also ten Bosch et al., 2018). The number of relevant preceding stimuli was selected to achieve maximum participant-to-participant correlation in response latency. After local speed effects have been attenuated, the correlation between participants and the correlation of DIANA to the average participant response latency both increased. The highest correlation between DIANA and the average participant response latency recorded for this word repetition task dataset was $r = . 41$ .

The same procedure that removed local speed effects was used in a later study (ten Bosch, Boves, & Ernestus, 2015) that again simulated participant responses to 613 Dutch words in the word repetition task. This time, however, different model parameters were also varied, showing that word frequency plays an important role in approximating participant response latencies, that the model should not just take the word with the highest activation score as the winner but also should add extra choice time if there is a close competitor at word offset, and that a word should in general have a substantial advantage to be selected as the winner. Applying these rules increased the average correlation between DIANA’s estimates and actual participant response latencies to $r = . 76$ .

DIANA was also implemented outside of Dutch, albeit only once (ten Bosch, Boves, Tucker, & Ernestus, 2015). The dataset included responses from 10 to 12 native and non-native listeners of English to 1,200 words. The results of the simulation still showed satisfactory performance of DIANA, with the correlation between model estimates and average participant response latency in an auditory lexical decision task being $r = . 45$ . In general, it seems that DIANA achieved higher correlations with participant performance when simulating word repetition than auditory lexical decision data.

1.2 The present study

One of many advantages of studies with very large item and/or participant sample sizes, often called megastudies (see Balota et al., 2012), is that they enable testing how well model estimates correspond to human performance by providing a behavioral database for comparison. The results of these comparisons are necessary for further model development. An extensive overview of existing megastudies is given in the work of Keuleers and Balota (2015), while a more recent list is available at http://crr.ugent.be/archives/2141.

Most large studies investigate responses to visually presented, written stimuli. The number of existing databases and their sizes are smaller for the auditory domain. Still, such databases are instrumental in the development of many models of spoken word recognition. One of the first larger databases was created by Luce and Pisoni (1998) and the data gathered in this study were used in the development of the capitalize Neighborhood Activation Model (see also Luce, 1986). Another example is the study conducted by Smits et al. (2003), as the collected data were used in the development of Shortlist B (Norris & McQueen, 2008). Biggest Auditory Lexical Decision Experiment Yet (BALDEY; Ernestus & Cutler, 2015) collected responses to 5,541 Dutch content words and pseudowords from 20 young native Dutch listeners and was instrumental in testing DIANA (ten Bosch, Boves, & Ernestus, 2015, 2016).

Massive auditory lexical decision (MALD; Tucker et al., 2019) is a still ongoing project designed to provide an even larger database of responses to isolated words presented in the auditory modality, with the goal of complementing the existing databases in the visual domain, such as the English Lexicon Project (Balota et al., 2007). One of the purposes of building a large database of MALD responses is to test existing computational models of spoken word recognition. The goal of the present study is to implement DIANA in English and test how well it matches participant performance in an auditory lexical decision task using MALD data. Although correspondence to actual participant behavior is only one of the criteria for estimating adequacy of models of spoken word recognition (see Scharenborg & Boves, 2010, for an extended discussion), an acceptable fit is still necessary for a model to be considered credible.

DIANA aims to be language-independent and, in our simulations, we want to investigate the challenges of implementing DIANA for the first time—the way any researcher would be using it for a language in which DIANA has not yet been tested. Therefore, although DIANA was already tested in English on a smaller scale (ten Bosch, Boves, Tucker, & Ernestus, 2015), we develop new AMs from spontaneous speech corpora, completing the entire process a researcher in any language would have to undertake to implement DIANA for their own purposes. These AMs are adapted for a single speaker, the same speaker that MALD participants listened to in the auditory lexical decision experiment. This approach restricts the model input variability to the variability present between segments produced by the same speaker in different words, not between different speakers. Once the AMs are created, we test DIANA’s performance in recognizing words in novel speech signals by calculating between-word competition as a function of time, and, most importantly, by simulating the lexical decision task. In addition, we compare model estimates to actual participant performance in MALD on a large scale and test model adequacy in that way. Original data accompanied with DIANA and statistical analysis scripts are available as supplementary material at https://doi.org/10.7939/r3-jdpa-dn72.

2 Behavioral experiment

As we noted in the introduction, we compare DIANA model estimates to human performance in the MALD project database (Tucker et al., 2019). We use the first version of the dataset (MALD1) available at mald.artsrn.ualberta.ca. In the present paper, we provide only the necessary information about the MALD experiment and the word and pseudoword recordings. More details about the stimuli and procedure are available in the work of Tucker et al. (2019).

2.1 Sample

The MALD1 dataset includes responses from 231 native monolingual English listeners (180 females, 51 males; age M = 20.11, SD = 2.39). All participants were recruited from the University of Alberta, receiving partial course credit for their participation.

2.2 Stimuli

Stimuli recordings were made by one 28-year-old male speaker of western Canadian English. The speaker was recorded reading isolated words and pseudowords on a computer monitor. He was instructed to produce the words written in their standard spelling as naturally as possible. Pseudowords were presented in their International Phonetic Alphabet (IPA) phonemic transcription and the speaker was instructed to read them as if they were words. All word and pseudoword recordings are available as separate wave files and have been aligned using the Penn Forced Aligner (Yuan & Liberman, 2008).

The recording procedure and post-processing of the stimuli yielded 26,800 words and 9,600 pseudowords used in the experiment. The words were split into 67 sets, and the pseudowords were split into 24 sets. Each word and pseudoword set contains 400 unique items. A total of 134 pairings of one word and one pseudoword set were made (i.e., each word list was paired separately with two different pseudoword lists), creating 134 balanced 800-item lists used in the behavioral experiment.

The simulations described in the following sections have many steps and there was small word/pseudoword loss between these steps for various technical reasons (mostly recording errors, mismatches between item lists, or missing MALD1 RT data). In the interest of clarity and brevity, we do not document all of these losses in the paper because they are minor and because we always maintain a high standard of hundreds or thousands of items used. We do provide the exact number of items used for critical simulations and comparisons to MALD data. Detailed information about the simulation process can be found in our supplementary materials, and following the scripts allows more detailed tracking of item loss.

2.3 Procedure

The participants were seated in a sound-attenuated booth for the experiment. A single 800-item list of stimuli was presented using the E-Prime experimental software (Schneider et al., 2012). Stimuli order was randomized. After a visual fixation cross lasting 500 ms, a word or a pseudoword was presented over headphones and the task for the participants was to decide whether the signal was a word of English or not by pressing the “yes” or “no” button on the button box. Responding during stimulus presentation would interrupt it and the experiment would proceed to the next trial. If no response was made within 3 seconds, the following trial was presented. The participants had the option of returning for another session and a new experimental list up to three times. Some participants, therefore, completed more than one list (but never the same word or pseudoword set), and a total of 284 responses to experimental lists were recorded.

3 Simulation 1—AMs

The first goal of Simulation 1 was to follow the process of setting up DIANA from scratch. We developed our own AMs and compared their performance with the performance of existing AMs for English in a free word recognition test. We do not compare model estimates to participant data in this simulation.

3.1 Simulation setup

Acoustic models can be trained using careful (read) speech corpora, such as TIMIT (Garofolo et al., 1993) or LibriSpeech (Panayotov et al., 2015), which was used in the development of Montreal Forced Aligner (McAuliffe et al., 2017). AMs can be also trained using spontaneous speech corpora, such as SCOTUS (Yuan & Liberman, 2008), which was used in the development of the Forced Alignment and Vowel Extraction (FAVE; Rosenfelder et al., 2014) suite. We used two unpublished spontaneous speech corpora as a baseline for creating AMs. The WCE spontaneous speech corpus contains telephone call recordings made by 11 speakers, while the Corpus of Spontaneous Multimodal Interactive Language (CoSMIL) contains conversation recordings of eight pairs of speakers. We decided to use WCE and CoSMIL to train AMs for three reasons. First, many languages do not have extensive support in terms of previously available speech corpora. Using our own corpora, we show that an independent researcher could create a spontaneous speech corpus for their language of interest and use it to create AMs for DIANA. Second, the speakers in WCE and CoSMIL speak the western Canadian variety of English, same as the MALD speaker. Third, human listeners are more often exposed to spontaneous, conversational speech than to careful enunciations. It is best when a model of spoken word recognition can be presented with the same input as the human listener; in our study, both are presented with MALD items in the test phase. However, we also wanted to represent the kind of “practice” human listeners receive as faithfully as possible, so that we used spontaneous speech in the training phase.

In our implementation of DIANA, similar to previous implementations, we trained the AMs using automatic speech recognition training in the Hidden Markov Model Toolkit (HTK; Young et al., 2006). HTK is a modular open source system for speech decoding. The process of creating the AMs described below is time-consuming. In ideal circumstances, starting from a few hundreds of hours of speech, this process may be completed in several days with computational time included. However, the amount of time will vary depending on a number of factors, one of them being familiarity with the technicalities involved. This estimate does not include the process of creating and organizing acoustic recordings and compiling reference data about this material.

WCE and CoSMIL recordings were separated into brief speech intervals, and we further split the longer transcribed intervals to create speech chunks shorter than 10 seconds. We excluded speech chunks that entirely consisted of silent pauses, laughter, or other non-speech noise. In total, just over 9 hours of speech were isolated and split into 20,086 speech chunks each shorter than 10 seconds. We downsampled the speech chunks to 16 kHz, and excluded 31 speech chunks due to potential sound clipping.

The first step in the training procedure takes the speech chunk input and creates estimates for all transcribed units (in this case, phones) as three-state hidden Markov models (HMMs), while the acoustic characteristics of phones are represented by Gaussian mixture models (GMMs). An HMM is a classical, elegant, and computationally feasible solution for representations that correspond to time varying signals, such as words and syllables. An HMM is a network (often a linear sequence with self-loops) of nodes or “states”; each state in an HMM is then associated with a statistical distribution of a collection of very short speech sounds (e.g., all transients into an /a/, stable portions of an /u/, spectral properties of the burst of a /t/). Those statistical distributions can be modeled by a GMM or by other statistical models (nowadays, deep neural networks have become more widely used; see Abdel-Hamid et al., 2014). Allophonic variation is encoded in terms of deviations from acoustic trajectories matched by an HMM and individual Gaussians in a GMM may carry a short stretch in such an allophonic trajectory. The HMM is then a truly representational structure, while the Gaussians or the deep network is a mathematical construct used to associate the audio signal with the HMM states. The HMMs themselves do not have a cognitive status, but they act as an operational computational unit that simulates the activation between audio and representational units. The modeling power of an HMM depends on its topology (linear, or multiple paths in parallel, number of nodes per path, and the number of statistical distributions used to link the acoustical statistics with each node). The number of Gaussians per node is indirectly related to the number of allophones.

Speech chunks from conversational speech corpora often included two or more connected words. Therefore, we expanded the created AMs to also include estimates for short pauses in speech, that is, we created the so-called “sp models” (for more information, see Young et al., 2006).

Increasing the number of GMMs per state may reliably reduce error rate in word recognition (Vertanen, 2006), so in the second step of creating the AMs, we increased the number of GMMs per HMM state from 1 to 2, then 4, 8, 16, and finally to the usually recommended 32 GMMs. The currently employed monophone system assumes that phones are context-independent. In reality, they are not, so with larger training material triphone models can be created to take into account phonetic context. We kept our models simple due to our limited training material, and also because HTK is just a technical mechanism to bridge audio on one hand and activations of words as items in a dictionary on the other.

The third and final step in creating the AMs was speaker adaptation. In this step, recordings from the MALD speaker (the speaker that the model will be tested on) are introduced to realign AM estimates. This process enables the AM to perform substantially better when recognizing speech recorded by a particular individual. Besides being a technical necessity, speaker adaptation is also cognitively relevant; inter-talker variability is high (see, for example, Kleinschmidt, 2019; Magnuson et al., 2021) and humans too show improved performance upon adapting or being exposed to particular speakers or speaker groups (see, for example, Bradlow & Bent, 2008; Eisner & McQueen, 2005; Xie et al., 2021).

Using a portion of speakers’ recordings for training purposes limits the amount of material remaining for the test phase. Considering that the amount of material from the same speaker used in a behavioral experiment may be small to begin with, we wanted to test how much material is required to create adequate AMs. Starting with the “sp model” described above, we created separate speaker-adapted models differing in the number of MALD word recordings used for adaptation. Pseudoword recordings were not used in training. The smallest adaptation set included only three MALD word lists with a total of 1,200 words. Larger adaptation sets were created in increments of three (six lists, nine lists, 12 lists, etc.) up to 45 MALD word lists with a total of 18,000 words. Each list includes approximately just under 4 minutes of speech.

We compared speaker-adapted models in their ability to recognize the input signal from a list of candidates comprised all 26,000 MALD words. In other words, all MALD words were used as the lexicon and we used six MALD lists (46–51) as test (input) material. In the current implementation of DIANA, the activation component analyzes the acoustic input by converting it into vectors of mel-frequency cepstral coefficients (MFCCs), while the acoustic characteristics of every phone in the lexicon, as we stated above, are represented by GMMs specifying the distribution of MFCC vectors for the three states in an HMM that each phone has. MFCCs are built upon the characteristics of the human auditory pathway (critical bands, energy compression) and are since the nineties the dominant audio representation in many analyses based on audio (and speech in particular, Rabiner & Juang, 1993) as well as in computational psycholinguistics. The MFCC representation is a representation of the audio signal in terms of a sequence of low-dimensional vectors. The MFCC vector contains a number of coefficients, each representing an aspect of the spectrum of a small portion of the speech signal. The MFCC vector sequence is computed using a sliding window—an analysis window of 25 ms is standard, as it is considered long enough to cover both steady and transient portions in the speech signal—with a window shift of 10 ms. The number of MFCCs mentioned in the literature may differ: while certain studies mention the use of 12 coefficients, other studies mention 13. Often 12 MFCCs (c1, . . ., c12) are combined (augmented) with log(E) (i.e., log energy) resulting in log(E), c1, c2, . . ., c12, while studies using 13 MFCCs always refer to the additional use of the “zeroth coefficient” (c0) which is also a measure of energy, resulting in c0, c1, . . ., c12. The matching is performed using a Bayesian framework that is often employed in conventional automatic speech recognition. Briefly, the probability that a certain sequence (word) is observed given the acoustic signal depends both on the probability that such an acoustic signal would be created for that sequence (this first factor is related to the AM) and on the probability that such a sequence would occur at all (this second factor is related to the language model and captures effects of expectations based on, for example, frequency of previous occurrence). Therefore, the activation values are scaled log probabilities. Calculations are performed for every 10 ms of input, as per the HTK default settings. Since the goal was to assess the quality of the AMs, activation values were not weighted by word frequency (γ = 0). Furthermore, we did not use the decision component of DIANA; we simply observed whether the correct, target word has the highest activation value.

To further assess AM quality, we also compared our AMs based on spontaneous speech corpora with FAVE AMs (Rosenfelder et al., 2014), likewise adapted for the MALD speaker. In addition, one may be concerned whether the obtained model performance is a consequence of which MALD lists were used for training and which for testing. For that reason, we created another set of AMs based on CoSMIL and WCE which were adapted on MALD lists 67–23, again in steps of three lists (i.e., the model adapted on three lists used lists 67–65, the model adapted on six lists used lists 67–62, etc.), and tested the performance of these models on MALD lists 22–17.

Finally, we created n-best lists to show the top candidates and their activations at word offset. These lists allow us to see whether the competitors considered alongside the word with top activation are sensible, and also inspect the cases in which the wrong winner is selected. We created 20-best lists, that is, observed top 20 candidates for every target word. The choice of the number of candidates considered was arbitrary and made to ascertain that no important candidates will be omitted, and also to allow feasible computation and data manipulation. The number of retained candidates is comparable to those used in established measures, such as orthographic Levenshtein distance 20 and phonological Levenshtein distance 20 (OLD20 and PLD20; Yap & Balota, 2009; Yarkoni et al., 2008).

3.2 Results

Free word recognition results are presented in Figure 2. We can see that free word recognition accuracy is relatively low when models unadapted to the MALD speaker are used. In this initial step, the FAVE model performs slightly better than our own models. Adapting the AMs on more MALD words leads to a large improvement in free word recognition at first, but this effect is reduced for adaptations performed on more than nine MALD word lists, especially for models based on WCE/CoSMIL. AMs created based on WCE and CoSMIL (circles) slightly outperform those based on FAVE AMs (triangles), but this difference becomes smaller as more words are added and disappears when the adaptation is performed on 40 MALD word lists or more. AMs trained and tested using a different set of MALD word lists (squares) show that the obtained free word recognition accuracy (circles) is not an artifact of the training/test set used; in fact, these models even have slightly higher accuracy. As another point of comparison, the AMs used by ten Bosch, Boves, Tucker, and Ernestus (2015) in the pilot DIANA simulations of MALD data had an accuracy of 82% when 500 words were tested with a lexicon of 36,000 word competitors.

Figure 2.

Accuracy in free word recognition on six MALD word lists. The number of MALD word lists used for speaker adaptation is given on the x-axis. Average recognition accuracy and confidence intervals across six MALD word lists used for testing are given on the y-axis. The legend distinguishes between models based on WCE and CoSMIL and adapted on MALD word lists 1–45 (circles), models based on the FAVE AM and adapted on MALD word lists 1–45 (triangles), and models based on WCE and CoSMIL and adapted on MALD word lists 67–23 (squares).

We selected the model adapted on (MALD word lists 1–30), henceforth referred to as AM30, for all subsequent simulations. The difference in average accuracy between AM30 and the AM adapted on 45 MALD lists is only 1%. The model adapted on 33 MALD lists is the first model where we see a slight decline rather than an increase in free word recognition accuracy, indicating that any additional realigning may be volatile. The model still offers a bit more (1.4%) than the model adapted on 15 MALD word lists, as well as a smaller difference in accuracy across the six test lists. Choosing AM30 as the model to be used leaves 37 MALD word lists available for testing purposes.

We also used AM30 to extract 20-best candidates for the target words in the six MALD test lists. We noted sensible competitors in all cases, regardless of whether the correct word was selected as the winner or not. Table 1 shows the winning word and the top 3 competitors for target words tales and proceed. For the first word, the string of phones was correctly recognized although the target word shares the same activation level as its heterographic homophone tails. The correct word was selected as the winner because it appears earlier in an alphabetized list of words. (Note that weighing activation using word frequency would change activation values of the homophones so the more frequent homophone would be selected as the winner; regardless, in later simulations we treat a win by any of the homophones to the target word as correct.) High activations of rhyme competitors pales and hails indicate that the model is considering candidates with initial phone mismatches. The remaining candidates not shown in Table 1 for the word tales, in order of decreasing activation, were fails, sales, sails, bales, trails, veils, nails, ails, jails, rails, wails, whales, feels, Brailles, males, and scales. The word proceed was incorrectly recognized as precede by a very small difference in activation, indicating that small differences in vowel characteristics may be difficult for the model to tease apart. Other close competitors include words that have the same lemma as the target word, while the remaining candidates were preceded, perceived, proceeding, preceding, poppyseed, airspeed, proceedings, preseason, and concede (in some cases, the output presents fewer than 20 candidates as there are not as many suitable options).

Table 1.

Activation of Top 4 Candidates at Word Offset for Two Example Words.

Target word	Competitor	Activation
TALES (correct winner)	TALES	−2,861
	TAILS	−2,861
	PALES	−2,870
	HAILS	−2,880
PROCEED (incorrect winner)	PRECEDE	−5,093
	PROCEED	−5,095
	PROCEEDS	−5,148
	PROCEEDED	−5,153

Higher values indicate higher activation (e.g., −100 is better than −200). Activation level is also dependent on signal length, with longer words reaching lower negatives than shorter words. For the word tales the correct phone string was detected and selected as the winner. For the word proceed, precede was incorrectly detected as the winner, with the target word being a close second.

Out of 2,403 words considered, only 14 were not one of the top 20 candidates for their signal: bow, curb, dear, tongues, desirous, boors, brazier, juggle, bairn, beer, betrothed, croquette, mowing, and priority. We found no errors in the recordings of these words and no commonalities between them, except perhaps that they all begin with an oral or a nasal stop. The 20 closest candidates for these words still seemed sensible, at least based on a subjective assessment made by the authors. In all other cases, even when the correct word is not selected as the winner, it is at least a close competitor. In 59% of the remaining cases the correct word is the runner-up and in 88% it is within the top 5 competitors. A closer (manual) inspection of errors showed that some of them were made because there is a heterographic homophone among the competitors, such as in urns and earns or genes and jeans. Other errors mainly occur due to uncertainty whether there should be an initial stop or not (e.g., breast winning instead of arrest and aiding winning instead of bathing), due to omitting the final stop (individualize instead of individualized), or due to the wrong vowel being activated (cake instead of kick). Complete information on 20-best lists can be found in our supplementary material.

3.3 Discussion

Setting up the HTK AMs required for DIANA simulations was relatively successful. It seems that approximately 9 hours of transcribed and labeled spontaneous speech is sufficient to create AMs that will, after speaker adaptation, perform on par with certain existing AMs. Where speaker adaptation itself is concerned, we selected the model trained on 30 MALD word lists (AM30), that is, we used slightly less than 2 hours of careful speech from the MALD speaker. It also seems that similar results in free word recognition can be obtained with the equivalent of 40 minutes of speech (approximately, 10 MALD word lists).

Currently, DIANA certainly requires more adaptation to a particular speaker than a human listener does. Humans are capable of understanding even those speakers they have never heard before despite the fact that the speech signal is highly varied between speakers (see, for example, Johnson & Sjerps, 2018; Kleinschmidt, 2019; Magnuson et al., 2021; Peterson & Barney, 1952; Weatherholtz & Jaeger, 2016). Still, DIANA performs well without years of experience humans have with speech, and it seems that the amount of required recording material for training is not unfeasibly high.

Free word recognition accuracy never exceeded 93% in any of the AMs for any of the test lists—we see that free word recognition accuracy is roughly between 85% and 90%. This is a result that could perhaps be improved using a larger, more varied training set or by providing the AM with transitional probabilities between segments. However, the competition process included 26,000 competitors for every word and even when a mistake was made, the target word was often among the closest competitors. Therefore, we decided to use AM30 in subsequent DIANA simulations of the auditory lexical decision task.

4 Simulation 2—lexical decision

When simulating the auditory lexical decision task, DIANA’s task architecture treats the process as containing two distinct decisions—(1) the decision of whether the signal is a word or a pseudoword and (2) the decision of which word is the winning candidate and when it is selected as the winner. In Simulation 2, we use DIANA to simulate the first decision: whether a signal is a word or a pseudoword. We also compare DIANA errors to MALD1 participant errors.

4.1 Simulation setup

The end result of the activation process in DIANA is a string of phones (one or more) that align with the acoustic input to some degree. When a lexicon of words is presented for the model to choose from, the model only uses strings of phones that exist in the lexicon (i.e., that are stored as words) as valid options. In those cases, the signal is placed in a kind of a Procrustean bed, as it is forced to align with the presented options. The best of these options, that is, the word whose string of phones best matches the acoustic input, will have the highest activation. We have seen this in Simulation 1 as one word is selected as the winner for having the highest activation, while other candidates have lower activation values. We refer to this kind of activation as word activation.

However, there may be a string of phones that would fit the acoustic signal better than any of those present in the mental lexicon. This can be tested using a language model that does not include a mental lexicon with a list of word candidates. Instead, it only contains phones, and, optionally, probabilities of moving from one phone to the other. In our simulation, we treated all possible phone transitions as equally probable. Note that this is not likely the case for real listeners, as they show a tendency to learn and use phonotactic constraints or transitional probabilities in behavioral tasks (e.g., Bailey & Hahn, 2001; Warker & Dell, 2006). When there is no mental lexicon of words, there is no Procrustean bed for the signal to forcefully fit; the model simply chooses the string of phones that yields highest possible activation level. We refer to this activation as free phone activation.

Since free phone activation is the highest activation obtainable for a particular acoustic input (given existing phone AMs), word activation can never exceed free phone activation. Words form a subset of the set of all word-like, phonotactically licensed phone sequences: even if a particular word activation is indeed optimal for a signal, free phone activation would simply yield that same phone string (the same word) and match the activation level registered. Conversely, if the signal does not have a perfect match with any of the words in the lexicon, forceful attempts to adapt to one lead to imperfect matching and therefore lower word activation levels, while free phone activation would still come up with the optimal string of phones and would remain high.

DIANA uses this difference between word activation and free phone activation to perform lexical decision. The decision whether a signal is a word or a pseudoword is made by comparing the best possible activation of a word candidate present in the mental lexicon (best word activation) to the best possible activation achieved if any phone sequence is allowed (free phone activation). The larger the difference in word activation and free phone activation, the less the signal resembles the given word (though keep in mind that the same result in word activation and free phone activation does not necessarily mean that the model recognized the correct, intended word). When a pseudoword is presented to the model, free phone activation should be significantly higher than word activation for any word in the mental lexicon, simply because phone strings comprising pseudowords are not present in the mental lexicon. Words, in turn, should have similar free phone and word activations. This should yield two distinct distributions of differences between free phone and word activation, forming a group in which the difference is 0 or close to 0 (words) and a group in which the difference is larger (pseudowords).

Ideally, there would be no overlap between these two groups of stimuli, allowing the model to perfectly distinguish between them. However, this would require AMs that perform perfectly, in addition to all word and pseudoword recordings having very careful enunciations of every phone in the word that align well with the AM. Instead, DIANA employs a threshold θ_lb that specifies the difference between free phone activation and word activation that is small enough for a signal to be considered a word. This threshold is adjustable and we investigate what value leads to best accuracies in word and pseudoword classifications. We test what word vs. pseudoword accuracy levels are registered for a range of θ_lb values and compare them to those recorded in human responses.

Besides calculating free phone activation and introducing pseudowords, we made additional changes in comparison to Simulation 1. We performed the simulation on all MALD words from lists 31 to 67 (i.e., all lists that were not used in adapting the model AM30, a total of 14,800 words) and on all MALD pseudowords. Instead of using all of the MALD words as the lexicon of candidates, we created separate lexicons for every word and pseudoword. Since DIANA conceptually endorses a Cohort-like competition (Marslen-Wilson & Tyler, 1980; Marslen-Wilson & Welsh, 1978), the lexicon included all short words (three phones or fewer) and all words that share the first three phones with the target word. The competitors were selected from the Carnegie Mellon University Pronouncing Dictionary (CMU; Weide, 2005). This procedure yielded lexicons of approximately 25,000 words mostly comprised words with three phones or fewer. In other words, the intention was to limit the competitor list under the assumption that the first three phones would be correctly recognized as the signal unfolds, but expand the number of close competitors by including more similarly sounding words.

We noticed in Simulation 1 that sometimes the wrong word is selected as the winner because the target word had a heterographic homophone in the lexicon (as in the urns vs. earns example mentioned previously). Using the entire CMU dictionary to create separate lexicons of competitors for each target word introduced many such heterographic homophones. To avoid this issue, we treated all cases in which a homophone of the target word was selected as the winner to be accurate, given that homophones have identical activations in DIANA and that in an auditory task where single words are recognized both are technically correct.

The activation scores for words were still left unaffected by word frequency weighted by the parameter $γ$ . The decision whether a signal is a word or a pseudoword compares the activation of a single top word candidate and the best free phone activation, meaning that relative positioning of word competitors is insignificant. Furthermore, participants in an auditory lexical decision experiment quickly learn that the number of words is roughly equal to the number of pseudowords: any given signal has an equal chance of being a word or a pseudoword and these prior probabilities cancel each other out.

4.2 Results

For word recordings, the maximum difference between word and free phone activations was 371.87, recorded for the word depopulation. Mean difference in activation was 29.64, while the median was 16.93. A total of 3,303 words (22.26%) had the activation difference of 0, meaning that free phone activation perfectly matched word activation. For pseudoword recordings, the differences between word and free phone activations were expectedly higher. The highest difference was 1,482.45 and it was recorded for the pseudoword /εkmɪsieɪskləɹoʊsiz/, which the model fitted as Izzy’s—competitors with more than three phones beginning with /εkm/ were rare and not similar to this pseudoword. The mean difference in activations for pseudoword recordings was 222.34 and the median was 184.38. Still, 337 pseudowords (3.5%) had the activation difference value of 0, meaning that the model incorrectly interpreted them in a way that perfectly matched with a word in the lexicon.

Figure 3 shows the distribution of word and free phone activation differences for word and pseudoword recordings. We set the x-axis limit to activation difference of 500 to make the distribution in the lower values more visible, but the long tail of differences continues for pseudoword recordings up to 1,482.45. In the case of words, as the activation difference increases, the number of words with that difference between word and free phone activation decreases; most words tend to have a small difference between word and free phone activation. For pseudowords, this trend can also be noted, but with a much smaller slope, as the distribution is, especially toward the lower hundreds, nearly uniform.

Figure 3.

Histogram of differences in word activation and free phone activation for MALD words (lists 31–67) given in dark gray and MALD pseudowords given in light gray. The x-axis is limited to activation difference of 500, but the same trend continues to the maximum activation difference recorded which is 1,482.45.

We then examined the ratio of “word” versus “pseudoword” responses and model accuracy in predicting whether the input signal is a word or a pseudoword as a function of $θ_{l b}$ (i.e., the difference in free phone vs. word activation required to make a “pseudoword” decision). To make a balanced list of words and pseudowords for our computational simulation, we randomly selected a subset of words that matched the number of MALD pseudowords retained at this point in the simulation process (9,591). The activation difference was equal in words that were selected and words that were not selected for the subset, confirmed by a Wilcoxon rank sum test with continuity correction (W = 25,096,000, p = .87). We varied θ_lb in increments of 10, starting from 0, meaning that only a perfect match between word activation and free phone activation yields a “word” response, and ending with 200, leaving out only 66 words (0.004%) with an activation difference higher than this number.

The percent of “word” responses increases as the θ_lb increases. When θ_lb is 0 only 13.01% of the stimuli are selected as real words. With θ_lb of 150, the percent of “word” responses in all stimuli rises to 70.44%. Figure 4 shows this relationship and also includes three points of special interest on the curve. These points mark the quartiles of the percent of “word” responses in all experimental sessions in MALD1. The middle half of MALD1 sessions (the interquartile range) are found between points Q1 (48.12% “word” responses) and Q3 (54.83% “word” responses). These results indicate that in most sessions MALD1 participants had a fairly balanced response regime, making roughly an equal number of “word” versus “pseudoword” responses. Since DIANA aims to simulate general tendencies in participant behavior, it seems that $θ_{l b}$ should be set in the range between approximately 50 and 70. This threshold is dependent on the choice of features (in our case, MFCCs), the quality of the speech recordings, and the speaker—θ_lb needs to be adjusted for each new experiment.

Figure 4.

The relationship between threshold θ_lb and the percent of “word” responses DIANA makes. Points Q1, Q2, and Q3 are added for comparison and represent quartiles of the percent of “word” responses in MALD1 sessions.

DIANA’s accuracy in classifying words versus pseudowords is also dependent on θ_lb. As the threshold rises, so does the percent of word signals correctly recognized as words. At the same time, the number of false alarms increases, as more and more pseudowords are mistakenly taken for words. Figure 5 shows how accuracy for words and pseudowords changes as a function of θ_lb and again includes quartiles from MALD1 sessions for comparison. We see that in our current setup DIANA inevitably performs worse than an average MALD1 participant, as it cannot have a performance higher than the value of Q1 for both words and pseudowords. Since the focus of our simulations was responses to words and since we wanted to match the participants’ balanced response regime, we settled for θ_lb value of 70. With this threshold value the model made 55.65% of “word” responses and had an accuracy of 87.92% when responding to words and 76.44% when responding to pseudowords.

Figure 5.

DIANA’s accuracy in lexical decision for words and pseudowords as a function of threshold θ_lb. Points Q1, Q2, and Q3 connected to word and pseudoword curves represent quartiles from MALD1 session data. The vertical dashed line marks the value of θ_lb selected as optimal in the present simulations.

Cross-tabulation of lexical decision and winner selection accuracy at word offset showed that 78.46% of word items were both selected as a word by the lexical decision process and the target, correct word won at signal offset. In 10.57% of the cases, the correct word was the winner based on word activation, but the word activation was still smaller than free phone activation by more than 70, so these signals were incorrectly flagged as pseudowords. In 9.45% of the cases, the signal was accurately selected as a word (the difference between word and free phone activation was less than 70), but the wrong word had the highest activation at word offset. Only 1.52% of word recordings were both mistakenly marked as pseudowords and the wrong word had the highest activation at word offset. Words tend to be misinterpreted as another word rather than a pseudoword if they are shorter in duration (Welch’s unequal variances t test: $t_{(2736.2)} = - 47.62$ , $p < . 001$ ) and have fewer phones (Welch’s unequal variances t test: $t_{(2471.3)} = - 56.18$ , $p < . 001$ ), as these words have more close competitors.

As noted in Simulation 1, longer recordings reach lower negative activation values. One concern that we had is whether higher differences between word activation and free phone activation would simply be a product of longer signals and a higher opportunity of mismatch between the two. Figure 6 shows that this is indeed the case, but mostly for pseudowords $(r = . 76)$ , not words $(r = . 33)$ . Only 8.16% of pseudowords split into more than 50 frames (approximately 520 ms in duration or longer) are incorrectly recognized as words, and this percent drops further to only 0.58% for pseudowords with more than 70 frames (720 ms). In turn, words of any length can have an activation difference of less than θ_lb = 70 and therefore be recognized as a word by DIANA. Still, the error rate for words does increase as their duration increases: it is 16.50% for words longer than 50 frames and 26.24% for words longer than 70 frames. In other words, longer signals are more likely to be labeled as pseudowords and especially so if they indeed are pseudowords.

Figure 6.

The relationship between the number of frames (duration) of a stimulus and the difference between word and free phone activation for that stimulus presented separately for words and pseudowords. The black dashed line marks the selected difference value of θ_lb = 70. The eventual score a hypothesis (phone string) receives is based on the summation of local scores that are associated with individual MFCC frames—in principle, the longer the stimulus, the larger the deviations will be between the scores of competing candidates.

We do not see such a strong correlation between duration and accuracy in MALD1 participants. There is no correlation between word duration and the proportion of correct responses to that word $(r = . 03)$ . For pseudowords, the correlation between pseudoword recording duration and the proportion of correct responses to that pseudoword is much lower than the one recorded in DIANA (being $r = . 27$ in MALD1 data). The relationship between the proportion of correct responses to words and the activation difference in DIANA is also practically non-existent $(r = - . 08)$ , but pseudowords with higher activation differences are also recognized as pseudowords by participants more often $(r = . 29)$ .

4.3 Discussion

The goal of Simulation 2 was to test DIANA’s approach to modeling lexical decision, that is, the word/pseudoword decision all human participants make in an auditory lexical decision experiment. Specifically, we wanted to establish the best value of the threshold θ_lb which determines whether a stimulus will be recognized as a word by DIANA. We found that the approach can distinguish between the two types of stimuli fairly well, although the model in our current setup does perform somewhat worse than an average MALD1 participant. It is important to note that DIANA’s response accuracy could be increased by selecting an unrealistic response strategy—in our case, by increasing the number of “word” responses the model makes. However, this is a poor approach if the goal of the simulation is matching participant behavior. The goal of cognitive simulation is to explain a process, such as spoken word recognition in humans using plausible solutions, not to maximize model performance.

Where words are concerned, although we changed the lexicon of candidates, tailoring them for every target stimulus, word recognition accuracy remained as high as in Simulation 1. However, in a portion of cases, a pseudoword has higher activation than the target word. In addition, among correct lexical decisions, there are cases in which the wrong word had the highest activation. Both of these kinds of errors in DIANA’s word recognition stem from the same root cause—relatively low activation of the target word.

The reasons for making mistakes are only partly shared between DIANA and human participants. Both can “mishear” the signal, taking a pseudoword for a word, a word for a pseudoword, or mistaking the word for some other word. However, participants also make mistakes because they do not know a word, whereas DIANA has all the MALD/CMU words stored in its lexicon. In addition, a human participant can simply lose attention and press the wrong button (Prins, 2012; ten Bosch et al., 2019), whereas DIANA always performs on the same level. In the current simulation, DIANA’s performance fully depends on the quality of the AMs, the characteristics of the incoming novel acoustic signal, and the available competitors in the mental lexicon.

In the case of pseudowords, we note a trend in which longer pseudowords are more accurately categorized by DIANA. We explain this finding in terms of cumulative activation and lexicon structure. There are more opportunities for longer pseudowords to mismatch with an existing word. In addition, the number of plausible word candidates is smaller for longer pseudowords and with that so are the odds of the pseudoword signal being mistakenly taken for an existing word. Still, due to imperfect AMs, we see that certain short pseudowords are mistaken for words using the threshold θ_lb = 70.

Although a similar relationship between pseudoword duration and accuracy exists in the MALD1 data, it is much less pronounced. But why are not MALD1 participants benefiting (as much) from more opportunities for mismatch and fewer plausible word candidates when listening to longer pseudowords? We argue that, unlike our current DIANA setup, MALD1 participants are aware of the morphological and even semantic characteristics of pseudowords, making certain long pseudowords more word-like to a human listener. MALD pseudowords were created from actual words of English by replacing a third of their subsyllabic constituents with another phonotactically licit and probable segment, yielding pseudowords with some apparent morphological complexity (Tucker et al., 2019). One example is the pseudoword /εnspeɪzd/ that was correctly classified in only 36% of occurrences in MALD1 sessions. Morphologically, this word may resemble a combination of en plus spaced, as in, for example, encircled (although we do note that the final sounds are voiced in the pseudoword, unlike in the word spaced /speɪst/). Another example is the pseudoword /trænzvɑɹmɪŋ/. Although there are differences in comparison to existing words, such as transforming, or a potential “word” transwarming, the prefix trans and the suffix ing in conjunction with the central part of the word that sounds like existing words are likely the reason why this pseudoword was correctly responded to in only 27% of its MALD1 trials. DIANA is not sensitive to this kind of similarity and the top word competitor to /εnspeɪzd/ is the word inspires, while the top competitor for the pseudoword /trænzvɑɹmɪŋ/ is tensiometer. Furthermore, processing written pseudowords is not free of item-level frequency or semantic effects, as pseudowords do not necessarily have a frequency of 0 and as form-meaning patterns learned from words can extend to pseudoword processing to yield pseudoword positions in a semantic space (Cassani et al., 2020; Hendrix & Sun, 2021).

5 Simulation 3—response latency

The goal of Simulation 3 was to test how well DIANA’s estimates of when a word is recognized match general tendencies in participant response latency from MALD1 data.

5.1 Simulation setup

Statistical modeling of participant response latencies collected in auditory lexical decision tasks ordinarily only considers trials in which the participant responded correctly. Our approach to estimating response latency using DIANA was the same, as we only considered the words that were correctly classified as words in Simulation 2 for response latency estimation in Simulation 3. However, when a participant makes the correct lexical decision to a word in a behavioral task, we cannot know whether they actually accessed the correct, target word or some other word. DIANA, in turn, outputs information about the winning word. The researcher needs to decide whether words that were correctly classified, but incorrectly recognized will be included in the simulation.

We conducted separate simulations and considered both possibilities. The results were not qualitatively different in any way. The results of the simulation using all the words that were correctly classified as words in Simulation 2 (even if the wrong word won) are presented in the supplementary material. In the remainder of the text we present the results of the simulation in which we only considered the words that were both correctly recognized at word offset and treated as words (not pseudowords) by DIANA in Simulation 2. This number was further reduced by approximately 100 words to the final number of 11,465 words when the simulation results were merged with MALD1 data due to missing or skewed data or technical issues in item matching. We used the same lexicons of competitors to these target words as in Simulation 2. However, in Simulation 3 we calculated word activation using a gating procedure. We split all word recordings into 20 ms frames. Model estimates were made upon addition of every new frame. Since the process is computationally demanding and since the initial stages of word competition are uninformative, we only observed the activation of top 20 candidates in the last 300 ms of the sound signal. In effect, the gating procedure allows us to estimate candidate activation and observe the activation–competition process as the signal unfolds. In addition, DIANA’s decision component can make a decision at every selected point in time during the signal presentation.

The activation at the final phase of the gating procedure (word offset) is identical to the activation used in the lexical decision simulation from Simulation 2. We already determined the value of the lexical decision threshold θ_lb based on the difference in free phone and word activation when the entirety of the signal was available to the model. The majority of responses in auditory lexical decision experiments are made after signal offset, and our reasoning was that one viable strategy for the listener would be to make the best possible decisions when all of the information is available. In addition, varying all parameters in DIANA at the same time would create too many combinations for feasible computation and analysis of results, so we determined θ_lb independently from $γ$ (determines the impact of top-down, frequency information), θ_wc (determines the difference in activation between the leading candidate and the runner-up required to select the leading candidate as the winner), and β (determines the added time required to make a decision depending on remaining competition).

We followed similar reasoning when determining plausible values for parameter $γ$ that controls the contribution of top-down (frequency) effects; value of the $γ$ needs to be determined experimentally because it depends on the type of word material used. Since we only selected words that were correctly recognized in Simulation 2, accuracy in selecting the right word at word offset is 100% with no contribution of word frequency. However, modifying acoustic activation using word frequency may change the order of top candidates if a runner-up has a much higher frequency than the top candidate and a high weight is assigned to the top-down effect, that is, a high $γ$ is used. As shown in Equation 1, a candidate’s total activation (TA) was calculated as a sum of its acoustic activation from the AM and logged frequency count $(f)$ from the Corpus of Contemporary American English (COCA; Davies, 2009) weighted by parameter $γ$ .

T A = A M + γ \times \log (f)

(1)

We assessed which values of parameter $γ$ are acceptable as weights for logged frequency so that word recognition accuracy is not severely reduced. The word recognition process in the auditory lexical decision task is primarily guided by acoustic information, not prior probabilities or context; as Norris and McQueen (2008) state: “Once the perceptual evidence becomes completely unambiguous, frequency should never override it” (p. 371). Although we do not believe that any auditory evidence ever reaches the state of being completely unambiguous, we opted for an approach that increases the difference between the top competitor and other competitors if the top competitor is a high-frequency word, and reduces this difference if the top competitor is a low-frequency word, but ultimately does not determine which word is heard in too many cases. This should yield results in which high-frequency words are isolated and recognized sooner, while low-frequency words are more difficult to isolate and are recognized later. It should also be stressed that the word frequency effect in the current setup is further limited because it only modifies the activations of up to the top 20 acoustic candidates.

The decision of which word is the winning candidate in DIANA is regulated by a threshold θ_wc determining the required difference in activation between the leading candidate and the runner-up. Since there are many heterographic homophones in the dictionary that will have identical activation (e.g., tails and tales), we only considered non-homophone competitors when we determined the difference between the leading candidate and the runner-up. We calculated this difference at every step in the gating procedure. When determining the range of acceptable values for threshold θ_wc, we again used MALD1 responses as a benchmark. Increasing θ_wc increases the required difference between the top candidate and the runner-up for a winner to be selected, and therefore increases the number of word signals which do not have a clear winner before word offset. A very low value of θ_wc will in turn yield many winners before word offset—which can also lead to many wrong competitors being selected as winners based on early activation. We decided to adjust the value of θ_wc so that the percent of words that win before word offset is roughly equal to the percent of word responses that happen before word offset in MALD1 data. When determining this percent for MALD1, we added 200 ms to word duration to take into account the time required to execute the response, as assumed by DIANA.

In this simulation, we only selected words that were correctly recognized at signal offset in Simulation 2 (but see the supplementary material for the alternative option). However, a wrong word may be the leading candidate prior to signal offset, especially considering that top-down information now affected competitor activation. Therefore, we also tested which word is the leading candidate at the time frame when the winner is selected.

When a winner is selected prior to word offset, DIANA takes the time at which it was selected and adds the aforementioned 200 ms for execution. In the case when the required difference between the top candidate and the runner-up (controlled by threshold θ_wc) is not attained at stimulus offset, a controllable parameter β estimates the added time for the final winner decision. The time needed to decide on the final winner depends on the number of remaining plausible competitors, that is, all the words with an activation difference of less than θ_wc from the top candidate. However, when simulating the lexical decision task, DIANA assumes that the listener is at this stage also considering viable phone strings which are not present in the mental lexicon. In other words, pseudowords are also competing with real words, increasing the perplexity of the decision at signal offset. Unlike for highly activated word competitors, we cannot obtain the activation values for all potential pseudowords. The number of pseudoword competitors at word offset is approximated by raising 3 to the power of the number of phones of the target word. The formula for estimating choice reaction time then follows the Hick–Hyman law (Hick, 1952; Hyman, 1953) by calculating the logarithm of the total number of remaining word and pseudoword competitors weighted by parameter β (Equation 2). Choice reaction time is finally added to the total duration of the signal, in addition to the 200 ms required for execution.

R T_{choice} = β \times \log (N_{words} + 3^{N_{phones}})

(2)

We note that this formula is a rough estimation in itself that also assumes that these non-word competitors are still plausible competitors at stimulus offset. Also note that the plausible pseudoword competitors retained using this formula would likely highly outnumber the remaining word competitors, especially for longer target words and if phoneme recognition is precise so that most word competitors are excluded as implausible. However, the formula is based on a deeper conceptual idea about the (virtual) role of pseudowords during decoding. DIANA is based on the idea that the total reaction time that is the empirical outcome of a lexical decision experiment is composed of three parts: (1) the time it takes for the stimulus to unfold and to build activations, (2) the time it takes to make a decision based on these activations, and (3) some additional time (here assumed constant) for the signal to travel from the motor cortex to the finger muscles along the neural pathway. The contribution from (1) is dependent on the stimulus, and primarily on stimulus duration. Contribution from (2) is directly related to Hick’s law, which says that the time it takes to make a decision is approximated by the expression $a * \log (N) + b$ , in which $a$ and $b$ denote constants that depend on the exact task, while $N$ denotes the number of items to choose from. This contribution is sometimes referred to by “choice RT” (or $R T_{choice}$ ). The expression for $R T_{choice}$ mentioned above is an approximation of the actual (but unobservable) $R T_{choice}$ , based on an estimation of the number of viable candidates (words and pseudowords) at stimulus offset. The number “3” which is raised to the number of pseudowords comes from the fact that, on average, each phone has about three similar phones in its broad phonetic class. This number is an approximation—its exact value will depend on the phonetic-phonological structure of the language. Observe that the expression in the formula is $3^{N_{phones}}$ , which is a proxy for the number of possible pseudowords that might play a role. Admittedly, this is a rather simplistic expression, especially given the fact that various prosodic factors (stress patterns) and morphological factors (prefix and suffix structure, compounding) are not taken into account. In this direction, there is still substantial room for improvement and precision in the present task architecture. On the other hand, it makes the assumptions very explicit and it clearly defines the way how this approximation could be improved once more data become available.

With acceptable ranges for parameter $γ$ and threshold θ_wc determined, we adjusted the value of parameter β, which determines the added time required to make a decision depending on remaining competition, to maximize the match in mean response latency between DIANA and MALD1. We then observed the correlation between logged DIANA’s response latency estimates per word calculated using the selected values of $γ$ , θ_wc, and β and mean logged MALD1 response latency per word. We followed the procedure from ten Bosch et al. (2018) to de-trend MALD1 response latencies, limiting the degree of local speed effects (Ernestus & Baayen, 2007). Maximum between-participant correlation on the entirety of MALD1 data $(r = . 19)$ was achieved when ten previous responses were taken into account to determine the “true” current response latency. The code for MALD1 data de-trending is available alongside all other data and scripts in our supplementary material and is described in more detail in the work of Nenadić and Tucker (2020).

5.2 Results

We first tested how word recognition accuracy at signal offset changes when top-down frequency effects are introduced to the model. We tested $γ$ values from 0 (no frequency effect) to 21, in steps of 3. With $γ = 21$ , word recognition accuracy dropped to 89%, meaning that in 11% of the cases a more frequent competitor won instead of the less frequent target word. We decided to stop with $γ$ of 21 as that would further increase the number of incorrectly recognized words at word offset. In addition, this number of incorrectly recognized words could further increase when we introduce the possibility of responding before word offset, creating a situation in which the model too often recognizes the wrong word. Of course, how much is too much is debatable as we cannot know whether the listener in the auditory lexical decision task recognized the right word when they make a correct lexical decision. We return to this issue in the “Discussion” section.

We then assessed plausible ranges for threshold θ_wc by comparing the percent of decisions made before word offset in DIANA and MALD1. When making this comparison, we decided to take into account only the percent of correct responses made before word offset per MALD1 session (to remind, a MALD1 experimental session includes a single participant completing a single 800-stimulus MALD1 list). This was done for two reasons. First, the correlation between the percent of responses made before word offset with 200 ms added for execution time and the percent of correct lexical decisions made to word stimuli was very low in MALD1 sessions $(r = - . 10)$ . Second, in the present simulation, we only considered words correctly recognized by DIANA and compared their estimates to response latency in correct trials from MALD1.

Figure 7 shows how the number of winner selections that happen before word offset decreases as the required difference in activation between the top competitor and the runner-up (θ_wc) increases. This relationship is nearly identical for all considered levels of $γ$ if θ_wc is larger than approximately 200. For lower θ_wc values, there is some impact of top-down information on early winner selection. For example, when θ_wc is 150, the percent of winners chosen before word offset ranges between 33.91% $(γ = 0)$ and 37.57% $(γ = 21)$ . MALD1 data include a wide distribution of percent of responses made before word offset when 200 ms are deducted from the response latency to account for execution time. This indicates a wide range of participant strategies: while some opt to make practically no (correct) responses before they heard the entirety of the signal, certain other participants make up to 80% of their responses at least slightly before the signal ended. A portion of this variability may be attributed to simple differences in speed, as it is likely that not all participants take exactly 200 ms to execute a response to every stimulus. The mean percent of correct responses made before word offset in MALD1 sessions was 26%, while the median was 24%. Since our goal was to match general tendencies in participant performance, it would be reasonable to opt for θ_wc values that would yield 16% (Q1) to 35% (Q3) of responses made before word offset. As can be seen in Figure 7, θ_wc values between 150 and 220 fit that range.

Figure 7.

DIANA’s percent of decisions made prior to word offset as a function of threshold θ_wc. Separate lines are drawn for different parameter $γ$ values (0–21, in steps of 3). Higher values of $γ$ are associated with a larger percent of winners being chosen before signal offset, at least for low values of θ_wc. Points Q1, Q2, and Q3 represent quartiles from MALD1 session data.

However, making a decision prior to word offset also introduces the risk of choosing the wrong word as the winner: at some point during the activation–competition process, a candidate may get highly activated and win, even though the remainder of the signal would reduce its activation. (Remember that we previously excluded all the words that were incorrectly recognized in Simulation 2, so all words are correctly recognized at word offset.) Therefore, we tested how accuracy in selecting the right word as the winner changes as a function of θ_wc. Figure 8 shows that the number of wrong selections for responses prior to word offset decreases as θ_wc increases. When the model is more conservative in selecting the winning word and fewer words are recognized before word offset, there is less of a chance that the wrong word will be selected as the winner. Frequency again plays only a minor role when the value of θ_wc is high. However, for θ_wc of 150, the range of correctly recognized words ranges between 77.78% ( $γ$ = 21) and 84.82% ( $γ$ = 0). Simply put, higher values of $γ$ somewhat increase both the number of responses before word offset and the number of incorrectly recognized words. The vertical dashed lines in Figure 8 represent the margins within which the average MALD1 session operates (16% and 35% of word responses before signal offset). According to DIANA, that would indicate that for approximately 8%–20% (also depending on the value of $γ$ ) of the responses before word offset the participants actually heard the wrong word. We cannot know whether this is true as the standard auditory lexical decision task does not require the participant to state which word they heard. However, this is not entirely implausible: listeners may be opting for a strategy that introduces some risks, but enables them to complete the task quickly, by responding when they are “fairly certain” that the stimulus is a word and ending the competition process prematurely. Such behavior is recognized in a number of decision-making or speed–accuracy trade-off models (see, e.g., Foraker & McElree, 2011; Ratcliff et al., 2004). Furthermore, data from a word repetition task (ten Bosch, Boves, & Ernestus, 2015) show that approximately 10% of repetitions are wrong, but that the worst participant made as many as approximately 19% errors after reporting they understood the word. We also found that target words for which DIANA selects the winner before word offset have relatively earlier phonological uniqueness points (when γ = 0 and θ_wc = 150, and the total number of phones in the word are taken into account), confirmed by a Wilcoxon rank sum test with continuity correction (W = 21,033,820, p < .001). This finding indicated that the selection of words for which a response was made before offset by DIANA is plausible.

Figure 8.

DIANA’s percent of accurate responses prior to word offset as a function of threshold θ_wc. Separate lines are drawn for different parameter $γ$ values (0–21). Higher values of $γ$ are associated with a larger percent of winners being chosen before signal offset, at least for low values of θ_wc. The two vertical dashed lines delineate the value range of θ_wc that yields a percent of responses made before offset equal to the values between the first and third quartile in MALD1 session data.

We considered θ_wc values between 140 and 240 in steps of 20 for estimating response latency in DIANA. We excluded very early DIANA estimates of a word winning (before 420 ms pass with 200 ms for response included) as unrealistic. We also excluded words that were RT outliers in MALD1 data. As stated above, the final number of words used to correlate DIANA’s response latency estimates with response latency from MALD1 data was 11,465. DIANA estimates of when the target word should be selected were compared to de-trended MALD1 data from Sessions 31 to 67. As before, $γ$ values ranged between 0 and 21 in steps of 3. If the decision was made prior to word offset, we took the time when the slice ends and added 200 ms to account for execution time. In all other cases, we adjusted the value of β to maximize the similarity of estimated response latency between the model and behavioral data.

To adjust plausible levels of β, we assessed the final response latency estimate in milliseconds provided by DIANA and observed whether this duration fits within the general time frame of responses in MALD1. Figure 9 shows how the average response latency estimate by DIANA changes as a function of β (ranging from 0 to 100 in steps of 10) for different levels of $γ$ and θ_wc. We see that the estimated RT increases as β increases, which is expected given that β governs how much additional time will be spent on each remaining competitor past word offset. DIANA’s estimated RT also increases as θ_wc increases because more words are not responded to until word offset and have choice reaction time added to their estimate. The effect of frequency ( $γ$ ) is more humble in comparison, but we see that for lower values of θ_wc lower values of $γ$ are associated with longer RT. The dashed line represents the mean MALD1 response latency for the 11,465 MALD1 words (949 ms). Similarly, the mean RT of correct response to word stimuli calculated per MALD1 session is 941 ms. Therefore, a plausible value of parameter β ranges between 25 and 60, depending on the values of $γ$ and θ_wc.

Figure 9.

DIANA’s estimated RT as a function of parameter β for different values of $γ$ and θ_wc. The horizontal dashed line is set at 949 ms and is equal to the mean MALD1 RT for the same 11,465 words considered by DIANA.

However, for the range of β values between 25 and 60, we noticed that the correlation between logged DIANA’s estimated RT and participant de-trended RT (dRT) decreases as β increases. We decided to use a broader range of β values to better explore this trend. Figure 10 represents the change in the correlation between DIANA and MALD1 response latency for values of β ranging from −50 to 100 in steps of 10. Different lines stand for different $γ$ values, while different points represent different θ_wc values. The highest correlation between logged DIANA’s estimates and MALD1 dRT was $r = . 42$ and it was obtained in negative values of β. Although the highest correlation was observed for highest values of $γ$ and θ_wc, the variation in these parameters never made a difference larger than 1.1%, as long as β was kept at this optimal value. In other words, the best result was obtained when RT was subtracted from word offset rather than added to it. Different values of $γ$ and θ_wc have a larger impact for other values of β, especially positive, but in that case the correlation with MALD1 dRT gets significantly reduced, as can be seen in Figure 10.

Figure 10.

Correlation between DIANA estimates and MALD1 dRT as a function of parameter β for different values of $γ$ and θ_wc when both words and pseudowords are considered as plausible competitors at word offset. Not all values of parameters $γ$ and θ_wc are shown for clarity; the lines for these values can be interpolated as they follow the same pattern as above.

5.3 Discussion

The goal of Simulation 3 was to simulate participant response latency when responding to words in the auditory lexical decision task. We used words that were correctly classified as words and correctly recognized in Simulation 2 (although note that the results do not change if all the words that are correctly classified are used; see the supplementary material) and adjusted the values of parameters γ and β and threshold θ_wc to calculate DIANA estimates of participant response latency. Crucially, we developed parameter and threshold values that lead to plausible model behavior by comparing model performance to human performance.

Increasing $γ$ , which would increase the impact word frequency has on final DIANA estimates, lead to an increase in the number of wrong words winning instead of the target word. It remains an open question of how we should assess the number of correct lexical decisions made based on an incorrect word recognition. Still, our results show a relatively small effect of word frequency on simulation outcome. Once the ranges of θ_wc and β values are adjusted, and their values are varied within those ranges, the variation of $γ$ has a very small relative impact. In other words, DIANA estimates of participant response latency in the current setup perform best when they are almost exclusively based on bottom-up, acoustic information.

Varying the threshold θ_wc leads to conceptually similar results. A lower threshold leads to an increased percent of wrong word selections prior to word offset, where some of the decisions are made implausibly early. In addition, all correlations of MALD1 participant response latency with DIANA estimates favored the highest θ_wc we used (240), which indicates that better matching with participant RT is achieved if more decisions are made after the entirety of the word signal had been presented to the model.

DIANA also includes a parameter θ_wc, which weighs the formula accounting for choice reaction time—once the signal has reached its end, the model calculates added time needed to make a decision between the remaining plausible candidates. DIANA assumes that in the lexical decision task (but not word repetition task) the set of plausible candidates at word offset includes the remaining word competitors whose activation is within the threshold θ_wc from the activation of the leading candidate. The correlations obtained with MALD1 response latency were moderate, reaching the value of $r = . 42$ . Although this correlation with participant data is not by any means low, the contribution of choice reaction time to the overall correlation between model estimates and MALD1 data was very modest. More importantly, the highest correlation between DIANA’s estimates of response latency and MALD1 response latency was registered when using negative values of β. DIANA’s assumption is that an increase in the number of remaining competitors should lead to longer choice reaction times. Instead, given that negative β values were optimal, a higher number of remaining competitors was connected to shorter response latency. Furthermore, negative values of β lead to removing time from the total duration of the signal to estimate the duration of the decision process after that signal had already completed. It is clear that such a procedure is in collision with the physical reality in which human listeners operate.

Simulating MALD1 response latency data show a shortcoming of DIANA in the sense that the transformation from choice entropy to choice RT is not precise enough. This could indicate that either Hick–Hyman’s law is not applicable in its full form, or that the computation of the entropy is not precise enough—for example, due to the quite rough estimation of the number of pseudoword competitors at stimulus offset. We offer a more thorough discussion of the theoretical implications of these findings in the following section.

6 General discussion

In this study, we used DIANA (ten Bosch, Boves, & Ernestus, 2015) to simulate participant performance in an auditory lexical decision task. In three simulations, we (1) created new AMs for western Canadian English, (2) simulated the lexical decision, and (3) correlated DIANA’s estimates of when the winning word is selected with general tendencies in participant responses from the MALD project (Tucker et al., 2019). The results of these simulations can be used to guide future development of models of spoken word recognition and inform the theory regarding the process of spoken word recognition.

6.1 Model input

In Simulation 1, we show that setting up DIANA in a new language is possible even without existing AMs: we used our own, relatively small, in-house spontaneous speech corpora to make new AMs. This process is labor-intensive, as it requires recording and annotating a speech corpus, training AMs, and recording enough additional material by the speaker whose recordings are used in experiments to adapt these AMs. Although possible, it would be time-consuming for an independent researcher to take DIANA as an off-the-shelf model even with existing AMs, given that speaker adaptation must be performed regardless. We provide the AMs we developed and adapted for the MALD speaker as part of our supplementary materials. These adapted AMs should allow researchers to perform DIANA simulations using MALD recordings as material.

DIANA is not isolated with regards to model setup complexity. SpeM and Fine-Tracker (Scharenborg, 2008; Scharenborg et al., 2005) require similar preparatory work. Shortlist B (Norris & McQueen, 2008) depends on a large database of listener responses to gated diphones, which is likely the reason this model has only been implemented in Dutch. For comparison, using instantiations of the TRACE model, jTRACE (Strauss et al., 2007) and the TISK model (You & Magnuson, 2018), requires installations that can be completed in a matter of hours. However, the additional work yields a crucial advantage: DIANA deals with actual acoustic input (but see Norris & McQueen, 2008, for criticism of spoken word recognition models based on automatic speech recognition).

One consequence of a good representation of the variability in the acoustic signal is that DIANA performs well in free word recognition. Accuracy in selecting the correct word as the winner from a corpus of approximately 26,000 words was between 85% and 90%. This level of word recognition accuracy is much higher than those we obtained using TISK, where lexicons with close competition never yielded word recognition accuracy higher than approximately 30% (Nenadić & Tucker, 2020, but note that the competitor structure was different in TISK simulations). DIANA also exceeds the accuracy recorded in the initial tests of the EARSHOT model, which was 67% when novel items from trained-on speakers where presented to the model (Magnuson et al., 2020). Shortlist and SpeM never exceeded 75% recognition accuracy in free word recognition, although it is important to note that these values were obtained for spontaneous speech recordings and multiple speakers (see Scharenborg et al., 2005). At the same time, DIANA was tested using a larger lexicon. Shortlist and SpeM were tested on 10,509 utterances representing 2,398 unique lexicon entries, while EARSHOT used a lexicon of 1,000 words. Word recognition accuracy using the discriminative lexicon approach yielded accuracy of up to 25%, although that simulation analyzed word recordings isolated from spontaneous speech and human participants generally did not perform better on the same material (Arnold et al., 2017). Finally, our AMs work on par with the AMs based on the FAVE suite (Rosenfelder et al., 2014). Even higher accuracy may be obtained with improved base AMs or extended model adaptation. A high standard of model performance in terms of input (free word) recognition is crucial for simulations that involve large word sets—that is, for any simulation that aims to be more than a proof of concept using a toy example.

Another important advantage of using the acoustic signal as input is that candidate activation is dependent on the characteristics of the sound signal, not on preconceptions about which words should sound similar (see, for example, Hawkins, 2003, for an extended discussion). For example, TRACE (McClelland & Elman, 1986b; Strauss et al., 2007) relies on acoustic pseudofeatures to determine phoneme identity. A phoneme always has the same pseudofeature values, meaning that every occurrence of a phoneme is always the same (barring some pseudofeature overlap of neighboring phonemes that accounts for coarticulation, see also Salverda et al., 2014). DIANA, in turn, can analyze any number of unique recordings of the same word, each time generating a different activation–competition pattern. Besides providing much better estimates of what words the signal actually resembles most, this allows researchers to explore and simulate phenomena that were not part of our simulation, such as subphonemic, acoustic effects (e.g., Andruski et al., 1994; Marslen-Wilson & Warren, 1994), effects of prosody (e.g., Kemps et al., 2005; Salverda et al., 2003), or effects related to processing reduced variants of a word (Dilley & Pitt, 2010; Ernestus & Baayen, 2007; Ernestus & Warner, 2011; Tucker, 2011; Tucker & Ernestus, 2016). Although DIANA’s AMs are currently necessarily adapted for a particular speaker, models that use actual acoustic input have the potential of also probing at inter-speaker or inter-group differences (see, for example, Kleinschmidt & Jaeger, 2015; Liu & Jaeger, 2018; McMurray & Jongman, 2011; Theodore & Monto, 2019; Theodore et al., 2020; Xie & Jaeger, 2020), as was recently attempted with the EARSHOT model (Magnuson et al., 2020).

6.2 Competitor selection

We noticed that even toward the end of a signal, many competitors had high activation despite initial mismatch with the target word. For example, pales and hails are the highest activated competitors for the recording of word tales. This model performance is in contrast to the Cohort-like competition endorsed by DIANA, but is in line with human performance. Listener flexibility when analyzing an ambiguous initial part of a signal to make it match with its later parts is the baseline for the well-established Ganong (1980) effect. The fact that listeners maintain more than just categorical information from preceding signal was also shown in various studies utilizing the visual world paradigm (see, for example, Allopenna et al., 1998). Although we deal with isolated word recognition in our simulation, it is worth mentioning that these effects apparently extend even beyond word boundaries (see, for example, Burchill et al., 2018; Connine et al., 1991; Falandays et al., 2020; Szostak & Pitt, 2013). A recent experiment using magnetoencephalography also supports the claim that subsequent contextual information influences the perception of preceding segments as subphonemic detail is preserved in the auditory cortex and reanalyzed as additional signal becomes available (Gwilliams et al., 2018).

Models of spoken word recognition in general attempt to include this kind of flexibility in word recognition and not discard a candidate based on differences in early phonemes the way it was done in the original Cohort model (Marslen-Wilson & Tyler, 1980; Marslen-Wilson & Welsh, 1978). Cohort II (Marslen-Wilson, 1987; Marslen-Wilson et al., 1988) was specifically adapted to be able to recognize the correct word despite initial mismatch (see also Weber & Scharenborg, 2012). TRACE (McClelland & Elman, 1986b) also retains candidates even if there is some initial mismatch, as besides cohorts (words sharing the first two phonemes with the target word), the model also considers rhymes (words sharing the last two phonemes with the target word) and embeddings (words that are fully embedded in the target word) to be close competitors to a target word. The authors of Shortlist B (Norris & McQueen, 2008) also make a point of that model successfully recognizing words despite some (initial) mismatch with the target.

Although DIANA’s current implementation allows creation of lexicons including as many as 36,000 words, this implementation still cannot deal with lexicons the size of the entire CMU dictionary (approximately 135,000 words). Therefore, the initial set of plausible competitors needed to be decided by the researcher. DIANA, conceptually, endorses a Cohort-like competition so we created separate lexicons in Simulations 2 and 3 to include all words with up to three phones and all words that share the first three phones with the target word. The assumption behind this procedure was that DIANA should resolve initial inconsistencies for longer words and that only the closest of competitors will matter toward word offset. However, since the AMs and the enunciations of every segment in MALD recordings are not perfect, we have seen in Simulation 1 that DIANA made mistakes in recognizing the correct word due to, i.a., initial stop elision. The approach currently endorsed by DIANA, which we used to preselect competitors, seems to be faulty as it disqualifies words that could have won instead of the target word. In addition, even if the model successfully resolves initial inconsistencies, that does not mean that competitors with initial mismatch are not some of the closest competitors to the target word. By pre-excluding competitors based on initial phone mismatch we inevitably affected both the potential winner of the activation–competition process and the structure of close competitors (which may be relevant for response latency estimation).

Then how should we select which competitors should be included in the limited lexicon of competitors created for every target word? It is important to note that competitor selection depends on the task the model of spoken word recognition is simulating. An example is the visual world paradigm where limiting the lexicon to include only the competitors that are visually presented on the screen may be justified (see, for example, Allopenna et al., 1998), although Dahan et al. (2001) show evidence that a close competitor to the target word may be activated and compete even if it is not visually presented. In many other tasks—such as the word repetition or the auditory lexical decision task with unrestricted stimulus selection—the constraints of competitor selection are even laxer as any word in the lexicon could be activated if a fitting acoustic signal is presented. If we continue to think of close competitors to words in terms of the phonemes they share, using competitor selection criteria from TRACE (McClelland & Elman, 1986b) seems like a better approach. Note that the criteria from TRACE encompass word neighbors from the Neighborhood Activation Model (Luce & Pisoni, 1998) and word cohorts from the Cohort model (Marslen-Wilson & Welsh, 1978).

Still, it is possible (although not too probable) for a word competitor to be highly activated and not belong to any of these three groups of TRACE competitors, especially prior to word offset. This issue may be solved through brute force, that is, by the sheer size of the lexicon that the current implementation of DIANA can handle. The number of TRACE close competitors extracted from the CMU dictionary for 442 English words ranges from 17 to 2,243, with the average of 605 close competitors (Nenadić & Tucker, 2020). DIANA, in turn, can handle quite sizable lexicons. Therefore, we propose using the competitor selection approach from TRACE but also capitalizing on DIANA’s capacity for large lexicons by selecting 30,000 words that have the lowest phone edit distances from the target word. This approach is yet to be tested, but all cohorts, rhymes, and embeddings should be present in these 30,000 selected competitors—in fact, it should be true that most of the words that are not within 30,000 most similar competitors to the target word (based on phonemic transcription) are indeed not very similar to the target word. One downside of that approach is that it is more computationally demanding to use large lexicons of 30,000 words than to build smaller, targeted lexicons.

However, relying on categories such as the phoneme, as we have previously noted, misses a lot of variability present in the fine-phonetic detail and stemming from reduction or other pronunciation variants in words. The close competitors could instead be determined using similarity in the acoustic signal, rather than generalized categories. For example, with our MALD word set, we could calculate acoustic distances between word recordings and use those to form sets of competitors for each word (Kelley, 2018). An even better alternative would be to calculate acoustic distances between many recordings of many words—although this would require a very large (truly, representative) set of word recordings—and use those as a benchmark.

We also note that the necessity of preselecting competitors in a model of spoken word recognition has been at least as much a question of its technical implementation as it has been of its theory. If a model always considers all the options stored in the mental lexicon, there would be no need to discard a candidate before the activation–competition process has even started. Our implementation of DIANA relied on the HTK (Young et al., 2006) and the lexicon limitations we had were a matter of technique: the model could be implemented using, for example, KALDI (Povey et al., 2011), allowing for better performance and a much more fine-grained view of unfolding activations. The lexicon size could also be dramatically increased to hundreds of thousands of words, removing the issue of candidate preselection. Technical limitations and novel advances will certainly continue to shape models of spoken word recognition and in part determine which questions regarding their architecture are considered relevant.

6.3 Lexical decision

DIANA’s lexical decision accuracy was fairly high. The model uses a simple but powerful solution of comparing the best possible activation of a word in the mental lexicon with the best possible activation of any phone string. Note, however, that DIANA and the human participant have different causes of errors, both in free word recognition and lexical decision. For the computational model, the only cause of error is a poor match between the acoustic signal and the existing AMs, leading to a misinterpretation of the input. Listener errors may have other causes besides issues in interpreting the acoustic signal. For example, a human may not have the target word stored in their mental lexicon (i.e., the person may not know a word), may not be able to retrieve the target word at that particular time, may miss portions of the signal, or simply press the wrong button.

We note two specific issues in the current approach. The first issue is that pseudoword accuracy highly depends on signal length. We will address this finding in more detail below, when we consider the representation of the mental lexicon in DIANA. The second issue is that word frequency does not affect the outcome of the lexical decision, while MALD1 and other lexical decision data generally show that word frequency predicts response accuracy. As we said above, some of the correlation between accuracy and word frequency in behavioral experiments is certainly due to the fact that lower frequency words are known by fewer participants. In addition, perhaps signals of low frequency words require a higher threshold of attention due to less practice with that signal; it is easier to get confused and make a mistake for a word one does know if that word is encountered rarely. Given high-performing AMs, future simulations could include a parameter that would estimate the probability of a word being responded to as a pseudoword based on that word’s frequency (or other characteristics that prove relevant).

6.4 Estimating response latency

The central aim of our simulations was to simulate the time needed to make a response from the onset of the signal. Effect of word frequency, regularly registered in statistical analyses of behavioral responses, was found to be almost negligible. It is important to note that the current implementation of the frequency effect in DIANA is not as straightforward as it may appear. In statistical modeling of auditory lexical decision data, word frequency is ordinarily included as a predictor of response latency to that word. In DIANA the impact of frequency is instead best described as an interaction between a word’s frequency and the frequency of its close acoustic competitors (cf. Vitevitch & Luce, 2016). If a high-frequency target word has a high-frequency runner-up, then the activation difference between the two will remain dependent on acoustic activation alone and the winner may be selected rather late. In contrast, a high-frequency target word that has no high-frequency competitors will become the sole plausible candidate much sooner. Statistical analyses of participant responses should investigate whether this sort of frequency relationship between top acoustic competitors is a better predictor of human response latency than using solely the frequency of the target word.

Another reason for low impact of word frequency is due to the model estimating that it is better to wait until word offset, as higher values of θ_wc are favored (i.e., a larger difference in activation between the leading candidate and the runner-up is required for a winner to be selected), where $γ$ (the parameter that determines top-down, frequency information) has a more limited impact. This behavior is unsurprising in an auditory lexical decision task, as a signal can become a pseudoword at any point before signal offset. Once signal offset is reached, DIANA assumes that the task is to choose the correct winner from the number of remaining candidates, with the decision being weighed by parameter β. The list of competitors includes all words that have their activation within the valu1e of θ_wc in comparison to the top candidate and all potential pseudowords. The number of pseudowords is approximated by raising 3 to the power of the number of phones in the signal word. A six-phone word would therefore have as many as 729 potential pseudoword competitors at word offset, and a word with seven phones would have 2,187. It is clear that when using this estimation pseudoword competitors far outnumber remaining plausible word competitors. In effect, the number of plausible word competitors and the distribution of their activations become insignificant in comparison. In turn, this means that the more phones a word has the longer choice reaction time will be for that word (as more potential pseudoword competitors are registered at word offset). In contrast, MALD1 data shows that longer words are responded to faster when response latency is calculated from word offset, that is, from the point when the signal of the word has ended. Simply put, while DIANA assumes that the time needed to select the winner will be longer in longer words due to many pseudoword competitors remaining at word offset, behavioral data shows an opposite trend in which participants respond faster to longer words relative to word offset.

In line with participant data, we found that optimal values of parameter $β$ controlling for choice RT are negative, indicating that time should be deducted from word offset rather than added to it, and deducted more for longer words. Even with such a setup that would make the raw response latency estimates much shorter than those observed in the behavioral experiment, the added benefit of choice RT to the correlation with MALD data is very limited. The highest correlation between DIANA response latency estimate and mean de-trended logged participant response latency was $r = . 42$ . Although this correlation is moderate and higher than any correlation Nenadić and Tucker (2020) managed to obtain using TISK and jTRACE (the correlation never exceeded r = .2 using these models), it is almost exclusively due to the fact that DIANA and the human participants were presented with the same sound recordings. The correlation when β is set to 0, that is, when no choice RT is added and word duration alone is used, was nearly as high $(r = . 41)$ . These results indicate that a different way of representing the decision process is needed. The main issue seems to lie in the way choice response latency is calculated, especially with regards to estimating the number of plausible pseudoword competitors at word offset in longer words. A possible improvement for DIANA would be to adapt the current estimations of pseudoword competitors at stimulus offset, as the estimations are likely too high. The current number of pseudoword competitors is based on the full phonetic length, but perhaps only the most recent changes (those made toward signal end) matter—as all earlier ones are already knocked out as implausible.

6.5 Future directions and concluding remarks

Besides the above-mentioned adaptations that may be performed on the DIANA’s auditory lexical decision task architecture, we discuss two more important aspects of models of spoken word recognition that could be a part of future DIANA developments. First, the current representation of the mental lexicon in DIANA ignores a very important characteristic of words—their meaning and contexts in which they are used. This ties directly to our interpretation of how the mental lexicon is organized and accessed. Currently, DIANA represents the mental lexicon as a list of unconnected strings of phones (words), focusing on form alone. Under this setup, recognizing a word is in no way affected by the word’s meaning beyond its frequency of occurrence, which is the case in most (abstract) models of spoken word recognition.

However, effects of word meaning in spoken word recognition, even when words are presented in isolation, extend beyond frequency of occurrence (e.g., Goh et al., 2016; Sajin & Connine, 2014; Tucker et al., 2019). In addition, we have seen in Simulation 2 that a representation of the mental lexicon that stores information on form and frequency alone leads to lexical decisions to pseudowords being mostly guided by direct acoustic mismatch, making long pseudowords very easy to discard for DIANA. Human participants, however, do not have this sort of certainty when responding to long pseudowords, given that these pseudowords share, for example, morphological characteristics with existing English words (see also Hendrix & Sun, 2021, for a detailed discussion of factors affecting pseudoword processing in the visual modality).

Therefore, we argue that not just DIANA, but any model of spoken word recognition would benefit from a representation of the mental lexicon that does not consider word form (and frequency) only. This is not a novel notion. The authors of both Cohort (Marslen-Wilson & Tyler, 1980; Marslen-Wilson & Welsh, 1978) and TRACE (McClelland & Elman, 1986a, 1986b), for example, recognize the impact semantic or contextual factors can have on processing (but see also Gaskell & Marslen-Wilson, 1997, 2002). However, primarily due to technical limitations, such factors remained underdeveloped. Many of these technical limitations are now alleviated and certain models of spoken word recognition attempt to expand on the representation of the mental lexicon. The Distributed Cohort Model (Gaskell & Marslen-Wilson, 1997, 1999, 2002), EARSHOT (Magnuson et al., 2020), and the discriminative lexicon approach (Baayen et al., 2019) represent units in the mental lexicon as semantic vectors. These vectors are correlated, creating a network of word meaning. Similar solutions could be implemented in DIANA in the future, by altering DIANA’s lexicon and the information it holds or even by grafting a solution from another model to DIANA’s existing task architecture.

The second development direction we discuss is estimating when the model (and therefore the human listener) should make a “pseudoword” decision. Most models of spoken word recognition do not mention or simulate the process of responding to a signal that is not present in the mental lexicon, although this is an important situation that occurs even outside of the laboratory, for example, when the listener is presented with unknown or foreign words. We envision an approach similar to the one used to determine that the input is a word: a threshold that marks the difference between free phone activation and word activation needed to discard the input as a pseudoword. This threshold would be higher than the threshold used to determine that the signal is a word, creating a zone of uncertainty between the two thresholds where DIANA would wait for more information. This waiting could continue until signal offset. At signal offset, an elegant solution would be to calculate the added choice reaction time using the same formula as in words.

The value of this “pseudoword” threshold would likely have to be high. Otherwise, too many “pseudoword” decisions would occur before pseudoword signal offset, contradicting the behavioral data showing that responses to pseudowords tend to be slower than responses to words (Tucker et al., 2019). It seems that human listeners prefer to wait until more information is provided even in the cases when there may be an early point of disambiguation that should disqualify the signal as a word. We believe this is because they allow for a possibility that something was pronounced in a strange manner, that they misheard something, or that they are simply coming up short in their search through the mental lexicon and that they need a bit more time and information to find the word to which the signal corresponds. The more the pseudoword deviates from words stored in the mental lexicon, however, the bigger the chance that it would be discarded before signal offset, so an exception to this general rule could be very unlikely strings of segments and especially speech signals that break the phonotactic rules of the language—nonwords (cf. Ziegler et al., 1997)—though such stimuli do not occur in MALD and other standard auditory lexical decision studies.

Despite many challenges that the current implementation of DIANA faces, we believe that DIANA is very promising. DIANA successfully uses the acoustic signal as input and has no binding limitations in terms of language it can be used for (as long as AMs exist or can be created) or the lexicon size that can be implemented. Furthermore, we hope that DIANA will be applied to other behavioral tasks. This could also show how robust the current task architecture actually is, that is, whether the performance in these tasks can be described using the same processes as those used for auditory lexical decision and word repetition or not. The development of the field of spoken word recognition depends on its models being tested against various behavioral data and improved based on the findings. We argue that the primary frontier for current models of spoken word recognition is to simulate spoken word recognition phenomena using realistic conditions (e.g., realistic input and realistic competitor sets) and be adaptable enough to simulate data from a plethora of different behavioral experiments used in the field.

Supplemental Material

sj-pdf-1-las-10.1177_00238309221111752 – Supplemental material for Computational Modeling of an Auditory Lexical Decision Experiment Using DIANA

Supplemental material, sj-pdf-1-las-10.1177_00238309221111752 for Computational Modeling of an Auditory Lexical Decision Experiment Using DIANA by Filip Nenadić, Benjamin V. Tucker and Louis ten Bosch in Language and Speech

Footnotes

Acknowledgements

The authors thank Terrance M. Nearey and Matthew C. Kelley for their help in shaping this study.

Authors’ note

Prior publication: An earlier version of some simulations in this paper was presented at Interspeech 2018 (DOI: 10.21437/Interspeech.2018-2081).

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project was funded by the Social Sciences and Humanities Research Council of Canada: Grant #435-2014-0678.

ORCID iD

Filip Nenadić

Supplemental material

Supplemental material for this article is available online.

References

Abdel-Hamid

Mohamed

A. R.

Jiang

Deng

Penn

(2014). Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10), 1533–1545.

Allopenna

P. D.

Magnuson

J. S.

Tanenhaus

M. K.

(1998). Tracking the time course of spoken word recognition using eye movements: Evidence for continuous mapping models. Journal of Memory and Language, 38(4), 419–439.

Andruski

J. E.

Blumstein

S. E.

Burton

(1994). The effect of subphonetic differences on lexical access. Cognition, 52(3), 163–187.

Arnold

Tomaschek

Sering

Lopez

Baayen

R. H.

(2017). Words from spontaneous conversational speech can be recognized with human-like accuracy by an error-driven learning algorithm that discriminates between meanings straight from smart acoustic features, bypassing the phoneme as recognition unit. PLOS ONE, 12(4), Article e0174623.

Baayen

R. H.

Chuang

Y.-Y.

Shafaei-Bajestan

Blevins

J. P.

(2019). The discriminative lexicon: A unified computational model for the lexicon and lexical processing in comprehension and production grounded not in (de)composition but in linear discriminative learning. Complexity, 2019, Article 4895891.

Bailey

T. M.

Hahn

(2001). Determinants of wordlikeness: Phonotactics or lexical neighborhoods? Journal of Memory and Language, 44(4), 568–591. http://www.sciencedirect.com/science/article/pii/S0749596X00927566

Balota

D. A.

Yap

M. J.

Hutchison

K. A.

Cortese

M. J.

(2012). Megastudies: What do millions (or so) of trials tell us about lexical processing? In Adelman

J. S.

(Ed.), Visual word recognition, Vol. 1: Models and methods, orthography and phonology (pp. 90–115). Psychology Press.

Balota

D. A.

Yap

M. J.

Hutchison

K. A.

Cortese

M. J.

Kessler

Loftis

. . . Treiman

(2007). The English lexicon project. Behavior Research Methods, 39(3), 445–459.

Bradlow

A. R.

Bent

(2008). Perceptual adaptation to non-native speech. Cognition, 106(2), 707–729.

10.

Burchill

Liu

Jaeger

T. F.

(2018). Maintaining information about speech input during accent adaptation. PLOS ONE, 13(8), Article e0199358.

11.

Cassani

Chuang

Y.-Y.

Baayen

R. H.

(2020). On the semantics of nonwords and their lexical category. Journal of Experimental Psychology: Learning, Memory, and Cognition, 46(4), 621–637.

12.

Clayards

Tanenhaus

M. K.

Aslin

R. N.

Jacobs

R. A.

(2008). Perception of speech reflects optimal use of probabilistic speech cues. Cognition, 108(3), 804–809.

13.

Connine

C. M.

Blasko

D. G.

Hall

(1991). Effects of subsequent sentence context in auditory word recognition: Temporal and linguistic constraints. Journal of Memory and Language, 30(2), 234–250.

14.

Dahan

Magnuson

J. S.

Tanenhaus

M. K.

Hogan

E. M.

(2001). Subcategorical mismatches and the time course of lexical access: Evidence for lexical competition. Language and Cognitive Processes, 16(5–6), 507–534.

15.

Davies

(2009). The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights. International Journal of Corpus Linguistics, 14(2), 159–190.

16.

Dilley

L. C.

Pitt

M. A.

(2010). Altering context speech rate can cause words to appear or disappear. Psychological Science, 21(11), 1664–1670.

17.

Eisner

McQueen

J. M.

(2005). The specificity of perceptual learning in speech processing. Perception & Psychophysics, 67(2), 224–238.

18.

Ernestus

Baayen

(2007). The comprehension of acoustically reduced morphologically complex words: The roles of deletion, duration, and frequency of occurrence. In Trouvain

Jürgen

Barry

William, J.

(Eds.), Proceedings of the 16th International Congress of Phonetic Sciences (pp. 773–776). Saarbrücken.

19.

Ernestus

Cutler

(2015). Baldey: A database of auditory lexical decisions. Quarterly Journal of Experimental Psychology, 68(8), 1469–1488.

20.

Ernestus

Warner

(2011). An introduction to reduced pronunciation variants. Journal of Phonetics, 39(SI), 253–260.

21.

Falandays

J. B.

Brown-Schmidt

Toscano

J. C.

(2020). Long-lasting gradient activation of referents during spoken language processing. Journal of Memory and Language, 112, Article 104088.

22.

Foraker

McElree

(2011). Comprehension of linguistic dependencies: Speed-accuracy tradeoff evidence for direct-access retrieval from memory. Language and Linguistics Compass, 5(11), 764–783.

23.

Ganong

W. F.

(1980). Phonetic categorization in auditory word perception. Journal of Experimental Psychology: Human Perception and Performance, 6(1), 110–125.

24.

Garofolo

J. S.

Lamel

L. F.

Fisher

W. M.

Fiscus

J. G.

Pallett

D. S.

(1993). DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1, National Institute of Standards and Technology (NASA STI/Recon technical report, 93).

25.

Gaskell

M. G.

Marslen-Wilson

W. D.

(1997). Integrating form and meaning: A distributed model of speech perception. Language and Cognitive Processes, 12(5–6), 613–656.

26.

Gaskell

M. G.

Marslen-Wilson

W. D.

(1999). Ambiguity, competition, and blending in spoken word recognition. Cognitive Science, 23(4), 439–462.

27.

Gaskell

M. G.

Marslen-Wilson

W. D.

(2002). Representation and competition in the perception of spoken words. Cognitive Psychology, 45(2), 220–266.

28.

Goh

W. D.

Yap

M. J.

Lau

M. C.

M. M.

Tan

L.-C.

(2016). Semantic richness effects in spoken word recognition: A lexical decision and semantic categorization megastudy. Frontiers in Psychology, 7, Article 976.

29.

Gwilliams

Linzen

Poeppel

Marantz

(2018). In spoken word recognition, the future predicts the past. Journal of Neuroscience, 38(35), 7585–7599.

30.

Hannagan

Magnuson

J. S.

Grainger

(2013). Spoken word recognition without a TRACE. Frontiers in Psychology, 4, Article 563.

31.

Hawkins

(2003). Roles and representations of systematic fine phonetic detail in speech understanding. Journal of Phonetics, 31(3–4), 373–405.

32.

Hendrix

Sun

(2021). A word or two about nonwords: Frequency, semantic neighborhood density, and orthography-to-semantics consistency effects for nonwords in the lexical decision task. Journal of Experimental Psychology: Learning, Memory, and Cognition, 47, 157–183.

33.

Hick

W. E.

(1952). On the rate of gain of information. Quarterly Journal of Experimental Psychology, 4(1), 11–26.

34.

Hitczenko

Feldman

N. H.

(2016). Modeling adaptation to a novel accent [Conference session]. Proceedings of the 38th Annual Meeting of the Cognitive Science Society, (pp. 1367–1372). Philadelphia, Pennsylvania.

35.

Hyman

(1953). Stimulus information as a determinant of reaction time. Journal of Experimental Psychology, 45(3), 188–196.

36.

Johnson

(1997). The auditory/perceptual basis for speech segmentation. Ohio State University Working Papers in Linguistics, 50, 101–113.

37.

Johnson

Sjerps

(2018). Speaker normalization in speech perception (UC Berkeley PhonLab Annual Report, 14). https://escholarship.org/uc/item/2fc6x1ph

38.

Kelley

M. C.

(2018). How acoustic distinctiveness affects spoken word recognition: A pilot study [Paper presentation]. 11th International Conference on the Mental Lexicon, Edmonton, AB. https://doi.org/10.7939/R39G5GV9Q

39.

Kemps

R. J.

Wurm

L. H.

Ernestus

Schreuder

Baayen

(2005). Prosodic cues for morphological complexity in Dutch and English. Language and Cognitive Processes, 20(1–2), 43–73.

40.

Keuleers

Balota

D. A.

(2015). Megastudies, crowdsourcing, and large datasets in psycholinguistics: An overview of recent developments. Journal of Experimental Psychology, 68(8), 1457–1468.

41.

Klatt

D. H.

(1979). Speech perception: A model of acoustic–phonetic analysis and lexical access. Journal of Phonetics, 7(3), 279–312.

42.

Kleinschmidt

D. F.

(2019). Structure in talker variability: How much is there and how much can it help? Language, Cognition and Neuroscience, 34(1), 43–68.

43.

Kleinschmidt

D. F.

Jaeger

T. F.

(2015). Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel. Psychological Review, 122(2), 148–203.

44.

Kleinschmidt

D. F.

Jaeger

T. F.

(2016). What do you expect from an unfamiliar talker? In Papafragou

Grodner

Daniel J.

Mirman

Trueswell

(Eds.), Proceedings of the 38th annual meeting of the Cognitive Science Society, (pp. 2351-2356). Philadelphia, Pennsylvania.

45.

Kosinski

R. J.

(2008). A literature review on reaction time. Clemson University, 10, 337–344.

46.

Kronrod

Coppess

Feldman

N. H.

(2016). A unified account of categorical effects in phonetic perception. Psychonomic Bulletin & Review, 23(6), 1681–1712.

47.

Kurumada

Roettger

T. B.

(2022). Thinking probabilistically in the study of intonational speech prosody. Wiley Interdisciplinary Reviews: Cognitive Science, 13(1), Article e1579.

48.

Lee

C.-y.

O’Donnell

T. J.

Glass

(2015). Unsupervised lexicon discovery from acoustic input. Transactions of the Association for Computational Linguistics, 3, 389–403.

49.

Liu

Jaeger

T. F.

(2018). Inferring causes during speech perception. Cognition, 174, 55–70.

50.

Luce

P. A.

(1986). A computational analysis of uniqueness points in auditory word recognition. Perception & Psychophysics, 39(3), 155–158.

51.

Luce

P. A.

Pisoni

D. B.

(1998). Recognizing spoken words: The neighborhood activation model. Ear and Hearing, 19(1), 1–36.

52.

Magnuson

J. S.

Mirman

Harris

H. D.

(2012). Computational models of spoken word recognition. In Spivey

Joanisse

McRae

(Eds.), The Cambridge handbook of psycholinguistics (pp. 76–103). Cambridge University Press.

53.

Magnuson

J. S.

Nusbaum

H. C.

Akahane-Yamada

Saltzman

(2021). Talker familiarity and the accommodation of talker variability. Attention, Perception, & Psychophysics, 83(4), 1842–1860.

54.

Magnuson

J. S.

You

Luthra

Nam

Escabi

. . . Rueckl

J. G.

(2020). EARSHOT: A minimal neural network model of incremental human speech recognition. Cognitive Science, 44(4), Article e12823.

55.

Marslen-Wilson

W. D.

(1987). Functional parallelism in spoken word-recognition. Cognition, 25(1), 71–102.

56.

Marslen-Wilson

W. D.

Brown

C. M.

Tyler

L. K.

(1988). Lexical representations in spoken language comprehension. Language and Cognitive Processes, 3(1), 1–16.

57.

Marslen-Wilson

W. D.

Tyler

L. K.

(1980). The temporal structure of spoken language understanding. Cognition, 8(1), 1–71.

58.

Marslen-Wilson

W. D.

Warren

(1994). Levels of perceptual representation and process in lexical access: Words, phonemes, and features. Psychological Review, 101(4), 653–675.

59.

Marslen-Wilson

W. D.

Welsh

(1978). Processing interactions and lexical access during word recognition in continuous speech. Cognitive Psychology, 10(1), 29–63.

60.

McAuliffe

Socolof

Mihuc

Wagner

Sonderegger

(2017). Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. In Lacerda

House

Heldner

Gustafson

Strömbergsson

Włodarczak

(Eds.), Proceedings of the interspeech 2017 (pp. 498–502), International Speech Communication Association.

61.

McClelland

J. L.

Elman

J. L.

(1986a). Interactive processes in speech perception: The TRACE model. In Parallel distributed processing: Explorations in the microstructure of cognition, vol. 2: Psychological and biological models (pp. 58–121). Cambridge MA: MIT Press/Bradford Books.

62.

McClelland

J. L.

Elman

J. L.

(1986b). The TRACE model of speech perception. Cognitive Psychology, 18(1), 1–86.

63.

McMurray

Jongman

(2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review, 118(2), 219–246.

64.

McQueen

J. M.

(2007). Eight questions about spoken-word recognition. In Rueschemeyer

S.-A.

Gaskell

(Eds.), The Oxford Handbook of Psycholinguistics (pp. 37–53). Oxford University Press.

65.

Nenadić

Tucker

B. V.

(2020). Computational modelling of an auditory lexical decision experiment using jTRACE and TISK. Language, Cognition and Neuroscience, 35(10), 1326–1354. https://doi.org/10.1080/23273798.2020.1764600

66.

Norris

(1994). Shortlist: A connectionist model of continuous speech recognition. Cognition, 52(3), 189–234.

67.

Norris

McQueen

J. M.

(2008). Shortlist B: A Bayesian model of continuous speech recognition. Psychological Review, 115(2), 357–395.

68.

Norris

McQueen

J. M.

Cutler

(2003). Perceptual learning in speech. Cognitive Psychology, 47(2), 204–238.

69.

Panayotov

Chen

Povey

Khudanpur

(2015). LibriSpeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). The Institute of Electrical and Electronics Engineers, Inc., (pp. 5206–5210).

70.

Perfors

Tenenbaum

J. B.

Griffiths

T. L.

(2011). A tutorial introduction to bayesian models of cognitive development. Cognition, 120(3), 302–321.

71.

Peterson

G. E.

Barney

H. L.

(1952). Control methods used in a study of the vowels. The Journal of the Acoustical Society of America, 24(2), 175–184.

72.

Phaf

R. H.

Van der Heijden

Hudson

P. T.

(1990). SLAM: A connectionist model for attention in visual selection tasks. Cognitive Psychology, 22(3), 273–341.

73.

Pisoni

D. B.

Luce

P. A.

(1987). Acoustic-phonetic representations in word recognition. Cognition, 25(1–2), 21–52.

74.

Povey

Ghoshal

Boulianne

Burget

Glembek

Goel

. . . Vesely

(2011). The Kaldi speech recognition toolkit. In IEEE 2011 Automatic Speech Recognition and Understanding Workshop (ASRU 2011) (p. 4).

75.

Prins

(2012). The psychometric function: The lapse rate revisited. Journal of Vision, 12(6), Article 25.

76.

Protopapas

(1999). Connectionist modeling of speech perception. Psychological Bulletin, 125(4), 410–436.

77.

Rabiner

Juang

B.-H.

(1993). Fundamentals of speech recognition. Prentice Hall.

78.

Ratcliff

Gomez

McKoon

(2004). A diffusion model account of the lexical decision task. Psychological Review, 111(1), 159–182.

79.

Rosenfelder

Fruehwald

Evanini

Seyfarth

Gorman

Prichard

Yuan

(2014). FAVE (Forced Alignment and Vowel Extraction) suite version 1.1.3. http://fave.ling.upenn.edu

80.

Sajin

S. M.

Connine

C. M.

(2014). Semantic richness: The role of semantic features in processing spoken words. Journal of Memory and Language, 70, 13–35.

81.

Salverda

A. P.

Dahan

McQueen

J. M.

(2003). The role of prosodic boundaries in the resolution of lexical embedding in speech comprehension. Cognition, 90(1), 51–89.

82.

Salverda

A. P.

Kleinschmidt

Tanenhaus

M. K.

(2014). Immediate effects of anticipatory coarticulation in spoken-word recognition. Journal of Memory and Language, 71(1), 145–163.

83.

Scharenborg

(2008). Modelling fine-phonetic detail in a computational model of word recognition. In Fletcher

Loakes

Göcke

Burnham

Wagner

(Eds.), The 9th Annual Conference of the International Speech Communication Association (pp. 1473–1476).

84.

Scharenborg

(2009). Using durational cues in a computational model of spoken-word recognition. In Uther

Moore

Cox

(Eds.), The10th Annual Conference of the International Speech Communication Association (pp. 1675–1678).

85.

Scharenborg

Boves

(2010). Computational modelling of spoken-word recognition processes: Design choices and evaluation. Pragmatics & Cognition, 18(1), 136–164.

86.

Scharenborg

Merkx

(2018). The role of articulatory feature representation quality in a computational model of human spoken-word recognition [Paper presentation]. Proceedings of the Machine Learning in Speech and Language Processing Workshop, 7 September, Hyderabad, India.

87.

Scharenborg

Norris

Bosch

McQueen

J. M.

(2005). How should a speech recognizer work? Cognitive Science, 29(6), 867–918.

88.

Schatz

Feldman

N. H.

Goldwater

Cao

X.-N.

Dupoux

(2021). Early phonetic learning without phonetic categories: Insights from large-scale simulations on realistic input. Proceedings of the National Academy of Sciences, 118(7), Article e2001844118.

89.

Schneider

Eschman

Zuccolotto

(2012). E-prime reference guide. Pittsburgh: Psychology Software Tools, Inc.

90.

Smits

Warner

McQueen

J. M.

Cutler

(2003). Unfolding of phonetic information over time: A database of Dutch diphone perception. The Journal of the Acoustical Society of America, 113(1), 563–574.

91.

Stevens

K. N.

Blumstein

S. E.

(1978). Invariant cues for place of articulation in stop consonants. The Journal of the Acoustical Society of America, 64(5), 1358–1368.

92.

Strauss

T. J.

Harris

H. D.

Magnuson

J. S.

(2007). jTRACE: A reimplementation and extension of the trace model of speech perception and spoken word recognition. Behavior Research Methods, 39(1), 19–30.

93.

Szostak

C. M.

Pitt

M. A.

(2013). The prolonged influence of subsequent context on spoken word recognition. Attention, Perception, & Psychophysics, 75(7), 1533–1546.

94.

Tan

Xie

Jaeger

T. F.

(2021). Using rational models to interpret the results of experiments on accent adaptation. Frontiers in Psychology, 12, Article 676271.

95.

Tanenhaus

M. K.

(2004). On-line sentence processing: Past, present and, future. In Carreiras

Clifton

(Eds.), The On-line Study of Sentence Comprehension: Eyetracking, ERPs and Beyond (pp. 371–392). New York: Psychology Press.

96.

ten Bosch

Boves

Ernestus

(2013). Towards an end-to-end computational model of speech comprehension: Simulating a lexical decision task. In Bimbot

Cerisara

Fougeron

Gravier

Lamel

Pellegrino

Perrier

(Eds.), Interspeech 2013: 14th Annual Conference of the International Speech Communication Association (pp. 2822–2826).

97.

ten Bosch

Boves

Ernestus

. (2015). DIANA, an end-to-end computational model of human word comprehension [Conference session]. The 18th International Congress of Phonetic Sciences (ICPhS 2015). Glasgow, U.K.

98.

ten Bosch

Boves

Ernestus

. (2016). Combining data-oriented and process-oriented approaches to modeling reaction time data [Conference session]. In Morgan

Georgiou

Narayanan

Metze

(Eds.), Interspeech 2016: The 17th Annual Conference of the International Speech Communication Association.

99.

ten Bosch

Boves

Mulder

(2019). Analyzing reaction time and error sequences in lexical decision experiments. In Kubin

Hain

Schuller

El Zarka

Hödl

(Eds.), Proceedings of the Interspeech 2019, International Speech Communication Association (pp. 2280–2284). https://doi.org/10.21437/Interspeech.2019-2611

100.

ten Bosch

Boves

Tucker

Ernestus

. (2015). DIANA: Towards computational modeling reaction times in lexical decision in North American English. In Möller

Ney

Möbius

Nöth

Steidl

(Eds.), Interspeech 2015: The 16th Annual Conference of the International Speech Communication Association, International Speech Communication Association (pp. 1576–1580).

101.

ten Bosch

Ernestus

Boves

(2014). Comparing reaction time sequences from human participants and computational models. In Li

Meng

Siong Chng

Xie

(Eds.), Interspeech 2014: The 15th Annual Conference of the International Speech Communication Association, International Speech Communication Association (pp. 462–466).

102.

ten Bosch

Ernestus

Boves

. (2018). Analyzing reaction time sequences from human participants in auditory experiments. In Li

Meng

Siong Chng

Xie

(Eds.), Interspeech 2018: The 19th Annual Conference of the International Speech Communication Association (pp. 971–975). ISCA.

103.

ten Bosch

Giezenaar

Boves

Ernestus

(2016). Modeling language-learners’ errors in understanding casual speech. In Adda

Barbu Mititelu

Mariani

Tufiș

Vasilescu

(Eds.), Errors by humans and machines in multimedia, multimodal, multilingual data processing, Proceedings of Errare 2015 (pp. 7–121). Bucharest: Editura Academiei Române.

104.

Theodore

R. M.

Monto

N. R.

(2019). Distributional learning for speech reflects cumulative exposure to a talker’s phonetic distributions. Psychonomic Bulletin & Review, 26(3), 985–992.

105.

Theodore

R. M.

Monto

N. R.

Graham

(2020). Individual differences in distributional learning for speech: What’s ideal for ideal observers? Journal of Speech, Language, and Hearing Research, 63(1), 1–13.

106.

Tucker

B. V.

(2011). The effect of reduction on the processing of flaps and/g/in isolated words. Journal of Phonetics, 39(3), 312–318.

107.

Tucker

B. V.

Brenner

Danielson

D. K.

Kelley

M. C.

Nenadić

Sims

(2019). The Massive Auditory Lexical Decision (MALD) database. Behavior Research Methods, 51(3), 1187–1204.

108.

Tucker

B. V.

Ernestus

(2016). Why we need to investigate casual speech to truly understand language production, processing and the mental lexicon. The Mental Lexicon, 11(3), 375–400.

109.

Vertanen

(2006). Baseline WSJ acoustic models for HTK and Sphinx: Training recipes and recognition experiments (Tech. Rep.). Cavendish Laboratory.

110.

Vitevitch

M. S.

Luce

P. A.

(2016). Phonological neighborhood effects in spoken word perception and production. Annual Review of Linguistics, 2, 75–94.

111.

Warker

J. A.

Dell

G. S.

(2006). Speech errors reflect newly learned phonotactic constraints. Journal of Experimental Psychology: Learning, Memory, and Cognition, 32(2), 387–398. https://doi.org/10.1037/0278-7393.32.2.387

112.

Weatherholtz

Jaeger

T. F.

(2016). Speech perception and generalization across talkers and accents. Oxford University Press. https://oxfordre.com/linguistics/view/10.1093/acrefore/9780199384655.001.0001/acrefore-9780199384655-e-95

113.

Weber

Scharenborg

(2012). Models of spoken-word recognition. Wiley Interdisciplinary Reviews: Cognitive Science, 3(3), 387–401.

114.

Weide

(2005). The Carnegie Mellon pronouncing dictionary [cmudict. 0.6]. Carnegie Mellon University. http://www.speech.cs.cmu.edu/cgi-bin/cmudict

115.

Xie

Buxó-Lugo

Kurumada

(2021). Encoding and decoding of meaning through structured variability in intonational speech prosody. Cognition, 211, Article 104619.

116.

Xie

Jaeger

T. F.

(2020). Comparing non-native and native speech: Are L2 productions more variable? The Journal of the Acoustical Society of America, 147(5), 3322–3347.

117.

Yap

M. J.

Balota

D. A.

(2009). Visual word recognition of multisyllabic words. Journal of Memory and Language, 60(4), 502–529.

118.

Yarkoni

Balota

Yap

(2008). Moving beyond Coltheart’s N: A new measure of orthographic similarity. Psychonomic Bulletin & Review, 15(5), 971–979.

119.

You

Magnuson

J. S.

(2018). Tisk 1.0: An easy-to-use python implementation of the time-invariant string kernel model of spoken word recognition. Behavior Research Methods, 50(3), 871–889.

120.

Young

Evermann

Gales

Hain

Kershaw

Liu

. . . Woodland

(2006). The htk book (version 3.4). Cambridge University Engineering Department. http://htk.eng.cam.ac.uk/

121.

Yuan

Liberman

(2008). Speaker identification on the SCOTUS corpus. Journal of the Acoustical Society of America, 123(5), Article 3878.

122.

Ziegler

J. C.

Besson

Jacobs

A. M.

Nazir

T. A.

Carr

T. H.

(1997). Word, pseudoword, and nonword processing: A multitask comparison using event-related brain potentials. Journal of Cognitive Neuroscience, 9(6), 758–775.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.43 MB