Abstract
Matrix sentence tests in noise can be challenging to the listener and time-consuming. A trade-off should be found between testing time, listener’s comfort and the precision of the results. Here, a novel test procedure based on an updated maximum likelihood method was developed and implemented in a German matrix sentence test. It determines the parameters of the psychometric function (threshold, slope, and lapse-rate) without constantly challenging the listener at the intelligibility threshold. A so-called “credible interval” was used as a mid-run estimate of reliability and can be used as a termination criterion for the test. The procedure was evaluated and compared to a STAIRCASE procedure in a study with 20 cochlear implant patients and 20 normal hearing participants. The proposed procedure offers comparable accuracy and reliability to the reference method, but with a lower listening effort, as rated by the listeners (
Introduction
Speech intelligibility tests in noise are essential instruments in the assessment and follow-up of patients treated with hearing aids and cochlear implants (CIs) (British Society of Audiology, 2019; Nilsson et al., 2011). Measuring a person’s speech reception threshold (SRT) often requires a tradeoff between accuracy, test duration, and listener’s effort. Longer test runs may provide more accurate results, but are time-consuming and can be tiring for the participant. Moreover, the listening effort related to being tested near the individual listener’s intelligibility threshold may provoke stress (Zekveld et al., 2011; Mackersie & Cones, 2011).
Matrix sentence tests can be used to determine speech intelligibility in noise and offer a balance between accuracy, test duration and listening effort (Kollmeier et al., 2015). Hagerman (1982) describes how to form syntactically correct sentences out of a matrix of predefined words. These are then used to find a subject’s SRT (e.g., in noise). To this end, an adaptive procedure alters the signal-to-noise ratio (SNR) depending on the listener’s responses. Typically, the SNR is adjusted to reach 50% correct responses to determine the SRT in noise (i.e., the SNR that yields 50% intelligibility). Testing at this intelligibility level is efficient due to the steepness of the psychometric function (PF) (Green, 1990). However, many listeners feel uncomfortable when trying to repeat sentences they can barely understand. Furthermore, the SNRs corresponding to the SRTs may be lower than those encountered in everyday situations (Smeds et al., 2015; Wu et al., 2018) and may fall below the range of the SNRs where hearing aids’ nonlinear features perform optimally (Naylor, 2016).
The German matrix sentence test Oldenburger Satztest (OLSA) applies 5-word-long sentences and requires 20–30 sentences to evaluate the SRT. To determine the SRT, it alters the SNR in a so-called STAIRCASE procedure. In this procedure, the SNR is adapted in predefined steps according to the listener’s responses (Wagener et al., 1999). While the procedure is simple and widely used, one of its limitations is that it is not suited for determination of the slope of the underlying PF. Knowing not only the SRT but also the slope would facilitate the identification of points of interest on the PF (e.g., SRT75) and allow to estimate the effect of SNR changes on intelligibility (MacPherson & Akeroyd, 2014).
Besides the STAIRCASE procedure, there are other adaptive methods to determine the SRT in noise. Bayesian methods, for instance, combine prior knowledge and experimental results by using Bayes’ theorem. This can be used to model a PF and infer its parameters based on prior values and listener responses. Herbert et al. (2022) simulated seven different adaptive procedures for speech-in-noise tests regarding the accuracy and reliability for SRT50 and SRT75. They report promising results for Bayesian procedures and suggest to investigate these procedures with human participants.
The aim of our study was to design a procedure for matrix sentence tests in noise which requires less listening effort and thus offers more comfort to the participants. The method does not constantly challenge the listeners at their SRT, but tests at higher intelligibility levels too and uses as few trials as possible. To this end, Bayesian inference in an updated maximum likelihood (UML) procedure was used (Shen & Richards, 2012). This adaptive procedure has proved reliable in various psychophysical tests; for instance, in the field of psychoacoustics (Carcagno & Plack, 2021; Fischer et al., 2021a,b; Lee & Müllensiefen, 2020; Jurado et al., 2020). The method can estimate the reliability of the results by calculating a credible interval, the Bayesian equivalent to the confidence interval in frequentist statistics (Comotto, 2022). The credible interval can serve as the termination criterion for the adaptive procedure. The proposed method was called BPACE (Bayesian PAtient-CEntered) procedure since it focuses not only on efficiency but also on the patient’s listening effort. Additionally, besides the determination of the SRT, the procedure can estimate the slope and the ceiling performance level of a listener’s PF.
Methods
Test Procedure
Idea and Principle
The proposed test procedure assumes the speech intelligibility in noise to be a PF modeled by a logistic function of the SNR according to Equation 1, where

Listener’s responses (black dots) depending on the signal-to-noise ratio (SNR) with the resulting psychometric function and its sweet-points (
The lower bound of this function is zero, as used in the original version of the test Wagener et al. (1999) and as confirmed by a small pilot study. The parameter
Deducing a PF from previous responses allows testing subsequent sentences at SNR levels deviating from the currently assumed SRT. A UML procedure identifies optimal SNRs—so called sweet-points—for the next trial, based on the results of all previous trials. Testing the next sentence at one of those sweet-points will minimize the expected variance for the respective parameter and thus result in maximum information gain. Figure 1 shows an example of responses to trials at different SNRs and the estimated psychometric function. In this function, the sweet-points are illustrated. The
After each trial, all the previous responses are evaluated to find the most probable PF.
1
Then the algorithm decides, which sweet-point on this PF to use for the next trial. Whenever less than 50% of the words were correctly repeated in the last trial, the algorithm selects a sweet-point with a higher SNR and vice versa. With the aim of not discouraging the listener, the lower
The test can be terminated after a predefined number of trials or as soon as the reliability of the results is sufficient. At the end of the adaptive procedure, the first sentence is presented once more to the listener without noise to offer an easy finish, since Kahneman et al. (1993) found the end of a test contributing substantially to the participant’s subjective experience. This sentence is neither evaluated nor saved.
Implementation
MATLAB R2020a (the Mathworks, Natick, MA, USA) was used to implement the test procedure algorithm with a graphical user interface. It stores the individual test steps as well as the test results in a database. The source code is available as Supplemental Material. It may be adapted to work with any matrix sentence test.
The UML toolbox version 4 provided by Shen et al. (2015) calculates the posterior parameter distribution, the sweet-points and the credible intervals. The procedure starts with normal prior distributions for each parameter according to Table 1. The prior distributions were defined and parametrized based on pilot tests and simulations with virtual listeners. For each parameter, a range with a number of uniformly distributed discrete values was defined (Table 1).
Parameter Space and Prior Distributions.
Together they span a three-dimensional parameter space in which the maximum likelihood for the posterior parameters is calculated using Bayes’ theorem according to Equation 2 with
Technical Validation
We performed simulations to parametrize and validate our new procedure and to compare it to the conventional existing procedure. A detailed summary of the simulation experiments is provided in the Supplemental Material. Although simulations are useful to estimate the performance and accuracy of the different test methods, a study with real participants is required for a test validation.
Study Design and Ethics
The BPACE procedure was evaluated and compared to the original STAIRCASE procedure in a prospective study to test its accuracy and reliability. Since the novel procedure is parametrized with prior values, it is important to test different groups of listeners and therefore CI patients and normal hearing (NH) participants were included. The study was approved by the local institutional review board (BASEC-ID 2021-01828).
Participants
A total of 20 CI users and 20 NH listeners were evaluated in the study. All subjects were native German speakers and gave their informed consent. Pure tone thresholds for NH participants had to be 25 dB
Demographics
The mean age of the study participants was 39 years (SD = 17 years); the CI patients had a mean of 47 years (SD = 21 years) and NH participants 32 years (SD = 6 years). A total of 27 participants were women and 13 men. The CI patients are implanted with Advanced Bionics, Cochlear and Med-El implants and have been using their systems for 1–24 years (mean 9 years, SD = 7 years). Details of the CI systems and the NH participant’s pure tone thresholds are added as Supplemental Material.
Audiometric Setting
All experiments were performed in an acoustic chamber with a calibrated clinical audiometer (Equinox, Interacoustics A/S, Assens, Denmark). Pure tone air conduction hearing thresholds (in dB
The CI patients used only the specified implant; in case of residual hearing in the contralateral ear, it was closed with an earplug.
Test Protocol
Figure 2 illustrates the study timeline, starting with an assessment of the prerequisites and a training list consisting of 20 sentences in quiet. Subsequently, two adaptive tests identified two SRTs for the first procedure (staircase or BPACE) before the listener was asked to rate the listening effort. The effort rating scale ranged from 0 to 10, with labels ranging between no effort to very high effort (Zekveld et al., 2011). Afterward, the same proceeding was repeated with the other test procedure. The order of the two test procedures (staircase/BPACE, A/B, respectively) was systematically alternated and not communicated to the listeners (single-blinded design).

Timeline of the study.
To minimize learning effects, the participants were allowed to see 10 example sentences containing all possible words during the training session. However, the test setting was not closed-set, as the listeners could—and frequently did—give nonconforming responses or no response, thus the effective guessing rate was close to zero. For both test procedures, each test run included 30 sentences; even after seven trials at the same SNR with the STAIRCASE procedure, the test was not terminated.
2
The STAIRCASE method was adapting the signal level starting at 0 dB
Statistical Analysis
Demographic data and functional outcomes are summarized using descriptive statistics. For the BPACE method, the credible interval for parameter
The agreement between the two methods is graphically explored in a Bland–Altman plot (Martin Bland & Altman, 1986). Furthermore, a linear mixed-effects model was used to investigate the differences between the SRT estimates obtained with the two test methods. The test method (i.e., STAIRCASE vs. BPACE), the test sequence number (i.e., 0–3), the hearing condition (i.e., NH vs. CI) and the age (years) were used as fixed effects. The subject ID was included as random intercept to account for repeated measurements. The subjective listening effort was evaluated with a similar model excluding the test sequence number. The residuals were visually inspected in a residuals versus fitted plot.
The test reliability was determined by calculating the intraclass correlation coefficient (ICC) using a two-way effects, absolute agreement, single measurement model. ICC can take values between 0 and 1 and is a measure of the consistency of test and retest results (< 0.5 poor, 0.5–0.75 moderate, 0.75–0.9 good, and > 0.9 excellent) (Koo & Li, 2016; Weir, 2005). The R studio software (Core Team, 2017) with the lme4 and irr packages (Bates et al., 2015; Gamer et al., 2019) served as tools for the statistical analysis.
Results
Accuracy and Agreement of the Two Methods
Figure 3 summarizes the SRTs measured with each test method for CI patients and NH participants separately. CI patients in this study showed a mean SRT of

SRT for the STAIRCASE and the BPACE methods separated for CI patients and NH listeners. Abbreviations: SRT= speech reception threshold; CI = cochlear implant; NH = normal hearing; BPACE = Bayesian PAtient-CEntered.
Figure 4 demonstrates the learning curve over the four tests for NH and CI patients.

SRT depending on test sequence number for CI patients and NH listeners. Abbreviations: SRTs = speech reception thresholds; CI = cochlear implant; NH = normal hearing.
Only the BPACE method estimates the slope and the asymptote of the PF. For NH listeners, the mean slope was 0.18 dB
The left panel in Figure 5 shows the Bland–Altman plot representing each test method by the mean of the two SRTs measured, subtracting STAIRCASE from BPACE results. In the right panel, the mean SRTs obtained with the BPACE method are plotted against the mean SRTs of the STAIRCASE method.

Left: Bland–Altman plot for STAIRCASE and BPACE test methods. Right: Scatterplot of the SRT, the color separates NH listeners from CI patients. Abbreviations: SRTs = speech reception thresholds; CI = cochlear implant; NH = normal hearing.
A linear mixed-effects model including the test method, test sequence number, the hearing condition, and age as fixed effects and the subject as random effect was computed to analyze the 160 data points. No significant dependence of the SRT on the test method (staircase or BPACE) was found (see Table 2). Unsurprisingly, the NH group performed considerably better than the CI patients. In the course of the investigation, the SRT of the participants improved with each test run by 0.33 dB
Linear Mixed-Effects Model Summary.
Abbreviations: BPACE = Bayesian PAtient-CEntered; NH = normal hearing.
Test–retest reliability was determined by calculating the ICC of the two measurements for each test method. The results are listed in Table 3. The absolute test–retest difference of the SRT was found to be below 1.1 dB for 33 out of 40 STAIRCASE test pairs and for 33 out of 40 BPACE test pairs.
Test–Retest Reliability.
Abbreviations: BPACE = Bayesian PAtient-CEntered; ICC = intraclass correlation coefficient.
Listening Effort
The subjective listening effort ratings and the proportion correct are summarized in Figure 6. The participants rated the effort with 6.4 (SD = 2) on the 10-point scale. For the STAIRCASE method, the rating was 7.3 (SD = 1.7) and for the BPACE method, it was 5.5 (SD = 1.9). The BPACE method reduced the mean subjective listening effort from 7.3 to 5.4 for the CI patients and from 7.4 to 5.6 for the NH participants. The proportion correct for the CI patients was 51.5% (SD = 2.8%) with the STAIRCASE method and 60.3% (SD = 3.9%) with the BPACE method. The NH listeners achieved a proportion correct of 53.8% (SD = 2.3%) with the STAIRCASE method and 63.2% (SD = 2.8%) with the BPACE method.

Subjective listening effort and mean proportion correct in the adaptive test for the CI patients and the NH participants. Abbreviations: CI = cochlear implant; NH = normal hearing.
Table 4 gives a summary of the fitting of the linear mixed-effects model with method, age, and hearing condition as fixed effects and the subject as random effect. No specific pattern or outlier was detected in the residuals versus fitted plot.
Linear Mixed-Effects Model Summary for the Listening Effort.
Abbreviations: BPACE = Bayesian PAtient-CEntered; NH = normal hearing.
Test Duration
The mean test duration for 30 sentences in noise was 4.6 min (SD = 0.8 min)—4.8 min (SD = 0.8 min) with the STAIRCASE method and 4.4 min (SD = 0.7 min) with the BPACE method. Applying a termination criterion of 2 dB for the credible interval reduces the testing time to 3.1 min (SD = 1.2 min) for the BPACE method. The termination criterion was reached after 21.5 trials (SD = 6.3 trials) on average. Figure 7 compares the duration for 30 trials or for termination based on the running credible interval. The data are grouped for NH and CI listeners since the reduction in test duration is more pronounced for NH participants. For the training list in quiet, the mean duration was 2.5 min (SD = 0.6 min).

BPACE test duration for 30 trials or until reaching the desired credible interval for the CI patients and NH participants. The dashed line indicates the mean duration of the STAIRCASE method (4.8 min). Abbreviations: BPACE = Bayesian PAtient-CEntered; CI = cochlear implant; NH = normal hearing.
Discussion
Speech in noise testing is an important audiological outcome measure, however, testing time and listening effort limit its applicability in clinical routine. The BPACE procedure was designed to alleviate these drawbacks and still return reliable SRT estimates. The study results demonstrate very good agreement between the SRT estimates found by the STAIRCASE and BPACE methods. Figure 3 demonstrates almost equal mean SRTs for the two methods and comparable standard deviations. Figure 4 shows a decreasing SRT during the test sequence, a learning effect known from the literature (Wagener et al., 1999; Heyn, 2019; Rudolf & Kaiser, 2019). A longer training period with several training runs could account for this learning effect. However, in this study the learning effect is allocated equally to the STAIRCASE and to the BPACE results.
The mean SRT of
Besides providing the SRT in agreement with conventional existing procedures, the BPACE method yields more information about the entire PF. Knowing the parameters
Accuracy and Agreement of the Two Methods
The mean SRT results for STAIRCASE and BPACE tests are almost equal. For CI patients, the mean and the standard deviation strongly depend on the selected patient collective, and it, therefore, cannot be compared to published data. For the NH participants, the SRT standard deviation was slightly higher for BPACE than for STAIRCASE tests and lower than the value reported for the OLSA, but in good agreement with other matrix sentence tests (HörTech, 2013; Kollmeier et al., 2015).
The Bland–Altman plot in Figure 5 and the linear mixed-effects model show a high agreement between the two test methods, suggesting that the SRTs obtained by the new method are valid. Moreover, the test–retest reliability is similar for both test methods.
The absolute test–retest difference of the SRT was found to be below 1.1 dB for 33 out of 40 tests for both methods. The credible interval calculated by the BPACE method would be able to detect five out of the seven results with more than 1.1 dB difference by virtue of their credible interval lying above the limit of 2 dB. With the STAIRCASE method, on the other hand, there is no possibility to discern the reliable results from those with a high test–retest difference.
Listening Effort
The CI patients and NH participants rated the listening effort for the BPACE method significantly lower than for the STAIRCASE method. It is worth mentioning that the better rating for the subjective listening effort is achieved by only a moderate increase in the overall proportion correct (see right graph in Figure 6). Apparently, for most listeners, it is already sufficient to get relief after a tough sentence and to not struggle at their limit permanently. Nonetheless, finding an SRT in noise remains a challenging task, and most participants rated the listening effort rather high. One NH participant criticized the noise level being too loud for the BPACE method and rated the listening effort higher than for the STAIRCASE method. This phenomenon could be avoided by keeping the overall level of signal and noise constant and changing the SNR by simultaneously raising one level while lowering the other (Kaandorp et al., 2014).
We did not find a significant influence of the SRTs on the effort ratings. Intuitively, one could suppose better performers to rate the listening effort lower. In particular, because the BPACE method effectively is more challenging for listeners with lower speech intelligibility in quiet. The upper asymptote of their PF is relatively low and, therefore, the
The linear mixed-effects model revealed a small but significant increase of the listening effort with the listener’s age. This effect has been found and documented in previous studies (Cardin, 2016; Tun et al., 2009). However, the current work was not designed to investigate the influence of the listener’s age on subjective listening effort.
Test Duration
For the BPACE method, a test termination based on the credible interval can substantially shorten the test for NH listeners. Unfortunately, only few CI patients reach an acceptable credible interval with few sentences and may thus benefit from this termination criterion. Poor performers with a shallow PF often require test runs with 30 sentences to produce reliable results. Therefore, Figure 7 demonstrates a smaller reduction of the test duration for CI patients compared to NH participants. Nevertheless, the testing time can be optimized individually when tests are terminated based on the updated credible interval.
The test duration with the STAIRCASE method was longer than with the BPACE method for 30 sentences. We assume the difference is caused by the overall difference in intelligibility and the listeners responded more rapidly to the more intelligible sentences presented at the upper
It might be possible to define a termination criterion for the STAIRCASE procedure also, based on the variability or the reversals of the adapting SNR. Thereby, the test duration might be optimized, but such an advancement for the OLSA procedure is not in this project’s scope. Termination after seven consecutive trials at the same SNR is another way to reduce the test duration but, according to experience, only rarely occurs. In this study, only two of the 80 STAIRCASE tests produced seven consecutive trials at equal SNR.
Limitations and Outlook
It is important to keep in mind that the calculated credible interval for the parameter
Even though the slope and the asymptote of the PF can be estimated with the BPACE method, the accuracy of these values is limited. This is due to the sweet-point selection rule favoring the determination of the SRT. In contrast, there is no mathematical weighting of the parameters, as implemented in a different method proposed by Prins (2013).
Last, but not least, our evaluation is limited to NH participants and CI patients. It might be useful to evaluate the procedure with hearing aid users in a future study.
Conclusion
We developed a Bayesian procedure for matrix speech tests in noise. The method aims for reliable results in a short time with an acceptable listening effort and was compared to the existing procedure in a study. The evaluation shows good agreement of the novel test method BPACE with the conventional STAIRCASE method with similar test-retest reliability. The listeners rated the listening effort with the novel method significantly lower than for the STAIRCASE method. The estimation of a credible interval for the current SRT allows individual test termination. Thereby, the test duration can be reduced by 1.3 min on average, particularly for good performing listeners with consistent responses. The credible interval specifies the quality of the results, information not exploited by traditional methods.
Supplemental Material
sj-pdf-1-tia-10.1177_23312165231191382 - Supplemental material for BPACE: A Bayesian, Patient-Centered Procedure for Matrix Speech Tests in Noise
Supplemental material, sj-pdf-1-tia-10.1177_23312165231191382 for BPACE: A Bayesian, Patient-Centered Procedure for Matrix Speech Tests in Noise by Christoph Schmid, Wilhelm Wimmer and Martin Kompis in Trends in Hearing
Footnotes
Acknowledgments
We thank the two anonymous reviewers for their critical reading of the manuscript and the constructive comments. Thank you to Benjamin von Gunten and David Tschanz for their support in data collection.
Declaration of Conflicting Interests
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The study was funded by the scientific fund of the ENT clinic Inselspital Bern. Open access funding was provided by the University of Bern.
Supplemental Material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
