Abstract
At present, intelligent oral English training mostly makes systematic intelligent discrimination by inputting human voices, and this method has a high false judgment rate, so it can’t realize targeted pronunciation correction. This study aims to address the issues of insufficient generalization of speech recognition in English oral training caused by stress interference (such as the LibriSpeech test set WER reaching 13.38%) and low accuracy of speech animation synchronization. Moreover, this study proposes an innovative diarticulation model, which quantifies the co-articulation effect through a consonant-vowel/vowel-vowel visual weight function and optimizes the mouth shape parameters through a dynamic feedback mechanism. The experimental results show a significant improvement in the recognition performance of the model, with an F1 value of 0.91 and a domain matching dataset CER of 6–8.5%. The pronunciation correction effect is outstanding, with a 43% reduction in vowel error rate (23.1% → 13.2%) and a 38.5% improvement in consonant connection accuracy. In addition, the synchronization delay is reduced by 15.3% (p < 0.001) compared to PhoneBERT, and the DTW distance is only 0.18. The conclusion shows that the system effectively solves the problem of speech recognition generalization through data adaptive training and dual pronunciation modeling. At the same time, the synchronization accuracy and real-time performance (48FPS) have reached a practical level, providing a reliable technical solution for intelligent oral training.
Get full access to this article
View all access options for this article.
