Articulatory Representation and Speech Technology

Abstract

In this paper we demonstrate the feasibility and usefulness of articulation-based approaches in two major areas of speech technology: speech recognition and speech synthesis. Our articulatory recognition model estimates probabilities of categories of manner and place of articulation, which establish the articulatory feature vector. The transformation from the articulatory level to the symbolic level is performed by hidden Markov models or multi-layer perceptrons. Evaluations show that the articulatory approach is a good basis for speaker-independent and speaker-adaptive speech recognition. We are now working on a more realistic articulatory model for speech recognition. An algorithm based on an analysis by synthesis model maps the acoustic signal to 10 articulatory parameters which describe the position of the articulators. EMA (electro-magnetic articulograph) measurements recorded at the University of Munich provide good initial estimates of tongue coordinates. In order to improve articulatory speech synthesis we investigated an accurate physical model for the generation of the glottal source with the aid of a numerical simulation. This model takes into account nonlinear vortical flow and its interaction with soundwaves. The simulation results can be used to improve the articulatory synthesis model developed by Ishizaka and Flanagan (1972).

Keywords

automatic speech recognition speech synthesis articulatory modeling

Get full access to this article

View all access options for this article.

References

Aktas, A., Schmidbauer, O., Maier, K.H., and Feix, W., (1990). Classification of coarse phonetic categories in continuous speech: Statistical classifiers vs. temporal flow connec-tionist networks. IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 89–92). Albuquerque.

Atal, B.S. , (1978). Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer sorting technique. Journal of the Acoustical Society of America, 63, 1535–1555.

Barney, A.M., Shadle, C.H., and Thomas, D.W., (1990). Airflow measurement in a dynamic mechanical model of the vocal folds. Proceedings International Conference on Spoken Language Processing (pp. 165–168). Kobe.

Coker, C., (1976). A model of articulatory dynamics and control. Proceedings IEEE, pp. 452–460.

Hegerl, G.C., and Höge, H., (1991). Numerical simulation of the glottal flow by a model based on the compressible Navier-Stokes equations. IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 447–481). Toronto.

Hegerl, G.C., (1991a). Numerical simulation of the glottal flow and glottal excitation. Proceedings of XIIth International Conference of Phonetic Sciences, Aix-en-provence, 2, 477–481.

Hegerl, G.C., (1991b). Numerische Lösung der kompressiblen zweidimensionalen Navier-Stokes Gleichungen in einem zeitabhängigen Gebiet mil Hilfe energie-vermindernder Randbe-dingungen. Ph.D. thesis, Mathematics Department of the Ludwig-Maximilians University, Munich.

Hon, H.-W., and Lee, K.-F., (1990). On vocabulary-independent speech modeling. IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 445–449). Edinburgh.

Iijima, H., Miki, N., and Nagai, N., (1990). Glottal flow analysis based on a finite element simulation of a two-dimensional unsteady viscous fluid. Proceedings International Conference on Spoken Language Processing, 1, 77–80, Kobe.

10.

Ishizaka, K. , and Flanagan, J.L. , (1972). Synthesis of voiced sounds from a two mass model of the vocal cords. Bell Systems Technical Journal, 51, 1233–1268.

11.

Ishizaka, K., and Matsudaira, M., (1972). Fluid mechanical considerations of vocal cord vibration. S.C.R.L. Monograph, 8, Speech Communication Research Lab., Santa Barbara, CA.

12.

Kobayashi, T., and Yagyu, M., (1991). Application of neural nets to articulatory motion estimation. IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 489– 492). Toronto.

13.

Kubala, F., Schwarz, R., and Barry, C.O., (1990). Speaker adaptation from a speaker-independent training corpus. IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 137–140). Albuquerque.

14.

Lee, K.F., (1988). Large Vocabulary Speaker-Independent Continuous Speech Recognition: The SPHINX System. Ph.D. Thesis, Computer Science Department, Carnegie Mellon University, Pittsburgh.

15.

Maeda, S., (1982). The role of the sinus cavities in the production of nasal vowels. IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 911–914). Paris.

16.

Mermelstein, P. , (1973). Articulatory model for the study of speech production. Journal of the Acoustical Society of America, 53, 1070–1082.

17.

Milenkovic, P., Xue, Q., and Hu, Y.H., (1990). Analyses of the hidden units of the multi-layer perceptron and its application in acoustic-to-articulatory mapping. IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 869–872). Albuquerque.

18.

Sanchez, J.A., and Casacuberta, F., (1991). The use of articulatory features in isolated word recognition with hidden Markov models. Report R4R3 and PPR2 of ACCOR Projekt, University of Valencia.

19.

Schmidbauer, O., (1989a). Ein System zur Lauterkennung auf der Basis artikulatorischer Merk-male. Dissertation, Fakultät für Elektrotechnik und Informationstechnik, Technische Uni-versität, München.

20.

Schmidbauer, O., (1989b). Robust statistic modelling of systematic variabilities in continuous speech incorporating acoustic-articulatory relations. IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 616–619). Edinburgh.

21.

Schmidbauer, O., (1990). An algorithm for automatic formant extraction in continuous speech. Proceedings EUSIPCO, Fifth European Signal Processing Conference, pp. 1151–1154, Barcelona.

22.

Schmidbauer, O., (1991). Speaker adaptation based on articulatory features. Proceedings EURO-SPEECH, 2nd European Conference on Speech Communication and Technology, pp. 1099–1102, Genova.

23.

Schroeter, J., Larar, J.N., and Sondhi, M.M., (1987). Speech parameter estimation using a vocal tract/cord model. IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 308–311). Dallas.

24.

Schroeter, J., and Sondhi, M.M., (1989). Dynamic programming search of articulatory code-books. IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 588–591). Edinburgh.

25.

Sondhi, M.M. , and Resnick, J.R. , (1983). The inverse problem for the vocal tract: Numerical methods, acoustic experiments and speech synthesis. Journal of the Acoustical Society of America, 73, 985–1002.

26.

Sondhi, M.M. , and Schroeter, J. , (1987). A hybrid time-frequency domain articulatory speech synthesizer. IEEE Acoustic Speech and Signal Processing, 35, 965–967.

27.

Sotschek, J., (1984). Sätze für Sprachgütemessungen und ihre phonologische Anpassung an die Deutsche Sprache. Proceedings Deutsche Arbeitsgemeinschaft für Akustik (DAGA), pp. 873–876, Darmstadt.

28.

Strube, H.W. , and Roesler, S. , (1989). Measurement of the glottal impedance with a mechanical model. Journal of the Acoustical Society of America, 86, 1708–1716.

29.

Sundberg, J. , (1987). From sagittal distance to area. Phonetics, 44, 76–90.

30.

Thomas, T.J. , (1986). A finite element model of fluid flow in the vocal tract. Computer Speech and Language, 1, 131–151.

31.

Tseng, H.P., Sabin, M.J., and Lee, E.A., (1987). Fuzzy vector quantization applied to hidden Markov models. IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 641–644). Dallas.