On automatic estimation of articulatory parameters in a text-to-speech system

Computer Speech & Language | , pp. 37-75

It is often conjectured that articulatory modelling of speech is desirable, if training data can be obtained. Articulatory parameters are likely to interpolate over long intervals, and reproduce many acoustic details from simple constraints governed by physical laws. For text-to-speech application, it may be easier to formulate rules in the articulatory domain.

An analysis synthesis scheme for estimating the phoneme-level articulatory parameters to obtain best fits to natural speech, in the context of a text-to-speech (TTS) system, is presented. The working units of optimization are the parameters of an articulatory model (one vector per phoneme) and vectors of time and speed of transition for each parameter. The TTS system is used to initialize these parameters.

Problems encountered in analysis synthesis are due to the following: (1) Automatic optimization methods work well only when the model is capable of matching the data closely. Otherwise, best solutions according to an objective error criteria are not necessarily sensible compromises according to perception. (2) Very different vocal tract shapes can produce speech spectra that are very similar (non-uniqueness). (3) Local minima can result from artifacts of the comparison strategy or due to the non-uniqueness in the spectrum-to-area transformation.

Our solutions mainly consist of the following: (1) In each phoneme, adapt only those variables that have a large effect on the acoustic result (critical variables). (2) Use multiple error criteria sequentially to match large-scale spectral features in the first stage, and then, match the finer details using a high resolution criterion.

It is demonstrated that good quality speech synthesis is possible through a phoneme-level control of an articulatory model. The synthesis is devoid of any artifacts, even in transitions, due to parameterization at the phoneme level. The quality of synthetic speech is comparable (at least in non-nasal voiced segments) to that of a good quality, frame-by-frame linear predictive coder. The models need to be embellished to reproduce other phonemes closely.