The primary objective of this dissertation is to document the development of an acoustic-to-articulatory transformation for non-nasalized vowel-like speech samples. The implementation of this transformation relies on a database of 11385 synthetic speech samples, generated using a version of Mermelstein's articulatory model (DOI:10.1121/1.1913427) that represents an average adult male speaker. This model was used to determine an area function for each of the synthetic articulatory states. The speech samples were synthesized by applying the transmission line analogue for the human vocal tract. Both time- and frequency-domain applications of this representation were investigated. The frequency domain was selected for the final implementation to permit accurate representation of the distributed vocal tract losses, and radiation load effects. No glottal termination was used. The peaks of the synthesized power spectra were then determined. As a result, each of the synthetic states has known acoustic and articulatory attributes, and these data are stored in the database.
To perform the transformation from a natural speech sample to the articulatory state, formant information is estimated from segments of vowel-like speech using single-pitch pitch-synchronous linear predictive analyses. The estimated formants are then used to access the database of synthetic speech samples, resulting in a candidate list from which the best estimate state is chosen. This final selection employs a distance function, which not only considers acoustic distance, but also factors relating to articulatory continuity and neutrality.
The results of a number of experimental tests that validate the operation of the acoustic-to-articulatory transformation are presented. The database storage and access techniques have been verified using a number of examples based on published formant data and synthetically generated vowels. The capabilities of the complete transformation are demonstrated using a selection of natural speech samples produced by an adult male Australian English speaker. Specifically, these samples consist of isolated "h-d" words containing both vowels and diphthongs, as well as words, taken from passages of continuous speech, that demonstrate consonant-vowel and vowel-consonant co-articulation effects.
As a secondary objective of this dissertation, the design of a speech therapy aid, based on this acoustic-to-articulatory transformation, is outlined. This system is intended for use by speech therapist and teachers, and provides the speaker with acoustic and articulatory information to supplement the aural feedback normally used in learning to speak. The aid is expected to be especially valuable for hearing-impaired speakers who lack effective aural feedback. The suggested implementation of the aid utilizes state-of-the-art 16-bit microprocessor and VLSI digital signal processing technology. Particular attention is paid to the development of a low-cost multi-feature aid that operates in near-real-time, is easy to use, and is capable of addressing many of the instrumentation problems of modern speech therapy and training.
A number of other topics are examined, as they are required in the implementation of the proposed speech therapy aid. These topics include the segmentation of natural speech into basic phonetic classes, the validation of the raw formant data by tracking the formants through successive pitch periods, and the normalization of speaker-dependent features. This latter topic is especially important for a speech therapy aid, where the speakers are likely to range from young children to adults. Algorithms for each of these topics are selected on the basis of maximum return for minimum effort.
The segmentation procedure used is an enhanced version of a previously published algorithm (DOI:10.1109/ICASSP.1982.1171793), and is computationally efficient, requiring only measures of signal amplitude, zero-crossing rate, and wide-band energy. The enhanced procedure is capable of accurately delineating segments of natural speech to reflect the four basic speech classes: voiced, unvoiced, voiced-unvoiced, and silence. Plosives are also identified, although with an accuracy somewhat less than for the basic classes.
A novel formant tracking algorithm was developed, in order to track the formants accurately through consecutive pitch periods. The ability of this new procedure to track formant transitions during diphthongs and co-articulation effects is demonstrated by a number of examples. As with the segmentation algorithm, the developed formant tracking algorithm is computationally efficient, and only requires a small amount of a priori speaker-dependent information in addition to the formant data. Within this dissertation, this algorithm is applied to vowel-like sounds only, but is not limited to this, and could easily be extended to the general formant-tracking task.
A survey of published speaker normalizations is presented, and from this representative examples of speech production- and perception-based normalizations are chosen. Experimental results for a speaker-normalized version of the acoustic-to-articulatory transformation, using each of the selected procedures, are provided.