Automatic Time-Alignment of Spoken Sentences
Two spoken sentences are never identical, even if they contain the same textual material. For example, two people reading the same poem, reciting the same pledge, repeating the same chant, or performing the same script, will inevitably differ in their timing, their emphasis, and their intonation. Basic to the study of such differences is the ability to (automatically) parse the sentences and align the relevant utterances, to locate the times when the phonemes of one speaker align with the (same) phonemes of another. This enables an analysis of the micro-timing of events in the speech and has uses in linguistic studies, in speech recognition systems, in speech therapy, and in audio/video synchronization. This talk presents preliminary results on a method for automated audio alignment of multiple people when speaking the same sentence. The primary technical innovations in the algorithm are the use of variable length windows (in the analysis stage) and a piecewise phase vocoder (in the resynthesis stage). The switching between different length analysis windows, triggered by a basic characteristic of the speech (whether it is voiced or unvoiced), is used to ameliorate the time-frequency resolution trade-off. The performance of the algorithm is tested on the TIMIT speech corpus and the performance of the voiced/unvoiced distinction is compared with existing machine learning algorithms. A series of sound examples demonstrate the performance of the algorithm.
presentation slides available here