The Rhythm of Speech: How Much Information Does it Carry?
How much information is carried in the rhythm of speech and at what level (word, phoneme, or microtiming) is that information contained? This paper uses a classification paradigm to investigate the time scales at which information may be encoded in normal speech. For example, is it possible to tell, just from the rhythm of the speech, whether the speaker is male or female? Is it possible to tell if they are a native or nonnative speaker? Traditional investigations into speech rhythm approach this problem by manually annotating the speech, and investigating a preselected collection of features such as the durations of vowels or inter-phoneme timings. We present a method that can automatically align the audio of multiple people when speaking the same sentence. The output of the alignment procedure is a mapping (from the microtiming of one speaker to that of another) that can be used as a surrogate for speech rhythm. The method is applied to a large online corpus of speakers and shows that it is possible to classify the speakers based on these mappings alone. The accuracy of the alignment is evaluated by a technique of transitive validation, and the timing maps are used to form a feature vector for the classification. When applied to the online Speech Accent Archive corpus, in the gender discrimination experiments, the proposed method was only about 5% worse than a state-of-the-art classifier based on spectral feature vectors. In the native speaker discrimination task, the speech rhythm was about 15% better than when using spectral information. The experiments show a clear relationship in the level of the speech at which rhythmic meaning occurs.