It Speaks!: How Computers Get Their Voice

Submitted by Admin

While many of us may be familiar with the IVR voice talent that you hear over the automated answering machine or the increasingly human-like responses of virtual assistant technology such as Alexa and Siri, not many of us know how they came to be. Especially in the case of fully automated programs such as the artificial intelligence (A.I.) seen in sci-fi, it may seem improbable or nearly impossible that they are able to respond to just about any question without hesitation.

The process behind how these voices are curated and generated is complex and labor intensive, and certainly takes more than a single recording session in order to develop. Not to mention the fact that such programs have to be able to “read” any piece of text thrown at them no matter the subject matter. Furthermore, over the years, software developers have been doing their best to make A.I.s seem human-like, both in terms of responses as well as how they sound. Although to some it may cross into uncanny-valley territory, the fact of the matter is that as the capabilities of voice synthesis software have continued to develop, the voices on everything from Global Positioning Systems (GPS) to that in the latest action movies are increasingly able to be pass as human. So how do they do it?

Break It Down

All speech, no matter if it’s from an organic vocal tract or a computer’s speakers, is made from small units of sound. The technical name for these small units are “morphemes.” The smallest morphemes are captured in the International Phonetic Alphabet (IPA), in which each symbol of the IPA represents a single type of sound. However, it is remarkably difficult to make some of these sounds in an isolated manner and even more difficult to artificially create in a way that sounds accurate to the human ear.

So how do programmers and developers get past this? Well, the simplest solution is to bring in experienced voice actors and have them read a variety of texts. This description is of course deceptively simple. With over 160 symbols in the IPA and thousands of ways to combine them into even a simple consonant-vowel pair, to try and record the whole scope of sounds a human is capable of making in a single recording is nearly impossible.

Does This Sound Right?

To be completely fair, no language in the world uses the full suite of 160 IPA symbols, and in fact most of them use only a small proportion of the sounds. However, with the increasingly multi-ethnic and global reach of the internet, this also brings in many new sounds and words that a language might not use on a regular basis. For example, monolingual speakers of English may find it challenging to differentiate between the tonal languages such as the Chinese languages and Vietnamese. So what happens if Siri is asked to read an article talking about various tourist attractions in China that have Chinese names?

Therefore, trained voice actors often have to carefully listen to the original source language and to study how the words are pronounced in order to best replicate them. Trained voice actors are often acutely aware of how they sound and will do their best to accurately convey the correct pronunciation of a word while trying to remain sounding natural. This is, of course, no mean feat, and it involves constant updates and recording sessions that utilize a variety of texts in order to maximize the range of sounds from a single voice actor.

Get The Right Tools For The Job

While it may seem that all you need to do is to remix the sounds from an audio recording in order to form a new sentence, it’s actually not that easy to do. Although anyone can get their hands on cheap (or even free) audio mixing software, actually developing a program that can understand questions and respond appropriately is actually a lot more difficult.

Think about it: There are over 5 billion searches on google each day and around 500 million of them use Siri. How can you possibly anticipate all the possible whacky internet searches that come with the sheer size of the user base?

The current ability of many such programs is mainly due to advances in how we train various pieces of software. Many use either Deep Neural Networking or Hidden Markov Models, and the math behind them is more than a little horrifyingly complicated. However, the main thing to note is that such training programs are able to cycle through thousands of possibilities and permutations with every new piece of information that you feed it.

Computer programs have come a long way from the huge monstrosities in the 60s that worked using a punch card system. Now they are incorporated into a majority of households and are an integral part of life. Today programs are becoming more and more human-like, finding their voice, and seemingly gaining more and more awareness. As technology continues to progress, we are likely to see this trend towards life-likeness continue.