• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Insight

A Platform for Diverse Voices

  • Home
  • Submissions

It Speaks!: How Computers Get Their Voice

January 5, 2022 by admin

While many of us may be familiar with the IVR voice talent that you hear over the automated answering machine or the increasingly human-like responses of virtual assistant technology such as Alexa and Siri, not many of us know how they came to be. Especially in the case of fully automated programs such as the artificial intelligence (A.I.) seen in sci-fi, it may seem improbable or nearly impossible that they are able to respond to just about any question without hesitation.

The process behind how these voices are curated and generated is complex and labor intensive, and certainly takes more than a single recording session in order to develop. Not to mention the fact that such programs have to be able to “read” any piece of text thrown at them no matter the subject matter. Furthermore, over the years, software developers have been doing their best to make A.I.s seem human-like, both in terms of responses as well as how they sound. Although to some it may cross into uncanny-valley territory, the fact of the matter is that as the capabilities of voice synthesis software have continued to develop, the voices on everything from Global Positioning Systems (GPS) to that in the latest action movies are increasingly able to be pass as human. So how do they do it?

Break It Down

All speech, no matter if it’s from an organic vocal tract or a computer’s speakers, is made from small units of sound. The technical name for these small units are “morphemes.” The smallest morphemes are captured in the International Phonetic Alphabet (IPA), in which each symbol of the IPA represents a single type of sound. However, it is remarkably difficult to make some of these sounds in an isolated manner and even more difficult to artificially create in a way that sounds accurate to the human ear.

So how do programmers and developers get past this? Well, the simplest solution is to bring in experienced voice actors and have them read a variety of texts. This description is of course deceptively simple. With over 160 symbols in the IPA and thousands of ways to combine them into even a simple consonant-vowel pair, to try and record the whole scope of sounds a human is capable of making in a single recording is nearly impossible.

Does This Sound Right?

To be completely fair, no language in the world uses the full suite of 160 IPA symbols, and in fact most of them use only a small proportion of the sounds. However, with the increasingly multi-ethnic and global reach of the internet, this also brings in many new sounds and words that a language might not use on a regular basis. For example, monolingual speakers of English may find it challenging to differentiate between the tonal languages such as the Chinese languages and Vietnamese. So what happens if Siri is asked to read an article talking about various tourist attractions in China that have Chinese names?

Therefore, trained voice actors often have to carefully listen to the original source language and to study how the words are pronounced in order to best replicate them. Trained voice actors are often acutely aware of how they sound and will do their best to accurately convey the correct pronunciation of a word while trying to remain sounding natural. This is, of course, no mean feat, and it involves constant updates and recording sessions that utilize a variety of texts in order to maximize the range of sounds from a single voice actor.

Get The Right Tools For The Job

While it may seem that all you need to do is to remix the sounds from an audio recording in order to form a new sentence, it’s actually not that easy to do. Although anyone can get their hands on cheap (or even free) audio mixing software, actually developing a program that can understand questions and respond appropriately is actually a lot more difficult.

Think about it: There are over 5 billion searches on google each day and around 500 million of them use Siri. How can you possibly anticipate all the possible whacky internet searches that come with the sheer size of the user base?

The current ability of many such programs is mainly due to advances in how we train various pieces of software. Many use either Deep Neural Networking or Hidden Markov Models, and the math behind them is more than a little horrifyingly complicated. However, the main thing to note is that such training programs are able to cycle through thousands of possibilities and permutations with every new piece of information that you feed it.

Computer programs have come a long way from the huge monstrosities in the 60s that worked using a punch card system. Now they are incorporated into a majority of households and are an integral part of life. Today programs are becoming more and more human-like, finding their voice, and seemingly gaining more and more awareness. As technology continues to progress, we are likely to see this trend towards life-likeness continue.

Filed Under: Uncategorized

Primary Sidebar

Recent Posts

  • Consumer Rights in Product-Related Accidents
  • Why a New Roof is Worth the Investment
  • Roadblocks and Red Tape: Legal Pitfalls in Accidents
  • Cryptocurrency and the New Era of Trading
  • Sell Up or Clear Out? What to Do When You Inherit a Hoarder’s Home

Archives

  • March 2025
  • February 2025
  • January 2025
  • August 2024
  • July 2024
  • June 2024
  • February 2024
  • December 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • March 2022
  • February 2022
  • January 2022
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • July 2021
  • June 2021
  • May 2021
  • April 2021
  • March 2021
  • December 2020
  • November 2020
  • October 2020

Copyright © 2025 · News Pro on Genesis Framework · WordPress · Log in