Week 10

7 April :  Automatic Speech-to-Text (STT) Recognition

 

Subject:  We pass between text and speech on a regular basis now, using Siri or other voice assistants or by dictating messages in Whatsapp.  Week 10 will include discussion of speech-to-text recognition, a demonstration of NYU Stream and its ability to create subtitles (if you go to the short videos I made last week, you will notice the subtitles that I have created for last week’s short videos).  These subtitles have an interesting double purpose: for creating accessible videos as we learned in our section on accessibility, but also for generating transcripts that we can use for other purposes such as text analysis. 

Format:  Class on Tuesday will be mixed delivery, consisting of pre-recorded parts, a short synchronous meeting and a quick follow up in a forum.  There will be no synchronous class on Thursday, but attendance will be counted via your participation in the forum in the form of “quick writing” (a few sentences in response to a prompt).

The “audio book” is a very important media form, not only for those for whom reading from the printed book is difficult, but also in terms of social distancing in Spring 2020. Many people have found solace in listening to books when they are not able to read as they would when they are able.  Human-read audio books provide access to books, but not all of them are free (even if during the COVID pandemic many providers have opened access to them).  Some sites always offer audio books for free.  For English, OpenCulture.com has hundreds of audio books to choose from, as well as movies, lectures and lists of recommended books by famous people.  Many of these are available to download from Soundcloud.  Not all books are available for free downloads. You can check out librivox.org, Lit2Go, LoyalBooks, MindWebs (for speculative fiction), StoryNory (for kids).  There are also Spotify, YouTube and podcast sites. For political speeches, try history.com or archive.org.

Project Gutenberg even has a way for its plain text, public domain books to be read.  Listen to a part of this book “The South Pole” as a computer-generated audio book.  Would you enjoy listening to literature (or any text) read aloud this way?

Last week we tested the deep learning environment for turning handwriting into machine readable text.  This week we are going to test the speech-to-text generation algorithm of NYU Stream this week to see how well it does with some different input in different accents of English: audio book, podcast and speech in different accents of English.

To prepare before class: Please take a look at the sites above–look in languages other than English if you know any–and make a recommendation for a set of three books (not available in Project Gutenberg or elsewhere on the web in open format) that you would like to transform into text if you wanted to “read them like a computer.”

To read before class: 

Are audio books as good for you as reading? (Heid)
If we all end up sounding like Americans, you can probably blame voice assistants (Olyeinka)

To watch/listen before class

Listen to a part of this book “The South Pole” as a computer-generated audio book.
Reading Call Center phone calls like a computer
What is Cloud Speech API?
Voice Recognition Elevator in Scotland

Our class experiment with NYU Stream:

We are going to do two things with NYU Stream this week.  First, we will take an audio recording (not included in Project Gutenberg) and test to see how well we might use automatic captioning to create plain text from it.

The audio record I chose is a podcast by “World View – The Foreign Affairs Podcast” of the Irish Times about the burning of Notre Dame cathedral in Paris. I have trimmed the relevant 18 minutes of the podcast.  You can find it here and also in NYU Stream here with captions. The captions have been placed in the data folder of our class Drive.

Second, we will see how well the speech-to-text algorithms work for a variety of different audio recordings, global English accents and recording conditions. I have downloaded and trimmed some clips of language: one in an Australian accent, another in Cockney and another in the computer-generated voice from Project Gutenberg above.  I have uploaded segments of them into NYU Stream and requested captioning to see how well STT works.

Inside the Ropes #157 (Lucas Michael and Mark Unwin) – Australian golf podcast – Stream
The computer generated audio book “The South Pole” – from Project Gutenberg – Stream
Ray Winstone talks about boxing and acting in a Cockney accent Stream

Listen to the captioned segments in NYU Stream and note how differently the STT algorithm deals with different voices.

 

 

Quick writing: Similar to call center conversations that begin “THIS CALL MAY BE RECORDED FOR QUALITY ASSURANCE PURPOSES”, we now begin today’s Zoom lessons that are being recorded with a message “THIS CALL IS BEING RECORDED” that everyone on the call must consent to by clicking on a pop-up box.  A lot of articles have emerged in recent days about Zoom not respecting user privacy. What do you think about privacy and voice recordings increasing in an era in when social distancing is being enforced?  Where do you use voice recognition algorithms in your daily life? What do you think the possibilities for someone “reading” your socially distanced conversations “like a computer” the way that the YouTube video used R to analyze call center calls?  Find another participant in the class and add a few comments on their post as well. 

Articles about Zoom:
Singapore Bans Teachers Using Zoom
Is it Safe to Use Zoom?
Using Zoom from Working from Home?

 

Blog 3: For the third blog, I would like you to reproduce the exercise that we have done in class this week with audio books.  You should (1) choose a downloadable audio book interesting to you (in any language that is supported by NYU Stream).  Choose a text that is not included in Project Gutenberg, trim a chapter or a segment from it and try creating captions using NYU Stream. You will need to wait a bit for the captions to be generated and be made available from the download . attachments pull down bar. Analyze the results using any tools you know to use.

Then (2) pick a podcast or audiocast in a pronounced accent in English, trim a segment of about 5 minutes and try to make captions with it.  I have listed a set of these below that you could use as possibilities. In your blog posting, as usual, please make it about 500 words, and include visuals and use the affordances of WordPress to their benefit. If you would like some help in choosing your two audio recordings, be in touch or attend virtual office hours (due: Sunday, 19 April)

 

Here are some other interesting examples I found (and Soundcloud seems to have many more): 

(1) The Late Wire EP 5 (interview with Raffy Akinwande) – Nigerian social podcast
(2) Iraq Matters#30: (interview with Moussa AlNasari) Remembering Mutanabbi Street 10 Years Later
(3) SG Explained (Willy, Elliot, Rovik) Talking about Racism – Singaporean “regular guys” podcast
(4) Scotland Outdoors, Mark and Euan Visit the Mysterious Goblin Ha’ – BBC Radio Scottish nature podcast
(5) Chini Ya Maji podcast (interview with Don Okoth) – Kenyan podcast on startup culture
(6) Cornish Soccer Talking Football (interview with Andy Watkins) – football podcast from the SW United Kingdom
(7) AWR Colloquial English Sudan – a Christian podcast from Sudan

 

Short video instructions for the assignment:

Trimming audio with QuickTime Player (5 mins)
How to upload an mp3 to NYU Stream and request subtitles/captions (correct, and download) (2 min)
My explanation of how you can find the text of your subtitles/captions for viewing or reuse (4 min)