My idea is to use a speech to text converter to train a model to recognize my sentences (regardless of tone, etc)… I would most likely ask my friends to read some random sentences too just to build that data set and to see how well it works in terms of recognizing different words regardless of accents/tones/etc.
Then I would collect a lot of movies/tv shows (I think just in English to make it easier on myself) and make the voice bring up clips of those specific quotes. o imagine if you said “How you doin?” and a clip of Joey from Friends came up and says the iconic line back to you. This requires text matching across the two sets of text.
The whole idea came about when I thought about my terrible habit of taking gifs of my favorite movies and making them into WeChat stickers. But sometimes, see I can only remember the line. I’m so bad with names and faces that I’ll just type in the quote and hope it’s iconic enough for the scene to come up, but let’s say I am a couple of words off, then I’m completely stuck.
This project is personal to me because I love movies, I have seen so many that at times I get into arguments with my friends about the details (usually it’s about when something happened on the Marvel timeline or something). Imagine having this tool just there to help prove you are right to all your friends!
The main challenge I see with this project is getting a hold of the videos. Netflix doesn’t allow for the downloading of full films But there are ways around this. The script of most films are available online… (most notably the Bee Movie… because that gem… I mean who wouldn’t want to have just the entire script). But for TV shows, it may be more difficult.
In terms of references, when I met with Aven we talked about similar projects including a sample that takes a random squiggle and matches it to landforms within the trained data set. Imagine that happening with audio and matching with a video.
Same logic, just a slightly different implementation.