How AI Dubbing Will Change Everything: Building A Video Translator
There is a very interesting piece of content we are looking at today. A team from IIIT-H (Indian Institute of Information Technology - Hyderabad) has build an AI which translates and lip-syncs a video from one language to another.
Here at video translator we are obviously interested in the work these scientists are doing. Our approach is a little bit different, but we cannot do lip syncing, so maybe its a moot point.
Video Translator: Face-To-Face Translation
So what is the team at IIIT-H doing? From the paper,
only provide textual transcripts or translated speech for talking face videos to also translate the visual modality i.e. lip and mouth movements. Consequently, our proposed pipeline produces fully translated talking face videos with corresponding lip synchronisation.
So this really very cool. What is happening here is:
- First there is a text-to-text translation happening
- Next there is a speech-to-speech translation happening
- This gets added to the visual translation (which is the bit the IIIT-H team worked on)
Together these give what the researchers are calling Face-To-Face Translation. One of the research team has a YouTube channel, so this is a sample below.
While the technology is very cool, its not something we really do. The approach we have taken is quite different.
Video Translator: Speech-To-Speech Translation (Our App!)
So how is this technology different to what is happening here at video translator?
So what we are doing is we expect human intervention at (1) and (2). That is:
- Do a Speech-To-Text AI Transcription (human post-editing expected)
- With the transcript do a Text-To-Text Translation (human post-editing expected)
- With the translated transcript, do a Text-To-Speech Dubbing (human post-editing expected)
Generally this means the end-to-end flow is better suited for assets which are expected to be online for a long time. Hence we put in the extra effort into making the asset really nice.
What do such assets look like? This is an English video a client recently provided us.
This is the AI Vietnamese version.
Obviously the work that IIIT-H has done is a scientific paper, whereas you can try our technology for free, because its a production Saas app.
Cheap too! :)
Which Is Better?
Clearly we are biased, and we think our approach is superior. But lets talk about why?
Our clients report that you always want to over-disclose with AI.
Ok, here is what is happening.
You always want to tell people it is an AI. This is because if they don't know, most people feel like you are trying to fool them somehow. And then they react badly.
Basically AI Dubbing is pretty good, but its NOT that good. So a human will always work out that something is up. If you do not disclose, people get cranky.
Disclosing its an AI is very good. Mostly because normal people (outside tech) are excited about technology, so will (paradoxically) pay more attention. That is, we get a win when we disclose, and people are cranky when you do not disclose.
The lip-syncing is a very cool feature, but comes awfully close to fake news, and communities worldwide have deep concerns (totally legitimate concerns too!) about fake news.
No really - disclose that you are using an AI!
We think you need standards/regulation, and lip syncing with AI is probably not going to reassure the community. That being said, if properly disclosed, there is almost certainly a place for this new technology.
We wish the team at IIIT-H the best, and hopefully we see their tech out in the wild sooner rather than later.
Best of luck gents, and very nicely done!