How To Transcribe Lots of Different Accents Accurately

How To Transcribe Lots of Different Accents Accurately

Today, we will look at how to transcribe many different accents correctly. First off, what does this question even mean in terms of AI?

The video we are working with is from Institute for Communication in Health Care (ICH) at the Australian National University (ANU).

Looking at the ICH’s mission statement, “The ICH’s mission is to build an international, collaborative research and training hub in healthcare communication to improve patient safety and quality of healthcare practice around the world…”

This goal aligns with what Video Translator is doing, but from the point of view of this post, there is a number of different English accents being spoken in their video, which makes it an ideal test for this post.

The following process was followed to get the output:

  • Use the Speech-To-Text AI Transcription to transcribe the video.
  • Fix up the results of the transcription.
  • Use the AI Translation from English to Chinese to translate the video.

Please note, we will not be fixing up the Chinese language translation post AI translation, as this post covers the process itself. For a client facing output, please work with a language aware subject matter specialist to ensure a high quality artefact at the end of your process.

ICH: English Original With Chinese Captions

First we will look at the process by which this video translation was completed. Then, a discussion on challenges and options around the video translation. Finally, some extra options and how they can be used to meet your client’s requirements.


  1. The original video was sourced from the ICH website.

  2. Please direct your browser to, and then click on the Login button. Select myTemplate, or your preferred template, and create a new item. Note to follow this visual guide you will require a video component only in your template.

  3. Once in the new item, please upload your video. In this example, we upload the ICH video, and your screen should look like below.

    Upload the ICH video
    Upload the ICH video
  4. Next, click on Actions -> Transcribe, and you should see something like the below. We have selected English (United Kingdom) here. Why exactly will be addressed in the discussion below.

    Trigger the Transcription, with UK English for our dialect choice
    Trigger the Transcription, with UK English for our dialect choice
  5. After triggering this action, the platform will close this item, and lock it. Once transcription is complete, this will automatically unlock.

  6. Open up the item, and have a look at the captions. The main task here, is to fix up the transcription. The heuristics section below covers some of the issues around this specific video. After fixing up, the end result is below.

    ICH: English Original With English Captions

  7. Sweet! Now that we have our transcription sorted, we can translate. Now, click Action -> Translation, to trigger the video captions being translated. This is a pretty simple process, and can be seen in the image below.

    Trigger the Translation, with Simplified Chinese for our language/dialect choice
    Trigger the Translation, with Simplified Chinese for our language/dialect choice
  8. Once the translation process is complete, the below result is available. In the application, this looks like below. Also, please note the Origin toggle, allowing a user to flick back and forth - this functionality is for use when a human translator is checking the work of the AI.

    Captions post translation, in Simplified Chinese
    Captions post translation, in Simplified Chinese
  9. The final version of the video is shown below. In a real workflow, please ask your subject matter expert to eyeball the translation and verify suitability for your stakeholders.

    ICH: English Original With Chinese Captions


How do you work with many different dialects, specifically which AI should we use? In this video, we see the following features:

  • There are five people speaking using the following accents, Australian English, American English, English as spoken in Hong Kong, and English as spoken by a person of European descent.
  • This is the crux of the problem, which AI should we use? While experimenting, we tried both English (Australian) and English (UK). The results were different, but it seemed liked the English (UK) worked a little bit better - this is probably because the first speaker, Professor Diana Slade, does not have a classic Australian accent, but more of a mix of Australian English and British English.
  • For each of the speakers, the AI had trouble during a transition from one speaker to another. The first instance of this is at 0:18 seconds, which is a simple (and nifty) cut scene, but totally threw the AI. In essence, the AI likes monotone and the same person speaking.
  • Dr Elizabeth Rider, speaking from 1:11 worked fairly well. This is because the underlying AI has been trained on a significant volume of American content. The exact opposite was true for the voice over by Dr Angela Chan. Additionally, switching between the HK English dialect, and the European English dialect, was ugly.
  • All in all using the UK English worked better, in terms of number of post AI translation changes required. This is likely because if there is a mix of accents, going with base English, for lack of better descriptors, is probably the way to go.


In this post we looked at some of the trade off’s between using different dialect AI’s for transcription. This is a valid concern for English, Arabic and Spanish, because these languages have the largest number of possible dialects.

The platform is currently in closed beta, while we work with early users to test/iron out issues. If you are interested in trying out our technology, please drop us an email at

Please connect with us on LinkedIn, YouTube or Facebook for any comments, questions, or just to keep up to date with the work we do!

We are very grateful for your support!