Speech-Enabling Our Future One Step at a Time

Speech recognition has come a long way from its humble roots in the 1950s when a machine was built to recognize the numbers one to nine. Today we can turn our voices into text, speak to automated services to get through to the right people, and program our Siris, Alexas, and Cortanas to be our personal assistants, cutting so many corners on our menial tasks.

But this is all very English-oriented. What do these advances mean for the multitude of all the other languages that the world speaks?

Speech-enabling technology

GIF via Giphy

Automatic Speech Recognition

Okay, so not all automatic speech recognition (ASR) is purely in English. Google Cloud speech-to-text allows developers to convert audio to text in 120 languages, and there is a rapidly growing market of start-ups tailoring multilingual ASR for everything from telecommunications to transcriptions, and just about everything else you can imagine in between.

But multilingual isn’t happening all at once with speech recognition. It’s not so easy to talk to a device in say, Russian, and have it convert it into a text in Spanish. It’s either a case of speaking in your mother tongue and having the words appear in front of you, or a limited straight translation between one language and another. Speech-enabled technology of the future to truly be effective will have to give us a way to give us whatever we want in all the languages we speak.

Learning a new language? Check out our free placement test to see how your level measures up!

What is speech-enabling?

Speech-enabling is a catch-all term for making our lives easier, by having all the technology around us do what we want it to just by asking it to do it. We can talk to our tablets to get them to, for example, type out our thoughts, and can have entire websites speak to us if, say, we can’t read the pages of it or prefer to listen to them out loud.

Our virtual assistants like Siri have streamlined grocery shopping and choosing mood-appropriate playlists. Speech recognition in our cars has left making calls truly hands-free. And we can control our air-conditioning and heating in our homes without even so much as pressing a button. There is so much speech-enabling can do for us that we’re spoilt for choice (and greedy for more!).

Speech-enabled conversations

If a speech-enabled future is what we’re aiming for, think of the possibilities in terms of language and translation. We can reach an international client base without having to translate a single sales pitch or press release, and we can go to a restaurant in any country and order the exact dish that we want without tripping over our own tongues. Seeking healthcare when overseas won’t seem half as daunting as it currently does for some. And okay, going through passport control and customs won’t be any more exciting, but at least we might be able to get them to crack a smile if we use an interface to speak to them in their own tongue.


GIF via Giphy

The problem

Speaking one language and having it come out as another sounds like something from Doctor Who or Futurama, and our technology doesn’t sound quite as exciting as a TARDIS or any one of Farnsworth’s creations. Application programming interfaces (APIs), for example, are essentially the set of rules that determine how the different components of software interact. In order for our interfaces to be able to recognize all our languages, they first need to learn them. And before we put an image in your head of a classroom full of student robots, that’s not quite what we mean!

The machine learning behind these interfaces needs to be able to recognize more than just the vocabulary that makes up an individual language. It needs to know all the language semantics, every aspect of its grammar, and even each individual phoneme that comprises its alphabet.

As well as this, effective ASR will need to recognise things like idiomatic expressions and colloquialisms, which are also hurdles for the everyday human learner. But whereas a human learner can identify language aspects like loanwords and cognates, these are going to be even more difficult for a machine to pick up the differences between.

Language Diversity

Photo via Pixabay


Finding one standard solution to conversing in every language of our planet without having to learn them all for ourselves is never going to be an overnight discovery. Companies within the speech-enabling market are creating and concentrating on their niches, whether that be the broadest amount of languages available for communication, or difficult dialects for use in areas with low literacy for educational purposes.

The ultimate language recognition software will not only be able to distinguish between each individual language and accurately translate it with every nuance and turn of phrase into every other language, but will also be able to understand and replicate the different dialects of those languages as well. We can’t wait to see the kind of technology that will able to be do that!