Automated Speech Recognition: Taking Stock
By: Stephane Attal (CEO, AskKinjo, www.AskKinjo.com)
There are significant business opportunities for speech recognition applications in the wireless market. This is especially true now that a hands-free voice solution is required to use a mobile phone while driving in many provinces, states and countries. In this two-piece blog I intend to explore the technological side of automated speech recognition (ASR), having dealt with the commercial side of ASR for mobile phones in my previous post.
Whatever the science fiction movie (2001: Space Odyssey, Blade Runner, Star Wars, Star Trek, etc.) people were shown communicating with machines the way we communicate with other human beings. This is now getting much closer to being reality! ASR technologies enable human interaction with computers and are finally reaching maturity resulting in a proliferation of applications.
Work on voice interaction goes back to the late 1700’s with the development of speaking machines. Subsequently, in the1880’s both Bell and Edison came up with dictating machines. In the 1930’s a mathematical model was first introduced by Bell Laboratories modeling speech analysis and synthesis. From then on, phenomenal progress was made, from a machine understanding a few words from a known speaker to a device answering meaningfully to naturally spoken sentences by any speaker.
In the 60’s, ASR system could handle 10-100 words in a very constrained set of conditions. By the 70’s this had risen to 100-1000 words, by the 80’s 1000+ connected words and then in the 90’s unconstrained continuous speech with an incredible vocabulary. There were of course many confusing instances; you all have had the same frustration with a machine that would understand “Recognize speech” as “Wreck a nice beach”. In the new millennium asking for what one wants and getting it is a natural interaction with a system whose voice is closer to our own than to a machine.
This is an amazing achievement when you contemplate the fundamental challenge of voice recognition:
No two utterances of the same linguistic content from the same speaker are ever the same. In plain English, you never say the same words the same way, ever.
Consider the incredible number of parameters around a spoken word: speaking mode (isolated words or sentences), speaking style (reading or spontaneous), speed (clearly enunciating or zipping through), energy (whispering or shouting), vocabulary (practical set of 1000 words or a literary universe of 20000 words), background noise level, dialect, and accent. Moreover, to complicate matters, these words can be spoken into a mobile phone and heard over the air half the world away or straight into a high fidelity microphone.
The fact that today’s best speech systems understand any speaker, irrespective of dialects, accents, and head colds is testament to the incredible research and development efforts of the past 20 years. More is coming! In November 2008, IBM’s “Five Innovations That Will Change Our Lives” included three innovations around speech recognition, including “You will talk to the Web . . . and the Web will talk back”. That’s surfing the Internet, hands-free, by using your voice.
Speech recognition is relying on the latest mathematical advances around acoustic modeling (most likely word with that sound) and language modeling (next most likely word in that sentence based on grammar) combined with hypothesis search (next most plausible word). The first approach using Hidden Markov Models and related data-driven statistical techniques provided a significant breakthrough in the 70's and 80's.The last two approaches are the focus of intense research efforts.
Want a headache?
The few keystones in this ongoing march to success include classification and regression trees (artificial neural network), hidden Markov models, multivariate Gaussian distributions, nonparametric estimation, stochastic processes, clustering and finite state automata theory!
Got it?
Part 2 - What’s Next?
Author: Stephane Attal is CEO of AskKinjo Inc. (www.Askkinjo.com)
AskKinjo, a Canadian mobile location based service company, uses speech recognition to communicate with drivers and provide real-time information, such as nearby gas prices, traffic, parking lots and various points of interest. For more check www.AskKinjo.com
Posted by: Yoram Shalmon

Comments