Leading breakthroughs in speech recognition software at Microsoft, Google, IBM
Groundbreaking work on speech recognition software by the 91³Ō¹Ļās Department of Computer Science (DCS) is transforming Microsoft, Google and IBM.
At a conference in Asia recently, Microsoftās Chief Research Officer demonstrated an almost instantaneous translation of spoken English to Chinese speech ā with software that maintained the sound of the speakerās voice. It was the latest in a series of breakthroughs in the field involving 91³Ō¹Ļ faculty and students.
āA few years ago, researchers at Microsoft Research and the 91³Ō¹Ļ came together to develop another breakthrough in the field of speech recognition,ā Rick Rashid told the crowd. āThe idea that they had was to use a technology in a way patterned after the way the human brain works ā itās called deep neural networks.
āThat one change, that particular breakthrough increased recognition rates by approximately thirty percent. Thatās a big deal.ā
The breakthrough involves better recognition by the computer of what are called phonemes ā small units of sound that comprise speech ā and it has led to a reduction in errors by the computer, said Rashid.
āThatās the difference between going from 20 to 25 per cent errors - or about one out of every five words - to roughly 15 per cent less errors or roughly one out of every seven or perhaps one out of every eight words,ā Rashid said. āItās still not perfect, thereās still a long way to go but I think you can see that we have already made a significant amount of progress in the recognition of speech.ā
DCS research in speech recognition is conducted by Professors Geoffrey E. Hinton (Machine Learning) and Gerald Penn (Computational Linguistics), with this latest breakthrough drawing on Hinton's deep neural networks.
Graduate students Abdel-rahman Mohamed and George Dahl began collaborating in 2009, applying deep neural networks to speech recognition. (Artificial neural networks are simplified mathematical models of neural circuits in the human brain.)
āEven before I started my PhD at 91³Ō¹Ļ with Gerald Penn, I was always thinking about how I might make a breakthrough in the speech recognition field,ā said Mohamed, ābringing Automatic Speech Recognition (ASR) technology closer to the end users.ā
Inspired by one of Hintonās lectures on deep neural networks, Mohamed began applying them to speech - but deep neural networks required too much computing power for conventional computers ā so Hinton and Mohamed enlisted Dahl. A student in Hintonās lab, Dahl had discovered how to train and simulate neural networks efficiently using the same high-end graphics cards which make vivid computer games feasible on personal computers.
āThey applied the same method to the problem of recognizing fragments of phonemes in very short windows of speech,ā said Hinton. āThey got significantly better results than previous methods on a standard three-hour benchmark.ā
Dahl and Mohamed presented the results of their work at a 2009 Neural Information Processing Systems (NIPS) workshop to a mixed reaction.
āMany participants in the workshop were excited about our results,ā recalled Dahl, ābut at the time there was a lot of healthy skeptical concern that our results might not translate into similar gains on more realistic speech recognition problems.ā
Researchers at Microsoft, however, were interested enough to invite both students to internships at Microsoft Research in Redmond the following year. There, Mohamed and Dahl successfully applied their methods to larger speech tasks, involving much larger vocabularies.
Fellow CS graduate student Navdeep Jaitly also became involved in the research, and worked with Google to implement it in their system. Google now uses a deep neural network for voice search in the Android 4.1 operating system, their answer to the iPhoneās Siri conversational agent.
āI was expecting this move,ā said Mohamed, āgiven the great results our model achieved consistently on so many benchmarks.ā
Dahl continued: āIt is very gratifying, particularly because there was a lot of initial resistance from the speech community to using deep neural networks for acoustic modeling.ā
Today, most top speech labs are embracing the technology, including IBM, a long-time leader in speech recognition research, with whom Mohamed has also worked on this topic. Pennās speech lab has also since developed an alternative neural network model in collaboration with York University Professor Hui Jiang and graduate student Ossama Abdel-Hamid. Abdel-Hamid has also worked on neural networks at Microsoft Research.
And the 91³Ō¹Ļ researchers say the new business opportunities theyāve helped create are just the beginning. Hintonās lab has already applied deep neural networks to several other pattern recognition problems. And Pennās speech lab is in the process of digitizing the last 23 years of CBC NewsWorld video to develop search algorithms for large collections of speech.
Unlike Google voice search, which uses voice queries for searching web pages of text, this work uses text queries to search through speech data for related news coverage or interviews.
āThis is important not just for speech researchers,ā said Penn, ābut for journalists, historians and anyone else who is interested in documenting the Canadian perspective on world affairs. Having all of this data around is great, but itās of limited application if we canāt somehow navigate or search through it for topics of interest.ā