Researchers developing an automatic speech recognition software for under-resourced languages

By Emily Scott
April 10, 2016

Preethi Jyothi thinks that in 50 years, the term “under-resourced language” should not exist.

To help make that happen, Jyothi, a postdoctoral fellow at Beckman Institute, and her colleague Mark Hasegawa-Johnson, faculty member of the artificial intelligence group at Beckman Institute, started a research project that aims to transcribe under-resourced languages in a way no one has ever attempted before.

They hope their efforts will help under-resourced languages have access to automatic speech recognition technology that is used on electronic devices, among other applications.

Hasegawa-Johnson said that while there is good speech recognition technology available for about 10 languages, the majority of languages spoken are still incompatible with automatic speech recognition.

Get The Daily Illini in your inbox!

“(For) the other 6,990 languages, there’s no reasonable audio technology that you can use in those languages because it’s really hard to recruit people to create the labeled speech data that you need in order to create it,” he said.

Jyothi explained that other mainstream approaches to this problem have created language-specific models, where they collect audio of a spoken language and corresponding transcriptions from native speakers of the language.

“It’s a very expensive resource, especially if you’re trying to recognize languages which are minority languages or languages that are very hard to reach native speakers online,” Jyothi said.

In order to collect data about under-resourced languages inexpensively, but still effectively, Jyothi and Hasegawa-Johnson considered the fact that there are many common sounds across languages.

“Even if you play sound in a new language to someone who doesn’t speak the language, there is some useful information that is perceived by this non-native speaker of the language,” Jyothi said.

Their research involves playing audio files to non-native speakers of a language and asking them to write down English text that most closely matches what they heard. This information is then refined with algorithms and acts as a substitute for transcriptions from a native speaker. So far, they have worked with 10 different languages.

Their approach could solve the commercial problem involved with building automatic speech recognition technology for under-resourced languages — a problem that Hasegawa-Johnson described as simple supply and demand.

He said some minority languages could have only 2,000 native speakers, and it’s possible that of those native speakers, very few of them have Internet access.

“Maybe there are a couple of people, but they’re busy. They’re doctors, bankers, lawyers, or whatever — they don’t have time to sit down and transcribe 100 hours of speech,” Hasegawa-Johnson said.

The smaller the market, the more expensive it becomes to develop automatic speech recognition technology in that language.

“Which means that the people who might be able to use it are left out in the cold,” Hasegawa-Johnson said. “There’s nothing to help them.”

Jyothi and Hasegawa Johnson’s approach could solve this problem, but the idea is so novel in itself that they said it’s hard to convince others it will work.

It was an idea that came out of Jyothi’s frustration when she was trying to develop automatic speech recognition technology in Hindi and couldn’t find a reliable collection of text in order to do so.

“And I said, well, why don’t we just have people write down what they hear, even if they don’t speak Hindi?” Hasegawa-Johnson said. “If you listen carefully to another language, you can hear consonants versus vowels, you can hear things that sound more ‘e’ like versus more ‘ah’ like — you can hear some distinctions that do carry across languages.”

It’s the type of approach that some have proposed as a joke, Hasegawa-Johnson said. As a researcher, Jyothi said that poses a problem when it comes to convincing others of the approach’s validity, but makes it even more rewarding when they show that it works.

“It’s nice to introduce the problem — because it’s kind of really out there — and then show convincing results,” Jyothi said. “Especially when you compare those results with baselines which people trust . . . then it allows us to justify our techniques better.”

Hasegawa-Johnson said it’s the “wait-and-see phenomenon” that commonly occurs in scientific research — where others aren’t willing to jump into the research until they see that it works.

“I think the only people besides us who have tried this now are people who are working with us,” he said.

Moving forward, Jyothi and Hasegawa-Johnson said their biggest challenges will be scaling their approach so that it can be competitive with systems based on transcriptions from native speakers.

Ultimately, they would like to see their work reduce the cost of entry into under-resourced languages, and make speech technology available to those who may not want to learn a language such as English or Mandarin Chinese.

It all comes back to Jyothi and Hasegawa-Johnson’s belief that this technology will allow people to communicate and live their lives comfortably, despite the fact that they speak an under-resourced language. With their novel approach, they hope they can make this term a thing of the past.

“This mismatched crowdsourcing has been a really effective new tool that nobody else has done,” Hasegawa-Johnson said. “But our ultimate goal, really, is to make it possible to create a speech technology in any language.”

[email protected]