My Private Collections: internationalization

Friday, July 26, 2013

Under the hood of Croatian, Filipino, Ukrainian, and Vietnamese in Google Voice Search

Posted by Eugene Weinstein and Pedro Moreno, Google Speech Team

Although we’ve been working on speech recognition for several years, every new language requires our engineers and scientists to tackle unique challenges. Our most recent additions - Croatian, Filipino, Ukrainian, and Vietnamese - required creative solutions to reflect how each language is used across devices and in everyday conversations.

For example, since Vietnamese is a tonal language, we had to explore how to take tones into consideration. One simple technique is to model the tone and vowel combinations (tonemes) directly in our lexicons. This, however, has the side effect of a larger phonetic inventory. As a result we had to come up with special algorithms to handle the increased complexity. Additionally, Vietnamese is a heavily diacritized language, with tone markers on a majority of syllables. Since Google Search is very good at returning valid results even when diacritics are omitted, our Vietnamese users frequently omit the diacritics when typing their queries. This creates difficulties for the speech recognizer, which selects its vocabulary from typed queries. For this purpose, we created a special diacritic restoration algorithm which enables us to present properly formatted text to our users in the majority of cases.

Filipino also presented interesting challenges. Much like in other multilingual societies such as Hong Kong, India, South Africa, etc., Filipinos often mix several languages in their daily life. This is called code switching. Code switching complicates the design of pronunciation, language, and acoustic models. Speech scientists are effectively faced with a dilemma: should we build one system per language, or should we combine all languages into one?

In such situations we prefer to model the reality of daily language use in our speech recognizer design. If users mix several languages, our recognizers should do their best in modeling this behavior. Hence our Filipino voice search system, while mainly focused on the Filipino language, also allows users to mix in English terms.

The algorithms we’re using to model how speech sounds are spoken in each language make use of our distributed large-scale neural network learning infrastructure (yes, the same one that spontaneously discovered cats on YouTube!). By partitioning the gigantic parameter set of the model, and by evaluating each partition on a separate computation server, we’re able to achieve unprecedented levels of parallelism in training acoustic models.

The more people use Google speech recognition products, the more accurate the technology becomes. These new neural network technologies will help us bring you lots of improvements and many more languages in the future.

Wednesday, January 4, 2012

Google Correlate expands to 49 additional countries

Posted by Matt Mohebbi, Software Engineer

In May of this year we launched Google Correlate on Google Labs. This system enables a correlation search between a user-provided time series and millions of time series of Google search traffic. Since our initial launch, we've graduated to Google Trends and we've seen a number of great applications of Correlate in several domains, including economics (consumer spending, unemployment rate and housing inventory), sociology and meteorology. The correspondence of gas prices and search activity for fuel efficient cars was even briefly discussed in a Fox News presidential debate and NPR recently covered correlations related to political commentators.

Health has always been an area of particular interest to our team (Matt Mohebbi, Julia Kodysh, Rob Schonberger and Dan Vanderkam). Correlate was inspired by Google Flu Trends and many of us worked on both systems. So we were very excited when the BioSense division at the CDC published a page which shows correlations between some of their national trends in patient diagnosis activity and Google search activity. With just three years of weekly data, relevant search terms are surfaced. For example, the time series for bloody nose surfaces "bloody snot" and "blood in snot".

While these terms shouldn't come as a surprise, there are others which are more interesting, including searches related to static electricity, dry skin, and red cheeks. Of course, correlation is not causation but we hope that Correlate can be used as a method for researchers to generate new hypotheses with their data.

To help researchers outside the United States, we're pleased to announce support for 49 additional countries in Google Correlate. It's now possible to see correlations like "snorkeling" in Australia, "cherry blossoms" in Japan , and "beer garden" in Germany. We look forward to seeing what new correlations researchers can find with this data!