My Private Collections: June 2010

Tuesday, June 15, 2010

Google Search by Voice now available in France, Italy, Germany and Spain

Posted by Thad Hughes, Martin Jansche, and Pedro Moreno, Google Research

Google’s speech team is composed of people from many different cultural backgrounds. Indeed, if we count the languages spoken by our teammates, the number comes to well over a dozen. Given our own backgrounds and interests, we are naturally excited to extend our software to work with many different languages and dialects. After testing the waters with English, Mandarin Chinese, and Japanese, we decided to tackle four main European languages which are often referred to as FIGS - French, Italian, German and Spanish.

Developing Voice Search systems in each of these languages presented its own challenges. French and Spanish required special work to deal with diacritic and accent marks (e.g. ç in French, ñ in Spanish). When we develop a new language we tweak our dictionaries based on user generated content. To our surprise we found that a lot of this content in French and Spanish often uses non-standard orthography. For example a French speaker might type “francoise” into a search engine and still expect it to return results for “Françoise”. Likewise in Spanish a user might type “espana” and expect results for the term “España”. Of course a lot of this has to do with the fact that, until recently, domain names (like www.elpais.es) did not allow diacritics, and that entering special characters is often painful but omitting diacrictics is usually not an obstacle to communication. However, non-standard spellings distort the intended pronunciations. For example, if “francoise” were a real French word, one would expect it to be pronounced “franquoise”. In order to capture the intended pronunciation of the non-standard spellings, we fixed the orthography in our dictionaries for Spanish and French automatically. While this is not perfect, it deals with many of the offending cases.

Since our Voice search systems typically understand more than a million different words in each language, developing pronunciation dictionaries is one of the most critical tasks. We need the dictionary to match what the user said with the written form. Not surprisingly we found that dictionary development for some languages like Spanish and Italian to be extremely easy, as they have very regular orthographies. In fact the core of our Spanish pronunciation module consists of less than 100 lines of source code. Other languages like German and French have more complex orthographies. For example in French “au”, “eaux” and “hauts” are all pronounced “o”.

A notable aspect of German (especially “Internet German”) is that a lot of English words are in common usage. We do our best to recognize thousands of English words, even though English contains some sounds that don’t exist in German, like “th” in “the”. One of the trickiest examples we came across was when one of our volunteers read “nba playoffs 2009”, saying “nba playoffs” in English followed by “zwei tausend neun” in German. So go ahead and search for “Germany’s Next Topmodel” or “Postbank Online”, see if it works for you.

German is also notorious for having long, complex words. Our favorite examples include:

Berufskraftfahrerqualifikationsgesetz (or shorter: BKrFQG)
Eierschalensollbruchstellenverursacher
Verkehrsinfrastrukturfinanzierungsgesellschaft
Stichpimpulibockforcelorum
Hypothalamus-Hypophysen-Nebennierenrinde-Achse

Just for fun, compare how long it takes you to say these to Voice Search vs. typing them.

Even though a vocabulary size of one million words sounds like a large number, each of these languages has even more words, so we need a procedure to select which ones to model. We obviously do not do this manually and instead use statistical procedures to identify the list of words we will allow. We do this by looking at many sources of data and looking at the frequency of words. It is therefore surprising to find sometimes really weird terms selected by our algorithms. For example in Spanish we found these unusual words:

So, in the unlikely event that you ever try a Spanish voice search query like this “imágenes del músculo supercalifragilisticoespialidoso chiripitiflautico esternocleidomastoideo” you may be surprised to see that it works.

French, Italian, German, and Spanish are spoken in many parts of the world. In this first release of Google Search by Voice in these languages, we initially only support the varieties spoken in France, Italy, Germany, and Spain, respectively. The reason is that almost all aspects of a Voice Search system are affected by regional variation: French speakers from different regions have slightly different accents, use a number of different words, and will want to search for different things. Eventually, we plan to support other regions as well, and we will work hard to make sure our systems work well for all of you.

So, we hope you find these new voice search system useful and fun to use. We definitely had a “supercalifragilisticoespialidoso chiripitiflautico” time developing them.

Thursday, June 10, 2010

Google Fusion Tables celebrates one year of data management

Posted by Alon Halevy, Google Research and Rebecca Shapley, User Experience

A year ago we launched Google Fusion Tables, an easy way to integrate, visualize and collaborate on data tables in the Google cloud. You used it and saw the potential, and told us what else you wanted. Since then, we’ve responded by offering programmatic access through the Fusion Tables API, math across data columns owned by multiple people, and search on the collection of tables that have been made public. We published about Fusion Tables in SIGMOD 2010 and in the First Symposium on Cloud Computing. And since the map visualizations were such a hit, we made them even better by supporting large numbers of points, lines and polygons, custom HTML in map pop-up balloons complete with tutorials and integration with the Google Maps API. We’ve made all this capability available on Google’s cloud and are excited to see examples every day of how our cloud approach to data tables is changing the game and making structured data management, collaboration, and publishing fast, easy, and open.

But more exciting than all the features we’ve been releasing is the things that people have been *doing* with Fusion Tables. News agencies have been taking advantage of Fusion Tables to map data that governments make public, and tell a more complete story (see the L.A. Times, Knoxville News, and Chicago Tribune). Just this month the State of California kicked off an application development contest, hosting data sets like this one in Fusion Tables for easy API access for developers. And the US Department of Health and Human Services held the Community Health Data Forum, where attendees presented data applications such as the heart-friendly and people-friendly hospital-finder, built with Google Fusion Tables.

It continues to astound us how quickly our users are able to pull together these kinds of compelling data applications with Fusion Tables, again showing the power of a cloud approach to data. Fusion Tables were the multimedia extension to Joseph Rossano’s art exhibit on Butterflies and DNA barcodes, an easy way to map real-estate in Monterey county or potholes in Spain, provided the geo-catalog for wind power data and ethanol-selling stations, and even the data backend for an geo portal to organize water data for Africa, among many, many other uses.

As we head into our second year, we’re looking forward to delivering more tools that make data management easier and more powerful on the web. What’s next for Fusion Tables? Request your favorite features on our Feature Request (a special implementation of Google Moderator), and follow the latest progress of Fusion Tables on our User Group, Facebook, and Twitter. We love to hear from you!