Wednesday, December 31, 2008

Translation is Risky Business



At Google, we like search. So it's no surprise that we treat language translation as a search problem. We build statistical models of how one language maps to another (the translation model) and models of what the target language is supposed to look like (the language model) and then we search for the best translation according to those models (combined into one big log linear model for those of you taking notes).

But, just as putting all of your money in the investment with the highest historical return is not always the best idea, choosing the translation with the highest probability is not always the best idea either - especially when you have a relatively flat distribution among the top candidates. Instead, we can use the Minimum Bayes Risk (MBR) criterion. Essentially, we look at a sample of the best candidate translations (the so called n-best list) and choose the safest one, the one most likely to do the least amount of damage (where 'damage' is defined by our measurement of translation quality). You might want to view this as choosing a translation that is a lot like the other good translations instead of choosing that strange one that had the good model score.

If this is our 'diversification' strategy, how can we make things even safer? Exactly the same way as we do for investments, we diversify even more. That is, we look at more of the candidate translations to make the MBR decision. A lot more. And the way to do that is to build a lattice of translations during the search and then we do our MBR search over the lattice. Instead of 100 or 1000 best translations that we would use for the n-best approach, lattices give us access to a number that rivals the number of particles in the visible universe (really, it's huge).

 Interested? You can read all about it here.

Tuesday, November 11, 2008

plop: Probabilistic Learning of Programs



Cross-posted with Open Source at Google blog

Traditional machine learning systems work with relatively flat, uniform data representations, such as feature vectors, time-series, and probabilistic context-free grammars. However, reality often presents us with data which are best understood in terms of relations, types, hierarchies, and complex functional forms. The best representational scheme we computer scientists have for coping with this sort of complexity is computer programs. Yet there are comparatively few machine learning methods that operate directly on programmatic representations, due to the extreme combinatorial explosions involved and the semantic complexities of programs.

The plop project is part a new approach to learning programs being developed at Google and elsewhere that takes on the challenges of learning programs through a unified approach based on reducing programs to a hierarchical normal form, building sequences of specialized representations for programs as search progresses, maintaining alternative representations, and managing uncertainty probabilistically by applying estimation-of-distribution algorithms over program spaces, and exploiting probabilistic background knowledge.

For more information on this approach to learning programs, see my doctoral dissertation. For more on the overall philosophy and where things are going, see the plop wiki on Google Code.

Saturday, October 4, 2008

New Technology Roundtable Series



We've just posted the first three videos in the Google Technology Roundtable Series.  Each one is a discussion with senior Google researchers and technologists about one of our most significant achievements. We use a talk show format, where I lead a discussion on the technology. 

While the videos are intended for a reasonably technical audience, I think they may be interesting to many as an overview of the key challenges and ideas underlying Google's systems. And of course they offer a glimpse into the people behind Google.

The first one we made is Large-Scale Search System Infrastructure and Search Quality." I interview Google Fellows Jeff Dean and Amit Singhal on their insights in how search works at Google.

The next title is "Map Reduce," a discussion of this key technology (first, at Google, and now having a great impact across the field) for harnessing parallelism provided by very large-scale clusters computers, while mitigating the component failures that inevitably occur in such big systems. My discussion is with four of our Map Reduce expert engineers: Sanjay Ghemawat and Jeff Dean again, plus Software Engineers Jerry Zhao and Matt Austern, who discuss the origin, evolution and future of Map Reduce. By the way, this type of infrastructure underlies the infrastructure concepts in our recent post on "The Intelligent Cloud."

The third video, "Applications of Human Language Technology," is a discussion of our enormous progress in large-scale automated translation of languages and speech recognition. Both of these technology domains are coming of age with capabilities that will truly impact what we expect of computers on a day-to-day basis. I discuss these technologies with human language technology experts Franz Josef Och, an expert in the automated translation of languages, and Mike Cohen, an expert in speech processing.

We hope to produce more of these, so please leave feedback at YouTube (in the comments field for each video), and we will incorporate your ideas into our future efforts.

Tuesday, September 30, 2008

Doubling Up



Machine translation is hard. Natural languages are so complex and have
so many ambiguities and exceptions that teaching a computer to
translate between them turned out to be a much harder problem than
people thought when the field of machine translation was born over 50
years ago. At Google Research, our approach is to have the machines
learn to translate by using learning algorithms on gigantic amounts of
monolingual and translated data. Another knowledge source is user
suggestions. This approach allows us to constantly improve the
quality of machine translations as we mine more data and
get more and more feedback from users.

A nice property of the learning algorithms that we use is that they
are largely language independent -- we use the same set of core
algorithms for all languages. So this means if we find a lot of
translated data for a new language, we can just run our algorithms and
build a new translation system for that language.

As a result, we were recently able to significantly increase the number of
languages on translate.google.com. Last week, we launched eleven new
languages: CatalanFilipinoHebrewIndonesianLatvianLithuanianSerbian,
SlovakSlovenianUkrainianVietnamese. This increases the
total number of languages from 23 to 34.  Since we offer translation
between any of those languages this increases the number of language
pairs from 506 to 1122 (well, depending on how you count simplified
and traditional Chinese you might get even larger numbers). We're very
happy that we can now provide free online machine translation for many
languages that didn't have any available translation system before.

So how far can we go with adding new languages in the future? Can we
go to 40, 50 or even more languages?  It is certainly getting harder,
as less data is available for those languages and as a result it is
harder to build systems that meet our quality bar.  But we're working
on better learning algorithms and new ways to mine data and so even if
we haven't covered your favorite language yet, we hope that we will have
it soon.

Saturday, July 26, 2008

Remembering Randy Pausch



It is with great sadness that we note the passing of Randy Pausch, who taught computer science at Carnegie Mellon University. Randy was well-known by many within the research community, including quite a number of us here at Google. Alfred Spector, our Vice President of Research, was his Ph.D. advisor. Rich Gossweiler, a Senior Research Scientist, was his first Ph.D. student. Several other former colleagues and coauthors (Joshua Bloch, Adam Fass, and Ning Hu) now work here.

All of us strive to make an impact with our research, and Randy was no exception. He will be remembered for his work, but also for his contributions to humanity at large. Millions have watched the video on YouTube from his lecture titled Achieving your Childhood Dreams. The strength of his character was already known to his family, his colleagues, and the broader computer science research community. The courage and optimism that he displayed at the end of his life became inspirational to millions more.

I've seen Randy repeatedly go to bat for what is right. As a leader, he consistently evoked incredible enthusiasm and optimism for the subjects he embraces. Randy had a very human passion about people and not just who they are, but their potential, despite any flaws or obstacles in their way. His contributions will be remembered for generations to come. - Rich Gossweiler

Randy was one of the most vibrant, passionate people I've ever known. His passion was inspirational not only to his family and colleagues, but also, because of his courageous presentations beginning with his well-known Last Lecture, he has influenced millions more. - Alfred Spector


We will miss Randy very much, and remember him fondly.

Wednesday, May 21, 2008

Machine Learning Meeting



Machine Learning is a branch of Artificial Intelligence in which, naturally enough, the aim is to get computers to learn: things like improving performance over time, and recognizing general tendencies among a number of specific cases.  We have many ways to exploit Machine Learning programs, and a lot of data to give them.  Machine Learning helps us to estimate what content users like most, what content is even legitimate, and how to match ads to content.  It also plays key roles in products such as Google Translate and 1-800-GOOG-411.

Last week, Google held a Machine Learning Summit in New York, gathering together engineers from around the world, including participants from Beijing, Zurich, Boston, Haifa, and Mountain View. The program included invited talks by CJ Lin and Peng Xu, and many shorter presentations by Googlers on topics including broadly applicable Machine Learning software, especially challenging applications, and some mathematical analysis.

Wednesday, May 7, 2008

Can You Publish at Google?



As part of the interview process at Google we try to describe what it is like to do research here. A common question I get is "How hard is it to publish at Google?" I want to dispel the myth that it is hard. It is easy to publish, easy to put code into open source, easy to give talks, etc. But it is also easy for great research to become great engineering, and that is an incredible lure. Great barriers of despair exist between research and development at many companies; researchers can find it hard to have impact beyond demos, papers, or patents.

Here at Google it is not uncommon for researchers to work on products (and is modus operandi for me); you can come up with something interesting, experiment to convince yourself and others that it is worthwhile, and then work as part of the team to build it and help create world-changing products. But are you willing to do that? That decision is the hard part, not publishing a paper.

I think from a Google standpoint, we need to make sure these barriers don't form, that making products, experimenting and having a venue for trying bold new approaches continues to be part of the culture.

Friday, May 2, 2008

VisualRank



At WWW-2008, in Beijing, China, we presented our paper "PageRank for Product Image Search". In this paper, we presented a system that used visual cues, instead of solely text information, to determine the rank of images. The idea was simple: find common visual themes in a set of images, and then find a small set of images that best represented those themes. The resulting algorithm wound up being PageRank, but on an entirely inferred graph of image similarities. Since the release of the paper, we've noticed lots of coverage in the press and have received quite a few questions. We thought we could answer a few of them here.


"Why did we choose to use products for our test case?" First and foremost, product queries are popular in actual usage; addressing them is important. Second, users have strong expectations of what results we should return for these queries; therefore, this category provides an important set of examples that we need to address especially carefully. Third, on a pragmatic note, they lend themselves well to the type of "image features" that we selected in this study. Since the publication of the paper, we've also extended our results to other query types, including travel-related queries. One of the nice features of the approach is that (we hope) it will be easy to extend to new domains; as research in measuring image or object similarity continues, the advances can easily be incorporated into the similarity calculation to compute the underlying graph; the computations on the graph do not change.

"Where are we going from here?" Besides broadening the sets of queries (and sets of features) for which we can use this approach, there are three directions we're exploring. First, estimating similarity measures for all of the images on the web is computationally expensive; approximations or alternative computations are needed. Second, we hope to evaluate our approach with respect to the large number of recently proposed alternative clustering methods. Third, many variations of PageRank can be used in quite interesting ways for image search. For example, we can use some of these previously published methods to reintroduce, in a meaningful manner, the textual information that the VisualRank algorithm removed. In the end, we have an approach that has an easy integration with both text and visual clues. Stay tuned for more on that in the coming months.

And now to answer the most commonly asked question, "Is it live?" Not yet. Currently, it is research in progress (click here to help speed up the process). In the meantime, though, if you'd like another sneak peek of our research on large graphs, this time in the context of YouTube datamining, just follow the link.

Finally, we want to extend our deepest thanks to the people who helped on this project, especially the image-search team; without their help, this research would not have been possible.

Thursday, April 24, 2008

Research in the Cloud: Providing Cutting Edge Computational Resources to Scientists



The emergence of extremely large datasets, well beyond the capacity of almost any single computer, has challenged traditional and contemporary methods of analysis in the research world. While a simple spreadsheet or modest database remains sufficient for some research, problems in the domain of "computational science," which explores mathematical models via computational simulation, require systems that provide huge amounts of data storage and computer processing (current research areas in computational science include climate modeling, gene sequencing, protein mapping, materials science and many more). As an added hurdle, this level of computational infrastructure is often not affordable to research teams, who usually work with significant budgetary restrictions.

Fortunately, as the Internet technology industry expands its global infrastructure, accessing world class distributed computational and storage resources can be as simple as visiting a website. Building on its Academic Cloud Computing Initiative (ACCI) announced last October, Google and IBM, with the National Science Foundation, announced in February the CluE initiative to address this particular need. After coordinating the technical details with Google and IBM, the NSF posted the official solicitation of proposals last week.

Our primary goal in participating in the CluE initiative is to encourage the understanding, further refinement and --importantly-- targeted application of the latest distributed computing technology and methods across many academic disciplines. Engaging educators and researchers with the new potential of distributed computing for processing and analyzing extremely large datasets is an invaluable investment for any technology company to make, and Google in particular is pleased to make a contribution to the academic community that has enabled so many recent advances in the industry.

We're looking forward to an eclectic collection of proposals from the NSF's solicitation. We believe many will leverage the power of distributed computing to produce a diverse range of knowledge that will provide long term benefit to both the research community and the public at large. We also hope that Google's contribution to this low cost, open source approach to distributed computing will allow many more in the academic community to take advantage of this pervasive technological shift.

More details, including information on how to apply for access to these resources, is available on the NSF site.

Saturday, March 29, 2008

Deploying Goog411



A couple of years ago, a few of us got together and decided to build Goog411. It would be a free phone service that users could call to connect to any business in the US, or simply to browse through a list of businesses such as "bookstores" in a given city. Everything would be fully automated, with no operator in the background, just a speech recognition system to converse with the user, and Google Maps to execute the business search.

We knew that speech recognition is not a solved problem; there would be users for whom the system wouldn't work well, and queries that would be harder to recognize than others. But we got big assets through hosting the service: we could iterate as often as we wanted on any component of the system, we'd have access to all the data, and we could measure whatever seemed relevant to callers. So we built Goog411, started taking some traffic, defining metrics, and iterated many, many times.

We learned a few interesting things in the process (see our ICASSP paper). For example, we discovered that databases with lists of business names are almost useless to train a language model for how users answer the question "What business name or category?"; aggregated web query logs from Google Maps yield far better performance. And we found the speech data we collect through our own service is almost as useful to model new queries as the web data, even though we have orders of magnitude less of it. After all, you may type "real estate" in Google Maps to glance at a few properties, but would you ask for it over the phone while driving your car?

Today Goog411 has grown from an experiment into a product, and we're working on expanding the service to Canada. As calls flow through the system, our focus is still on making the best use of the increasing data, defining metrics that best correlate to the user's experience, and taking advantage of the computer resources and data sources available within Google.

Maybe our most rewarding experience so far has been to see our traffic grow, and to see repeat callers succeed more and more often with the system. Have you tried it already? Just call 1-800-GOOG-411, and don't hesitate to send us feedback!

Tuesday, February 12, 2008

This year's scalability conference



Managing huge repositories of data and large clusters of machines is no easy task -- and building systems that use those clusters to usefully process that data is even harder. Last year, we held a conference on scalable systems so a bunch of people who work on these challenges could get together and share ideas. Well, it was so much fun that we've decided to do it again.

This year, the conference is taking place in Seattle on Saturday, June 14. (Registration is free.) If you'd like to talk about a topic on scalable or large-scale systems that is near and dear to your heart, we'd love to hear from you. Potential topics include:

Development, deployment and production:
  • Systems, environments and languages for building, deploying and debugging complex datacenter-scale apps, or for allowing teams of collaborating engineers to work together on such apps more effectively
Mobile applications:
  • Unique challenges of scaling services for mobile devices
  • Location-aware scaling techniques
  • Experiences designing scalable apps involving mobile devices
Of course, you've probably got more ideas. Send a 500-word abstract of your 30-minute presentation to scalabilityconf@google.com no later than Friday, April 11, and we'll post registration details in the next couple of months.