My Private Collections: 2007

Friday, October 19, 2007

Google Education Summit

Posted by Jeff Walz and Kevin McCurley

The world's research and educational infrastructures are tightly intertwined. Research universities enable students to participate in research activities, and research contributes to the vitality of the educational experience. At Google, we also recognize the importance of education to our research and engineering activities. In addition to our own in-house activities, we maintain strong ties to academic institutions through visiting faculty programs and summer internships. In recognition of the importance of education to Google's mission, we also recently organized a Google Education Summit. Mehran Sahami has more to say about this in a recent blog post.

Monday, September 24, 2007

OpenHTMM Released

Posted by Ashok C. Popat, Research Scientist

Statistical methods of text analysis have become increasingly sophisticated over the years. A good example is automated topic analysis using latent models, two variants of which are Probabilistic latent semantic analysis and Latent Dirichlet Allocation.

Earlier this year, Amit Gruber, a Ph.D. student at the Hebrew University of Jerusalem, presented a technique for analyzing the topical content of text at the Eleventh International Conference on Artificial Intelligence and Statistics in Puerto Rico.

Gruber's approach, dubbed Hidden Topic Markov Models (HTMM), was developed in collaboration with Michal Rosen-Zvi and Yair Weiss. It differs notably from others in that, rather than treat each document as a single "bag of words," it imposes a temporal Markov structure on the document. In this way, it is able to account for shifting topics within a document, and in so doing, provides a topic segmentation within the document, and also seems to effectively distinguish among multiple senses that the same word may have in different contexts within the same document.

Amit is currently a doing graduate internship at Google. As part of his project, he has developed a fresh implementation of his method in C++. We are pleased to release it as the OpenHTMM package to the research community under the Apache 2 license, in the hopes that it will be of general interest and facilitate further research in this area.

Thursday, September 20, 2007

The Sky is Open

Posted by Jeremy Brewer

We've gotten an incredible amount of positive feedback about Sky in Google Earth, which lets Google Earth users explore the sky above them with hundreds of millions of stars and galaxies taken from astronomy imagery.

From the start though, we have wanted to open the sky up to everyone. As a first step, we've been hard at work developing tools to let astronomers add their own imagery, and we think we've come up with something that does the job nicely. We're pleased to announce the availability of wcs2kml, an open source project for importing astronomical imagery into Sky.

Modern telescopes output imagery in the FITS binary format that contains a set of headers known as a World Coordinate System (that's the "wcs" part) specifying the location of the image on the sky. Wcs2kml handles the task of transforming this imagery into the projection system used by Google Earth (the "kml" part) so that it can be viewed directly in Sky. Wcs2kml also includes tools to simplify uploading this data to a web server and sharing it with friends.

We were astounded at the imagery and novel applications people created when we opened the Google Earth API to our users. Now, by opening Sky in Google Earth to the astronomy community, we hope to open a floodgate of new imagery for Sky!

Wednesday, August 22, 2007

Introducing Sky in Google Earth

Posted by Andy Connolly and Ryan Scranton

At Google we are always interested in creating new ways to share ideas and information and applying these techniques to different research fields. Astronomy provides a great opportunity with an abundance of images and information that are accessible to researchers and indeed, anyone with an interest in the stars. With the release of the Google Earth 4.2 client the new Sky feature acts as a virtual telescope that provides a view of some of the most detailed images ever taken of the night sky. By clicking on the Sky button, you can explore the universe, seamlessly zooming from the familiar views of the constellations and stars, to the deepest images ever taken of galaxies and more. From planets moving across the sky to supernovae exploding in distant galaxies, Sky provides a view of a dynamic universe that we hope you will enjoy.

In addition to allowing educators, amateurs or anyone with an interest in space to visually explore the sky, one of the most exciting aspects of Sky is its capability for research and discovery in astronomy. With the latest features in KML you can connect astronomical image and catalog databases directly to the visualization capabilities of Sky ( e.g. searching the Sloan Digital Sky Survey database for the highest redshift quasars or correlating the the infrared and optical sky to detect the presence of dust within our Galaxy). From releasing new data about the latest discovery of planets around nearby stars to identifying the host galaxy of a gamma ray burst the possibilities are endless. Examples of how to build research applications such as a view of the microwave background emission from the remnant of the Big Bang can be found in the Google Earth Gallery.

It has been a lot of fun creating Google's first astronomical observatory. Go check it out; explore the sky from the comfort of your home; If you find something interesting let us know on the Sky section of the Google Earth Community, or author your own KML applications to to share your discoveries and data with everyone else. You can also find more Sky resources on our website.

Friday, July 27, 2007

Drink from the firehose with University Research Programs

Posted by Michael Lancaster and Josh Estelle, Software Engineers

Whenever we talk to university researchers, we hear a consistent message: they wish they had Google infrastructure. In pursuit of our company mission, we have built an elaborate set of systems for collecting, organizing, and analyzing information about the web. Operating and maintaining such an infrastructure is a high barrier to entry for many researchers. We recognize this and want to share some of the fruits of our labor with the research community. Today, in conjunction with the Google Faculty Summit we're making two services available under the new University Research Programs, namely access to web search, and machine translation.

University Research Program for Google Search

Google is focused on the success of the web, which is essentially an organism in and of itself with extremely complex contents and an ever-evolving structure. The primary goal of the University Research Program for Google Search is to promote research that creates a greater understanding of the web. We want to make it easy for researchers to analyze millions of queries in a reasonably short amount of time. We feel that such research can benefit everyone. As such, we've added a proviso that all research produced through this program must be published in a freely accessible manner.

University Research Program for Google Translate

The web is a global information medium with content from many cultures and languages. In order to break the language barrier, many researchers are hard at work building high quality, automatic, machine translation systems. We've been successful with our own statistical machine translation system, and are now happy to provide researchers greater access to it. The University Research Program for Google Translate provides researchers access to translations, including detailed word alignment information and lists of the n-best translations with detailed scoring information. We hope this program will be a terrific resource to help further the state of the art in automatic machine translation.

The web holds a wealth of untapped research potential and we look forward to seeing great new publications enabled by these new programs. Go ahead - surprise us!

By the way, since many researchers lead a double life as educators, we want to let you know about a site that recently launched: Google Code for Educators, designed to make it easy for CS faculty to integrate cutting-edge computer science topics into their courses. Check it out.

Tuesday, June 19, 2007

New Conference on Web Search and Data Mining

Posted by Ziv Bar-Yossef and Kevin McCurley, Research Team

The pace of innovation on the World Wide Web continues unabated more than fifteen years after the first servers went live. The web was initially used by only a small community of scientists, but there are now over a billion people on the planet who use the web in their lives. The World Wide Web grows and changes as a young organism might, reflecting the social forces of the users and information producers. Each year seems to bring a radical new change, including the movement of commerce to the web, the availability of realtime news on the web, mobile users being able to access the web from anywhere, new forms of media such as video, and the emergence of blogs changing politics and publishing.

This rapid pace of innovation and scale presents many interesting research questions. At Google our goal is to organize information in ways that are useful to users, and we regularly find ourselves solving problems that seemed like ridiculous thought experiments just a few years ago. We therefore welcome the arrival of a new conference on Web Search and Data Mining, prosaically named with the acronym WSDM (pronounced as wisdom). WSDM is intended to be complementary to the World Wide Web Conference tracks in search and data mining. The soaring volume of submissions to these two tracks over the past few years justifies the foundation of a new top-tier conference on web search and mining. WSDM is a joint effort of researchers from the three large search engines (Google, Yahoo, MSN) as well as top-notch scientists from the Academia (such as Jon Kleinberg from Cornell, Rajeev Motwani from Stanford, and Monika Henzinger from Google and EPFL). The first WSDM conference will take place at Stanford University (the place where both Google and Yahoo! were conceived by their founders). The conference will be held in February of 2008, and the deadline for submissions is July 30, 2007. For further information see the WSDM web site. If you have good papers on search or data mining in the pipeline, please consider sending them to WSDM.

We look forward to seeing you there!

Videos of talks

Posted by Kevin McCurley, Research Team

We've recently launched a Google Research web site that we'll be updating to provide information about research activities at Google. Among other things, one thing you'll find there is the ability to search and view videos of talks at Google.

One of the best features of working at Google is the rich variety of talks that we can attend, both technical and general interest. Most of these are videotaped for later viewing. This has multiple benefits:

In case of a scheduling conflict, Google employees may view talks at a later time (yes, some of us do have other things to do in the day).

Talks are available for viewing by Google employees at other sites. This provides us with a much more cohesive intellectual culture than most global companies.

When appropriate, speakers may opt to have their talks available on the World Wide Web. This provides a benefit to both viewers and speakers, since it allows speakers to reach a much broader audience, and it allows viewers to hear interesting talks without the need to be
physically present.

The World Wide Web started out as a means for scientists to communicate among themselves. In the early days it provided a less formal and timely means of distributing information than archival refereed publications, and it's now routine for a scientist to have a home page from which they distribute their writings and thoughts. Moreover, it's also now commonplace to find a large fraction of current scientific literature through the web, both refereed and unrefereed. In fact, the situation has evolved to the point where scientists often consult the web for publications before going to a library.

Archival publications are but one means of communication that has typically been used by scientists. Another mode of communication that has a long history of use is the presentation of talks at meetings and during visits to other institutions. Oral presentations have historically been less formal, and allow the speaker to be more speculative and interactive.

In the last few years, several technological developments have made it possible to distribute high quality video of talks on the web in addition to written publications. This distribution of videos from talks holds the promise of changing the way that scientists think about communication. Imagine what lessons would be available to us if we had the ability to view lectures by Kepler, Einstein, Turing, Shannon, or von Neumann! Imagine also what it would be like to be able to watch and listen to selected talks from conferences that are across the world, without having to suffer the burden of traveling to the remote location. Such media are unlikely to ever completely supplant the richness of communication that arises from personal interaction in physical proximity, but it will probably still change scientific communication as much as email and the web have already.

Saturday, February 17, 2007

Seattle conference on scalability

Posted by Amanda Camp, Software Engineer

We care a lot about scalability at Google. An algorithm that works only on a small scale doesn't cut it when we are talking global access, millions of people, millions of search queries. We think big and love to talk about big ideas, so we're planning our first ever conference on scalable systems. It will take place on June 23 at our Seattle office. Our goal: to create a collegial atmosphere for participants to brainstorm different ways to build the robust systems that can handle, literally, a world of information.

If you have a great new idea for handling a growing system or an innovative approach to scalability, we want to hear from you. Send a short note about who you are and a description of your 45-minute talk in 500 words or less to scalabilityconf@google.com by Friday, April 20.

With your help, we can create an exciting event that brings together great people and ideas. (And by the way, we'll bring the food.) If you'd like to attend but not speak, we'll post registration details later.

Thursday, February 15, 2007

Hear, here. A Sample of Audio Processing at Google.

Posted by Shumeet Baluja, Michele Covell, Pedro Moreno & Eugene Weinstein

Text isn't the only source of information on the web! We've been working on a variety of projects related to audio and visual recognition. One of the fundamental constraints that we have in designing systems at Google is the huge amounts of data that we need to process rapdily. A few of the research papers that have come out of this work are shown here.

In the first pair of papers, to be presented at the 2007 International Conference on Acoustics, Speech and Signal Processing (Waveprint Overview, Waveprint-for-Known-Audio), we show how computer vision processing techniques, combined with large-scale data stream processing, can create an efficient system for recognizing audio that has been degraded by various means such as cell phone playback, lossy compression, echoes, time-dilation (as found on the radio), competing noise, etc.

It is also fun and surprising to see how often in research the same problem can be approached from a completely different perspective. In the third paper to be presented at ICASSP-2007 (Music Identification with WFST) we explore how acoustic modeling techniques commonly used in speech recognition, and finite state transducers used to represent and search large graphs, can be used in the problem of music identification. Our approach learns a common alphabet of music sounds (which we call music-phones) and represents large song collections as a big graph where efficient search is possible.

Perhaps one of the most interesting aspects of audio recognition goes beyond the matching of degraded signals, and instead attempts to capture meaningful notions of similarity. In our paper presented at the International Conference on Artificial Intelligence (Music Similarity), we describe a system that learns relevant similarities in music signals, while maintaining efficiency by using these learned models to create customized hashing functions.

We're extending these pieces of work in a variety of ways, not only in the learning algorithms used, but also the application areas. If you're interested in joining google research and working on these projects, be sure to drop us a line.