Wednesday, August 22, 2007
Introducing Sky in Google Earth
At Google we are always interested in creating new ways to share ideas and information and applying these techniques to different research fields. Astronomy provides a great opportunity with an abundance of images and information that are accessible to researchers and indeed, anyone with an interest in the stars. With the release of the Google Earth 4.2 client the new Sky feature acts as a virtual telescope that provides a view of some of the most detailed images ever taken of the night sky. By clicking on the Sky button, you can explore the universe, seamlessly zooming from the familiar views of the constellations and stars, to the deepest images ever taken of galaxies and more. From planets moving across the sky to supernovae exploding in distant galaxies, Sky provides a view of a dynamic universe that we hope you will enjoy.
In addition to allowing educators, amateurs or anyone with an interest in space to visually explore the sky, one of the most exciting aspects of Sky is its capability for research and discovery in astronomy. With the latest features in KML you can connect astronomical image and catalog databases directly to the visualization capabilities of Sky ( e.g. searching the Sloan Digital Sky Survey database for the highest redshift quasars or correlating the the infrared and optical sky to detect the presence of dust within our Galaxy). From releasing new data about the latest discovery of planets around nearby stars to identifying the host galaxy of a gamma ray burst the possibilities are endless. Examples of how to build research applications such as a view of the microwave background emission from the remnant of the Big Bang can be found in the Google Earth Gallery.
It has been a lot of fun creating Google's first astronomical observatory. Go check it out; explore the sky from the comfort of your home; If you find something interesting let us know on the Sky section of the Google Earth Community, or author your own KML applications to to share your discoveries and data with everyone else. You can also find more Sky resources on our website.
Friday, July 27, 2007
Drink from the firehose with University Research Programs
Posted by Michael Lancaster and Josh Estelle, Software Engineers
Whenever we talk to university researchers, we hear a consistent message: they wish they had Google infrastructure. In pursuit of our company mission, we have built an elaborate set of systems for collecting, organizing, and analyzing information about the web. Operating and maintaining such an infrastructure is a high barrier to entry for many researchers. We recognize this and want to share some of the fruits of our labor with the research community. Today, in conjunction with the Google Faculty Summit we're making two services available under the new University Research Programs, namely access to web search, and machine translation.
University Research Program for Google Search
Google is focused on the success of the web, which is essentially an organism in and of itself with extremely complex contents and an ever-evolving structure. The primary goal of the University Research Program for Google Search is to promote research that creates a greater understanding of the web. We want to make it easy for researchers to analyze millions of queries in a reasonably short amount of time. We feel that such research can benefit everyone. As such, we've added a proviso that all research produced through this program must be published in a freely accessible manner.
University Research Program for Google Translate
The web is a global information medium with content from many cultures and languages. In order to break the language barrier, many researchers are hard at work building high quality, automatic, machine translation systems. We've been successful with our own statistical machine translation system, and are now happy to provide researchers greater access to it. The University Research Program for Google Translate provides researchers access to translations, including detailed word alignment information and lists of the n-best translations with detailed scoring information. We hope this program will be a terrific resource to help further the state of the art in automatic machine translation.
The web holds a wealth of untapped research potential and we look forward to seeing great new publications enabled by these new programs. Go ahead - surprise us!
By the way, since many researchers lead a double life as educators, we want to let you know about a site that recently launched: Google Code for Educators, designed to make it easy for CS faculty to integrate cutting-edge computer science topics into their courses. Check it out.
Whenever we talk to university researchers, we hear a consistent message: they wish they had Google infrastructure. In pursuit of our company mission, we have built an elaborate set of systems for collecting, organizing, and analyzing information about the web. Operating and maintaining such an infrastructure is a high barrier to entry for many researchers. We recognize this and want to share some of the fruits of our labor with the research community. Today, in conjunction with the Google Faculty Summit we're making two services available under the new University Research Programs, namely access to web search, and machine translation.
University Research Program for Google Search
Google is focused on the success of the web, which is essentially an organism in and of itself with extremely complex contents and an ever-evolving structure. The primary goal of the University Research Program for Google Search is to promote research that creates a greater understanding of the web. We want to make it easy for researchers to analyze millions of queries in a reasonably short amount of time. We feel that such research can benefit everyone. As such, we've added a proviso that all research produced through this program must be published in a freely accessible manner.
University Research Program for Google Translate
The web is a global information medium with content from many cultures and languages. In order to break the language barrier, many researchers are hard at work building high quality, automatic, machine translation systems. We've been successful with our own statistical machine translation system, and are now happy to provide researchers greater access to it. The University Research Program for Google Translate provides researchers access to translations, including detailed word alignment information and lists of the n-best translations with detailed scoring information. We hope this program will be a terrific resource to help further the state of the art in automatic machine translation.
The web holds a wealth of untapped research potential and we look forward to seeing great new publications enabled by these new programs. Go ahead - surprise us!
By the way, since many researchers lead a double life as educators, we want to let you know about a site that recently launched: Google Code for Educators, designed to make it easy for CS faculty to integrate cutting-edge computer science topics into their courses. Check it out.
Tuesday, June 19, 2007
New Conference on Web Search and Data Mining
Posted by Ziv Bar-Yossef and Kevin McCurley, Research Team
The pace of innovation on the World Wide Web continues unabated more than fifteen years after the first servers went live. The web was initially used by only a small community of scientists, but there are now over a billion people on the planet who use the web in their lives. The World Wide Web grows and changes as a young organism might, reflecting the social forces of the users and information producers. Each year seems to bring a radical new change, including the movement of commerce to the web, the availability of realtime news on the web, mobile users being able to access the web from anywhere, new forms of media such as video, and the emergence of blogs changing politics and publishing.
This rapid pace of innovation and scale presents many interesting research questions. At Google our goal is to organize information in ways that are useful to users, and we regularly find ourselves solving problems that seemed like ridiculous thought experiments just a few years ago. We therefore welcome the arrival of a new conference on Web Search and Data Mining, prosaically named with the acronym WSDM (pronounced as wisdom). WSDM is intended to be complementary to the World Wide Web Conference tracks in search and data mining. The soaring volume of submissions to these two tracks over the past few years justifies the foundation of a new top-tier conference on web search and mining. WSDM is a joint effort of researchers from the three large search engines (Google, Yahoo, MSN) as well as top-notch scientists from the Academia (such as Jon Kleinberg from Cornell, Rajeev Motwani from Stanford, and Monika Henzinger from Google and EPFL). The first WSDM conference will take place at Stanford University (the place where both Google and Yahoo! were conceived by their founders). The conference will be held in February of 2008, and the deadline for submissions is July 30, 2007. For further information see the WSDM web site. If you have good papers on search or data mining in the pipeline, please consider sending them to WSDM.
We look forward to seeing you there!
The pace of innovation on the World Wide Web continues unabated more than fifteen years after the first servers went live. The web was initially used by only a small community of scientists, but there are now over a billion people on the planet who use the web in their lives. The World Wide Web grows and changes as a young organism might, reflecting the social forces of the users and information producers. Each year seems to bring a radical new change, including the movement of commerce to the web, the availability of realtime news on the web, mobile users being able to access the web from anywhere, new forms of media such as video, and the emergence of blogs changing politics and publishing.
This rapid pace of innovation and scale presents many interesting research questions. At Google our goal is to organize information in ways that are useful to users, and we regularly find ourselves solving problems that seemed like ridiculous thought experiments just a few years ago. We therefore welcome the arrival of a new conference on Web Search and Data Mining, prosaically named with the acronym WSDM (pronounced as wisdom). WSDM is intended to be complementary to the World Wide Web Conference tracks in search and data mining. The soaring volume of submissions to these two tracks over the past few years justifies the foundation of a new top-tier conference on web search and mining. WSDM is a joint effort of researchers from the three large search engines (Google, Yahoo, MSN) as well as top-notch scientists from the Academia (such as Jon Kleinberg from Cornell, Rajeev Motwani from Stanford, and Monika Henzinger from Google and EPFL). The first WSDM conference will take place at Stanford University (the place where both Google and Yahoo! were conceived by their founders). The conference will be held in February of 2008, and the deadline for submissions is July 30, 2007. For further information see the WSDM web site. If you have good papers on search or data mining in the pipeline, please consider sending them to WSDM.
We look forward to seeing you there!
Videos of talks
Posted by Kevin McCurley, Research Team
We've recently launched a Google Research web site that we'll be updating to provide information about research activities at Google. Among other things, one thing you'll find there is the ability to search and view videos of talks at Google.
The World Wide Web started out as a means for scientists to communicate among themselves. In the early days it provided a less formal and timely means of distributing information than archival refereed publications, and it's now routine for a scientist to have a home page from which they distribute their writings and thoughts. Moreover, it's also now commonplace to find a large fraction of current scientific literature through the web, both refereed and unrefereed. In fact, the situation has evolved to the point where scientists often consult the web for publications before going to a library.
Archival publications are but one means of communication that has typically been used by scientists. Another mode of communication that has a long history of use is the presentation of talks at meetings and during visits to other institutions. Oral presentations have historically been less formal, and allow the speaker to be more speculative and interactive.
In the last few years, several technological developments have made it possible to distribute high quality video of talks on the web in addition to written publications. This distribution of videos from talks holds the promise of changing the way that scientists think about communication. Imagine what lessons would be available to us if we had the ability to view lectures by Kepler, Einstein, Turing, Shannon, or von Neumann! Imagine also what it would be like to be able to watch and listen to selected talks from conferences that are across the world, without having to suffer the burden of traveling to the remote location. Such media are unlikely to ever completely supplant the richness of communication that arises from personal interaction in physical proximity, but it will probably still change scientific communication as much as email and the web have already.
We've recently launched a Google Research web site that we'll be updating to provide information about research activities at Google. Among other things, one thing you'll find there is the ability to search and view videos of talks at Google.
One of the best features of working at Google is the rich variety of talks that we can attend, both technical and general interest. Most of these are videotaped for later viewing. This has multiple benefits:
- In case of a scheduling conflict, Google employees may view talks at a later time (yes, some of us do have other things to do in the day).
- Talks are available for viewing by Google employees at other sites. This provides us with a much more cohesive intellectual culture than most global companies.
- When appropriate, speakers may opt to have their talks available on the World Wide Web. This provides a benefit to both viewers and speakers, since it allows speakers to reach a much broader audience, and it allows viewers to hear interesting talks without the need to be
physically present.
The World Wide Web started out as a means for scientists to communicate among themselves. In the early days it provided a less formal and timely means of distributing information than archival refereed publications, and it's now routine for a scientist to have a home page from which they distribute their writings and thoughts. Moreover, it's also now commonplace to find a large fraction of current scientific literature through the web, both refereed and unrefereed. In fact, the situation has evolved to the point where scientists often consult the web for publications before going to a library.
Archival publications are but one means of communication that has typically been used by scientists. Another mode of communication that has a long history of use is the presentation of talks at meetings and during visits to other institutions. Oral presentations have historically been less formal, and allow the speaker to be more speculative and interactive.
In the last few years, several technological developments have made it possible to distribute high quality video of talks on the web in addition to written publications. This distribution of videos from talks holds the promise of changing the way that scientists think about communication. Imagine what lessons would be available to us if we had the ability to view lectures by Kepler, Einstein, Turing, Shannon, or von Neumann! Imagine also what it would be like to be able to watch and listen to selected talks from conferences that are across the world, without having to suffer the burden of traveling to the remote location. Such media are unlikely to ever completely supplant the richness of communication that arises from personal interaction in physical proximity, but it will probably still change scientific communication as much as email and the web have already.
Saturday, February 17, 2007
Seattle conference on scalability
Posted by Amanda Camp, Software Engineer
We care a lot about scalability at Google. An algorithm that works only on a small scale doesn't cut it when we are talking global access, millions of people, millions of search queries. We think big and love to talk about big ideas, so we're planning our first ever conference on scalable systems. It will take place on June 23 at our Seattle office. Our goal: to create a collegial atmosphere for participants to brainstorm different ways to build the robust systems that can handle, literally, a world of information.
If you have a great new idea for handling a growing system or an innovative approach to scalability, we want to hear from you. Send a short note about who you are and a description of your 45-minute talk in 500 words or less to scalabilityconf@google.com by Friday, April 20.
With your help, we can create an exciting event that brings together great people and ideas. (And by the way, we'll bring the food.) If you'd like to attend but not speak, we'll post registration details later.
We care a lot about scalability at Google. An algorithm that works only on a small scale doesn't cut it when we are talking global access, millions of people, millions of search queries. We think big and love to talk about big ideas, so we're planning our first ever conference on scalable systems. It will take place on June 23 at our Seattle office. Our goal: to create a collegial atmosphere for participants to brainstorm different ways to build the robust systems that can handle, literally, a world of information.
If you have a great new idea for handling a growing system or an innovative approach to scalability, we want to hear from you. Send a short note about who you are and a description of your 45-minute talk in 500 words or less to scalabilityconf@google.com by Friday, April 20.
With your help, we can create an exciting event that brings together great people and ideas. (And by the way, we'll bring the food.) If you'd like to attend but not speak, we'll post registration details later.
Thursday, February 15, 2007
Hear, here. A Sample of Audio Processing at Google.
Posted by Shumeet Baluja, Michele Covell, Pedro Moreno & Eugene Weinstein
Text isn't the only source of information on the web! We've been working on a variety of projects related to audio and visual recognition. One of the fundamental constraints that we have in designing systems at Google is the huge amounts of data that we need to process rapdily. A few of the research papers that have come out of this work are shown here.
In the first pair of papers, to be presented at the 2007 International Conference on Acoustics, Speech and Signal Processing (Waveprint Overview, Waveprint-for-Known-Audio), we show how computer vision processing techniques, combined with large-scale data stream processing, can create an efficient system for recognizing audio that has been degraded by various means such as cell phone playback, lossy compression, echoes, time-dilation (as found on the radio), competing noise, etc.
It is also fun and surprising to see how often in research the same problem can be approached from a completely different perspective. In the third paper to be presented at ICASSP-2007 (Music Identification with WFST) we explore how acoustic modeling techniques commonly used in speech recognition, and finite state transducers used to represent and search large graphs, can be used in the problem of music identification. Our approach learns a common alphabet of music sounds (which we call music-phones) and represents large song collections as a big graph where efficient search is possible.
Perhaps one of the most interesting aspects of audio recognition goes beyond the matching of degraded signals, and instead attempts to capture meaningful notions of similarity. In our paper presented at the International Conference on Artificial Intelligence (Music Similarity), we describe a system that learns relevant similarities in music signals, while maintaining efficiency by using these learned models to create customized hashing functions.
We're extending these pieces of work in a variety of ways, not only in the learning algorithms used, but also the application areas. If you're interested in joining google research and working on these projects, be sure to drop us a line.
Text isn't the only source of information on the web! We've been working on a variety of projects related to audio and visual recognition. One of the fundamental constraints that we have in designing systems at Google is the huge amounts of data that we need to process rapdily. A few of the research papers that have come out of this work are shown here.
In the first pair of papers, to be presented at the 2007 International Conference on Acoustics, Speech and Signal Processing (Waveprint Overview, Waveprint-for-Known-Audio), we show how computer vision processing techniques, combined with large-scale data stream processing, can create an efficient system for recognizing audio that has been degraded by various means such as cell phone playback, lossy compression, echoes, time-dilation (as found on the radio), competing noise, etc.
It is also fun and surprising to see how often in research the same problem can be approached from a completely different perspective. In the third paper to be presented at ICASSP-2007 (Music Identification with WFST) we explore how acoustic modeling techniques commonly used in speech recognition, and finite state transducers used to represent and search large graphs, can be used in the problem of music identification. Our approach learns a common alphabet of music sounds (which we call music-phones) and represents large song collections as a big graph where efficient search is possible.
Perhaps one of the most interesting aspects of audio recognition goes beyond the matching of degraded signals, and instead attempts to capture meaningful notions of similarity. In our paper presented at the International Conference on Artificial Intelligence (Music Similarity), we describe a system that learns relevant similarities in music signals, while maintaining efficiency by using these learned models to create customized hashing functions.
We're extending these pieces of work in a variety of ways, not only in the learning algorithms used, but also the application areas. If you're interested in joining google research and working on these projects, be sure to drop us a line.
Tuesday, December 12, 2006
Google Research Picks for Videos of the Year
Posted by Peter Norvig
Everyone else is giving you year-end top ten lists of their favorite movies, so we thought we'd give you ours, but we're skipping Cars and The Da Vinci Code and giving you autonomous cars and open source code. Our top twenty (we couldn't stop at ten):
Everyone else is giving you year-end top ten lists of their favorite movies, so we thought we'd give you ours, but we're skipping Cars and The Da Vinci Code and giving you autonomous cars and open source code. Our top twenty (we couldn't stop at ten):
- Winning the DARPA Grand Challenge: Sebastian Thrun stars in the heartwarming drama of a little car that could.
- The Graphing Calculator Story: A thriller starring Ron Avitzur as the engineer who snuck into the Apple campus to write code.
- Should Google Go Nuclear?: Robert Bussard (former Asst. Director of the AEC) talks about inertial electrostatic fusion.
- A New Way to Look at Networking: Van Jacobson as the old pro discovering that the old problems have not gone away.
- Python 3000: Guido van Rossum always looks on the bright side of life in this epic look at the future of Python.
- How to Survive a Robot Uprising: Daniel Wilson stars in this sci-fi horror story.
- The New "Bill of Rights of Information Society": Raj Reddy talks about how to get the right information to the right people at the right time.
- Practical Common Lisp: In this foreign film, Peter Seibel introduces the audience to a new language. Subtitles in parentheses.
- Debugging Backwards in Time: Starring Bil Lewis in this sequel to Back to the Future.
- Building Large Systems at Google: Narayanan Shivakumar takes us behind the scenes to see how Google builds large distributed systems. Like Charlie and the Chocolate Factory but without the Oompa-Loompas.
- The Science and Art of User Experience at Google: Jen Fitzpatrick continues the behind-the-scenes look.
- Universally Accessible Demands Accessibility for All of Humanity: McArthur "Genius Award" Fellow Jim Fruchterman talks about accessibility for the blind and others.
- DNA and the Brain: Nobel Laureate James Watson explains how the key to understanding the brain is in our genes.
- Steve Wozniak: This one-man show is playing to boffo reviews.
- Jane Goodall: The celebrated primatologist discusses her mission to empower individuals to improve the environment.
- Computers Versus Common Sense: Doug Lenat reprises his role as the teacher trying to get computers to understand.
- The Google Story: David Vise talks about his book on Google.
- The Search: John Battelle talks about his book on Google.
- The Archimedes Palimpsest: Like Da Vinci Code, only true.
- The Paradox of Choice - Why More is Less: With Barry Schwartz. Hmm, maybe I should have made this a top three list?
Wednesday, November 29, 2006
CSCW 2006: Collaborative editing 20 years later
Posted by Lilly Irani & Jens Riegelsberger, User Experience team
9am Mountain View, California. 6pm Zurich, Switzerland. The two of us sit separated by thousands miles, telephones tucked under our ears, talking about this blog post and typing words and edits into Google Docs. As we talk about the title, we start typing into the same paragraph -- and Lilly gets a warning: "You've edited a paragraph that Jens has been editing!" Lilly stops typing so she doesn't lose her thoughts and coordinates with Jens over the phone. Then we realize "We just talked about this problem at the conference we're writing about!"
Two weeks ago four Googlers ventured north to attend ACM CSCW in Banff, Alberta, Canada. CSCW is ACM's conference on Computer Supported Cooperative Work and brings together computer scientists, social scientists, and designers interested in how people live their lives -- at work, at play, and in between -- with and around technology, with a focus on undestanding the design of technological systems. Topics like issues and implementation of collaborative editing are staples at CSCW.
As this year was the conference's 20th anniversary, we had a chance to hear from many of the founders of CSCW: Irene Greif, Jonathan Grudin, Tom Malone, Judy Olson, Lucy Suchman, among others. Not surprisingly, the mood was introspective, with many speakers tracing the impact of the community over time and looking critically and constructively at the future paths the research community might take. Many sessions focused on less traditional areas of research, such as how Facebook figures into college students' school transitions and how tagging vocabularies evolve and are shaped by technology in a movie community. Jens also gave a talk on his pre-Google research on how photos and voice profiles affect people's choice of gaming partners. And he participated in a workshop exploring how people trust -- and learn to trust -- in online environments.
Apart from actively taking part in the debates and Q&As, we also demo-ed Google's tools for getting things done, collaboratively or solo: Google Docs & Spreadsheets and Google Notebook. These were met with much interest, as these publicly available Google tools build on insights gained in the CSCW field over the last 20 years.
If you're interested in these issues, you'd be a great addition to our team. Learn about available positions in user experience research and design.
9am Mountain View, California. 6pm Zurich, Switzerland. The two of us sit separated by thousands miles, telephones tucked under our ears, talking about this blog post and typing words and edits into Google Docs. As we talk about the title, we start typing into the same paragraph -- and Lilly gets a warning: "You've edited a paragraph that Jens has been editing!" Lilly stops typing so she doesn't lose her thoughts and coordinates with Jens over the phone. Then we realize "We just talked about this problem at the conference we're writing about!"
Two weeks ago four Googlers ventured north to attend ACM CSCW in Banff, Alberta, Canada. CSCW is ACM's conference on Computer Supported Cooperative Work and brings together computer scientists, social scientists, and designers interested in how people live their lives -- at work, at play, and in between -- with and around technology, with a focus on undestanding the design of technological systems. Topics like issues and implementation of collaborative editing are staples at CSCW.
As this year was the conference's 20th anniversary, we had a chance to hear from many of the founders of CSCW: Irene Greif, Jonathan Grudin, Tom Malone, Judy Olson, Lucy Suchman, among others. Not surprisingly, the mood was introspective, with many speakers tracing the impact of the community over time and looking critically and constructively at the future paths the research community might take. Many sessions focused on less traditional areas of research, such as how Facebook figures into college students' school transitions and how tagging vocabularies evolve and are shaped by technology in a movie community. Jens also gave a talk on his pre-Google research on how photos and voice profiles affect people's choice of gaming partners. And he participated in a workshop exploring how people trust -- and learn to trust -- in online environments.
Apart from actively taking part in the debates and Q&As, we also demo-ed Google's tools for getting things done, collaboratively or solo: Google Docs & Spreadsheets and Google Notebook. These were met with much interest, as these publicly available Google tools build on insights gained in the CSCW field over the last 20 years.
If you're interested in these issues, you'd be a great addition to our team. Learn about available positions in user experience research and design.
Friday, September 22, 2006
And the Awards Go To ...
Posted by Proud Googlers
We're usually a modest bunch, but we we couldn't help but let you know about some honors and awards bestowed on Googlers recently:
We're usually a modest bunch, but we we couldn't help but let you know about some honors and awards bestowed on Googlers recently:
- Ramakrishnan Srikant is the winner of the 2006 ACM SIGKDD Innovation Award for his work on pruning techniques for the discovery of association rules, and for developing new data mining approaches that respect the privacy of people in the data base.
- Henry Rowley and Shumeet Baluja, along with CMU professor Takeo Kanade, received the Longuet-Higgins prize for "a contribution which has stood the test of time," namely their 1996 paper Neural Network based face detection. The award was given at the 2006 Computer Vision and Pattern Recognition (CVPR) Conference.
- Team Smartass, consisting of Christopher Hendrie, Derek Kisman, Ambrose Feinstein and Daniel Wright won first place in the ICFP (International Conference on Functional Programming) programming contest, using a combination of C++, Haskell and 2D. Third place went to Can't Spell Awesome without ASM, a team consisting of Google engineer Jon Dethridge, former Google interns Ralph Furmaniak and Tomasz Czajka, and Reid Barton of Harvard. They got the judges at the functional programming conference to admit "Assembler is not too shabby."
- Peter Norvig was named a Berkeley Distinguished Alumni in Computer Science, and gave the keynote commencement address. We'd also like to congratulate Prabhakar Raghavan, Head of Yahoo Research, who was a co-recipient of this award.
- Simon Quellen Field's book Return of Gonzo Gizmos was a selection of the Scientific American Book Club.
- Google summer intern Rion Snow (along with Stanford professors Dan Jurafsky and Andrew Ng) got the best paper award at the 2006 ACL/COLING (computational linguistics) conference for his paper titled Semantic taxonomy induction from heterogenous evidence.
- Google summer intern Lev Reyzin won the outstanding student paper award at ICML (International Conference on Machine Learning) for work with Rob Schapire of Princeton on How Boosting the Margin Can Also Boost Classifier Complexity.
- As we mentioned earlier, Michael Fink, Michele Covell and Shumeet Baluja won a best paper award for Social- and Interactive-Television Applications Based on Real-Time Ambient-Audio Identification.
- Update 13 Oct 2006: Paul Rademacher has been named one of the top innovators under 35 by MIT's Technology Review. He was cited
for his mashup of Google Maps and Craig's List housing data at housingmaps.com. - Update 31 Oct 2006: We forgot Alon Halevy, who won the VLDB 10 Year Best Paper Award for Querying Heterogeneous Information Sources Using Source Descriptions with Anand Rajaraman and Joann J. Ordille.
Friday, August 4, 2006
All Our N-gram are Belong to You
Posted by Alex Franz and Thorsten Brants, Google Machine Translation Team
Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others. While such models have usually been estimated from training corpora containing at most a few billion words, we have been harnessing the vast power of Google's datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there's no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more - resulting in a training corpus of one trillion words from public Web pages.
We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That's why we decided to share this enormous dataset with everyone. We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.
Watch for an announcement at the Linguistics Data Consortium (LDC), who will be distributing it soon, and then order your set of 6 DVDs. And let us hear from you - we're excited to hear what you will do with the data, and we're always interested in feedback about this dataset, or other potential datasets that might be useful for the research community.
Update (22 Sept. 2006): The LDC now has the data available in their catalog. The counts are as follows:
The following is an example of the 3-gram data contained this corpus:
The following is an example of the 4-gram data in this corpus:
Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others. While such models have usually been estimated from training corpora containing at most a few billion words, we have been harnessing the vast power of Google's datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there's no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more - resulting in a training corpus of one trillion words from public Web pages.
We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That's why we decided to share this enormous dataset with everyone. We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.
Watch for an announcement at the Linguistics Data Consortium (LDC), who will be distributing it soon, and then order your set of 6 DVDs. And let us hear from you - we're excited to hear what you will do with the data, and we're always interested in feedback about this dataset, or other potential datasets that might be useful for the research community.
Update (22 Sept. 2006): The LDC now has the data available in their catalog. The counts are as follows:
File sizes: approx. 24 GB compressed (gzip'ed) text files
Number of tokens: 1,024,908,267,229
Number of sentences: 95,119,665,584
Number of unigrams: 13,588,391
Number of bigrams: 314,843,401
Number of trigrams: 977,069,902
Number of fourgrams: 1,313,818,354
Number of fivegrams: 1,176,470,663
The following is an example of the 3-gram data contained this corpus:
ceramics collectables collectibles 55
ceramics collectables fine 130
ceramics collected by 52
ceramics collectible pottery 50
ceramics collectibles cooking 45
ceramics collection , 144
ceramics collection . 247
ceramics collection </S> 120
ceramics collection and 43
ceramics collection at 52
ceramics collection is 68
ceramics collection of 76
ceramics collection | 59
ceramics collections , 66
ceramics collections . 60
ceramics combined with 46
ceramics come from 69
ceramics comes from 660
ceramics community , 109
ceramics community . 212
ceramics community for 61
ceramics companies . 53
ceramics companies consultants 173
ceramics company ! 4432
ceramics company , 133
ceramics company . 92
ceramics company </S> 41
ceramics company facing 145
ceramics company in 181
ceramics company started 137
ceramics company that 87
ceramics component ( 76
ceramics composed of 85
ceramics composites ferrites 56
ceramics composition as 41
ceramics computer graphics 51
ceramics computer imaging 52
ceramics consist of 92
The following is an example of the 4-gram data in this corpus:
serve as the incoming 92
serve as the incubator 99
serve as the independent 794
serve as the index 223
serve as the indication 72
serve as the indicator 120
serve as the indicators 45
serve as the indispensable 111
serve as the indispensible 40
serve as the individual 234
serve as the industrial 52
serve as the industry 607
serve as the info 42
serve as the informal 102
serve as the information 838
serve as the informational 41
serve as the infrastructure 500
serve as the initial 5331
serve as the initiating 125
serve as the initiation 63
serve as the initiator 81
serve as the injector 56
serve as the inlet 41
serve as the inner 87
serve as the input 1323
serve as the inputs 189
serve as the insertion 49
serve as the insourced 67
serve as the inspection 43
serve as the inspector 66
serve as the inspiration 1390
serve as the installation 136
serve as the institute 187
serve as the institution 279
serve as the institutional 461
serve as the instructional 173
serve as the instructor 286
serve as the instructors 161
serve as the instrument 614
serve as the instruments 193
serve as the insurance 52
serve as the insurer 82
serve as the intake 70
serve as the integral 68
Subscribe to:
Posts (Atom)