My Private Collections: December 2010

Tuesday, December 21, 2010

More researchers dive into the digital humanities

Posted by Jon Orwant, Engineering Manager for Google Books

When we started Google Book Search back in 2004, we were driven by the desire to make books searchable and discoverable online. But as that corpus grew -- we’ve now scanned approximately 10% of all books published in the modern era -- we began to realize how useful it would be for scholarly work. Humanities researchers have started to ask and answer questions about history, society, linguistics, and culture via quantitative techniques that complement traditional qualitative methods.

We’ve been gratified at the positive response to our initial forays into the digital humanities, from our Digital Humanities Research Awards earlier this year, to the Google Books Ngram Viewer and datasets made public just last week. Today we’re pleased to announce a second set of awards focusing on European universities and research centers.

We’ve given awards to 12 projects led by 15 researchers at 13 institutions:

Humboldt-Universität zu Berlin. Annotated Corpora in Studying and Teaching Variation and Change in Academic German, Anke Lüdeling
LIMSI/CNRS, Université Paris Sud. Building Multi-Parallel Corpora of Classical Fiction, François Yvon
Radboud Universiteit. Extracting Factoids from Dutch Texts, Suzan Verberne
Slovenian Academy of Sciences and Arts, Jožef Stefan Institute. Language models for historical Slovenian, Matija Ogrin and Tomaž Erjavec
Université d'Avignon, Université de Provence. Robust and Language Independent Machine Learning Approaches for Automatic Annotation of Bibliographical References in DH Books, Articles and Blogs, Patrice Bellot and Marin Dacos
Université François Rabelais-Tours. Full-text retrieval and indexation for Early Modern French, Marie-Luce Demonet
Université François Rabelais-Tours. Using Pattern Redundancy for Text Transcription, Jean-Yves Ramel and Jean-Charles Billaut
Universität Frankfurt. Towards a “Corpus Caucasicum”: Digitizing Pre-Soviet Cyrillic-Based Publications on the Languages of the Caucasus, Jost Gippert
Universität Hamburg. CLÉA: Literature Éxploration and Annotation Environment for Google Books Corpora, Jan-Christoph Meister
Universität zu Köln. Integrating Charter Research in Old and New Media, Manfred Thaller
Universität zu Köln. Validating Metadata-Patterns for Google Books' Ancient Places and Sites, Reinhard Foertsch
University of Zagreb. A Proﬁle of Croatian neo-Latin, Neven Jovanović

Projects like these, blending empirical data and traditional scholarship, are springing up around the world. We’re eager to see what results they yield and what broader impact their success will have on the humanities.

(Cross-posted from the European Public Policy Blog)

Saturday, December 18, 2010

Robot hackathon connects with Android, browsers and the cloud

Posted by Ryan Hickman and Mamie Rheingold, 20% Robotics Task Force

With a beer fridge stocked and music blasting, engineers from across Google—and the world—spent the month of October soldering and hacking in their 20% time to connect hobbyist and educational robots with Android phones. Just two months later we’re psyched to announce three ways you can play with your iRobot Create(R), LEGO(R) MINDSTORMS(R) or VEX Pro(R) through the cloud:

For the month of October, we invited any Googler who wanted to contribute to connect robots to Google’s services in the cloud to pool their 20% time and participate in as much of the process as they could, from design to hard-core coding.

Thanks to our hardware partners (iRobot, LEGO Group, and VEX Robotics), we never suffered a shortage of supplies. Designers flew in from London, and prototypes were passed between engineers in Tel-Aviv, Hyderabad, Zurich, Munich and California. In Mountain View, we gathered around every Thursday night, rigging up a projector against the wall to share our week’s worth of demos while chowing on pizza. And here is what we produced (so far!):

App Inventor: Low level Bluetooth support for connecting with many serial-enabled robots, and of course tight integration with LEGO MINDSTORMS.
Cellbots for Android: Brand new Java app from Cellbots.com, which is open source and available for free in the Android Market.
Python library: Modularized version of the popular Cellbots project, which is all open source code.

We hope these applications provide some fun and inspire you to build upon this lightweight connectivity between robots, Android, the cloud and your browser.

Friday, December 17, 2010

Find out what’s in a word, or five, with the Google Books Ngram Viewer

Posted by Jon Orwant, Engineering Manager, Google Books

[Cross-posted from the Google Books Blog]

Scholars interested in topics such as philosophy, religion, politics, art and language have employed qualitative approaches such as literary and critical analysis with great success. As more of the world’s literature becomes available online, it’s increasingly possible to apply quantitative methods to complement that research. So today Will Brockman and I are happy to announce a new visualization tool called the Google Books Ngram Viewer, available on Google Labs. We’re also making the datasets backing the Ngram Viewer, produced by Matthew Gray and intern Yuan K. Shen, freely downloadable so that scholars will be able to create replicable experiments in the style of traditional scientific discovery.

Comparing instances of [flute], [guitar], [drum] and [trumpet] (blue, red, yellow and green respectively)
in English literature from 1750 to 2008

Since 2004, Google has digitized more than 15 million books worldwide. The datasets we’re making available today to further humanities research are based on a subset of that corpus, weighing in at 500 billion words from 5.2 million books in Chinese, English, French, German, Russian, and Spanish. The datasets contain phrases of up to five words with counts of how often they occurred in each year.

These datasets were the basis of a research project led by Harvard University’s Jean-Baptiste Michel and Erez Lieberman Aiden published today in Science and coauthored by several Googlers. Their work provides several examples of how quantitative methods can provide insights into topics as diverse as the spread of innovations, the effects of youth and profession on fame, and trends in censorship.

The Ngram Viewer lets you graph and compare phrases from these datasets over time, showing how their usage has waxed and waned over the years. One of the advantages of having data online is that it lowers the barrier to serendipity: you can stumble across something in these 500 billion words and be the first person ever to make that discovery. Below I’ve listed a few interesting queries to pique your interest:

World War I, Great War
child care, nursery school, kindergarten
fax, phone, email
look before you leap, he who hesitates is lost
virus, bacteria
tofu, hot dog
burnt, burned
flute, guitar, trumpet, drum
Paris, London, New York, Boston, Rome
laptop, mainframe, microcomputer, minicomputer
fry, bake, grill, roast
George Washington, Thomas Jefferson, Abraham Lincoln
supercalifragilisticexpialidocious

We know nothing can replace the balance of art and science that is the qualitative cornerstone of research in the humanities. But we hope the Google Books Ngram Viewer will spark some new hypotheses ripe for in-depth investigation, and invite casual exploration at the same time. We’ve started working with some researchers already via our Digital Humanities Research Awards, and look forward to additional collaboration with like-minded researchers in the future.

Thursday, December 16, 2010

Letting everyone do great things with App Inventor

Posted by Karen Parker, App Inventor Program Manager

In July, we announced App Inventor for Android, a Google Labs experiment that makes it easier for people to access the capabilities of their Android phone and create apps for their personal use. We were delighted (and honestly a bit overwhelmed!) by the interest that our announcement generated. We were even more delighted to hear the stories of what you were doing with App Inventor. All sorts of people (teachers and students, parents and kids, programming hobbyists and programming newbies) were building Android apps that perfectly fit their needs.

For example, we’ve heard of people building vocabulary apps for their children, SMS broadcasting apps for their community events, apps that track their favorite public transportation routes and—our favorite—a marriage proposal app.

We are so impressed with the great things people have done with App Inventor, we want to allow more people the opportunity to do great things. So we’re excited to announce that App Inventor (beta) is now available in Labs to anyone with a Google account.

Visit the App Inventor home page to get set up and start building your first app. And be sure to share your App Inventor story on the App Inventor user forum. Maybe this holiday season you can make a new kind of homemade gift—an app perfectly designed for the recipient’s needs!

Thursday, December 9, 2010

$6 million to faculty in Q4 Research Awards

Posted by Maggie Johnson, Director of Education and University Relations

We've just completed the latest round of Google Research Awards, our program which identifies and supports faculty pursuing research in areas of mutual interest. We had a record number of submissions this round, and are funding 112 awards across 20 different areas—for a total of more than $6 million. We’re also providing more than 150 Android devices for research and curriculum development to faculty whose projects rely heavily on Android hardware.

The areas that received the highest level of funding, due to the large number of proposals in these areas, were systems and infrastructure, human computer interaction, security and multimedia. We also continue to support international research; in this round, 29 percent of the funding was awarded to universities outside the U.S.

Some examples from this round of awards:

Injong Rhee, North Carolina State University. Experimental Evaluation of Increasing TCP Initial Congestion Window (Systems)
James Jones, University of California, Irvine. Bug Comprehension Techniques to Assist Software Debugging (Software Engineering)
Yonina Eldar, Technion, Israel. Semi-Supervised Regression with Auxiliary Knowledge (Machine Learning)
Victor Lavrenko, University of Edinburgh, United Kingdom. Interactive Relevance Feedback for Mobile Search (Information Retrieval)
James Glass, MIT. Crowdsourcing to Acquire Semantically Labelled Text and Speech Data for Speech Understanding (Speech)
Chi Keung Tang, The Hong Kong University of Science and Technology. Quasi-Dense 3D Reconstruction from 2D Uncalibrated Photos (Geo/Maps)
Phil Blunsom, Oxford, United Kingdom. Unsupervised Induction of Multi-Nonterminal Grammars for Statistical Machine Translation (Machine Translation)
Oren Etzioni, University of Washington. Accessing the Web utilizing Android Phones, Dialogue, and Open Information Extraction (Mobile)
Matthew Salganik, Princeton. Developments in Bottom-Up Social Data Collection (Social)

The full list of this round’s award recipients can be found in this PDF. For more information on our research award program, visit our website. And if you’re a faculty member, we welcome you to apply for one of next year’s two rounds. The deadline for the first round is February 1.

Wednesday, December 8, 2010

Four Googlers elected ACM Fellows this year

Posted by Alfred Spector, VP of Research

I am delighted to share with you that, like last year, the Association for Computing Machinery (ACM) has announced that four Googlers have been elected ACM Fellows in 2010, the most this year from any single corporation or institution.

Luiz Barroso, Dick Lyon, Muthu Muthukrishnan and Fernando Pereira were chosen for their contributions to computing and computer science that have provided fundamental knowledge to the field and have generated multiple innovations.

On behalf of Google, I congratulate our colleagues, who join the 10 other ACM Fellows and other professional society awardees at Google in exemplifying our extraordinarily talented people. I’ve been struck by the breadth and depth of their contributions, and I hope that they will serve as inspiration for students and computer scientists around the world.

You can read more detailed summaries of their achievements below, including the official citations from ACM—although it’s really hard to capture everything they’ve accomplished in one paragraph!

Dr. Luiz Barroso: Distinguished Engineer
For contributions to multi-core computing, warehouse scale data-center architectures, and energy proportional computing

Over the past decade, Luiz has played a leading role in the definition and implementation of Google’s cluster architecture which has become a blueprint for the computing systems behind the world’s leading Internet services. As the first manager of Google’s Platforms Engineering team, he helped deliver multiple generations of cluster systems, including the world’s first container-based data center. His theoretical and engineering insights into the requirements of this class of machinery have influenced the processor industry roadmap towards more effective products for server-class computing. His book "The Datacenter as a Computer" (co-authored with Urs Hoelzle) was the first authoritative publication describing these so-called warehouse-scale computers for computer systems professionals and researchers. Luiz was among the first computer scientists to recognize and articulate the importance of energy-related costs for large data centers, and identify energy proportionality as a key property of energy efficient data centers. Prior to Google, at Digital Equipment's Western Research Lab, he worked on Piranha, a pioneering chip-multiprocessing architecture that inspired today’s popular multi-core products. As one of the lead architects and designers of Piranha, his papers, ideas and numerous presentations stimulated much of the research that led to products decades later.

Richard Lyon: Research Scientist
For contributions to machine perception and for the invention of the optical mouse

In the last four years at Google, Dick led the team developing new camera systems and improved photographic image processing for Street View, while leading another team developing technologies for machine hearing and their application to sound retrieval and ranking. He is now writing a book with Cambridge University Press, and will teach a Stanford course this fall on "Human and Machine Hearing," returning to a line of work that he carried out at Xerox, Schlumberger, and Apple while also doing the optical mouse, bit-serial VLSI computing machines, and handwriting recognition. The optical mouse (1980) is especially called out in the citation, because it exemplifies the field of "semi-digital" techniques that he developed, which also led to his work on the first single-chip Ethernet device. And more recently, as chief scientist at Foveon, Dick invented and developed several new techniques for color image sensing and processing, and delivered acclaimed cameras and end-user software. A hallmark of Dick’s work during his distinguished career has been a practical interplay between theory, including biological theory, and practical computing.

Dr. S. Muthukrishnan: Research Scientist
For contributions to efficient algorithms for string matching, data streams, and Internet ad auctions

Muthu has made significant contributions to the theory and practice of Internet ad systems during his more than four years at Google. Muthu's breakthrough WWW’09 paper presented a general stable matching framework that produces a (desirable) truthful mechanism capturing all of the common variations and more, in contradiction to prevailing wisdom. In display ads, where image, video and other types of ads are shown as users browse, Muthu led Ad Exchange at Google, to automate placement of display ads that were previously negotiated offline by sales teams. Prior to Google, Muthu was well known for his pioneering work in the area of data stream algorithmics (including a definitive book on the subject), which led to theoretical and practical advances still in use today to monitor the health and smooth operation of the Internet. Muthu has a talent for bringing new perspectives to longstanding open problems as exemplified in the work he did on string processing. Muthu has made influential contributions to many other areas and problems including IP networks, data compression, scheduling, computational biology, distributed algorithms and database technology. As an educator, Muthu’s avant garde teaching style won him the Award for Excellence in Graduate Teaching at Rutgers CS, where is on the faculty. As a student remarked in his blog: "there is a magic in his class which kinda spellbinds you and it doesn't feel like a class. It’s more like a family sitting down for dinner to discuss some real world problems. It was always like that even when we were 40 people jammed in for cs-513."

Dr. Fernando Pereira: Research Director
For contributions to machine-learning models of natural language and biological sequences

For the past three years, Fernando has been leading some of Google’s most advanced natural language understanding efforts and some of the most important applications of machine learning technology. He has just the right mix of forward thinking ideas and the ability to put ideas into practice. With this balance, Fernando has has helped his team of research scientists apply their ideas at the scale needed for Google. From when he wrote the first Prolog compiler (for the PDP-10 with David Warren) to his days as Chair at University of Pennsylvania, Fernando has demonstrated a unique understanding of the challenges and opportunities that faced companies like Google with their unprecedented access to massive data sets and its application to the world of speech recognition, natural language processing and machine translation. At SRI, he pioneered probabilistic language models at a time when logic-based models were more popular. At AT&T, his work on a toolkit for finite-state models became an industry standard, both as a useful piece of software and in setting the direction for building ever larger language models. And his year at WhizBang had an influence on other leaders of the field, such as Andrew McCallum at University of Massachusetts and John Lafferty and Tom Mitchell at Carnegie Mellon University, with whom Fernando developed the Conditional Random Field model for sequence processing that has become one of the leading tools of the trade.

Finally, we also congratulate Professor Christos Faloutsos of Carnegie Mellon, who is on sabbatical and a Visiting Faculty Member at Google this academic year. Professor Faloutsos is cited for contributions to data mining, indexing, fractals and power laws.

Update 12/8: Updated Dick Lyon's title and added information about Professor Faloutsos.

Friday, December 3, 2010

Google Launches Cantonese Voice Search in Hong Kong

Posted by Posted by Yun-hsuan Sung (宋雲軒) and Martin Jansche, Google Research

On November 30th 2010, Google launched Cantonese Voice Search in Hong Kong. Google Search by Voice has been available in a growing number of languages since we launched our first US English system in 2008. In addition to US English, we already support Mandarin for Mainland China, Mandarin for Taiwan, Japanese, Korean, French, Italian, German, Spanish, Turkish, Russian, Czech, Polish, Brazilian Portuguese, Dutch, Afrikaans, and Zulu, along with special recognizers for English spoken with British, Indian, Australian, and South African accents.

Cantonese is widely spoken in Hong Kong, where it is written using traditional Chinese characters, similar to those used in Taiwan. Chinese script is much harder to type than the Latin alphabet, especially on mobile devices with small or virtual keyboards. People in Hong Kong typically use either “Cangjie” (倉頡) or “Handwriting” (手寫輸入) input methods. Cangjie (倉頡) has a steep learning curve and requires users to break the Chinese characters down into sequences of graphical components. The Handwriting (手寫輸入) method is easier to learn, but slow to use. Neither is an ideal input method for people in Hong Kong trying to use Google Search on their mobile phones.

Speaking is generally much faster and more natural than typing. Moreover, some Chinese characters – like “滘” in “滘西州” (Kau Sai Chau) and “砵” in “砵典乍街” (Pottinger Street) – are so rarely used that people often know only the pronunciation, and not how to write them. Our Cantonese Voice Search begins to address these situations by allowing Hong Kong users to speak queries instead of entering Chinese characters on mobile devices. We believe our development of Cantonese Voice Search is a step towards solving the text input challenge for devices with small or virtual keyboards for users in Hong Kong.

There were several challenges in developing Cantonese Voice Search, some unique to Cantonese, some typical of Asian languages and some universal to all languages. Here are some examples of problems that stood out:

Data Collection: In contrast to English, there are few existing Cantonese datasets that can be used to train a recognition system. Building a recognition system requires both audio and text data so it can recognize both the sounds and the words. For audio data, our efficient DataHound collection technique uses smartphones to record and upload large numbers of audio samples from local Cantonese-speaking volunteers. For text data, we sample from anonymized search query logs from http://www.google.com.hk to obtain the large amounts of data needed to train language models.
Chinese Word Boundaries: Chinese writing doesn’t use spaces to indicate word boundaries. To limit the size of the vocabulary for our speech recognizer and to simplify lexicon development, we use characters, rather than words, as the basic units in our system and allow multiple pronunciations for each character.
Mixing of Chinese Characters and English Words: We found that Hong Kong users mix more English into their queries than users in Mainland China and Taiwan. To build a lexicon for both Chinese characters and English words, we map English words to a sequence of Cantonese pronunciation units.
Tone Issues: Linguists disagree on the best count of the number of tones in Cantonese – some say 6, some say 7, or 9, or 10. In any case, it’s a lot. We decided to model tone-plus-vowel combinations as single units. In order to limit the complexity of the resulting model, some rarely-used tone-vowel combinations are merged into single models.
Transliteration: We found that some users use English words while others use the Cantonese transliteration (e.g.,: “Jordan” vs. “佐敦”). This makes it challenging to develop and evaluate the system, since it’s often impossible for the recognizer to distinguish between an English word and its Cantonese transliteration. During development we use a metric that simply checks whether the correct search results are returned.
Different Accents and Noisy Environment: People speak in different styles with different accents. They use our systems in a variety of environments, including offices, subways, and shopping malls. To make our system work in all these different conditions, we train it using data collected from many different volunteers in many different environments.

Cantonese is Google’s third spoken language for Voice Search in the Chinese linguistic family, after Mandarin for Mainland China and Mandarin for Taiwan. We plan to continue to use our data collection and language modeling technologies to help speakers of Chinese languages easily input text and look up information.