Allison Druin, University of Maryland, Elizabeth Foss, University of Maryland, Hilary Hutchinson, Evan Golub, University of Maryland, and Leshell Hatley, University of Maryland
Large Scale Image Annotation: Learning to Rank with Joint Word-Image EmbeddingsEuropean Conference on Machine Learning (ECML) Best Paper Jason Weston, Samy Bengio, and Nicolas Usunier, Universite Paris 6 - LIP6
In this paper, we introduce a generic framework to find a joint representation of images and their labels, which can then be used for various tasks, including image ranking and image annotation. We simultaneously propose an efficient training algorithm that scales to tens of millions of images and hundreds of thousands of labels, while focusing training on making good predictions at the top of the ranked list. The models are both fast at prediction time and have low memory usage making it possible to house such systems on a laptop or mobile device.
Overlapping Experiment Infrastructure: More, BetterFaster Experimentation, Knowledge Discovery and Datamining (KDD) Diane Tang, Ashish Agarwal, Deirdre O'Brien, and Mike Meyer
Google's data driven culture requires running a large number of live traffic experiments. This paper describes Google's overlapping experimental infrastructure where a single event (e.g. a web search) can be assigned to multiple simultaneous large experiments. The infrastructure and supporting tools provide a framework that enables running experiments from design to decision making and launch, and can be generalized to many other web applications.
NLP
North American Chapter of the Association for Computational Linguistics (NAACL)
Slav Petrov
It is well known that the Expectation Maximization algorithm can converge to widely varying local maxima. This paper shows that this can be advantageous when learning latent variable grammars for syntactic parsing. By combining multiple state-of-the-art individual grammars into an unweighted product model, parsing accuracy can be improved from 90.2% to 91.8% for English, and from 80.3% to 84.5% for German.
Software Engineering
International Symposium on Code Generation and Optimization (CGO)
Jason Mars, University of Virginia, Neil Vachharajani, Robert Hundt, Mary Lou Soffa, University of Virginia
This paper makes a big step forward in addressing an important and pressing problem in the field of Computer Science today. This work presents a lightweight runtime solution that significantly improves the utilization of datacenter servers by up to 58% on average. This work also received the CGO 2010 Best Presentation Award.
Speech
Interspeech
Maryam Kamvar and Doug Beeferman
Say What? Have you been speaking your search queries into your mobile device rather than typing them? Spoken search is available on Android, iPhone and Blackberry devices and we see an increasing numbers of searches coming in by voice on these phones. In our paper “Say What: Why users choose to speak their web queries” we investigate, on an aggregate level, what factors are most predictive of spoken queries. Understanding context in which a speech-driven search is used (or conversely not used) can be used to improve recognition engines and spoken interface design. So, save keystrokes and say your query!
Query Language Modeling for Voice SearchIEEE Workshop on Spoken Language Technology
Ciprian Chelba, Johan Schalkwyk, Thorsten Brants, Vida Ha, Boulos Harb, Will Neveitt, Carolina Parada*, Johns Hopkins University, and Peng Xu
The paper describes language modeling for google.com query data, and its application to speech recognition for Google Voice Search.
Our empirical findings include:
- 10% relative gains in WER from large scale modeling,
- a less known yet potentially quite detrimental interaction between Kneser-Ney smoothing and entropy pruning (approx. 10% relative increase in WER)
- evidence that hints at non-stationarity of the query stream, and
- surprisingly strong dependence across three English locales---USA, Britain and Australia.
Structured DataVery Large Data Bases (VLDB)
Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis, Google Inc.
Dremel is a scalable, interactive ad-hoc query system. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system is widely used at Google and serves as the foundational technology behind BigQuery, a product launched in limited preview mode.
Systems and Infrastructure
USENIX Symposium on Operating Systems Design and Implementation (OSDI)
Daniel Peng and Frank Dabek
In the past, Google accumulated a whole day’s worth of changes to the web and ran a series of enormous MapReduces to apply this batch of changes to our index of the web. This system led to a delay of several days between crawling a document and presenting it to users in search results. To meet our goal of reducing the indexing delay to minutes, we needed to update the index as each individual document was crawled, rather than in daily batches. No existing infrastructure supported this kind of incremental transformation at web scale, so we built Percolator: a framework for transforming a large repository using small ACID transactions.
Availability in Globally Distributed Storage SystemsUSENIX Symposium on Operating Systems Design and Implementation (OSDI)
Daniel Ford, Francois Labelle, Florentina Popovici, Murray Stokely, Van-Anh Truong*, Columbia University, Luiz Barroso, Carrie Grimes, and Sean Quinlan
In our paper, we characterize the availability of cloud storage systems, based on extensive monitoring of Google's main storage infrastructure, and the sources of failure which affect availability. We also present statistical models for reasoning about the impact of design choices such as data placement, recovery speed, and replication strategies, including replication across multiple data centers.
Vision
IEEE International Conference on Data Mining (ICDM)
Sergey Ioffe
With the huge amounts of very high-dimensional data, such as images and videos, we frequently need to "sketch" the data -- that is, represent it in a much more compact form, while still allowing us to accurately determine how different any two images or videos are. In this paper, we describe a sketching method for L1, one of the most common distance measures. It works by first hashing the data with a new algorithm, and then compressing each hash to a small number of bits, which is learned from data. This method is fast and allows the distances to be estimated accurately, while reducing the storage requirements by a factor of 100.
*) work carried out while at Google