Friday, July 29, 2011

President's Council Recommends Open Data for Federal Agencies



Cross-posted with the Public Sector and Elections Lab Blog

One of the things I most enjoy about working on data management is the ability to work on a variety of problems, both in the private sector and in government. I recently had the privilege of serving on a working group of the President’s Council of Advisors on Science and Technology (PCAST) studying the challenges of conserving the nation’s ecosystems. The report, titled “Sustaining Environmental Capital: Protecting Society and the Economy” was presented to President Obama on July 18th, 2011. The full report is now available to the public.

The press release announcing the report summarizes its recommendations:
The Federal Government should launch a series of efforts to assess thoroughly the condition of U.S. ecosystems and the social and economic value of the services those ecosystems provide, according to a new report by the President’s Council of Advisors on Science and Technology (PCAST), an independent council of the Nation’s leading scientists and engineers. The report also recommends that the Nation apply modern informatics technologies to the vast stores of biodiversity data already collected by various Federal agencies in order to increase the usefulness of those data for decision- and policy-making.

One of the key challenges we face in assessing the condition of ecosystems is that a lot of the data pertaining to these systems is locked up in individual databases. Even though this data is often collected using government funds, it is not always available to the public and in other cases available but not in usable formats. This is a classical example of a data integration problem that occurs in many other domains.

The report calls for creating an ecosystem, EcoINFORMA, around data. The crucial piece of this ecosystem is to make the relevant data publicly available in a timely manner and, most importantly, in a machine readable form. Publishing data embedded in a PDF file is a classical example of what does not count as being machine readable. For example, if you are publishing a tabular data set, then a computer program should be able to directly access the meta-data (e.g., column names, date collected) and the data rows without having to heuristically extract it from surrounding text.

Once the data is published, it can be discovered by search engines. Data from multiple sources can be combined to provide additional insight, and the data can be visualized and analyzed by sophisticated tools. The main point is that innovation should be pursued by many parties (academics, commercial, government), each applying their own expertise and passions.

There is a subtle point about how much meta-data should be provided before publishing the data. Unfortunately, requiring too much meta-data (e.g., standard schemas) often stymies publication. When meta-data exists, that’s great, but when it’s not there or is not complete, we should still publish the data in a timely manner. If the data is valuable and discoverable, there will be someone in the ecosystem who will enhance the data in an appropriate fashion.

I look forward to seeing this ecosystem evolve and excited that Google Fusion Tables, our own cloud-based service for visualizing, sharing and integrating structured data, can contribute to its development.

Thursday, July 21, 2011

Studies Show Search Ads Drive 89% Incremental Traffic



Advertisers often wonder whether search ads cannibalize their organic traffic. In other words, if search ads were paused, would clicks on organic results increase, and make up for the loss in paid traffic? Google statisticians recently ran over 400 studies on paused accounts to answer this question.

In what we call “Search Ads Pause Studies”, our group of researchers observed organic click volume in the absence of search ads. Then they built a statistical model to predict the click volume for given levels of ad spend using spend and organic impression volume as predictors. These models generated estimates for the incremental clicks attributable to search ads (IAC), or in other words, the percentage of paid clicks that are not made up for by organic clicks when search ads are paused.

The results were surprising. On average, the incremental ad clicks percentage across verticals is 89%. This means that a full 89% of the traffic generated by search ads is not replaced by organic clicks when ads are paused. This number was consistently high across verticals. The full study can be found on here.

Faculty from across the Americas meet in New York for the Faculty Summit



(Cross-posted from the Official Google Blog)

Last week, we held our seventh annual Computer Science Faculty Summit. For the first time, the event took place at our New York City office; nearly 100 faculty members from universities in the U.S., Canada and Latin America attended. The two-day Summit focused on systems, artificial intelligence and mobile computing. Alfred Spector, VP of research and special initiatives, hosted the conference and led lively discussions on privacy, security and Google’s approach to research.

Google’s Internet evangelist, Vint Cerf, opened the Summit with a talk on the challenges involved in securing the “Internet of things”—that is, uniquely identifiable objects (“things”) and their virtual representations. With almost 2 billion international Internet users and 5 billion mobile devices out there in the world, Vint expounded upon the idea that Internet security is not just about technology, but also about policy and global institutions. He stressed that our new digital ecosystem is complex and large in scale, and includes both hardware and software. It also has multiple stakeholders, diverse business models and a range of legal frameworks. Vint argued that making and keeping the Internet secure over the next few years will require technical innovation and global collaboration.

After Vint kicked things off, faculty spent the two days attending presentations by Google software engineers and research scientists, including John Wilkes on the management of Google's large hardware infrastructure, Andrew Chatham on the self-driving car, Johan Schalkwyk on mobile speech technology and Andrew Moore on the research challenges in commerce services. Craig Nevill-Manning, the engineering founder of Google’s NYC office, gave an update on Google.org, particularly its recent work in crisis response. Other talks covered the engineering work behind products like Ad Exchange and Google Docs, and the range of engineering projects taking place across 35 Google offices in 20 countries. For a complete list of the topics and sessions, visit the Faculty Summit site. Also, a few of our attendees heeded Alfred’s call to recap their breakout sessions in verse—download a PDF of one of our favorite poems, about the future of mobile computing, penned by NYU professor Ken Perlin.

A highlight of this year’s Summit was Bill Schilit’s presentation of the Library Wall, a Chrome OS experiment featuring an eight-foot tall full-color virtual display of ebooks that can be browsed and examined individually via touch screen. Faculty members were invited to play around with the digital-age “bookshelf,” which is one of the newest additions to our NYC office.

We’ve already posted deeper dives on a few of the talks—including cluster management, mobile search and commerce. We also collected some interesting faculty reflections. For more information on all of our programs, visit our University Relations website. The Faculty Summit is meant to connect forerunners across the computer science community—in business, research and academia—and we hope all our attendees returned home feeling informed and inspired.

Wednesday, July 20, 2011

Google Americas Faculty Summit: Reflections from our attendees



Last week, we held our seventh annual Americas Computer Science Faculty Summit at our New York City office. About 100 faculty members from universities in the Western Hemisphere attended the two-day Summit, which focused on systems, artificial intelligence and mobile. To finish up our series of Summit recaps, I asked four faculty members to provide us their perspective on the summit, thinking their views would complement our own blog: Jeannette Wing from Carnegie Mellon, Rebecca Wright from Rutgers, Andrew Williams from Spelman and Christos Kozyrakis from Stanford.

Jeannette M. Wing, Carnegie Mellon University
Fun, cool, edgy and irreverent. Those words describe my impression of Google after attending the Google Faculty Summit, held for the first time at its New York City location. Fun and cool: The Library Wall prototype, which attendees were privileged to see, is a peek at the the future where e-books have replaced physical books, but where physical space, equipped with wall-sized interactive displays, still encourages the kind of serendipitous browsing we enjoy in the grand libraries of today. Cool and edgy: Being in the immense old Port Authority building in the midst of the Chelsea district of Manhattan is just plain cool and adds an edgy character to Google not found at the corporate campuses of Silicon Valley. Edgy, or more precisely “on the edge,” is Google as it explores new directions: social networking (Google+), mobile voice search (check out the microphone icon in your search bar) and commerce (e.g. selling soft goods on-line). Why these directions? Some are definitely for business reasons, but some are also simply because Google can (self-driving cars) and because it’s good for society (e.g., emergency response in Haiti, Chile, New Zealand and Japan). “Irreverent” is Alfred Spector’s word and sums it up—Google is a fun place to work, where smart people can be creative, build cool products and make a difference in untraditional ways.

But the one word that epitomizes Google is “scale.” How do you manage clusters on the order of hundreds of thousands of processors where the focus is faults, not performance or power? What knowledge about humanity can machine learning discover from 12 million scanned books in 400 languages that generated five billion pages and two trillion words digitized? Beyond Google, how do you secure the Internet of Things when eventually everything from light bulbs to pets will all be Internet-enabled and accessible?

One conundrum. Google’s hybrid model of research clearly works for Google and for Googlers. It is producing exciting advances in technology and having an immeasurable impact on society. Evident from our open and intimate breakout sessions, Google stays abreast of cutting-edge academic research, often by hiring our Ph.D. students. The challenge for computer science research is, “how can academia build on the shoulders of Google’s scientific results?”

Academia does not have access to the scale of data or the complexity of system constraints found within Google. For the good of the entire industry-academia-government research ecosystem, I hope that Google continues to maintain an open dialogue with academia—through faculty summits, participation and promotion of open standards, robust university relations programs and much more.
-----

Rebecca Wright, Rutgers University
This was my first time attending a Google Faculty Summit. It was great to see it held in my "backyard," which emphasized the message that much of Google's work takes place outside their Mountain View campus. There was a broad variety of excellent talks, each of which only addressed the tip of the iceberg of the particular problem area. The scope and scale of the work being done at Google is really mind-boggling. It both drives Google’s need for new solutions and allows the company to consider new approaches. At Google’s scale, automation is critical and almost everything requires research advances, engineering advances, considerable development effort and engagement of people outside Google (including academics, the open source community, policymakers and "the crowd").

A unifying theme in much of Google’s work is the use of approaches that leverage its scale rather than fight it (such as MapMaker, which combines Google's data and computational resources with people's knowledge about and interest in their own geographic areas). In addition to hearing presentations, the opportunity to interact with the broad variety of Googlers present as well as other faculty was really useful and interesting. As a final thought, I would like to see Google get more into education, particularly in terms of advancing hybrid in-class/on-line technologies that take advantage of the best features of each.
-----

Andrew Williams, Spelman College
At the 2011 Google Faculty Summit in New York, the idea that we are moving past the Internet of computers to an "Internet of Things" became a clear theme. After hearing presentations by Googlers, such as Vint Cerf dapperly dressed in a three piece suit, I realized that we are in fact moving to an Internet of Things and People. The pervasiveness of connected computing devices and very large systems for cloud computing all interacting with socially connected people were expounded upon both in presentations and in informal discussions with faculty from around the world. The "Internet of people" aspect was also evident in emerging policies we touched on, involving security, privacy and social networks (like the Google+ project). I also enjoyed the demonstration of the Google self-driving car as an advanced application of artificial intelligence that integrates computer vision, localization and decision making in a real world transportation setting. I was impressed with how Google volunteers its talent, technology and time to help people, as it did with its crisis response efforts in Haiti, Japan and other parts of the world.

As an educator and researcher in humanoid robotics and AI at a historically black college for women in Atlanta, the Google Faculty Summit motivated me to improve how I educate our students to eventually tackle the grand challenges posed by the Internet of Things and People. It was fun to learn how Google is actively seeking to solve these grand challenges on a global scale.
-----

Christos Kozyrakis, Stanford University
What makes the Google Faculty Summit a unique event to attend is its wide-reaching focus. Our discipline-focused conferences facilitate in-depth debates over a narrow set of challenges. In contrast, the Faculty Summit is about bringing together virtually all disciplines of computer science to turn information into services with an immediate impact on our everyday lives. It is fascinating to discuss how large data centers and distributed software systems allow us to use machine learning algorithms on massive datasets and get voice based search, tailored shopping recommendations or driver-less cars. Apart from the general satisfaction of seeing these applications in action, one of the important takeaways for me is that specifying and managing the behavior of large systems in an end-to-end manner is currently a major challenge for our field. Now is probably the best time to be a computer scientist, and I am leaving with a better understanding of what advances in my area of expertise can have the biggest overall impact.

I also enjoyed having the summit at the New York City office, away from Google headquarters in Silicon Valley. It’s great to see in practice how the products of our field (networking, video-conferencing and online collaboration tools) allow for technology development anywhere in the world.
-----

As per Jeannette Wing’s comments about Google being “irreverent,” I own up to using the term—initially about a subject on which Aristophanes once wrote (I’ll leave that riddle open). As long as you take my usage in the right way (that is, we’re very serious about the work we do, but perhaps not about all the things one would expect of a large company), I’m fine with it. There’s so much in the future of computer science and its potential impact that we should always be coming at things in new ways, with the highest aspirations and with joy at the prospects.

Tuesday, July 19, 2011

Google Americas Faculty Summit Day 2: Shopping, Coupons and Data



On July 14 and 15, we held our seventh annual Faculty Summit for the Americas with our New York City offices hosting for the first time. Over the next few days, we will be bringing you a series of blog posts dedicated to sharing the Summit's events, topics and speakers. --Ed

Google is ramping up its commitment to making shopping and commerce fun, convenient and useful. As a computer scientist with a background in algorithms and large scale artificial intelligence, what's most interesting to me is the breadth of fundamental new technologies needed in this area. They range from the computer vision technology that recognizes fashion styles and visually similar items of clothing, to a deep understanding of (potentially) all goods for sale in the world, to new and convenient payments technologies, to the intelligence that can be brought to the mobile shopping experience, to the infrastructure needed to make these technologies work on a global scale.

At the Faculty Summit this week, I took the opportunity to engage faculty in some of the fascinating research questions that we are working on within Google Commerce. For example, consider the processing flow required to present a user with an appropriate set of shoes from which to choose, given the input of an image of a high heel shoe. First, we need to segment or identify the object of interest in the input image. If the input is an image of a high heel with the Alps in the background, we don’t want to find images of different types of shoes with the Alps in the background, we want images of high heels.

The second step is to extract the object’s “visual signature” and build an index using color, shape, pattern and metadata. Then, a search is performed using a variety of similarity measures. The implementation of this processing flow raises several research challenges. For example, the calculations required to determine similar shoes could be slow due to the number of factors that must be considered. Segmentation can also pose a difficult problem because of the complexity of the feature extraction algorithms.

Another important consideration is personalization. Consumers want items that correspond to their interests, so we should include results based on historical search and shopping data for a particular person (who has opted-in to such features). More importantly, we want to downweight styles that the shopper has indicated he does not like. Finally, we also need to include some creative items to simulate the serendipitous connections one makes when shopping in a store. This is a new kind of search experience, which requires a new kind of architecture and new ways to infer shopper satisfaction. As a result, we find ourselves exploring new kinds of statistical models and the underlying infrastructure to support them.


Saturday, July 16, 2011

Google Americas Faculty Summit Day 1: Cluster Management



On July 14 and 15, we held our seventh annual Faculty Summit for the Americas with our New York City offices hosting for the first time. Over the next few days, we will be bringing you a series of blog posts dedicated to sharing the Summit's events, topics and speakers. --Ed

At this year’s Faculty Summit, I had the opportunity to provide a glimpse into the world of cluster management at Google. My goal was to brief the audience on the challenges of this complex system and explain a few of the research opportunities that these kinds of systems provide.

First, a little background. Google’s fleet of machines are spread across many data centers, each of which consists of a number of clusters (a set of machines with a high-speed network between them). Each cluster is managed as one or more cells. A user (in this case, a Google engineer) submits jobs to a cell for it to run. A job could be a service that runs for an extended period, or a batch job that runs, for example, a MapReduce updating an index.

Cluster management operates on a very large scale: whereas a storage system that can hold a petabyte of data is considered large by most people, our storage systems will send us an emergency page when it has only a few petabytes of free space remaining. This scale give us opportunities (e.g., a single job may use several thousand machines at a time), but also many challenges (e.g., we constantly need to worry about the effects of failures). The cluster management system juggles the needs of a large number of jobs in order to achieve good utilization, trying to strike a balance between a number of conflicting goals.

To complicate things, data centers can have multiple types of machines, different network and power-distribution topologies, a range of OS versions and so on. We also need to handle changes, such as rolling out a software or a hardware upgrade, while the system is running.

Our current cluster management system is about seven years old now (several generations for most Google software) and, although it has been a huge success, it is beginning to show its age. We are currently prototyping a new system that will replace it; most of my talk was about the challenges we face in building this system. We are building it to handle larger cells, to look into the future (by means of a calendar of resource reservations) to provide predictable behavior, to support failures as a first-class concept, to unify a number of today’s disjoint systems and to give us the flexibility to add new features easily. A key goal is that it should provide predictable, understandable behavior to users and system administrators. For example, the latter want to know answers to questions like “Are we in trouble? Are we about to be in trouble? If so, what should we do about it?”

Putting all this together requires advances in a great many areas. I touched on a few of them, including scheduling and ways of representing and reasoning with user intentions. One of the areas that I think doesn’t receive nearly enough attention is system configuration—describing how systems should behave, how they should be set up, how those setups should change, etc. Systems at Google typically rely on dozens of other services and systems. It’s vital to simplify the process of making controlled changes to configurations that result in predictable outcomes, every time, even in the face of heterogeneous infrastructure environments and constant flux.

We’ll be taking steps toward these goals ourselves, but the intent of today’s discussion was to encourage people in the academic community to think about some of these problems and come up with new and better solutions, thereby raising the level for us all.

Google Americas Faculty Summit Day 1: Mobile Search



On July 14 and 15, we held our seventh annual Faculty Summit for the Americas with our New York City offices hosting for the first time. Over the next few days, we will be bringing you a series of blog posts dedicated to sharing the Summit's events, topics and speakers. --Ed

Google’s mobile speech team has a lofty goal: recognize any search query spoken in English and return the relevant results. Regardless of whether your accent skews toward a Southern drawl, a Boston twang, or anything in between, spoken searches like “navigate to the Metropolitan Museum,” “call California Pizza Kitchen” or “weather, Scarsdale, New York” should provide immediate responses with a map, the voice of the hostess at your favorite pizza place or an online weather report. The responses must be fast and accurate or people will stop using the tool, and—given that the number of speech queries has more than doubled over the past year—the team is clearly succeeding.

As a software engineer on the mobile speech team, I took the opportunity of the Faculty Summit this week to present some of the interesting challenges surrounding developing and implementing mobile search. One of the immediate puzzles we have to solve is how to train a computer system to recognize speech queries. There are two aspects to consider: the acoustic model, or the sound of letters and words in a language; and the language model, which in English is essentially grammar, or what allows us to predict words that follow one another. The language model we can put together using a huge amount of data gathered from our query logs. The acoustic model, however, is more challenging.

To build our acoustic model, we could conduct “supervised learning” where we collect 100+ hours of audio data from search queries and then transcribe and label the data. We use this data to translate a speech query into a written query. This approach works fairly well, but it doesn’t improve as we collect more audio data. Thus, we use an “unsupervised model” where we continuously add more audio data to our training set as users do speech queries.

Given the scale of this system, another interesting challenge is testing accuracy. The traditional approach is to have human testers run assessments. Over the past year, however, we have determined that our automated system has the same or better level of accuracy as our human testers, so we’ve decided to create a new method for automated testing at scale, a project we are working on now.

The current voice search system is trained on over 230 billion words and has a one million word vocabulary, meaning it understands all the different contexts in which those one million words can be used. It requires multiple CPU decades for training and data processing, plus a significant amount of storage, so this is an area where Google’s large infrastructure is essential. It’s exciting to be a part of such cutting edge research, and the Faculty Summit was an excellent opportunity to share our latest innovations with people who are equally inspired by this area of computer science.

Wednesday, July 13, 2011

What You Capture Is What You Get: A New Way for Task Migration Across Devices



We constantly move from one device to another while carrying out everyday tasks. For example, we might find an interesting article on a desktop computer at work, then bring the article with us on a mobile phone during the commute and keep reading it on a laptop or a TV when we get home. Cloud computing and web applications have made it possible to access the same data and applications on different devices and platforms. However, there are not many ways to easily move tasks across devices that are as intuitive as drag-and-drop in a graphical user interface.

Since last year, our research team started developing new technologies for users to easily migrate their tasks across devices. In a project named Deep Shot, we demonstrated how a user can easily move web pages and applications, such as Google Maps directions, between a laptop and an Android phone by using the phone camera. With Deep Shot, a user can simply take a picture of their monitor with a phone camera, and the captured content automatically shows up and becomes instantly interactive on the mobile phone.

This project was inspired by our observations that many people tend to take a picture of map directions on the monitor using their mobile phone camera, rather than using other approaches such as email. Taking pictures feels more direct and convenient, and fits well our everyday activity that is often more opportunistic. Instead of just capturing raw pixels, Deep Shot recovers the actual contents and applications on the mobile phone based on these pixels. You can find out how Deep Shot keeps user interaction simple and what happens behind the scenes here. Similar to WYSIWYG—What You See Is What You Get—for graphical user interfaces, Deep Shot demonstrates WYCIWYG—What You Capture Is What You Get—for cross-device interaction. We are exploring this interaction style for various task migration situations in our everyday life.



Deep Shot remains a research project at Google. With increasing capabilities of mobile phones and fast growing web applications, we hope to explore more exciting ways to help users carry out their everyday activities.

Friday, July 8, 2011

Languages of the World (Wide Web)



The web is vast and infinite. Its pages link together in a complex network, containing remarkable structures and patterns. Some of the clearest patterns relate to language.

Most web pages link to other pages on the same web site, and the few off-site links they have are almost always to other pages in the same language. It's as if each language has its own web which is loosely linked to the webs of other languages. However, there are a small but significant number of off-site links between languages. These give tantalizing hints of the world beyond the virtual.

To see the connections between languages, start by taking the several billion most important pages on the web in 2008, including all pages in smaller languages, and look at the off-site links between these pages. The particular choice of pages in our corpus here reflects decisions about what is `important'. For example, in a language with few pages every page is considered important, while for languages with more pages some selection method is required, based on pagerank for example.

We can use our corpus to draw a very simple graph of the web, with a node for each language and an edge between two languages if more than one percent of the offsite links in the first language land on pages in the second. To make things a little clearer, we only show the languages which have at least a hundred thousand pages and have a strong link with another language, meaning at least 1% of off-site links go to that language. We also leave out English, which we'll discuss more in a moment. (Figure 1)

Looking at the language web in 2008, we see a surprisingly clear map of Europe and Asia.
The language linkages invite explanations around geopolitics, linguistics, and historical associations.


Figure 1: Language links on the web. 

The outlines of the Iberian and Scandinavian Peninsulas are clearly visible, which suggest geographic rather than purely linguistic associations.

Examining links between other languages, it seems that many are explained by people and communities which speak both languages.

The language webs of many former Soviet republics link back to the Russian web, with the strongest link from Ukrainian. While Russia is the major importer of Ukrainian products, the bilingual nature of Ukraine is a more plausible explanation. Most Ukrainians speak both languages, and Russian is even the dominant language in large parts of the country.

The link from Arabic to French speaks to the long connection between France and its former colonies. In many of these countries Arabic and French are now commonly spoken together, and there has been significant emigration from these countries to France.

Another strong link is between the Malay/Malaysian and Indonesian webs. Malaysia and Indonesia share a border, but more importantly the languages are nearly eighty percent cognate, meaning speakers of one can easily understand the other.

What about the sizes of each language web? Both the number of sites in each language and the number of urls seen by Google's crawler follow an exponential distribution, although the ordering for each is slightly different (Figure 2). The exact number of pages in each language in 2008 is unknown, since multiple urls may point to the same page and some pages may not have been seen at all. However, the language of an un-crawled url can be guessed by the dominant language of its site. In fact, calendar pages and other infinite spaces mean that there really are an unlimited number of pages on the web, though some are more useful than others.

Figure 2: The number of sites and seen urls per language are roughly exponentially distributed. 

The largest language on the web, in terms of size and centrality, has always been English, but where is it on our map?

Every language on the web has strong links to English, usually with around twenty percent of offsite links and occasionally over forty five percent, such as from Tagalog/Filipino, spoken in the Philippines, and Urdu, principally spoken in Pakistan (Figure 3). Both the Philippines and Pakistan are former British colonies where English is one of the two official languages.

Figure 3: Language links to and from English 

You might wonder whether off-site links landing on English pages can be explained simply by the number of English pages available to be linked to. The webs of other languages in our corpus typically have sixty to eighty percent of their out-language links to English pages. However, only 38 percent of the pages and 42 percent of sites in our set are English, while it attracts 79 percent of all out-language links from other languages.

Chinese and Japanese also seem unusual because there are relatively few links from pages in these languages to pages in English. This is despite the fact that Japanese and Chinese sites are the most popular non-English sites for English sites to link to. However, the number of sites in a language is a strong predictor of its `introversion', or fraction of off-site links to pages in the same language. Taking this into account shows that Chinese and Japanese webs are not unusually introverted given their size. In general, language webs with more sites are more introverted, perhaps due to better availability of content. (Figure 4)

Figure 4: Language size vs introversion. 

There is a roughly linear relationship between the (log) number of sites in a language and the fraction of off-site links which point to pages in the same language, with a correlation of 0.9 if English is removed. However, only 45 percent of off-site links from English pages are to other English pages, making English the most extroverted web language given its size. Other notable outliers are the Hindi web, which is unusually introverted, and the Tagalog and Malay webs which are unusually extroverted.

We can generate another map by connecting languages if the number of links from one to the other is 50 times greater than expected given the number of out-of-language links and the size of the language linked to (Figure 5). This time, the native languages of India show up clearly. Surprising links include those from Hindi to Ukrainian, Kurdish to Swedish, Swahili to Tagalog and Bengali, and Esperanto to Polish.

Figure 5: Unexpected connections, given the size of each language. 

What's happened since 2008? The languages of the web have become more densely connected. There is now significant content in even more languages, and these languages are more closely linked. We hope that tools like Google page translation, voice translation, and other services will accelerate this process and bring more people in the world closer together, whichever languages they speak.


UPDATE 9 July 2011: As has been pointed out in the comments, in both the Philippines and Pakistan, English is one of the two official languages; however, the Philippines was not a British colony.