Thursday, May 31, 2012
Explore historic sites with the World Wonders Project
The World Wonders Project enables you to discover 132 historic sites from 18 countries, including Stonehenge, the archaeological areas of Pompeii and the ancient Kyoto temples. In addition to man-made sites, you can explore natural places: wander the sandy dunes of Australia’s Shark Bay or gaze up at the rock domes of Yosemite National Park in California.
World Wonders uses Street View technology to take you on a virtual trip to each iconic site. Most could not be filmed by car, so we used camera-carrying trikes to pedal our way close enough. The site also includes 3D models and YouTube videos of the historical places, so you can dig in and get more information and a broader view of each site. We also partnered with several prestigious organizations, including UNESCO, the World Monuments Fund, Getty Images and Ourplace, who provided official information and photographs for many of the sites.
We hope World Wonders will prove to be a valuable educational resource for students and scholars. A selection of educational packages are available to download for classroom use; you can also share the site content with friends.
World Wonders is part of our commitment to preserving culture online and making it accessible to everyone. Under the auspices of the Google Cultural Institute, we’re publishing high resolution images of the Dead Sea Scrolls, digitizing the archives of famous figures such as Nelson Mandela and presenting thousands of artworks through the Art Project.
Find out more about the project on the World Wonders YouTube channel, and start exploring at www.google.com/worldwonders.
Wednesday, May 30, 2012
Local—now with a dash of Zagat and a sprinkle of Google+
Today, we’re rolling out Google+ Local, a simple way to discover and share local information featuring Zagat scores and recommendations from people you trust in Google+. Google+ Local helps people like my husband turn a craving—“Wow, I need brunch”—into an afternoon outing: “Perfect, there’s a dim sum place with great reviews just two blocks from here. Let’s go.” It’s integrated into Search, Maps and mobile and available as a new tab in Google+—creating one simple experience across Google.
Local information integrated across Google
From the new “Local” tab on the left-hand side of Google+, you can search for specific places or browse for ones that fit your mood. If you click on a restaurant, or a museum (or whatever), you’ll be taken to a local Google+ page that includes photos, Zagat scores and summaries, reviews from people you know, and other useful information like address and opening hours.
Google+ Local is also integrated across other products you already use every day. If you’re looking for a place on Search or Maps, you get the same great local information there too. You can also take it on the go with Google Maps for mobile on your Android device, and soon on iOS devices.
Better decisions with Zagat
Since Zagat joined the Google family last fall, our teams have been working together to improve the way you find great local information. Zagat has offered high-quality reviews, based on user-written submissions and surveys, of tens of thousands of places for more than three decades. All of Zagat’s accurate scores and summaries are now highlighted on local Google+ pages.
Each place you see in Google+ Local will now be scored using Zagat’s 30-point scale, which tells you all about the various aspects of a place so you can make the best decisions. For example, a restaurant that has great food but not great decor might be 4 stars, but with Zagat you’d see a 26 in Food and an 8 in Decor, and know that it might not be the best place for date night.
Recommendations and reviews from people you know and trust
Your friends know what you like, and they probably like the same things you do. That’s why the opinions of people in your circles are front and center. If you search for [tacos] on Google+ Local, your results might include a friend’s rave review of the Baja-style taco stand in your neighborhood. And if you’re searching on Google or Google Maps for a great place to buy a gift for that same friend, your results might include a review from her about a boutique she shops at all the time.
You can also share your opinions and upload photos. These reviews and photos will help your friends when they’re checking out a place, and are also integrated into the aggregate score that other people see. The more you contribute, the more helpful Google+ Local will be for your friends, family and everyone else.
Whether it’s a block you’ve lived on for years or a city you’ve never been to before, we hope Google+ Local helps you discover new gems.
Today is just the first step, and you’ll see more updates in the coming months. If you’re a business owner, you can continue to manage your local listing information via Google Places for Business. Soon we’ll make it even easier for business owners to manage their listings on Google and to take full advantage of the social features provided by local Google+ pages. Get more information on our Google and Your Business Blog.
Posted by Avni Shah, Director of Product Management
(Cross-posted on the Zagat and Lat Long Blogs)
Next step in the Chrome OS journey
a new kind of computer
This is the next step
All of you haiku fans (like many of us on the Chrome team) can stop here; the rest can read on for more details.
A year ago we introduced a new model of computing with the launch of Chromebooks. We’ve heard from many of you who’ve enjoyed the speed, simplicity and security of your Chromebooks at home, at school or at work. (Thanks for all the wonderful feedback and stories!) Today, we wanted to share some developments with you—new hardware, a major software update and many more robust apps—as we continue on our journey to make computers much better.
Next-generation devices
Our partner Samsung has just announced a new Chromebook and the industry’s first Chromebox. Like its predecessor, the newest Chromebook is a fast and portable laptop for everyday users. The Chromebox is a compact, powerful and versatile desktop perfect for the home or office.
Speed
Speed is integral to the Chrome experience. The new Chromebook and Chromebox, based on Intel Core processors, are nearly three times as fast as the first-generation Chromebooks. And support for hardware-accelerated graphics, a built-from-scratch multi-touch trackpad and an open-source firmware stack provide a much faster and more responsive computing experience. The new Chromebook boots in less than seven seconds and resumes instantly. With the Chromebox, you can be on a video conference while continuing to play your favorite role-playing game on the side.
An app-centric user interface
With the new user interface you can easily find and launch apps, and use them alongside your browser or other apps. You can pin commonly-used apps for quick access, display multiple windows side-by-side or experience your favorite apps in full-screen mode without any distractions.
Be much more productive...or not
- Get more stuff done, online or offline: With the built-in ability to view Microsoft Office files and dozens of the most common file formats, you can access all your content without the hassle of installing additional software. Google Drive makes it easy to create, store and share with just one click. Drive will be seamlessly integrated with the File Manager and support offline access with the next release of Chrome OS in six weeks. With Google Docs offline support (rolling out over the next few weeks), you can keep working on your documents even when offline and seamlessly sync back up when you re-connect. In addition, there are hundreds of offline-capable web apps in the Chrome Web Store.
- Have more fun: The revamped media player and a built-in photo editor and uploader enable you to easily play and manage your personal media collections. Through the Chrome Web Store, you can access entertainment apps such as Google Play, Netflix, Kindle Cloud Reader and Pandora, and thousands of games including popular games like Angry Birds and console titles such as Bastion.
- Carry your other computers...inside your Chromebook: With Chrome Remote Desktop Beta, you can now securely connect to your PC or Mac from your Chromebook or Chromebox. With the underlying VP8 technology, it’s almost like you’re in front of your other computers in real time.
We’ve released eight stable updates over the past year, adding a number of major features and hundreds of improvements to all Chromebooks through our seamless auto-update mechanism. There’s a lot more on the way, so all you need to do is sit back and enjoy the benefits of the (always) new computer.
For those who want to try the Chromebook and Chromebox first-hand, we’re expanding the Chrome Zone experience centers. In the U.S., Chromebooks will be available to try out in select Best Buy stores in the coming weeks. In the U.K., they’re now available in a growing list of PC World and Currys stores.
Starting today, you can get the new Chromebook and Chromebox from our online retail partners in the U.S. and U.K., and in other select countries over the coming weeks.
Posted by Linus Upson, Vice President, Engineering and Caesar Sengupta, Director of Product Management
(Cross-posted from the Chrome Blog)
Friday, May 25, 2012
The fight against scam ads—by the numbers
Last month, I shared an overview of the technology Google has built to prevent bad ads from showing on Google and our partner sites, including our efforts to review accounts, sites and ads. To illustrate the scale of this challenge, today I’d like to provide some metrics that give greater insight into the scale of the problem we’re combating.
Bad ads have a disproportionately negative effect on our users; even a single bad ad slipping through our defenses is one too many. That’s why we’re constantly working to improve our systems and utilize new techniques to prevent bad ads from appearing on Google and our partner sites. In fact, billions of ads are submitted every year for a wide variety of products. We have a set of ads policies that cover a huge array of areas in more than 40 different languages. For example, because we aim to show safe, truthful and accurate ads to our users, we don’t allow ads for misleading claims, ad spam or malware.
Ads that are in violation of our ads policies aren’t allowed to be shown on Google and our AdSense partner sites. For many repeat offenders, we ban not just ads but also advertisers who seek to abuse our advertising system to take advantage of people. In the case of ads that are promoting counterfeit goods, we typically ban the advertiser after only one violation. Here are some metrics that give some insight into the scale of the impact we have had over time, showing the numbers of actions we’ve taken against advertiser accounts, sites and ads. You can see that the numbers are growing—and growing faster over time.
Year | Advertiser Accounts Suspended for Terms of Service and Advertising Policies | Sites Rejected for Site Policy | Ads Disapproved |
---|---|---|---|
2011 | 824K | 610K | 134M |
2010 | 248K | 398K | 56.7M |
2009 | 68.5K | 305K | 42.5M |
2008 | 18.1K | 167K | 25.3M |
Even in this ever-escalating arms race, our efforts are working. One method we use to test the success of our efforts is to ask human raters to tell us how we’re doing. These human raters review a set of sites that are advertised on Google. We use a large set of sites in order to get an accurate statistical reading of our efforts. We also weight the sites in our statistical sample based on the number of times a particular site was displayed so that if a particular site is shown more often, it’s more likely to be in our sample set. By using human raters, we can calibrate our automated systems and ensure that we’re improving our efforts over time. In 2011, we reduced the percentage of bad ads by more than 50 percent compared with 2010. That means the proportion of bad ads that are showing on Google was halved in just a year.
Google’s long-term success is based on people trusting our products. We want to make sure that the ads on Google are safe and trustworthy, and we’re not satisfied until we do.
Posted by David W. Baker, Director of Engineering, Advertising
Transparency for copyright removals in search
So two years ago we launched the Transparency Report, showing when and what information is accessible on Google services around the world. We started off by sharing data about the government requests we receive to remove content from our services or for information about our users. Then we began showing traffic patterns to our services, highlighting when they’ve been disrupted.
Today we’re expanding the Transparency Report with a new section on copyright. Specifically, we’re disclosing the number of requests we get from copyright owners (and the organizations that represent them) to remove Google Search results because they allegedly link to infringing content. We’re starting with search because we remove more results in response to copyright removal notices than for any other reason. So we’re providing information about who sends us copyright removal notices, how often, on behalf of which copyright owners and for which websites. As policymakers and Internet users around the world consider the pros and cons of different proposals to address the problem of online copyright infringement, we hope this data will contribute to the discussion.
For this launch we’re disclosing data dating from July 2011, and moving forward we plan on updating the numbers each day. As you can see from the report, the number of requests has been increasing rapidly. These days it’s not unusual for us to receive more than 250,000 requests each week, which is more than what copyright owners asked us to remove in all of 2009. In the past month alone, we received about 1.2 million requests made on behalf of more than 1,000 copyright owners to remove search results. These requests targeted some 24,000 different websites.
Fighting online piracy is very important, and we don’t want our search results to direct people to materials that violate copyright laws. So we’ve always responded to copyright removal requests that meet the standards set out in the Digital Millennium Copyright Act (DMCA). At the same time, we want to be transparent about the process so that users and researchers alike understand what kinds of materials have been removed from our search results and why. To promote that transparency, we have long shared copies of copyright removal requests with Chilling Effects, a nonprofit organization that collects these notices from Internet users and companies. We also include a notice in our search results when items have been removed in response to copyright removal requests.
We believe that the time-tested “notice-and-takedown” process for copyright strikes the right balance between the needs of copyright owners, the interests of users, and our efforts to provide a useful Google Search experience. Google continues to put substantial resources into improving and streamlining this process. We already mentioned that we’re processing more copyright removal requests for Search than ever before. And we’re also processing these requests faster than ever before; last week our average turnaround time was less than 11 hours.
At the same time, we try to catch erroneous or abusive removal requests. For example, we recently rejected two requests from an organization representing a major entertainment company, asking us to remove a search result that linked to a major newspaper’s review of a TV show. The requests mistakenly claimed copyright violations of the show, even though there was no infringing content. We’ve also seen baseless copyright removal requests being used for anticompetitive purposes, or to remove content unfavorable to a particular person or company from our search results. We try to catch these ourselves, but we also notify webmasters in our Webmaster Tools when pages on their website have been targeted by a copyright removal request, so that they can submit a counter-notice if they believe the removal request was inaccurate.
Transparency is a crucial element to making this system work well. We look forward to making more improvements to our Transparency Report—offering copyright owners, Internet users, policymakers and website owners the data they need to see and understand how removal requests from both governments and private parties affect our results in Search.
Update December 11, 2012: Starting today, anyone interested in studying the data can download all the data shown for copyright removals in the Transparency Report. We are also providing information about how often we remove search results that link to allegedly infringing material. Specifically, we are disclosing how many URLs we removed for each request and specified website, the overall removal rate for each request and the specific URLs we did not act on. Between December 2011 and November 2012, we removed 97.5% of all URLs specified in copyright removal requests. Read more on Policy by the Numbers.
Posted by Fred von Lohmann, Senior Copyright Counsel
Thursday, May 24, 2012
Google+ for Android: polish and performance
Start a hangout from anywhere, and ring the folks that matter most
With Hangouts we want to help people connect face-to-face-to-face—at any time, from anywhere. Of course, there's really only one device that's always by your side—your phone—so we've invested in mobile hangouts since early on. Today we're adding another important feature to the mix: the ability to start a hangout directly from your mobile device.
To get started, tap “Hangout” in the (new) navigation ribbon, add some friends and tap “Start.” We'll ring their phones (if you want), and if someone misses the hangout, they can ring you back with a single tap.
Share your favorites, and feel awesome afterward
When you share with your circles, we owe you an experience that's both intimate and immersive. Your time and your relationships are precious, after all, so your posts should make you feel proud. Today's new Android app takes this to heart, with full-screen media in the stream, conversations that fade into view and instantly-touchable actions like +1.
Do more, in less time
We think you’ll find today’s app nicer to look at, but we’re also making it easier to use. Improvements include:
- A navigation ribbon that slides in and out, providing quick access to just about everything
- The ability to download photos directly from Google+, and turn them into wallpaper
- The chance to edit posts inline, in case you make any mistakes while on the go
Posted by Vic Gundotra, Senior Vice President
Wednesday, May 23, 2012
A faster, simpler Google Search app for iPhone
Get results, fast
When you’re on the go, you usually want to get things done quickly. Autocompletion of search suggestions is significantly faster in this latest version of the app, bringing you search predictions instantly with each letter you type. You’ll also notice that results load faster, and checking out webpages is easy with the slide-in panel. Quickly swipe back and forth between webpages and your search results, and swap between search modes like Images and Places with a swipeable menu. Finding text within a webpage is a snap as well; just try tapping the magnifying glass on the bottom menu option on any page.
Easily switch between search modes using the swipeable menu at the bottom | Swipe the slide-in panel to instantly return to your search results |
Beautiful Image Search
Searching for images will never again be a chore. Tap the images button at the bottom of the search results page, and watch high-resolution images load into a beautiful grid. Browse the images by scrolling down the full-screen grid, or tap on a single image to get details about it and then quickly swipe from image to image. You can also tap and hold an image to save it to your camera roll to use as your wallpaper or share with a friend.
Simple access
We’ve put all of your favorite Google services in one place for easy access. You can choose to browse Google web apps, or see just the apps that you have on your phone. Sign in once, and you’ll never need to sign in again to check a quick email, view your next calendar appointment or see what’s hot on Google+.
Download the Google Search app now for a fast, beautiful, simple search experience on your iPhone.
Posted by Noah Levin, Interaction Designer, Google Search app
Software downloads in Syria
The fine details of these restrictions evolve over time, and we’re always exploring how we can better offer tools for people to access and share information. For example, last year we were able to make some of our products available for download in Iran. And today we’re pleased to make Google Earth, Picasa and Chrome available for download in Syria.
As a U.S. company, we remain committed to full compliance with U.S. export controls and sanctions. We remain equally committed to continue exploring how we can help more people around the globe use technology to communicate, find and create information.
Posted by Neil Martin, Export Compliance Programs Manager
A tribute to Bob Moog, sonic doodler
When people hear the word “synthesizer” they often think “synthetic”—fake, manufactured, unnatural. In contrast, Bob Moog’s synthesizers produce beautiful, organic and rich sounds that are, nearly 50 years later, regarded by many professional musicians as the epitome of an electronic instrument. “Synthesizer,” it turns out, refers to the synthesis embedded in Moog’s instruments: a network of electronic components working together to create a whole greater than the sum of the parts.
With his passion for high-tech toolmaking in the service of creativity, Bob Moog is something of a patron saint of the nerdy arts and a hero to many of us here. So for the next 24 hours on our homepage, you’ll find an interactive, playable logo inspired by the instruments with which Moog brought musical performance into the electronic age. You can use your mouse or computer keyboard to control the mini-synthesizer’s keys and knobs to make nearly limitless sounds. Keeping with the theme of 1960s music technology, we’ve patched the keyboard into a 4-track tape recorder so you can record, play back and share songs via short links or Google+.
Much like the musical machines Bob Moog created, this doodle was synthesized from a number of smaller components to form a unique instrument. When experienced with Google Chrome, sound is generated natively using the Web Audio API—a doodle first (for other browsers the Flash plugin is used). This doodle also takes advantage of JavaScript, Closure libraries, CSS3 and tools like Google Web Fonts, the Google+ API, the Google URL Shortener and App Engine.
Special thanks to engineers Reinaldo Aguiar and Rui Lopes and doodle team lead Ryan Germick for their work, as well as the Bob Moog Foundation and Moog Music for their blessing. Now give those knobs a spin and compose a tune that would make Dr. Moog smile!
Update May 30: We're so glad you enjoyed last week's synthesizer doodle for Bob Moog. Worldwide, you recorded 57 years' worth of synthesized tunes—more than 54 million songs! And those songs were played back 3.6 million times. You can still play on our doodle site. Even if you've composed a song already, create another one—the range of sounds you can create with the knobs is virtually limitless.
Posted by Joey Hurst, Software Engineer
A world of opportunity at the G(irls)20 Summit
Research shows that investing in girls and women can help the global economy. Consider the following examples:
- According to Plan UK, an extra year of education increases a girl’s income by 10 to 20% and is a significant step on the road to breaking the cycle of poverty.
- In Kenya, adolescent pregnancies cost the economy $500 million per year, while investing in girls could potentially add $32 billion to the economy (NIKE Foundation, 2009, Girl Effect).
- If men and women had equal influence in decision-making , an additional 1.7 million children would be adequately nourished in sub-Saharan Africa (International Labour Organization, 2009).
Launched in 2010 at the Clinton Global Initiative, the G(irls)20 Summit precedes the G20 Leaders Summit, and brings together one girl aged 18 to 20 from each G20 country plus the African Union. The delegates attend workshops and participate in panel discussions to come up with tangible, scalable solutions for how to engage and empower girls and women around the world. Then, at the end of the summit, they lead a press conference and present a set of recommendations for the G20 leaders to consider.
This year, the Summit will take place in Mexico City from May 28-31. But the impact of the Summit will be ongoing, thanks in part to the power of the Internet and social media. Take past Summit participants July Lee of the U.S. and Noma Sibayoni of South Africa, who launched Write With A Smile to encourage teens to continue with their education. Or Riana Shah of India who co-founded Independent Thought & Social Action (ITSA India), an education reform organization that aims to empower socially responsible youth leaders. And the African Union’s Lilian Kithiri continues to persevere creating awareness around reproductive health to communities living in the rural areas of Kenya.
There are a few ways you can experience the Summit:
- Join us in Mexico City on May 28
- Sign up for your number in support of girls and women
- Join the conversation via our live stream on www.girls20summit.com on May 28, 29 and 31
Posted by Farah Mohamed, President & CEO, G(irls)20 Summit
Tuesday, May 22, 2012
We’ve acquired Motorola Mobility
It’s why I’m excited to announce today that our Motorola Mobility deal has closed. Motorola is a great American tech company that has driven the mobile revolution, with a track record of over 80 years of innovation, including the creation of the first cell phone. We all remember Motorola’s StarTAC, which at the time seemed tiny and showed the real potential of these devices. And as a company who made a big, early bet on Android, Motorola has become an incredibly valuable partner to Google.
Sanjay Jha, who was responsible for building the company and placing that big bet on Android, has stepped down as CEO. I would like to thank him for his efforts and am tremendously pleased that he will be working to ensure a smooth transition as long-time Googler Dennis Woodside takes over as CEO of Motorola Mobility.
I’ve known Dennis for nearly a decade, and he’s been phenomenal at building teams and delivering on some of Google’s biggest bets. One of his first jobs at Google was to put on his backpack and build our businesses across the Middle East, Africa, Eastern Europe and Russia. More recently he helped increase our revenue in the U.S. from $10.8 billion to $17.5 billion in under three years as President of the Americas region. Dennis has always been a committed partner to our customers and I know he will be an outstanding leader of Motorola. As an Ironman triathlete, he’s got plenty of energy for the journey ahead—and he’s already off to great start with some very strong new hires for the Motorola team.
It’s a well known fact that people tend to overestimate the impact technology will have in the short term, but underestimate its significance in the longer term. Many users coming online today may never use a desktop machine, and the impact of that transition will be profound—as will the ability to just tap and pay with your phone. That’s why it’s a great time to be in the mobile business, and why I’m confident Dennis and the team at Motorola will be creating the next generation of mobile devices that will improve lives for years to come.
Posted by Larry Page, CEO
Monday, May 21, 2012
Announcing the 90 regional finalists of the Google Science Fair 2012
This year’s competition was even more international and diverse than last year. We had thousands of entries from more than 100 countries, and topics ranging from improving recycling using LEGO robots to treating cancer with a substance created by bees to tackling meth abuse. Our judges were impressed by the quality of the projects, and it was no easy task to evaluate the creativity, scientific merit and global relevance of each submission to narrow down the entries to just 90 finalists.
Thirteen of our 90 finalists have also been nominated for the Scientific American Science in Action award, the winner of which will be announced on June 6 along with our 15 finalists. These top 15 and the Science in Action winner will be flown out to Google’s headquarters in California in July for our celebratory finalist event and for the last round of judging, which will be conducted by our panel of renowned scientists and innovators.
Thanks to all of the students around the world who submitted projects to the Google Science Fair and congratulations to all the young scientists who were selected as regional finalists.
Posted by Sam Peter, Google Science Fair Team
Saturday, May 19, 2012
A look inside our 2011 diversity report
In the U.S., fewer and fewer students are graduating with computer science degrees each year, and enrollment rates are even lower for women and underrepresented groups. It’s important to grow a diverse talent pool and help develop the technologists of tomorrow who will be integral to the success of the technology industry. Here are a few of the things we did last year aimed at this goal in the U.S. and around the world:
- We held our third annual HBCU (Historically Black Colleges and Universities) Faculty Summit at Google New York, hosting 50 professors and administrators from 16 HBCUs, who came together to collaborate, share insights and engage with Googlers.
- We helped 100,000 students and faculty at 22 HBCUs in the U.S. “go Google;” they now use Google Apps for Education.
- To date, 3,000 students in 77 countries have received Google scholarships and we also expanded our scholarship programs for women in technology.
- We piloted the Top Black Talent U.K. program to help the U.K.’s top black engineering and business students transition into the tech industry. We also partnered with the African Caribbean Society to offer 100 students workshops and mentoring with Googlers from engineering, sales and marketing.
- We had more than 10,000 members participate in one of our 18 Global Employee Resource Groups (ERGs). Membership and reach expanded as Women@Google held the first ever Women’s Summit in both Mountain View, Calif. and Japan; the Black Googler Network (BGN) made their fourth visit to New Orleans, La., contributing 360 volunteer hours in just two days; and the Google Veterans Network partnered with GoogleServe, resulting in 250 Googlers working on nine Veteran-related projects from San Francisco to London.
- Googlers in more than 50 offices participated in the Sum of Google, a celebration about diversity and inclusion, in their respective offices around the globe.
- We sponsored 464 events in 70 countries to celebrate the anniversary of International Women's Day. Google.org collaborated with Women for Women International to launch the “Join me on the Bridge” campaign. Represented in 20 languages, the campaign invited people to celebrate by joining each other on bridges around the world—either physically or virtually—to show their support.
- We introduced ChromeVox, a screen reader for Google Chrome, which helps people with vision impairment navigate websites. It's easy to learn and free to install as a Chrome Extension.
- We grew Accelerate with Google to make Google’s tools, information and services more accessible and useful to underrepresented communities and diverse business partners.
- On Veterans Day in the U.S., we launched a new platform for military veterans and their families. The Google for Veterans and Families website helps veterans and their families stay connected through products like Google+, YouTube and Google Earth.
Posted by Yolanda Mangolini, Director, Global Diversity & Inclusion/Talent & Outreach Programs
Friday, May 18, 2012
From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas
Yet in each word some concept there must be... |
— from Goethe's Faust (Part I, Scene III) |
Human language is both rich and ambiguous. When we hear or read words, we resolve meanings to mental representations, for example recognizing and linking names to the intended persons, locations or organizations. Bridging words and meaning — from turning search queries into relevant results to suggesting targeted keywords for advertisers — is also Google's core competency, and important for many other tasks in information retrieval and natural language processing. We are happy to release a resource, spanning 7,560,141 concepts and 175,100,788 unique text strings, that we hope will help everyone working in these areas.
How do we represent concepts? Our approach piggybacks on the unique titles of entries from an encyclopedia, which are mostly proper and common noun phrases. We consider each individual Wikipedia articleas representing a concept (an entity or an idea), identified by its URL. Text strings that refer to concepts were collected using the publicly available hypertext of anchors (the text you click on in a web link) that point to each Wikipedia page, thus drawing on the vast link structure of the web. For every English article we harvested the strings associated with its incoming hyperlinks from the rest of Wikipedia, the greater web, and also anchors of parallel, non-English Wikipedia pages. Our dictionaries are cross-lingual, and any concept deemed too fine can be broadened to a desired level of generality using Wikipedia's groupings of articles into hierarchical categories.
The data set contains triples, each consisting of (i) text, a short, raw natural language string; (ii) url, a related concept, represented by an English Wikipedia article's canonical location; and (iii) count, an integer indicating the number of times text has been observed connected with the concept's url. Our database thus includes weights that measure degrees of association. For example, the top two entries for football indicate that it is an ambiguous term, which is almost twice as likely to refer to what we in the US call soccer:
text=football | url | count |
1. | Association football | 44,984 |
2. | American football | 23,373 |
⋮ |
An inverted index can be used to perform reverse look-ups, identifying salient terms for each concept. Some of the highest-scoring strings — including synonyms and translations — for both sports, are listed below:
|
|
Associated counts can easily be turned into percentages. The following table illustrates the concept-to-words dictionary direction — which may be useful for paraphrasing, summarization and topic modeling — for the idea of soft drink, restricted to English (and normalized for punctuation, pluralization and capitalization differences):
url=Soft_drink | text | % | |
1. | soft drink | (and soft-drinks) | 28.6 |
2. | soda | (and sodas) | 5.5 |
3. | soda pop | 0.9 | |
4. | fizzy drinks | 0.6 | |
5. | carbonated beverages | (and beverage) | 0.3 |
6. | non-alcoholic | 0.2 | |
7. | soft | 0.1 | |
8. | pop | 0.1 | |
9. | carbonated soft drink | (and drinks) | 0.1 |
10. | aerated water | 0.1 | |
11. | non-alcoholic drinks | (and drink) | 0.1 |
12. | soft drink controversy | 0.0 | |
13. | citrus-flavored soda | 0.0 | |
14. | carbonated | 0.0 | |
15. | soft drink topics | 0.0 | |
⋮ |
The words-to-concepts dictionary direction can disambiguate senses and link entities, which are often highly ambiguous, since people, places and organizations can (nearly) all be named after each other. The next table shows the top concepts meant by the string Stanford, which refers to all three (and other) types:
text=Stanford | url | % | type |
1. | Stanford University | 50.3 | ORGANIZATION |
2. | Stanford (disambiguation) | 7.7 | a disambiguation page |
3. | Stanford, California | 7.5 | LOCATION |
4. | Stanford Cardinal football | 5.7 | ORGANIZATION |
5. | Stanford Cardinal | 4.1 | multiple athletic programs |
6. | Stanford Cardinal men's basketball | 2.0 | ORGANIZATION |
7. | Stanford prison experiment | 2.0 | a famous psychology experiment |
8. | Stanford, Kentucky | 1.7 | LOCATION |
9. | Stanford, Norfolk | 1.0 | LOCATION |
10. | Bank of the West Classic | 1.0 | a recurring sporting event |
11. | Stanford, Illinois | 0.9 | LOCATION |
12. | Leland Stanford | 0.9 | PERSON |
13. | Charles Villiers Stanford | 0.8 | PERSON |
14. | Stanford, New York | 0.8 | LOCATION |
15. | Stanford, Bedfordshire | 0.8 | LOCATION |
⋮ |
The database that we are providing was designed for recall. It is large and noisy, incorporating 297,073,139 distinct string-concept pairs, aggregated over 3,152,091,432 individual links, many of them referencing non-existent articles. For technical details, see our paper(to be presented at LREC 2012) and the README file accompanying the data.
We hope that this release will fuel numerous creative applications that haven't been previously thought of!