My Private Collections: datasets

Showing posts with label datasets. Show all posts

Saturday, January 14, 2012

CDC Birth Vital Statistics in BigQuery

Posted by Dan Vanderkam, Software Engineer

Google’s BigQuery Service lets enterprises and developers crunch large-scale data sets quickly. But what if you don’t have a large-scale data set of your own?

To help the data-less masses, BigQuery offers several large, public data sets. One of these is the natality data set, which records information about live births in the United States. The data is derived from the Division of Vital Statistics at the Centers for Disease Control and Prevention, which has collected an electronic record of birth statistics since 1969. It is one of the longest-running electronic records in existence.

Each row in this database represents a live birth. Using simple queries, you can discover fascinating trends from the last forty years.

For example, here’s the average age of women giving birth to their first child:

The average age has increased from 21.3 years in 1969 to 25.1 years in 2008. Using more complex queries, one could analyze the factors which have contributed to this increase, i.e. whether it can be explained by changing racial/ethnic composition of the population.

You can see more examples like this one on the BigQuery site.

Wednesday, November 30, 2011

More Google Cluster Data

Posted by John Wilkes, Principal Software Engineer

Google has a strong interest in promoting high quality systems research, and we believe that providing information about real-life workloads to the academic community can help.

In support of this we published a small (7-hour) sample of resource-usage information from a Google production cluster in 2010 (research blog on Google Cluster Data). Approximately a dozen researchers at UC Berkeley, CMU, Brown, NCSU, and elsewhere have made use of it.

Recently, we released a larger dataset. It covers a longer period of time (29 days) for a larger cell (about 11k machines) and includes significantly more information, including:

the original resource requests, to permit scheduling experiments
request constraints and machine attriibutes
machine availability and failure events
some of the reasons for task exits
(obfuscated) job and job-submitter names, to help identify repeated or related jobs
more types of usage information
CPI (cycles per instruction) and memory traffic for some of the machines

Note that this trace primarily provides data about resource requests and usage. It contains no information about end users, their data, or access patterns to storage systems and other services.

More information can be found via this link, which will (after a short questionnaire) take you to a site that provides access instructions, a description of the data schema, and information about how the data was derived and its meaning.

We hope this data will facilitate a range of research in cluster management. Let us know if you find it useful, are willing to share tools that analyze it, or have suggestions for how to improve it.

Tuesday, March 1, 2011

Slicing and dicing data for interactive visualization

Posted by Benjamin Yolken, Google Public Data Product Manager

A year ago, we introduced the Google Public Data Explorer, a tool that allows users to interactively explore public-interest datasets from a variety of influential sources like the World Bank, IMF, Eurostat, and the US Census Bureau. Today, users can visualize over 300 metrics across 31 datasets, including everything from labor productivity (OECD) to Internet speed (Ookla) to gender balance in parliaments (UNECE) to government debt levels (IMF) to population density by municipality (Statistics Catalonia), with more data being added every week.

Last week, as part of the launch of our dataset upload interface, we released one of the key pieces of technology behind the product: the Dataset Publishing Language (DSPL). We created this format to address a key problem in the Public Data Explorer and other, similar tools, namely, that existing data formats don’t provide enough information to support easy yet powerful data exploration by non-technical users.

DSPL addresses this by adding an additional layer of metadata on top of the raw, tabular data in a dataset. This metadata, expressed in XML, describes the concepts in the dataset, for instance “country”, “gender”, “population”, and “unemployment”, giving descriptions, URLs, formatting properties, etc. for each. These concepts are then referenced in slices, which partition the former into dimensions (i.e., categories) and metrics (i.e., quantitative values) and link them with the underlying data tables (provided in CSV format). This structure, along with some additional metadata, is what allows us to provide rich, interactive dataset visualizations in the Public Data Explorer.

With the release of DSPL, we hope to accelerate the process of making the world’s datasets searchable, visualizable, and understandable, without requiring a PhD in statistics. We encourage you to read more about the format and try it yourself, both in the Public Data Explorer and in your own software. Stay tuned for more DSPL extensions and applications in the future!