I have considered adding a “Data Sources” category to Data Sci Guide, and I have saved bookmarks of good data resources online for future use and possible inclusion on this site.
Today, I was interviewed for a series that BPDM is recording (I’ll link to the interview when it’s live!), and they asked me about recommendations on good data sources and APIs. I thought that would be a good blog post, so I figured I’d go ahead and share some of my own bookmarks here:
The UCI Machine Learning Repository has a bunch of great datasets for practicing machine learning, and even suggests the types of tasks each set would be good for.
Data.gov is the United States Government’s open data website. You’ll find data on agriculture, education, health, scientific research, and more. Right now the headline says they have almost 190,000 datasets available!
US Census APIs site has all kinds of demographic data for the United States population, as well as data on poverty, business, and manufacturing. Some demographic information on other countries is also available via API.
Speaking of data.gov and census.gov, Lynda.com has a course called Up and Running with Public Data Sets that talks about how to find these and other publicly available data sets and how to do some basic analyses on them in Excel.
There’s a site called The Setup (aka UsesThis) where people from various industries are interviewed about the hardware and software they use. It has an API, and I used it in a project I mentioned in my BPDM interview, which I wrote about on my Becoming a Data Scientist blog.
The Natural History Museum in London recently announced that they were putting their species inventory, specimen collections, and other biological and chemical datasets online for the public to access.
Presciber-level Medicare prescription drug data was released for the first time this year by CMS.gov.
NASA posts Earth Observation Data and other cool datasets online. They also recently released almost every photo ever taken by an astronaut on the Apollo missions via Flickr, which could be used for image analysis.
Speaking of image datasets, I found this collection of image sets for computer vision projects, posted by Dr. Kristen Grauman of UTexas Computer Science.
The Open Science Data Cloud has an impressive collection of “public data sets of scientific interest”, such as human genetic sequences, public data from the City of Chicago, Enron emails, Global Land Survey imagery, Million Song Dataset, telescope imagery of space, and more!
KDNuggets also has a large collection of links to data respositories.
What are your favorite data sources for data science projects? Share them in the comments below!