11 Websites to Find Free, Interesting Datasets

If you're new to the data space, recently learned a new skill, or are trying to build a more robust data science/analyst portfolio, a perfect way of solidifying your skills is to do some mini-projects focused on your new skills. Below we outline a few places you can find publicly available data for your next project.

If you're interested in practicing real data scientist and analyst interview questions, feel free to sign up for our email newsletter, where we send a few curated questions per week to help you prepare for interviews at top companies.

1. FiveThirtyEight

FiveThirtyEight is an interactive news and sports site that has some incredible data visualizations (which you should totally check out). They make a lot of their data open to the public, meaning you can download and play with the source data yourself! Here are some examples:

2. BuzzFeed News

BuzzFeed makes the data sets, analysis, libraries, tools, and guides used in its articles available on GitHub. Check them out to learn from some of the best! Here are some examples:

3. Kaggle

Kaggle, recently acquired by Google, is a place where you can learn, practice, and fine-tune your data science/analytics skills. They have tons of data that’s open to the public, and allow users of the platform to share code so you can learn best practices within the data space. They also host competitions where you can win real money if you have a top ranking model! Here are some examples:

4. Socrata

Socrata hosts cleaned open source data sources ranging from government, business, and education data sets. Here are some examples:

5. Awesome-Public-Datasets on GitHub

This github hosts a library of awesome, public datasets! They are all sorted by category and link you straight to the hosting website. Here are some examples:

6. Google Public Datasets

Google lists all of the data sets on a page. Google has a cloud hosting service called Google Cloud Platform (GCP), and you can query using a tool called BigQuery to explore these datasets. You'll need to sign up for a GCP account, but the first 1TB of queries you make are free! Here are some examples:

  • US Name Data Set — contains all names from social security card applications from births that occur after 1879.
  • Major League Baseball Data — data includes pitch-by-pitch data for Major League Baseball (MLB) games in 2016.

7. UCI Machine Learning Repository

University of California Irvine hosts 440 data sets as a service to the machine learning community. These data sets are nice because most of them are squeaky clean and ready for modeling! Here are some examples:

  • Iris Data Set — the most famous pattern recognition dataset.

  • Wine Data Set — using chemical analysis to determine the origin of wine.

  • Forest Fires — try to predict the burn area of forest fires using this dataset.

8. Data.gov

Data.gov allows you to download and explore data from multiple US government agencies. Data can range from government budgets to climate data. The data is very well documented so you should have an easy time navigating the sources. You can browse the data sets on Data.gov directly without registering. You can browse by topic area or search for a specific data set. Here are some examples:

9. Academic Torrents

Academic Torrents is a site that is geared around sharing the data sets from scientific papers. It has tons of interesting data sets. You can browse the data sets directly on the site and download them. Here are some examples:

  • Enron Emails — a set of many emails from executives at Enron, a company that famously went bankrupt.
  • Student Learning Factors — a set of factors that measure and influence student learning.
  • News Articles — contains news article attributes and a target variable.

10. Quandl

Quandl is a repository of economic and financial data. Some of the datasets are free, while others are up for purchase. Here are some examples:

11. Jeremy Singer-Vine

Jeremy Singer-Vine collects awesome data sets across multiple sources. If you're interested in getting data sets straight to your inbox, you should consider signing up for his newsletter.