Scraping Reddit to find the most popular domains


Below, we'll show you how to scrape Reddit using Praw (Python Reddit API Wrapper). For this example, our goal will be to scrape the top submissions for the year across a few subreddits, storing the following: submission URL, domain (website URL), submission score. Ultimately, we want to be able to see which domains (urls) generate the highest scoring posts across a given subreddit.

1) Import packages, set up PRAW, select subreddits

Here we can set up our Praw credentials and select the list of subreddits we want to analyze.

import pandas as pd
import praw
import operator
import pandas as pd
#set up praw - setup here:
reddit = praw.Reddit(client_id='my client id',
client_secret='my client secret',
user_agent='my user agent')
#create list of subreddits to include
s_list = \
'enter subreddits you want to include here as comma 
separated strings - e.g. 'news', 'datascience', etc'

2) Grab the score, domain (url), and subreddit for each top yearly submission

In this section we're looping through our array of subreddits from above, and storing the score, domain, and subreddit; we'll store each of these attributes in 3 separate dataframes, and merge together using the submission ID.

#set up dictionaries to store submission information
domains_sub = {}
domains = {}
domains_score = {}
domains_url = {}

#Loop through our selected list of subreddits for i in s_list:
#--Grab the score for a given submission--# #pull in top submissions for the year for subreddit specified in list above subreddit = reddit.subreddit(i) submissions ='year', limit=50) #sum score across submissions for s in submissions: if in domains_score.keys(): domains_score[] += s.score else: domains_score[] = s.score
df_score = pd.DataFrame.from_dict(domains_score, orient='index').reset_index() df_score.columns = ['id','score']
#--Grab domain for given submission ID--# subreddit = reddit.subreddit(i) #input('enter subreddit name: /r/')) submissions ='year', limit=50) for s in submissions: if in domains.keys(): domains[] = s.domain else: domains[] = s.domain df_domain = pd.DataFrame.from_dict(domains, orient='index').reset_index() df_domain.columns = ['id','domain']
#--Grab subreddit for given submission ID--# subreddit = reddit.subreddit(i) submissions ='year', limit=50) for s in submissions: if in domains_sub.keys(): domains_sub[] = s.subreddit.display_name else: domains_sub[] = s.subreddit.display_name df_subreddit = pd.DataFrame.from_dict(domains_sub, orient='index').reset_index() df_subreddit.columns = ['id','subreddit']

Merge dataframes

Now that we have dataframes containing score, domain (url), and subreddit we can merge the three tables together, using submission ID as the primary key.

#merge the three tables together, using submission ID as primary key
df_sub_score = df_subreddit.merge(df_score, how='left', on="id")
df_final = df_sub_score.merge(df_domain, how='left', on='id')
# Add in submission URL using the 'id' 
df_final['url'] = ['']+df_final['id'].astype(str) 
id subreddit score domain url
0 78tulq todayilearned 42729
1 76bn5s science 25024
2 7871xy science 30642
3 77pnk6 science 13176
4 75eydj gaming 64510

Done! Explore the output:

We now have a nice clean dataframe of the top yearly posts from each chosen subreddit, allowing us to see which domains racked up the highest total scores. I dumped the dataframe into a Google Sheet for you to explore.

Ace your next data science interview

Get better at data science interviews by solving a few questions per week

*We will never spam