Scraping Reddit to find the most popular domains

July 2019

Overview

Below, we'll show you how to scrape Reddit using Praw (Python Reddit API Wrapper). For this example, our goal will be to scrape the top submissions for the year across a few subreddits, storing the following: submission URL, domain (website URL), submission score. Ultimately, we want to be able to see which domains (urls) generate the highest scoring posts across a given subreddit.

1) Import packages, set up PRAW, select subreddits

Here we can set up our Praw credentials and select the list of subreddits we want to analyze.

#packages
import pandas as pd
import praw
import operator
import pandas as pd

#set up praw - setup here: http://praw.readthedocs.io/en/latest/getting_started/quick_start.html
reddit = praw.Reddit(client_id='my client id',
client_secret='my client secret',
user_agent='my user agent')

#create list of subreddits to include
s_list = \
[
'enter subreddits you want to include here as comma  separated strings - e.g. 'news', 'datascience', etc']

2) Grab the score, domain (url), and subreddit for each top yearly submission

In this section we're looping through our array of subreddits from above, and storing the score, domain, and subreddit; we'll store each of these attributes in 3 separate dataframes, and merge together using the submission ID.

#set up dictionaries to store submission information
domains_sub = {}
domains = {}
domains_score = {}
domains_url = {}

#Loop through our selected list of subreddits
for i in s_list:

#--Grab the score for a given submission--#
#pull in top submissions for the year for subreddit specified in list above
subreddit = reddit.subreddit(i)   
submissions = subreddit.top('year', limit=50)
#sum score across submissions
for s in submissions:
    if s.id in domains_score.keys():
        domains_score[s.id] += s.score
    else:
        domains_score[s.id] = s.score

df_score = pd.DataFrame.from_dict(domains_score, orient='index').reset_index()
df_score.columns = ['id','score']

#--Grab domain for given submission ID--#
subreddit = reddit.subreddit(i)   #input('enter subreddit name: /r/'))
submissions = subreddit.top('year', limit=50)

for s in submissions:
    if s.id in domains.keys():

        domains[s.id] = s.domain
    else:

        domains[s.id] = s.domain

df_domain = pd.DataFrame.from_dict(domains, orient='index').reset_index()
df_domain.columns = ['id','domain']

#--Grab subreddit for given submission ID--#
subreddit = reddit.subreddit(i)
submissions = subreddit.top('year', limit=50)

for s in submissions:
    if s.id in domains_sub.keys():

        domains_sub[s.id] = s.subreddit.display_name
    else:

        domains_sub[s.id] = s.subreddit.display_name

df_subreddit = pd.DataFrame.from_dict(domains_sub, orient='index').reset_index()
df_subreddit.columns = ['id','subreddit']

Merge dataframes

Now that we have dataframes containing score, domain (url), and subreddit we can merge the three tables together, using submission ID as the primary key.

#merge the three tables together, using submission ID as primary key
df_sub_score = df_subreddit.merge(df_score, how='left', on=&quot;id&quot;)
df_final = df_sub_score.merge(df_domain, how='left', on='id')

# Add in submission URL using the 'id' 
df_final['url'] = ['www.reddit.com/']+df_final['id'].astype(str)

df_final.head()

	id	subreddit	score	domain	url
0	78tulq	todayilearned	42729	atlasobscura.com	www.reddit.com/78tulq
1	76bn5s	science	25024	ns.umich.edu	www.reddit.com/76bn5s
2	7871xy	science	30642	acsh.org	www.reddit.com/7871xy
3	77pnk6	science	13176	jech.bmj.com	www.reddit.com/77pnk6
4	75eydj	gaming	64510	i.redd.it	www.reddit.com/75eydj

Done! Explore the output

We now have a nice clean dataframe of the top yearly posts from each chosen subreddit, allowing us to see which domains racked up the highest total scores. I dumped the dataframe into a Google Sheet for you to explore.