Below, we'll show you how to scrape Reddit using Praw (Python Reddit API Wrapper). For this example, our goal will be to scrape the top submissions for the year across a few subreddits, storing the following: submission URL, domain (website URL), submission score. Ultimately, we want to be able to see which domains (urls) generate the highest scoring posts across a given subreddit.
Here we can set up our Praw credentials and select the list of subreddits we want to analyze.
#packages
import pandas as pd
import praw
import operator
import pandas as pd
#set up praw - setup here: http://praw.readthedocs.io/en/latest/getting_started/quick_start.html
reddit = praw.Reddit(client_id='my client id',
client_secret='my client secret',
user_agent='my user agent')
#create list of subreddits to include
s_list = \
[
'enter subreddits you want to include here as comma
separated strings - e.g. 'news', 'datascience', etc']
In this section we're looping through our array of subreddits from above, and storing the score, domain, and subreddit; we'll store each of these attributes in 3 separate dataframes, and merge together using the submission ID.
#set up dictionaries to store submission information
domains_sub = {}
domains = {}
domains_score = {}
domains_url = {}
#Loop through our selected list of subreddits
for i in s_list:
#--Grab the score for a given submission--#
#pull in top submissions for the year for subreddit specified in list above
subreddit = reddit.subreddit(i)
submissions = subreddit.top('year', limit=50)
#sum score across submissions
for s in submissions:
if s.id in domains_score.keys():
domains_score[s.id] += s.score
else:
domains_score[s.id] = s.score
df_score = pd.DataFrame.from_dict(domains_score, orient='index').reset_index()
df_score.columns = ['id','score']
#--Grab domain for given submission ID--#
subreddit = reddit.subreddit(i) #input('enter subreddit name: /r/'))
submissions = subreddit.top('year', limit=50)
for s in submissions:
if s.id in domains.keys():
domains[s.id] = s.domain
else:
domains[s.id] = s.domain
df_domain = pd.DataFrame.from_dict(domains, orient='index').reset_index()
df_domain.columns = ['id','domain']
#--Grab subreddit for given submission ID--#
subreddit = reddit.subreddit(i)
submissions = subreddit.top('year', limit=50)
for s in submissions:
if s.id in domains_sub.keys():
domains_sub[s.id] = s.subreddit.display_name
else:
domains_sub[s.id] = s.subreddit.display_name
df_subreddit = pd.DataFrame.from_dict(domains_sub, orient='index').reset_index()
df_subreddit.columns = ['id','subreddit']
Now that we have dataframes containing score, domain (url), and subreddit we can merge the three tables together, using submission ID as the primary key.
#merge the three tables together, using submission ID as primary key
df_sub_score = df_subreddit.merge(df_score, how='left', on="id")
df_final = df_sub_score.merge(df_domain, how='left', on='id')
# Add in submission URL using the 'id'
df_final['url'] = ['www.reddit.com/']+df_final['id'].astype(str)
df_final.head()