PythonSentiment AnalysisTextblob

Which news publications have the most positive headlines? (simple sentiment analysis with Textblob)

Analyzing hundreds of thousands of news headlines with Textblob to determine which publications are the most positive.

October 1, 2019
7 mins read

Inspiration/base dataset

Kaggle provides a great dataset containing news headlines for most major publications. I decided to run some simple sentiment analysis using Textblob, a Python library for processing textual data, that comes with some pre-trained sentiment classifiers. One could of course train their own model, and probably obtain more accurate results overall, but I wasn't able to quickly fine a clean dataset of news headlines tagged with sentiment. Textblob should work fine for comparing the publications relative to each other.

To start, from Kaggle we're given a table as follows:

authorcontentdatemonthpublicationtitleurlyear
0Avi SelkUber driver Keith Avila picked up a p...2016-12-3012.0Washington PostAn eavesdropping Uber driver saved his 16-year...https://web.archive.org/web/20161231004909/htt...2016.0
1Sarah LarimerCrews on Friday continued to search L...2016-12-3012.0Washington PostPlane carrying six people returning from a Cav...https://web.archive.org/web/20161231004909/htt...2016.0
2Renae MerleWhen the Obama administration announced a...2016-12-3012.0Washington PostAfter helping a fraction of homeowners expecte...https://web.archive.org/web/20161231004909/htt...2016.0
3Chelsea HarveyThis story has been updated. A new law in...2016-12-3012.0Washington PostYes, this is real: Michigan just banned bannin...https://web.archive.org/web/20161231004909/htt...2016.0
4Christopher IngrahamThe nation’s first recreational marijuana...2016-12-2912.0Washington PostWhat happened in Washington state after voters...https://web.archive.org/web/20161231004909/htt...2016.0

After poking around the n-counts by year, it looks like 2016-2017 are the only years with decent data >~2000 headlines across each publication. So, I filtered the dataset to these two years:

#years came in string.0 format, just filtering them as they are for now
years = ['2016.0','2017.0']
df = df.loc[df['year'].isin(years)]  

Running our data through Textblob to determine news headline sentiment

Great, so now we have a pretty clean dataframe containing headlines by news publication. Next, I'll create a utility function to log the polarity (ranging from -1 as most negative to 1 as most positive) of each headline

#used to analyze sentiment w/ textblob, leaving the raw output
def analize_sentiment_raw(headline):
    '''
    Function to lassify the polarity of a tweet
    using textblob.
    '''
    analysis = TextBlob(headline)
    #polarity = negative (-1) to positive (1)
    return analysis.sentiment.polarity 

Next, we'll just feed the headlines into the function we created above:

df['SA'] = np.array([ analize_sentiment_raw(headline) for headline in df['title']])

We can now see the polarity score (again ranging from -1 to 1) reflected in our dataframe:

titlepublicationauthordateyearmonthurlcontentSApublish_year
0An eavesdropping Uber driver saved his 16-year...Washington PostAvi Selk2016-12-302016.012.0https://web.archive.org/web/20161231004909/htt...Uber driver Keith Avila picked up a p...0.002016
1Plane carrying six people returning from a Cav...Washington PostSarah Larimer2016-12-302016.012.0https://web.archive.org/web/20161231004909/htt...Crews on Friday continued to search L...-0.402016
2After helping a fraction of homeowners expecte...Washington PostRenae Merle2016-12-302016.012.0https://web.archive.org/web/20161231004909/htt...When the Obama administration announced a...-0.052016
3Yes, this is real: Michigan just banned bannin...Washington PostChelsea Harvey2016-12-302016.012.0https://web.archive.org/web/20161231004909/htt...This story has been updated. A new law in...0.202016
4What happened in Washington state after voters...Washington PostChristopher Ingraham2016-12-292016.012.0https://web.archive.org/web/20161231004909/htt...The nation’s first recreational marijuana...0.002016

Pivoting the data to rank news publications by headline sentiment

Now that we have sentiment attached to each news headline as reflected in col 'SA' above, we can pivot to summarize the publications with the most positive/negative headlines:

#grab a slice of the original df with just publication and polarity score
df_clean = df[['SA', 'publication']]
#pivoting to grab the average sentiment by publication
df_clean2 = df_clean.groupby(['publication']).mean().reset_index()
#preview the dataframe, sorting by most positive
df_clean2.sort_values(['SA'],ascending=False).head(15)

Below, we can see that (according to Textblob's pre-trained classifier) Business Insider has the most positive headlines on average, with Breitbart coming in at the bottom with the most negative headlines.

publicationSA
2Business Insider0.060214
12Vox0.051213
7NPR0.043666
11Talking Points Memo0.035564
6Guardian0.032476
3Buzzfeed News0.028023
0Atlantic0.026798
13Washington Post0.022752
10Reuters0.020006
9New York Post0.019992
5Fox News0.016228
4CNN0.015503
8National Review0.014110
1Breitbart0.009544

Next, lets check out the n-counts on these, and add a view that shows the % of articles that are positive (score >0), negative (score 0, 1, 0)

df_clean['neutral'] = np.where(df['SA'] == 0, 1, 0)
df_clean['negative'] = np.where(df['SA'] < 0, 1, 0)
df_clean2 = df_clean[['publication','positive','neutral','negative']]
df_clean3 = df_clean2.groupby(['publication']).sum().reset_index()
df_clean3.sort_values(['positive'],ascending=False).head(15)
publicationpositiveneutralnegative
0Atlantic14354781963
1Breitbart5469135364701
2Business Insider258528521320
3Buzzfeed News13542549951
4CNN159150561279
5Fox News9902493839
6Guardian236645581700
7NPR345163592010
8National Review12034064904
9New York Post467791963573
10Reuters242366001664
11Talking Points Memo6701499393
12Vox173021341014
13Washington Post317555552348

Great, we can now see the # of headlines Textblob identified as positive/neutral/negative. Lets add some %s into this df too:

#fields for positive/neutral/negative percents
df_clean3['positive_perc'] = \
df_clean3['positive']/(df_clean3['positive']+df_clean3['neutral']+df_clean3['negative'])
df_clean3['neutral_perc'] = \
df_clean3['neutral']/(df_clean3['positive']+df_clean3['neutral']+df_clean3['negative'])
df_clean3['negative_perc'] = \
df_clean3['negative']/(df_clean3['positive']+df_clean3['neutral']+df_clean3['negative'])
#sort and preview dataframe
df_clean3.sort_values(['positive_perc'],ascending=False).head(15)
publicationpositiveneutralnegativepositive_percneutral_percnegative_perc
2Business Insider2585285213200.3825660.4220810.195353
12Vox1730213410140.3546540.4374740.207872
7NPR3451635920100.2919630.5379860.170051
13Washington Post3175555523480.2866040.5014440.211952
3Buzzfeed News135425499510.2789450.5251340.195921
6Guardian2366455817000.2743510.5285250.197124
9New York Post4677919635730.2680840.5271120.204803
11Talking Points Memo67014993930.2615140.5850900.153396
1Breitbart54691353647010.2307010.5709950.198304
5Fox News99024938390.2290610.5768160.194123
10Reuters2423660016640.2267240.6175730.155703
4CNN1591505612790.2007320.6379010.161368
0Atlantic143547819630.1998890.6659700.134141
8National Review120340649040.1949440.6585640.146492

Visualizing news publication sentiment

Using plotly, making a quick bar chart to visualize the news publication sentiment shown in our pivoted data above:

#bar chart
x = df_clean2['publication']
y = df_clean2['SA']

data = [
    go.Bar(
        x=x,
        y=y,
        marker=dict(
            color='#8bffc0f2',
            line=dict(
                color='#d8ffea96',
                width=1.5
            ),
        ),
        opacity=1
    )
]
layout = go.Layout(
      title='News publications with the most/least positive headlines',
    xaxis=dict(
        title='News publication'
    ),
    yaxis=dict(
        title='Sentiment (higher = more positive',
    ),
    margin=go.Margin(
        l=140,
        r=90,
        b=140,
        t=60,
        pad=4
    ),
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='news_pub_sentiment')

Voila! We were able to quickly use Textblob to classify and rank sentiment across hundreds of thousands of news headlines.

pic1 For those interested, the jupyter notebook with the complete code can be found in our github repository.