Which news publications have the most positive headlines? (simple sentiment analysis with Textblob)

November 2019

Inspiration/base dataset

Kaggle provides a great dataset containing news headlines for most major publications. I decided to run some simple sentiment analysis using Textblob, a Python library for processing textual data, that comes with some pre-trained sentiment classifiers. One could of course train their own model, and probably obtain more accurate results overall, but I wasn't able to quickly fine a clean dataset of news headlines tagged with sentiment. Textblob should work fine for comparing the publications relative to each other.

To start, from Kaggle we're given a table as follows:

	author	content	date	month	publication	title	url	year
0	Avi Selk	Uber driver Keith Avila picked up a p...	2016-12-30	12.0	Washington Post	An eavesdropping Uber driver saved his 16-year...	https://web.archive.org/web/20161231004909/htt...	2016.0
1	Sarah Larimer	Crews on Friday continued to search L...	2016-12-30	12.0	Washington Post	Plane carrying six people returning from a Cav...	https://web.archive.org/web/20161231004909/htt...	2016.0
2	Renae Merle	When the Obama administration announced a...	2016-12-30	12.0	Washington Post	After helping a fraction of homeowners expecte...	https://web.archive.org/web/20161231004909/htt...	2016.0
3	Chelsea Harvey	This story has been updated. A new law in...	2016-12-30	12.0	Washington Post	Yes, this is real: Michigan just banned bannin...	https://web.archive.org/web/20161231004909/htt...	2016.0
4	Christopher Ingraham	The nation’s first recreational marijuana...	2016-12-29	12.0	Washington Post	What happened in Washington state after voters...	https://web.archive.org/web/20161231004909/htt...	2016.0

After poking around the n-counts by year, it looks like 2016-2017 are the only years with decent data >~2000 headlines across each publication. So, I filtered the dataset to these two years:

#years came in string.0 format, just filtering them as they are for now
years = ['2016.0','2017.0']
df = df.loc[df['year'].isin(years)]

Running our data through Textblob to determine news headline sentiment

Great, so now we have a pretty clean dataframe containing headlines by news publication. Next, I'll create a utility function to log the polarity (ranging from -1 as most negative to 1 as most positive) of each headline

#used to analyze sentiment w/ textblob, leaving the raw output
def analize_sentiment_raw(headline):
    '''
    Function to lassify the polarity of a tweet
    using textblob.
    '''
    analysis = TextBlob(headline)
    #polarity = negative (-1) to positive (1)
    return analysis.sentiment.polarity

Next, we'll just feed the headlines into the function we created above:

df['SA'] = np.array([ analize_sentiment_raw(headline) for headline in df['title']])

We can now see the polarity score (again ranging from -1 to 1) reflected in our dataframe:

	title	publication	author	date	year	month	url	content	SA	publish_year
0	An eavesdropping Uber driver saved his 16-year...	Washington Post	Avi Selk	2016-12-30	2016.0	12.0	https://web.archive.org/web/20161231004909/htt...	Uber driver Keith Avila picked up a p...	0.00	2016
1	Plane carrying six people returning from a Cav...	Washington Post	Sarah Larimer	2016-12-30	2016.0	12.0	https://web.archive.org/web/20161231004909/htt...	Crews on Friday continued to search L...	-0.40	2016
2	After helping a fraction of homeowners expecte...	Washington Post	Renae Merle	2016-12-30	2016.0	12.0	https://web.archive.org/web/20161231004909/htt...	When the Obama administration announced a...	-0.05	2016
3	Yes, this is real: Michigan just banned bannin...	Washington Post	Chelsea Harvey	2016-12-30	2016.0	12.0	https://web.archive.org/web/20161231004909/htt...	This story has been updated. A new law in...	0.20	2016
4	What happened in Washington state after voters...	Washington Post	Christopher Ingraham	2016-12-29	2016.0	12.0	https://web.archive.org/web/20161231004909/htt...	The nation’s first recreational marijuana...	0.00	2016

Pivoting the data to rank news publications by headline sentiment

Now that we have sentiment attached to each news headline as reflected in col 'SA' above, we can pivot to summarize the publications with the most positive/negative headlines:

#grab a slice of the original df with just publication and polarity score
df_clean = df[['SA', 'publication']]
#pivoting to grab the average sentiment by publication
df_clean2 = df_clean.groupby(['publication']).mean().reset_index()
#preview the dataframe, sorting by most positive
df_clean2.sort_values(['SA'],ascending=False).head(15)

Below, we can see that (according to Textblob's pre-trained classifier) Business Insider has the most positive headlines on average, with Breitbart coming in at the bottom with the most negative headlines.

	publication	SA
2	Business Insider	0.060214
12	Vox	0.051213
7	NPR	0.043666
11	Talking Points Memo	0.035564
6	Guardian	0.032476
3	Buzzfeed News	0.028023
0	Atlantic	0.026798
13	Washington Post	0.022752
10	Reuters	0.020006
9	New York Post	0.019992
5	Fox News	0.016228
4	CNN	0.015503
8	National Review	0.014110
1	Breitbart	0.009544

Next, lets check out the n-counts on these, and add a view that shows the % of articles that are positive (score >0), negative (score 0, 1, 0) df_clean['neutral'] = np.where(df['SA'] == 0, 1, 0) df_clean['negative'] = np.where(df['SA'] < 0, 1, 0) df_clean2 = df_clean[['publication','positive','neutral','negative']] df_clean3 = df_clean2.groupby(['publication']).sum().reset_index() df_clean3.sort_values(['positive'],ascending=False).head(15)


|    | publication         | positive | neutral | negative |
| -- | ------------------- | -------- | ------- | -------- |
| 0  | Atlantic            | 1435     | 4781    | 963      |
| 1  | Breitbart           | 5469     | 13536   | 4701     |
| 2  | Business Insider    | 2585     | 2852    | 1320     |
| 3  | Buzzfeed News       | 1354     | 2549    | 951      |
| 4  | CNN                 | 1591     | 5056    | 1279     |
| 5  | Fox News            | 990      | 2493    | 839      |
| 6  | Guardian            | 2366     | 4558    | 1700     |
| 7  | NPR                 | 3451     | 6359    | 2010     |
| 8  | National Review     | 1203     | 4064    | 904      |
| 9  | New York Post       | 4677     | 9196    | 3573     |
| 10 | Reuters             | 2423     | 6600    | 1664     |
| 11 | Talking Points Memo | 670      | 1499    | 393      |
| 12 | Vox                 | 1730     | 2134    | 1014     |
| 13 | Washington Post     | 3175     | 5555    | 2348     |

Great, we can now see the # of headlines Textblob identified as positive/neutral/negative. Lets add some %s into this df too:

```python
#fields for positive/neutral/negative percents
df_clean3['positive_perc'] = \
df_clean3['positive']/(df_clean3['positive']+df_clean3['neutral']+df_clean3['negative'])
df_clean3['neutral_perc'] = \
df_clean3['neutral']/(df_clean3['positive']+df_clean3['neutral']+df_clean3['negative'])
df_clean3['negative_perc'] = \
df_clean3['negative']/(df_clean3['positive']+df_clean3['neutral']+df_clean3['negative'])
#sort and preview dataframe
df_clean3.sort_values(['positive_perc'],ascending=False).head(15)

	publication	positive	neutral	negative	positive_perc	neutral_perc	negative_perc
2	Business Insider	2585	2852	1320	0.382566	0.422081	0.195353
12	Vox	1730	2134	1014	0.354654	0.437474	0.207872
7	NPR	3451	6359	2010	0.291963	0.537986	0.170051
13	Washington Post	3175	5555	2348	0.286604	0.501444	0.211952
3	Buzzfeed News	1354	2549	951	0.278945	0.525134	0.195921
6	Guardian	2366	4558	1700	0.274351	0.528525	0.197124
9	New York Post	4677	9196	3573	0.268084	0.527112	0.204803
11	Talking Points Memo	670	1499	393	0.261514	0.585090	0.153396
1	Breitbart	5469	13536	4701	0.230701	0.570995	0.198304
5	Fox News	990	2493	839	0.229061	0.576816	0.194123
10	Reuters	2423	6600	1664	0.226724	0.617573	0.155703
4	CNN	1591	5056	1279	0.200732	0.637901	0.161368
0	Atlantic	1435	4781	963	0.199889	0.665970	0.134141
8	National Review	1203	4064	904	0.194944	0.658564	0.146492

Visualizing news publication sentiment

Using plotly, making a quick bar chart to visualize the news publication sentiment shown in our pivoted data above:

#bar chart
x = df_clean2['publication']
y = df_clean2['SA']

data = [
    go.Bar(
        x=x,
        y=y,
        marker=dict(
            color='#8bffc0f2',
            line=dict(
                color='#d8ffea96',
                width=1.5
            ),
        ),
        opacity=1
    )
]
layout = go.Layout(
      title='News publications with the most/least positive headlines',
    xaxis=dict(
        title='News publication'
    ),
    yaxis=dict(
        title='Sentiment (higher = more positive',
    ),
    margin=go.Margin(
        l=140,
        r=90,
        b=140,
        t=60,
        pad=4
    ),
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='news_pub_sentiment')

Voila! We were able to quickly use Textblob to classify and rank sentiment across hundreds of thousands of news headlines.

For those interested, the jupyter notebook with the complete code can be found in our github repository.