In this tutorial I'll walk you through a simple methodology to correlate various stocks against each other. We'll grab the prices of the selected stocks using python, drop them into a clean dataframe, run a correlation, and visualize our results.
(1) Import libraries, select our list of stocks to correlate
import numpy as np
import pandas as pd
#used to grab the stock prices, with yahoo
import pandas_datareader as web
from datetime import datetime
#to visualize the results
import matplotlib.pyplot as plt
import seaborn
#select start date for correlation window as well as list of tickers
start = datetime(2017, 1, 1)
symbols_list = ['AAPL', 'F', 'TWTR', 'FB', 'AAL', 'AMZN', 'GOOGL', 'GE']
(2) Pull stock prices, push into clean dataframe
#array to store prices
symbols=[]
#pull price using iex for each symbol in list defined above
for ticker in symbols_list:
r = web.DataReader(ticker, 'yahoo', start)
# add a symbol column
r['Symbol'] = ticker
symbols.append(r)
# concatenate into df
df = pd.concat(symbols)
df = df.reset_index()
df = df[['Date', 'Close', 'Symbol']]
df.head()
We're now left with a table in this format:
date | close | symbol |
---|---|---|
2017-01-03 | 113.4101 | AAPL |
2017-01-04 | 113.2832 | AAPL |
2017-01-05 | 113.8593 | AAPL |
3) However, we want our symbols represented as columns so we'll have to pivot the dataframe:
df_pivot = df.pivot('date','symbol','close').reset_index()
df_pivot.head()
date | AAL | AAPL | AMZN | F | FB | GE | GOOGL | TWTR | |
---|---|---|---|---|---|---|---|---|---|
0 | 2017-01-03 | 45.7092 | 113.4101 | 753.67 | 11.4841 | 116.86 | 30.1370 | 808.01 | 16.44 |
1 | 2017-01-04 | 46.1041 | 113.2832 | 757.18 | 12.0132 | 118.69 | 30.1465 | 807.77 | 16.86 |
2 | 2017-01-05 | 45.3044 | 113.8593 | 780.45 | 11.6483 | 120.67 | 29.9754 | 813.02 | 17.09 |
3 | 2017-01-06 | 45.6203 | 115.1286 | 795.99 | 11.6392 | 123.41 | 30.0610 | 825.21 | 17.17 |
4 | 2017-01-09 | 46.4792 | 116.1832 | 796.92 | 11.5206 | 124.90 | 29.9183 | 827.18 | 17.50 |
(4) Next, we can run the correlation. Using the Pandas 'corr' function to compute the Pearson correlation coeffecient between each pair of equities
corr_df = df_pivot.corr(method='pearson')
#reset symbol as index (rather than 0-X)
corr_df.head().reset_index()
del corr_df.index.name
corr_df.head(10)
symbol | AAL | AAPL | AMZN | F | FB | GE | GOOGL | TWTR |
---|---|---|---|---|---|---|---|---|
AAL | 1.000000 | 0.235239 | 0.226061 | 0.068024 | 0.356063 | -0.332681 | 0.494075 | 0.169487 |
AAPL | 0.235239 | 1.000000 | 0.868763 | 0.184501 | 0.911380 | -0.895210 | 0.903191 | 0.781755 |
AMZN | 0.226061 | 0.868763 | 1.000000 | 0.108351 | 0.744732 | -0.937415 | 0.864455 | 0.955373 |
F | 0.068024 | 0.184501 | 0.108351 | 1.000000 | 0.206055 | -0.216064 | 0.189753 | 0.161078 |
FB | 0.356063 | 0.911380 | 0.744732 | 0.206055 | 1.000000 | -0.814703 | 0.900033 | 0.650404 |
GE | -0.332681 | -0.895210 | -0.937415 | -0.216064 | -0.814703 | 1.000000 | -0.882526 | -0.866871 |
GOOGL | 0.494075 | 0.903191 | 0.864455 | 0.189753 | 0.900033 | -0.882526 | 1.000000 | 0.789379 |
TWTR | 0.169487 | 0.781755 | 0.955373 | 0.161078 | 0.650404 | -0.866871 | 0.789379 | 1.000000 |
5) Finally, we can plot a heatmap of the correlations (with Seaborn and Matplotlib) to better visualize the results:
#take the bottom triangle since it repeats itself
mask = np.zeros_like(corr_df)
mask[np.triu_indices_from(mask)] = True
#generate plot
seaborn.heatmap(corr_df, cmap='RdYlGn', vmax=1.0, vmin=-1.0 , mask = mask, linewidths=2.5)
plt.yticks(rotation=0)
plt.xticks(rotation=90)
plt.show()
Here we can see that, as expected, the tech companies are generally pretty highly correlated (as indicated by dark green -- AMZN/GOOGL, FB/AAPL, etc). Conversely, Ford (F) and General Electric (GE) are either not correlated or negatively correlated with the rest of the group.
We send 3 questions each week to thousands of data scientists and analysts preparing for interviews or just keeping their skills sharp. You can sign up to receive the questions for free on our home page.