# Introduction to Linear Regression using Python

Python

This post will walk through a practice problem which analyzes NBA player's season stats. We've also provided the practice problem's Colab Notebook so you can follow along, just copy the notebook and the two Google sheets (season stats and player data) to your Google Drive.

This practice problem has a data set that contains NBA players and their individual season stats. This practice will take you through fitting a linear model using player stats to estimate win shares per 48 minutes.

Topics covered in this tutorial:

• Importing data from Google Sheets into Colab
• Basic data cleaning / analysis
• Splitting data into test and training sets
• Training a single variable regression model
• Training a multivariate regression model
• Manually calculating Adjusted R-squared and R-squared
• Calculating Root Mean Square Error
• Comparing results of 2 regression models

Importing and cleaning data

``````# Importing packages.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import numpy as np
import scipy.stats as ss
from sklearn.linear_model import LinearRegression
import seaborn as sns
sns.set(color_codes = True)
%matplotlib inline

# Setting figure sizes
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14``````
``````# Code to connect Colab to Google Drive (save the spreadsheet provided into your drive)

auth.authenticate_user()

``````### Importing sheet

# Open our new sheet and read some data.
worksheet = gc.open('Seasons_Stats').sheet1
# get_all_values gives a list of rows.
rows = worksheet.get_all_values()
# Convert to a DataFrame and render.
seasonPlayerStats = pd.DataFrame.from_records(rows)

seasonPlayerStats.columns = seasonPlayerStats.iloc
#drop the first row after we rename the header
seasonPlayerStats = seasonPlayerStats.reindex(seasonPlayerStats.index.drop(0))``````
``````### Importing sheet

# Open our new sheet and read some data.
worksheet = gc.open('player_data').sheet1
# get_all_values gives a list of rows.
rows = worksheet.get_all_values()
# Convert to a DataFrame and render.
playerData = pd.DataFrame.from_records(rows)

#rename header so it's the first row
playerData.columns = playerData.iloc
#drop the first row after we rename the header
playerData = playerData.reindex(playerData.index.drop(0))``````
``````# Merging dataframes and dropping null values so we avoid errors
# joining
playerAllStats = pd.merge(seasonPlayerStats, playerData, how='left', on='Player')
# dropping null value columns to avoid errors
playerAllStats = playerAllStats[pd.notnull(playerAllStats['height'])]
playerAllStats = playerAllStats[pd.notnull(playerAllStats['Points'])]
playerAllStats = playerAllStats[pd.notnull(playerAllStats['Games'])]
``````# Viewing all columns that we have in the dataframe. I'll be referencing this figure out which fields I need to clean
playerAllStats.columns``````

Output:

``````
Index(['', 'Year', 'Player', 'Position', 'Age', 'Team', 'Games',
'GamesStarted', 'MinutesPlayed', 'PlayerEfficiencyRating',
'TrueShootingPerct', 'ThreePointAttemptRate', 'FreeThrowRate',
'OffensiveReboundPercentage', 'DefensiveReboundPercentage',
'TotalReboundPercentage', 'AssistPercentage', 'StealPercentage',
'BlockPercentage', 'TurnoverPercentage', 'UsagePercentage',
'OffesniveWinShares', 'DefensiveWinShares', 'WinShares',
'WinsSharesPer48Minutes', 'OffensiveBoxPlusMinus',
'DefensiveBoxPlusMinus', 'BoxPlusMinus', 'ValueOverReplacement',
'FieldGoals', 'FieldGoalAttempts', 'FieldGoalPercentage',
'ThreePointFieldGoals', 'ThreePointFieldGoalAttempts',
'ThreePointFieldGoalPercentage', 'TwoPointFieldGoals',
'TwoPointFieldGoalAttempts', 'TwoPointFieldGoalPercentage',
'EffectiveFieldGoalPercentage', 'FreeThrows', 'FreeThrowAttempts',
'FreeThrowPercentage', 'OffensiveRebounds', 'DefensiveRebounds',
'TotalRebounds', 'Assists', 'Steals', 'Blocks', 'Turnovers',
'PersonalFouls', 'Points', 'year_start', 'year_end', 'position',
'height', 'weight', 'birth_date', 'college', 'feet', 'inches',
'heightInches', 'PointsPerGame', 'TotalReboundsPerGame',
'AssistsPerGame'],
dtype='object', name=0)``````

Cleaning data even more

• Converting height to inches (so we can use it if we decide to)
• Converting some of the stats to a numeric value so we can do analysis
• Calculating points per game, total rebounds per game, and assists per game
``````playerAllStats['feet'], playerAllStats['inches'] = playerAllStats['height'].str.split('-', 1).str
playerAllStats['feet'], playerAllStats['inches'], playerAllStats['weight'] = pd.to_numeric(playerAllStats.feet), \
pd.to_numeric(playerAllStats.inches), pd.to_numeric(playerAllStats.weight)
playerAllStats['heightInches'] = playerAllStats['feet']*12 + playerAllStats['inches']

playerAllStats['Games'], playerAllStats['Points'] = pd.to_numeric(playerAllStats.Games), pd.to_numeric(playerAllStats.Points)
playerAllStats['MinutesPlayed'], playerAllStats['FieldGoalPercentage']= pd.to_numeric(playerAllStats.MinutesPlayed), pd.to_numeric(playerAllStats.FieldGoalPercentage)
playerAllStats['Games'], playerAllStats['TotalRebounds']= pd.to_numeric(playerAllStats.Games), pd.to_numeric(playerAllStats.TotalRebounds)
playerAllStats['Assists'], playerAllStats['WinsSharesPer48Minutes']= pd.to_numeric(playerAllStats.Assists), pd.to_numeric(playerAllStats.WinsSharesPer48Minutes)
playerAllStats['WinShares'], playerAllStats['Year'] = pd.to_numeric(playerAllStats.WinShares), pd.to_numeric(playerAllStats.Year)

playerAllStats['PointsPerGame'] = playerAllStats['Points']/ playerAllStats['Games']
playerAllStats['TotalReboundsPerGame'] = playerAllStats['TotalRebounds']/ playerAllStats['Games']
playerAllStats['AssistsPerGame'] = playerAllStats['Assists']/ playerAllStats['Games']
``````

Question: Can we create a linear regression model using points per game to estimate win shares per 48 minutes? 1 . Create a plot with the 2 statistics Before we create the model, let's plot the points per game vs win shares per 48 minutes. This will let us understand the data set and see if we need to remove outliers to improve model accuracy.

``sns.regplot(x="WinsSharesPer48Minutes", y = "PointsPerGame", data=playerAllStats)``

There's obviously some outliers and data that doesn't below. Let's see what kind of data we're working with.

``playerAllStats.describe()``

Output: There's a few things I see immediately that would make our data hard to plot.

• We have seasons from 11400...to minimize noise, we're only going to look at recent player's seasons. So I'm going to limit my analysis from 2014 to present.

• We have players who've played 1 game all season. We're also going to limit the number of games a player has played in a given season to be >70% of all games. A given season has 82 games.

• This isn't in the descriptive statistics above, but there are different positions in basketball. For the purpose of this analysis let's limit our data to include players who play Point Guard (PG) (or other positions along with PG). These are typically the players who score the most points, and it makes sense for us to group players who have even scoring potential to reduce skewness.

``````#filtering on players who have played since 2014 and have played at least 70% of the games in a given season (assuming 82 games per season)
playerAllStatsSince2014 = playerAllStats.loc[(playerAllStats['Year'] >= 2014) &(playerAllStats['Games'] >= 58)]
#filteirng on PG players
playerAllStatsSince2014PG = playerAllStatsSince2014[playerAllStatsSince2014['Position'].str.contains('PG')].reset_index(drop=True)
sns.regplot(x="WinsSharesPer48Minutes", y = "PointsPerGame", data=playerAllStatsSince2014PG)``````

Output: The plot looks much better between points per game vs win shares per 48 minutes. Our data is in a format we can use. Let's move on to creating a training and test data set.

*2. Splitting data into training and test set * We're going to split our data into dependent variables and an independent varaibles for both training and test sets using SKLearn's train_test_split function.

``````playerAllStatsSince2014PG_WinShare = playerAllStatsSince2014PG['WinsSharesPer48Minutes']
playerAllStatsSince2014PG_PPG = playerAllStatsSince2014PG['PointsPerGame']

#importing packages and splitting out data into training + testing sets
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(playerAllStatsSince2014PG['PointsPerGame'], \
playerAllStatsSince2014PG['WinsSharesPer48Minutes'], test_size=0.20, random_state=42)``````

*3. Fitting the data to a linear regression and getting R-squared * Here we're creating a linear regression model called model_1. We're going to fit the training data to this model. After that we're going to get the R-squared value. R-squared is a value between 0 and 1 which is a statistical measure of how close the data are to the fitted regression line. 1 = perfect fit, 0 = no fit

``````model_1 = LinearRegression()
model_1.fit(train_x.reshape(-1, 1), train_y.reshape(-1, 1))
model_1.score(train_x.reshape(-1, 1), train_y.reshape(-1, 1))``````

Output:

``0.49632072484943907``

A r-squared of 0.496 tells us we have a moderately effective fit. Since we're also going to be creating a multi-variate model, let's also calculate Adjusted R-squared and Root Mean Square Error (RMSE) so we have more statistics to compare. The below code calculates R-squared and Adjusted R-squared.

``````X = train_x.values.reshape(-1, 1)
y = train_y.values.reshape(-1, 1)

# compute with formulas from the theory
yhat = model_1.predict(X)
SS_Residual = sum((y-yhat)**2)
SS_Total = sum((y-np.mean(y))**2)
r_squared = 1 - np.divide(SS_Residual,SS_Total)
#r_squared = 1 - np.exp((float(SS_Residual))/SS_Total)

Output:

``[0.49632072] [0.49326812]``

*4. Estimate values for the test data * Last thing to do is to estimate Win Shares with our test data. We'll compare our estimate to the true values and calculate the Root Mean Square Error.

``````model_1_predict_values = model_1.predict(test_x.values.reshape(-1, 1))
model_1_predict = pd.DataFrame({'Points Per Game': test_x.values, 'Actual Win Share': test_y.values,'Predicted Win Share': model_1_predict_values.reshape(1, -1)})``````

The code below calculate Root Mean Square Error. The lower the Root Mean Square Error the closer our estimations are to the real values.

``````#Root Mean square error on test dataset
np.sqrt(np.mean(np.square(model_1_predict['Actual Win Share'] - \
model_1_predict['Predicted Win Share'])))``````

Output:

``0.0523034857831081``

Question: Can we create a linear regression model using points per game, total rebounds per game, and assists per game to estimate win shares per 48 minutes? We'll go through the same process as above (less the data exploration, which we do not need to do 2 times). We'll create the model and generate the statistics. Then we'll compare the single variable to the multivariable model.

``````playerAllStatsSince2014PG_Ys = playerAllStatsSince2014PG['WinsSharesPer48Minutes']
playerAllStatsSince2014PG_Xs = playerAllStatsSince2014PG[['PointsPerGame', 'TotalReboundsPerGame', 'AssistsPerGame']]

#splitting test and training data
train_x, test_x, train_y, test_y = train_test_split(playerAllStatsSince2014PG_Xs, playerAllStatsSince2014PG_Ys, test_size=0.20, random_state=42)

#fitting model
model_1 = LinearRegression()
model_1.fit(train_x, train_y)
model_1.score(train_x, train_y)
``````

Output:

``0.5060498809063979``

It looks like this model's R-square value is already higher that our single variable. Next we'll calculate the model's estimate for Win Shares.

``````model_1_predict_values = model_1.predict(test_x)
model_1_predict = pd.DataFrame({'Actual Win Share': test_y.values,'Predicted Win Share': model_1_predict_values.reshape(1, -1)})
#Root Mean square error on test dataset
np.sqrt(np.mean(np.square(model_1_predict['Actual Win Share'] - model_1_predict['Predicted Win Share'])))``````

Output:

``0.05174714479057624``
``````X = train_x.values
y = train_y.values

# compute with formulas from the theory
yhat = model_1.predict(X)
SS_Residual = sum((y-yhat)**2)
SS_Total = sum((y-np.mean(y))**2)
r_squared = 1 - np.divide(SS_Residual,SS_Total)
#r_squared = 1 - np.exp((float(SS_Residual))/SS_Total)
``0.5060498809063977 0.49695877442001235``