First, we're going to import the packages that we'll be using throughout this notebook. Then we'll bring in the CSV from my desktop. You can get the raw data from UCI's ML Database.
We're also using sci-kit learn. For more information on installing sci-kit to use sklearn packages, visit this website.
#Importing required packages. import pandas as pd import matplotlib.pyplot as plt import seaborn as sb import numpy as np %matplotlib inline #Importing sklearn packages from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from sklearn.svm import SVC, LinearSVC from sklearn.linear_model import SGDClassifier from sklearn.metrics import confusion_matrix, classification_report from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, StratifiedKFold from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.naive_bayes import GaussianNB from sklearn.metrics import accuracy_score
#import data and view first few rows of in a dataframe red_wine_df = pd.read_csv('/Users/crystalrood/desktop/winequality-red.csv') #Let's preview the dataframe red_wine_df.head()
About the dataset
Now that we have the data in a format that's usable for us within a Jupyter notebook, let's take a closer look at the data. There's a lot of numbers... and that's great! It's not necessary that we understand the intricacy of all of the numbers and the meaning being the measurements for the purpose of this exercise. If it makes you feel more comfortable, feel free to look at this link. Kaggle provides a great breakdown for each of the variables. However, it's important that we identify what will be inputs for our model and what will be the factor we're trying to determine. In this scenario, our goal is to determine whether the wine is "good" or "bad". We'll use the "quality" field to determine "good" or "bad". The rest of the variables in the table will be inputs for our model.
Next step: Understanding the variable statistics
For us to be able to use this data confidently, we need to ensure the data we're using is clean and is not missing many variables. Let's take a quick crack at ensuring that the data meets these two conditions.
#checking the number of records imported len(red_wine_df.index)
1599
#running descriptive statistics across all the variables red_wine_df.describe()
#checking to see if there's any null variables red_wine_df.isnull().sum()
fixed acidity 0 volatile acidity 0 citric acid 0 residual sugar 0 chlorides 0 free sulfur dioxide 0 total sulfur dioxide 0 density 0 pH 0 sulphates 0 alcohol 0 quality 0 dtype: int64
# listing the unique values for the wine quality red_wine_df['quality'].unique()
array([5, 6, 7, 4, 8, 3])
#taking a look at the quality ranges sb.countplot(x='quality', data=red_wine_df)
# generating charts that compare all of the variables # against quality, although not necessary, it's good to understand the data # spread df1 = red_wine_df.select_dtypes([np.int, np.float]) for i, col in enumerate(df1.columns): plt.figure(i) sb.barplot(x='quality', y =col, data=df1)
Preprocessing data before for modeling
Now that we know our data is pretty clean, and we have a good idea of what our data looks like, we're going pre-processing the data before plugging these variables into the model. What we need to do:Need to split the "quality" column into "good" and "bad", and assign numeric values for good and bad We need to split out training and testing data.
1. Spliting the quality column into "good" and "bad"
# splitting wine into good and bad groups, we're saying here that wines that have a quality score between # 2-6.5 are "bad" quality, and wines that are between 6.5 - 8 are "good" bins = (2, 6.5, 8) group_names = ['bad', 'good'] red_wine_df['quality'] = pd.cut(red_wine_df['quality'], bins = bins, labels = group_names)
# however "bad" and "good" aren't good naming conventions for a model to read in, so we're going to # assign a numeric label for this value. LabelEncoder() will help us do this! # Assigning a label to our quality variable label_quality = LabelEncoder() # Now changing our dataframe to reflect our new label red_wine_df['quality'] = label_quality.fit_transform(red_wine_df['quality'])
# printing the head to ensure the transformation happened red_wine_df.head()
2. Splitting out training data and testing data First we'll split out the labels and the features, then we will split the testing and training data out.
# extracting all model inputs from the data set all_inputs = red_wine_df[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']].values # extracting quality labels all_labels = red_wine_df['quality'].values # a test to see what the inputs look like all_inputs[:2]
array([[ 7.4 , 0.7 , 0. , 1.9 , 0.076 , 11. , 34. , 0.9978, 3.51 , 0.56 , 9.4 ], [ 7.8 , 0.88 , 0. , 2.6 , 0.098 , 25. , 67. , 0.9968, 3.2 , 0.68 , 9.8 ]])
# Next we will apply standard scaling # standard scaling allows us to normalize all of the data such # that the distribution will have a mean value of 0 and a standard deviation of 1 # this step is useful when we want to compare data that corresponds to different units sc = StandardScaler() all_inputs = sc.fit_transform(all_inputs)
# the function, train_test_split, will take our inputs and labels and split them out into training # and testing subsections for us # test_size parameter = proprotion of data that should be kept aside for testing, in this case 1/4 of the data # random_state = theseed used by the random number generator (training_inputs, testing_inputs, training_classes, testing_classes) = train_test_split(all_inputs, all_labels, test_size=0.25, random_state=1)
Modeling! We're going to use a decision tree classifier to begin with, let's see how it works for us!
#trying decision tree classfier from sklearn.tree import DecisionTreeClassifier # Create the classifier decision_tree_classifier = DecisionTreeClassifier() # Train the classifier on the training set decision_tree_classifier.fit(training_inputs, training_classes) # Validate the classifier on the testing set using classification accuracy decision_tree_classifier.score(testing_inputs, testing_classes)
0.855
The decision tree classifer's wasn't terrible. We're going to try experimenting with a few different classifers, then once we pick one we'll dig into optimizing it.
#selecting the models and the model names in an array models=[LogisticRegression(), LinearSVC(), SVC(kernel='rbf'), KNeighborsClassifier(), RandomForestClassifier(), DecisionTreeClassifier(), GradientBoostingClassifier(), GaussianNB()] model_names=['Logistic Regression', 'Linear SVM', 'rbf SVM', 'K-Nearest Neighbors', 'Random Forest Classifier', 'Decision Tree', 'Gradient Boosting Classifier', 'Gaussian NB'] # creating an accuracy array and a matrix to join the accuracy of the models # and the name of the models so we can read the results easier acc=[] m={} # next we're going to iterate through the models, and get the accuracy for each for model in range(len(models)): clf=models[model] clf.fit(training_inputs,training_classes) pred=clf.predict(testing_inputs) acc.append(accuracy_score(pred,testing_classes)) m={'Algorithm':model_names,'Accuracy':acc} # just putting the matrix into a data frame and listing out the results acc_frame=pd.DataFrame(m) acc_frame
Based on this single run, it looks like the Random Forest performed the best. Let's try to optimize it a bit more.
We're going to use a grid search test different parameter changes within the RandomForestClassifer, cross validates each one and determines which combintation provides the best performance.
random_forest_classifier = RandomForestClassifier() # setting up the parameters for our grid search # You can check out what each of these parameters mean on the Scikit webiste! # http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html parameter_grid = {'n_estimators': [10, 25, 50, 100, 200], 'max_features': ['auto', 'sqrt', 'log2'], 'criterion': ['gini', 'entropy'], 'max_features': [1, 2, 3, 4]} # Stratified K-Folds cross-validator allows us mix up the given test/train data per run # with k-folds each test set should not overlap across all shuffles. This allows us to # ultimately have "more" test data for our model cross_validation = StratifiedKFold(n_splits=10) # running the grid search function with our random_forest_classifer, our parameter grid # defineda bove, and our cross validation method grid_search = GridSearchCV(random_forest_classifier, param_grid=parameter_grid, cv=cross_validation) # using the defined grid search above, we're going to test it out on our # data set grid_search.fit(all_inputs, all_labels) # printing the best scores, parameters, and estimator for our Random Forest classifer print('Best score: {}'.format(grid_search.best_score_)) print('Best parameters: {}'.format(grid_search.best_params_)) grid_search.best_estimator_
Best score: 0.8830519074421513 Best parameters: {'criterion': 'entropy', 'max_features': 2, 'n_estimators': 50}
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy', max_depth=None, max_features=2, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False)
#Now we can take the best classifier from the Grid Search and use that for our classifer random_forest_classifier = grid_search.best_estimator_ rf_df = pd.DataFrame({'accuracy': cross_val_score(random_forest_classifier, all_inputs, all_labels, cv=10), 'classifier': ['Random Forest'] * 10}) rf_df.mean()
accuracy 0.874939 dtype: float64
#plotting our accuracy results!! sb.boxplot(x='classifier', y='accuracy', data=rf_df) sb.stripplot(x='classifier', y='accuracy', data=rf_df, jitter=True, color='black')
Our classifer on average, has an accuracy of 87.6%, not too bad! If you want more material on this data set, check out Kaggle for additional examples!
We send 3 questions each week to thousands of data scientists and analysts preparing for interviews or just keeping their skills sharp. You can sign up to receive the questions for free on our home page.