Navigate back to the homepage

Wine Quality Classification

Jayanth Boddu
August 7th, 2020 · 2 min read

Photo by EVGENIY KONEV on Unsplash

Data Set

This Data set contains the information related to red wine , Various factors affecting the quality. This data set was prepossessed and downloaded from the UCI Machine Learning Repository. This data set was simple, cleaned, practice data set for classification modelling. Source of this Dataset: https://archive.ics.uci.edu/ml/datasets/wine+quality

Attribute Information:
Input variables (based on physicochemical tests):

  1. fixed acidity
  2. volatile acidity
  3. citric acid
  4. residual sugar
  5. chlorides
  6. free sulfur dioxide
  7. total sulfur dioxide
  8. density
  9. pH
  10. sulphates
  11. alcohol
    Output variable (based on sensory data):
  12. quality (‘good’ and ‘bad’ based on score >5 and <5)

Analysis Approach & Conclusions

This analysis focuses on finding attributes that significantly affect wine quality classification and training a predictive model to classify wine quality into good and bad based on attributes. Analysis is pivoted on the variable quality(target variable). Exploratory data analysis steps like removing null values, observing summary statistics, visualizing the variables, removing oultiers, checking for correlations are carried out.

Following significant correlations are observed.

  • Fixed acidity vs pH : -0.69
  • Fixed acidity vs density : 0.69
  • fixed acidity vs citric acid : 0.67
  • Volatile acidity vs citric acid : -0.53
  • citric acid vs pH : -0.54
  • density vs alcohol : -0.51

A 70-30 split is done to divide dataset into test and train sets.
10 variables are selected using automated RFE. Further, manual selection is carried out using p-value method. Models are build on train data using statsmodels.api package. Final Model is build on the following variables.
citric acid,fixed acidity,volatile acidity,alcohol,sulphates,total sulfur dioxide
Variance inflation factor is calculated for all final selection of variables. VIF < 5. No significant Multicollinearity observed.

ROC, Precision-Recall / Sensitivity - Specificity curves have been plotted. The optimum threshold for classification seems to be 0.5

Model metrics on train data at classification threshold of 0.5 :

  • Accuracy : 0.752
  • Misclassification Rate / Error Rate : 0.248
  • Sensitivity / True Positive Rate / Recall : 0.755
  • Specificity / True Negative Rate : 0.75
  • False Positive Rate : 0.25
  • Precision / Positive Predictive Value : 0.777
  • Prevalance : 0.535
  • Negative Predictive Value 0.726
  • Likelihood Ratio : Sensitivity / 1-Specificity : 3.02
  • F1-score : 0.766

Model metrics on test data at classification threshold of 0.5 :

  • Accuracy : 0.746
  • Misclassification Rate / Error Rate : 0.254
  • Sensitivity / True Positive Rate / Recall : 0.797
  • Specificity / True Negative Rate : 0.688
  • False Positive Rate : 0.312
  • Precision / Positive Predictive Value : 0.745
  • Prevalance : 0.533
  • Negative Predictive Value 0.748
  • Likelihood Ratio : Sensitivity / 1-Specificity : 2.554
  • F1-score : 0.77

Analysis

1import pandas as pd, numpy as np
2import matplotlib.pyplot as plt
3import seaborn as sns

Importing Data

1data = pd.read_csv('./wine_quality_classification.csv')
1print(data.head())
1fixed acidity volatile acidity citric acid residual sugar chlorides \
20 7.4 0.70 0.00 1.9 0.076
31 7.8 0.88 0.00 2.6 0.098
42 7.8 0.76 0.04 2.3 0.092
53 11.2 0.28 0.56 1.9 0.075
64 7.4 0.70 0.00 1.9 0.076
7
8 free sulfur dioxide total sulfur dioxide density pH sulphates \
90 11.0 34.0 0.9978 3.51 0.56
101 25.0 67.0 0.9968 3.20 0.68
112 15.0 54.0 0.9970 3.26 0.65
123 17.0 60.0 0.9980 3.16 0.58
134 11.0 34.0 0.9978 3.51 0.56
14
15 alcohol quality
160 9.4 bad
171 9.8 bad
182 9.8 bad
193 9.8 good
204 9.4 bad
1data.info()
1<class 'pandas.core.frame.DataFrame'>
2RangeIndex: 1599 entries, 0 to 1598
3Data columns (total 12 columns):
4 # Column Non-Null Count Dtype
5--- ------ -------------- -----
6 0 fixed acidity 1599 non-null float64
7 1 volatile acidity 1599 non-null float64
8 2 citric acid 1599 non-null float64
9 3 residual sugar 1599 non-null float64
10 4 chlorides 1599 non-null float64
11 5 free sulfur dioxide 1599 non-null float64
12 6 total sulfur dioxide 1599 non-null float64
13 7 density 1599 non-null float64
14 8 pH 1599 non-null float64
15 9 sulphates 1599 non-null float64
16 10 alcohol 1599 non-null float64
17 11 quality 1599 non-null object
18dtypes: float64(11), object(1)
19memory usage: 150.0+ KB
1data.isnull().sum()
1fixed acidity 0
2volatile acidity 0
3citric acid 0
4residual sugar 0
5chlorides 0
6free sulfur dioxide 0
7total sulfur dioxide 0
8density 0
9pH 0
10sulphates 0
11alcohol 0
12quality 0
13dtype: int64

quality is our target variable. It has two levels - good & bad. No null or missing values. All the other variables are continuous variables.

Replacing quality levels with 0,1

1data['quality'] = data['quality'].replace({'good' : 1, 'bad' : 0})

Summary Statistics

1print(data.describe())
1fixed acidity volatile acidity citric acid residual sugar \
2count 1599.000000 1599.000000 1599.000000 1599.000000
3mean 8.319637 0.527821 0.270976 2.538806
4std 1.741096 0.179060 0.194801 1.409928
5min 4.600000 0.120000 0.000000 0.900000
625% 7.100000 0.390000 0.090000 1.900000
750% 7.900000 0.520000 0.260000 2.200000
875% 9.200000 0.640000 0.420000 2.600000
9max 15.900000 1.580000 1.000000 15.500000
10
11 chlorides free sulfur dioxide total sulfur dioxide density \
12count 1599.000000 1599.000000 1599.000000 1599.000000
13mean 0.087467 15.874922 46.467792 0.996747
14std 0.047065 10.460157 32.895324 0.001887
15min 0.012000 1.000000 6.000000 0.990070
1625% 0.070000 7.000000 22.000000 0.995600
1750% 0.079000 14.000000 38.000000 0.996750
1875% 0.090000 21.000000 62.000000 0.997835
19max 0.611000 72.000000 289.000000 1.003690
20
21 pH sulphates alcohol quality
22count 1599.000000 1599.000000 1599.000000 1599.000000
23mean 3.311113 0.658149 10.422983 0.534709
24std 0.154386 0.169507 1.065668 0.498950
25min 2.740000 0.330000 8.400000 0.000000
2625% 3.210000 0.550000 9.500000 0.000000
2750% 3.310000 0.620000 10.200000 1.000000
2875% 3.400000 0.730000 11.100000 1.000000
29max 4.010000 2.000000 14.900000 1.000000

Checking for Outliers

1print(data.quantile(np.linspace(0.90,1,12)))
1fixed acidity volatile acidity citric acid residual sugar \
20.900000 10.700000 0.745000 0.522000 3.600000
30.909091 10.872727 0.760000 0.537273 3.800000
40.918182 11.100000 0.775000 0.550000 4.000000
50.927273 11.300000 0.785000 0.560000 4.200000
60.936364 11.500000 0.810000 0.580000 4.400000
70.945455 11.600000 0.834182 0.590000 4.783636
80.954545 11.900000 0.851818 0.630000 5.500000
90.963636 12.089091 0.880000 0.640000 5.789091
100.972727 12.500000 0.910000 0.660000 6.141818
110.981818 12.794545 0.965000 0.680000 6.983636
120.990909 13.300000 1.022364 0.714727 8.694545
131.000000 15.900000 1.580000 1.000000 15.500000
14
15 chlorides free sulfur dioxide total sulfur dioxide density \
160.900000 0.109000 31.000000 93.200000 0.999140
170.909091 0.111000 31.000000 96.000000 0.999300
180.918182 0.114000 32.000000 99.000000 0.999400
190.927273 0.117000 33.000000 102.781818 0.999478
200.936364 0.120000 34.000000 106.000000 0.999700
210.945455 0.123000 35.000000 110.000000 0.999800
220.954545 0.136364 36.000000 115.000000 1.000000
230.963636 0.164564 38.000000 121.000000 1.000200
240.972727 0.187673 40.000000 129.000000 1.000400
250.981818 0.234727 42.945455 136.000000 1.000989
260.990909 0.368473 51.000000 145.945455 1.001942
271.000000 0.611000 72.000000 289.000000 1.003690
28
29 pH sulphates alcohol quality
300.900000 3.510000 0.850000 12.0 1.0
310.909091 3.520000 0.860000 12.0 1.0
320.918182 3.530000 0.870000 12.1 1.0
330.927273 3.540000 0.887818 12.2 1.0
340.936364 3.543091 0.903091 12.4 1.0
350.945455 3.560000 0.930000 12.5 1.0
360.954545 3.573636 0.953636 12.5 1.0
370.963636 3.590000 0.998909 12.7 1.0
380.972727 3.610000 1.060000 12.8 1.0
390.981818 3.660000 1.140000 12.9 1.0
400.990909 3.710000 1.280000 13.4 1.0
411.000000 4.010000 2.000000 14.9 1.0
  • There are outlier in fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, pH, sulphates, alcohol

Visualizing Independent Variables

1x_vars = data.columns[data.columns != 'quality']
2fig,ax = plt.subplots(len(x_vars))
3fig.set_figheight(24)
4fig.set_figwidth(12)
5for num,i in enumerate(x_vars) :
6 ax[num].set_title(i)
7 ax[num].set_xlabel('')
8 sns.boxplot(data[i],ax=ax[num])

svg

1# removing outliers :
2x_vars = data.columns[data.columns != 'quality']
3for i in x_vars :
4 q1 = data[i].quantile(0.25)
5 q3 = data[i].quantile(0.75)
6 upper_extreme = data[i].quantile(0.75) + 1.5*(q3-q1) # q3-q1 is IQR
7 lower_extreme = data[i].quantile(0.75) - 1.5*(q3-q1)
8 mask = (data[i] > lower_extreme) & (data[i] < upper_extreme) # sans outliers
9 outliers = data[mask].index
10 data.drop(index=outliers)

Test Train Split

1from sklearn.model_selection import train_test_split
2y = data.pop('quality')
3X = data
1X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=100)

Scaling Continuous Variables

1# In our case, all the independent variables are continuous
2from sklearn.preprocessing import StandardScaler
3scaler = StandardScaler()
4X_train[X_train.columns] = scaler.fit_transform(X_train[X_train.columns])
5
6# Scaling test set for later use
7X_test[X_train.columns] = scaler.transform(X_test[X_train.columns])

Correlations

1plt.figure(figsize=[20,10])
2sns.heatmap(X_train.corr(),annot=True)
3plt.title('Visualizing Correlations')
4plt.show()

svg

High Correlations :

  • Fixed acidity vs pH : -0.69
  • Fixed acidity vs density : 0.69
  • fixed acidity vs citric acid : 0.67
  • Volatile acidity vs citric acid : -0.53
  • citric acid vs pH : -0.54
  • density vs alcohol : -0.51

Model Building

1import statsmodels.api as sm
1# Logistic Regression Model
2logm1 = sm.GLM(y_train, sm.add_constant(X_train),family=sm.families.Binomial())
3logm1.fit().summary()
Generalized Linear Model Regression Results
Dep. Variable: quality No. Observations: 1119
Model: GLM Df Residuals: 1107
Model Family: Binomial Df Model: 11
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -570.86
Date: Fri, 07 Aug 2020 Deviance: 1141.7
Time: 18:45:03 Pearson chi2: 1.08e+03
No. Iterations: 5
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const 0.2632 0.076 3.478 0.001 0.115 0.411
fixed acidity 0.3550 0.204 1.742 0.082 -0.045 0.755
volatile acidity -0.5883 0.103 -5.706 0.000 -0.790 -0.386
citric acid -0.3117 0.132 -2.356 0.018 -0.571 -0.052
residual sugar 0.2039 0.093 2.185 0.029 0.021 0.387
chlorides -0.1757 0.091 -1.931 0.054 -0.354 0.003
free sulfur dioxide 0.1652 0.107 1.546 0.122 -0.044 0.375
total sulfur dioxide -0.5286 0.115 -4.584 0.000 -0.755 -0.303
density -0.2451 0.186 -1.320 0.187 -0.609 0.119
pH -0.0311 0.133 -0.233 0.815 -0.292 0.230
sulphates 0.4795 0.093 5.143 0.000 0.297 0.662
alcohol 0.9432 0.134 7.014 0.000 0.680 1.207

Feature Selection using RFE

1from sklearn.linear_model import LogisticRegression
2logReg = LogisticRegression()
1from sklearn.feature_selection import RFE
2rfe = RFE(logReg,10)
3rfe = rfe.fit(X_train,y_train)
1## RFE results
2rfe_results = list(zip(X_train.columns,rfe.support_,rfe.ranking_))
3sorted(rfe_results,key=lambda x : (x[2]))
1[('fixed acidity', True, 1),
2 ('volatile acidity', True, 1),
3 ('citric acid', True, 1),
4 ('residual sugar', True, 1),
5 ('chlorides', True, 1),
6 ('free sulfur dioxide', True, 1),
7 ('total sulfur dioxide', True, 1),
8 ('density', True, 1),
9 ('sulphates', True, 1),
10 ('alcohol', True, 1),
11 ('pH', False, 2)]
  • RFE results show that pH can be dropped.
1X_train.drop(columns=['pH'],inplace=True)
2X_test.drop(columns=['pH'],inplace=True)

Assessing Model

Model 1

1X_train.columns = X_train.columns[X_train.columns !='pH']
2logm1 = sm.GLM(y_train, sm.add_constant(X_train),family=sm.families.Binomial())
3logm1.fit().summary()
Generalized Linear Model Regression Results
Dep. Variable: quality No. Observations: 1119
Model: GLM Df Residuals: 1108
Model Family: Binomial Df Model: 10
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -570.89
Date: Fri, 07 Aug 2020 Deviance: 1141.8
Time: 18:45:03 Pearson chi2: 1.08e+03
No. Iterations: 5
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const 0.2631 0.076 3.478 0.001 0.115 0.411
fixed acidity 0.3894 0.141 2.762 0.006 0.113 0.666
volatile acidity -0.5904 0.103 -5.750 0.000 -0.792 -0.389
citric acid -0.3128 0.132 -2.367 0.018 -0.572 -0.054
residual sugar 0.2110 0.088 2.389 0.017 0.038 0.384
chlorides -0.1705 0.088 -1.933 0.053 -0.343 0.002
free sulfur dioxide 0.1609 0.105 1.528 0.127 -0.045 0.367
total sulfur dioxide -0.5228 0.113 -4.645 0.000 -0.743 -0.302
density -0.2686 0.156 -1.722 0.085 -0.574 0.037
sulphates 0.4816 0.093 5.196 0.000 0.300 0.663
alcohol 0.9287 0.119 7.803 0.000 0.695 1.162

Model 2

  • Dropping free sulfur dioxide because of high p-value
1X = X_train.loc[:,X_train.columns != 'free sulfur dioxide']
2logm2 = sm.GLM(y_train, sm.add_constant(X),family=sm.families.Binomial())
3logm2.fit().summary()
Generalized Linear Model Regression Results
Dep. Variable: quality No. Observations: 1119
Model: GLM Df Residuals: 1109
Model Family: Binomial Df Model: 9
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -572.06
Date: Fri, 07 Aug 2020 Deviance: 1144.1
Time: 18:45:03 Pearson chi2: 1.08e+03
No. Iterations: 5
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const 0.2687 0.075 3.561 0.000 0.121 0.417
fixed acidity 0.4006 0.141 2.845 0.004 0.125 0.677
volatile acidity -0.6186 0.102 -6.089 0.000 -0.818 -0.420
citric acid -0.3548 0.130 -2.738 0.006 -0.609 -0.101
residual sugar 0.2323 0.088 2.629 0.009 0.059 0.406
chlorides -0.1646 0.088 -1.868 0.062 -0.337 0.008
total sulfur dioxide -0.4099 0.083 -4.911 0.000 -0.574 -0.246
density -0.2762 0.156 -1.771 0.077 -0.582 0.029
sulphates 0.4918 0.093 5.279 0.000 0.309 0.674
alcohol 0.9411 0.119 7.892 0.000 0.707 1.175

Model 3

  • dropping free sulfur dioxide because of high p-value
1X = X.loc[:,X.columns != 'free sulfur dioxide']
2logm3 = sm.GLM(y_train, sm.add_constant(X),family=sm.families.Binomial())
3logm3.fit().summary()
Generalized Linear Model Regression Results
Dep. Variable: quality No. Observations: 1119
Model: GLM Df Residuals: 1109
Model Family: Binomial Df Model: 9
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -572.06
Date: Fri, 07 Aug 2020 Deviance: 1144.1
Time: 18:45:03 Pearson chi2: 1.08e+03
No. Iterations: 5
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const 0.2687 0.075 3.561 0.000 0.121 0.417
fixed acidity 0.4006 0.141 2.845 0.004 0.125 0.677
volatile acidity -0.6186 0.102 -6.089 0.000 -0.818 -0.420
citric acid -0.3548 0.130 -2.738 0.006 -0.609 -0.101
residual sugar 0.2323 0.088 2.629 0.009 0.059 0.406
chlorides -0.1646 0.088 -1.868 0.062 -0.337 0.008
total sulfur dioxide -0.4099 0.083 -4.911 0.000 -0.574 -0.246
density -0.2762 0.156 -1.771 0.077 -0.582 0.029
sulphates 0.4918 0.093 5.279 0.000 0.309 0.674
alcohol 0.9411 0.119 7.892 0.000 0.707 1.175

Model 4

  • Dropping density because of high p-value
1X = X.loc[:,X.columns != 'density']
2logm4 = sm.GLM(y_train, sm.add_constant(X),family=sm.families.Binomial())
3logm4.fit().summary()
Generalized Linear Model Regression Results
Dep. Variable: quality No. Observations: 1119
Model: GLM Df Residuals: 1110
Model Family: Binomial Df Model: 8
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -573.64
Date: Fri, 07 Aug 2020 Deviance: 1147.3
Time: 18:45:03 Pearson chi2: 1.10e+03
No. Iterations: 5
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const 0.2590 0.075 3.451 0.001 0.112 0.406
fixed acidity 0.2393 0.107 2.234 0.025 0.029 0.449
volatile acidity -0.6496 0.101 -6.426 0.000 -0.848 -0.451
citric acid -0.3570 0.130 -2.747 0.006 -0.612 -0.102
residual sugar 0.1478 0.074 1.998 0.046 0.003 0.293
chlorides -0.1540 0.088 -1.748 0.080 -0.327 0.019
total sulfur dioxide -0.4002 0.083 -4.822 0.000 -0.563 -0.238
sulphates 0.4584 0.090 5.090 0.000 0.282 0.635
alcohol 1.0615 0.099 10.722 0.000 0.867 1.255

Model 5

  • dropping chlorides because of high p-value
1X = X.loc[:,X.columns != 'chlorides']
2logm5 = sm.GLM(y_train, sm.add_constant(X),family=sm.families.Binomial())
3logm5.fit().summary()
Generalized Linear Model Regression Results
Dep. Variable: quality No. Observations: 1119
Model: GLM Df Residuals: 1111
Model Family: Binomial Df Model: 7
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -575.22
Date: Fri, 07 Aug 2020 Deviance: 1150.4
Time: 18:45:03 Pearson chi2: 1.11e+03
No. Iterations: 5
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const 0.2577 0.075 3.438 0.001 0.111 0.405
fixed acidity 0.2787 0.105 2.659 0.008 0.073 0.484
volatile acidity -0.6975 0.098 -7.105 0.000 -0.890 -0.505
citric acid -0.4242 0.125 -3.401 0.001 -0.669 -0.180
residual sugar 0.1398 0.073 1.914 0.056 -0.003 0.283
total sulfur dioxide -0.3884 0.082 -4.712 0.000 -0.550 -0.227
sulphates 0.3856 0.078 4.946 0.000 0.233 0.538
alcohol 1.1119 0.096 11.633 0.000 0.925 1.299

Model 6

— Dropping residual sugar because of high p-value

1X = X.loc[:,X.columns != 'residual sugar']
2logm6 = sm.GLM(y_train, sm.add_constant(X),family=sm.families.Binomial())
3logm6.fit().summary()
Generalized Linear Model Regression Results
Dep. Variable: quality No. Observations: 1119
Model: GLM Df Residuals: 1112
Model Family: Binomial Df Model: 6
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -577.02
Date: Fri, 07 Aug 2020 Deviance: 1154.0
Time: 18:45:03 Pearson chi2: 1.10e+03
No. Iterations: 5
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const 0.2593 0.075 3.465 0.001 0.113 0.406
fixed acidity 0.2883 0.105 2.753 0.006 0.083 0.494
volatile acidity -0.6874 0.097 -7.064 0.000 -0.878 -0.497
citric acid -0.4051 0.124 -3.271 0.001 -0.648 -0.162
total sulfur dioxide -0.3479 0.079 -4.402 0.000 -0.503 -0.193
sulphates 0.3769 0.078 4.846 0.000 0.224 0.529
alcohol 1.1186 0.095 11.716 0.000 0.931 1.306
  • All the p-values are very low. So the variable which remain have statistically significant relationships.

Checking Multi-Collinearity

1from statsmodels.stats.outliers_influence import variance_inflation_factor
2def vif(X) :
3 df = sm.add_constant(X)
4 vif = [variance_inflation_factor(df.values,i) for i in range(df.shape[1])]
5 vif_frame = pd.DataFrame({'vif' : vif[0:]},index = df.columns).reset_index()
6 print(vif_frame.sort_values(by='vif',ascending=False))
1vif(X)
1index vif
23 citric acid 2.735105
31 fixed acidity 2.113091
42 volatile acidity 1.498180
56 alcohol 1.163675
65 sulphates 1.150064
74 total sulfur dioxide 1.121176
80 const 1.000000
  • As we can see, there’s no multi collinearity since VIF < 5

Final Model

1print('Selected columns :' , X.columns)
1Selected columns : Index(['fixed acidity', 'volatile acidity', 'citric acid',
2 'total sulfur dioxide', 'sulphates', 'alcohol'],
3 dtype='object')
1logm_final = sm.GLM(y_train, sm.add_constant(X_train[X.columns]),family=sm.families.Binomial())
2res = logm_final.fit()
3res.summary()
Generalized Linear Model Regression Results
Dep. Variable: quality No. Observations: 1119
Model: GLM Df Residuals: 1112
Model Family: Binomial Df Model: 6
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -577.02
Date: Fri, 07 Aug 2020 Deviance: 1154.0
Time: 18:45:04 Pearson chi2: 1.10e+03
No. Iterations: 5
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const 0.2593 0.075 3.465 0.001 0.113 0.406
fixed acidity 0.2883 0.105 2.753 0.006 0.083 0.494
volatile acidity -0.6874 0.097 -7.064 0.000 -0.878 -0.497
citric acid -0.4051 0.124 -3.271 0.001 -0.648 -0.162
total sulfur dioxide -0.3479 0.079 -4.402 0.000 -0.503 -0.193
sulphates 0.3769 0.078 4.846 0.000 0.224 0.529
alcohol 1.1186 0.095 11.716 0.000 0.931 1.306

Making Predictions on Train Set

1selected_vars = X.columns
2y_train_pred = res.predict(sm.add_constant(X_train[X.columns]))
1print(y_train_pred.head())
1858 0.855444
2654 0.255470
3721 0.172042
4176 0.388875
5692 0.379338
6dtype: float64

Wine Quality vs Predicted Probability

1predictions = pd.DataFrame({'Quality' : y_train.values,'class_probability' : y_train_pred.values.reshape(-1)}, index=X_train.index)
2print(predictions.head())
1Quality class_probability
2858 1 0.855444
3654 0 0.255470
4721 0 0.172042
5176 0 0.388875
6692 0 0.379338

Classification Threshold

  • Let us assume that any probability < 0.5 is Bad and >0.5 is Good
1predictions['Predicted_Quality'] = predictions['class_probability'].apply(lambda x : 1 if x > 0.5 else 0)
2print(predictions.head())
1Quality class_probability Predicted_Quality
2858 1 0.855444 1
3654 0 0.255470 0
4721 0 0.172042 0
5176 0 0.388875 0
6692 0 0.379338 0

Simple Metrics

1from sklearn import metrics

Confusion Matrix

1confusion = metrics.confusion_matrix(predictions['Quality'],predictions['Predicted_Quality'])
2print(confusion)
1[[390 130]
2 [147 452]]
  • The above result could be interpreted in the following manner
  • a[i,j] is the no of times class j was predicted when actual was class i
  • So TN : 390, FP : 130, FN : 147, TP : 452
Predicted >01
Actual
0TN = 390FP =130
1FN =147TP = 452

0 : bad, 1 : good

1# Accuracy of the model
2print(metrics.accuracy_score(predictions['Quality'],predictions['Predicted_Quality']))
10.7524575513851653

Metrics beyond Simple Accuracy

1TP = confusion[1,1]
2TN = confusion[0,0]
3FP = confusion[0,1]
4FN = confusion[1,0]
1#### Metrics
2import math
3def model_metrics(TP,TN,FP,FN) :
4 print('Accuracy :' , round((TP + TN)/float(TP+TN+FP+FN),3))
5 print('Misclassification Rate / Error Rate :', round((FP + FN)/float(TP+TN+FP+FN),3))
6 print('Sensitivity / True Positive Rate / Recall :', round(TP/float(FN + TP),3))
7 sensitivity = round(TP/float(FN + TP),3)
8 print('Specificity / True Negative Rate : ', round(TN/float(TN + FP),3))
9 specificity = round(TN/float(TN + FP),3)
10 print('False Positive Rate :',round(FP/float(TN + FP),3))
11 print('Precision / Positive Predictive Value :', round(TP/float(TP + FP),3))
12 precision = round(TP/float(TP + FP),3)
13 print('Prevalance :',round((FN + TP)/float(TP+TN+FP+FN),3))
14 print('Negative Predictive Value', round(TN/float(TN + FN),3))
15 print('Likelihood Ratio : Sensitivity / 1-Specificity :', round(sensitivity/float(1-specificity) ,3))
16 print('F1-score :', round(2*precision*sensitivity/(precision + sensitivity),3))
1model_metrics(TP,TN,FP,FN)
1Accuracy : 0.752
2Misclassification Rate / Error Rate : 0.248
3Sensitivity / True Positive Rate / Recall : 0.755
4Specificity / True Negative Rate : 0.75
5False Positive Rate : 0.25
6Precision / Positive Predictive Value : 0.777
7Prevalance : 0.535
8Negative Predictive Value 0.726
9Likelihood Ratio : Sensitivity / 1-Specificity : 3.02
10F1-score : 0.766

ROC curve

1print(predictions.head())
1Quality class_probability Predicted_Quality
2858 1 0.855444 1
3654 0 0.255470 0
4721 0 0.172042 0
5176 0 0.388875 0
6692 0 0.379338 0
1# generating predictions for cutoffs between 0 and 1
2cutoffs = pd.DataFrame()
3for i in np.arange(0,1,0.1) :
4 cutoffs[i] = predictions['class_probability'].map(lambda x : 1 if x > i else 0)
1tpr = []
2fpr = []
3for column in cutoffs.columns :
4 confusion = metrics.confusion_matrix(predictions['Quality'],cutoffs[column])
5 TP = confusion[1,1] # true positive
6 TN = confusion[0,0] # true negatives
7 FP = confusion[0,1] # false positives
8 FN = confusion[1,0] # false negatives
9 tpr.append(TP/float(TP + FN))
10 fpr.append(FP/float(FP + TN))
11plt.title('ROC curve')
12plt.xlabel('False Positive Rate')
13plt.ylabel('True Positive Rate')
14sns.scatterplot(fpr,tpr);

svg

Optimum Cut Off

1sensitivity = []
2specificity = []
3accuracy = []
4coffs = []
5for column in cutoffs.columns :
6 confusion = metrics.confusion_matrix(predictions['Quality'],cutoffs[column])
7 TP = confusion[1,1] # true positive
8 TN = confusion[0,0] # true negatives
9 FP = confusion[0,1] # false positives
10 FN = confusion[1,0] # false negatives
11 sensitivity.append(TP/float(TP + FN))
12 specificity.append(1 - FP/float(FP + TN))
13 accuracy.append((TP + TN)/(TP + TN + FP + FN))
14fig,ax = plt.subplots()
15ax.set_xlabel('Cutoffs')
16ax.plot(cutoffs.columns,sensitivity,label='sensitivity')
17ax.plot(cutoffs.columns,specificity,label='specificity')
18ax.plot(cutoffs.columns,accuracy,label='accuracy')
19ax.legend(('sensitivity','specificity','accuracy'))
20plt.show()

svg

  • From the above plot, 0.5 seems like the optimum Threshold for classification
1predictions['Final_Predictions'] = predictions['Predicted_Quality'].map(lambda x : 1 if x > 0.5 else 0)
1confusion_final = metrics.confusion_matrix(predictions['Quality'],predictions['Final_Predictions'])
2TP = confusion_final[1,1]
3TN = confusion_final[0,0]
4FP = confusion_final[0,1]
5FN = confusion_final[1,0]
1#### Metrics
2model_metrics(TP,TN,FP,FN)
1Accuracy : 0.752
2Misclassification Rate / Error Rate : 0.248
3Sensitivity / True Positive Rate / Recall : 0.755
4Specificity / True Negative Rate : 0.75
5False Positive Rate : 0.25
6Precision / Positive Predictive Value : 0.777
7Prevalance : 0.535
8Negative Predictive Value 0.726
9Likelihood Ratio : Sensitivity / 1-Specificity : 3.02
10F1-score : 0.766

Precision and Recall

1precision = [] # positive predictive power - TP / TP + FP
2recall = [] ## same as sensitivity
3
4for column in cutoffs.columns :
5 confusion = metrics.confusion_matrix(predictions['Quality'],cutoffs[column])
6 TP = confusion[1,1] # true positive
7 TN = confusion[0,0] # true negatives
8 FP = confusion[0,1] # false positives
9 FN = confusion[1,0] # false negatives
10 precision.append(TP/float(TP + FP))
11 recall.append(TP/float(FN + TP))
12
13fig,ax = plt.subplots()
14ax.set_xlabel('Cutoffs')
15ax.plot(cutoffs.columns,precision,label='precision')
16ax.plot(cutoffs.columns,recall,label='recall')
17ax.legend(('precision','recall'))
18plt.show()

svg

1# using sklearn utilities
2from sklearn.metrics import precision_score, recall_score
3print('Precision',precision_score(predictions['Quality'],predictions['Predicted_Quality']))
4print('Recall', recall_score(predictions['Quality'],predictions['Predicted_Quality']))
1Precision 0.7766323024054983
2Recall 0.7545909849749582

Predictions on Test set

1print(X_test[X.columns].head())
1fixed acidity volatile acidity citric acid total sulfur dioxide \
21254 -0.302046 0.908335 -1.056968 -0.341840
31087 -0.244545 -1.905494 0.765915 -0.496144
4822 -0.934554 0.025565 -0.702518 -0.310980
51514 -0.819552 1.680759 -0.297433 0.583983
6902 -0.532049 0.549710 -0.854425 -0.403562
7
8 sulphates alcohol
91254 0.203567 0.482194
101087 0.203567 0.766538
11822 -0.099025 -0.560402
121514 0.385122 -1.097497
13902 0.203567 0.387412
1test_predictions = pd.DataFrame()
2X_test_ = X_test[X.columns]
3test_predictions['Class_Probabilities'] = res.predict(sm.add_constant(X_test_))
1test_predictions['Original'] = y_test
2test_predictions.index = y_test.index
1# Predictions are made using 0.5 as the threshold
2test_predictions['Predicted'] = test_predictions['Class_Probabilities'].map(lambda x : 1 if x > 0.5 else 0)
1#### Metrics
2TN,FP,FN,TP = metrics.confusion_matrix(test_predictions['Original'],test_predictions['Predicted']).reshape(-1)
3model_metrics(TP,TN,FP,FN)
1Accuracy : 0.746
2Misclassification Rate / Error Rate : 0.254
3Sensitivity / True Positive Rate / Recall : 0.797
4Specificity / True Negative Rate : 0.688
5False Positive Rate : 0.312
6Precision / Positive Predictive Value : 0.745
7Prevalance : 0.533
8Negative Predictive Value 0.748
9Likelihood Ratio : Sensitivity / 1-Specificity : 2.554
10F1-score : 0.77

More articles from Yugen

Boom Bikes Demand Analysis

Boom Bikes Demand Analysis using Linear Regression

July 26th, 2020 · 5 min read

Seismic Enhancement using Deep Learning — Review of recent research

Resolution improvement and de-noising of seismic images using deep learning.

September 7th, 2021 · 7 min read
© 2020–2021 Yugen
Link to $http://twitter.com/JayanthBoddu/Link to $https://github.com/jayantb1019Link to $https://www.instagram.com/jayantb1019/Link to $https://www.linkedin.com/in/jayanthboddu/