Photo by EVGENIY KONEV on Unsplash
Data Set This Data set contains the information related to red wine , Various factors affecting the quality. This data set was prepossessed and downloaded from the UCI Machine Learning Repository. This data set was simple, cleaned, practice data set for classification modelling. Source of this Dataset: https://archive.ics.uci.edu/ml/datasets/wine+quality
Attribute Information:
Input variables (based on physicochemical tests):
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol Output variable (based on sensory data): quality (‘good’ and ‘bad’ based on score >5 and <5) Analysis Approach & Conclusions This analysis focuses on finding attributes that significantly affect wine quality classification and training a predictive model to classify wine quality into good
and bad
based on attributes. Analysis is pivoted on the variable quality
(target variable). Exploratory data analysis steps like removing null values, observing summary statistics, visualizing the variables, removing oultiers, checking for correlations are carried out.
Following significant correlations are observed.
Fixed acidity vs pH : -0.69 Fixed acidity vs density : 0.69 fixed acidity vs citric acid : 0.67 Volatile acidity vs citric acid : -0.53 citric acid vs pH : -0.54 density vs alcohol : -0.51 A 70-30 split is done to divide dataset into test and train sets.
10 variables are selected using automated RFE. Further, manual selection is carried out using p-value method.
Models are build on train data using statsmodels.api
package.
Final Model is build on the following variables.
citric acid
,fixed acidity
,volatile acidity
,alcohol
,sulphates
,total sulfur dioxide
Variance inflation factor is calculated for all final selection of variables. VIF < 5. No significant Multicollinearity observed.
ROC, Precision-Recall / Sensitivity - Specificity curves have been plotted. The optimum threshold for classification seems to be 0.5
Model metrics on train data at classification threshold of 0.5 :
Accuracy : 0.752 Misclassification Rate / Error Rate : 0.248 Sensitivity / True Positive Rate / Recall : 0.755 Specificity / True Negative Rate : 0.75 False Positive Rate : 0.25 Precision / Positive Predictive Value : 0.777 Prevalance : 0.535 Negative Predictive Value 0.726 Likelihood Ratio : Sensitivity / 1-Specificity : 3.02 F1-score : 0.766 Model metrics on test data at classification threshold of 0.5 :
Accuracy : 0.746 Misclassification Rate / Error Rate : 0.254 Sensitivity / True Positive Rate / Recall : 0.797 Specificity / True Negative Rate : 0.688 False Positive Rate : 0.312 Precision / Positive Predictive Value : 0.745 Prevalance : 0.533 Negative Predictive Value 0.748 Likelihood Ratio : Sensitivity / 1-Specificity : 2.554 F1-score : 0.77 Analysis Copy 1 import pandas as pd , numpy as np
2 import matplotlib . pyplot as plt
3 import seaborn as sns
Importing Data Copy 1 data = pd . read_csv ( './wine_quality_classification.csv' )
Copy 1 fixed acidity volatile acidity citric acid residual sugar chlorides \
2 0 7.4 0.70 0.00 1.9 0.076
3 1 7.8 0.88 0.00 2.6 0.098
4 2 7.8 0.76 0.04 2.3 0.092
5 3 11.2 0.28 0.56 1.9 0.075
6 4 7.4 0.70 0.00 1.9 0.076
7
8 free sulfur dioxide total sulfur dioxide density pH sulphates \
9 0 11.0 34.0 0.9978 3.51 0.56
10 1 25.0 67.0 0.9968 3.20 0.68
11 2 15.0 54.0 0.9970 3.26 0.65
12 3 17.0 60.0 0.9980 3.16 0.58
13 4 11.0 34.0 0.9978 3.51 0.56
14
15 alcohol quality
16 0 9.4 bad
17 1 9.8 bad
18 2 9.8 bad
19 3 9.8 good
20 4 9.4 bad
Copy 1 <class 'pandas.core.frame.DataFrame'>
2 RangeIndex: 1599 entries, 0 to 1598
3 Data columns (total 12 columns):
4 # Column Non-Null Count Dtype
5 --- ------ -------------- -----
6 0 fixed acidity 1599 non-null float64
7 1 volatile acidity 1599 non-null float64
8 2 citric acid 1599 non-null float64
9 3 residual sugar 1599 non-null float64
10 4 chlorides 1599 non-null float64
11 5 free sulfur dioxide 1599 non-null float64
12 6 total sulfur dioxide 1599 non-null float64
13 7 density 1599 non-null float64
14 8 pH 1599 non-null float64
15 9 sulphates 1599 non-null float64
16 10 alcohol 1599 non-null float64
17 11 quality 1599 non-null object
18 dtypes: float64(11), object(1)
19 memory usage: 150.0+ KB
Copy 1 data . isnull ( ) . sum ( )
Copy 1 fixed acidity 0
2 volatile acidity 0
3 citric acid 0
4 residual sugar 0
5 chlorides 0
6 free sulfur dioxide 0
7 total sulfur dioxide 0
8 density 0
9 pH 0
10 sulphates 0
11 alcohol 0
12 quality 0
13 dtype: int64
quality
is our target variable. It has two levels - good & bad. No null or missing values. All the other variables are continuous variables.
Replacing quality
levels with 0,1 Copy 1 data [ 'quality' ] = data [ 'quality' ] . replace ( { 'good' : 1 , 'bad' : 0 } )
Summary Statistics Copy 1 print ( data . describe ( ) )
Copy 1 fixed acidity volatile acidity citric acid residual sugar \
2 count 1599.000000 1599.000000 1599.000000 1599.000000
3 mean 8.319637 0.527821 0.270976 2.538806
4 std 1.741096 0.179060 0.194801 1.409928
5 min 4.600000 0.120000 0.000000 0.900000
6 25% 7.100000 0.390000 0.090000 1.900000
7 50% 7.900000 0.520000 0.260000 2.200000
8 75% 9.200000 0.640000 0.420000 2.600000
9 max 15.900000 1.580000 1.000000 15.500000
10
11 chlorides free sulfur dioxide total sulfur dioxide density \
12 count 1599.000000 1599.000000 1599.000000 1599.000000
13 mean 0.087467 15.874922 46.467792 0.996747
14 std 0.047065 10.460157 32.895324 0.001887
15 min 0.012000 1.000000 6.000000 0.990070
16 25% 0.070000 7.000000 22.000000 0.995600
17 50% 0.079000 14.000000 38.000000 0.996750
18 75% 0.090000 21.000000 62.000000 0.997835
19 max 0.611000 72.000000 289.000000 1.003690
20
21 pH sulphates alcohol quality
22 count 1599.000000 1599.000000 1599.000000 1599.000000
23 mean 3.311113 0.658149 10.422983 0.534709
24 std 0.154386 0.169507 1.065668 0.498950
25 min 2.740000 0.330000 8.400000 0.000000
26 25% 3.210000 0.550000 9.500000 0.000000
27 50% 3.310000 0.620000 10.200000 1.000000
28 75% 3.400000 0.730000 11.100000 1.000000
29 max 4.010000 2.000000 14.900000 1.000000
Checking for Outliers Copy 1 print ( data . quantile ( np . linspace ( 0.90 , 1 , 12 ) ) )
Copy 1 fixed acidity volatile acidity citric acid residual sugar \
2 0.900000 10.700000 0.745000 0.522000 3.600000
3 0.909091 10.872727 0.760000 0.537273 3.800000
4 0.918182 11.100000 0.775000 0.550000 4.000000
5 0.927273 11.300000 0.785000 0.560000 4.200000
6 0.936364 11.500000 0.810000 0.580000 4.400000
7 0.945455 11.600000 0.834182 0.590000 4.783636
8 0.954545 11.900000 0.851818 0.630000 5.500000
9 0.963636 12.089091 0.880000 0.640000 5.789091
10 0.972727 12.500000 0.910000 0.660000 6.141818
11 0.981818 12.794545 0.965000 0.680000 6.983636
12 0.990909 13.300000 1.022364 0.714727 8.694545
13 1.000000 15.900000 1.580000 1.000000 15.500000
14
15 chlorides free sulfur dioxide total sulfur dioxide density \
16 0.900000 0.109000 31.000000 93.200000 0.999140
17 0.909091 0.111000 31.000000 96.000000 0.999300
18 0.918182 0.114000 32.000000 99.000000 0.999400
19 0.927273 0.117000 33.000000 102.781818 0.999478
20 0.936364 0.120000 34.000000 106.000000 0.999700
21 0.945455 0.123000 35.000000 110.000000 0.999800
22 0.954545 0.136364 36.000000 115.000000 1.000000
23 0.963636 0.164564 38.000000 121.000000 1.000200
24 0.972727 0.187673 40.000000 129.000000 1.000400
25 0.981818 0.234727 42.945455 136.000000 1.000989
26 0.990909 0.368473 51.000000 145.945455 1.001942
27 1.000000 0.611000 72.000000 289.000000 1.003690
28
29 pH sulphates alcohol quality
30 0.900000 3.510000 0.850000 12.0 1.0
31 0.909091 3.520000 0.860000 12.0 1.0
32 0.918182 3.530000 0.870000 12.1 1.0
33 0.927273 3.540000 0.887818 12.2 1.0
34 0.936364 3.543091 0.903091 12.4 1.0
35 0.945455 3.560000 0.930000 12.5 1.0
36 0.954545 3.573636 0.953636 12.5 1.0
37 0.963636 3.590000 0.998909 12.7 1.0
38 0.972727 3.610000 1.060000 12.8 1.0
39 0.981818 3.660000 1.140000 12.9 1.0
40 0.990909 3.710000 1.280000 13.4 1.0
41 1.000000 4.010000 2.000000 14.9 1.0
There are outlier in fixed acidity
, volatile acidity
, citric acid
, residual sugar
, chlorides
, free sulfur dioxide
, total sulfur dioxide
, pH
, sulphates
, alcohol
Visualizing Independent Variables Copy 1 x_vars = data . columns [ data . columns != 'quality' ]
2 fig , ax = plt . subplots ( len ( x_vars ) )
3 fig . set_figheight ( 24 )
4 fig . set_figwidth ( 12 )
5 for num , i in enumerate ( x_vars ) :
6 ax [ num ] . set_title ( i )
7 ax [ num ] . set_xlabel ( '' )
8 sns . boxplot ( data [ i ] , ax = ax [ num ] )
Copy 1
2 x_vars = data . columns [ data . columns != 'quality' ]
3 for i in x_vars :
4 q1 = data [ i ] . quantile ( 0.25 )
5 q3 = data [ i ] . quantile ( 0.75 )
6 upper_extreme = data [ i ] . quantile ( 0.75 ) + 1.5 * ( q3 - q1 )
7 lower_extreme = data [ i ] . quantile ( 0.75 ) - 1.5 * ( q3 - q1 )
8 mask = ( data [ i ] > lower_extreme ) & ( data [ i ] < upper_extreme )
9 outliers = data [ mask ] . index
10 data . drop ( index = outliers )
Test Train Split Copy 1 from sklearn . model_selection import train_test_split
2 y = data . pop ( 'quality' )
3 X = data
Copy 1 X_train , X_test , y_train , y_test = train_test_split ( X , y , test_size = 0.3 , random_state = 100 )
Scaling Continuous Variables Copy 1
2 from sklearn . preprocessing import StandardScaler
3 scaler = StandardScaler ( )
4 X_train [ X_train . columns ] = scaler . fit_transform ( X_train [ X_train . columns ] )
5
6
7 X_test [ X_train . columns ] = scaler . transform ( X_test [ X_train . columns ] )
Correlations Copy 1 plt . figure ( figsize = [ 20 , 10 ] )
2 sns . heatmap ( X_train . corr ( ) , annot = True )
3 plt . title ( 'Visualizing Correlations' )
4 plt . show ( )
High Correlations :
Fixed acidity vs pH : -0.69 Fixed acidity vs density : 0.69 fixed acidity vs citric acid : 0.67 Volatile acidity vs citric acid : -0.53 citric acid vs pH : -0.54 density vs alcohol : -0.51 Model Building Copy 1 import statsmodels . api as sm
Copy 1
2 logm1 = sm . GLM ( y_train , sm . add_constant ( X_train ) , family = sm . families . Binomial ( ) )
3 logm1 . fit ( ) . summary ( )
Generalized Linear Model Regression Results Dep. Variable: quality No. Observations: 1119 Model: GLM Df Residuals: 1107 Model Family: Binomial Df Model: 11 Link Function: logit Scale: 1.0000 Method: IRLS Log-Likelihood: -570.86 Date: Fri, 07 Aug 2020 Deviance: 1141.7 Time: 18:45:03 Pearson chi2: 1.08e+03 No. Iterations: 5 Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975] const 0.2632 0.076 3.478 0.001 0.115 0.411 fixed acidity 0.3550 0.204 1.742 0.082 -0.045 0.755 volatile acidity -0.5883 0.103 -5.706 0.000 -0.790 -0.386 citric acid -0.3117 0.132 -2.356 0.018 -0.571 -0.052 residual sugar 0.2039 0.093 2.185 0.029 0.021 0.387 chlorides -0.1757 0.091 -1.931 0.054 -0.354 0.003 free sulfur dioxide 0.1652 0.107 1.546 0.122 -0.044 0.375 total sulfur dioxide -0.5286 0.115 -4.584 0.000 -0.755 -0.303 density -0.2451 0.186 -1.320 0.187 -0.609 0.119 pH -0.0311 0.133 -0.233 0.815 -0.292 0.230 sulphates 0.4795 0.093 5.143 0.000 0.297 0.662 alcohol 0.9432 0.134 7.014 0.000 0.680 1.207
Feature Selection using RFE Copy 1 from sklearn . linear_model import LogisticRegression
2 logReg = LogisticRegression ( )
Copy 1 from sklearn . feature_selection import RFE
2 rfe = RFE ( logReg , 10 )
3 rfe = rfe . fit ( X_train , y_train )
Copy 1
2 rfe_results = list ( zip ( X_train . columns , rfe . support_ , rfe . ranking_ ) )
3 sorted ( rfe_results , key = lambda x : ( x [ 2 ] ) )
Copy 1 [('fixed acidity', True, 1),
2 ('volatile acidity', True, 1),
3 ('citric acid', True, 1),
4 ('residual sugar', True, 1),
5 ('chlorides', True, 1),
6 ('free sulfur dioxide', True, 1),
7 ('total sulfur dioxide', True, 1),
8 ('density', True, 1),
9 ('sulphates', True, 1),
10 ('alcohol', True, 1),
11 ('pH', False, 2)]
RFE results show that pH
can be dropped. Copy 1 X_train . drop ( columns = [ 'pH' ] , inplace = True )
2 X_test . drop ( columns = [ 'pH' ] , inplace = True )
Assessing Model Model 1 Copy 1 X_train . columns = X_train . columns [ X_train . columns != 'pH' ]
2 logm1 = sm . GLM ( y_train , sm . add_constant ( X_train ) , family = sm . families . Binomial ( ) )
3 logm1 . fit ( ) . summary ( )
Generalized Linear Model Regression Results Dep. Variable: quality No. Observations: 1119 Model: GLM Df Residuals: 1108 Model Family: Binomial Df Model: 10 Link Function: logit Scale: 1.0000 Method: IRLS Log-Likelihood: -570.89 Date: Fri, 07 Aug 2020 Deviance: 1141.8 Time: 18:45:03 Pearson chi2: 1.08e+03 No. Iterations: 5 Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975] const 0.2631 0.076 3.478 0.001 0.115 0.411 fixed acidity 0.3894 0.141 2.762 0.006 0.113 0.666 volatile acidity -0.5904 0.103 -5.750 0.000 -0.792 -0.389 citric acid -0.3128 0.132 -2.367 0.018 -0.572 -0.054 residual sugar 0.2110 0.088 2.389 0.017 0.038 0.384 chlorides -0.1705 0.088 -1.933 0.053 -0.343 0.002 free sulfur dioxide 0.1609 0.105 1.528 0.127 -0.045 0.367 total sulfur dioxide -0.5228 0.113 -4.645 0.000 -0.743 -0.302 density -0.2686 0.156 -1.722 0.085 -0.574 0.037 sulphates 0.4816 0.093 5.196 0.000 0.300 0.663 alcohol 0.9287 0.119 7.803 0.000 0.695 1.162
Model 2 Dropping free sulfur dioxide
because of high p-value Copy 1 X = X_train . loc [ : , X_train . columns != 'free sulfur dioxide' ]
2 logm2 = sm . GLM ( y_train , sm . add_constant ( X ) , family = sm . families . Binomial ( ) )
3 logm2 . fit ( ) . summary ( )
Generalized Linear Model Regression Results Dep. Variable: quality No. Observations: 1119 Model: GLM Df Residuals: 1109 Model Family: Binomial Df Model: 9 Link Function: logit Scale: 1.0000 Method: IRLS Log-Likelihood: -572.06 Date: Fri, 07 Aug 2020 Deviance: 1144.1 Time: 18:45:03 Pearson chi2: 1.08e+03 No. Iterations: 5 Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975] const 0.2687 0.075 3.561 0.000 0.121 0.417 fixed acidity 0.4006 0.141 2.845 0.004 0.125 0.677 volatile acidity -0.6186 0.102 -6.089 0.000 -0.818 -0.420 citric acid -0.3548 0.130 -2.738 0.006 -0.609 -0.101 residual sugar 0.2323 0.088 2.629 0.009 0.059 0.406 chlorides -0.1646 0.088 -1.868 0.062 -0.337 0.008 total sulfur dioxide -0.4099 0.083 -4.911 0.000 -0.574 -0.246 density -0.2762 0.156 -1.771 0.077 -0.582 0.029 sulphates 0.4918 0.093 5.279 0.000 0.309 0.674 alcohol 0.9411 0.119 7.892 0.000 0.707 1.175
Model 3 dropping free sulfur dioxide
because of high p-value Copy 1 X = X . loc [ : , X . columns != 'free sulfur dioxide' ]
2 logm3 = sm . GLM ( y_train , sm . add_constant ( X ) , family = sm . families . Binomial ( ) )
3 logm3 . fit ( ) . summary ( )
Generalized Linear Model Regression Results Dep. Variable: quality No. Observations: 1119 Model: GLM Df Residuals: 1109 Model Family: Binomial Df Model: 9 Link Function: logit Scale: 1.0000 Method: IRLS Log-Likelihood: -572.06 Date: Fri, 07 Aug 2020 Deviance: 1144.1 Time: 18:45:03 Pearson chi2: 1.08e+03 No. Iterations: 5 Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975] const 0.2687 0.075 3.561 0.000 0.121 0.417 fixed acidity 0.4006 0.141 2.845 0.004 0.125 0.677 volatile acidity -0.6186 0.102 -6.089 0.000 -0.818 -0.420 citric acid -0.3548 0.130 -2.738 0.006 -0.609 -0.101 residual sugar 0.2323 0.088 2.629 0.009 0.059 0.406 chlorides -0.1646 0.088 -1.868 0.062 -0.337 0.008 total sulfur dioxide -0.4099 0.083 -4.911 0.000 -0.574 -0.246 density -0.2762 0.156 -1.771 0.077 -0.582 0.029 sulphates 0.4918 0.093 5.279 0.000 0.309 0.674 alcohol 0.9411 0.119 7.892 0.000 0.707 1.175
Model 4 Dropping density
because of high p-value Copy 1 X = X . loc [ : , X . columns != 'density' ]
2 logm4 = sm . GLM ( y_train , sm . add_constant ( X ) , family = sm . families . Binomial ( ) )
3 logm4 . fit ( ) . summary ( )
Generalized Linear Model Regression Results Dep. Variable: quality No. Observations: 1119 Model: GLM Df Residuals: 1110 Model Family: Binomial Df Model: 8 Link Function: logit Scale: 1.0000 Method: IRLS Log-Likelihood: -573.64 Date: Fri, 07 Aug 2020 Deviance: 1147.3 Time: 18:45:03 Pearson chi2: 1.10e+03 No. Iterations: 5 Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975] const 0.2590 0.075 3.451 0.001 0.112 0.406 fixed acidity 0.2393 0.107 2.234 0.025 0.029 0.449 volatile acidity -0.6496 0.101 -6.426 0.000 -0.848 -0.451 citric acid -0.3570 0.130 -2.747 0.006 -0.612 -0.102 residual sugar 0.1478 0.074 1.998 0.046 0.003 0.293 chlorides -0.1540 0.088 -1.748 0.080 -0.327 0.019 total sulfur dioxide -0.4002 0.083 -4.822 0.000 -0.563 -0.238 sulphates 0.4584 0.090 5.090 0.000 0.282 0.635 alcohol 1.0615 0.099 10.722 0.000 0.867 1.255
Model 5 dropping chlorides
because of high p-value Copy 1 X = X . loc [ : , X . columns != 'chlorides' ]
2 logm5 = sm . GLM ( y_train , sm . add_constant ( X ) , family = sm . families . Binomial ( ) )
3 logm5 . fit ( ) . summary ( )
Generalized Linear Model Regression Results Dep. Variable: quality No. Observations: 1119 Model: GLM Df Residuals: 1111 Model Family: Binomial Df Model: 7 Link Function: logit Scale: 1.0000 Method: IRLS Log-Likelihood: -575.22 Date: Fri, 07 Aug 2020 Deviance: 1150.4 Time: 18:45:03 Pearson chi2: 1.11e+03 No. Iterations: 5 Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975] const 0.2577 0.075 3.438 0.001 0.111 0.405 fixed acidity 0.2787 0.105 2.659 0.008 0.073 0.484 volatile acidity -0.6975 0.098 -7.105 0.000 -0.890 -0.505 citric acid -0.4242 0.125 -3.401 0.001 -0.669 -0.180 residual sugar 0.1398 0.073 1.914 0.056 -0.003 0.283 total sulfur dioxide -0.3884 0.082 -4.712 0.000 -0.550 -0.227 sulphates 0.3856 0.078 4.946 0.000 0.233 0.538 alcohol 1.1119 0.096 11.633 0.000 0.925 1.299
Model 6 — Dropping residual sugar
because of high p-value
Copy 1 X = X . loc [ : , X . columns != 'residual sugar' ]
2 logm6 = sm . GLM ( y_train , sm . add_constant ( X ) , family = sm . families . Binomial ( ) )
3 logm6 . fit ( ) . summary ( )
Generalized Linear Model Regression Results Dep. Variable: quality No. Observations: 1119 Model: GLM Df Residuals: 1112 Model Family: Binomial Df Model: 6 Link Function: logit Scale: 1.0000 Method: IRLS Log-Likelihood: -577.02 Date: Fri, 07 Aug 2020 Deviance: 1154.0 Time: 18:45:03 Pearson chi2: 1.10e+03 No. Iterations: 5 Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975] const 0.2593 0.075 3.465 0.001 0.113 0.406 fixed acidity 0.2883 0.105 2.753 0.006 0.083 0.494 volatile acidity -0.6874 0.097 -7.064 0.000 -0.878 -0.497 citric acid -0.4051 0.124 -3.271 0.001 -0.648 -0.162 total sulfur dioxide -0.3479 0.079 -4.402 0.000 -0.503 -0.193 sulphates 0.3769 0.078 4.846 0.000 0.224 0.529 alcohol 1.1186 0.095 11.716 0.000 0.931 1.306
All the p-values are very low. So the variable which remain have statistically significant relationships. Checking Multi-Collinearity Copy 1 from statsmodels . stats . outliers_influence import variance_inflation_factor
2 def vif ( X ) :
3 df = sm . add_constant ( X )
4 vif = [ variance_inflation_factor ( df . values , i ) for i in range ( df . shape [ 1 ] ) ]
5 vif_frame = pd . DataFrame ( { 'vif' : vif [ 0 : ] } , index = df . columns ) . reset_index ( )
6 print ( vif_frame . sort_values ( by = 'vif' , ascending = False ) )
Copy 1 index vif
2 3 citric acid 2.735105
3 1 fixed acidity 2.113091
4 2 volatile acidity 1.498180
5 6 alcohol 1.163675
6 5 sulphates 1.150064
7 4 total sulfur dioxide 1.121176
8 0 const 1.000000
As we can see, there’s no multi collinearity since VIF < 5 Final Model Copy 1 print ( 'Selected columns :' , X . columns )
Copy 1 Selected columns : Index(['fixed acidity', 'volatile acidity', 'citric acid',
2 'total sulfur dioxide', 'sulphates', 'alcohol'],
3 dtype='object')
Copy 1 logm_final = sm . GLM ( y_train , sm . add_constant ( X_train [ X . columns ] ) , family = sm . families . Binomial ( ) )
2 res = logm_final . fit ( )
3 res . summary ( )
Generalized Linear Model Regression Results Dep. Variable: quality No. Observations: 1119 Model: GLM Df Residuals: 1112 Model Family: Binomial Df Model: 6 Link Function: logit Scale: 1.0000 Method: IRLS Log-Likelihood: -577.02 Date: Fri, 07 Aug 2020 Deviance: 1154.0 Time: 18:45:04 Pearson chi2: 1.10e+03 No. Iterations: 5 Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975] const 0.2593 0.075 3.465 0.001 0.113 0.406 fixed acidity 0.2883 0.105 2.753 0.006 0.083 0.494 volatile acidity -0.6874 0.097 -7.064 0.000 -0.878 -0.497 citric acid -0.4051 0.124 -3.271 0.001 -0.648 -0.162 total sulfur dioxide -0.3479 0.079 -4.402 0.000 -0.503 -0.193 sulphates 0.3769 0.078 4.846 0.000 0.224 0.529 alcohol 1.1186 0.095 11.716 0.000 0.931 1.306
Making Predictions on Train Set Copy 1 selected_vars = X . columns
2 y_train_pred = res . predict ( sm . add_constant ( X_train [ X . columns ] ) )
Copy 1 print ( y_train_pred . head ( ) )
Copy 1 858 0.855444
2 654 0.255470
3 721 0.172042
4 176 0.388875
5 692 0.379338
6 dtype: float64
Wine Quality vs Predicted Probability Copy 1 predictions = pd . DataFrame ( { 'Quality' : y_train . values , 'class_probability' : y_train_pred . values . reshape ( - 1 ) } , index = X_train . index )
2 print ( predictions . head ( ) )
Copy 1 Quality class_probability
2 858 1 0.855444
3 654 0 0.255470
4 721 0 0.172042
5 176 0 0.388875
6 692 0 0.379338
Classification Threshold Let us assume that any probability < 0.5 is Bad and >0.5 is Good Copy 1 predictions [ 'Predicted_Quality' ] = predictions [ 'class_probability' ] . apply ( lambda x : 1 if x > 0.5 else 0 )
2 print ( predictions . head ( ) )
Copy 1 Quality class_probability Predicted_Quality
2 858 1 0.855444 1
3 654 0 0.255470 0
4 721 0 0.172042 0
5 176 0 0.388875 0
6 692 0 0.379338 0
Simple Metrics Copy 1 from sklearn import metrics
Confusion Matrix Copy 1 confusion = metrics . confusion_matrix ( predictions [ 'Quality' ] , predictions [ 'Predicted_Quality' ] )
2 print ( confusion )
Copy 1 [[390 130]
2 [147 452]]
The above result could be interpreted in the following manner a[i,j] is the no of times class j was predicted when actual was class i So TN : 390, FP : 130, FN : 147, TP : 452 Predicted > 0 1 Actual 0 TN = 390 FP =130 1 FN =147 TP = 452
0 : bad, 1 : good
Copy 1
2 print ( metrics . accuracy_score ( predictions [ 'Quality' ] , predictions [ 'Predicted_Quality' ] ) )
Metrics beyond Simple Accuracy Copy 1 TP = confusion [ 1 , 1 ]
2 TN = confusion [ 0 , 0 ]
3 FP = confusion [ 0 , 1 ]
4 FN = confusion [ 1 , 0 ]
Copy 1
2 import math
3 def model_metrics ( TP , TN , FP , FN ) :
4 print ( 'Accuracy :' , round ( ( TP + TN ) / float ( TP + TN + FP + FN ) , 3 ) )
5 print ( 'Misclassification Rate / Error Rate :' , round ( ( FP + FN ) / float ( TP + TN + FP + FN ) , 3 ) )
6 print ( 'Sensitivity / True Positive Rate / Recall :' , round ( TP / float ( FN + TP ) , 3 ) )
7 sensitivity = round ( TP / float ( FN + TP ) , 3 )
8 print ( 'Specificity / True Negative Rate : ' , round ( TN / float ( TN + FP ) , 3 ) )
9 specificity = round ( TN / float ( TN + FP ) , 3 )
10 print ( 'False Positive Rate :' , round ( FP / float ( TN + FP ) , 3 ) )
11 print ( 'Precision / Positive Predictive Value :' , round ( TP / float ( TP + FP ) , 3 ) )
12 precision = round ( TP / float ( TP + FP ) , 3 )
13 print ( 'Prevalance :' , round ( ( FN + TP ) / float ( TP + TN + FP + FN ) , 3 ) )
14 print ( 'Negative Predictive Value' , round ( TN / float ( TN + FN ) , 3 ) )
15 print ( 'Likelihood Ratio : Sensitivity / 1-Specificity :' , round ( sensitivity / float ( 1 - specificity ) , 3 ) )
16 print ( 'F1-score :' , round ( 2 * precision * sensitivity / ( precision + sensitivity ) , 3 ) )
Copy 1 model_metrics ( TP , TN , FP , FN )
Copy 1 Accuracy : 0.752
2 Misclassification Rate / Error Rate : 0.248
3 Sensitivity / True Positive Rate / Recall : 0.755
4 Specificity / True Negative Rate : 0.75
5 False Positive Rate : 0.25
6 Precision / Positive Predictive Value : 0.777
7 Prevalance : 0.535
8 Negative Predictive Value 0.726
9 Likelihood Ratio : Sensitivity / 1-Specificity : 3.02
10 F1-score : 0.766
ROC curve Copy 1 print ( predictions . head ( ) )
Copy 1 Quality class_probability Predicted_Quality
2 858 1 0.855444 1
3 654 0 0.255470 0
4 721 0 0.172042 0
5 176 0 0.388875 0
6 692 0 0.379338 0
Copy 1
2 cutoffs = pd . DataFrame ( )
3 for i in np . arange ( 0 , 1 , 0.1 ) :
4 cutoffs [ i ] = predictions [ 'class_probability' ] . map ( lambda x : 1 if x > i else 0 )
Copy 1 tpr = [ ]
2 fpr = [ ]
3 for column in cutoffs . columns :
4 confusion = metrics . confusion_matrix ( predictions [ 'Quality' ] , cutoffs [ column ] )
5 TP = confusion [ 1 , 1 ]
6 TN = confusion [ 0 , 0 ]
7 FP = confusion [ 0 , 1 ]
8 FN = confusion [ 1 , 0 ]
9 tpr . append ( TP / float ( TP + FN ) )
10 fpr . append ( FP / float ( FP + TN ) )
11 plt . title ( 'ROC curve' )
12 plt . xlabel ( 'False Positive Rate' )
13 plt . ylabel ( 'True Positive Rate' )
14 sns . scatterplot ( fpr , tpr ) ;
Optimum Cut Off Copy 1 sensitivity = [ ]
2 specificity = [ ]
3 accuracy = [ ]
4 coffs = [ ]
5 for column in cutoffs . columns :
6 confusion = metrics . confusion_matrix ( predictions [ 'Quality' ] , cutoffs [ column ] )
7 TP = confusion [ 1 , 1 ]
8 TN = confusion [ 0 , 0 ]
9 FP = confusion [ 0 , 1 ]
10 FN = confusion [ 1 , 0 ]
11 sensitivity . append ( TP / float ( TP + FN ) )
12 specificity . append ( 1 - FP / float ( FP + TN ) )
13 accuracy . append ( ( TP + TN ) / ( TP + TN + FP + FN ) )
14 fig , ax = plt . subplots ( )
15 ax . set_xlabel ( 'Cutoffs' )
16 ax . plot ( cutoffs . columns , sensitivity , label = 'sensitivity' )
17 ax . plot ( cutoffs . columns , specificity , label = 'specificity' )
18 ax . plot ( cutoffs . columns , accuracy , label = 'accuracy' )
19 ax . legend ( ( 'sensitivity' , 'specificity' , 'accuracy' ) )
20 plt . show ( )
From the above plot, 0.5 seems like the optimum Threshold for classification Copy 1 predictions [ 'Final_Predictions' ] = predictions [ 'Predicted_Quality' ] . map ( lambda x : 1 if x > 0.5 else 0 )
Copy 1 confusion_final = metrics . confusion_matrix ( predictions [ 'Quality' ] , predictions [ 'Final_Predictions' ] )
2 TP = confusion_final [ 1 , 1 ]
3 TN = confusion_final [ 0 , 0 ]
4 FP = confusion_final [ 0 , 1 ]
5 FN = confusion_final [ 1 , 0 ]
Copy 1
2 model_metrics ( TP , TN , FP , FN )
Copy 1 Accuracy : 0.752
2 Misclassification Rate / Error Rate : 0.248
3 Sensitivity / True Positive Rate / Recall : 0.755
4 Specificity / True Negative Rate : 0.75
5 False Positive Rate : 0.25
6 Precision / Positive Predictive Value : 0.777
7 Prevalance : 0.535
8 Negative Predictive Value 0.726
9 Likelihood Ratio : Sensitivity / 1-Specificity : 3.02
10 F1-score : 0.766
Precision and Recall Copy 1 precision = [ ]
2 recall = [ ]
3
4 for column in cutoffs . columns :
5 confusion = metrics . confusion_matrix ( predictions [ 'Quality' ] , cutoffs [ column ] )
6 TP = confusion [ 1 , 1 ]
7 TN = confusion [ 0 , 0 ]
8 FP = confusion [ 0 , 1 ]
9 FN = confusion [ 1 , 0 ]
10 precision . append ( TP / float ( TP + FP ) )
11 recall . append ( TP / float ( FN + TP ) )
12
13 fig , ax = plt . subplots ( )
14 ax . set_xlabel ( 'Cutoffs' )
15 ax . plot ( cutoffs . columns , precision , label = 'precision' )
16 ax . plot ( cutoffs . columns , recall , label = 'recall' )
17 ax . legend ( ( 'precision' , 'recall' ) )
18 plt . show ( )
Copy 1
2 from sklearn . metrics import precision_score , recall_score
3 print ( 'Precision' , precision_score ( predictions [ 'Quality' ] , predictions [ 'Predicted_Quality' ] ) )
4 print ( 'Recall' , recall_score ( predictions [ 'Quality' ] , predictions [ 'Predicted_Quality' ] ) )
Copy 1 Precision 0.7766323024054983
2 Recall 0.7545909849749582
Predictions on Test set Copy 1 print ( X_test [ X . columns ] . head ( ) )
Copy 1 fixed acidity volatile acidity citric acid total sulfur dioxide \
2 1254 -0.302046 0.908335 -1.056968 -0.341840
3 1087 -0.244545 -1.905494 0.765915 -0.496144
4 822 -0.934554 0.025565 -0.702518 -0.310980
5 1514 -0.819552 1.680759 -0.297433 0.583983
6 902 -0.532049 0.549710 -0.854425 -0.403562
7
8 sulphates alcohol
9 1254 0.203567 0.482194
10 1087 0.203567 0.766538
11 822 -0.099025 -0.560402
12 1514 0.385122 -1.097497
13 902 0.203567 0.387412
Copy 1 test_predictions = pd . DataFrame ( )
2 X_test_ = X_test [ X . columns ]
3 test_predictions [ 'Class_Probabilities' ] = res . predict ( sm . add_constant ( X_test_ ) )
Copy 1 test_predictions [ 'Original' ] = y_test
2 test_predictions . index = y_test . index
Copy 1
2 test_predictions [ 'Predicted' ] = test_predictions [ 'Class_Probabilities' ] . map ( lambda x : 1 if x > 0.5 else 0 )
Copy 1
2 TN , FP , FN , TP = metrics . confusion_matrix ( test_predictions [ 'Original' ] , test_predictions [ 'Predicted' ] ) . reshape ( - 1 )
3 model_metrics ( TP , TN , FP , FN )
Copy 1 Accuracy : 0.746
2 Misclassification Rate / Error Rate : 0.254
3 Sensitivity / True Positive Rate / Recall : 0.797
4 Specificity / True Negative Rate : 0.688
5 False Positive Rate : 0.312
6 Precision / Positive Predictive Value : 0.745
7 Prevalance : 0.533
8 Negative Predictive Value 0.748
9 Likelihood Ratio : Sensitivity / 1-Specificity : 2.554
10 F1-score : 0.77