Navigate back to the homepage

Telecom Churn Case Study - Part 2

Jayanth Boddu
December 4th, 2020 · 2 min read

Photo by Mike Kononov on Unsplash

This analysis is the combined effort of Umaer and me.

Telecom Churn Case Study - Part 2

This notebook contains

  • Test Train Split
  • Class Imbalance
  • Standardization
  • Modelling
    • Model 1 : Logistic Regression with RFE & Manual Elimination ( Interpretable Model )
    • Model 2 : PCA + Logistic Regression
    • Model 3 : PCA + Random Forest Classifier
    • Model 4 : PCA + XGBoost

For the previous steps, refer to Part-1

1import numpy as np
2import pandas as pd
3import matplotlib.pyplot as plt
4import seaborn as sns
5# import tabulate
6import warnings
7pd.set_option('display.max_columns', None)
8pd.set_option('display.max_rows', None)
9warnings.filterwarnings('ignore')
1data = pd.read_csv('cleaned_churn_data.csv', index_col='mobile_number')
2data.drop(columns=['Unnamed: 0'], inplace=True)
3data.head()
onnet_mou_6onnet_mou_7onnet_mou_8offnet_mou_6offnet_mou_7offnet_mou_8roam_ic_mou_6roam_ic_mou_7roam_ic_mou_8roam_og_mou_6roam_og_mou_7roam_og_mou_8loc_og_t2t_mou_6loc_og_t2t_mou_7loc_og_t2t_mou_8loc_og_t2m_mou_6loc_og_t2m_mou_7loc_og_t2m_mou_8loc_og_t2f_mou_6loc_og_t2f_mou_7loc_og_t2f_mou_8loc_og_t2c_mou_6loc_og_t2c_mou_7loc_og_t2c_mou_8loc_og_mou_6loc_og_mou_7loc_og_mou_8std_og_t2t_mou_6std_og_t2t_mou_7std_og_t2t_mou_8std_og_t2m_mou_6std_og_t2m_mou_7std_og_t2m_mou_8std_og_t2f_mou_6std_og_t2f_mou_7std_og_t2f_mou_8std_og_mou_6std_og_mou_7std_og_mou_8isd_og_mou_6isd_og_mou_7isd_og_mou_8spl_og_mou_6spl_og_mou_7spl_og_mou_8og_others_6og_others_7og_others_8loc_ic_t2t_mou_6loc_ic_t2t_mou_7loc_ic_t2t_mou_8loc_ic_t2m_mou_6loc_ic_t2m_mou_7loc_ic_t2m_mou_8loc_ic_t2f_mou_6loc_ic_t2f_mou_7loc_ic_t2f_mou_8loc_ic_mou_6loc_ic_mou_7loc_ic_mou_8std_ic_t2t_mou_6std_ic_t2t_mou_7std_ic_t2t_mou_8std_ic_t2m_mou_6std_ic_t2m_mou_7std_ic_t2m_mou_8std_ic_t2f_mou_6std_ic_t2f_mou_7std_ic_t2f_mou_8std_ic_mou_6std_ic_mou_7std_ic_mou_8spl_ic_mou_6spl_ic_mou_7spl_ic_mou_8isd_ic_mou_6isd_ic_mou_7isd_ic_mou_8ic_others_6ic_others_7ic_others_8total_rech_num_6total_rech_num_7total_rech_num_8max_rech_amt_6max_rech_amt_7max_rech_amt_8last_day_rch_amt_6last_day_rch_amt_7last_day_rch_amt_8aonAverage_rech_amt_6n7Churndelta_vol_2gdelta_vol_3gdelta_total_og_moudelta_total_ic_moudelta_vbc_3gdelta_arpudelta_total_rech_amtsachet_3g_6_0sachet_3g_6_1sachet_3g_6_2monthly_2g_7_0monthly_2g_7_1monthly_2g_7_2monthly_2g_8_0monthly_2g_8_1sachet_3g_8_0sachet_3g_8_1monthly_3g_7_0monthly_3g_7_1monthly_3g_7_2sachet_2g_6_0sachet_2g_6_1sachet_2g_6_2sachet_2g_6_3sachet_2g_6_4monthly_2g_6_0monthly_2g_6_1monthly_2g_6_2sachet_2g_7_0sachet_2g_7_1sachet_2g_7_2sachet_2g_7_3sachet_2g_7_4sachet_2g_7_5sachet_3g_7_0sachet_3g_7_1sachet_3g_7_2monthly_3g_8_0monthly_3g_8_1monthly_3g_8_2monthly_3g_6_0monthly_3g_6_1monthly_3g_6_2sachet_2g_8_0sachet_2g_8_1sachet_2g_8_2sachet_2g_8_3sachet_2g_8_4sachet_2g_8_5
mobile_number
700070160157.8454.6852.29453.43567.16325.9116.2333.4931.6423.7412.5938.0651.3931.3840.28308.63447.38162.2862.1355.1453.230.00.00.00422.16533.91255.794.3023.2912.0149.8931.7649.146.6620.0816.6860.8675.1477.840.00.1810.014.500.006.500.000.00.058.1432.2627.31217.56221.49121.19152.16101.4639.53427.88355.23188.0436.8911.8330.3991.44126.99141.3352.1934.2422.21180.54173.08193.940.210.00.02.0614.5331.59015.74015.1915.145.05.07.01000.0790.0951.00.00.0619.08021185.010.000.000-198.225-163.51038.680864.340851036.4100100101010010000100100000100100100100000
7001524846413.69351.0335.0894.6680.63136.480.000.000.000.000.000.00297.13217.5912.4980.9670.5850.540.000.000.000.00.07.15378.09288.1863.04116.56133.4322.5813.6910.0475.690.000.000.00130.26143.4898.280.00.000.000.000.0010.230.000.00.023.849.840.3157.5813.9815.480.000.000.0081.4323.8315.790.000.580.1022.434.080.650.000.000.0022.434.660.750.000.00.00.000.000.0000.0000.000.0019.021.014.090.0154.030.050.00.010.0315519.00-177.97-363.535-298.450-49.635-495.375-298.11000-399.0100010101010010000100010000100100100000100
7002191713501.76108.39534.24413.31119.28482.4623.53144.2472.117.9835.261.4449.636.1936.01151.1347.28294.464.540.0023.510.00.00.49205.3153.48353.99446.4185.98498.23255.3652.94156.940.000.000.00701.78138.93655.180.00.001.290.000.004.780.000.00.067.887.5852.58142.8818.53195.184.810.007.49215.5826.11255.26115.6838.29154.58308.1329.79317.910.000.001.91423.8168.09474.410.450.00.0239.6062.11249.88820.70816.2421.446.04.011.0110.0110.0130.0110.050.00.02607380.000.020.000465.510573.9350.000244.00150337.0100100101010010000100100000100100100000100
700087556550.5174.0170.61296.29229.74162.760.002.830.000.0017.740.0042.6165.1667.38273.29145.99128.280.004.4810.260.00.00.00315.91215.64205.937.892.583.2322.9964.5118.290.000.000.0030.8967.0921.530.00.000.000.003.265.910.000.00.041.3371.4428.89226.81149.69150.168.718.6832.71276.86229.83211.7868.7978.646.3318.6873.0873.930.510.002.1887.99151.7382.440.000.00.00.000.000.2300.0000.000.0010.06.02.0110.0110.0130.0100.0100.0130.0511459.000.000.000-83.030-78.750-12.170-177.52800-299.0100100101010010000100100000100100100100000
70001874471185.919.287.7961.640.005.540.004.764.810.008.4613.3438.990.000.0058.540.000.000.000.000.000.00.00.0097.540.000.001146.910.810.001.550.000.000.000.000.001148.460.810.000.00.000.002.580.000.000.930.00.034.540.000.0047.412.310.000.000.000.0081.962.310.008.630.000.001.280.000.000.000.000.009.910.000.000.000.00.00.000.000.0000.0000.000.0019.02.04.0110.00.030.030.00.00.0667408.000.000.000-625.170-47.0950.000-328.99500-378.0100100101010010000100100000100100100100000

Train-Test Split

1y = data.pop('Churn') # Predicted / Target Variable
2X = data # Predictor variables
1from sklearn.model_selection import train_test_split
2X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7, random_state=42)

Class Imbalance

1y.value_counts(normalize=True).to_frame()
Churn
00.913598
10.086402
1# Ratio of classes
2class_0 = y[y == 0].count()
3class_1 = y[y == 1].count()
4
5print(f'Class Imbalance Ratio : {round(class_1/class_0,3)}')
1Class Imbalance Ratio : 0.095
  • To account for class imbalance, Synthetic Minority Class Oversampling Technique (SMOTE) could be used.

Using SMOTE

1#!pip install imblearn
2from imblearn.over_sampling import SMOTE
3smt = SMOTE(random_state=42, k_neighbors=5)
4
5# Resampling Train set to account for class imbalance
6
7X_train_resampled, y_train_resampled= smt.fit_resample(X_train, y_train)
8X_train_resampled.head()
onnet_mou_6onnet_mou_7onnet_mou_8offnet_mou_6offnet_mou_7offnet_mou_8roam_ic_mou_6roam_ic_mou_7roam_ic_mou_8roam_og_mou_6roam_og_mou_7roam_og_mou_8loc_og_t2t_mou_6loc_og_t2t_mou_7loc_og_t2t_mou_8loc_og_t2m_mou_6loc_og_t2m_mou_7loc_og_t2m_mou_8loc_og_t2f_mou_6loc_og_t2f_mou_7loc_og_t2f_mou_8loc_og_t2c_mou_6loc_og_t2c_mou_7loc_og_t2c_mou_8loc_og_mou_6loc_og_mou_7loc_og_mou_8std_og_t2t_mou_6std_og_t2t_mou_7std_og_t2t_mou_8std_og_t2m_mou_6std_og_t2m_mou_7std_og_t2m_mou_8std_og_t2f_mou_6std_og_t2f_mou_7std_og_t2f_mou_8std_og_mou_6std_og_mou_7std_og_mou_8isd_og_mou_6isd_og_mou_7isd_og_mou_8spl_og_mou_6spl_og_mou_7spl_og_mou_8og_others_6og_others_7og_others_8loc_ic_t2t_mou_6loc_ic_t2t_mou_7loc_ic_t2t_mou_8loc_ic_t2m_mou_6loc_ic_t2m_mou_7loc_ic_t2m_mou_8loc_ic_t2f_mou_6loc_ic_t2f_mou_7loc_ic_t2f_mou_8loc_ic_mou_6loc_ic_mou_7loc_ic_mou_8std_ic_t2t_mou_6std_ic_t2t_mou_7std_ic_t2t_mou_8std_ic_t2m_mou_6std_ic_t2m_mou_7std_ic_t2m_mou_8std_ic_t2f_mou_6std_ic_t2f_mou_7std_ic_t2f_mou_8std_ic_mou_6std_ic_mou_7std_ic_mou_8spl_ic_mou_6spl_ic_mou_7spl_ic_mou_8isd_ic_mou_6isd_ic_mou_7isd_ic_mou_8ic_others_6ic_others_7ic_others_8total_rech_num_6total_rech_num_7total_rech_num_8max_rech_amt_6max_rech_amt_7max_rech_amt_8last_day_rch_amt_6last_day_rch_amt_7last_day_rch_amt_8aonAverage_rech_amt_6n7delta_vol_2gdelta_vol_3gdelta_total_og_moudelta_total_ic_moudelta_vbc_3gdelta_arpudelta_total_rech_amtsachet_3g_6_0sachet_3g_6_1sachet_3g_6_2monthly_2g_7_0monthly_2g_7_1monthly_2g_7_2monthly_2g_8_0monthly_2g_8_1sachet_3g_8_0sachet_3g_8_1monthly_3g_7_0monthly_3g_7_1monthly_3g_7_2sachet_2g_6_0sachet_2g_6_1sachet_2g_6_2sachet_2g_6_3sachet_2g_6_4monthly_2g_6_0monthly_2g_6_1monthly_2g_6_2sachet_2g_7_0sachet_2g_7_1sachet_2g_7_2sachet_2g_7_3sachet_2g_7_4sachet_2g_7_5sachet_3g_7_0sachet_3g_7_1sachet_3g_7_2monthly_3g_8_0monthly_3g_8_1monthly_3g_8_2monthly_3g_6_0monthly_3g_6_1monthly_3g_6_2sachet_2g_8_0sachet_2g_8_1sachet_2g_8_2sachet_2g_8_3sachet_2g_8_4sachet_2g_8_5
053.0152.6437.48316.01195.7468.360.00.00.00.00.00.053.0152.6437.48282.38171.6444.5131.5917.3819.430.00.00.00366.99241.68101.430.000.000.000.002.110.002.034.594.412.036.714.410.000.00.000.00.00.000.00.00.018.4140.7911.79292.99191.9885.896.261.2110.39317.68233.99108.090.000.000.000.660.000.005.611.532.766.281.532.760.000.00.000.000.009.550.000.000.006.05.04.0198.0198.0198.0110.0130.0130.01423483.0-791.77001077.750-202.870-159.33571.085-172.4995-155.0100010011010010000010100000100100100100000
191.39216.14150.58504.19301.98434.410.00.00.00.00.00.040.3636.2127.7337.2636.7359.610.000.000.000.00.00.5877.6372.9487.3451.03179.93122.84465.96265.24356.440.000.000.00516.99445.18479.290.960.03.890.00.014.450.00.00.0104.3931.9835.83154.11147.88243.530.000.760.00258.51180.63279.364.032.990.466.3612.313.910.000.000.0010.3915.314.380.580.00.2519.6621.9686.630.230.561.048.011.012.0110.0130.0130.00.0130.00.0189454.00.00000.00028.130117.7450.00048.6160-94.0100100101010010000100100000100100100100000
211.9614.130.401.510.000.000.00.00.00.00.00.011.9614.130.401.510.000.000.000.000.000.00.00.0013.4814.130.400.000.000.000.000.000.000.000.000.000.000.000.000.000.00.000.00.00.000.00.00.020.5820.3997.6636.8421.5818.665.480.731.4362.9142.71117.760.000.000.000.000.000.000.000.000.000.000.000.000.000.00.000.000.000.000.000.000.005.03.04.0252.0252.0252.0252.00.0252.02922403.0-44.6300-5.525-13.40564.9500.00075.3940151.0100100101001010000100100000100001001100000
3532.66537.31738.2149.0371.6439.430.00.00.00.00.00.024.4619.7937.7441.2647.8639.431.194.040.000.00.00.0066.9371.7177.18508.19517.51700.466.5618.240.000.001.480.00514.76537.24700.460.000.00.000.00.00.000.00.00.019.8628.8120.2466.0894.1867.5451.7468.1650.08137.69191.16137.8818.8314.561.281.0820.896.830.003.083.0519.9138.5411.160.000.00.000.005.287.490.000.000.0010.013.012.0145.0150.0145.00.0150.00.01128521.0-10.1500-108.195182.315-39.7600.000192.8075207.0100100101001010000100100000100001010100000
4122.68105.51149.33302.23211.44264.110.00.00.00.00.00.0122.68105.51149.33301.04194.06257.140.000.660.510.00.00.00423.73300.24406.990.000.000.001.1815.756.440.000.960.001.1816.716.440.000.00.000.00.00.000.00.00.0228.54198.24231.13412.99392.98353.8681.7689.6988.74723.31680.93673.740.000.001.058.145.330.7011.836.5810.4419.9811.9112.190.000.00.000.430.000.480.000.000.005.05.04.0325.0154.0164.0325.0154.0164.02453721.0654.3125-686.91542.505-31.855-433.700-55.1110-105.0100001101010010000010100000100001100100000

Standardizing Columns

1# columns with numerical data
2condition1 = data.dtypes == 'int'
3condition2 = data.dtypes == 'float'
4numerical_vars = data.columns[condition1 | condition2].to_list()
1# Standard scaling
2from sklearn.preprocessing import StandardScaler
3scaler = StandardScaler()
4
5# Fit and transform train set
6X_train_resampled[numerical_vars] = scaler.fit_transform(X_train_resampled[numerical_vars])
7
8# Transform test set
9X_test[numerical_vars] = scaler.transform(X_test[numerical_vars])
1# summary statistics of standardized variables
2round(X_train_resampled.describe(),2)
onnet_mou_6onnet_mou_7onnet_mou_8offnet_mou_6offnet_mou_7offnet_mou_8roam_ic_mou_6roam_ic_mou_7roam_ic_mou_8roam_og_mou_6roam_og_mou_7roam_og_mou_8loc_og_t2t_mou_6loc_og_t2t_mou_7loc_og_t2t_mou_8loc_og_t2m_mou_6loc_og_t2m_mou_7loc_og_t2m_mou_8loc_og_t2f_mou_6loc_og_t2f_mou_7loc_og_t2f_mou_8loc_og_t2c_mou_6loc_og_t2c_mou_7loc_og_t2c_mou_8loc_og_mou_6loc_og_mou_7loc_og_mou_8std_og_t2t_mou_6std_og_t2t_mou_7std_og_t2t_mou_8std_og_t2m_mou_6std_og_t2m_mou_7std_og_t2m_mou_8std_og_t2f_mou_6std_og_t2f_mou_7std_og_t2f_mou_8std_og_mou_6std_og_mou_7std_og_mou_8isd_og_mou_6isd_og_mou_7isd_og_mou_8spl_og_mou_6spl_og_mou_7spl_og_mou_8og_others_6og_others_7og_others_8loc_ic_t2t_mou_6loc_ic_t2t_mou_7loc_ic_t2t_mou_8loc_ic_t2m_mou_6loc_ic_t2m_mou_7loc_ic_t2m_mou_8loc_ic_t2f_mou_6loc_ic_t2f_mou_7loc_ic_t2f_mou_8loc_ic_mou_6loc_ic_mou_7loc_ic_mou_8std_ic_t2t_mou_6std_ic_t2t_mou_7std_ic_t2t_mou_8std_ic_t2m_mou_6std_ic_t2m_mou_7std_ic_t2m_mou_8std_ic_t2f_mou_6std_ic_t2f_mou_7std_ic_t2f_mou_8std_ic_mou_6std_ic_mou_7std_ic_mou_8spl_ic_mou_6spl_ic_mou_7spl_ic_mou_8isd_ic_mou_6isd_ic_mou_7isd_ic_mou_8ic_others_6ic_others_7ic_others_8total_rech_num_6total_rech_num_7total_rech_num_8max_rech_amt_6max_rech_amt_7max_rech_amt_8last_day_rch_amt_6last_day_rch_amt_7last_day_rch_amt_8aonAverage_rech_amt_6n7delta_vol_2gdelta_vol_3gdelta_total_og_moudelta_total_ic_moudelta_vbc_3gdelta_arpudelta_total_rech_amtsachet_3g_6_0sachet_3g_6_1sachet_3g_6_2monthly_2g_7_0monthly_2g_7_1monthly_2g_7_2monthly_2g_8_0monthly_2g_8_1sachet_3g_8_0sachet_3g_8_1monthly_3g_7_0monthly_3g_7_1monthly_3g_7_2sachet_2g_6_0sachet_2g_6_1sachet_2g_6_2sachet_2g_6_3sachet_2g_6_4monthly_2g_6_0monthly_2g_6_1monthly_2g_6_2sachet_2g_7_0sachet_2g_7_1sachet_2g_7_2sachet_2g_7_3sachet_2g_7_4sachet_2g_7_5sachet_3g_7_0sachet_3g_7_1sachet_3g_7_2monthly_3g_8_0monthly_3g_8_1monthly_3g_8_2monthly_3g_6_0monthly_3g_6_1monthly_3g_6_2sachet_2g_8_0sachet_2g_8_1sachet_2g_8_2sachet_2g_8_3sachet_2g_8_4sachet_2g_8_5
count38374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.038374.038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.0038374.00
mean-0.00-0.000.000.00-0.00-0.00-0.00-0.000.00-0.00-0.000.00-0.00-0.000.000.00-0.000.000.00-0.00-0.00-0.000.00-0.000.00-0.00-0.00-0.00-0.00-0.00-0.00-0.000.000.000.000.00-0.000.000.00-0.00-0.000.000.000.000.00-0.000.00.00.00-0.00-0.00-0.000.00-0.000.000.000.00-0.00-0.00-0.000.000.00-0.000.000.000.000.000.00-0.00-0.000.00-0.00-0.00-0.00-0.00-0.000.000.00-0.00-0.000.00-0.000.00-0.00-0.00-0.000.000.00-0.000.000.00-0.000.000.00-0.000.00-0.000.00-0.000.00-0.00-0.00-0.00-0.00-0.000.000.00-0.000.000.000.000.000.00-0.000.000.000.000.000.000.00-0.00-0.00-0.000.000.00-0.00-0.000.00-0.00-0.000.00-0.00-0.00-0.00-0.00-0.00-0.000.00-0.000.000.00
std1.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.000.00.01.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.001.00
min-0.73-0.68-0.53-0.94-0.89-0.70-0.31-0.32-0.33-0.33-0.36-0.36-0.50-0.49-0.42-0.75-0.73-0.59-0.38-0.38-0.33-0.37-0.37-0.31-0.76-0.74-0.60-0.57-0.54-0.40-0.60-0.57-0.43-0.22-0.22-0.19-0.79-0.74-0.53-0.20-0.18-0.15-0.51-0.53-0.43-0.440.00.0-0.61-0.57-0.48-0.77-0.75-0.61-0.40-0.39-0.35-0.80-0.77-0.63-0.45-0.43-0.33-0.51-0.47-0.39-0.26-0.26-0.22-0.56-0.52-0.41-0.47-0.21-0.20-0.28-0.27-0.22-0.27-0.25-0.22-1.50-1.37-1.02-1.08-1.04-0.86-0.93-0.85-0.64-1.01-0.91-28.14-27.16-11.65-14.66-22.68-14.96-14.68-3.43-0.16-0.08-3.10-0.25-0.09-3.78-0.23-4.62-0.14-2.78-0.23-0.13-1.95-0.22-0.14-0.11-0.09-2.99-0.24-0.08-1.99-0.21-0.14-0.10-0.09-0.08-3.44-0.16-0.07-3.33-0.22-0.12-2.66-0.23-0.12-2.11-0.23-0.14-0.11-0.10-0.09
25%-0.63-0.60-0.52-0.66-0.65-0.66-0.31-0.32-0.33-0.33-0.36-0.36-0.45-0.45-0.42-0.63-0.63-0.59-0.38-0.38-0.33-0.37-0.37-0.31-0.63-0.63-0.60-0.57-0.54-0.40-0.59-0.56-0.43-0.22-0.22-0.19-0.77-0.72-0.53-0.20-0.18-0.15-0.51-0.53-0.43-0.440.00.0-0.53-0.51-0.48-0.62-0.61-0.61-0.40-0.39-0.35-0.63-0.62-0.63-0.45-0.43-0.33-0.49-0.46-0.39-0.26-0.26-0.22-0.51-0.48-0.41-0.47-0.21-0.20-0.28-0.27-0.22-0.27-0.25-0.22-0.68-0.66-0.63-0.40-0.45-0.70-0.64-0.65-0.64-0.73-0.680.120.11-0.41-0.220.08-0.51-0.480.29-0.16-0.080.32-0.25-0.090.26-0.230.22-0.140.36-0.23-0.130.51-0.22-0.14-0.11-0.090.33-0.24-0.080.50-0.21-0.14-0.10-0.09-0.080.29-0.16-0.070.30-0.22-0.120.38-0.23-0.120.47-0.23-0.14-0.11-0.10-0.09
50%-0.42-0.41-0.40-0.33-0.33-0.36-0.31-0.32-0.33-0.33-0.36-0.36-0.32-0.32-0.35-0.37-0.37-0.43-0.38-0.38-0.33-0.37-0.37-0.31-0.36-0.37-0.43-0.50-0.48-0.40-0.45-0.45-0.41-0.22-0.22-0.19-0.42-0.45-0.49-0.20-0.18-0.15-0.43-0.43-0.43-0.440.00.0-0.34-0.33-0.37-0.34-0.35-0.40-0.36-0.35-0.35-0.34-0.35-0.40-0.38-0.36-0.33-0.34-0.34-0.35-0.26-0.26-0.22-0.33-0.33-0.35-0.47-0.21-0.20-0.28-0.27-0.22-0.27-0.25-0.22-0.25-0.30-0.32-0.33-0.25-0.07-0.16-0.34-0.43-0.39-0.350.150.110.230.130.080.080.030.29-0.16-0.080.32-0.25-0.090.26-0.230.22-0.140.36-0.23-0.130.51-0.22-0.14-0.11-0.090.33-0.24-0.080.50-0.21-0.14-0.10-0.09-0.080.29-0.16-0.070.30-0.22-0.120.38-0.23-0.120.47-0.23-0.14-0.11-0.10-0.09
75%0.200.150.010.270.260.23-0.27-0.25-0.22-0.28-0.21-0.200.010.00-0.040.230.220.16-0.13-0.13-0.20-0.24-0.21-0.310.240.240.170.080.01-0.200.100.07-0.10-0.22-0.22-0.190.450.390.07-0.20-0.18-0.150.050.07-0.05-0.110.00.00.090.060.030.200.200.17-0.09-0.11-0.150.230.210.20-0.02-0.04-0.140.02-0.00-0.07-0.25-0.26-0.220.050.03-0.04-0.14-0.21-0.20-0.26-0.26-0.22-0.21-0.23-0.220.370.400.280.060.020.220.140.340.580.390.300.150.110.500.360.080.580.580.29-0.16-0.080.32-0.25-0.090.26-0.230.22-0.140.36-0.23-0.130.51-0.22-0.14-0.11-0.090.33-0.24-0.080.50-0.21-0.14-0.10-0.09-0.080.29-0.16-0.070.30-0.22-0.120.38-0.23-0.120.47-0.23-0.14-0.11-0.10-0.09
max4.094.465.674.024.455.246.116.096.195.415.445.677.067.457.715.265.345.797.257.307.576.326.477.425.355.535.834.024.365.894.044.645.998.748.899.373.563.935.217.828.5210.186.065.906.825.400.00.06.656.937.465.335.535.877.247.127.655.245.535.856.787.358.266.967.267.848.188.359.006.707.117.694.488.027.727.977.969.308.118.489.214.094.284.935.685.626.025.455.545.693.664.464.054.242.963.134.482.842.840.296.2812.990.324.0711.560.264.320.227.050.364.357.920.514.537.299.2011.420.334.1212.500.504.787.369.7110.8011.920.296.4313.420.304.578.360.384.288.040.474.297.359.3010.1310.99

Modelling

Model 1 : Interpretable Model : Logistic Regression

Baseline Logistic Regression Model

1from sklearn.linear_model import LogisticRegression
2
3
4baseline_model = LogisticRegression(random_state=100, class_weight='balanced') # `weight of class` balancing technique used
5baseline_model = baseline_model.fit(X_train, y_train)
6
7y_train_pred = baseline_model.predict_proba(X_train)[:,1]
8y_test_pred = baseline_model.predict_proba(X_test)[:,1]
1y_train_pred = pd.Series(y_train_pred,index = X_train.index, ) # converting test and train to a series to preserve index
2y_test_pred = pd.Series(y_test_pred,index = X_test.index)

Baseline Performance

1# Function for Baseline Performance Metrics
2import math
3def model_metrics(matrix) :
4 TN = matrix[0][0]
5 TP = matrix[1][1]
6 FP = matrix[0][1]
7 FN = matrix[1][0]
8 accuracy = round((TP + TN)/float(TP+TN+FP+FN),3)
9 print('Accuracy :' ,accuracy )
10 sensitivity = round(TP/float(FN + TP),3)
11 print('Sensitivity / True Positive Rate / Recall :', sensitivity)
12 specificity = round(TN/float(TN + FP),3)
13 print('Specificity / True Negative Rate : ', specificity)
14 precision = round(TP/float(TP + FP),3)
15 print('Precision / Positive Predictive Value :', precision)
16 print('F1-score :', round(2*precision*sensitivity/(precision + sensitivity),3))
1# Prediction at threshold of 0.5
2classification_threshold = 0.5
3
4y_train_pred_classified = y_train_pred.map(lambda x : 1 if x > classification_threshold else 0)
5y_test_pred_classified = y_test_pred.map(lambda x : 1 if x > classification_threshold else 0)
1from sklearn.metrics import confusion_matrix
2train_matrix = confusion_matrix(y_train, y_train_pred_classified)
3print('Confusion Matrix for train:\n', train_matrix)
4test_matrix = confusion_matrix(y_test, y_test_pred_classified)
5print('\nConfusion Matrix for test: \n', test_matrix)
1Confusion Matrix for train:
2 [[16001 3186]
3 [ 326 1494]]
4
5Confusion Matrix for test:
6 [[6090 2141]
7 [ 149 624]]
1# Baseline Model Performance :
2
3print('Train Performance : \n')
4model_metrics(train_matrix)
5
6print('\n\nTest Performance : \n')
7model_metrics(test_matrix)
1Train Performance :
2
3Accuracy : 0.833
4Sensitivity / True Positive Rate / Recall : 0.821
5Specificity / True Negative Rate : 0.834
6Precision / Positive Predictive Value : 0.319
7F1-score : 0.459
8
9
10Test Performance :
11
12Accuracy : 0.746
13Sensitivity / True Positive Rate / Recall : 0.807
14Specificity / True Negative Rate : 0.74
15Precision / Positive Predictive Value : 0.226
16F1-score : 0.353

Baseline Performance - Finding Optimum Probability Cutoff

1# Specificity / Sensitivity Tradeoff
2
3# Classification at probability thresholds between 0 and 1
4y_train_pred_thres = pd.DataFrame(index=X_train.index)
5thresholds = [float(x)/10 for x in range(10)]
6
7def thresholder(x, thresh) :
8 if x > thresh :
9 return 1
10 else :
11 return 0
12
13
14for i in thresholds:
15 y_train_pred_thres[i]= y_train_pred.map(lambda x : thresholder(x,i))
16y_train_pred_thres.head()
0.00.10.20.30.40.50.60.70.80.9
mobile_number
70001669261111100000
70013430851110000000
70018632831100000000
70022759811110000000
70010862211000000000
1# # sensitivity, specificity, accuracy for each threshold
2metrics_df = pd.DataFrame(columns=['sensitivity', 'specificity', 'accuracy'])
3
4# Function for calculation of metrics for each threshold
5def model_metrics_thres(matrix) :
6 TN = matrix[0][0]
7 TP = matrix[1][1]
8 FP = matrix[0][1]
9 FN = matrix[1][0]
10 accuracy = round((TP + TN)/float(TP+TN+FP+FN),3)
11 sensitivity = round(TP/float(FN + TP),3)
12 specificity = round(TN/float(TN + FP),3)
13 return sensitivity,specificity,accuracy
14
15# generating a data frame for metrics for each threshold
16for thres,column in zip(thresholds,y_train_pred_thres.columns.to_list()) :
17 confusion = confusion_matrix(y_train, y_train_pred_thres.loc[:,column])
18 sensitivity,specificity,accuracy = model_metrics_thres(confusion)
19
20 metrics_df = metrics_df.append({
21 'sensitivity' :sensitivity,
22 'specificity' : specificity,
23 'accuracy' : accuracy
24 }, ignore_index = True)
25
26metrics_df.index = thresholds
27metrics_df
sensitivityspecificityaccuracy
0.01.0000.0000.087
0.10.9740.3450.399
0.20.9470.5230.560
0.30.9100.6580.680
0.40.8680.7630.772
0.50.8210.8340.833
0.60.7700.8830.873
0.70.6770.9210.899
0.80.4930.9530.913
0.90.2340.9810.916
1metrics_df.plot(kind='line', figsize=(24,8), grid=True, xticks=np.arange(0,1,0.02),
2 title='Specificity-Sensitivity TradeOff');

png

Baseline Performance at Optimum Cutoff

1optimum_cutoff = 0.49
2y_train_pred_final = y_train_pred.map(lambda x : 1 if x > optimum_cutoff else 0)
3y_test_pred_final = y_test_pred.map(lambda x : 1 if x > optimum_cutoff else 0)
4
5train_matrix = confusion_matrix(y_train, y_train_pred_final)
6print('Confusion Matrix for train:\n', train_matrix)
7test_matrix = confusion_matrix(y_test, y_test_pred_final)
8print('\nConfusion Matrix for test: \n', test_matrix)
1Confusion Matrix for train:
2 [[15888 3299]
3 [ 318 1502]]
4
5Confusion Matrix for test:
6 [[1329 6902]
7 [ 16 757]]
1print('Train Performance: \n')
2model_metrics(train_matrix)
3
4print('\n\nTest Performance : \n')
5model_metrics(test_matrix)
1Train Performance:
2
3Accuracy : 0.828
4Sensitivity / True Positive Rate / Recall : 0.825
5Specificity / True Negative Rate : 0.828
6Precision / Positive Predictive Value : 0.313
7F1-score : 0.454
8
9
10Test Performance :
11
12Accuracy : 0.232
13Sensitivity / True Positive Rate / Recall : 0.979
14Specificity / True Negative Rate : 0.161
15Precision / Positive Predictive Value : 0.099
16F1-score : 0.18
1# ROC_AUC score
2from sklearn.metrics import roc_auc_score
3print('ROC AUC score for Train : ',round(roc_auc_score(y_train, y_train_pred),3), '\n' )
4print('ROC AUC score for Test : ',round(roc_auc_score(y_test, y_test_pred),3) )
1ROC AUC score for Train : 0.891
2
3ROC AUC score for Test : 0.838

Feature Selection using RFE

1from sklearn.feature_selection import RFE
2from sklearn.linear_model import LogisticRegression
3lr = LogisticRegression(random_state=100 , class_weight='balanced')
4rfe = RFE(lr, 15)
5results = rfe.fit(X_train,y_train)
6results.support_
1array([False, False, False, False, False, False, False, False, False,
2 False, False, False, False, False, False, False, False, False,
3 False, False, False, False, False, False, False, False, False,
4 False, False, False, False, False, False, False, False, True,
5 False, False, False, False, False, False, False, False, False,
6 False, False, False, False, False, False, False, False, False,
7 False, False, True, False, False, False, False, False, False,
8 False, False, False, False, True, True, False, False, False,
9 False, False, False, False, False, False, False, False, False,
10 True, False, True, False, False, False, False, False, False,
11 False, False, False, False, False, False, False, False, False,
12 True, False, False, True, False, False, True, False, False,
13 False, True, False, False, True, False, False, False, False,
14 True, False, False, True, False, False, False, False, False,
15 False, False, False, True, False, False, False, False, False,
16 True, False, False, False, False, False])
1# DataFrame with features supported by RFE
2rfe_support = pd.DataFrame({'Column' : X.columns.to_list(), 'Rank' : rfe.ranking_,
3 'Support' : rfe.support_}).sort_values(by=
4 'Rank', ascending=True)
5rfe_support
ColumnRankSupport
99sachet_3g_6_01True
120sachet_2g_7_01True
102monthly_2g_7_01True
135sachet_2g_8_01True
81total_rech_num_61True
129monthly_3g_8_01True
105monthly_2g_8_01True
83total_rech_num_81True
117monthly_2g_6_01True
68std_ic_t2f_mou_81True
67std_ic_t2f_mou_71True
112sachet_2g_6_01True
109monthly_3g_7_01True
56loc_ic_t2f_mou_81True
35std_og_t2f_mou_81True
40isd_og_mou_72False
53loc_ic_t2m_mou_83False
19loc_og_t2f_mou_74False
62std_ic_t2t_mou_85False
61std_ic_t2t_mou_76False
107sachet_3g_8_07False
41isd_og_mou_88False
89last_day_rch_amt_89False
11roam_og_mou_810False
132monthly_3g_6_011False
39isd_og_mou_612False
79ic_others_713False
50loc_ic_t2t_mou_814False
7roam_ic_mou_715False
58loc_ic_mou_716False
71std_ic_mou_817False
75isd_ic_mou_618False
33std_og_t2f_mou_619False
38std_og_mou_820False
66std_ic_t2f_mou_621False
29std_og_t2t_mou_822False
32std_og_t2m_mou_823False
78ic_others_624False
44spl_og_mou_825False
97delta_arpu26False
85max_rech_amt_727False
70std_ic_mou_728False
64std_ic_t2m_mou_729False
30std_og_t2m_mou_630False
42spl_og_mou_631False
27std_og_t2t_mou_632False
18loc_og_t2f_mou_633False
60std_ic_t2t_mou_634False
36std_og_mou_635False
51loc_ic_t2m_mou_636False
15loc_og_t2m_mou_637False
94delta_total_og_mou38False
69std_ic_mou_639False
65std_ic_t2m_mou_840False
2onnet_mou_841False
55loc_ic_t2f_mou_742False
28std_og_t2t_mou_743False
13loc_og_t2t_mou_744False
1onnet_mou_745False
9roam_og_mou_646False
21loc_og_t2c_mou_647False
14loc_og_t2t_mou_848False
84max_rech_amt_649False
26loc_og_mou_850False
8roam_ic_mou_851False
10roam_og_mou_752False
48loc_ic_t2t_mou_653False
57loc_ic_mou_654False
6roam_ic_mou_655False
106monthly_2g_8_156False
87last_day_rch_amt_657False
49loc_ic_t2t_mou_758False
98delta_total_rech_amt59False
88last_day_rch_amt_760False
34std_og_t2f_mou_761False
126sachet_3g_7_062False
23loc_og_t2c_mou_863False
103monthly_2g_7_164False
118monthly_2g_6_165False
92delta_vol_2g66False
16loc_og_t2m_mou_767False
4offnet_mou_768False
43spl_og_mou_769False
130monthly_3g_8_170False
20loc_og_t2f_mou_871False
17loc_og_t2m_mou_872False
63std_ic_t2m_mou_673False
93delta_vol_3g74False
76isd_ic_mou_775False
24loc_og_mou_676False
12loc_og_t2t_mou_677False
54loc_ic_t2f_mou_678False
0onnet_mou_679False
3offnet_mou_680False
77isd_ic_mou_881False
5offnet_mou_882False
22loc_og_t2c_mou_783False
95delta_total_ic_mou84False
52loc_ic_t2m_mou_785False
59loc_ic_mou_886False
90aon87False
74spl_ic_mou_888False
136sachet_2g_8_189False
121sachet_2g_7_190False
113sachet_2g_6_191False
108sachet_3g_8_192False
80ic_others_893False
137sachet_2g_8_294False
138sachet_2g_8_395False
114sachet_2g_6_296False
123sachet_2g_7_397False
133monthly_3g_6_198False
125sachet_2g_7_599False
131monthly_3g_8_2100False
119monthly_2g_6_2101False
25loc_og_mou_7102False
104monthly_2g_7_2103False
110monthly_3g_7_1104False
100sachet_3g_6_1105False
139sachet_2g_8_4106False
134monthly_3g_6_2107False
111monthly_3g_7_2108False
37std_og_mou_7109False
31std_og_t2m_mou_7110False
140sachet_2g_8_5111False
101sachet_3g_6_2112False
72spl_ic_mou_6113False
86max_rech_amt_8114False
73spl_ic_mou_7115False
96delta_vbc_3g116False
82total_rech_num_7117False
115sachet_2g_6_3118False
124sachet_2g_7_4119False
127sachet_3g_7_1120False
91Average_rech_amt_6n7121False
45og_others_6122False
116sachet_2g_6_4123False
128sachet_3g_7_2124False
122sachet_2g_7_2125False
47og_others_8126False
46og_others_7127False
1# RFE Selected columns
2rfe_selected_columns = rfe_support.loc[rfe_support['Rank'] == 1,'Column'].to_list()
3rfe_selected_columns
1['sachet_3g_6_0',
2 'sachet_2g_7_0',
3 'monthly_2g_7_0',
4 'sachet_2g_8_0',
5 'total_rech_num_6',
6 'monthly_3g_8_0',
7 'monthly_2g_8_0',
8 'total_rech_num_8',
9 'monthly_2g_6_0',
10 'std_ic_t2f_mou_8',
11 'std_ic_t2f_mou_7',
12 'sachet_2g_6_0',
13 'monthly_3g_7_0',
14 'loc_ic_t2f_mou_8',
15 'std_og_t2f_mou_8']

Logistic Regression with RFE Selected Columns

Model I

1# Logistic Regression Model with RFE columns
2import statsmodels.api as sm
3
4# Note that the SMOTE resampled Train set is used with statsmodels.api.GLM since it doesnot support class_weight
5logr = sm.GLM(y_train_resampled,(sm.add_constant(X_train_resampled[rfe_selected_columns])), family = sm.families.Binomial())
6logr_fit = logr.fit()
7logr_fit.summary()
Generalized Linear Model Regression Results
Dep. Variable: Churn No. Observations: 38374
Model: GLM Df Residuals: 38358
Model Family: Binomial Df Model: 15
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -19485.
Date: Mon, 30 Nov 2020 Deviance: 38969.
Time: 21:57:09 Pearson chi2: 2.80e+05
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.2334 0.015 -15.657 0.000 -0.263 -0.204
sachet_3g_6_0 -0.0396 0.014 -2.886 0.004 -0.066 -0.013
sachet_2g_7_0 -0.0980 0.016 -6.201 0.000 -0.129 -0.067
monthly_2g_7_0 0.0096 0.016 0.594 0.552 -0.022 0.041
sachet_2g_8_0 0.0489 0.015 3.359 0.001 0.020 0.077
total_rech_num_6 0.6047 0.017 35.547 0.000 0.571 0.638
monthly_3g_8_0 0.3993 0.017 23.439 0.000 0.366 0.433
monthly_2g_8_0 0.3697 0.018 21.100 0.000 0.335 0.404
total_rech_num_8 -1.2013 0.019 -62.378 0.000 -1.239 -1.164
monthly_2g_6_0 -0.0194 0.015 -1.262 0.207 -0.050 0.011
std_ic_t2f_mou_8 -0.3364 0.026 -12.792 0.000 -0.388 -0.285
std_ic_t2f_mou_7 0.1535 0.019 8.148 0.000 0.117 0.190
sachet_2g_6_0 -0.1117 0.016 -6.847 0.000 -0.144 -0.080
monthly_3g_7_0 -0.2094 0.017 -12.602 0.000 -0.242 -0.177
loc_ic_t2f_mou_8 -1.2743 0.038 -33.599 0.000 -1.349 -1.200
std_og_t2f_mou_8 -0.2476 0.021 -11.621 0.000 -0.289 -0.206

Logistic Regression with Manual Feature Elimination

1# Using P-value and vif for manual feature elimination
2
3from statsmodels.stats.outliers_influence import variance_inflation_factor
4def vif(X_train_resampled, logr_fit, selected_columns) :
5 vif = pd.DataFrame()
6 vif['Features'] = rfe_selected_columns
7 vif['VIF'] = [variance_inflation_factor(X_train_resampled[selected_columns].values, i) for i in range(X_train_resampled[selected_columns].shape[1])]
8 vif['VIF'] = round(vif['VIF'], 2)
9 vif = vif.set_index('Features')
10 vif['P-value'] = round(logr_fit.pvalues,4)
11 vif = vif.sort_values(by = ["VIF",'P-value'], ascending = [False,False])
12 return vif
13
14vif(X_train_resampled, logr_fit, rfe_selected_columns)
VIFP-value
Features
std_ic_t2f_mou_81.660.0000
sachet_2g_6_01.640.0000
sachet_2g_7_01.570.0000
std_ic_t2f_mou_71.560.0000
monthly_2g_7_01.540.5524
monthly_3g_7_01.540.0000
monthly_3g_8_01.520.0000
monthly_2g_8_01.430.0000
monthly_2g_6_01.380.2069
sachet_2g_8_01.360.0008
total_rech_num_61.270.0000
total_rech_num_81.250.0000
std_og_t2f_mou_81.200.0000
sachet_3g_6_01.120.0039
loc_ic_t2f_mou_81.090.0000
  • ‘monthly_2g_7_0’ has the very p-value. Hence, this feature could be eliminated
1selected_columns = rfe_selected_columns
2selected_columns.remove('monthly_2g_7_0')
3selected_columns
1['sachet_3g_6_0',
2 'sachet_2g_7_0',
3 'sachet_2g_8_0',
4 'total_rech_num_6',
5 'monthly_3g_8_0',
6 'monthly_2g_8_0',
7 'total_rech_num_8',
8 'monthly_2g_6_0',
9 'std_ic_t2f_mou_8',
10 'std_ic_t2f_mou_7',
11 'sachet_2g_6_0',
12 'monthly_3g_7_0',
13 'loc_ic_t2f_mou_8',
14 'std_og_t2f_mou_8']

Model II

1logr2 = sm.GLM(y_train_resampled,(sm.add_constant(X_train_resampled[selected_columns])), family = sm.families.Binomial())
2logr2_fit = logr2.fit()
3logr2_fit.summary()
Generalized Linear Model Regression Results
Dep. Variable: Churn No. Observations: 38374
Model: GLM Df Residuals: 38359
Model Family: Binomial Df Model: 14
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -19485.
Date: Mon, 30 Nov 2020 Deviance: 38970.
Time: 21:57:09 Pearson chi2: 2.80e+05
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.2335 0.015 -15.662 0.000 -0.263 -0.204
sachet_3g_6_0 -0.0395 0.014 -2.881 0.004 -0.066 -0.013
sachet_2g_7_0 -0.0982 0.016 -6.217 0.000 -0.129 -0.067
sachet_2g_8_0 0.0491 0.015 3.372 0.001 0.021 0.078
total_rech_num_6 0.6049 0.017 35.566 0.000 0.572 0.638
monthly_3g_8_0 0.4000 0.017 23.521 0.000 0.367 0.433
monthly_2g_8_0 0.3733 0.016 22.696 0.000 0.341 0.406
total_rech_num_8 -1.2012 0.019 -62.375 0.000 -1.239 -1.163
monthly_2g_6_0 -0.0163 0.014 -1.128 0.259 -0.045 0.012
std_ic_t2f_mou_8 -0.3361 0.026 -12.784 0.000 -0.388 -0.285
std_ic_t2f_mou_7 0.1532 0.019 8.136 0.000 0.116 0.190
sachet_2g_6_0 -0.1111 0.016 -6.823 0.000 -0.143 -0.079
monthly_3g_7_0 -0.2098 0.017 -12.633 0.000 -0.242 -0.177
loc_ic_t2f_mou_8 -1.2749 0.038 -33.622 0.000 -1.349 -1.201
std_og_t2f_mou_8 -0.2476 0.021 -11.620 0.000 -0.289 -0.206
1# vif and p-values
2vif(X_train_resampled, logr2_fit, selected_columns)
VIFP-value
Features
std_ic_t2f_mou_81.660.0000
sachet_2g_6_01.630.0000
sachet_2g_7_01.570.0000
std_ic_t2f_mou_71.560.0000
monthly_3g_7_01.540.0000
monthly_3g_8_01.520.0000
sachet_2g_8_01.360.0007
total_rech_num_61.270.0000
total_rech_num_81.250.0000
monthly_2g_8_01.230.0000
monthly_2g_6_01.210.2595
std_og_t2f_mou_81.200.0000
sachet_3g_6_01.120.0040
loc_ic_t2f_mou_81.090.0000
  • ‘monthly_2g_6_0’ has very high p-value. Hence, this feature could be eliminated
1selected_columns.remove('monthly_2g_6_0')
2selected_columns
1['sachet_3g_6_0',
2 'sachet_2g_7_0',
3 'sachet_2g_8_0',
4 'total_rech_num_6',
5 'monthly_3g_8_0',
6 'monthly_2g_8_0',
7 'total_rech_num_8',
8 'std_ic_t2f_mou_8',
9 'std_ic_t2f_mou_7',
10 'sachet_2g_6_0',
11 'monthly_3g_7_0',
12 'loc_ic_t2f_mou_8',
13 'std_og_t2f_mou_8']

Model III

1logr3 = sm.GLM(y_train_resampled,(sm.add_constant(X_train_resampled[selected_columns])), family = sm.families.Binomial())
2logr3_fit = logr3.fit()
3logr3_fit.summary()
Generalized Linear Model Regression Results
Dep. Variable: Churn No. Observations: 38374
Model: GLM Df Residuals: 38360
Model Family: Binomial Df Model: 13
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -19486.
Date: Mon, 30 Nov 2020 Deviance: 38971.
Time: 21:57:10 Pearson chi2: 2.79e+05
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.2336 0.015 -15.667 0.000 -0.263 -0.204
sachet_3g_6_0 -0.0399 0.014 -2.916 0.004 -0.067 -0.013
sachet_2g_7_0 -0.0987 0.016 -6.249 0.000 -0.130 -0.068
sachet_2g_8_0 0.0488 0.015 3.354 0.001 0.020 0.077
total_rech_num_6 0.6053 0.017 35.581 0.000 0.572 0.639
monthly_3g_8_0 0.3994 0.017 23.494 0.000 0.366 0.433
monthly_2g_8_0 0.3666 0.015 23.953 0.000 0.337 0.397
total_rech_num_8 -1.2033 0.019 -62.720 0.000 -1.241 -1.166
std_ic_t2f_mou_8 -0.3363 0.026 -12.788 0.000 -0.388 -0.285
std_ic_t2f_mou_7 0.1532 0.019 8.137 0.000 0.116 0.190
sachet_2g_6_0 -0.1108 0.016 -6.810 0.000 -0.143 -0.079
monthly_3g_7_0 -0.2099 0.017 -12.640 0.000 -0.242 -0.177
loc_ic_t2f_mou_8 -1.2736 0.038 -33.621 0.000 -1.348 -1.199
std_og_t2f_mou_8 -0.2474 0.021 -11.617 0.000 -0.289 -0.206
1# vif and p-values
2vif(X_train_resampled, logr3_fit, selected_columns)
VIFP-value
Features
std_ic_t2f_mou_81.660.0000
sachet_2g_6_01.630.0000
sachet_2g_7_01.570.0000
std_ic_t2f_mou_71.560.0000
monthly_3g_7_01.540.0000
monthly_3g_8_01.520.0000
sachet_2g_8_01.360.0008
total_rech_num_61.270.0000
total_rech_num_81.240.0000
std_og_t2f_mou_81.200.0000
sachet_3g_6_01.120.0035
loc_ic_t2f_mou_81.090.0000
monthly_2g_8_01.030.0000
  • All features have low p-values(<0.05) and VIF (<5)
  • This model could be used as the interpretable logistic regression model.

Final Logistic Regression Model with RFE and Manual Elimination

1logr3_fit.summary()
Generalized Linear Model Regression Results
Dep. Variable: Churn No. Observations: 38374
Model: GLM Df Residuals: 38360
Model Family: Binomial Df Model: 13
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -19486.
Date: Mon, 30 Nov 2020 Deviance: 38971.
Time: 21:57:10 Pearson chi2: 2.79e+05
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.2336 0.015 -15.667 0.000 -0.263 -0.204
sachet_3g_6_0 -0.0399 0.014 -2.916 0.004 -0.067 -0.013
sachet_2g_7_0 -0.0987 0.016 -6.249 0.000 -0.130 -0.068
sachet_2g_8_0 0.0488 0.015 3.354 0.001 0.020 0.077
total_rech_num_6 0.6053 0.017 35.581 0.000 0.572 0.639
monthly_3g_8_0 0.3994 0.017 23.494 0.000 0.366 0.433
monthly_2g_8_0 0.3666 0.015 23.953 0.000 0.337 0.397
total_rech_num_8 -1.2033 0.019 -62.720 0.000 -1.241 -1.166
std_ic_t2f_mou_8 -0.3363 0.026 -12.788 0.000 -0.388 -0.285
std_ic_t2f_mou_7 0.1532 0.019 8.137 0.000 0.116 0.190
sachet_2g_6_0 -0.1108 0.016 -6.810 0.000 -0.143 -0.079
monthly_3g_7_0 -0.2099 0.017 -12.640 0.000 -0.242 -0.177
loc_ic_t2f_mou_8 -1.2736 0.038 -33.621 0.000 -1.348 -1.199
std_og_t2f_mou_8 -0.2474 0.021 -11.617 0.000 -0.289 -0.206
1selected_columns
1['sachet_3g_6_0',
2 'sachet_2g_7_0',
3 'sachet_2g_8_0',
4 'total_rech_num_6',
5 'monthly_3g_8_0',
6 'monthly_2g_8_0',
7 'total_rech_num_8',
8 'std_ic_t2f_mou_8',
9 'std_ic_t2f_mou_7',
10 'sachet_2g_6_0',
11 'monthly_3g_7_0',
12 'loc_ic_t2f_mou_8',
13 'std_og_t2f_mou_8']
1# Prediction
2y_train_pred_lr = logr3_fit.predict(sm.add_constant(X_train_resampled[selected_columns]))
3y_train_pred_lr.head()
10 0.118916
21 0.343873
32 0.381230
43 0.015277
54 0.001595
6dtype: float64
1y_test_pred_lr = logr3_fit.predict(sm.add_constant(X_test[selected_columns]))
2y_test_pred_lr.head()
1mobile_number
27002242818 0.013556
37000517161 0.903162
47002162382 0.247123
57002152271 0.330787
67002058655 0.056105
7dtype: float64

Performance

Finding Optimum Probability Cutoff

1# Specificity / Sensitivity Tradeoff
2
3# Classification at probability thresholds between 0 and 1
4y_train_pred_thres = pd.DataFrame(index=X_train_resampled.index)
5thresholds = [float(x)/10 for x in range(10)]
6
7def thresholder(x, thresh) :
8 if x > thresh :
9 return 1
10 else :
11 return 0
12
13
14for i in thresholds:
15 y_train_pred_thres[i]= y_train_pred_lr.map(lambda x : thresholder(x,i))
16y_train_pred_thres.head()
0.00.10.20.30.40.50.60.70.80.9
01100000000
11111000000
21111000000
31000000000
41000000000
1# DataFrame for Performance metrics at each threshold
2
3logr_metrics_df = pd.DataFrame(columns=['sensitivity', 'specificity', 'accuracy'])
4for thres,column in zip(thresholds,y_train_pred_thres.columns.to_list()) :
5 confusion = confusion_matrix(y_train_resampled, y_train_pred_thres.loc[:,column])
6 sensitivity,specificity,accuracy = model_metrics_thres(confusion)
7 logr_metrics_df = logr_metrics_df.append({
8 'sensitivity' :sensitivity,
9 'specificity' : specificity,
10 'accuracy' : accuracy
11 }, ignore_index = True)
12
13logr_metrics_df.index = thresholds
14logr_metrics_df
sensitivityspecificityaccuracy
0.01.0000.0000.500
0.10.9760.2240.600
0.20.9470.3510.649
0.30.9160.4720.694
0.40.8640.5980.731
0.50.7940.7220.758
0.60.7030.8410.772
0.70.5500.9300.740
0.80.3100.9750.642
0.90.0950.9940.544
1logr_metrics_df.plot(kind='line', figsize=(24,8), grid=True, xticks=np.arange(0,1,0.02),
2 title='Specificity-Sensitivity TradeOff');

png

  • The optimum probability cutoff for Logistic regression model is 0.53
1optimum_cutoff = 0.53
2y_train_pred_lr_final = y_train_pred_lr.map(lambda x : 1 if x > optimum_cutoff else 0)
3y_test_pred_lr_final = y_test_pred_lr.map(lambda x : 1 if x > optimum_cutoff else 0)
4
5train_matrix = confusion_matrix(y_train_resampled, y_train_pred_lr_final)
6print('Confusion Matrix for train:\n', train_matrix)
7test_matrix = confusion_matrix(y_test, y_test_pred_lr_final)
8print('\nConfusion Matrix for test: \n', test_matrix)
1Confusion Matrix for train:
2 [[14531 4656]
3 [ 4411 14776]]
4
5Confusion Matrix for test:
6 [[6313 1918]
7 [ 191 582]]
1print('Train Performance: \n')
2model_metrics(train_matrix)
3
4print('\n\nTest Performance : \n')
5model_metrics(test_matrix)
1Train Performance:
2
3Accuracy : 0.764
4Sensitivity / True Positive Rate / Recall : 0.77
5Specificity / True Negative Rate : 0.757
6Precision / Positive Predictive Value : 0.76
7F1-score : 0.765
8
9
10Test Performance :
11
12Accuracy : 0.766
13Sensitivity / True Positive Rate / Recall : 0.753
14Specificity / True Negative Rate : 0.767
15Precision / Positive Predictive Value : 0.233
16F1-score : 0.356
1# ROC_AUC score
2print('ROC AUC score for Train : ',round(roc_auc_score(y_train_resampled, y_train_pred_lr),3), '\n' )
3print('ROC AUC score for Test : ',round(roc_auc_score(y_test, y_test_pred_lr),3) )
1ROC AUC score for Train : 0.843
2
3ROC AUC score for Test : 0.828

Model 1 : Logistic Regression (Interpretable Model Summary)

1lr_summary_html = logr3_fit.summary().tables[1].as_html()
2lr_results = pd.read_html(lr_summary_html, header=0, index_col=0)[0]
3coef_column = lr_results.columns[0]
4print('Most important predictors of Churn , in order of importance and their coefficients are as follows : \n')
5lr_results.sort_values(by=coef_column, key=lambda x: abs(x), ascending=False)['coef']
1Most important predictors of Churn , in order of importance and their coefficients are as follows :
2
3
4
5
6
7
8loc_ic_t2f_mou_8 -1.2736
9total_rech_num_8 -1.2033
10total_rech_num_6 0.6053
11monthly_3g_8_0 0.3994
12monthly_2g_8_0 0.3666
13std_ic_t2f_mou_8 -0.3363
14std_og_t2f_mou_8 -0.2474
15const -0.2336
16monthly_3g_7_0 -0.2099
17std_ic_t2f_mou_7 0.1532
18sachet_2g_6_0 -0.1108
19sachet_2g_7_0 -0.0987
20sachet_2g_8_0 0.0488
21sachet_3g_6_0 -0.0399
22Name: coef, dtype: float64
  • The above model could be used as the interpretable model for predicting telecom churn.

PCA

1from sklearn.decomposition import PCA
2pca = PCA(random_state = 42)
3pca.fit(X_train) # note that pca is fit on original train set instead of resampled train set.
4pca.components_
1array([[ 1.64887430e-01, 1.93987506e-01, 1.67239205e-01, ...,
2 1.43967238e-06, -1.55704675e-06, -1.88892194e-06],
3 [ 6.48591961e-02, 9.55966684e-02, 1.20775174e-01, ...,
4 -2.12841595e-06, -1.47944145e-06, -3.90881587e-07],
5 [ 2.38415388e-01, 2.73645507e-01, 2.38436263e-01, ...,
6 -1.25598531e-06, -4.37900299e-07, 6.19889336e-07],
7 ...,
8 [ 1.68015588e-06, 1.93600851e-06, -1.82065762e-06, ...,
9 4.25473944e-03, 2.56738368e-03, 3.51118176e-03],
10 [ 0.00000000e+00, -1.11533905e-16, 1.57807487e-16, ...,
11 1.73764144e-15, 6.22907679e-16, 1.45339158e-16],
12 [ 0.00000000e+00, 4.98537742e-16, -6.02718139e-16, ...,
13 1.27514583e-15, 1.25772226e-15, 3.41773342e-16]])
1pca.explained_variance_ratio_
1array([2.72067612e-01, 1.62438240e-01, 1.20827535e-01, 1.06070063e-01,
2 9.11349433e-02, 4.77504400e-02, 2.63978655e-02, 2.56843982e-02,
3 1.91789343e-02, 1.68045932e-02, 1.55523468e-02, 1.31676589e-02,
4 1.04552128e-02, 7.72970448e-03, 7.22746863e-03, 6.14494838e-03,
5 5.62073089e-03, 5.44579273e-03, 4.59009989e-03, 4.38488162e-03,
6 3.46703626e-03, 3.27941490e-03, 2.78099200e-03, 2.13444270e-03,
7 2.07542043e-03, 1.89794720e-03, 1.41383936e-03, 1.30240760e-03,
8 1.15369576e-03, 1.05262500e-03, 9.64293417e-04, 9.16686049e-04,
9 8.84067044e-04, 7.62966236e-04, 6.61794767e-04, 5.69667265e-04,
10 5.12585166e-04, 5.04441248e-04, 4.82396680e-04, 4.46889495e-04,
11 4.36441254e-04, 4.10389488e-04, 3.51844810e-04, 3.12626195e-04,
12 2.51673027e-04, 2.34723896e-04, 1.96950034e-04, 1.71296745e-04,
13 1.59882693e-04, 1.48330353e-04, 1.45919483e-04, 1.08583729e-04,
14 1.04038518e-04, 8.90621848e-05, 8.53009223e-05, 7.60704088e-05,
15 7.57150133e-05, 6.16615717e-05, 6.07777411e-05, 5.70517541e-05,
16 5.36161089e-05, 5.28495367e-05, 5.14887086e-05, 4.73768570e-05,
17 4.71283394e-05, 4.11523975e-05, 4.10392906e-05, 2.86090257e-05,
18 2.19793282e-05, 1.58203581e-05, 1.50969788e-05, 1.42865579e-05,
19 1.34537530e-05, 1.33026062e-05, 1.10239870e-05, 8.27539516e-06,
20 7.55845974e-06, 6.45372276e-06, 6.22570067e-06, 3.42288900e-06,
21 3.20804681e-06, 3.09270863e-06, 2.86608967e-06, 2.44898003e-06,
22 2.08230568e-06, 1.85144734e-06, 1.64714248e-06, 1.45630245e-06,
23 1.35265729e-06, 1.05472047e-06, 9.89133015e-07, 8.65864423e-07,
24 7.45065121e-07, 3.66727807e-07, 6.49277820e-08, 6.13357428e-08,
25 4.35995018e-08, 2.28152900e-08, 2.00441141e-08, 1.84235145e-08,
26 1.66102335e-08, 1.47870989e-08, 1.23390691e-08, 1.12094165e-08,
27 1.09702422e-08, 9.51924270e-09, 8.61596309e-09, 7.38051070e-09,
28 7.15370081e-09, 6.29095319e-09, 5.00739371e-09, 4.68791660e-09,
29 4.23376173e-09, 4.04558169e-09, 3.75847771e-09, 3.71213838e-09,
30 3.32806929e-09, 3.23527525e-09, 3.12734302e-09, 2.82062311e-09,
31 2.72602311e-09, 2.66103741e-09, 2.46562734e-09, 2.20243536e-09,
32 2.15044476e-09, 1.59498492e-09, 1.47087974e-09, 1.06159357e-09,
33 9.33938436e-10, 8.10080735e-10, 8.04656028e-10, 6.12994365e-10,
34 4.82074297e-10, 4.02577318e-10, 3.58059984e-10, 3.28374076e-10,
35 3.03687605e-10, 7.12091816e-11, 6.13978255e-11, 1.04375208e-33,
36 1.04375208e-33])

Scree Plot

1var_cum = np.cumsum(pca.explained_variance_ratio_)
2plt.figure(figsize=(20,8))
3sns.set_style('darkgrid')
4sns.lineplot(np.arange(1,len(var_cum) + 1), var_cum)
5plt.xticks(np.arange(0,140,5))
6plt.axhline(0.95,color='r')
7plt.axhline(1.0,color='r')
8plt.axvline(15,color='b')
9plt.axvline(45,color='b')
10plt.text(10,0.96,'0.95')
11
12plt.title('Scree Plot of Telecom Churn Train Set');

png

  • From the above scree plot, it is clear that 95% of variance in the train set can be explained by first 16 principal components and 100% of variance is explained by the first 45 principal components.
1# Perform PCA using the first 45 components
2pca_final = PCA(n_components=45, random_state=42)
3transformed_data = pca_final.fit_transform(X_train)
4X_train_pca = pd.DataFrame(transformed_data, columns=["PC_"+str(x) for x in range(1,46)], index = X_train.index)
5data_train_pca = pd.concat([X_train_pca, y_train], axis=1)
6
7data_train_pca.head()
PC_1PC_2PC_3PC_4PC_5PC_6PC_7PC_8PC_9PC_10PC_11PC_12PC_13PC_14PC_15PC_16PC_17PC_18PC_19PC_20PC_21PC_22PC_23PC_24PC_25PC_26PC_27PC_28PC_29PC_30PC_31PC_32PC_33PC_34PC_35PC_36PC_37PC_38PC_39PC_40PC_41PC_42PC_43PC_44PC_45Churn
mobile_number
7000166926-907.572208-342.92367613.09444258.813506-95.616159-1050.535219254.648987-31.445039305.140339-216.81425095.825021231.408291-111.002572-2.007256444.97724931.541681573.831941-278.53970830.768637-36.915195-0.293915-83.574447-13.960479-60.930941-53.20861356.049658-17.776675-12.62452614.149393-30.55915626.064776-1.080160-19.814893-3.293546-2.7179237.47025522.68683828.696686-14.3120374.959030-8.6525432.47314717.080399-21.824778-8.0629010
7001343085573.898045-902.385767-424.839214-331.153508-148.987005-36.955710-134.445130265.325388-92.070929-164.20358625.105150-36.980621164.785936-222.908959-12.573878-50.569424-44.767869-62.984835-18.100729-86.239469-115.399141-45.77651816.345395-21.497140-10.541281-71.75404729.230830-20.880178-0.6901833.220864-21.22329865.500636-39.71943750.42462310.58615043.0552190.209259-66.10788013.58301625.82344452.037618-3.2727738.49399519.449057-38.7794660
7001863283-1538.198366514.032564846.86549757.032319-1126.228705-84.209511-44.422495-88.158881-58.41188750.5188113.052703-229.100202-109.215465-3.2537827.045279-85.64539354.536446-52.29277920.978943-90.80616796.34865924.280381-52.42526242.430049-40.627473-12.715890-4.331719-4.09229050.339358-0.777645-35.146663-121.58096598.868473-34.068010-8.94107422.9207571.66993352.644942-8.5427629.087643-18.4038533.67207626.07307827.24637119.6033680
7002275981486.830772-224.9298031130.460535-496.1890156.00913981.106845-148.667431170.280911-7.375197-99.556793-159.659135-14.186219-98.682096213.233743-34.920639-17.21243029.6447784.9419942.799763-49.580528-88.56785516.809461-9.4710184.38388929.53218938.21155832.465761-5.316497-60.14957712.59330520.98820080.709846-50.975160-3.71258365.002407-57.837280-8.312631-5.931175-5.053131-5.667538-12.102225-14.690148-32.21557312.517731-20.1588200
7001086221-1420.949314794.07174999.221352155.118564145.349456784.723580-10.947301609.724272-172.482377-42.79640059.174124-162.912577-112.219187-55.10844517.303261-152.111164-611.929832181.577435-211.358075-77.180329116.28209583.488753-26.254488128.490023-69.0852534.854304-128.27857344.328867-6.470515-28.78220914.618174-31.35937927.331179-25.9487718.941634-34.840913-21.93384817.941556-0.866531-19.428832-5.3211936.319611-11.39837641.907093-8.2961320
1## Plotting principal components
2sns.pairplot(data=data_train_pca, x_vars=["PC_1"], y_vars=["PC_2"], hue = "Churn", size=8);

png

Model 2 : PCA + Logistic Regression Model

1# X,y Split
2y_train_pca = data_train_pca.pop('Churn')
3X_train_pca = data_train_pca
4
5# Transforming test set with pca ( 45 components)
6X_test_pca = pca_final.transform(X_test)
7
8# Logistic Regression
9lr_pca = LogisticRegression(random_state=100, class_weight='balanced')
10lr_pca.fit(X_train_pca,y_train_pca )
1LogisticRegression(class_weight='balanced', random_state=100)
1# y_train predictions
2y_train_pred_lr_pca = lr_pca.predict(X_train_pca)
3y_train_pred_lr_pca[:5]
1array([1, 0, 0, 0, 0])
1# Test Prediction
2X_test_pca = pca_final.transform(X_test)
3y_test_pred_lr_pca = lr_pca.predict(X_test_pca)
4y_test_pred_lr_pca[:5]
1array([1, 1, 1, 1, 1])

Baseline Performance

1train_matrix = confusion_matrix(y_train, y_train_pred_lr_pca)
2test_matrix = confusion_matrix(y_test, y_test_pred_lr_pca)
3
4print('Train Performance :\n')
5model_metrics(train_matrix)
6
7print('\nTest Performance :\n')
8model_metrics(test_matrix)
1Train Performance :
2
3Accuracy : 0.645
4Sensitivity / True Positive Rate / Recall : 0.905
5Specificity / True Negative Rate : 0.62
6Precision / Positive Predictive Value : 0.184
7F1-score : 0.306
8
9Test Performance :
10
11Accuracy : 0.086
12Sensitivity / True Positive Rate / Recall : 1.0
13Specificity / True Negative Rate : 0.0
14Precision / Positive Predictive Value : 0.086
15F1-score : 0.158

Hyperparameter Tuning

1# Creating a Logistic regression model using pca transformed train set
2from sklearn.pipeline import Pipeline
3lr_pca = LogisticRegression(random_state=100, class_weight='balanced')
1from sklearn.model_selection import RandomizedSearchCV, GridSearchCV , StratifiedKFold
2params = {
3 'penalty' : ['l1','l2','none'],
4 'C' : [0,1,2,3,4,5,10,50]
5}
6folds = StratifiedKFold(n_splits=4, shuffle=True, random_state=100)
7
8search = GridSearchCV(cv=folds, estimator = lr_pca, param_grid=params,scoring='roc_auc', verbose=True, n_jobs=-1)
9search.fit(X_train_pca, y_train_pca)
1Fitting 4 folds for each of 24 candidates, totalling 96 fits
2
3
4[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
5[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 4.0s
6[Parallel(n_jobs=-1)]: Done 96 out of 96 | elapsed: 6.9s finished
7
8
9
10
11
12GridSearchCV(cv=StratifiedKFold(n_splits=4, random_state=100, shuffle=True),
13 estimator=LogisticRegression(class_weight='balanced',
14 random_state=100),
15 n_jobs=-1,
16 param_grid={'C': [0, 1, 2, 3, 4, 5, 10, 50],
17 'penalty': ['l1', 'l2', 'none']},
18 scoring='roc_auc', verbose=True)
1# Optimum Hyperparameters
2print('Best ROC-AUC score :', search.best_score_)
3print('Best Parameters :', search.best_params_)
1Best ROC-AUC score : 0.8763924253372933
2Best Parameters : {'C': 0, 'penalty': 'none'}
1# Modelling using the best LR-PCA estimator
2lr_pca_best = search.best_estimator_
3lr_pca_best_fit = lr_pca_best.fit(X_train_pca, y_train_pca)
4
5# Prediction on Train set
6y_train_pred_lr_pca_best = lr_pca_best_fit.predict(X_train_pca)
7y_train_pred_lr_pca_best[:5]
1array([1, 1, 0, 0, 0])
1# Prediction on test set
2y_test_pred_lr_pca_best = lr_pca_best_fit.predict(X_test_pca)
3y_test_pred_lr_pca_best[:5]
1array([1, 1, 1, 1, 1])
1## Model Performance after Hyper Parameter Tuning
2
3train_matrix = confusion_matrix(y_train, y_train_pred_lr_pca_best)
4test_matrix = confusion_matrix(y_test, y_test_pred_lr_pca_best)
5
6print('Train Performance :\n')
7model_metrics(train_matrix)
8
9print('\nTest Performance :\n')
10model_metrics(test_matrix)
1Train Performance :
2
3Accuracy : 0.627
4Sensitivity / True Positive Rate / Recall : 0.918
5Specificity / True Negative Rate : 0.599
6Precision / Positive Predictive Value : 0.179
7F1-score : 0.3
8
9Test Performance :
10
11Accuracy : 0.086
12Sensitivity / True Positive Rate / Recall : 1.0
13Specificity / True Negative Rate : 0.0
14Precision / Positive Predictive Value : 0.086
15F1-score : 0.158

Model 3 : PCA + Random Forest

1from sklearn.ensemble import RandomForestClassifier
2
3# creating a random forest classifier using pca output
4
5pca_rf = RandomForestClassifier(random_state=42, class_weight= {0 : class_1/(class_0 + class_1) , 1 : class_0/(class_0 + class_1) } , oob_score=True, n_jobs=-1,verbose=1)
6pca_rf
1RandomForestClassifier(class_weight={0: 0.08640165272733331,
2 1: 0.9135983472726666},
3 n_jobs=-1, oob_score=True, random_state=42, verbose=1)
1# Hyper parameter Tuning
2params = {
3 'n_estimators' : [30,40,50,100],
4 'max_depth' : [3,4,5,6,7],
5 'min_samples_leaf' : [15,20,25,30]
6}
7folds = StratifiedKFold(n_splits=4, shuffle=True, random_state=42)
8pca_rf_model_search = GridSearchCV(estimator=pca_rf, param_grid=params,
9 cv=folds, scoring='roc_auc', verbose=True, n_jobs=-1 )
10
11pca_rf_model_search.fit(X_train_pca, y_train)
1Fitting 4 folds for each of 80 candidates, totalling 320 fits
2
3
4[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
5[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 23.2s
6[Parallel(n_jobs=-1)]: Done 192 tasks | elapsed: 2.7min
7[Parallel(n_jobs=-1)]: Done 320 out of 320 | elapsed: 5.5min finished
8[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
9[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 1.2s
10[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 2.6s finished
11
12
13
14
15
16GridSearchCV(cv=StratifiedKFold(n_splits=4, random_state=42, shuffle=True),
17 estimator=RandomForestClassifier(class_weight={0: 0.08640165272733331,
18 1: 0.9135983472726666},
19 n_jobs=-1, oob_score=True,
20 random_state=42, verbose=1),
21 n_jobs=-1,
22 param_grid={'max_depth': [3, 4, 5, 6, 7],
23 'min_samples_leaf': [15, 20, 25, 30],
24 'n_estimators': [30, 40, 50, 100]},
25 scoring='roc_auc', verbose=True)
1# Optimum Hyperparameters
2print('Best ROC-AUC score :', pca_rf_model_search.best_score_)
3print('Best Parameters :', pca_rf_model_search.best_params_)
1Best ROC-AUC score : 0.8861621751601011
2Best Parameters : {'max_depth': 7, 'min_samples_leaf': 20, 'n_estimators': 100}
1# Modelling using the best PCA-RandomForest Estimator
2pca_rf_best = pca_rf_model_search.best_estimator_
3pca_rf_best_fit = pca_rf_best.fit(X_train_pca, y_train)
4
5# Prediction on Train set
6y_train_pred_pca_rf_best = pca_rf_best_fit.predict(X_train_pca)
7y_train_pred_pca_rf_best[:5]
1[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
2[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 1.1s
3[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 2.7s finished
4[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
5[Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 0.0s
6[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed: 0.1s finished
7
8
9
10
11
12array([0, 0, 0, 0, 0])
1# Prediction on test set
2y_test_pred_pca_rf_best = pca_rf_best_fit.predict(X_test_pca)
3y_test_pred_pca_rf_best[:5]
1[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
2[Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 0.1s
3[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed: 0.1s finished
4
5
6
7
8
9array([0, 0, 0, 0, 0])
1## PCA - RandomForest Model Performance - Hyper Parameter Tuned
2
3train_matrix = confusion_matrix(y_train, y_train_pred_pca_rf_best)
4test_matrix = confusion_matrix(y_test, y_test_pred_pca_rf_best)
5
6print('Train Performance :\n')
7model_metrics(train_matrix)
8
9print('\nTest Performance :\n')
10model_metrics(test_matrix)
1Train Performance :
2
3Accuracy : 0.882
4Sensitivity / True Positive Rate / Recall : 0.816
5Specificity / True Negative Rate : 0.888
6Precision / Positive Predictive Value : 0.408
7F1-score : 0.544
8
9Test Performance :
10
11Accuracy : 0.914
12Sensitivity / True Positive Rate / Recall : 0.0
13Specificity / True Negative Rate : 1.0
14Precision / Positive Predictive Value : nan
15F1-score : nan
1## out of bag error
2pca_rf_best_fit.oob_score_
10.8625220164707003

Model 4 : PCA + XGBoost

1import xgboost as xgb
2pca_xgb = xgb.XGBClassifier(random_state=42, scale_pos_weight= class_0/class_1 ,
3 tree_method='hist',
4 objective='binary:logistic',
5
6
7 ) # scale_pos_weight takes care of class imbalance
8pca_xgb.fit(X_train_pca, y_train)
1XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
2 colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
3 importance_type='gain', interaction_constraints='',
4 learning_rate=0.300000012, max_delta_step=0, max_depth=6,
5 min_child_weight=1, missing=nan, monotone_constraints='()',
6 n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=42,
7 reg_alpha=0, reg_lambda=1, scale_pos_weight=10.573852680293097,
8 subsample=1, tree_method='hist', validate_parameters=1,
9 verbosity=None)
1print('Baseline Train AUC Score')
2roc_auc_score(y_train, pca_xgb.predict_proba(X_train_pca)[:, 1])
1Baseline Train AUC Score
2
3
4
5
6
70.9999996277241286
1print('Baseline Test AUC Score')
2roc_auc_score(y_test, pca_xgb.predict_proba(X_test_pca)[:, 1])
1Baseline Test AUC Score
2
3
4
5
6
70.46093390352284136
1## Hyper parameter Tuning
2parameters = {
3 'learning_rate': [0.1, 0.2, 0.3],
4 'gamma' : [10,20,50],
5 'max_depth': [2,3,4],
6 'min_child_weight': [25,50],
7 'n_estimators': [150,200,500]}
8pca_xgb_search = GridSearchCV(estimator=pca_xgb , param_grid=parameters,scoring='roc_auc', cv=folds, n_jobs=-1, verbose=1)
9pca_xgb_search.fit(X_train_pca, y_train)
1Fitting 4 folds for each of 162 candidates, totalling 648 fits
2
3
4[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
5[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 28.3s
6[Parallel(n_jobs=-1)]: Done 192 tasks | elapsed: 2.1min
7[Parallel(n_jobs=-1)]: Done 442 tasks | elapsed: 4.8min
8[Parallel(n_jobs=-1)]: Done 648 out of 648 | elapsed: 8.0min finished
9
10
11
12
13
14GridSearchCV(cv=StratifiedKFold(n_splits=4, random_state=42, shuffle=True),
15 estimator=XGBClassifier(base_score=0.5, booster='gbtree',
16 colsample_bylevel=1, colsample_bynode=1,
17 colsample_bytree=1, gamma=0, gpu_id=-1,
18 importance_type='gain',
19 interaction_constraints='',
20 learning_rate=0.300000012,
21 max_delta_step=0, max_depth=6,
22 min_child_weight=1, missing=nan,
23 monotone_...
24 n_estimators=100, n_jobs=0,
25 num_parallel_tree=1, random_state=42,
26 reg_alpha=0, reg_lambda=1,
27 scale_pos_weight=10.573852680293097,
28 subsample=1, tree_method='hist',
29 validate_parameters=1, verbosity=None),
30 n_jobs=-1,
31 param_grid={'gamma': [10, 20, 50],
32 'learning_rate': [0.1, 0.2, 0.3],
33 'max_depth': [2, 3, 4], 'min_child_weight': [25, 50],
34 'n_estimators': [150, 200, 500]},
35 scoring='roc_auc', verbose=1)
1# Optimum Hyperparameters
2print('Best ROC-AUC score :', pca_xgb_search.best_score_)
3print('Best Parameters :', pca_xgb_search.best_params_)
1Best ROC-AUC score : 0.8955777259491308
2Best Parameters : {'gamma': 10, 'learning_rate': 0.1, 'max_depth': 2, 'min_child_weight': 50, 'n_estimators': 500}
1# Modelling using the best PCA-XGBoost Estimator
2pca_xgb_best = pca_xgb_search.best_estimator_
3pca_xgb_best_fit = pca_xgb_best.fit(X_train_pca, y_train)
4
5# Prediction on Train set
6y_train_pred_pca_xgb_best = pca_xgb_best_fit.predict(X_train_pca)
7y_train_pred_pca_xgb_best[:5]
1array([0, 0, 0, 0, 0])
1X_train_pca.head()
PC_1PC_2PC_3PC_4PC_5PC_6PC_7PC_8PC_9PC_10PC_11PC_12PC_13PC_14PC_15PC_16PC_17PC_18PC_19PC_20PC_21PC_22PC_23PC_24PC_25PC_26PC_27PC_28PC_29PC_30PC_31PC_32PC_33PC_34PC_35PC_36PC_37PC_38PC_39PC_40PC_41PC_42PC_43PC_44PC_45
mobile_number
7000166926-907.572208-342.92367613.09444258.813506-95.616159-1050.535219254.648987-31.445039305.140339-216.81425095.825021231.408291-111.002572-2.007256444.97724931.541681573.831941-278.53970830.768637-36.915195-0.293915-83.574447-13.960479-60.930941-53.20861356.049658-17.776675-12.62452614.149393-30.55915626.064776-1.080160-19.814893-3.293546-2.7179237.47025522.68683828.696686-14.3120374.959030-8.6525432.47314717.080399-21.824778-8.062901
7001343085573.898045-902.385767-424.839214-331.153508-148.987005-36.955710-134.445130265.325388-92.070929-164.20358625.105150-36.980621164.785936-222.908959-12.573878-50.569424-44.767869-62.984835-18.100729-86.239469-115.399141-45.77651816.345395-21.497140-10.541281-71.75404729.230830-20.880178-0.6901833.220864-21.22329865.500636-39.71943750.42462310.58615043.0552190.209259-66.10788013.58301625.82344452.037618-3.2727738.49399519.449057-38.779466
7001863283-1538.198366514.032564846.86549757.032319-1126.228705-84.209511-44.422495-88.158881-58.41188750.5188113.052703-229.100202-109.215465-3.2537827.045279-85.64539354.536446-52.29277920.978943-90.80616796.34865924.280381-52.42526242.430049-40.627473-12.715890-4.331719-4.09229050.339358-0.777645-35.146663-121.58096598.868473-34.068010-8.94107422.9207571.66993352.644942-8.5427629.087643-18.4038533.67207626.07307827.24637119.603368
7002275981486.830772-224.9298031130.460535-496.1890156.00913981.106845-148.667431170.280911-7.375197-99.556793-159.659135-14.186219-98.682096213.233743-34.920639-17.21243029.6447784.9419942.799763-49.580528-88.56785516.809461-9.4710184.38388929.53218938.21155832.465761-5.316497-60.14957712.59330520.98820080.709846-50.975160-3.71258365.002407-57.837280-8.312631-5.931175-5.053131-5.667538-12.102225-14.690148-32.21557312.517731-20.158820
7001086221-1420.949314794.07174999.221352155.118564145.349456784.723580-10.947301609.724272-172.482377-42.79640059.174124-162.912577-112.219187-55.10844517.303261-152.111164-611.929832181.577435-211.358075-77.180329116.28209583.488753-26.254488128.490023-69.0852534.854304-128.27857344.328867-6.470515-28.78220914.618174-31.35937927.331179-25.9487718.941634-34.840913-21.93384817.941556-0.866531-19.428832-5.3211936.319611-11.39837641.907093-8.296132
1# Prediction on test set
2X_test_pca = pca_final.transform(X_test)
3X_test_pca = pd.DataFrame(X_test_pca, index=X_test.index, columns = X_train_pca.columns)
4y_test_pred_pca_xgb_best = pca_xgb_best_fit.predict(X_test_pca)
5y_test_pred_pca_xgb_best[:5]
1array([1, 1, 1, 1, 1])
1## PCA - XGBOOST [Hyper parameter tuned] Model Performance
2
3train_matrix = confusion_matrix(y_train, y_train_pred_pca_xgb_best)
4test_matrix = confusion_matrix(y_test, y_test_pred_pca_xgb_best)
5
6print('Train Performance :\n')
7model_metrics(train_matrix)
8
9print('\nTest Performance :\n')
10model_metrics(test_matrix)
1Train Performance :
2
3Accuracy : 0.873
4Sensitivity / True Positive Rate / Recall : 0.887
5Specificity / True Negative Rate : 0.872
6Precision / Positive Predictive Value : 0.396
7F1-score : 0.548
8
9Test Performance :
10
11Accuracy : 0.086
12Sensitivity / True Positive Rate / Recall : 1.0
13Specificity / True Negative Rate : 0.0
14Precision / Positive Predictive Value : 0.086
15F1-score : 0.158
1## PCA - XGBOOST [Hyper parameter tuned] Model Performance
2print('Train AUC Score')
3print(roc_auc_score(y_train, pca_xgb_best.predict_proba(X_train_pca)[:, 1]))
4print('Test AUC Score')
5print(roc_auc_score(y_test, pca_xgb_best.predict_proba(X_test_pca)[:, 1]))
1Train AUC Score
20.9442462043611259
3Test AUC Score
40.6353301334697982

Recommendations

1print('Most Important Predictors of churn , in the order of importance are : ')
2lr_results.sort_values(by=coef_column, key=lambda x: abs(x), ascending=False)['coef']
1Most Important Predictors of churn , in the order of importance are :
2
3
4
5
6
7loc_ic_t2f_mou_8 -1.2736
8total_rech_num_8 -1.2033
9total_rech_num_6 0.6053
10monthly_3g_8_0 0.3994
11monthly_2g_8_0 0.3666
12std_ic_t2f_mou_8 -0.3363
13std_og_t2f_mou_8 -0.2474
14const -0.2336
15monthly_3g_7_0 -0.2099
16std_ic_t2f_mou_7 0.1532
17sachet_2g_6_0 -0.1108
18sachet_2g_7_0 -0.0987
19sachet_2g_8_0 0.0488
20sachet_3g_6_0 -0.0399
21Name: coef, dtype: float64

From the above, the following are the strongest indicators of churn

  • Customers who churn show lower average monthly local incoming calls from fixed line in the action period by 1.27 standard deviations , compared to users who don’t churn , when all other factors are held constant. This is the strongest indicator of churn.
  • Customers who churn show lower number of recharges done in action period by 1.20 standard deviations, when all other factors are held constant. This is the second strongest indicator of churn.
  • Further customers who churn have done 0.6 standard deviations higher recharge than non-churn customers. This factor when coupled with above factors is a good indicator of churn.
  • Customers who churn are more likely to be users of ‘monthly 2g package-0 / monthly 3g package-0’ in action period (approximately 0.3 std deviations higher than other packages), when all other factors are held constant.

Based on the above indicators the recommendations to the telecom company are :

  • Concentrate on users with 1.27 std devations lower than average incoming calls from fixed line. They are most likely to churn.
  • Concentrate on users who recharge less number of times ( less than 1.2 std deviations compared to avg) in the 8th month. They are second most likely to churn.
  • Models with high sensitivity are the best for predicting churn. Use the PCA + Logistic Regression model to predict churn. It has an ROC score of 0.87, test sensitivity of 100%

More articles from Yugen

Lead Scoring for X Education

How to find potential customers for Education offerings

September 7th, 2020 · 14 min read

Countries in need of Aid

Finding Countries in dire need of aid using clustering

August 31st, 2020 · 8 min read
© 2020–2021 Yugen
Link to $http://twitter.com/JayanthBoddu/Link to $https://github.com/jayantb1019Link to $https://www.instagram.com/jayantb1019/Link to $https://www.linkedin.com/in/jayanthboddu/