Navigate back to the homepage

Lead Scoring for X Education

Jayanth Boddu
September 7th, 2020 · 14 min read

Photo by Will Myers on Unsplash
This analysis is the combined effort of Umaer and mine.

Lead Scoring for X Education

Problem Statement

An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.

The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%.

Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone. A typical lead conversion process can be represented using the following funnel:

X Education has appointed us to help them select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires us to build a model wherein we need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

Business Goals

  • X-Education wants to improve their lead conversion.
  • Rather than randomly pursuing leads, the company wants to create a pool of Hot Leads the sales team could focus on.
  • They have tasked us to score their leads betwen 0-100 based on the probability of conversion. 100 being the most likely to convert and 0 being unlikely to convert.

Analysis Approach & Conclusions

Lead scoring is a class probability estimation problem, a form of classification problem. The target variable in the data set has two classes : 0 - Un converted and 1 - Converted. The objective is to model the probability(p) that each lead belongs to the class - Converted. Since there are just two classes - it follows that the probability of belonging to class - Un-Converted is (1-p). The relationship between probability of conversion of each lead and its characteristics is modelled using Logistic Regression. And the leads are scored on a scale of 0-100, 100 being most probable conversion candidate.

The final solution has been provided in two parts. 1. Scoring the leads provided by the company in the order of probability of conversion (0-100) 2. Insights into the relationship between characteristics of a lead and the log-odds probability of conversion that could help the company score leads in the future.

A logistic regression model is created using lead features. To arrive at the list of features which significantly affect conversion probability, a mixed feature elimination approach is followed. 25 most important features are obtained through Recursive Feature Elimination and then reduced to 15 via p-value / VIF approach. The dataset is randomly divided into train and test set. (70 - 30 split).

The final relationship between log Odds of Conversion Probability and lead features is

logOdds(Conversion Probability) = -0.6469 - 1.5426 Do Not Email -1.2699 Unknown Occupation -0.9057 No Specialization -0.8704 Hospitality Management - 0.6584 Outside India + 1.7923 SMS Sent + 1.1749 Other Last Activity + 2.3769 Working Professional - 0.8614 Olark Chat Conversation + 5.3886 Welingak Website + 3.0246 Reference + 1.1876 Olark Chat -1.0250 Landing Page Submission + 1.1253 Total Time Spent on Website + 0.6106 * Email Opened

where Total Time Spent on Website is standardized to $\mu=0,\sigma=1$

Interpreting Top 6 features affecting Conversion Probability :

  • A lead from Welingak Website has 5.4 times higher log odds of conversion than those from Google.
  • Leads through Reference have 3 times higher log odds of conversion than those from Google.
  • Leads from Working Professional have 2.38 times higher log odds of conversion than those from Businessman.
  • Leads with SMS Sent have 1.8 times higher log odds of conversion than those with no SMS sent.
  • Leads with Do Not Email have 1.5 times lesser log odds of conversion compared to leads who would like email updates.
  • Leads with Unknown Occupation have 1.27 times lesser log odds of conversion compared to those from Businessman.

Lead Scores :

  • Score sheet can be generated by running coding in the cell named Score Sheet for X Education cell in the analysis notebook.

At an optimum cut-off probability of 0.36, model performance is as follows.

Model Performance on Training Set :

  • Accuracy : 81.7%
  • Sensitivity / Recall: 80.393 %
  • Specificity : 81.772 %
  • Precision / Positive Predictive Power : 72.924 %
  • False Positive Rate : 18.228 %
  • AUC Score : 0.81

Model Performance for Test Set :

  • Accuracy : 79.593 %
  • Sensitivity / Recall : 77.605 %
  • Specificity : 80.81%
  • Precision / Positive Predictive Power : 71.224 %
  • False Positive Rate : 19.19 %
  • AUC Score : 0.79

KS statistic :

  • Max KS Statistic is 59.76 for 5th decile
  • This model discriminates between Converted and Non-converted leads well since KS Statistic in 4th decile (58.11) is greater than 40%. Hence, this is a reasonably good model.

Gain :

  • Instead of pursuing leads randomly, pursuing the top 40% leads scored by the model would let the sales team reach 80% of leads likely to convert.

Lift :

  • The model outperforms a random model by alteast 2 times in identifying the top 40% potentially convertible leads.
  • As opposed to 10% conversions from 10% leads pursued randomly, pursuing the top 10% leads scored by this model would lead to 24% conversions.

Note :

  • Incorrect data types have been corrected
  • Columns with high missing values have been dropped.
  • Columns which do not explain variability in the model have been dropped.
  • Columns with sales teams notes like Tags where the classes are not mutually exclusive have been dropped.
  • Features with low missing values have been imputed with the most frequent values.
  • Categories in a feature with less than 1% contribution have been grouped together to reduce the number of levels.
  • Inconsistencies in Categories have been corrected.
  • 97.5 % of the leads provided by the company have been used for analysis.
  • Class imbalance = 0.6
  • Indicator variables have been created for all categorical variables with the first category as the reference.
  • Continuous variables have been standardized $\mu : 0 , \sigma = 1$ before modelling.
1import numpy as np
2import pandas as pd
3import matplotlib.pyplot as plt
4
5import seaborn as sns
6sns.set_style('whitegrid')
7
8import warnings
9warnings.filterwarnings('ignore')
10
11!pip install tabulate
12from tabulate import tabulate
13
14import sidetable
15
16!jt -t grade3 -f roboto -fs 12 -cellw 100%
1Requirement already satisfied: tabulate in /Users/jayanth/opt/anaconda3/lib/python3.7/site-packages (0.8.7)
2libc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Couldn't close file
1# to table print a dataframe
2def tab(ser) :
3 print(tabulate(pd.DataFrame(ser), headers='keys', tablefmt="psql"))

Importing Data

1# importing the dataset
2leads = pd.read_csv('./leads.csv')
3
4# Inspecting few column heads at a time
5for i in range(0,leads.shape[1], 5) :
6 if i+4 <= leads.shape[1] :
7 print('Columns : ',i,' to ',i+4)
8 else :
9 print('Columns : ',i,' to last')
10 tab(leads.iloc[:,i : i+5].head())
11 print('\n')
1Columns : 0 to 4
2+----+--------------------------------------+---------------+-------------------------+----------------+----------------+
3| | Prospect ID | Lead Number | Lead Origin | Lead Source | Do Not Email |
4|----+--------------------------------------+---------------+-------------------------+----------------+----------------|
5| 0 | 7927b2df-8bba-4d29-b9a2-b6e0beafe620 | 660737 | API | Olark Chat | No |
6| 1 | 2a272436-5132-4136-86fa-dcc88c88f482 | 660728 | API | Organic Search | No |
7| 2 | 8cc8c611-a219-4f35-ad23-fdfd2656bd8a | 660727 | Landing Page Submission | Direct Traffic | No |
8| 3 | 0cc2df48-7cf4-4e39-9de9-19797f9b38cc | 660719 | Landing Page Submission | Direct Traffic | No |
9| 4 | 3256f628-e534-4826-9d63-4a8b88782852 | 660681 | Landing Page Submission | Google | No |
10+----+--------------------------------------+---------------+-------------------------+----------------+----------------+
11
12
13Columns : 5 to 9
14+----+---------------+-------------+---------------+-------------------------------+------------------------+
15| | Do Not Call | Converted | TotalVisits | Total Time Spent on Website | Page Views Per Visit |
16|----+---------------+-------------+---------------+-------------------------------+------------------------|
17| 0 | No | 0 | 0 | 0 | 0 |
18| 1 | No | 0 | 5 | 674 | 2.5 |
19| 2 | No | 1 | 2 | 1532 | 2 |
20| 3 | No | 0 | 1 | 305 | 1 |
21| 4 | No | 1 | 2 | 1428 | 1 |
22+----+---------------+-------------+---------------+-------------------------------+------------------------+
23
24
25Columns : 10 to 14
26+----+-------------------------+-----------+-------------------------+--------------------------------------+-----------------------------------+
27| | Last Activity | Country | Specialization | How did you hear about X Education | What is your current occupation |
28|----+-------------------------+-----------+-------------------------+--------------------------------------+-----------------------------------|
29| 0 | Page Visited on Website | nan | Select | Select | Unemployed |
30| 1 | Email Opened | India | Select | Select | Unemployed |
31| 2 | Email Opened | India | Business Administration | Select | Student |
32| 3 | Unreachable | India | Media and Advertising | Word Of Mouth | Unemployed |
33| 4 | Converted to Lead | India | Select | Other | Unemployed |
34+----+-------------------------+-----------+-------------------------+--------------------------------------+-----------------------------------+
35
36
37Columns : 15 to 19
38+----+-------------------------------------------------+----------+------------+---------------------+----------------------+
39| | What matters most to you in choosing a course | Search | Magazine | Newspaper Article | X Education Forums |
40|----+-------------------------------------------------+----------+------------+---------------------+----------------------|
41| 0 | Better Career Prospects | No | No | No | No |
42| 1 | Better Career Prospects | No | No | No | No |
43| 2 | Better Career Prospects | No | No | No | No |
44| 3 | Better Career Prospects | No | No | No | No |
45| 4 | Better Career Prospects | No | No | No | No |
46+----+-------------------------------------------------+----------+------------+---------------------+----------------------+
47
48
49Columns : 20 to 24
50+----+-------------+-------------------------+---------------------------+------------------------------------------+-------------------------------------+
51| | Newspaper | Digital Advertisement | Through Recommendations | Receive More Updates About Our Courses | Tags |
52|----+-------------+-------------------------+---------------------------+------------------------------------------+-------------------------------------|
53| 0 | No | No | No | No | Interested in other courses |
54| 1 | No | No | No | No | Ringing |
55| 2 | No | No | No | No | Will revert after reading the email |
56| 3 | No | No | No | No | Ringing |
57| 4 | No | No | No | No | Will revert after reading the email |
58+----+-------------+-------------------------+---------------------------+------------------------------------------+-------------------------------------+
59
60
61Columns : 25 to 29
62+----+------------------+-------------------------------------+-----------------------------+----------------+--------+
63| | Lead Quality | Update me on Supply Chain Content | Get updates on DM Content | Lead Profile | City |
64|----+------------------+-------------------------------------+-----------------------------+----------------+--------|
65| 0 | Low in Relevance | No | No | Select | Select |
66| 1 | nan | No | No | Select | Select |
67| 2 | Might be | No | No | Potential Lead | Mumbai |
68| 3 | Not Sure | No | No | Select | Mumbai |
69| 4 | Might be | No | No | Select | Mumbai |
70+----+------------------+-------------------------------------+-----------------------------+----------------+--------+
71
72
73Columns : 30 to 34
74+----+-------------------------------+------------------------------+-------------------------------+------------------------------+--------------------------------------------+
75| | Asymmetrique Activity Index | Asymmetrique Profile Index | Asymmetrique Activity Score | Asymmetrique Profile Score | I agree to pay the amount through cheque |
76|----+-------------------------------+------------------------------+-------------------------------+------------------------------+--------------------------------------------|
77| 0 | 02.Medium | 02.Medium | 15 | 15 | No |
78| 1 | 02.Medium | 02.Medium | 15 | 15 | No |
79| 2 | 02.Medium | 01.High | 14 | 20 | No |
80| 3 | 02.Medium | 01.High | 13 | 17 | No |
81| 4 | 02.Medium | 01.High | 15 | 18 | No |
82+----+-------------------------------+------------------------------+-------------------------------+------------------------------+--------------------------------------------+
83
84
85Columns : 35 to last
86+----+------------------------------------------+-------------------------+
87| | A free copy of Mastering The Interview | Last Notable Activity |
88|----+------------------------------------------+-------------------------|
89| 0 | No | Modified |
90| 1 | No | Email Opened |
91| 2 | Yes | Email Opened |
92| 3 | No | Modified |
93| 4 | No | Modified |
94+----+------------------------------------------+-------------------------+
1# dataset information
2leads.info()
1<class 'pandas.core.frame.DataFrame'>
2RangeIndex: 9240 entries, 0 to 9239
3Data columns (total 37 columns):
4 # Column Non-Null Count Dtype
5--- ------ -------------- -----
6 0 Prospect ID 9240 non-null object
7 1 Lead Number 9240 non-null int64
8 2 Lead Origin 9240 non-null object
9 3 Lead Source 9204 non-null object
10 4 Do Not Email 9240 non-null object
11 5 Do Not Call 9240 non-null object
12 6 Converted 9240 non-null int64
13 7 TotalVisits 9103 non-null float64
14 8 Total Time Spent on Website 9240 non-null int64
15 9 Page Views Per Visit 9103 non-null float64
16 10 Last Activity 9137 non-null object
17 11 Country 6779 non-null object
18 12 Specialization 7802 non-null object
19 13 How did you hear about X Education 7033 non-null object
20 14 What is your current occupation 6550 non-null object
21 15 What matters most to you in choosing a course 6531 non-null object
22 16 Search 9240 non-null object
23 17 Magazine 9240 non-null object
24 18 Newspaper Article 9240 non-null object
25 19 X Education Forums 9240 non-null object
26 20 Newspaper 9240 non-null object
27 21 Digital Advertisement 9240 non-null object
28 22 Through Recommendations 9240 non-null object
29 23 Receive More Updates About Our Courses 9240 non-null object
30 24 Tags 5887 non-null object
31 25 Lead Quality 4473 non-null object
32 26 Update me on Supply Chain Content 9240 non-null object
33 27 Get updates on DM Content 9240 non-null object
34 28 Lead Profile 6531 non-null object
35 29 City 7820 non-null object
36 30 Asymmetrique Activity Index 5022 non-null object
37 31 Asymmetrique Profile Index 5022 non-null object
38 32 Asymmetrique Activity Score 5022 non-null float64
39 33 Asymmetrique Profile Score 5022 non-null float64
40 34 I agree to pay the amount through cheque 9240 non-null object
41 35 A free copy of Mastering The Interview 9240 non-null object
42 36 Last Notable Activity 9240 non-null object
43dtypes: float64(4), int64(3), object(30)
44memory usage: 2.6+ MB
  • This data set has a total of 9240 records ,each with 36 features.
  • Each record represents the characteristics of a lead and whether the lead was converted.
  • Converted column indicates whether the particular lead was converted to a client. This is our target variable.

Data Cleaning

Incorrect data types

1# 'Converted' is a binary categorical variable but the info shows it is `int64`. Converting to `category` data type
2leads['Converted'] = leads['Converted'].astype('category')

Duplicates

1# Checking for any duplicate leads / prospects
2
3duplicate_prospect_ids = leads['Prospect ID'][leads['Prospect ID'].duplicated()].sum()
4duplicate_lead_no = leads['Lead Number'][leads['Lead Number'].duplicated()].sum()
5print('No of Duplicate Prospect IDs : ', duplicate_prospect_ids)
6print('No of Duplicate Lead Nos : ', duplicate_lead_no)
1No of Duplicate Prospect IDs : 0
2No of Duplicate Lead Nos : 0
  • There are no duplicate prospect IDs or Lead Numbers
  • Since, these are dimensions (i.e identification columns) not required for analysis, they could be popped for re-indentification at a later step.

Separating ID columns

1# Popping Prospect ID and Lead Number columns for later use
2prospect_ids = leads.pop('Prospect ID')
3lead_no = leads.pop('Lead Number')

Missing Values

1# Null values in each Column
2nulls = pd.DataFrame(100*leads.isnull().sum()/leads.shape[0])
3nulls.columns = ['Null Percentage']
4
5# Sorting null percentages in descending order and highlighting null % > 45
6nulls[nulls['Null Percentage'] !=0].sort_values(by ='Null Percentage', ascending=False).style.applymap(lambda x : 'color : red' if x > 45 else '')
Null Percentage
Lead Quality51.590909
Asymmetrique Profile Score45.649351
Asymmetrique Activity Score45.649351
Asymmetrique Profile Index45.649351
Asymmetrique Activity Index45.649351
Tags36.287879
Lead Profile29.318182
What matters most to you in choosing a course29.318182
What is your current occupation29.112554
Country26.634199
How did you hear about X Education23.885281
Specialization15.562771
City15.367965
TotalVisits1.482684
Page Views Per Visit1.482684
Last Activity1.114719
Lead Source0.389610
  • More than 45% of the leads have missing values in Lead Quality,Asymmetrique Profile Score, Asymmetrique Activity Score, Asymmetrique Profile Index, Asymmetrique Activity Index
  • Further, the data in these columns is filled by the sales team and the values depend heavily on the team’s judgement. These columns are not good candidates for modelling since the values are subjective.
  • Hence these columns could be dropped.
1# Dropping columns with null percentage > 45
2high_null_col = nulls[nulls['Null Percentage'] >=45].index
3leads.drop(columns=high_null_col, inplace=True)
1# Rows Missing Target Variable
2print('Number of rows with missing Target Variable : ',leads['Converted'].isnull().sum())
1Number of rows with missing Target Variable : 0
  • No rows with missing target value
1# Rows missing more than 50% of values
2highNullRowsCondition = leads.isnull().sum(axis=1)/leads.shape[1] > 0.5
3leads[highNullRowsCondition].index
1Int64Index([], dtype='int64')
  • No rows missing more than 50% of values.

Disguised missing Values

1# Categorical columns
2condition = leads.dtypes == 'object'
3categoricalColumns = leads.dtypes[condition].index.values
4categoricalColumns
1array(['Lead Origin', 'Lead Source', 'Do Not Email', 'Do Not Call',
2 'Last Activity', 'Country', 'Specialization',
3 'How did you hear about X Education',
4 'What is your current occupation',
5 'What matters most to you in choosing a course', 'Search',
6 'Magazine', 'Newspaper Article', 'X Education Forums', 'Newspaper',
7 'Digital Advertisement', 'Through Recommendations',
8 'Receive More Updates About Our Courses', 'Tags',
9 'Update me on Supply Chain Content', 'Get updates on DM Content',
10 'Lead Profile', 'City', 'I agree to pay the amount through cheque',
11 'A free copy of Mastering The Interview', 'Last Notable Activity'],
12 dtype=object)
1# value counts of each label in a categorical feature
2def cat_value_counts(column_name) :
3 '''
4 prints unique values and value counts of each label in categorical column
5 '''
6 print(tabulate(pd.DataFrame(leads.stb.freq([column_name])), headers='keys', tablefmt='psql'))
7 print(pd.DataFrame(leads[column_name]).stb.missing(),'\n\n\n')
1# Looking at value counts of each label in categorical variables
2for col in sorted(categoricalColumns) :
3 print(col)
4 cat_value_counts(col)
1A free copy of Mastering The Interview
2+----+------------------------------------------+---------+-----------+--------------------+----------------------+
3| | A free copy of Mastering The Interview | Count | Percent | Cumulative Count | Cumulative Percent |
4|----+------------------------------------------+---------+-----------+--------------------+----------------------|
5| 0 | No | 6352 | 0.687446 | 6352 | 0.687446 |
6| 1 | Yes | 2888 | 0.312554 | 9240 | 1 |
7+----+------------------------------------------+---------+-----------+--------------------+----------------------+
8 Missing Total Percent
9A free copy of Mastering The Interview 0 9240 0.0
10
11
12
13City
14+----+-----------------------------+---------+------------+--------------------+----------------------+
15| | City | Count | Percent | Cumulative Count | Cumulative Percent |
16|----+-----------------------------+---------+------------+--------------------+----------------------|
17| 0 | Mumbai | 3222 | 0.41202 | 3222 | 0.41202 |
18| 1 | Select | 2249 | 0.287596 | 5471 | 0.699616 |
19| 2 | Thane & Outskirts | 752 | 0.0961637 | 6223 | 0.79578 |
20| 3 | Other Cities | 686 | 0.0877238 | 6909 | 0.883504 |
21| 4 | Other Cities of Maharashtra | 457 | 0.0584399 | 7366 | 0.941944 |
22| 5 | Other Metro Cities | 380 | 0.0485934 | 7746 | 0.990537 |
23| 6 | Tier II Cities | 74 | 0.00946292 | 7820 | 1 |
24+----+-----------------------------+---------+------------+--------------------+----------------------+
25 Missing Total Percent
26City 1420 9240 0.15368
27
28
29
30Country
31+----+----------------------+---------+-------------+--------------------+----------------------+
32| | Country | Count | Percent | Cumulative Count | Cumulative Percent |
33|----+----------------------+---------+-------------+--------------------+----------------------|
34| 0 | India | 6492 | 0.957663 | 6492 | 0.957663 |
35| 1 | United States | 69 | 0.0101785 | 6561 | 0.967842 |
36| 2 | United Arab Emirates | 53 | 0.00781826 | 6614 | 0.97566 |
37| 3 | Singapore | 24 | 0.00354035 | 6638 | 0.9792 |
38| 4 | Saudi Arabia | 21 | 0.0030978 | 6659 | 0.982298 |
39| 5 | United Kingdom | 15 | 0.00221272 | 6674 | 0.984511 |
40| 6 | Australia | 13 | 0.00191769 | 6687 | 0.986429 |
41| 7 | Qatar | 10 | 0.00147514 | 6697 | 0.987904 |
42| 8 | Hong Kong | 7 | 0.0010326 | 6704 | 0.988936 |
43| 9 | Bahrain | 7 | 0.0010326 | 6711 | 0.989969 |
44| 10 | Oman | 6 | 0.000885086 | 6717 | 0.990854 |
45| 11 | France | 6 | 0.000885086 | 6723 | 0.991739 |
46| 12 | unknown | 5 | 0.000737572 | 6728 | 0.992477 |
47| 13 | South Africa | 4 | 0.000590058 | 6732 | 0.993067 |
48| 14 | Nigeria | 4 | 0.000590058 | 6736 | 0.993657 |
49| 15 | Kuwait | 4 | 0.000590058 | 6740 | 0.994247 |
50| 16 | Germany | 4 | 0.000590058 | 6744 | 0.994837 |
51| 17 | Canada | 4 | 0.000590058 | 6748 | 0.995427 |
52| 18 | Sweden | 3 | 0.000442543 | 6751 | 0.99587 |
53| 19 | Uganda | 2 | 0.000295029 | 6753 | 0.996165 |
54| 20 | Philippines | 2 | 0.000295029 | 6755 | 0.99646 |
55| 21 | Netherlands | 2 | 0.000295029 | 6757 | 0.996755 |
56| 22 | Italy | 2 | 0.000295029 | 6759 | 0.99705 |
57| 23 | Ghana | 2 | 0.000295029 | 6761 | 0.997345 |
58| 24 | China | 2 | 0.000295029 | 6763 | 0.99764 |
59| 25 | Belgium | 2 | 0.000295029 | 6765 | 0.997935 |
60| 26 | Bangladesh | 2 | 0.000295029 | 6767 | 0.99823 |
61| 27 | Asia/Pacific Region | 2 | 0.000295029 | 6769 | 0.998525 |
62| 28 | Vietnam | 1 | 0.000147514 | 6770 | 0.998672 |
63| 29 | Tanzania | 1 | 0.000147514 | 6771 | 0.99882 |
64| 30 | Switzerland | 1 | 0.000147514 | 6772 | 0.998967 |
65| 31 | Sri Lanka | 1 | 0.000147514 | 6773 | 0.999115 |
66| 32 | Russia | 1 | 0.000147514 | 6774 | 0.999262 |
67| 33 | Malaysia | 1 | 0.000147514 | 6775 | 0.99941 |
68| 34 | Liberia | 1 | 0.000147514 | 6776 | 0.999557 |
69| 35 | Kenya | 1 | 0.000147514 | 6777 | 0.999705 |
70| 36 | Indonesia | 1 | 0.000147514 | 6778 | 0.999852 |
71| 37 | Denmark | 1 | 0.000147514 | 6779 | 1 |
72+----+----------------------+---------+-------------+--------------------+----------------------+
73 Missing Total Percent
74Country 2461 9240 0.266342
75
76
77
78Digital Advertisement
79+----+-------------------------+---------+-----------+--------------------+----------------------+
80| | Digital Advertisement | Count | Percent | Cumulative Count | Cumulative Percent |
81|----+-------------------------+---------+-----------+--------------------+----------------------|
82| 0 | No | 9236 | 0.999567 | 9236 | 0.999567 |
83| 1 | Yes | 4 | 0.0004329 | 9240 | 1 |
84+----+-------------------------+---------+-----------+--------------------+----------------------+
85 Missing Total Percent
86Digital Advertisement 0 9240 0.0
87
88
89
90Do Not Call
91+----+---------------+---------+------------+--------------------+----------------------+
92| | Do Not Call | Count | Percent | Cumulative Count | Cumulative Percent |
93|----+---------------+---------+------------+--------------------+----------------------|
94| 0 | No | 9238 | 0.999784 | 9238 | 0.999784 |
95| 1 | Yes | 2 | 0.00021645 | 9240 | 1 |
96+----+---------------+---------+------------+--------------------+----------------------+
97 Missing Total Percent
98Do Not Call 0 9240 0.0
99
100
101
102Do Not Email
103+----+----------------+---------+-----------+--------------------+----------------------+
104| | Do Not Email | Count | Percent | Cumulative Count | Cumulative Percent |
105|----+----------------+---------+-----------+--------------------+----------------------|
106| 0 | No | 8506 | 0.920563 | 8506 | 0.920563 |
107| 1 | Yes | 734 | 0.0794372 | 9240 | 1 |
108+----+----------------+---------+-----------+--------------------+----------------------+
109 Missing Total Percent
110Do Not Email 0 9240 0.0
111
112
113
114Get updates on DM Content
115+----+-----------------------------+---------+-----------+--------------------+----------------------+
116| | Get updates on DM Content | Count | Percent | Cumulative Count | Cumulative Percent |
117|----+-----------------------------+---------+-----------+--------------------+----------------------|
118| 0 | No | 9240 | 1 | 9240 | 1 |
119+----+-----------------------------+---------+-----------+--------------------+----------------------+
120 Missing Total Percent
121Get updates on DM Content 0 9240 0.0
122
123
124
125How did you hear about X Education
126+----+--------------------------------------+---------+------------+--------------------+----------------------+
127| | How did you hear about X Education | Count | Percent | Cumulative Count | Cumulative Percent |
128|----+--------------------------------------+---------+------------+--------------------+----------------------|
129| 0 | Select | 5043 | 0.717048 | 5043 | 0.717048 |
130| 1 | Online Search | 808 | 0.114887 | 5851 | 0.831935 |
131| 2 | Word Of Mouth | 348 | 0.049481 | 6199 | 0.881416 |
132| 3 | Student of SomeSchool | 310 | 0.0440779 | 6509 | 0.925494 |
133| 4 | Other | 186 | 0.0264468 | 6695 | 0.951941 |
134| 5 | Multiple Sources | 152 | 0.0216124 | 6847 | 0.973553 |
135| 6 | Advertisements | 70 | 0.00995308 | 6917 | 0.983506 |
136| 7 | Social Media | 67 | 0.00952652 | 6984 | 0.993033 |
137| 8 | Email | 26 | 0.00369686 | 7010 | 0.99673 |
138| 9 | SMS | 23 | 0.0032703 | 7033 | 1 |
139+----+--------------------------------------+---------+------------+--------------------+----------------------+
140 Missing Total Percent
141How did you hear about X Education 2207 9240 0.238853
142
143
144
145I agree to pay the amount through cheque
146+----+--------------------------------------------+---------+-----------+--------------------+----------------------+
147| | I agree to pay the amount through cheque | Count | Percent | Cumulative Count | Cumulative Percent |
148|----+--------------------------------------------+---------+-----------+--------------------+----------------------|
149| 0 | No | 9240 | 1 | 9240 | 1 |
150+----+--------------------------------------------+---------+-----------+--------------------+----------------------+
151 Missing Total Percent
152I agree to pay the amount through cheque 0 9240 0.0
153
154
155
156Last Activity
157+----+------------------------------+---------+-------------+--------------------+----------------------+
158| | Last Activity | Count | Percent | Cumulative Count | Cumulative Percent |
159|----+------------------------------+---------+-------------+--------------------+----------------------|
160| 0 | Email Opened | 3437 | 0.376163 | 3437 | 0.376163 |
161| 1 | SMS Sent | 2745 | 0.300427 | 6182 | 0.67659 |
162| 2 | Olark Chat Conversation | 973 | 0.10649 | 7155 | 0.78308 |
163| 3 | Page Visited on Website | 640 | 0.0700449 | 7795 | 0.853125 |
164| 4 | Converted to Lead | 428 | 0.0468425 | 8223 | 0.899967 |
165| 5 | Email Bounced | 326 | 0.0356791 | 8549 | 0.935646 |
166| 6 | Email Link Clicked | 267 | 0.0292218 | 8816 | 0.964868 |
167| 7 | Form Submitted on Website | 116 | 0.0126956 | 8932 | 0.977564 |
168| 8 | Unreachable | 93 | 0.0101784 | 9025 | 0.987742 |
169| 9 | Unsubscribed | 61 | 0.00667615 | 9086 | 0.994418 |
170| 10 | Had a Phone Conversation | 30 | 0.00328335 | 9116 | 0.997702 |
171| 11 | Approached upfront | 9 | 0.000985006 | 9125 | 0.998687 |
172| 12 | View in browser link Clicked | 6 | 0.000656671 | 9131 | 0.999343 |
173| 13 | Email Received | 2 | 0.00021889 | 9133 | 0.999562 |
174| 14 | Email Marked Spam | 2 | 0.00021889 | 9135 | 0.999781 |
175| 15 | Visited Booth in Tradeshow | 1 | 0.000109445 | 9136 | 0.999891 |
176| 16 | Resubscribed to emails | 1 | 0.000109445 | 9137 | 1 |
177+----+------------------------------+---------+-------------+--------------------+----------------------+
178 Missing Total Percent
179Last Activity 103 9240 0.011147
180
181
182
183Last Notable Activity
184+----+------------------------------+---------+-------------+--------------------+----------------------+
185| | Last Notable Activity | Count | Percent | Cumulative Count | Cumulative Percent |
186|----+------------------------------+---------+-------------+--------------------+----------------------|
187| 0 | Modified | 3407 | 0.368723 | 3407 | 0.368723 |
188| 1 | Email Opened | 2827 | 0.305952 | 6234 | 0.674675 |
189| 2 | SMS Sent | 2172 | 0.235065 | 8406 | 0.90974 |
190| 3 | Page Visited on Website | 318 | 0.0344156 | 8724 | 0.944156 |
191| 4 | Olark Chat Conversation | 183 | 0.0198052 | 8907 | 0.963961 |
192| 5 | Email Link Clicked | 173 | 0.0187229 | 9080 | 0.982684 |
193| 6 | Email Bounced | 60 | 0.00649351 | 9140 | 0.989177 |
194| 7 | Unsubscribed | 47 | 0.00508658 | 9187 | 0.994264 |
195| 8 | Unreachable | 32 | 0.0034632 | 9219 | 0.997727 |
196| 9 | Had a Phone Conversation | 14 | 0.00151515 | 9233 | 0.999242 |
197| 10 | Email Marked Spam | 2 | 0.00021645 | 9235 | 0.999459 |
198| 11 | View in browser link Clicked | 1 | 0.000108225 | 9236 | 0.999567 |
199| 12 | Resubscribed to emails | 1 | 0.000108225 | 9237 | 0.999675 |
200| 13 | Form Submitted on Website | 1 | 0.000108225 | 9238 | 0.999784 |
201| 14 | Email Received | 1 | 0.000108225 | 9239 | 0.999892 |
202| 15 | Approached upfront | 1 | 0.000108225 | 9240 | 1 |
203+----+------------------------------+---------+-------------+--------------------+----------------------+
204 Missing Total Percent
205Last Notable Activity 0 9240 0.0
206
207
208
209Lead Origin
210+----+-------------------------+---------+-------------+--------------------+----------------------+
211| | Lead Origin | Count | Percent | Cumulative Count | Cumulative Percent |
212|----+-------------------------+---------+-------------+--------------------+----------------------|
213| 0 | Landing Page Submission | 4886 | 0.528788 | 4886 | 0.528788 |
214| 1 | API | 3580 | 0.387446 | 8466 | 0.916234 |
215| 2 | Lead Add Form | 718 | 0.0777056 | 9184 | 0.993939 |
216| 3 | Lead Import | 55 | 0.00595238 | 9239 | 0.999892 |
217| 4 | Quick Add Form | 1 | 0.000108225 | 9240 | 1 |
218+----+-------------------------+---------+-------------+--------------------+----------------------+
219 Missing Total Percent
220Lead Origin 0 9240 0.0
221
222
223
224Lead Profile
225+----+-----------------------------+---------+------------+--------------------+----------------------+
226| | Lead Profile | Count | Percent | Cumulative Count | Cumulative Percent |
227|----+-----------------------------+---------+------------+--------------------+----------------------|
228| 0 | Select | 4146 | 0.634819 | 4146 | 0.634819 |
229| 1 | Potential Lead | 1613 | 0.246976 | 5759 | 0.881795 |
230| 2 | Other Leads | 487 | 0.0745674 | 6246 | 0.956362 |
231| 3 | Student of SomeSchool | 241 | 0.0369009 | 6487 | 0.993263 |
232| 4 | Lateral Student | 24 | 0.00367478 | 6511 | 0.996938 |
233| 5 | Dual Specialization Student | 20 | 0.00306232 | 6531 | 1 |
234+----+-----------------------------+---------+------------+--------------------+----------------------+
235 Missing Total Percent
236Lead Profile 2709 9240 0.293182
237
238
239
240Lead Source
241+----+-------------------+---------+-------------+--------------------+----------------------+
242| | Lead Source | Count | Percent | Cumulative Count | Cumulative Percent |
243|----+-------------------+---------+-------------+--------------------+----------------------|
244| 0 | Google | 2868 | 0.311604 | 2868 | 0.311604 |
245| 1 | Direct Traffic | 2543 | 0.276293 | 5411 | 0.587897 |
246| 2 | Olark Chat | 1755 | 0.190678 | 7166 | 0.778575 |
247| 3 | Organic Search | 1154 | 0.12538 | 8320 | 0.903955 |
248| 4 | Reference | 534 | 0.0580183 | 8854 | 0.961973 |
249| 5 | Welingak Website | 142 | 0.0154281 | 8996 | 0.977401 |
250| 6 | Referral Sites | 125 | 0.0135811 | 9121 | 0.990982 |
251| 7 | Facebook | 55 | 0.00597566 | 9176 | 0.996958 |
252| 8 | bing | 6 | 0.00065189 | 9182 | 0.99761 |
253| 9 | google | 5 | 0.000543242 | 9187 | 0.998153 |
254| 10 | Click2call | 4 | 0.000434594 | 9191 | 0.998588 |
255| 11 | Social Media | 2 | 0.000217297 | 9193 | 0.998805 |
256| 12 | Press_Release | 2 | 0.000217297 | 9195 | 0.999022 |
257| 13 | Live Chat | 2 | 0.000217297 | 9197 | 0.999239 |
258| 14 | youtubechannel | 1 | 0.000108648 | 9198 | 0.999348 |
259| 15 | welearnblog_Home | 1 | 0.000108648 | 9199 | 0.999457 |
260| 16 | testone | 1 | 0.000108648 | 9200 | 0.999565 |
261| 17 | blog | 1 | 0.000108648 | 9201 | 0.999674 |
262| 18 | WeLearn | 1 | 0.000108648 | 9202 | 0.999783 |
263| 19 | Pay per Click Ads | 1 | 0.000108648 | 9203 | 0.999891 |
264| 20 | NC_EDM | 1 | 0.000108648 | 9204 | 1 |
265+----+-------------------+---------+-------------+--------------------+----------------------+
266 Missing Total Percent
267Lead Source 36 9240 0.003896
268
269
270
271Magazine
272+----+------------+---------+-----------+--------------------+----------------------+
273| | Magazine | Count | Percent | Cumulative Count | Cumulative Percent |
274|----+------------+---------+-----------+--------------------+----------------------|
275| 0 | No | 9240 | 1 | 9240 | 1 |
276+----+------------+---------+-----------+--------------------+----------------------+
277 Missing Total Percent
278Magazine 0 9240 0.0
279
280
281
282Newspaper
283+----+-------------+---------+-------------+--------------------+----------------------+
284| | Newspaper | Count | Percent | Cumulative Count | Cumulative Percent |
285|----+-------------+---------+-------------+--------------------+----------------------|
286| 0 | No | 9239 | 0.999892 | 9239 | 0.999892 |
287| 1 | Yes | 1 | 0.000108225 | 9240 | 1 |
288+----+-------------+---------+-------------+--------------------+----------------------+
289 Missing Total Percent
290Newspaper 0 9240 0.0
291
292
293
294Newspaper Article
295+----+---------------------+---------+------------+--------------------+----------------------+
296| | Newspaper Article | Count | Percent | Cumulative Count | Cumulative Percent |
297|----+---------------------+---------+------------+--------------------+----------------------|
298| 0 | No | 9238 | 0.999784 | 9238 | 0.999784 |
299| 1 | Yes | 2 | 0.00021645 | 9240 | 1 |
300+----+---------------------+---------+------------+--------------------+----------------------+
301 Missing Total Percent
302Newspaper Article 0 9240 0.0
303
304
305
306Receive More Updates About Our Courses
307+----+------------------------------------------+---------+-----------+--------------------+----------------------+
308| | Receive More Updates About Our Courses | Count | Percent | Cumulative Count | Cumulative Percent |
309|----+------------------------------------------+---------+-----------+--------------------+----------------------|
310| 0 | No | 9240 | 1 | 9240 | 1 |
311+----+------------------------------------------+---------+-----------+--------------------+----------------------+
312 Missing Total Percent
313Receive More Updates About Our Courses 0 9240 0.0
314
315
316
317Search
318+----+----------+---------+------------+--------------------+----------------------+
319| | Search | Count | Percent | Cumulative Count | Cumulative Percent |
320|----+----------+---------+------------+--------------------+----------------------|
321| 0 | No | 9226 | 0.998485 | 9226 | 0.998485 |
322| 1 | Yes | 14 | 0.00151515 | 9240 | 1 |
323+----+----------+---------+------------+--------------------+----------------------+
324 Missing Total Percent
325Search 0 9240 0.0
326
327
328
329Specialization
330+----+-----------------------------------+---------+------------+--------------------+----------------------+
331| | Specialization | Count | Percent | Cumulative Count | Cumulative Percent |
332|----+-----------------------------------+---------+------------+--------------------+----------------------|
333| 0 | Select | 1942 | 0.248911 | 1942 | 0.248911 |
334| 1 | Finance Management | 976 | 0.125096 | 2918 | 0.374007 |
335| 2 | Human Resource Management | 848 | 0.10869 | 3766 | 0.482697 |
336| 3 | Marketing Management | 838 | 0.107408 | 4604 | 0.590105 |
337| 4 | Operations Management | 503 | 0.0644706 | 5107 | 0.654576 |
338| 5 | Business Administration | 403 | 0.0516534 | 5510 | 0.706229 |
339| 6 | IT Projects Management | 366 | 0.046911 | 5876 | 0.75314 |
340| 7 | Supply Chain Management | 349 | 0.0447321 | 6225 | 0.797872 |
341| 8 | Banking, Investment And Insurance | 338 | 0.0433222 | 6563 | 0.841195 |
342| 9 | Travel and Tourism | 203 | 0.026019 | 6766 | 0.867214 |
343| 10 | Media and Advertising | 203 | 0.026019 | 6969 | 0.893233 |
344| 11 | International Business | 178 | 0.0228147 | 7147 | 0.916047 |
345| 12 | Healthcare Management | 159 | 0.0203794 | 7306 | 0.936427 |
346| 13 | Hospitality Management | 114 | 0.0146116 | 7420 | 0.951038 |
347| 14 | E-COMMERCE | 112 | 0.0143553 | 7532 | 0.965393 |
348| 15 | Retail Management | 100 | 0.0128172 | 7632 | 0.978211 |
349| 16 | Rural and Agribusiness | 73 | 0.00935658 | 7705 | 0.987567 |
350| 17 | E-Business | 57 | 0.00730582 | 7762 | 0.994873 |
351| 18 | Services Excellence | 40 | 0.00512689 | 7802 | 1 |
352+----+-----------------------------------+---------+------------+--------------------+----------------------+
353 Missing Total Percent
354Specialization 1438 9240 0.155628
355
356
357
358Tags
359+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+
360| | Tags | Count | Percent | Cumulative Count | Cumulative Percent |
361|----+---------------------------------------------------+---------+-------------+--------------------+----------------------|
362| 0 | Will revert after reading the email | 2072 | 0.351962 | 2072 | 0.351962 |
363| 1 | Ringing | 1203 | 0.204349 | 3275 | 0.556311 |
364| 2 | Interested in other courses | 513 | 0.0871412 | 3788 | 0.643452 |
365| 3 | Already a student | 465 | 0.0789876 | 4253 | 0.722439 |
366| 4 | Closed by Horizzon | 358 | 0.060812 | 4611 | 0.783251 |
367| 5 | switched off | 240 | 0.0407678 | 4851 | 0.824019 |
368| 6 | Busy | 186 | 0.031595 | 5037 | 0.855614 |
369| 7 | Lost to EINS | 175 | 0.0297265 | 5212 | 0.885341 |
370| 8 | Not doing further education | 145 | 0.0246305 | 5357 | 0.909971 |
371| 9 | Interested in full time MBA | 117 | 0.0198743 | 5474 | 0.929845 |
372| 10 | Graduation in progress | 111 | 0.0188551 | 5585 | 0.948701 |
373| 11 | invalid number | 83 | 0.0140989 | 5668 | 0.962799 |
374| 12 | Diploma holder (Not Eligible) | 63 | 0.0107015 | 5731 | 0.973501 |
375| 13 | wrong number given | 47 | 0.00798369 | 5778 | 0.981485 |
376| 14 | opp hangup | 33 | 0.00560557 | 5811 | 0.98709 |
377| 15 | number not provided | 27 | 0.00458638 | 5838 | 0.991677 |
378| 16 | in touch with EINS | 12 | 0.00203839 | 5850 | 0.993715 |
379| 17 | Lost to Others | 7 | 0.00118906 | 5857 | 0.994904 |
380| 18 | Want to take admission but has financial problems | 6 | 0.00101919 | 5863 | 0.995923 |
381| 19 | Still Thinking | 6 | 0.00101919 | 5869 | 0.996942 |
382| 20 | Interested in Next batch | 5 | 0.000849329 | 5874 | 0.997792 |
383| 21 | In confusion whether part time or DLP | 5 | 0.000849329 | 5879 | 0.998641 |
384| 22 | Lateral student | 3 | 0.000509597 | 5882 | 0.999151 |
385| 23 | University not recognized | 2 | 0.000339732 | 5884 | 0.99949 |
386| 24 | Shall take in the next coming month | 2 | 0.000339732 | 5886 | 0.99983 |
387| 25 | Recognition issue (DEC approval) | 1 | 0.000169866 | 5887 | 1 |
388+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+
389 Missing Total Percent
390Tags 3353 9240 0.362879
391
392
393
394Through Recommendations
395+----+---------------------------+---------+-------------+--------------------+----------------------+
396| | Through Recommendations | Count | Percent | Cumulative Count | Cumulative Percent |
397|----+---------------------------+---------+-------------+--------------------+----------------------|
398| 0 | No | 9233 | 0.999242 | 9233 | 0.999242 |
399| 1 | Yes | 7 | 0.000757576 | 9240 | 1 |
400+----+---------------------------+---------+-------------+--------------------+----------------------+
401 Missing Total Percent
402Through Recommendations 0 9240 0.0
403
404
405
406Update me on Supply Chain Content
407+----+-------------------------------------+---------+-----------+--------------------+----------------------+
408| | Update me on Supply Chain Content | Count | Percent | Cumulative Count | Cumulative Percent |
409|----+-------------------------------------+---------+-----------+--------------------+----------------------|
410| 0 | No | 9240 | 1 | 9240 | 1 |
411+----+-------------------------------------+---------+-----------+--------------------+----------------------+
412 Missing Total Percent
413Update me on Supply Chain Content 0 9240 0.0
414
415
416
417What is your current occupation
418+----+-----------------------------------+---------+------------+--------------------+----------------------+
419| | What is your current occupation | Count | Percent | Cumulative Count | Cumulative Percent |
420|----+-----------------------------------+---------+------------+--------------------+----------------------|
421| 0 | Unemployed | 5600 | 0.854962 | 5600 | 0.854962 |
422| 1 | Working Professional | 706 | 0.107786 | 6306 | 0.962748 |
423| 2 | Student | 210 | 0.0320611 | 6516 | 0.994809 |
424| 3 | Other | 16 | 0.00244275 | 6532 | 0.997252 |
425| 4 | Housewife | 10 | 0.00152672 | 6542 | 0.998779 |
426| 5 | Businessman | 8 | 0.00122137 | 6550 | 1 |
427+----+-----------------------------------+---------+------------+--------------------+----------------------+
428 Missing Total Percent
429What is your current occupation 2690 9240 0.291126
430
431
432
433What matters most to you in choosing a course
434+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+
435| | What matters most to you in choosing a course | Count | Percent | Cumulative Count | Cumulative Percent |
436|----+-------------------------------------------------+---------+-------------+--------------------+----------------------|
437| 0 | Better Career Prospects | 6528 | 0.999541 | 6528 | 0.999541 |
438| 1 | Flexibility & Convenience | 2 | 0.000306232 | 6530 | 0.999847 |
439| 2 | Other | 1 | 0.000153116 | 6531 | 1 |
440+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+
441 Missing Total Percent
442What matters most to you in choosing a course 2709 9240 0.293182
443
444
445
446X Education Forums
447+----+----------------------+---------+-------------+--------------------+----------------------+
448| | X Education Forums | Count | Percent | Cumulative Count | Cumulative Percent |
449|----+----------------------+---------+-------------+--------------------+----------------------|
450| 0 | No | 9239 | 0.999892 | 9239 | 0.999892 |
451| 1 | Yes | 1 | 0.000108225 | 9240 | 1 |
452+----+----------------------+---------+-------------+--------------------+----------------------+
453 Missing Total Percent
454X Education Forums 0 9240 0.0
  • The following columns have a label Select which is a disguised missing value.
  • Select is the default option in online forms and this value might mean that the lead hasn’t selected any option.
  • We shall replace them with np.nan.
    • Specialization
    • Lead Profile
    • City
    • How did you hear about X Education
1# Replacing Select with NaN value
2leads.replace({'Select' : np.nan},inplace=True)
1# Looking at Missing Values again
2
3# Null values in each Column
4nulls = pd.DataFrame(100*leads.isnull().sum()/leads.shape[0])
5nulls.columns = ['Null Percentage']
6
7# Sorting null percentages in descending order and highlighting null % > 50
8nulls[nulls['Null Percentage'] !=0].sort_values(by ='Null Percentage', ascending=False).style.applymap(lambda x : 'color : red' if x > 50 else '')
Null Percentage
How did you hear about X Education78.463203
Lead Profile74.188312
City39.707792
Specialization36.580087
Tags36.287879
What matters most to you in choosing a course29.318182
What is your current occupation29.112554
Country26.634199
TotalVisits1.482684
Page Views Per Visit1.482684
Last Activity1.114719
Lead Source0.389610
  • Lead Profile & How did you hear about X Education have very high percentage of nulls. Let’s drop these columns
1leads.drop(columns=['Lead Profile','How did you hear about X Education'],inplace=True)

Imputation

1# Sorting null percentages in ascending order and highlighting null % < 16
2def lowNulls() :
3 nulls = pd.DataFrame(100*leads.isnull().sum()/leads.shape[0])
4 nulls.columns = ['Null Percentage']
5 return nulls[nulls['Null Percentage'] !=0].sort_values(by ='Null Percentage', ascending=True).style.applymap(lambda x : 'color : green' if x < 16 else '')
6
7lowNulls()
Null Percentage
Lead Source0.389610
Last Activity1.114719
TotalVisits1.482684
Page Views Per Visit1.482684
Country26.634199
What is your current occupation29.112554
What matters most to you in choosing a course29.318182
Tags36.287879
Specialization36.580087
City39.707792
  • ‘Lead Source’,‘Last Activity’,‘TotalVisits’,‘Page Views Per Visit’ have less than 2% missing values. These rows could be dropped.

  • We could impute columns with higher missing values on a case by case basis.

  • Missing values are imputed by the metric most representative of the feature’s distribution.

  • For categorical features, missing values could be imputed by the most frequently occuring label i.e MODE value, since this is the most representative metric of a categorical feature.

  • For continuous features, if there are outliers, the most representative metric of the feature’s distribution is MEDIAN, else it is MEAN. Continuous feature imputations are thus dependent on presence of outliers.

  • About 26% of data in Country Column is missing.

1# Country Imputation :
2tab(leads.stb.freq(['Country']))
1+----+----------------------+---------+-------------+--------------------+----------------------+
2| | Country | Count | Percent | Cumulative Count | Cumulative Percent |
3|----+----------------------+---------+-------------+--------------------+----------------------|
4| 0 | India | 6492 | 0.957663 | 6492 | 0.957663 |
5| 1 | United States | 69 | 0.0101785 | 6561 | 0.967842 |
6| 2 | United Arab Emirates | 53 | 0.00781826 | 6614 | 0.97566 |
7| 3 | Singapore | 24 | 0.00354035 | 6638 | 0.9792 |
8| 4 | Saudi Arabia | 21 | 0.0030978 | 6659 | 0.982298 |
9| 5 | United Kingdom | 15 | 0.00221272 | 6674 | 0.984511 |
10| 6 | Australia | 13 | 0.00191769 | 6687 | 0.986429 |
11| 7 | Qatar | 10 | 0.00147514 | 6697 | 0.987904 |
12| 8 | Hong Kong | 7 | 0.0010326 | 6704 | 0.988936 |
13| 9 | Bahrain | 7 | 0.0010326 | 6711 | 0.989969 |
14| 10 | Oman | 6 | 0.000885086 | 6717 | 0.990854 |
15| 11 | France | 6 | 0.000885086 | 6723 | 0.991739 |
16| 12 | unknown | 5 | 0.000737572 | 6728 | 0.992477 |
17| 13 | South Africa | 4 | 0.000590058 | 6732 | 0.993067 |
18| 14 | Nigeria | 4 | 0.000590058 | 6736 | 0.993657 |
19| 15 | Kuwait | 4 | 0.000590058 | 6740 | 0.994247 |
20| 16 | Germany | 4 | 0.000590058 | 6744 | 0.994837 |
21| 17 | Canada | 4 | 0.000590058 | 6748 | 0.995427 |
22| 18 | Sweden | 3 | 0.000442543 | 6751 | 0.99587 |
23| 19 | Uganda | 2 | 0.000295029 | 6753 | 0.996165 |
24| 20 | Philippines | 2 | 0.000295029 | 6755 | 0.99646 |
25| 21 | Netherlands | 2 | 0.000295029 | 6757 | 0.996755 |
26| 22 | Italy | 2 | 0.000295029 | 6759 | 0.99705 |
27| 23 | Ghana | 2 | 0.000295029 | 6761 | 0.997345 |
28| 24 | China | 2 | 0.000295029 | 6763 | 0.99764 |
29| 25 | Belgium | 2 | 0.000295029 | 6765 | 0.997935 |
30| 26 | Bangladesh | 2 | 0.000295029 | 6767 | 0.99823 |
31| 27 | Asia/Pacific Region | 2 | 0.000295029 | 6769 | 0.998525 |
32| 28 | Vietnam | 1 | 0.000147514 | 6770 | 0.998672 |
33| 29 | Tanzania | 1 | 0.000147514 | 6771 | 0.99882 |
34| 30 | Switzerland | 1 | 0.000147514 | 6772 | 0.998967 |
35| 31 | Sri Lanka | 1 | 0.000147514 | 6773 | 0.999115 |
36| 32 | Russia | 1 | 0.000147514 | 6774 | 0.999262 |
37| 33 | Malaysia | 1 | 0.000147514 | 6775 | 0.99941 |
38| 34 | Liberia | 1 | 0.000147514 | 6776 | 0.999557 |
39| 35 | Kenya | 1 | 0.000147514 | 6777 | 0.999705 |
40| 36 | Indonesia | 1 | 0.000147514 | 6778 | 0.999852 |
41| 37 | Denmark | 1 | 0.000147514 | 6779 | 1 |
42+----+----------------------+---------+-------------+--------------------+----------------------+
  • Since 95% of leads come from India, it is probable that missing values are from India.
1# Imputing missing values in Country feature with "India"
2leads['Country'].fillna('India', inplace=True)
  • Specialization feature has 36% of missing values.
  • Since there’s no one label that’s driving leads, replacement would mislead the analysis.
  • Hence, we could impute missing values with a new label ‘No Specialization’
1# Imputing Null Values by filling it using "No Specialization".
2leads['Specialization'].fillna("No Specialization",inplace=True)
3print('Missing values in Specialization feature ', leads['Specialization'].isnull().sum())
1Missing values in Specialization feature 0
1leads['Specialization'].value_counts()
1No Specialization 3380
2Finance Management 976
3Human Resource Management 848
4Marketing Management 838
5Operations Management 503
6Business Administration 403
7IT Projects Management 366
8Supply Chain Management 349
9Banking, Investment And Insurance 338
10Travel and Tourism 203
11Media and Advertising 203
12International Business 178
13Healthcare Management 159
14Hospitality Management 114
15E-COMMERCE 112
16Retail Management 100
17Rural and Agribusiness 73
18E-Business 57
19Services Excellence 40
20Name: Specialization, dtype: int64
  • About 39% of values in City column are missing
1# Imputation of missing cities
2leads.stb.freq(['City'])
CityCountPercentCumulative CountCumulative Percent
0Mumbai32220.57835232220.578352
1Thane & Outskirts7520.13498539740.713337
2Other Cities6860.12313846600.836475
3Other Cities of Maharashtra4570.08203251170.918507
4Other Metro Cities3800.06821054970.986717
5Tier II Cities740.01328355711.000000
1# Missing Cities vs Country
2condition_india = leads['Country'] == 'India'
3print('Total Missing City values :', leads['City'].isnull().sum())
4print('Missing City values in leads from India : ',leads.loc[condition_india,'City'].isnull().sum())
1Total Missing City values : 3669
2Missing City values in leads from India : 3609
  • Looks like 3609 out 3669 leads with Missing City label are from India.
  • As can be seen from the value counts of City feature, 60% of the leads come from Mumbai.
  • Since, we could impute missing City value for leads from India with Mumbai
1# Replacing Null Cities in India with Mumbai
2condition = (leads['City'].isnull()) & condition_india
3leads.loc[condition,'City'] = 'Mumbai'
  • 29% of values in What is your current occupation column are missing.
  • Let’s look at the distribution of levels in this column
1tab(leads.stb.freq(['What is your current occupation']))
2tab(leads['What is your current occupation'].reset_index().stb.missing())
1+----+-----------------------------------+---------+------------+--------------------+----------------------+
2| | What is your current occupation | Count | Percent | Cumulative Count | Cumulative Percent |
3|----+-----------------------------------+---------+------------+--------------------+----------------------|
4| 0 | Unemployed | 5600 | 0.854962 | 5600 | 0.854962 |
5| 1 | Working Professional | 706 | 0.107786 | 6306 | 0.962748 |
6| 2 | Student | 210 | 0.0320611 | 6516 | 0.994809 |
7| 3 | Other | 16 | 0.00244275 | 6532 | 0.997252 |
8| 4 | Housewife | 10 | 0.00152672 | 6542 | 0.998779 |
9| 5 | Businessman | 8 | 0.00122137 | 6550 | 1 |
10+----+-----------------------------------+---------+------------+--------------------+----------------------+
11+---------------------------------+-----------+---------+-----------+
12| | Missing | Total | Percent |
13|---------------------------------+-----------+---------+-----------|
14| What is your current occupation | 2690 | 9240 | 0.291126 |
15| index | 0 | 9240 | 0 |
16+---------------------------------+-----------+---------+-----------+
  • Since, the business problem clearly says the company targets working professionals, this is an extremely important variable.
  • So to keep the analysis unbiased, we could impute missing values with a new level for now.
  • Let’s replace missing values in What is your current occupation with ‘Unknown Occupation’
1leads['What is your current occupation'].fillna('Unknown Occupation',inplace=True)
1# Missing Values in `What matters most to you in choosing a course`
2ftr = 'What matters most to you in choosing a course'
3tab(leads.stb.freq([ftr]))
4tab(leads[ftr].reset_index().stb.missing())
1+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+
2| | What matters most to you in choosing a course | Count | Percent | Cumulative Count | Cumulative Percent |
3|----+-------------------------------------------------+---------+-------------+--------------------+----------------------|
4| 0 | Better Career Prospects | 6528 | 0.999541 | 6528 | 0.999541 |
5| 1 | Flexibility & Convenience | 2 | 0.000306232 | 6530 | 0.999847 |
6| 2 | Other | 1 | 0.000153116 | 6531 | 1 |
7+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+
8+-----------------------------------------------+-----------+---------+-----------+
9| | Missing | Total | Percent |
10|-----------------------------------------------+-----------+---------+-----------|
11| What matters most to you in choosing a course | 2709 | 9240 | 0.293182 |
12| index | 0 | 9240 | 0 |
13+-----------------------------------------------+-----------+---------+-----------+
  • Almost all leads ( >99%) show interest in the company’s offerings for Better Career Prospects , excluding leads who havent filled this feature.
  • Since this might be a very important feature from the analysis perspective, to keep the analysis unbiased, instead of imputation with an existing label, let’s impute with a new label for now.
  • Let’s replace missing values in What matters most to you in choosing a course with ‘Unknown Target’ for now.
1# Filling Missing Values with `Unknown`
2leads[ftr].fillna('Unknown Target',inplace=True)

Incorrect Labels

1# Looking at labels in each Categorical Variable to check for incorrect labels.
2categoricalFeatures = leads.dtypes[leads.dtypes == 'object'].index.values
3print('Categorical Features : ', categoricalFeatures,'\n\n')
4for feature in categoricalFeatures :
5 print('Levels in ',feature,' are ' , leads[feature].unique(),'\n\n')
1Categorical Features : ['Lead Origin' 'Lead Source' 'Do Not Email' 'Do Not Call' 'Last Activity'
2 'Country' 'Specialization' 'What is your current occupation'
3 'What matters most to you in choosing a course' 'Search' 'Magazine'
4 'Newspaper Article' 'X Education Forums' 'Newspaper'
5 'Digital Advertisement' 'Through Recommendations'
6 'Receive More Updates About Our Courses' 'Tags'
7 'Update me on Supply Chain Content' 'Get updates on DM Content' 'City'
8 'I agree to pay the amount through cheque'
9 'A free copy of Mastering The Interview' 'Last Notable Activity']
10
11
12Levels in Lead Origin are ['API' 'Landing Page Submission' 'Lead Add Form' 'Lead Import'
13 'Quick Add Form']
14
15
16Levels in Lead Source are ['Olark Chat' 'Organic Search' 'Direct Traffic' 'Google' 'Referral Sites'
17 'Welingak Website' 'Reference' 'google' 'Facebook' nan 'blog'
18 'Pay per Click Ads' 'bing' 'Social Media' 'WeLearn' 'Click2call'
19 'Live Chat' 'welearnblog_Home' 'youtubechannel' 'testone' 'Press_Release'
20 'NC_EDM']
21
22
23Levels in Do Not Email are ['No' 'Yes']
24
25
26Levels in Do Not Call are ['No' 'Yes']
27
28
29Levels in Last Activity are ['Page Visited on Website' 'Email Opened' 'Unreachable'
30 'Converted to Lead' 'Olark Chat Conversation' 'Email Bounced'
31 'Email Link Clicked' 'Form Submitted on Website' 'Unsubscribed'
32 'Had a Phone Conversation' 'View in browser link Clicked' nan
33 'Approached upfront' 'SMS Sent' 'Visited Booth in Tradeshow'
34 'Resubscribed to emails' 'Email Received' 'Email Marked Spam']
35
36
37Levels in Country are ['India' 'Russia' 'Kuwait' 'Oman' 'United Arab Emirates' 'United States'
38 'Australia' 'United Kingdom' 'Bahrain' 'Ghana' 'Singapore' 'Qatar'
39 'Saudi Arabia' 'Belgium' 'France' 'Sri Lanka' 'China' 'Canada'
40 'Netherlands' 'Sweden' 'Nigeria' 'Hong Kong' 'Germany'
41 'Asia/Pacific Region' 'Uganda' 'Kenya' 'Italy' 'South Africa' 'Tanzania'
42 'unknown' 'Malaysia' 'Liberia' 'Switzerland' 'Denmark' 'Philippines'
43 'Bangladesh' 'Vietnam' 'Indonesia']
44
45
46Levels in Specialization are ['No Specialization' 'Business Administration' 'Media and Advertising'
47 'Supply Chain Management' 'IT Projects Management' 'Finance Management'
48 'Travel and Tourism' 'Human Resource Management' 'Marketing Management'
49 'Banking, Investment And Insurance' 'International Business' 'E-COMMERCE'
50 'Operations Management' 'Retail Management' 'Services Excellence'
51 'Hospitality Management' 'Rural and Agribusiness' 'Healthcare Management'
52 'E-Business']
53
54
55Levels in What is your current occupation are ['Unemployed' 'Student' 'Unknown Occupation' 'Working Professional'
56 'Businessman' 'Other' 'Housewife']
57
58
59Levels in What matters most to you in choosing a course are ['Better Career Prospects' 'Unknown Target' 'Flexibility & Convenience'
60 'Other']
61
62
63Levels in Search are ['No' 'Yes']
64
65
66Levels in Magazine are ['No']
67
68
69Levels in Newspaper Article are ['No' 'Yes']
70
71
72Levels in X Education Forums are ['No' 'Yes']
73
74
75Levels in Newspaper are ['No' 'Yes']
76
77
78Levels in Digital Advertisement are ['No' 'Yes']
79
80
81Levels in Through Recommendations are ['No' 'Yes']
82
83
84Levels in Receive More Updates About Our Courses are ['No']
85
86
87Levels in Tags are ['Interested in other courses' 'Ringing'
88 'Will revert after reading the email' nan 'Lost to EINS'
89 'In confusion whether part time or DLP' 'Busy' 'switched off'
90 'in touch with EINS' 'Already a student' 'Diploma holder (Not Eligible)'
91 'Graduation in progress' 'Closed by Horizzon' 'number not provided'
92 'opp hangup' 'Not doing further education' 'invalid number'
93 'wrong number given' 'Interested in full time MBA' 'Still Thinking'
94 'Lost to Others' 'Shall take in the next coming month' 'Lateral student'
95 'Interested in Next batch' 'Recognition issue (DEC approval)'
96 'Want to take admission but has financial problems'
97 'University not recognized']
98
99
100Levels in Update me on Supply Chain Content are ['No']
101
102
103Levels in Get updates on DM Content are ['No']
104
105
106Levels in City are ['Mumbai' 'Thane & Outskirts' 'Other Metro Cities' nan 'Other Cities'
107 'Other Cities of Maharashtra' 'Tier II Cities']
108
109
110Levels in I agree to pay the amount through cheque are ['No']
111
112
113Levels in A free copy of Mastering The Interview are ['No' 'Yes']
114
115
116Levels in Last Notable Activity are ['Modified' 'Email Opened' 'Page Visited on Website' 'Email Bounced'
117 'Email Link Clicked' 'Unreachable' 'Unsubscribed'
118 'Had a Phone Conversation' 'Olark Chat Conversation' 'SMS Sent'
119 'Approached upfront' 'Resubscribed to emails'
120 'View in browser link Clicked' 'Form Submitted on Website'
121 'Email Received' 'Email Marked Spam']
  • We can clearly see that Google is appearing twice in ‘Lead Source’- (Google,google)
1# Replacing 'google' with 'Google
2leads['Lead Source']=leads['Lead Source'].str.replace("google","Google")

Cleaning Categorical Features

Dropping Unnecessary Columns

1# Missing Values and Value Counts for all categorical Variables
2tab(leads.stb.missing())
3print('Value Counts of each Feature : \n')
4for feature in sorted(categoricalFeatures) :
5 tab(leads.stb.freq([feature]))
1+-----------------------------------------------+-----------+---------+------------+
2| | Missing | Total | Percent |
3|-----------------------------------------------+-----------+---------+------------|
4| Tags | 3353 | 9240 | 0.362879 |
5| TotalVisits | 137 | 9240 | 0.0148268 |
6| Page Views Per Visit | 137 | 9240 | 0.0148268 |
7| Last Activity | 103 | 9240 | 0.0111472 |
8| City | 60 | 9240 | 0.00649351 |
9| Lead Source | 36 | 9240 | 0.0038961 |
10| Lead Origin | 0 | 9240 | 0 |
11| X Education Forums | 0 | 9240 | 0 |
12| A free copy of Mastering The Interview | 0 | 9240 | 0 |
13| I agree to pay the amount through cheque | 0 | 9240 | 0 |
14| Get updates on DM Content | 0 | 9240 | 0 |
15| Update me on Supply Chain Content | 0 | 9240 | 0 |
16| Receive More Updates About Our Courses | 0 | 9240 | 0 |
17| Through Recommendations | 0 | 9240 | 0 |
18| Digital Advertisement | 0 | 9240 | 0 |
19| Newspaper | 0 | 9240 | 0 |
20| Magazine | 0 | 9240 | 0 |
21| Newspaper Article | 0 | 9240 | 0 |
22| Search | 0 | 9240 | 0 |
23| What matters most to you in choosing a course | 0 | 9240 | 0 |
24| What is your current occupation | 0 | 9240 | 0 |
25| Specialization | 0 | 9240 | 0 |
26| Country | 0 | 9240 | 0 |
27| Total Time Spent on Website | 0 | 9240 | 0 |
28| Converted | 0 | 9240 | 0 |
29| Do Not Call | 0 | 9240 | 0 |
30| Do Not Email | 0 | 9240 | 0 |
31| Last Notable Activity | 0 | 9240 | 0 |
32+-----------------------------------------------+-----------+---------+------------+
33Value Counts of each Feature :
34
35+----+------------------------------------------+---------+-----------+--------------------+----------------------+
36| | A free copy of Mastering The Interview | Count | Percent | Cumulative Count | Cumulative Percent |
37|----+------------------------------------------+---------+-----------+--------------------+----------------------|
38| 0 | No | 6352 | 0.687446 | 6352 | 0.687446 |
39| 1 | Yes | 2888 | 0.312554 | 9240 | 1 |
40+----+------------------------------------------+---------+-----------+--------------------+----------------------+
41+----+-----------------------------+---------+-----------+--------------------+----------------------+
42| | City | Count | Percent | Cumulative Count | Cumulative Percent |
43|----+-----------------------------+---------+-----------+--------------------+----------------------|
44| 0 | Mumbai | 6831 | 0.744118 | 6831 | 0.744118 |
45| 1 | Thane & Outskirts | 752 | 0.0819172 | 7583 | 0.826035 |
46| 2 | Other Cities | 686 | 0.0747277 | 8269 | 0.900763 |
47| 3 | Other Cities of Maharashtra | 457 | 0.0497821 | 8726 | 0.950545 |
48| 4 | Other Metro Cities | 380 | 0.0413943 | 9106 | 0.991939 |
49| 5 | Tier II Cities | 74 | 0.008061 | 9180 | 1 |
50+----+-----------------------------+---------+-----------+--------------------+----------------------+
51+----+----------------------+---------+-------------+--------------------+----------------------+
52| | Country | Count | Percent | Cumulative Count | Cumulative Percent |
53|----+----------------------+---------+-------------+--------------------+----------------------|
54| 0 | India | 8953 | 0.968939 | 8953 | 0.968939 |
55| 1 | United States | 69 | 0.00746753 | 9022 | 0.976407 |
56| 2 | United Arab Emirates | 53 | 0.00573593 | 9075 | 0.982143 |
57| 3 | Singapore | 24 | 0.0025974 | 9099 | 0.98474 |
58| 4 | Saudi Arabia | 21 | 0.00227273 | 9120 | 0.987013 |
59| 5 | United Kingdom | 15 | 0.00162338 | 9135 | 0.988636 |
60| 6 | Australia | 13 | 0.00140693 | 9148 | 0.990043 |
61| 7 | Qatar | 10 | 0.00108225 | 9158 | 0.991126 |
62| 8 | Hong Kong | 7 | 0.000757576 | 9165 | 0.991883 |
63| 9 | Bahrain | 7 | 0.000757576 | 9172 | 0.992641 |
64| 10 | Oman | 6 | 0.000649351 | 9178 | 0.99329 |
65| 11 | France | 6 | 0.000649351 | 9184 | 0.993939 |
66| 12 | unknown | 5 | 0.000541126 | 9189 | 0.994481 |
67| 13 | South Africa | 4 | 0.0004329 | 9193 | 0.994913 |
68| 14 | Nigeria | 4 | 0.0004329 | 9197 | 0.995346 |
69| 15 | Kuwait | 4 | 0.0004329 | 9201 | 0.995779 |
70| 16 | Germany | 4 | 0.0004329 | 9205 | 0.996212 |
71| 17 | Canada | 4 | 0.0004329 | 9209 | 0.996645 |
72| 18 | Sweden | 3 | 0.000324675 | 9212 | 0.99697 |
73| 19 | Uganda | 2 | 0.00021645 | 9214 | 0.997186 |
74| 20 | Philippines | 2 | 0.00021645 | 9216 | 0.997403 |
75| 21 | Netherlands | 2 | 0.00021645 | 9218 | 0.997619 |
76| 22 | Italy | 2 | 0.00021645 | 9220 | 0.997835 |
77| 23 | Ghana | 2 | 0.00021645 | 9222 | 0.998052 |
78| 24 | China | 2 | 0.00021645 | 9224 | 0.998268 |
79| 25 | Belgium | 2 | 0.00021645 | 9226 | 0.998485 |
80| 26 | Bangladesh | 2 | 0.00021645 | 9228 | 0.998701 |
81| 27 | Asia/Pacific Region | 2 | 0.00021645 | 9230 | 0.998918 |
82| 28 | Vietnam | 1 | 0.000108225 | 9231 | 0.999026 |
83| 29 | Tanzania | 1 | 0.000108225 | 9232 | 0.999134 |
84| 30 | Switzerland | 1 | 0.000108225 | 9233 | 0.999242 |
85| 31 | Sri Lanka | 1 | 0.000108225 | 9234 | 0.999351 |
86| 32 | Russia | 1 | 0.000108225 | 9235 | 0.999459 |
87| 33 | Malaysia | 1 | 0.000108225 | 9236 | 0.999567 |
88| 34 | Liberia | 1 | 0.000108225 | 9237 | 0.999675 |
89| 35 | Kenya | 1 | 0.000108225 | 9238 | 0.999784 |
90| 36 | Indonesia | 1 | 0.000108225 | 9239 | 0.999892 |
91| 37 | Denmark | 1 | 0.000108225 | 9240 | 1 |
92+----+----------------------+---------+-------------+--------------------+----------------------+
93+----+-------------------------+---------+-----------+--------------------+----------------------+
94| | Digital Advertisement | Count | Percent | Cumulative Count | Cumulative Percent |
95|----+-------------------------+---------+-----------+--------------------+----------------------|
96| 0 | No | 9236 | 0.999567 | 9236 | 0.999567 |
97| 1 | Yes | 4 | 0.0004329 | 9240 | 1 |
98+----+-------------------------+---------+-----------+--------------------+----------------------+
99+----+---------------+---------+------------+--------------------+----------------------+
100| | Do Not Call | Count | Percent | Cumulative Count | Cumulative Percent |
101|----+---------------+---------+------------+--------------------+----------------------|
102| 0 | No | 9238 | 0.999784 | 9238 | 0.999784 |
103| 1 | Yes | 2 | 0.00021645 | 9240 | 1 |
104+----+---------------+---------+------------+--------------------+----------------------+
105+----+----------------+---------+-----------+--------------------+----------------------+
106| | Do Not Email | Count | Percent | Cumulative Count | Cumulative Percent |
107|----+----------------+---------+-----------+--------------------+----------------------|
108| 0 | No | 8506 | 0.920563 | 8506 | 0.920563 |
109| 1 | Yes | 734 | 0.0794372 | 9240 | 1 |
110+----+----------------+---------+-----------+--------------------+----------------------+
111+----+-----------------------------+---------+-----------+--------------------+----------------------+
112| | Get updates on DM Content | Count | Percent | Cumulative Count | Cumulative Percent |
113|----+-----------------------------+---------+-----------+--------------------+----------------------|
114| 0 | No | 9240 | 1 | 9240 | 1 |
115+----+-----------------------------+---------+-----------+--------------------+----------------------+
116+----+--------------------------------------------+---------+-----------+--------------------+----------------------+
117| | I agree to pay the amount through cheque | Count | Percent | Cumulative Count | Cumulative Percent |
118|----+--------------------------------------------+---------+-----------+--------------------+----------------------|
119| 0 | No | 9240 | 1 | 9240 | 1 |
120+----+--------------------------------------------+---------+-----------+--------------------+----------------------+
121+----+------------------------------+---------+-------------+--------------------+----------------------+
122| | Last Activity | Count | Percent | Cumulative Count | Cumulative Percent |
123|----+------------------------------+---------+-------------+--------------------+----------------------|
124| 0 | Email Opened | 3437 | 0.376163 | 3437 | 0.376163 |
125| 1 | SMS Sent | 2745 | 0.300427 | 6182 | 0.67659 |
126| 2 | Olark Chat Conversation | 973 | 0.10649 | 7155 | 0.78308 |
127| 3 | Page Visited on Website | 640 | 0.0700449 | 7795 | 0.853125 |
128| 4 | Converted to Lead | 428 | 0.0468425 | 8223 | 0.899967 |
129| 5 | Email Bounced | 326 | 0.0356791 | 8549 | 0.935646 |
130| 6 | Email Link Clicked | 267 | 0.0292218 | 8816 | 0.964868 |
131| 7 | Form Submitted on Website | 116 | 0.0126956 | 8932 | 0.977564 |
132| 8 | Unreachable | 93 | 0.0101784 | 9025 | 0.987742 |
133| 9 | Unsubscribed | 61 | 0.00667615 | 9086 | 0.994418 |
134| 10 | Had a Phone Conversation | 30 | 0.00328335 | 9116 | 0.997702 |
135| 11 | Approached upfront | 9 | 0.000985006 | 9125 | 0.998687 |
136| 12 | View in browser link Clicked | 6 | 0.000656671 | 9131 | 0.999343 |
137| 13 | Email Received | 2 | 0.00021889 | 9133 | 0.999562 |
138| 14 | Email Marked Spam | 2 | 0.00021889 | 9135 | 0.999781 |
139| 15 | Visited Booth in Tradeshow | 1 | 0.000109445 | 9136 | 0.999891 |
140| 16 | Resubscribed to emails | 1 | 0.000109445 | 9137 | 1 |
141+----+------------------------------+---------+-------------+--------------------+----------------------+
142+----+------------------------------+---------+-------------+--------------------+----------------------+
143| | Last Notable Activity | Count | Percent | Cumulative Count | Cumulative Percent |
144|----+------------------------------+---------+-------------+--------------------+----------------------|
145| 0 | Modified | 3407 | 0.368723 | 3407 | 0.368723 |
146| 1 | Email Opened | 2827 | 0.305952 | 6234 | 0.674675 |
147| 2 | SMS Sent | 2172 | 0.235065 | 8406 | 0.90974 |
148| 3 | Page Visited on Website | 318 | 0.0344156 | 8724 | 0.944156 |
149| 4 | Olark Chat Conversation | 183 | 0.0198052 | 8907 | 0.963961 |
150| 5 | Email Link Clicked | 173 | 0.0187229 | 9080 | 0.982684 |
151| 6 | Email Bounced | 60 | 0.00649351 | 9140 | 0.989177 |
152| 7 | Unsubscribed | 47 | 0.00508658 | 9187 | 0.994264 |
153| 8 | Unreachable | 32 | 0.0034632 | 9219 | 0.997727 |
154| 9 | Had a Phone Conversation | 14 | 0.00151515 | 9233 | 0.999242 |
155| 10 | Email Marked Spam | 2 | 0.00021645 | 9235 | 0.999459 |
156| 11 | View in browser link Clicked | 1 | 0.000108225 | 9236 | 0.999567 |
157| 12 | Resubscribed to emails | 1 | 0.000108225 | 9237 | 0.999675 |
158| 13 | Form Submitted on Website | 1 | 0.000108225 | 9238 | 0.999784 |
159| 14 | Email Received | 1 | 0.000108225 | 9239 | 0.999892 |
160| 15 | Approached upfront | 1 | 0.000108225 | 9240 | 1 |
161+----+------------------------------+---------+-------------+--------------------+----------------------+
162+----+-------------------------+---------+-------------+--------------------+----------------------+
163| | Lead Origin | Count | Percent | Cumulative Count | Cumulative Percent |
164|----+-------------------------+---------+-------------+--------------------+----------------------|
165| 0 | Landing Page Submission | 4886 | 0.528788 | 4886 | 0.528788 |
166| 1 | API | 3580 | 0.387446 | 8466 | 0.916234 |
167| 2 | Lead Add Form | 718 | 0.0777056 | 9184 | 0.993939 |
168| 3 | Lead Import | 55 | 0.00595238 | 9239 | 0.999892 |
169| 4 | Quick Add Form | 1 | 0.000108225 | 9240 | 1 |
170+----+-------------------------+---------+-------------+--------------------+----------------------+
171+----+-------------------+---------+-------------+--------------------+----------------------+
172| | Lead Source | Count | Percent | Cumulative Count | Cumulative Percent |
173|----+-------------------+---------+-------------+--------------------+----------------------|
174| 0 | Google | 2873 | 0.312147 | 2873 | 0.312147 |
175| 1 | Direct Traffic | 2543 | 0.276293 | 5416 | 0.58844 |
176| 2 | Olark Chat | 1755 | 0.190678 | 7171 | 0.779118 |
177| 3 | Organic Search | 1154 | 0.12538 | 8325 | 0.904498 |
178| 4 | Reference | 534 | 0.0580183 | 8859 | 0.962516 |
179| 5 | Welingak Website | 142 | 0.0154281 | 9001 | 0.977944 |
180| 6 | Referral Sites | 125 | 0.0135811 | 9126 | 0.991525 |
181| 7 | Facebook | 55 | 0.00597566 | 9181 | 0.997501 |
182| 8 | bing | 6 | 0.00065189 | 9187 | 0.998153 |
183| 9 | Click2call | 4 | 0.000434594 | 9191 | 0.998588 |
184| 10 | Social Media | 2 | 0.000217297 | 9193 | 0.998805 |
185| 11 | Press_Release | 2 | 0.000217297 | 9195 | 0.999022 |
186| 12 | Live Chat | 2 | 0.000217297 | 9197 | 0.999239 |
187| 13 | youtubechannel | 1 | 0.000108648 | 9198 | 0.999348 |
188| 14 | welearnblog_Home | 1 | 0.000108648 | 9199 | 0.999457 |
189| 15 | testone | 1 | 0.000108648 | 9200 | 0.999565 |
190| 16 | blog | 1 | 0.000108648 | 9201 | 0.999674 |
191| 17 | WeLearn | 1 | 0.000108648 | 9202 | 0.999783 |
192| 18 | Pay per Click Ads | 1 | 0.000108648 | 9203 | 0.999891 |
193| 19 | NC_EDM | 1 | 0.000108648 | 9204 | 1 |
194+----+-------------------+---------+-------------+--------------------+----------------------+
195+----+------------+---------+-----------+--------------------+----------------------+
196| | Magazine | Count | Percent | Cumulative Count | Cumulative Percent |
197|----+------------+---------+-----------+--------------------+----------------------|
198| 0 | No | 9240 | 1 | 9240 | 1 |
199+----+------------+---------+-----------+--------------------+----------------------+
200+----+-------------+---------+-------------+--------------------+----------------------+
201| | Newspaper | Count | Percent | Cumulative Count | Cumulative Percent |
202|----+-------------+---------+-------------+--------------------+----------------------|
203| 0 | No | 9239 | 0.999892 | 9239 | 0.999892 |
204| 1 | Yes | 1 | 0.000108225 | 9240 | 1 |
205+----+-------------+---------+-------------+--------------------+----------------------+
206+----+---------------------+---------+------------+--------------------+----------------------+
207| | Newspaper Article | Count | Percent | Cumulative Count | Cumulative Percent |
208|----+---------------------+---------+------------+--------------------+----------------------|
209| 0 | No | 9238 | 0.999784 | 9238 | 0.999784 |
210| 1 | Yes | 2 | 0.00021645 | 9240 | 1 |
211+----+---------------------+---------+------------+--------------------+----------------------+
212+----+------------------------------------------+---------+-----------+--------------------+----------------------+
213| | Receive More Updates About Our Courses | Count | Percent | Cumulative Count | Cumulative Percent |
214|----+------------------------------------------+---------+-----------+--------------------+----------------------|
215| 0 | No | 9240 | 1 | 9240 | 1 |
216+----+------------------------------------------+---------+-----------+--------------------+----------------------+
217+----+----------+---------+------------+--------------------+----------------------+
218| | Search | Count | Percent | Cumulative Count | Cumulative Percent |
219|----+----------+---------+------------+--------------------+----------------------|
220| 0 | No | 9226 | 0.998485 | 9226 | 0.998485 |
221| 1 | Yes | 14 | 0.00151515 | 9240 | 1 |
222+----+----------+---------+------------+--------------------+----------------------+
223+----+-----------------------------------+---------+------------+--------------------+----------------------+
224| | Specialization | Count | Percent | Cumulative Count | Cumulative Percent |
225|----+-----------------------------------+---------+------------+--------------------+----------------------|
226| 0 | No Specialization | 3380 | 0.365801 | 3380 | 0.365801 |
227| 1 | Finance Management | 976 | 0.105628 | 4356 | 0.471429 |
228| 2 | Human Resource Management | 848 | 0.0917749 | 5204 | 0.563203 |
229| 3 | Marketing Management | 838 | 0.0906926 | 6042 | 0.653896 |
230| 4 | Operations Management | 503 | 0.0544372 | 6545 | 0.708333 |
231| 5 | Business Administration | 403 | 0.0436147 | 6948 | 0.751948 |
232| 6 | IT Projects Management | 366 | 0.0396104 | 7314 | 0.791558 |
233| 7 | Supply Chain Management | 349 | 0.0377706 | 7663 | 0.829329 |
234| 8 | Banking, Investment And Insurance | 338 | 0.0365801 | 8001 | 0.865909 |
235| 9 | Travel and Tourism | 203 | 0.0219697 | 8204 | 0.887879 |
236| 10 | Media and Advertising | 203 | 0.0219697 | 8407 | 0.909848 |
237| 11 | International Business | 178 | 0.0192641 | 8585 | 0.929113 |
238| 12 | Healthcare Management | 159 | 0.0172078 | 8744 | 0.94632 |
239| 13 | Hospitality Management | 114 | 0.0123377 | 8858 | 0.958658 |
240| 14 | E-COMMERCE | 112 | 0.0121212 | 8970 | 0.970779 |
241| 15 | Retail Management | 100 | 0.0108225 | 9070 | 0.981602 |
242| 16 | Rural and Agribusiness | 73 | 0.00790043 | 9143 | 0.989502 |
243| 17 | E-Business | 57 | 0.00616883 | 9200 | 0.995671 |
244| 18 | Services Excellence | 40 | 0.004329 | 9240 | 1 |
245+----+-----------------------------------+---------+------------+--------------------+----------------------+
246+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+
247| | Tags | Count | Percent | Cumulative Count | Cumulative Percent |
248|----+---------------------------------------------------+---------+-------------+--------------------+----------------------|
249| 0 | Will revert after reading the email | 2072 | 0.351962 | 2072 | 0.351962 |
250| 1 | Ringing | 1203 | 0.204349 | 3275 | 0.556311 |
251| 2 | Interested in other courses | 513 | 0.0871412 | 3788 | 0.643452 |
252| 3 | Already a student | 465 | 0.0789876 | 4253 | 0.722439 |
253| 4 | Closed by Horizzon | 358 | 0.060812 | 4611 | 0.783251 |
254| 5 | switched off | 240 | 0.0407678 | 4851 | 0.824019 |
255| 6 | Busy | 186 | 0.031595 | 5037 | 0.855614 |
256| 7 | Lost to EINS | 175 | 0.0297265 | 5212 | 0.885341 |
257| 8 | Not doing further education | 145 | 0.0246305 | 5357 | 0.909971 |
258| 9 | Interested in full time MBA | 117 | 0.0198743 | 5474 | 0.929845 |
259| 10 | Graduation in progress | 111 | 0.0188551 | 5585 | 0.948701 |
260| 11 | invalid number | 83 | 0.0140989 | 5668 | 0.962799 |
261| 12 | Diploma holder (Not Eligible) | 63 | 0.0107015 | 5731 | 0.973501 |
262| 13 | wrong number given | 47 | 0.00798369 | 5778 | 0.981485 |
263| 14 | opp hangup | 33 | 0.00560557 | 5811 | 0.98709 |
264| 15 | number not provided | 27 | 0.00458638 | 5838 | 0.991677 |
265| 16 | in touch with EINS | 12 | 0.00203839 | 5850 | 0.993715 |
266| 17 | Lost to Others | 7 | 0.00118906 | 5857 | 0.994904 |
267| 18 | Want to take admission but has financial problems | 6 | 0.00101919 | 5863 | 0.995923 |
268| 19 | Still Thinking | 6 | 0.00101919 | 5869 | 0.996942 |
269| 20 | Interested in Next batch | 5 | 0.000849329 | 5874 | 0.997792 |
270| 21 | In confusion whether part time or DLP | 5 | 0.000849329 | 5879 | 0.998641 |
271| 22 | Lateral student | 3 | 0.000509597 | 5882 | 0.999151 |
272| 23 | University not recognized | 2 | 0.000339732 | 5884 | 0.99949 |
273| 24 | Shall take in the next coming month | 2 | 0.000339732 | 5886 | 0.99983 |
274| 25 | Recognition issue (DEC approval) | 1 | 0.000169866 | 5887 | 1 |
275+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+
276+----+---------------------------+---------+-------------+--------------------+----------------------+
277| | Through Recommendations | Count | Percent | Cumulative Count | Cumulative Percent |
278|----+---------------------------+---------+-------------+--------------------+----------------------|
279| 0 | No | 9233 | 0.999242 | 9233 | 0.999242 |
280| 1 | Yes | 7 | 0.000757576 | 9240 | 1 |
281+----+---------------------------+---------+-------------+--------------------+----------------------+
282+----+-------------------------------------+---------+-----------+--------------------+----------------------+
283| | Update me on Supply Chain Content | Count | Percent | Cumulative Count | Cumulative Percent |
284|----+-------------------------------------+---------+-----------+--------------------+----------------------|
285| 0 | No | 9240 | 1 | 9240 | 1 |
286+----+-------------------------------------+---------+-----------+--------------------+----------------------+
287+----+-----------------------------------+---------+-------------+--------------------+----------------------+
288| | What is your current occupation | Count | Percent | Cumulative Count | Cumulative Percent |
289|----+-----------------------------------+---------+-------------+--------------------+----------------------|
290| 0 | Unemployed | 5600 | 0.606061 | 5600 | 0.606061 |
291| 1 | Unknown Occupation | 2690 | 0.291126 | 8290 | 0.897186 |
292| 2 | Working Professional | 706 | 0.0764069 | 8996 | 0.973593 |
293| 3 | Student | 210 | 0.0227273 | 9206 | 0.99632 |
294| 4 | Other | 16 | 0.0017316 | 9222 | 0.998052 |
295| 5 | Housewife | 10 | 0.00108225 | 9232 | 0.999134 |
296| 6 | Businessman | 8 | 0.000865801 | 9240 | 1 |
297+----+-----------------------------------+---------+-------------+--------------------+----------------------+
298+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+
299| | What matters most to you in choosing a course | Count | Percent | Cumulative Count | Cumulative Percent |
300|----+-------------------------------------------------+---------+-------------+--------------------+----------------------|
301| 0 | Better Career Prospects | 6528 | 0.706494 | 6528 | 0.706494 |
302| 1 | Unknown Target | 2709 | 0.293182 | 9237 | 0.999675 |
303| 2 | Flexibility & Convenience | 2 | 0.00021645 | 9239 | 0.999892 |
304| 3 | Other | 1 | 0.000108225 | 9240 | 1 |
305+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+
306+----+----------------------+---------+-------------+--------------------+----------------------+
307| | X Education Forums | Count | Percent | Cumulative Count | Cumulative Percent |
308|----+----------------------+---------+-------------+--------------------+----------------------|
309| 0 | No | 9239 | 0.999892 | 9239 | 0.999892 |
310| 1 | Yes | 1 | 0.000108225 | 9240 | 1 |
311+----+----------------------+---------+-------------+--------------------+----------------------+
  • Let’s look for columns have more than 99% leads have the same level
  • Such variables do not explain any variability.
  • Hence, these columns are unnecessary to the analysis . They could be dropped
1# Dropping columns having only one label - since these donot explain any variability in the dataset
2
3invariableCol = ['Digital Advertisement','Do Not Call','Get updates on DM Content','Magazine','Newspaper','Newspaper Article','Receive More Updates About Our Courses','Search',
4'Update me on Supply Chain Content','Through Recommendations',
5'I agree to pay the amount through cheque',"What matters most to you in choosing a course",'X Education Forums']
6leads.drop(columns=invariableCol, inplace=True)
1# Tags feature
2tab(leads.stb.freq(['Tags']))
1+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+
2| | Tags | Count | Percent | Cumulative Count | Cumulative Percent |
3|----+---------------------------------------------------+---------+-------------+--------------------+----------------------|
4| 0 | Will revert after reading the email | 2072 | 0.351962 | 2072 | 0.351962 |
5| 1 | Ringing | 1203 | 0.204349 | 3275 | 0.556311 |
6| 2 | Interested in other courses | 513 | 0.0871412 | 3788 | 0.643452 |
7| 3 | Already a student | 465 | 0.0789876 | 4253 | 0.722439 |
8| 4 | Closed by Horizzon | 358 | 0.060812 | 4611 | 0.783251 |
9| 5 | switched off | 240 | 0.0407678 | 4851 | 0.824019 |
10| 6 | Busy | 186 | 0.031595 | 5037 | 0.855614 |
11| 7 | Lost to EINS | 175 | 0.0297265 | 5212 | 0.885341 |
12| 8 | Not doing further education | 145 | 0.0246305 | 5357 | 0.909971 |
13| 9 | Interested in full time MBA | 117 | 0.0198743 | 5474 | 0.929845 |
14| 10 | Graduation in progress | 111 | 0.0188551 | 5585 | 0.948701 |
15| 11 | invalid number | 83 | 0.0140989 | 5668 | 0.962799 |
16| 12 | Diploma holder (Not Eligible) | 63 | 0.0107015 | 5731 | 0.973501 |
17| 13 | wrong number given | 47 | 0.00798369 | 5778 | 0.981485 |
18| 14 | opp hangup | 33 | 0.00560557 | 5811 | 0.98709 |
19| 15 | number not provided | 27 | 0.00458638 | 5838 | 0.991677 |
20| 16 | in touch with EINS | 12 | 0.00203839 | 5850 | 0.993715 |
21| 17 | Lost to Others | 7 | 0.00118906 | 5857 | 0.994904 |
22| 18 | Want to take admission but has financial problems | 6 | 0.00101919 | 5863 | 0.995923 |
23| 19 | Still Thinking | 6 | 0.00101919 | 5869 | 0.996942 |
24| 20 | Interested in Next batch | 5 | 0.000849329 | 5874 | 0.997792 |
25| 21 | In confusion whether part time or DLP | 5 | 0.000849329 | 5879 | 0.998641 |
26| 22 | Lateral student | 3 | 0.000509597 | 5882 | 0.999151 |
27| 23 | University not recognized | 2 | 0.000339732 | 5884 | 0.99949 |
28| 24 | Shall take in the next coming month | 2 | 0.000339732 | 5886 | 0.99983 |
29| 25 | Recognition issue (DEC approval) | 1 | 0.000169866 | 5887 | 1 |
30+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+
  • Tags column shows remarks by the sales team. This a subjective variable based on judgement of the team and cannot be used for analysis since the lables might change or might not always be available.
  • Also, it has a lot of levels that don’t seem like mutually exclusive classes
  • Let’s drop this feature for this analysis.
1# dropping Tags feature
2leads.drop(columns=['Tags'], inplace=True)
  • Last Notable Activity & Last Activity seem to have similar levels
  • Lets look at the possibility of dropping one of them
1# Last Notable Activity vs Last Activity
2leads_copy = leads.copy()
3leads_copy['Converted'] = leads_copy['Converted'].astype('int')
4tab(leads_copy.stb.freq(['Last Notable Activity'],value='Converted'))
5tab(leads_copy.stb.freq(['Last Activity'],value='Converted'))
1+----+--------------------------+-------------+------------+------------------------+----------------------+
2| | Last Notable Activity | Converted | Percent | Cumulative Converted | Cumulative Percent |
3|----+--------------------------+-------------+------------+------------------------+----------------------|
4| 0 | SMS Sent | 1508 | 0.423477 | 1508 | 0.423477 |
5| 1 | Email Opened | 1044 | 0.293176 | 2552 | 0.716653 |
6| 2 | Modified | 783 | 0.219882 | 3335 | 0.936535 |
7| 3 | Page Visited on Website | 93 | 0.0261163 | 3428 | 0.962651 |
8| 4 | Email Link Clicked | 45 | 0.0126369 | 3473 | 0.975288 |
9| 5 | Olark Chat Conversation | 25 | 0.0070205 | 3498 | 0.982308 |
10| 6 | Unreachable | 22 | 0.00617804 | 3520 | 0.988486 |
11| 7 | Unsubscribed | 14 | 0.00393148 | 3534 | 0.992418 |
12| 8 | Had a Phone Conversation | 13 | 0.00365066 | 3547 | 0.996069 |
13| 9 | Email Bounced | 9 | 0.00252738 | 3556 | 0.998596 |
14| 10 | Email Marked Spam | 2 | 0.00056164 | 3558 | 0.999158 |
15| 11 | Resubscribed to emails | 1 | 0.00028082 | 3559 | 0.999438 |
16| 12 | Email Received | 1 | 0.00028082 | 3560 | 0.999719 |
17| 13 | Approached upfront | 1 | 0.00028082 | 3561 | 1 |
18+----+--------------------------+-------------+------------+------------------------+----------------------+
19+----+------------------------------+-------------+-------------+------------------------+----------------------+
20| | Last Activity | Converted | Percent | Cumulative Converted | Cumulative Percent |
21|----+------------------------------+-------------+-------------+------------------------+----------------------|
22| 0 | SMS Sent | 1727 | 0.496264 | 1727 | 0.496264 |
23| 1 | Email Opened | 1253 | 0.360057 | 2980 | 0.856322 |
24| 2 | Page Visited on Website | 151 | 0.0433908 | 3131 | 0.899713 |
25| 3 | Olark Chat Conversation | 84 | 0.0241379 | 3215 | 0.923851 |
26| 4 | Email Link Clicked | 73 | 0.020977 | 3288 | 0.944828 |
27| 5 | Converted to Lead | 54 | 0.0155172 | 3342 | 0.960345 |
28| 6 | Unreachable | 31 | 0.00890805 | 3373 | 0.969253 |
29| 7 | Form Submitted on Website | 28 | 0.00804598 | 3401 | 0.977299 |
30| 8 | Email Bounced | 26 | 0.00747126 | 3427 | 0.98477 |
31| 9 | Had a Phone Conversation | 22 | 0.00632184 | 3449 | 0.991092 |
32| 10 | Unsubscribed | 16 | 0.0045977 | 3465 | 0.99569 |
33| 11 | Approached upfront | 9 | 0.00258621 | 3474 | 0.998276 |
34| 12 | Email Received | 2 | 0.000574713 | 3476 | 0.998851 |
35| 13 | Email Marked Spam | 2 | 0.000574713 | 3478 | 0.999425 |
36| 14 | View in browser link Clicked | 1 | 0.000287356 | 3479 | 0.999713 |
37| 15 | Resubscribed to emails | 1 | 0.000287356 | 3480 | 1 |
38+----+------------------------------+-------------+-------------+------------------------+----------------------+
  • Last Activity has more levels compared to Last Notable Activity
  • Last Notable Activity seems like a column derived by the sales team using Last Activity.
  • Since this insight might not be available for a new lead, let’d drop Last Notable Activity.
1leads.drop(columns = ['Last Notable Activity'], inplace=True)
1# Looking at Missing Values again
2tab(leads.stb.missing())
1+----------------------------------------+-----------+---------+------------+
2| | Missing | Total | Percent |
3|----------------------------------------+-----------+---------+------------|
4| TotalVisits | 137 | 9240 | 0.0148268 |
5| Page Views Per Visit | 137 | 9240 | 0.0148268 |
6| Last Activity | 103 | 9240 | 0.0111472 |
7| City | 60 | 9240 | 0.00649351 |
8| Lead Source | 36 | 9240 | 0.0038961 |
9| Lead Origin | 0 | 9240 | 0 |
10| Do Not Email | 0 | 9240 | 0 |
11| Converted | 0 | 9240 | 0 |
12| Total Time Spent on Website | 0 | 9240 | 0 |
13| Country | 0 | 9240 | 0 |
14| Specialization | 0 | 9240 | 0 |
15| What is your current occupation | 0 | 9240 | 0 |
16| A free copy of Mastering The Interview | 0 | 9240 | 0 |
17+----------------------------------------+-----------+---------+------------+
  • From the above, the number of missing values is less than 2%. These are deemed missing completely at random. And these rows could be dropped without affecting the analysis
1leads.dropna(inplace=True)
2tab(leads.stb.missing())
1+----------------------------------------+-----------+---------+-----------+
2| | Missing | Total | Percent |
3|----------------------------------------+-----------+---------+-----------|
4| Lead Origin | 0 | 9014 | 0 |
5| Lead Source | 0 | 9014 | 0 |
6| Do Not Email | 0 | 9014 | 0 |
7| Converted | 0 | 9014 | 0 |
8| TotalVisits | 0 | 9014 | 0 |
9| Total Time Spent on Website | 0 | 9014 | 0 |
10| Page Views Per Visit | 0 | 9014 | 0 |
11| Last Activity | 0 | 9014 | 0 |
12| Country | 0 | 9014 | 0 |
13| Specialization | 0 | 9014 | 0 |
14| What is your current occupation | 0 | 9014 | 0 |
15| City | 0 | 9014 | 0 |
16| A free copy of Mastering The Interview | 0 | 9014 | 0 |
17+----------------------------------------+-----------+---------+-----------+

Grouping Labels with less leads

Country
1# Country distribution
2tab(leads.stb.freq(['Country']))
1+----+----------------------+---------+-------------+--------------------+----------------------+
2| | Country | Count | Percent | Cumulative Count | Cumulative Percent |
3|----+----------------------+---------+-------------+--------------------+----------------------|
4| 0 | India | 8787 | 0.974817 | 8787 | 0.974817 |
5| 1 | United States | 51 | 0.00565787 | 8838 | 0.980475 |
6| 2 | United Arab Emirates | 44 | 0.0048813 | 8882 | 0.985356 |
7| 3 | Saudi Arabia | 21 | 0.00232971 | 8903 | 0.987686 |
8| 4 | Singapore | 17 | 0.00188596 | 8920 | 0.989572 |
9| 5 | United Kingdom | 12 | 0.00133126 | 8932 | 0.990903 |
10| 6 | Australia | 11 | 0.00122032 | 8943 | 0.992123 |
11| 7 | Qatar | 8 | 0.000887508 | 8951 | 0.993011 |
12| 8 | Bahrain | 7 | 0.00077657 | 8958 | 0.993787 |
13| 9 | Hong Kong | 6 | 0.000665631 | 8964 | 0.994453 |
14| 10 | France | 6 | 0.000665631 | 8970 | 0.995119 |
15| 11 | Oman | 5 | 0.000554693 | 8975 | 0.995673 |
16| 12 | Nigeria | 4 | 0.000443754 | 8979 | 0.996117 |
17| 13 | Kuwait | 4 | 0.000443754 | 8983 | 0.996561 |
18| 14 | Germany | 4 | 0.000443754 | 8987 | 0.997005 |
19| 15 | South Africa | 3 | 0.000332816 | 8990 | 0.997337 |
20| 16 | Canada | 3 | 0.000332816 | 8993 | 0.99767 |
21| 17 | Philippines | 2 | 0.000221877 | 8995 | 0.997892 |
22| 18 | Netherlands | 2 | 0.000221877 | 8997 | 0.998114 |
23| 19 | Belgium | 2 | 0.000221877 | 8999 | 0.998336 |
24| 20 | Bangladesh | 2 | 0.000221877 | 9001 | 0.998558 |
25| 21 | Vietnam | 1 | 0.000110939 | 9002 | 0.998669 |
26| 22 | Uganda | 1 | 0.000110939 | 9003 | 0.99878 |
27| 23 | Tanzania | 1 | 0.000110939 | 9004 | 0.998891 |
28| 24 | Switzerland | 1 | 0.000110939 | 9005 | 0.999002 |
29| 25 | Sweden | 1 | 0.000110939 | 9006 | 0.999112 |
30| 26 | Malaysia | 1 | 0.000110939 | 9007 | 0.999223 |
31| 27 | Liberia | 1 | 0.000110939 | 9008 | 0.999334 |
32| 28 | Kenya | 1 | 0.000110939 | 9009 | 0.999445 |
33| 29 | Italy | 1 | 0.000110939 | 9010 | 0.999556 |
34| 30 | Indonesia | 1 | 0.000110939 | 9011 | 0.999667 |
35| 31 | Ghana | 1 | 0.000110939 | 9012 | 0.999778 |
36| 32 | Denmark | 1 | 0.000110939 | 9013 | 0.999889 |
37| 33 | China | 1 | 0.000110939 | 9014 | 1 |
38+----+----------------------+---------+-------------+--------------------+----------------------+
  • We see that leads from India make 97% of all leads. And others collectively make up 3% and the contribution of each of these countries is <= 1%.
  • To reduce the levels, let us group the minority labels into a new label called ‘Outside India’
1# Grouping Countries with very low lead count into 'Outside India'
2
3leadsByCountry = leads['Country'].value_counts(normalize=True)
4lowLeadCountries = leadsByCountry[leadsByCountry <= 0.01].index
5
6leads['Country'].replace(lowLeadCountries,'Outside India',inplace=True)
7tab(leads.stb.freq(['Country']))
1+----+---------------+---------+-----------+--------------------+----------------------+
2| | Country | Count | Percent | Cumulative Count | Cumulative Percent |
3|----+---------------+---------+-----------+--------------------+----------------------|
4| 0 | India | 8787 | 0.974817 | 8787 | 0.974817 |
5| 1 | Outside India | 227 | 0.025183 | 9014 | 1 |
6+----+---------------+---------+-----------+--------------------+----------------------+
Lead Origin
1feature = 'Lead Origin'
2tab(leads.stb.freq([feature]))
1+----+-------------------------+---------+------------+--------------------+----------------------+
2| | Lead Origin | Count | Percent | Cumulative Count | Cumulative Percent |
3|----+-------------------------+---------+------------+--------------------+----------------------|
4| 0 | Landing Page Submission | 4870 | 0.540271 | 4870 | 0.540271 |
5| 1 | API | 3533 | 0.391946 | 8403 | 0.932217 |
6| 2 | Lead Add Form | 581 | 0.0644553 | 8984 | 0.996672 |
7| 3 | Lead Import | 30 | 0.00332816 | 9014 | 1 |
8+----+-------------------------+---------+------------+--------------------+----------------------+
  • We see that lead origins like Lead Add Form,Lead Import are less than 1% of all origins.
  • Let’s group them into a level called ‘Other Lead Origins’
1# Grouping lead origins
2leadOriginsToGroup = ["Lead Add Form","Lead Import"]
3leads[feature] = leads[feature].replace(leadOriginsToGroup, ['Other Lead Origins']*2)
4tab(leads.stb.freq([feature]))
1+----+-------------------------+---------+-----------+--------------------+----------------------+
2| | Lead Origin | Count | Percent | Cumulative Count | Cumulative Percent |
3|----+-------------------------+---------+-----------+--------------------+----------------------|
4| 0 | Landing Page Submission | 4870 | 0.540271 | 4870 | 0.540271 |
5| 1 | API | 3533 | 0.391946 | 8403 | 0.932217 |
6| 2 | Other Lead Origins | 611 | 0.0677834 | 9014 | 1 |
7+----+-------------------------+---------+-----------+--------------------+----------------------+
Lead Source
1feature = 'Lead Source'
2tab(leads.stb.freq([feature]))
1+----+-------------------+---------+-------------+--------------------+----------------------+
2| | Lead Source | Count | Percent | Cumulative Count | Cumulative Percent |
3|----+-------------------+---------+-------------+--------------------+----------------------|
4| 0 | Google | 2857 | 0.316951 | 2857 | 0.316951 |
5| 1 | Direct Traffic | 2528 | 0.280453 | 5385 | 0.597404 |
6| 2 | Olark Chat | 1753 | 0.194475 | 7138 | 0.791879 |
7| 3 | Organic Search | 1131 | 0.125471 | 8269 | 0.917351 |
8| 4 | Reference | 443 | 0.0491458 | 8712 | 0.966497 |
9| 5 | Welingak Website | 129 | 0.0143111 | 8841 | 0.980808 |
10| 6 | Referral Sites | 120 | 0.0133126 | 8961 | 0.99412 |
11| 7 | Facebook | 31 | 0.00343909 | 8992 | 0.997559 |
12| 8 | bing | 6 | 0.000665631 | 8998 | 0.998225 |
13| 9 | Click2call | 4 | 0.000443754 | 9002 | 0.998669 |
14| 10 | Social Media | 2 | 0.000221877 | 9004 | 0.998891 |
15| 11 | Press_Release | 2 | 0.000221877 | 9006 | 0.999112 |
16| 12 | Live Chat | 2 | 0.000221877 | 9008 | 0.999334 |
17| 13 | welearnblog_Home | 1 | 0.000110939 | 9009 | 0.999445 |
18| 14 | testone | 1 | 0.000110939 | 9010 | 0.999556 |
19| 15 | blog | 1 | 0.000110939 | 9011 | 0.999667 |
20| 16 | WeLearn | 1 | 0.000110939 | 9012 | 0.999778 |
21| 17 | Pay per Click Ads | 1 | 0.000110939 | 9013 | 0.999889 |
22| 18 | NC_EDM | 1 | 0.000110939 | 9014 | 1 |
23+----+-------------------+---------+-------------+--------------------+----------------------+
  • Lead sources from #7 to #18 contribute to less than 1% of all sources.
  • Let’s group these into a new label called ‘Other Lead Sources’
1# Grouping lead Sources
2labelCounts = leads[feature].value_counts(normalize=True)
3
4# labels with less than 1% contribution
5labelsToGroup = labelCounts[labelCounts < 0.01].index.values
6
7leads[feature] = leads[feature].replace(labelsToGroup, ['Other '+feature+'s']*len(labelsToGroup))
8
9tab(leads.stb.freq([feature]))
1+----+--------------------+---------+------------+--------------------+----------------------+
2| | Lead Source | Count | Percent | Cumulative Count | Cumulative Percent |
3|----+--------------------+---------+------------+--------------------+----------------------|
4| 0 | Google | 2857 | 0.316951 | 2857 | 0.316951 |
5| 1 | Direct Traffic | 2528 | 0.280453 | 5385 | 0.597404 |
6| 2 | Olark Chat | 1753 | 0.194475 | 7138 | 0.791879 |
7| 3 | Organic Search | 1131 | 0.125471 | 8269 | 0.917351 |
8| 4 | Reference | 443 | 0.0491458 | 8712 | 0.966497 |
9| 5 | Welingak Website | 129 | 0.0143111 | 8841 | 0.980808 |
10| 6 | Referral Sites | 120 | 0.0133126 | 8961 | 0.99412 |
11| 7 | Other Lead Sources | 53 | 0.00587974 | 9014 | 1 |
12+----+--------------------+---------+------------+--------------------+----------------------+
Last Activity
1feature = 'Last Activity'
2tab(leads.stb.freq([feature]))
1+----+------------------------------+---------+-------------+--------------------+----------------------+
2| | Last Activity | Count | Percent | Cumulative Count | Cumulative Percent |
3|----+------------------------------+---------+-------------+--------------------+----------------------|
4| 0 | Email Opened | 3417 | 0.379077 | 3417 | 0.379077 |
5| 1 | SMS Sent | 2701 | 0.299645 | 6118 | 0.678722 |
6| 2 | Olark Chat Conversation | 963 | 0.106834 | 7081 | 0.785556 |
7| 3 | Page Visited on Website | 637 | 0.0706679 | 7718 | 0.856224 |
8| 4 | Converted to Lead | 422 | 0.0468161 | 8140 | 0.90304 |
9| 5 | Email Bounced | 304 | 0.0337253 | 8444 | 0.936765 |
10| 6 | Email Link Clicked | 266 | 0.0295097 | 8710 | 0.966275 |
11| 7 | Form Submitted on Website | 115 | 0.0127579 | 8825 | 0.979033 |
12| 8 | Unreachable | 89 | 0.00987353 | 8914 | 0.988906 |
13| 9 | Unsubscribed | 58 | 0.00643444 | 8972 | 0.995341 |
14| 10 | Had a Phone Conversation | 25 | 0.00277346 | 8997 | 0.998114 |
15| 11 | View in browser link Clicked | 6 | 0.000665631 | 9003 | 0.99878 |
16| 12 | Approached upfront | 5 | 0.000554693 | 9008 | 0.999334 |
17| 13 | Email Received | 2 | 0.000221877 | 9010 | 0.999556 |
18| 14 | Email Marked Spam | 2 | 0.000221877 | 9012 | 0.999778 |
19| 15 | Visited Booth in Tradeshow | 1 | 0.000110939 | 9013 | 0.999889 |
20| 16 | Resubscribed to emails | 1 | 0.000110939 | 9014 | 1 |
21+----+------------------------------+---------+-------------+--------------------+----------------------+
  • Leads from #9 to #16 contribute to less than 1% of all last activity labels.Moreover, each of these labels contributes to less than 1% of leads.
  • Let’s group these into a new label called ‘Other Last Activity’
1# Grouping Last Activity
2labelCounts = leads[feature].value_counts(normalize=True)
3
4# labels with less than 2% contribution
5labelsToGroup = labelCounts[labelCounts < 0.01].index.values
6
7leads[feature] = leads[feature].replace(labelsToGroup, ['Other '+feature]*len(labelsToGroup))
8
9tab(leads.stb.freq([feature]))
1+----+---------------------------+---------+-----------+--------------------+----------------------+
2| | Last Activity | Count | Percent | Cumulative Count | Cumulative Percent |
3|----+---------------------------+---------+-----------+--------------------+----------------------|
4| 0 | Email Opened | 3417 | 0.379077 | 3417 | 0.379077 |
5| 1 | SMS Sent | 2701 | 0.299645 | 6118 | 0.678722 |
6| 2 | Olark Chat Conversation | 963 | 0.106834 | 7081 | 0.785556 |
7| 3 | Page Visited on Website | 637 | 0.0706679 | 7718 | 0.856224 |
8| 4 | Converted to Lead | 422 | 0.0468161 | 8140 | 0.90304 |
9| 5 | Email Bounced | 304 | 0.0337253 | 8444 | 0.936765 |
10| 6 | Email Link Clicked | 266 | 0.0295097 | 8710 | 0.966275 |
11| 7 | Other Last Activity | 189 | 0.0209674 | 8899 | 0.987242 |
12| 8 | Form Submitted on Website | 115 | 0.0127579 | 9014 | 1 |
13+----+---------------------------+---------+-----------+--------------------+----------------------+
Specialization
1feature = 'Specialization'
2tab(leads.stb.freq([feature]))
1+----+-----------------------------------+---------+------------+--------------------+----------------------+
2| | Specialization | Count | Percent | Cumulative Count | Cumulative Percent |
3|----+-----------------------------------+---------+------------+--------------------+----------------------|
4| 0 | No Specialization | 3230 | 0.358331 | 3230 | 0.358331 |
5| 1 | Finance Management | 959 | 0.10639 | 4189 | 0.464722 |
6| 2 | Human Resource Management | 836 | 0.0927446 | 5025 | 0.557466 |
7| 3 | Marketing Management | 822 | 0.0911915 | 5847 | 0.648658 |
8| 4 | Operations Management | 498 | 0.0552474 | 6345 | 0.703905 |
9| 5 | Business Administration | 397 | 0.0440426 | 6742 | 0.747948 |
10| 6 | IT Projects Management | 366 | 0.0406035 | 7108 | 0.788551 |
11| 7 | Supply Chain Management | 344 | 0.0381629 | 7452 | 0.826714 |
12| 8 | Banking, Investment And Insurance | 335 | 0.0371644 | 7787 | 0.863878 |
13| 9 | Travel and Tourism | 202 | 0.0224096 | 7989 | 0.886288 |
14| 10 | Media and Advertising | 202 | 0.0224096 | 8191 | 0.908698 |
15| 11 | International Business | 176 | 0.0195252 | 8367 | 0.928223 |
16| 12 | Healthcare Management | 156 | 0.0173064 | 8523 | 0.945529 |
17| 13 | E-COMMERCE | 111 | 0.0123142 | 8634 | 0.957843 |
18| 14 | Hospitality Management | 110 | 0.0122032 | 8744 | 0.970047 |
19| 15 | Retail Management | 100 | 0.0110939 | 8844 | 0.98114 |
20| 16 | Rural and Agribusiness | 73 | 0.00809851 | 8917 | 0.989239 |
21| 17 | E-Business | 57 | 0.0063235 | 8974 | 0.995562 |
22| 18 | Services Excellence | 40 | 0.00443754 | 9014 | 1 |
23+----+-----------------------------------+---------+------------+--------------------+----------------------+
  • Lead from from #16 to #18 contribute to less than 2% of all Specialization categories. Moreover, each of these categories contributes to less than 1% of leads.
  • Let’s group these into a new label called ‘Other Specializations’
1# Grouping Last Activity
2labelCounts = leads[feature].value_counts(normalize=True)
3
4# labels with less than 2% contribution
5labelsToGroup = labelCounts[labelCounts <=0.012121].index.values
6
7leads[feature] = leads[feature].replace(labelsToGroup, ['Other '+feature]*len(labelsToGroup))
8
9tab(leads.stb.freq([feature]))
1+----+-----------------------------------+---------+-----------+--------------------+----------------------+
2| | Specialization | Count | Percent | Cumulative Count | Cumulative Percent |
3|----+-----------------------------------+---------+-----------+--------------------+----------------------|
4| 0 | No Specialization | 3230 | 0.358331 | 3230 | 0.358331 |
5| 1 | Finance Management | 959 | 0.10639 | 4189 | 0.464722 |
6| 2 | Human Resource Management | 836 | 0.0927446 | 5025 | 0.557466 |
7| 3 | Marketing Management | 822 | 0.0911915 | 5847 | 0.648658 |
8| 4 | Operations Management | 498 | 0.0552474 | 6345 | 0.703905 |
9| 5 | Business Administration | 397 | 0.0440426 | 6742 | 0.747948 |
10| 6 | IT Projects Management | 366 | 0.0406035 | 7108 | 0.788551 |
11| 7 | Supply Chain Management | 344 | 0.0381629 | 7452 | 0.826714 |
12| 8 | Banking, Investment And Insurance | 335 | 0.0371644 | 7787 | 0.863878 |
13| 9 | Other Specialization | 270 | 0.0299534 | 8057 | 0.893832 |
14| 10 | Travel and Tourism | 202 | 0.0224096 | 8259 | 0.916241 |
15| 11 | Media and Advertising | 202 | 0.0224096 | 8461 | 0.938651 |
16| 12 | International Business | 176 | 0.0195252 | 8637 | 0.958176 |
17| 13 | Healthcare Management | 156 | 0.0173064 | 8793 | 0.975483 |
18| 14 | E-COMMERCE | 111 | 0.0123142 | 8904 | 0.987797 |
19| 15 | Hospitality Management | 110 | 0.0122032 | 9014 | 1 |
20+----+-----------------------------------+---------+-----------+--------------------+----------------------+
  • Cleaning tasks have been completed. No more missing values exist

Retained Data

1# Columns retained
2print('Retained Columns\n\n', leads.columns.values)
1Retained Columns
2
3 ['Lead Origin' 'Lead Source' 'Do Not Email' 'Converted' 'TotalVisits'
4 'Total Time Spent on Website' 'Page Views Per Visit' 'Last Activity'
5 'Country' 'Specialization' 'What is your current occupation' 'City'
6 'A free copy of Mastering The Interview']
1# Retained rows
2print('Retained rows : ',leads.shape[0])
3print("Ratio of retained rows", 100*leads.shape[0]/9240)
1Retained rows : 9014
2Ratio of retained rows 97.55411255411255

Data Imbalance

1tab(leads.stb.freq(['Converted']))
1+----+-------------+---------+-----------+--------------------+----------------------+
2| | Converted | Count | Percent | Cumulative Count | Cumulative Percent |
3|----+-------------+---------+-----------+--------------------+----------------------|
4| 0 | 0 | 5595 | 0.620701 | 5595 | 0.620701 |
5| 1 | 1 | 3419 | 0.379299 | 9014 | 1 |
6+----+-------------+---------+-----------+--------------------+----------------------+
1converted_cond = leads['Converted'] == 1
2imbalance = leads[converted_cond].shape[0]/leads[~converted_cond].shape[0]
3print('Class Imbalance : Converted /Un-converted =', np.round(imbalance,3))
1Class Imbalance : Converted /Un-converted = 0.611
  • From the above, you can see that this data set contains 37% of converted leads and 62% of un-converted leads.
  • Ratio of classes = 0.6
  • The dataset is skewed towards ‘unconverted leads’

Univariate Analysis

1def categoricalUAn(column,figsize=[8,8]) :
2
3 ''' Function for categorical univariate analysis '''
4 print('Types of ' + column)
5 tab(leads.stb.freq([column]))
6
7 converted = leads[leads['Converted'] == 1]
8 unconverted = leads[leads['Converted'] == 0]
9
10 print(column + ' for Converted Leads')
11
12 tab(converted.stb.freq([column]))
13
14 print(column + ' for Un-Converted Leads')
15
16 tab(unconverted.stb.freq([column]))
17
18 print(column + ' vs Conversion Rate')
19
20 tab((converted[column].value_counts()) / (converted[column].value_counts() + unconverted[column].value_counts()))
21
22 # bar plot
23 plt.figure(figsize=figsize)
24 ax = sns.countplot(y=column,hue='Converted',data=leads)
25 title = column + ' vs Lead Conversion'
26 ax.set(title= title)

Lead Origin

1column = 'Lead Origin'
2categoricalUAn(column,figsize=[8,8])
1Types of Lead Origin
2+----+-------------------------+---------+-----------+--------------------+----------------------+
3| | Lead Origin | Count | Percent | Cumulative Count | Cumulative Percent |
4|----+-------------------------+---------+-----------+--------------------+----------------------|
5| 0 | Landing Page Submission | 4870 | 0.540271 | 4870 | 0.540271 |
6| 1 | API | 3533 | 0.391946 | 8403 | 0.932217 |
7| 2 | Other Lead Origins | 611 | 0.0677834 | 9014 | 1 |
8+----+-------------------------+---------+-----------+--------------------+----------------------+
9Lead Origin for Converted Leads
10+----+-------------------------+---------+-----------+--------------------+----------------------+
11| | Lead Origin | Count | Percent | Cumulative Count | Cumulative Percent |
12|----+-------------------------+---------+-----------+--------------------+----------------------|
13| 0 | Landing Page Submission | 1765 | 0.516233 | 1765 | 0.516233 |
14| 1 | API | 1101 | 0.322024 | 2866 | 0.838257 |
15| 2 | Other Lead Origins | 553 | 0.161743 | 3419 | 1 |
16+----+-------------------------+---------+-----------+--------------------+----------------------+
17Lead Origin for Un-Converted Leads
18+----+-------------------------+---------+-----------+--------------------+----------------------+
19| | Lead Origin | Count | Percent | Cumulative Count | Cumulative Percent |
20|----+-------------------------+---------+-----------+--------------------+----------------------|
21| 0 | Landing Page Submission | 3105 | 0.55496 | 3105 | 0.55496 |
22| 1 | API | 2432 | 0.434674 | 5537 | 0.989634 |
23| 2 | Other Lead Origins | 58 | 0.0103664 | 5595 | 1 |
24+----+-------------------------+---------+-----------+--------------------+----------------------+
25Lead Origin vs Conversion Rate
26+-------------------------+---------------+
27| | Lead Origin |
28|-------------------------+---------------|
29| Landing Page Submission | 0.362423 |
30| API | 0.311633 |
31| Other Lead Origins | 0.905074 |
32+-------------------------+---------------+

png

  • Leads from Landing Page Submission followed by API make up 93% of all leads.
  • But it is interesting that 8.3% of leads coming from other sources have the highest conversion rate of 87.5%

Lead Source

1column = 'Lead Source'
2categoricalUAn(column,figsize=[8,8])
1Types of Lead Source
2+----+--------------------+---------+------------+--------------------+----------------------+
3| | Lead Source | Count | Percent | Cumulative Count | Cumulative Percent |
4|----+--------------------+---------+------------+--------------------+----------------------|
5| 0 | Google | 2857 | 0.316951 | 2857 | 0.316951 |
6| 1 | Direct Traffic | 2528 | 0.280453 | 5385 | 0.597404 |
7| 2 | Olark Chat | 1753 | 0.194475 | 7138 | 0.791879 |
8| 3 | Organic Search | 1131 | 0.125471 | 8269 | 0.917351 |
9| 4 | Reference | 443 | 0.0491458 | 8712 | 0.966497 |
10| 5 | Welingak Website | 129 | 0.0143111 | 8841 | 0.980808 |
11| 6 | Referral Sites | 120 | 0.0133126 | 8961 | 0.99412 |
12| 7 | Other Lead Sources | 53 | 0.00587974 | 9014 | 1 |
13+----+--------------------+---------+------------+--------------------+----------------------+
14Lead Source for Converted Leads
15+----+--------------------+---------+------------+--------------------+----------------------+
16| | Lead Source | Count | Percent | Cumulative Count | Cumulative Percent |
17|----+--------------------+---------+------------+--------------------+----------------------|
18| 0 | Google | 1142 | 0.334016 | 1142 | 0.334016 |
19| 1 | Direct Traffic | 815 | 0.238374 | 1957 | 0.57239 |
20| 2 | Olark Chat | 448 | 0.131032 | 2405 | 0.703422 |
21| 3 | Organic Search | 428 | 0.125183 | 2833 | 0.828605 |
22| 4 | Reference | 410 | 0.119918 | 3243 | 0.948523 |
23| 5 | Welingak Website | 127 | 0.0371454 | 3370 | 0.985668 |
24| 6 | Referral Sites | 31 | 0.00906698 | 3401 | 0.994735 |
25| 7 | Other Lead Sources | 18 | 0.0052647 | 3419 | 1 |
26+----+--------------------+---------+------------+--------------------+----------------------+
27Lead Source for Un-Converted Leads
28+----+--------------------+---------+-------------+--------------------+----------------------+
29| | Lead Source | Count | Percent | Cumulative Count | Cumulative Percent |
30|----+--------------------+---------+-------------+--------------------+----------------------|
31| 0 | Google | 1715 | 0.306524 | 1715 | 0.306524 |
32| 1 | Direct Traffic | 1713 | 0.306166 | 3428 | 0.61269 |
33| 2 | Olark Chat | 1305 | 0.233244 | 4733 | 0.845934 |
34| 3 | Organic Search | 703 | 0.125648 | 5436 | 0.971582 |
35| 4 | Referral Sites | 89 | 0.0159071 | 5525 | 0.987489 |
36| 5 | Other Lead Sources | 35 | 0.00625559 | 5560 | 0.993744 |
37| 6 | Reference | 33 | 0.00589812 | 5593 | 0.999643 |
38| 7 | Welingak Website | 2 | 0.000357462 | 5595 | 1 |
39+----+--------------------+---------+-------------+--------------------+----------------------+
40Lead Source vs Conversion Rate
41+--------------------+---------------+
42| | Lead Source |
43|--------------------+---------------|
44| Direct Traffic | 0.322389 |
45| Google | 0.39972 |
46| Olark Chat | 0.255562 |
47| Organic Search | 0.378426 |
48| Other Lead Sources | 0.339623 |
49| Reference | 0.925508 |
50| Referral Sites | 0.258333 |
51| Welingak Website | 0.984496 |
52+--------------------+---------------+

png

  • Most leads that get converted come from Google(31%), followed by Direct Traffic(28%) and Olark Chat(19%)
  • And leads through Reference have a very high conversion rate (91%)

Do not Email

1feature = 'Do Not Email'
2categoricalUAn(feature,figsize=[8,8])
1Types of Do Not Email
2+----+----------------+---------+-----------+--------------------+----------------------+
3| | Do Not Email | Count | Percent | Cumulative Count | Cumulative Percent |
4|----+----------------+---------+-----------+--------------------+----------------------|
5| 0 | No | 8311 | 0.92201 | 8311 | 0.92201 |
6| 1 | Yes | 703 | 0.0779898 | 9014 | 1 |
7+----+----------------+---------+-----------+--------------------+----------------------+
8Do Not Email for Converted Leads
9+----+----------------+---------+-----------+--------------------+----------------------+
10| | Do Not Email | Count | Percent | Cumulative Count | Cumulative Percent |
11|----+----------------+---------+-----------+--------------------+----------------------|
12| 0 | No | 3315 | 0.969582 | 3315 | 0.969582 |
13| 1 | Yes | 104 | 0.0304183 | 3419 | 1 |
14+----+----------------+---------+-----------+--------------------+----------------------+
15Do Not Email for Un-Converted Leads
16+----+----------------+---------+-----------+--------------------+----------------------+
17| | Do Not Email | Count | Percent | Cumulative Count | Cumulative Percent |
18|----+----------------+---------+-----------+--------------------+----------------------|
19| 0 | No | 4996 | 0.89294 | 4996 | 0.89294 |
20| 1 | Yes | 599 | 0.10706 | 5595 | 1 |
21+----+----------------+---------+-----------+--------------------+----------------------+
22Do Not Email vs Conversion Rate
23+-----+----------------+
24| | Do Not Email |
25|-----+----------------|
26| No | 0.398869 |
27| Yes | 0.147937 |
28+-----+----------------+

png

  • 92% of leads prefer to be sent Emails about the company. Do not Email = No
  • And these are the most converted customers (40%)

Last Activity

1# 'Last Activity'
2feature = 'Last Activity'
3categoricalUAn(feature,figsize=[8,8])
1Types of Last Activity
2+----+---------------------------+---------+-----------+--------------------+----------------------+
3| | Last Activity | Count | Percent | Cumulative Count | Cumulative Percent |
4|----+---------------------------+---------+-----------+--------------------+----------------------|
5| 0 | Email Opened | 3417 | 0.379077 | 3417 | 0.379077 |
6| 1 | SMS Sent | 2701 | 0.299645 | 6118 | 0.678722 |
7| 2 | Olark Chat Conversation | 963 | 0.106834 | 7081 | 0.785556 |
8| 3 | Page Visited on Website | 637 | 0.0706679 | 7718 | 0.856224 |
9| 4 | Converted to Lead | 422 | 0.0468161 | 8140 | 0.90304 |
10| 5 | Email Bounced | 304 | 0.0337253 | 8444 | 0.936765 |
11| 6 | Email Link Clicked | 266 | 0.0295097 | 8710 | 0.966275 |
12| 7 | Other Last Activity | 189 | 0.0209674 | 8899 | 0.987242 |
13| 8 | Form Submitted on Website | 115 | 0.0127579 | 9014 | 1 |
14+----+---------------------------+---------+-----------+--------------------+----------------------+
15Last Activity for Converted Leads
16+----+---------------------------+---------+------------+--------------------+----------------------+
17| | Last Activity | Count | Percent | Cumulative Count | Cumulative Percent |
18|----+---------------------------+---------+------------+--------------------+----------------------|
19| 0 | SMS Sent | 1697 | 0.496344 | 1697 | 0.496344 |
20| 1 | Email Opened | 1246 | 0.364434 | 2943 | 0.860778 |
21| 2 | Page Visited on Website | 150 | 0.0438725 | 3093 | 0.90465 |
22| 3 | Olark Chat Conversation | 84 | 0.0245686 | 3177 | 0.929219 |
23| 4 | Other Last Activity | 74 | 0.0216438 | 3251 | 0.950863 |
24| 5 | Email Link Clicked | 72 | 0.0210588 | 3323 | 0.971922 |
25| 6 | Converted to Lead | 53 | 0.0155016 | 3376 | 0.987423 |
26| 7 | Form Submitted on Website | 27 | 0.00789705 | 3403 | 0.99532 |
27| 8 | Email Bounced | 16 | 0.00467973 | 3419 | 1 |
28+----+---------------------------+---------+------------+--------------------+----------------------+
29Last Activity for Un-Converted Leads
30+----+---------------------------+---------+-----------+--------------------+----------------------+
31| | Last Activity | Count | Percent | Cumulative Count | Cumulative Percent |
32|----+---------------------------+---------+-----------+--------------------+----------------------|
33| 0 | Email Opened | 2171 | 0.388025 | 2171 | 0.388025 |
34| 1 | SMS Sent | 1004 | 0.179446 | 3175 | 0.567471 |
35| 2 | Olark Chat Conversation | 879 | 0.157105 | 4054 | 0.724576 |
36| 3 | Page Visited on Website | 487 | 0.087042 | 4541 | 0.811618 |
37| 4 | Converted to Lead | 369 | 0.0659517 | 4910 | 0.877569 |
38| 5 | Email Bounced | 288 | 0.0514745 | 5198 | 0.929044 |
39| 6 | Email Link Clicked | 194 | 0.0346738 | 5392 | 0.963718 |
40| 7 | Other Last Activity | 115 | 0.0205541 | 5507 | 0.984272 |
41| 8 | Form Submitted on Website | 88 | 0.0157283 | 5595 | 1 |
42+----+---------------------------+---------+-----------+--------------------+----------------------+
43Last Activity vs Conversion Rate
44+---------------------------+-----------------+
45| | Last Activity |
46|---------------------------+-----------------|
47| Converted to Lead | 0.125592 |
48| Email Bounced | 0.0526316 |
49| Email Link Clicked | 0.270677 |
50| Email Opened | 0.364647 |
51| Form Submitted on Website | 0.234783 |
52| Olark Chat Conversation | 0.0872274 |
53| Other Last Activity | 0.391534 |
54| Page Visited on Website | 0.235479 |
55| SMS Sent | 0.628286 |
56+---------------------------+-----------------+

png

  • Most leads open emails sent to them (38%) and that’s their last activity.
  • Among those leads who’s last activity is opening emails, 37% are converted.
  • Only 4% of last activity indicators show Converted to Lead
  • Last activiy as ‘SMS Sent’ has highest conversion rate (62%).
  • Last activiy as ‘Email Bounced’ has lowest conversion rate (7.9%).

Country

1feature = 'Country'
2categoricalUAn(feature,figsize=[8,8])
1Types of Country
2+----+---------------+---------+-----------+--------------------+----------------------+
3| | Country | Count | Percent | Cumulative Count | Cumulative Percent |
4|----+---------------+---------+-----------+--------------------+----------------------|
5| 0 | India | 8787 | 0.974817 | 8787 | 0.974817 |
6| 1 | Outside India | 227 | 0.025183 | 9014 | 1 |
7+----+---------------+---------+-----------+--------------------+----------------------+
8Country for Converted Leads
9+----+---------------+---------+-----------+--------------------+----------------------+
10| | Country | Count | Percent | Cumulative Count | Cumulative Percent |
11|----+---------------+---------+-----------+--------------------+----------------------|
12| 0 | India | 3351 | 0.980111 | 3351 | 0.980111 |
13| 1 | Outside India | 68 | 0.0198889 | 3419 | 1 |
14+----+---------------+---------+-----------+--------------------+----------------------+
15Country for Un-Converted Leads
16+----+---------------+---------+-----------+--------------------+----------------------+
17| | Country | Count | Percent | Cumulative Count | Cumulative Percent |
18|----+---------------+---------+-----------+--------------------+----------------------|
19| 0 | India | 5436 | 0.971582 | 5436 | 0.971582 |
20| 1 | Outside India | 159 | 0.0284182 | 5595 | 1 |
21+----+---------------+---------+-----------+--------------------+----------------------+
22Country vs Conversion Rate
23+---------------+-----------+
24| | Country |
25|---------------+-----------|
26| India | 0.381359 |
27| Outside India | 0.299559 |
28+---------------+-----------+

png

  • Most leads come from India (97%)
  • Out of these 38% are converted.

Specialization

1feature = 'Specialization'
2categoricalUAn(feature)
1Types of Specialization
2+----+-----------------------------------+---------+-----------+--------------------+----------------------+
3| | Specialization | Count | Percent | Cumulative Count | Cumulative Percent |
4|----+-----------------------------------+---------+-----------+--------------------+----------------------|
5| 0 | No Specialization | 3230 | 0.358331 | 3230 | 0.358331 |
6| 1 | Finance Management | 959 | 0.10639 | 4189 | 0.464722 |
7| 2 | Human Resource Management | 836 | 0.0927446 | 5025 | 0.557466 |
8| 3 | Marketing Management | 822 | 0.0911915 | 5847 | 0.648658 |
9| 4 | Operations Management | 498 | 0.0552474 | 6345 | 0.703905 |
10| 5 | Business Administration | 397 | 0.0440426 | 6742 | 0.747948 |
11| 6 | IT Projects Management | 366 | 0.0406035 | 7108 | 0.788551 |
12| 7 | Supply Chain Management | 344 | 0.0381629 | 7452 | 0.826714 |
13| 8 | Banking, Investment And Insurance | 335 | 0.0371644 | 7787 | 0.863878 |
14| 9 | Other Specialization | 270 | 0.0299534 | 8057 | 0.893832 |
15| 10 | Travel and Tourism | 202 | 0.0224096 | 8259 | 0.916241 |
16| 11 | Media and Advertising | 202 | 0.0224096 | 8461 | 0.938651 |
17| 12 | International Business | 176 | 0.0195252 | 8637 | 0.958176 |
18| 13 | Healthcare Management | 156 | 0.0173064 | 8793 | 0.975483 |
19| 14 | E-COMMERCE | 111 | 0.0123142 | 8904 | 0.987797 |
20| 15 | Hospitality Management | 110 | 0.0122032 | 9014 | 1 |
21+----+-----------------------------------+---------+-----------+--------------------+----------------------+
22Specialization for Converted Leads
23+----+-----------------------------------+---------+-----------+--------------------+----------------------+
24| | Specialization | Count | Percent | Cumulative Count | Cumulative Percent |
25|----+-----------------------------------+---------+-----------+--------------------+----------------------|
26| 0 | No Specialization | 888 | 0.259725 | 888 | 0.259725 |
27| 1 | Finance Management | 422 | 0.123428 | 1310 | 0.383153 |
28| 2 | Marketing Management | 397 | 0.116116 | 1707 | 0.499269 |
29| 3 | Human Resource Management | 379 | 0.110851 | 2086 | 0.61012 |
30| 4 | Operations Management | 234 | 0.0684411 | 2320 | 0.678561 |
31| 5 | Business Administration | 175 | 0.0511846 | 2495 | 0.729746 |
32| 6 | Banking, Investment And Insurance | 164 | 0.0479672 | 2659 | 0.777713 |
33| 7 | Supply Chain Management | 147 | 0.042995 | 2806 | 0.820708 |
34| 8 | IT Projects Management | 140 | 0.0409476 | 2946 | 0.861655 |
35| 9 | Other Specialization | 97 | 0.0283709 | 3043 | 0.890026 |
36| 10 | Media and Advertising | 84 | 0.0245686 | 3127 | 0.914595 |
37| 11 | Healthcare Management | 76 | 0.0222287 | 3203 | 0.936824 |
38| 12 | Travel and Tourism | 71 | 0.0207663 | 3274 | 0.95759 |
39| 13 | International Business | 62 | 0.018134 | 3336 | 0.975724 |
40| 14 | Hospitality Management | 44 | 0.0128693 | 3380 | 0.988593 |
41| 15 | E-COMMERCE | 39 | 0.0114068 | 3419 | 1 |
42+----+-----------------------------------+---------+-----------+--------------------+----------------------+
43Specialization for Un-Converted Leads
44+----+-----------------------------------+---------+-----------+--------------------+----------------------+
45| | Specialization | Count | Percent | Cumulative Count | Cumulative Percent |
46|----+-----------------------------------+---------+-----------+--------------------+----------------------|
47| 0 | No Specialization | 2342 | 0.418588 | 2342 | 0.418588 |
48| 1 | Finance Management | 537 | 0.0959786 | 2879 | 0.514567 |
49| 2 | Human Resource Management | 457 | 0.0816801 | 3336 | 0.596247 |
50| 3 | Marketing Management | 425 | 0.0759607 | 3761 | 0.672207 |
51| 4 | Operations Management | 264 | 0.047185 | 4025 | 0.719392 |
52| 5 | IT Projects Management | 226 | 0.0403932 | 4251 | 0.759786 |
53| 6 | Business Administration | 222 | 0.0396783 | 4473 | 0.799464 |
54| 7 | Supply Chain Management | 197 | 0.03521 | 4670 | 0.834674 |
55| 8 | Other Specialization | 173 | 0.0309205 | 4843 | 0.865594 |
56| 9 | Banking, Investment And Insurance | 171 | 0.030563 | 5014 | 0.896157 |
57| 10 | Travel and Tourism | 131 | 0.0234138 | 5145 | 0.919571 |
58| 11 | Media and Advertising | 118 | 0.0210903 | 5263 | 0.940661 |
59| 12 | International Business | 114 | 0.0203753 | 5377 | 0.961037 |
60| 13 | Healthcare Management | 80 | 0.0142985 | 5457 | 0.975335 |
61| 14 | E-COMMERCE | 72 | 0.0128686 | 5529 | 0.988204 |
62| 15 | Hospitality Management | 66 | 0.0117962 | 5595 | 1 |
63+----+-----------------------------------+---------+-----------+--------------------+----------------------+
64Specialization vs Conversion Rate
65+-----------------------------------+------------------+
66| | Specialization |
67|-----------------------------------+------------------|
68| Banking, Investment And Insurance | 0.489552 |
69| Business Administration | 0.440806 |
70| E-COMMERCE | 0.351351 |
71| Finance Management | 0.440042 |
72| Healthcare Management | 0.487179 |
73| Hospitality Management | 0.4 |
74| Human Resource Management | 0.453349 |
75| IT Projects Management | 0.382514 |
76| International Business | 0.352273 |
77| Marketing Management | 0.482968 |
78| Media and Advertising | 0.415842 |
79| No Specialization | 0.274923 |
80| Operations Management | 0.46988 |
81| Other Specialization | 0.359259 |
82| Supply Chain Management | 0.427326 |
83| Travel and Tourism | 0.351485 |
84+-----------------------------------+------------------+

png

  • Specialization of 36% of leads is missing.
  • We have mapped those missing values with ‘No Specialization’.There might be two reason for this,
    • Lead might be a fresher.
    • Lead missed to fill it.
  • Among all the specializations, ’ Banking, Investment And Insurance’ has the highest conversion rate(48.9%).

What is your current occupation

1feature = 'What is your current occupation'
2categoricalUAn(feature,figsize=[8,8])
1Types of What is your current occupation
2+----+-----------------------------------+---------+-------------+--------------------+----------------------+
3| | What is your current occupation | Count | Percent | Cumulative Count | Cumulative Percent |
4|----+-----------------------------------+---------+-------------+--------------------+----------------------|
5| 0 | Unemployed | 5445 | 0.60406 | 5445 | 0.60406 |
6| 1 | Unknown Occupation | 2656 | 0.294653 | 8101 | 0.898713 |
7| 2 | Working Professional | 675 | 0.0748835 | 8776 | 0.973597 |
8| 3 | Student | 206 | 0.0228533 | 8982 | 0.99645 |
9| 4 | Other | 15 | 0.00166408 | 8997 | 0.998114 |
10| 5 | Housewife | 9 | 0.000998447 | 9006 | 0.999112 |
11| 6 | Businessman | 8 | 0.000887508 | 9014 | 1 |
12+----+-----------------------------------+---------+-------------+--------------------+----------------------+
13What is your current occupation for Converted Leads
14+----+-----------------------------------+---------+------------+--------------------+----------------------+
15| | What is your current occupation | Count | Percent | Cumulative Count | Cumulative Percent |
16|----+-----------------------------------+---------+------------+--------------------+----------------------|
17| 0 | Unemployed | 2336 | 0.683241 | 2336 | 0.683241 |
18| 1 | Working Professional | 620 | 0.18134 | 2956 | 0.86458 |
19| 2 | Unknown Occupation | 366 | 0.107049 | 3322 | 0.971629 |
20| 3 | Student | 74 | 0.0216438 | 3396 | 0.993273 |
21| 4 | Other | 9 | 0.00263235 | 3405 | 0.995905 |
22| 5 | Housewife | 9 | 0.00263235 | 3414 | 0.998538 |
23| 6 | Businessman | 5 | 0.00146242 | 3419 | 1 |
24+----+-----------------------------------+---------+------------+--------------------+----------------------+
25What is your current occupation for Un-Converted Leads
26+----+-----------------------------------+---------+-------------+--------------------+----------------------+
27| | What is your current occupation | Count | Percent | Cumulative Count | Cumulative Percent |
28|----+-----------------------------------+---------+-------------+--------------------+----------------------|
29| 0 | Unemployed | 3109 | 0.555675 | 3109 | 0.555675 |
30| 1 | Unknown Occupation | 2290 | 0.409294 | 5399 | 0.964969 |
31| 2 | Student | 132 | 0.0235925 | 5531 | 0.988561 |
32| 3 | Working Professional | 55 | 0.00983021 | 5586 | 0.998391 |
33| 4 | Other | 6 | 0.00107239 | 5592 | 0.999464 |
34| 5 | Businessman | 3 | 0.000536193 | 5595 | 1 |
35+----+-----------------------------------+---------+-------------+--------------------+----------------------+
36What is your current occupation vs Conversion Rate
37+----------------------+-----------------------------------+
38| | What is your current occupation |
39|----------------------+-----------------------------------|
40| Businessman | 0.625 |
41| Housewife | nan |
42| Other | 0.6 |
43| Student | 0.359223 |
44| Unemployed | 0.429017 |
45| Unknown Occupation | 0.137801 |
46| Working Professional | 0.918519 |
47+----------------------+-----------------------------------+

png

  • Although the conversion rate for Working Professional is the highest ! 91.6%, they only make 7.4% of all leads. 60% leads are Unemployed customers followed by 29% with unknown nature of employment
  • Among all the converted leads, Unemployed and Working Professionals top the list.
  • Conversion for Housewife segment is 100%

City

1feature = 'City'
2categoricalUAn(feature)
1Types of City
2+----+-----------------------------+---------+------------+--------------------+----------------------+
3| | City | Count | Percent | Cumulative Count | Cumulative Percent |
4|----+-----------------------------+---------+------------+--------------------+----------------------|
5| 0 | Mumbai | 6692 | 0.742401 | 6692 | 0.742401 |
6| 1 | Thane & Outskirts | 745 | 0.0826492 | 7437 | 0.82505 |
7| 2 | Other Cities | 680 | 0.0754382 | 8117 | 0.900488 |
8| 3 | Other Cities of Maharashtra | 446 | 0.0494786 | 8563 | 0.949967 |
9| 4 | Other Metro Cities | 377 | 0.0418238 | 8940 | 0.991791 |
10| 5 | Tier II Cities | 74 | 0.00820945 | 9014 | 1 |
11+----+-----------------------------+---------+------------+--------------------+----------------------+
12City for Converted Leads
13+----+-----------------------------+---------+------------+--------------------+----------------------+
14| | City | Count | Percent | Cumulative Count | Cumulative Percent |
15|----+-----------------------------+---------+------------+--------------------+----------------------|
16| 0 | Mumbai | 2440 | 0.713659 | 2440 | 0.713659 |
17| 1 | Thane & Outskirts | 332 | 0.0971044 | 2772 | 0.810763 |
18| 2 | Other Cities | 272 | 0.0795554 | 3044 | 0.890319 |
19| 3 | Other Cities of Maharashtra | 196 | 0.0573267 | 3240 | 0.947646 |
20| 4 | Other Metro Cities | 154 | 0.0450424 | 3394 | 0.992688 |
21| 5 | Tier II Cities | 25 | 0.00731208 | 3419 | 1 |
22+----+-----------------------------+---------+------------+--------------------+----------------------+
23City for Un-Converted Leads
24+----+-----------------------------+---------+------------+--------------------+----------------------+
25| | City | Count | Percent | Cumulative Count | Cumulative Percent |
26|----+-----------------------------+---------+------------+--------------------+----------------------|
27| 0 | Mumbai | 4252 | 0.759964 | 4252 | 0.759964 |
28| 1 | Thane & Outskirts | 413 | 0.0738159 | 4665 | 0.83378 |
29| 2 | Other Cities | 408 | 0.0729223 | 5073 | 0.906702 |
30| 3 | Other Cities of Maharashtra | 250 | 0.0446828 | 5323 | 0.951385 |
31| 4 | Other Metro Cities | 223 | 0.039857 | 5546 | 0.991242 |
32| 5 | Tier II Cities | 49 | 0.00875782 | 5595 | 1 |
33+----+-----------------------------+---------+------------+--------------------+----------------------+
34City vs Conversion Rate
35+-----------------------------+----------+
36| | City |
37|-----------------------------+----------|
38| Mumbai | 0.364614 |
39| Thane & Outskirts | 0.445638 |
40| Other Cities | 0.4 |
41| Other Cities of Maharashtra | 0.439462 |
42| Other Metro Cities | 0.408488 |
43| Tier II Cities | 0.337838 |
44+-----------------------------+----------+

png

  • Most Leads come from ‘Mumbai’ and they have a decent conversion rate of 36.4%.
  • Leads from Thane and outskirts make up 8.2% with a conversion rate of 44%

A free copy of Mastering The Interview.

1feature = 'A free copy of Mastering The Interview'
2categoricalUAn(feature)
1Types of A free copy of Mastering The Interview
2+----+------------------------------------------+---------+-----------+--------------------+----------------------+
3| | A free copy of Mastering The Interview | Count | Percent | Cumulative Count | Cumulative Percent |
4|----+------------------------------------------+---------+-----------+--------------------+----------------------|
5| 0 | No | 6126 | 0.679609 | 6126 | 0.679609 |
6| 1 | Yes | 2888 | 0.320391 | 9014 | 1 |
7+----+------------------------------------------+---------+-----------+--------------------+----------------------+
8A free copy of Mastering The Interview for Converted Leads
9+----+------------------------------------------+---------+-----------+--------------------+----------------------+
10| | A free copy of Mastering The Interview | Count | Percent | Cumulative Count | Cumulative Percent |
11|----+------------------------------------------+---------+-----------+--------------------+----------------------|
12| 0 | No | 2389 | 0.698742 | 2389 | 0.698742 |
13| 1 | Yes | 1030 | 0.301258 | 3419 | 1 |
14+----+------------------------------------------+---------+-----------+--------------------+----------------------+
15A free copy of Mastering The Interview for Un-Converted Leads
16+----+------------------------------------------+---------+-----------+--------------------+----------------------+
17| | A free copy of Mastering The Interview | Count | Percent | Cumulative Count | Cumulative Percent |
18|----+------------------------------------------+---------+-----------+--------------------+----------------------|
19| 0 | No | 3737 | 0.667918 | 3737 | 0.667918 |
20| 1 | Yes | 1858 | 0.332082 | 5595 | 1 |
21+----+------------------------------------------+---------+-----------+--------------------+----------------------+
22A free copy of Mastering The Interview vs Conversion Rate
23+-----+------------------------------------------+
24| | A free copy of Mastering The Interview |
25|-----+------------------------------------------|
26| No | 0.389977 |
27| Yes | 0.356648 |
28+-----+------------------------------------------+

png

  • 68% of the leads said “No” for a free copy of ‘Mastering The Interview’.
  • Conversion rate of leads who said “No” is high (39.8%).
1def num_univariate_analysis(column_name,scale='linear') :
2
3 converted = leads[leads['Converted'] == 1]
4 unconverted = leads[leads['Converted'] == 0]
5
6 plt.figure(figsize=(8,6))
7 ax = sns.boxplot(x=column_name, y='Converted', data = leads)
8 title = 'Boxplot of ' + column_name+' vs Conversion'
9 ax.set(title=title)
10 if scale == 'log' :
11 ax.set_xscale('log')
12 ax.set(ylabel=column_name + '(Log Scale)')
13
14 print("Spread for range of "+column_name+" that were Converted")
15 tab(converted[column_name].describe())
16 print("Spread for range of "+column_name+" that were not converted")
17 tab(unconverted[column_name].describe())

TotalVisits

1column_name = 'TotalVisits'
2num_univariate_analysis(column_name,scale='log')
1Spread for range of TotalVisits that were Converted
2+-------+---------------+
3| | TotalVisits |
4|-------+---------------|
5| count | 3419 |
6| mean | 3.65575 |
7| std | 5.57527 |
8| min | 0 |
9| 25% | 0 |
10| 50% | 3 |
11| 75% | 5 |
12| max | 251 |
13+-------+---------------+
14Spread for range of TotalVisits that were not converted
15+-------+---------------+
16| | TotalVisits |
17|-------+---------------|
18| count | 5595 |
19| mean | 3.33262 |
20| std | 4.37298 |
21| min | 0 |
22| 25% | 1 |
23| 50% | 3 |
24| 75% | 4 |
25| max | 141 |
26+-------+---------------+

png

  • Looks like Total Visits have a lot of outliers among both Converted and Un-converted leads.
  • Let’s take a look at the quantiles between 90 and 100.
1# Looking at Quantiles
2tab(leads[column_name].quantile(np.linspace(.90,1,20)))
1+----------+---------------+
2| | TotalVisits |
3|----------+---------------|
4| 0.9 | 7 |
5| 0.905263 | 7 |
6| 0.910526 | 8 |
7| 0.915789 | 8 |
8| 0.921053 | 8 |
9| 0.926316 | 8 |
10| 0.931579 | 9 |
11| 0.936842 | 9 |
12| 0.942105 | 9 |
13| 0.947368 | 9 |
14| 0.952632 | 10 |
15| 0.957895 | 10 |
16| 0.963158 | 11 |
17| 0.968421 | 11 |
18| 0.973684 | 12 |
19| 0.978947 | 13 |
20| 0.984211 | 14 |
21| 0.989474 | 17 |
22| 0.994737 | 20 |
23| 1 | 251 |
24+----------+---------------+
  • From the above, it is clear that outliers exist and these might skew the analyses.
  • For now, let’s cap the outliers about 99th percentile to 99th percentile value. soft range capping.
1# Capping outliers to 99th perentile value
2cap = leads[column_name].quantile(.99)
3condition = leads[column_name] > cap
4leads.loc[condition, column_name] = cap

Total TIme Spent on Website.

1column = 'Total Time Spent on Website'
2num_univariate_analysis(column)
1Spread for range of Total Time Spent on Website that were Converted
2+-------+-------------------------------+
3| | Total Time Spent on Website |
4|-------+-------------------------------|
5| count | 3419 |
6| mean | 732.945 |
7| std | 614.476 |
8| min | 0 |
9| 25% | 0 |
10| 50% | 826 |
11| 75% | 1265.5 |
12| max | 2253 |
13+-------+-------------------------------+
14Spread for range of Total Time Spent on Website that were not converted
15+-------+-------------------------------+
16| | Total Time Spent on Website |
17|-------+-------------------------------|
18| count | 5595 |
19| mean | 329.919 |
20| std | 432.757 |
21| min | 0 |
22| 25% | 14 |
23| 50% | 178 |
24| 75% | 393.5 |
25| max | 2272 |
26+-------+-------------------------------+

png

  • ‘Total Time Spend on Website’ has many outliers.
  • Let’s look quantiles to confirm this.
1tab(leads[column].quantile(np.linspace(0.75,1,25)))
1+----------+-------------------------------+
2| | Total Time Spent on Website |
3|----------+-------------------------------|
4| 0.75 | 924 |
5| 0.760417 | 953.635 |
6| 0.770833 | 991 |
7| 0.78125 | 1022.41 |
8| 0.791667 | 1054 |
9| 0.802083 | 1087 |
10| 0.8125 | 1115.06 |
11| 0.822917 | 1143 |
12| 0.833333 | 1177.83 |
13| 0.84375 | 1208 |
14| 0.854167 | 1238.6 |
15| 0.864583 | 1271 |
16| 0.875 | 1296.38 |
17| 0.885417 | 1328 |
18| 0.895833 | 1360 |
19| 0.90625 | 1392 |
20| 0.916667 | 1434 |
21| 0.927083 | 1468 |
22| 0.9375 | 1503 |
23| 0.947917 | 1549 |
24| 0.958333 | 1592.46 |
25| 0.96875 | 1647 |
26| 0.979167 | 1720.23 |
27| 0.989583 | 1830.34 |
28| 1 | 2272 |
29+----------+-------------------------------+
1leads[column].quantile(np.linspace(0.75,1,50)).plot()
1<matplotlib.axes._subplots.AxesSubplot at 0x7fc54b9106d0>

png

1# Capping `Total Time Spent on Website` values to 99th percentile
2cap = leads[column].quantile(.99)
3condition = leads[column] > cap
4leads.loc[condition, column] = cap

Page Views Per Visit.

1column = 'Page Views Per Visit'
2num_univariate_analysis(column)
1Spread for range of Page Views Per Visit that were Converted
2+-------+------------------------+
3| | Page Views Per Visit |
4|-------+------------------------|
5| count | 3419 |
6| mean | 2.36407 |
7| std | 2.10862 |
8| min | 0 |
9| 25% | 0 |
10| 50% | 2 |
11| 75% | 3.5 |
12| max | 15 |
13+-------+------------------------+
14Spread for range of Page Views Per Visit that were not converted
15+-------+------------------------+
16| | Page Views Per Visit |
17|-------+------------------------|
18| count | 5595 |
19| mean | 2.36962 |
20| std | 2.17789 |
21| min | 0 |
22| 25% | 1 |
23| 50% | 2 |
24| 75% | 3 |
25| max | 55 |
26+-------+------------------------+

png

  • ‘Page Views Per Visit’ has many outliers.
  • Let’s look quantiles to confirm this.
1leads[column].quantile(np.linspace(0.75,1,30)).plot()
1<matplotlib.axes._subplots.AxesSubplot at 0x7fc54c15d890>

png

  • There is a sudden jump between 99th percentile and maximum value.
  • let’s cap the values to 99th percentile to avoid skewing the analysis
1# Capping `Page Views Per Visit` values to 99th percentile
2cap = leads[column].quantile(.99)
3condition = leads[column] > cap
4leads.loc[condition, column] = cap

Bivariate Analysis

1leads.columns.values
1array(['Lead Origin', 'Lead Source', 'Do Not Email', 'Converted',
2 'TotalVisits', 'Total Time Spent on Website',
3 'Page Views Per Visit', 'Last Activity', 'Country',
4 'Specialization', 'What is your current occupation', 'City',
5 'A free copy of Mastering The Interview'], dtype=object)
1continuous_vars = ['TotalVisits', 'Page Views Per Visit', 'Total Time Spent on Website']

TotalVisits vs A free copy of Mastering The Interview

1plt.figure(figsize=[8,8])
2sns.barplot(x=continuous_vars[0], y = 'A free copy of Mastering The Interview', data=leads, hue='Converted')
1<matplotlib.axes._subplots.AxesSubplot at 0x7fc54c0fba90>

png

  • One can see that the proportion of leads with high Total Visits to the website also like a free copy of Mastering The Interview.
  • Incidentally, these are leads with higher conversion rate.
  • More convertable leads are being attracted by the website through providing ‘A free copy of Mastering The Interview’.

Lead Source vs Country

1# sns.barplot(x='Lead Source', y = 'Country', hue='Converted', data=leads)
2leads.groupby(['Country','Lead Source'])['Converted'].value_counts(normalize=True)\
3.unstack()\
4 .plot(
5 layout=(2,2),
6 figsize=(14,12), kind='barh', stacked=True);

png

  • Most leads from India through Reference Sources have very high conversion rate.
  • Leads from outside of India from other Lead sources do not convert at all.

Occupation vs City

1x = "What is your current occupation"
2y = 'City'
3
4leads.groupby([x,y])['Converted'].value_counts(normalize=True)\
5.unstack()\
6 .plot(
7 layout=(2,2),
8 figsize=(14,12), kind='barh', stacked=True);

png

  • Working Professionals in Other cities of Maharashtra have higher conversion rates compared to those from Mumbai , Thane and other cities.
  • BusinessMen from Mumbai and Thane & Outskirts are poor leads in comparison to Tier 2 and Other cities .

Last Activity vs Country

1x = "Country"
2y = 'Last Activity'
3
4leads.groupby([x,y])['Converted'].value_counts(normalize=True)\
5.unstack()\
6 .plot(
7 layout=(2,2),
8 figsize=(14,12), kind='barh', stacked=True);

png

  • SMS and Emails are more favourable for conversion over Website Visits outside of India.
  • Leads from outside of India who click email links have higher conversion rate compared those from India.

Data Preparation

Mapping Binary Variables to 0 / 1

1binary_var = ['Do Not Email', 'A free copy of Mastering The Interview']
2leads[binary_var] = leads[binary_var].replace({'Yes' : 1, 'No' : 0})

Creating Indicator Variables

1categoricalCol = ['Lead Origin', 'Lead Source','Last Activity', 'Country', 'Specialization',
2 'What is your current occupation', 'City']
3
4print('Levels in Each Cateogrical Variable\n')
5for col in sorted(categoricalCol) :
6 print(col, leads[col].unique(), '\n')
7
8# Creating dummy variables
9leadOriginDummies = pd.get_dummies(leads['Lead Origin'], drop_first=True)
10leadSourceDummies = pd.get_dummies(leads['Lead Source'], drop_first=True)
11lastActivityDummies = pd.get_dummies(leads['Last Activity'], drop_first=True)
12countryDummies = pd.get_dummies(leads['Country'] ,drop_first=True)
13specDummies = pd.get_dummies(leads['Specialization'],drop_first=True)
14occupationDummies = pd.get_dummies(leads[ 'What is your current occupation'],drop_first=True)
15cityDummies = pd.get_dummies(leads[ 'City'],drop_first=True)
16
17# adding dummy variables to leads dataframe
18leads = pd.concat([leads, leadOriginDummies,leadSourceDummies,lastActivityDummies, countryDummies, specDummies, occupationDummies, cityDummies], axis=1)
19
20# dropping categorical columns
21leads.drop(columns = categoricalCol, inplace=True)
22
23
24print('Final Columns')
25leads.columns
1Levels in Each Cateogrical Variable
2
3City ['Mumbai' 'Thane & Outskirts' 'Other Metro Cities' 'Other Cities'
4 'Other Cities of Maharashtra' 'Tier II Cities']
5
6Country ['India' 'Outside India']
7
8Last Activity ['Page Visited on Website' 'Email Opened' 'Other Last Activity'
9 'Converted to Lead' 'Olark Chat Conversation' 'Email Link Clicked'
10 'Form Submitted on Website' 'Email Bounced' 'SMS Sent']
11
12Lead Origin ['API' 'Landing Page Submission' 'Other Lead Origins']
13
14Lead Source ['Olark Chat' 'Organic Search' 'Direct Traffic' 'Google' 'Referral Sites'
15 'Reference' 'Welingak Website' 'Other Lead Sources']
16
17Specialization ['No Specialization' 'Business Administration' 'Media and Advertising'
18 'Supply Chain Management' 'IT Projects Management' 'Finance Management'
19 'Travel and Tourism' 'Human Resource Management' 'Marketing Management'
20 'Banking, Investment And Insurance' 'International Business' 'E-COMMERCE'
21 'Operations Management' 'Other Specialization' 'Hospitality Management'
22 'Healthcare Management']
23
24What is your current occupation ['Unemployed' 'Student' 'Unknown Occupation' 'Working Professional'
25 'Businessman' 'Other' 'Housewife']
26
27Final Columns
28
29
30
31
32
33Index(['Do Not Email', 'Converted', 'TotalVisits',
34 'Total Time Spent on Website', 'Page Views Per Visit',
35 'A free copy of Mastering The Interview', 'Landing Page Submission',
36 'Other Lead Origins', 'Google', 'Olark Chat', 'Organic Search',
37 'Other Lead Sources', 'Reference', 'Referral Sites', 'Welingak Website',
38 'Email Bounced', 'Email Link Clicked', 'Email Opened',
39 'Form Submitted on Website', 'Olark Chat Conversation',
40 'Other Last Activity', 'Page Visited on Website', 'SMS Sent',
41 'Outside India', 'Business Administration', 'E-COMMERCE',
42 'Finance Management', 'Healthcare Management', 'Hospitality Management',
43 'Human Resource Management', 'IT Projects Management',
44 'International Business', 'Marketing Management',
45 'Media and Advertising', 'No Specialization', 'Operations Management',
46 'Other Specialization', 'Supply Chain Management', 'Travel and Tourism',
47 'Housewife', 'Other', 'Student', 'Unemployed', 'Unknown Occupation',
48 'Working Professional', 'Other Cities', 'Other Cities of Maharashtra',
49 'Other Metro Cities', 'Thane & Outskirts', 'Tier II Cities'],
50 dtype='object')

Correlation

1# Top Correlations
2def correlation(dataframe) :
3 cor0=dataframe.corr()
4 type(cor0)
5 cor0.where(np.triu(np.ones(cor0.shape),k=1).astype(np.bool))
6 cor0=cor0.unstack().reset_index()
7 cor0.columns=['VAR1','VAR2','CORR']
8 cor0.dropna(subset=['CORR'], inplace=True)
9 cor0.CORR=round(cor0['CORR'],2)
10 cor0.CORR=cor0.CORR.abs()
11 cor0.sort_values(by=['CORR'],ascending=False)
12 cor0=cor0[~(cor0['VAR1']==cor0['VAR2'])]
13 return pd.DataFrame(cor0.sort_values(by=['CORR'],ascending=False))
1#Correlations for Converted Leads
2convertedCondition= leads['Converted']==1
3print('Correlations for Converted Leads')
4correlation(leads[convertedCondition])[1:30:2].style.background_gradient(cmap='GnBu').hide_index()
1Correlations for Converted Leads
VAR1 VAR2 CORR
Other Lead OriginsReference0.840000
SMS SentEmail Opened0.750000
UnemployedWorking Professional0.690000
Page Views Per VisitTotalVisits0.680000
No SpecializationLanding Page Submission0.600000
Page Views Per VisitLanding Page Submission0.560000
Landing Page SubmissionA free copy of Mastering The Interview0.540000
UnemployedUnknown Occupation0.510000
Total Time Spent on WebsitePage Views Per Visit0.500000
Other Lead OriginsPage Views Per Visit0.480000
Other Lead OriginsTotal Time Spent on Website0.460000
Other Lead OriginsLanding Page Submission0.450000
Welingak WebsiteOther Lead Origins0.450000
Total Time Spent on WebsiteTotalVisits0.440000
TotalVisitsTotal Time Spent on Website0.440000
  • Conversions of leads from other lead origins and the ones through reference have similar conversion behaviour.
1#Correlations for un-Converted Leads
2unconvertedCondition=leads['Converted']==0
3print('Correlations for Non-Converted Leads')
4correlation(leads[unconvertedCondition])[1:30:2].style.background_gradient(cmap='GnBu').hide_index()
1Correlations for Non-Converted Leads
VAR1 VAR2 CORR
UnemployedUnknown Occupation0.930000
Landing Page SubmissionNo Specialization0.870000
ReferenceOther Lead Origins0.750000
TotalVisitsPage Views Per Visit0.730000
Email BouncedDo Not Email0.650000
Olark ChatPage Views Per Visit0.630000
Olark ChatLanding Page Submission0.620000
A free copy of Mastering The InterviewNo Specialization0.590000
A free copy of Mastering The InterviewLanding Page Submission0.580000
No SpecializationOlark Chat0.570000
Olark ChatTotalVisits0.520000
Page Views Per VisitLanding Page Submission0.500000
Olark Chat ConversationOlark Chat0.500000
No SpecializationPage Views Per Visit0.490000
Other Lead OriginsOther Lead Sources0.480000
  • From the above, Unknown Occupation and Unemployed are highly correlated for non-converted leads.
  • This might mean that unemployed leads and leads with unknown occupation have the same conversion behaviour.

Train-Test Split

1from sklearn.model_selection import train_test_split
2y = leads.pop('Converted')
3X = leads
1X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=100)

Standardizing Continuous Variables

1continuous_vars
1['TotalVisits', 'Page Views Per Visit', 'Total Time Spent on Website']
1from sklearn.preprocessing import StandardScaler
2scaler = StandardScaler()
3
4# fitting and transforming train set
5X_train[continuous_vars] = scaler.fit_transform(X_train[continuous_vars])
6
7# Transforming test set for later use
8X_test[continuous_vars] = scaler.transform(X_test[continuous_vars])

Modelling

Recurvise Feature Elimination

1print('No of features : ', len(X_train.columns))
1No of features : 49
  • Currently, the dataset has 49 features.
  • We shall follow a mixed feature elimination approach.
  • We could use Recursive Feature Elimination for coarse elimination to 25 columns
  • This is followed by manual elimination of features with high p-value / VIF.
1# RFE
2from sklearn.linear_model import LogisticRegression
3from sklearn.feature_selection import RFE
4minFeatures = 25
5model = LogisticRegression()
6rfe = RFE(model, n_features_to_select=minFeatures)
7rfe = rfe.fit(X_train, y_train)
1# Columns selected by RFE :
2RFE_features = pd.DataFrame( {'feature' : X_train.columns, 'rank' : rfe.ranking_, 'support' : rfe.support_})
3condition = RFE_features['support'] == True
4rfe_features = RFE_features[condition].sort_values(by='rank',ascending=True )
5print('Features selected by RFE\n')
6tab(rfe_features)
1Features selected by RFE
2
3+----+-----------------------------+--------+-----------+
4| | feature | rank | support |
5|----+-----------------------------+--------+-----------|
6| 0 | Do Not Email | 1 | True |
7| 42 | Unknown Occupation | 1 | True |
8| 41 | Unemployed | 1 | True |
9| 40 | Student | 1 | True |
10| 38 | Housewife | 1 | True |
11| 33 | No Specialization | 1 | True |
12| 32 | Media and Advertising | 1 | True |
13| 27 | Hospitality Management | 1 | True |
14| 22 | Outside India | 1 | True |
15| 21 | SMS Sent | 1 | True |
16| 19 | Other Last Activity | 1 | True |
17| 43 | Working Professional | 1 | True |
18| 18 | Olark Chat Conversation | 1 | True |
19| 15 | Email Link Clicked | 1 | True |
20| 13 | Welingak Website | 1 | True |
21| 11 | Reference | 1 | True |
22| 10 | Other Lead Sources | 1 | True |
23| 8 | Olark Chat | 1 | True |
24| 6 | Other Lead Origins | 1 | True |
25| 5 | Landing Page Submission | 1 | True |
26| 3 | Page Views Per Visit | 1 | True |
27| 2 | Total Time Spent on Website | 1 | True |
28| 1 | TotalVisits | 1 | True |
29| 16 | Email Opened | 1 | True |
30| 48 | Tier II Cities | 1 | True |
31+----+-----------------------------+--------+-----------+
1rfeFeatures = rfe_features['feature'].values

Manual Feature Elimination

Model 1

1### Multicollinearity
2from statsmodels.stats.outliers_influence import variance_inflation_factor
3def vif(X) :
4 df = sm.add_constant(X)
5 vif = [variance_inflation_factor(df.values,i) for i in range(df.shape[1])]
6 vif_frame = pd.DataFrame({'vif' : vif[0:]},index = df.columns).reset_index()
7 tab(vif_frame.sort_values(by='vif',ascending=False))
1# Model 1
2import statsmodels.api as sm
3features = rfe_features['feature'].values
4X_train = X_train[features]
5logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
6print("VIF for X_train")
7vif(X_train)
8logm1.fit().summary()
1VIF for X_train
2+----+-----------------------------+-----------+
3| | index | vif |
4|----+-----------------------------+-----------|
5| 0 | const | 498.666 |
6| 3 | Unemployed | 117.314 |
7| 2 | Unknown Occupation | 102.767 |
8| 19 | Other Lead Origins | 34.8794 |
9| 12 | Working Professional | 34.6746 |
10| 16 | Reference | 26.0024 |
11| 4 | Student | 11.9606 |
12| 15 | Welingak Website | 8.73583 |
13| 20 | Landing Page Submission | 3.48279 |
14| 6 | No Specialization | 3.05316 |
15| 17 | Other Lead Sources | 2.84805 |
16| 21 | Page Views Per Visit | 2.65403 |
17| 24 | Email Opened | 2.47549 |
18| 18 | Olark Chat | 2.45889 |
19| 10 | SMS Sent | 2.30595 |
20| 23 | TotalVisits | 2.0918 |
21| 13 | Olark Chat Conversation | 1.93429 |
22| 5 | Housewife | 1.46972 |
23| 22 | Total Time Spent on Website | 1.35898 |
24| 1 | Do Not Email | 1.2055 |
25| 14 | Email Link Clicked | 1.20382 |
26| 11 | Other Last Activity | 1.11873 |
27| 9 | Outside India | 1.02551 |
28| 7 | Media and Advertising | 1.01922 |
29| 8 | Hospitality Management | 1.0113 |
30| 25 | Tier II Cities | 1.00967 |
31+----+-----------------------------+-----------+
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 6309
Model: GLM Df Residuals: 6283
Model Family: Binomial Df Model: 25
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2503.0
Date: Tue, 08 Sep 2020 Deviance: 5005.9
Time: 07:36:56 Pearson chi2: 6.36e+03
No. Iterations: 20
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.3145 0.675 -0.466 0.641 -1.638 1.009
Do Not Email -1.4730 0.189 -7.775 0.000 -1.844 -1.102
Unknown Occupation -1.8781 0.668 -2.813 0.005 -3.187 -0.570
Unemployed -0.6028 0.664 -0.908 0.364 -1.904 0.698
Student -0.6044 0.699 -0.864 0.387 -1.975 0.766
Housewife 19.9950 1.12e+04 0.002 0.999 -2.2e+04 2.21e+04
No Specialization -0.8514 0.127 -6.698 0.000 -1.101 -0.602
Media and Advertising -0.2951 0.247 -1.196 0.232 -0.779 0.189
Hospitality Management -0.9043 0.333 -2.715 0.007 -1.557 -0.252
Outside India -0.6829 0.236 -2.899 0.004 -1.145 -0.221
SMS Sent 2.0121 0.128 15.767 0.000 1.762 2.262
Other Last Activity 1.3061 0.262 4.991 0.000 0.793 1.819
Working Professional 1.8232 0.688 2.651 0.008 0.475 3.171
Olark Chat Conversation -0.6947 0.202 -3.437 0.001 -1.091 -0.298
Email Link Clicked 0.4905 0.236 2.079 0.038 0.028 0.953
Welingak Website 4.3758 1.134 3.858 0.000 2.153 6.599
Reference 2.0165 0.892 2.261 0.024 0.268 3.765
Other Lead Sources -0.1279 0.830 -0.154 0.878 -1.755 1.500
Olark Chat 1.2173 0.141 8.613 0.000 0.940 1.494
Other Lead Origins 1.0385 0.869 1.196 0.232 -0.664 2.741
Landing Page Submission -0.9141 0.131 -6.962 0.000 -1.171 -0.657
Page Views Per Visit -0.2670 0.056 -4.727 0.000 -0.378 -0.156
Total Time Spent on Website 1.1245 0.042 26.804 0.000 1.042 1.207
TotalVisits 0.3212 0.050 6.432 0.000 0.223 0.419
Email Opened 0.7814 0.125 6.271 0.000 0.537 1.026
Tier II Cities -0.4419 0.425 -1.040 0.298 -1.275 0.391
  • Unemployed has the highest VIF. let’s drop this feature.

Model 2

1# Model 2 : Removing `Unemployed`
2column_to_remove = 'Unemployed'
3features = X_train.columns[X_train.columns !=column_to_remove]
4X_train = X_train[features]
5logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
6print("VIF for X_train")
7vif(X_train)
8logm1.fit().summary()
1VIF for X_train
2+----+-----------------------------+----------+
3| | index | vif |
4|----+-----------------------------+----------|
5| 18 | Other Lead Origins | 34.8784 |
6| 15 | Reference | 26.0024 |
7| 0 | const | 19.6719 |
8| 14 | Welingak Website | 8.73573 |
9| 19 | Landing Page Submission | 3.4731 |
10| 5 | No Specialization | 3.04661 |
11| 16 | Other Lead Sources | 2.84805 |
12| 20 | Page Views Per Visit | 2.65403 |
13| 23 | Email Opened | 2.47543 |
14| 17 | Olark Chat | 2.45888 |
15| 9 | SMS Sent | 2.30562 |
16| 22 | TotalVisits | 2.09129 |
17| 12 | Olark Chat Conversation | 1.93402 |
18| 21 | Total Time Spent on Website | 1.35834 |
19| 1 | Do Not Email | 1.20547 |
20| 13 | Email Link Clicked | 1.2038 |
21| 2 | Unknown Occupation | 1.17383 |
22| 11 | Working Professional | 1.15381 |
23| 10 | Other Last Activity | 1.11849 |
24| 8 | Outside India | 1.02545 |
25| 3 | Student | 1.02534 |
26| 6 | Media and Advertising | 1.01899 |
27| 7 | Hospitality Management | 1.01126 |
28| 4 | Housewife | 1.01026 |
29| 24 | Tier II Cities | 1.00967 |
30+----+-----------------------------+----------+
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 6309
Model: GLM Df Residuals: 6284
Model Family: Binomial Df Model: 24
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2503.4
Date: Tue, 08 Sep 2020 Deviance: 5006.8
Time: 07:36:56 Pearson chi2: 6.36e+03
No. Iterations: 21
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.9101 0.163 -5.598 0.000 -1.229 -0.591
Do Not Email -1.4697 0.189 -7.761 0.000 -1.841 -1.099
Unknown Occupation -1.2777 0.091 -14.023 0.000 -1.456 -1.099
Student -0.0044 0.229 -0.019 0.985 -0.453 0.444
Housewife 21.5945 1.85e+04 0.001 0.999 -3.63e+04 3.63e+04
No Specialization -0.8575 0.127 -6.754 0.000 -1.106 -0.609
Media and Advertising -0.2953 0.247 -1.198 0.231 -0.779 0.188
Hospitality Management -0.9066 0.333 -2.722 0.006 -1.559 -0.254
Outside India -0.6853 0.236 -2.909 0.004 -1.147 -0.224
SMS Sent 2.0111 0.128 15.755 0.000 1.761 2.261
Other Last Activity 1.3069 0.261 5.001 0.000 0.795 1.819
Working Professional 2.4233 0.191 12.700 0.000 2.049 2.797
Olark Chat Conversation -0.6896 0.202 -3.415 0.001 -1.085 -0.294
Email Link Clicked 0.4893 0.236 2.074 0.038 0.027 0.952
Welingak Website 4.3797 1.134 3.861 0.000 2.157 6.603
Reference 2.0184 0.892 2.263 0.024 0.270 3.766
Other Lead Sources -0.1269 0.830 -0.153 0.879 -1.754 1.500
Olark Chat 1.2189 0.141 8.624 0.000 0.942 1.496
Other Lead Origins 1.0344 0.869 1.191 0.234 -0.668 2.737
Landing Page Submission -0.9202 0.131 -7.016 0.000 -1.177 -0.663
Page Views Per Visit -0.2668 0.056 -4.723 0.000 -0.377 -0.156
Total Time Spent on Website 1.1251 0.042 26.818 0.000 1.043 1.207
TotalVisits 0.3227 0.050 6.466 0.000 0.225 0.421
Email Opened 0.7824 0.125 6.278 0.000 0.538 1.027
Tier II Cities -0.4432 0.425 -1.043 0.297 -1.276 0.389
  • Other Lead Origins has a very high VIF.
  • Let’s drop this variable

Model 3

1# Model 3 : Removing `Other Lead Origins`
2column_to_remove = 'Other Lead Origins'
3features = X_train.columns[X_train.columns !=column_to_remove]
4X_train = X_train[features]
5logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
6print("VIF for X_train")
7vif(X_train)
8logm1.fit().summary()
1VIF for X_train
2+----+-----------------------------+----------+
3| | index | vif |
4|----+-----------------------------+----------|
5| 0 | const | 19.5742 |
6| 18 | Landing Page Submission | 3.44401 |
7| 5 | No Specialization | 3.03467 |
8| 19 | Page Views Per Visit | 2.64774 |
9| 22 | Email Opened | 2.4731 |
10| 17 | Olark Chat | 2.45033 |
11| 9 | SMS Sent | 2.30433 |
12| 21 | TotalVisits | 2.09083 |
13| 12 | Olark Chat Conversation | 1.93393 |
14| 15 | Reference | 1.64559 |
15| 20 | Total Time Spent on Website | 1.35735 |
16| 1 | Do Not Email | 1.20544 |
17| 13 | Email Link Clicked | 1.20372 |
18| 2 | Unknown Occupation | 1.17275 |
19| 11 | Working Professional | 1.15369 |
20| 14 | Welingak Website | 1.14402 |
21| 10 | Other Last Activity | 1.11839 |
22| 16 | Other Lead Sources | 1.03494 |
23| 8 | Outside India | 1.02524 |
24| 3 | Student | 1.02486 |
25| 6 | Media and Advertising | 1.01894 |
26| 7 | Hospitality Management | 1.01112 |
27| 4 | Housewife | 1.01026 |
28| 23 | Tier II Cities | 1.00966 |
29+----+-----------------------------+----------+
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 6309
Model: GLM Df Residuals: 6285
Model Family: Binomial Df Model: 23
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2504.2
Date: Tue, 08 Sep 2020 Deviance: 5008.4
Time: 07:36:56 Pearson chi2: 6.37e+03
No. Iterations: 21
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.8922 0.162 -5.511 0.000 -1.209 -0.575
Do Not Email -1.4710 0.189 -7.770 0.000 -1.842 -1.100
Unknown Occupation -1.2816 0.091 -14.076 0.000 -1.460 -1.103
Student -0.0129 0.228 -0.056 0.955 -0.461 0.435
Housewife 21.5902 1.85e+04 0.001 0.999 -3.63e+04 3.63e+04
No Specialization -0.8719 0.127 -6.891 0.000 -1.120 -0.624
Media and Advertising -0.2936 0.247 -1.190 0.234 -0.777 0.190
Hospitality Management -0.9011 0.333 -2.705 0.007 -1.554 -0.248
Outside India -0.6916 0.235 -2.940 0.003 -1.153 -0.231
SMS Sent 2.0143 0.127 15.802 0.000 1.764 2.264
Other Last Activity 1.3096 0.261 5.011 0.000 0.797 1.822
Working Professional 2.4221 0.191 12.693 0.000 2.048 2.796
Olark Chat Conversation -0.6896 0.202 -3.416 0.001 -1.085 -0.294
Email Link Clicked 0.4914 0.236 2.084 0.037 0.029 0.954
Welingak Website 5.4015 0.741 7.294 0.000 3.950 6.853
Reference 3.0328 0.258 11.775 0.000 2.528 3.538
Other Lead Sources 0.6953 0.383 1.815 0.070 -0.056 1.446
Olark Chat 1.2083 0.141 8.581 0.000 0.932 1.484
Landing Page Submission -0.9391 0.130 -7.204 0.000 -1.195 -0.684
Page Views Per Visit -0.2697 0.056 -4.780 0.000 -0.380 -0.159
Total Time Spent on Website 1.1240 0.042 26.816 0.000 1.042 1.206
TotalVisits 0.3212 0.050 6.437 0.000 0.223 0.419
Email Opened 0.7860 0.124 6.318 0.000 0.542 1.030
Tier II Cities -0.4421 0.425 -1.041 0.298 -1.275 0.391
  • Housewife has a high p-value and hence the coefficient is insignificant. let’s drop the same.

Model 4

1# Model 4 : Removing `Housewife`
2column_to_remove = 'Housewife'
3features = X_train.columns[X_train.columns !=column_to_remove]
4X_train = X_train[features]
5logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
6print("VIF for X_train")
7vif(X_train)
8logm1.fit().summary()
1VIF for X_train
2+----+-----------------------------+----------+
3| | index | vif |
4|----+-----------------------------+----------|
5| 0 | const | 19.5671 |
6| 17 | Landing Page Submission | 3.44232 |
7| 4 | No Specialization | 3.03251 |
8| 18 | Page Views Per Visit | 2.64773 |
9| 21 | Email Opened | 2.47231 |
10| 16 | Olark Chat | 2.4503 |
11| 8 | SMS Sent | 2.30422 |
12| 20 | TotalVisits | 2.09075 |
13| 11 | Olark Chat Conversation | 1.93383 |
14| 14 | Reference | 1.64154 |
15| 19 | Total Time Spent on Website | 1.35679 |
16| 1 | Do Not Email | 1.20544 |
17| 12 | Email Link Clicked | 1.20292 |
18| 2 | Unknown Occupation | 1.17264 |
19| 10 | Working Professional | 1.15248 |
20| 13 | Welingak Website | 1.14402 |
21| 9 | Other Last Activity | 1.11839 |
22| 15 | Other Lead Sources | 1.03492 |
23| 3 | Student | 1.02478 |
24| 7 | Outside India | 1.02446 |
25| 5 | Media and Advertising | 1.01783 |
26| 6 | Hospitality Management | 1.01111 |
27| 22 | Tier II Cities | 1.00965 |
28+----+-----------------------------+----------+
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 6309
Model: GLM Df Residuals: 6286
Model Family: Binomial Df Model: 22
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2506.1
Date: Tue, 08 Sep 2020 Deviance: 5012.2
Time: 07:36:57 Pearson chi2: 6.39e+03
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.8894 0.162 -5.494 0.000 -1.207 -0.572
Do Not Email -1.4718 0.189 -7.770 0.000 -1.843 -1.101
Unknown Occupation -1.2830 0.091 -14.091 0.000 -1.461 -1.105
Student -0.0165 0.229 -0.072 0.942 -0.465 0.432
No Specialization -0.8767 0.127 -6.929 0.000 -1.125 -0.629
Media and Advertising -0.2871 0.246 -1.169 0.243 -0.769 0.194
Hospitality Management -0.9050 0.333 -2.715 0.007 -1.558 -0.252
Outside India -0.6828 0.234 -2.919 0.004 -1.141 -0.224
SMS Sent 2.0146 0.128 15.799 0.000 1.765 2.264
Other Last Activity 1.3101 0.261 5.011 0.000 0.798 1.823
Working Professional 2.4187 0.191 12.674 0.000 2.045 2.793
Olark Chat Conversation -0.6880 0.202 -3.407 0.001 -1.084 -0.292
Email Link Clicked 0.4981 0.235 2.117 0.034 0.037 0.959
Welingak Website 5.4013 0.741 7.294 0.000 3.950 6.853
Reference 3.0546 0.257 11.869 0.000 2.550 3.559
Other Lead Sources 0.6922 0.383 1.807 0.071 -0.059 1.443
Olark Chat 1.2075 0.141 8.575 0.000 0.931 1.483
Landing Page Submission -0.9416 0.130 -7.228 0.000 -1.197 -0.686
Page Views Per Visit -0.2701 0.056 -4.788 0.000 -0.381 -0.160
Total Time Spent on Website 1.1252 0.042 26.851 0.000 1.043 1.207
TotalVisits 0.3203 0.050 6.421 0.000 0.223 0.418
Email Opened 0.7897 0.124 6.347 0.000 0.546 1.034
Tier II Cities -0.4440 0.425 -1.045 0.296 -1.277 0.389
  • Student has a p-value higher than 0.05 and the highest among all p-values. Let’s drop this feature.

Model 5

1# Model 5 : Removing `Student`
2column_to_remove = 'Student'
3features = X_train.columns[X_train.columns !=column_to_remove]
4X_train = X_train[features]
5logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
6print("VIF for X_train")
7vif(X_train)
8logm1.fit().summary()
1VIF for X_train
2+----+-----------------------------+----------+
3| | index | vif |
4|----+-----------------------------+----------|
5| 0 | const | 19.5487 |
6| 16 | Landing Page Submission | 3.43698 |
7| 3 | No Specialization | 3.02919 |
8| 17 | Page Views Per Visit | 2.64719 |
9| 20 | Email Opened | 2.47218 |
10| 15 | Olark Chat | 2.44998 |
11| 7 | SMS Sent | 2.30157 |
12| 19 | TotalVisits | 2.09067 |
13| 10 | Olark Chat Conversation | 1.93303 |
14| 13 | Reference | 1.63987 |
15| 18 | Total Time Spent on Website | 1.35679 |
16| 1 | Do Not Email | 1.20485 |
17| 11 | Email Link Clicked | 1.2029 |
18| 2 | Unknown Occupation | 1.15459 |
19| 9 | Working Professional | 1.14906 |
20| 12 | Welingak Website | 1.14371 |
21| 8 | Other Last Activity | 1.11789 |
22| 14 | Other Lead Sources | 1.03492 |
23| 6 | Outside India | 1.02427 |
24| 4 | Media and Advertising | 1.01768 |
25| 5 | Hospitality Management | 1.01055 |
26| 21 | Tier II Cities | 1.00946 |
27+----+-----------------------------+----------+
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 6309
Model: GLM Df Residuals: 6287
Model Family: Binomial Df Model: 21
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2506.1
Date: Tue, 08 Sep 2020 Deviance: 5012.2
Time: 07:36:57 Pearson chi2: 6.39e+03
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.8898 0.162 -5.500 0.000 -1.207 -0.573
Do Not Email -1.4718 0.189 -7.771 0.000 -1.843 -1.101
Unknown Occupation -1.2824 0.091 -14.134 0.000 -1.460 -1.105
No Specialization -0.8767 0.127 -6.929 0.000 -1.125 -0.629
Media and Advertising -0.2868 0.246 -1.168 0.243 -0.768 0.195
Hospitality Management -0.9053 0.333 -2.717 0.007 -1.559 -0.252
Outside India -0.6826 0.234 -2.919 0.004 -1.141 -0.224
SMS Sent 2.0148 0.127 15.806 0.000 1.765 2.265
Other Last Activity 1.3105 0.261 5.014 0.000 0.798 1.823
Working Professional 2.4193 0.191 12.687 0.000 2.046 2.793
Olark Chat Conversation -0.6883 0.202 -3.410 0.001 -1.084 -0.293
Email Link Clicked 0.4981 0.235 2.117 0.034 0.037 0.959
Welingak Website 5.4017 0.741 7.294 0.000 3.950 6.853
Reference 3.0541 0.257 11.873 0.000 2.550 3.558
Other Lead Sources 0.6922 0.383 1.807 0.071 -0.059 1.443
Olark Chat 1.2072 0.141 8.577 0.000 0.931 1.483
Landing Page Submission -0.9418 0.130 -7.231 0.000 -1.197 -0.687
Page Views Per Visit -0.2701 0.056 -4.788 0.000 -0.381 -0.160
Total Time Spent on Website 1.1252 0.042 26.851 0.000 1.043 1.207
TotalVisits 0.3204 0.050 6.421 0.000 0.223 0.418
Email Opened 0.7896 0.124 6.347 0.000 0.546 1.033
Tier II Cities -0.4436 0.425 -1.044 0.297 -1.276 0.389
  • Tier II Cities has a p-value higher than confidence level and the highest among all the p - values.
  • Let’s remove this feature

Model 6

1# Model 6 : Removing `Tier II Cities`
2column_to_remove = 'Tier II Cities'
3features = X_train.columns[X_train.columns !=column_to_remove]
4X_train = X_train[features]
5logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
6print("VIF for X_train")
7vif(X_train)
8logm1.fit().summary()
1VIF for X_train
2+----+-----------------------------+----------+
3| | index | vif |
4|----+-----------------------------+----------|
5| 0 | const | 19.5461 |
6| 16 | Landing Page Submission | 3.43294 |
7| 3 | No Specialization | 3.02879 |
8| 17 | Page Views Per Visit | 2.64718 |
9| 20 | Email Opened | 2.47182 |
10| 15 | Olark Chat | 2.44996 |
11| 7 | SMS Sent | 2.30139 |
12| 19 | TotalVisits | 2.09046 |
13| 10 | Olark Chat Conversation | 1.93285 |
14| 13 | Reference | 1.63987 |
15| 18 | Total Time Spent on Website | 1.35679 |
16| 1 | Do Not Email | 1.20341 |
17| 11 | Email Link Clicked | 1.20289 |
18| 2 | Unknown Occupation | 1.15454 |
19| 9 | Working Professional | 1.14885 |
20| 12 | Welingak Website | 1.14371 |
21| 8 | Other Last Activity | 1.11724 |
22| 14 | Other Lead Sources | 1.0349 |
23| 6 | Outside India | 1.02425 |
24| 4 | Media and Advertising | 1.0176 |
25| 5 | Hospitality Management | 1.01026 |
26+----+-----------------------------+----------+
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 6309
Model: GLM Df Residuals: 6288
Model Family: Binomial Df Model: 20
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2506.6
Date: Tue, 08 Sep 2020 Deviance: 5013.3
Time: 07:36:57 Pearson chi2: 6.38e+03
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.8916 0.162 -5.511 0.000 -1.209 -0.574
Do Not Email -1.4762 0.189 -7.805 0.000 -1.847 -1.106
Unknown Occupation -1.2818 0.091 -14.129 0.000 -1.460 -1.104
No Specialization -0.8758 0.127 -6.923 0.000 -1.124 -0.628
Media and Advertising -0.2821 0.246 -1.149 0.251 -0.763 0.199
Hospitality Management -0.9112 0.333 -2.740 0.006 -1.563 -0.259
Outside India -0.6825 0.234 -2.919 0.004 -1.141 -0.224
SMS Sent 2.0155 0.127 15.809 0.000 1.766 2.265
Other Last Activity 1.3159 0.261 5.035 0.000 0.804 1.828
Working Professional 2.4165 0.191 12.684 0.000 2.043 2.790
Olark Chat Conversation -0.6867 0.202 -3.402 0.001 -1.082 -0.291
Email Link Clicked 0.4992 0.235 2.123 0.034 0.038 0.960
Welingak Website 5.4036 0.741 7.296 0.000 3.952 6.855
Reference 3.0553 0.257 11.877 0.000 2.551 3.559
Other Lead Sources 0.6941 0.383 1.812 0.070 -0.057 1.445
Olark Chat 1.2077 0.141 8.582 0.000 0.932 1.483
Landing Page Submission -0.9467 0.130 -7.273 0.000 -1.202 -0.692
Page Views Per Visit -0.2700 0.056 -4.790 0.000 -0.381 -0.160
Total Time Spent on Website 1.1249 0.042 26.851 0.000 1.043 1.207
TotalVisits 0.3212 0.050 6.441 0.000 0.223 0.419
Email Opened 0.7907 0.124 6.355 0.000 0.547 1.035
  • Page Views Per Visit has a high p-value. Let’s eliminate this.

Model 7

1# Model 7 : Removing `Page Views Per Visit`
2column_to_remove = 'Page Views Per Visit'
3features = X_train.columns[X_train.columns !=column_to_remove]
4X_train = X_train[features]
5logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
6print("VIF for X_train")
7vif(X_train)
8logm1.fit().summary()
1VIF for X_train
2+----+-----------------------------+----------+
3| | index | vif |
4|----+-----------------------------+----------|
5| 0 | const | 19.521 |
6| 16 | Landing Page Submission | 3.40467 |
7| 3 | No Specialization | 3.02277 |
8| 19 | Email Opened | 2.43525 |
9| 7 | SMS Sent | 2.2499 |
10| 15 | Olark Chat | 2.22665 |
11| 10 | Olark Chat Conversation | 1.92174 |
12| 13 | Reference | 1.56753 |
13| 18 | TotalVisits | 1.49299 |
14| 17 | Total Time Spent on Website | 1.35607 |
15| 1 | Do Not Email | 1.20207 |
16| 11 | Email Link Clicked | 1.20085 |
17| 2 | Unknown Occupation | 1.15454 |
18| 9 | Working Professional | 1.14798 |
19| 12 | Welingak Website | 1.12183 |
20| 8 | Other Last Activity | 1.11356 |
21| 14 | Other Lead Sources | 1.03154 |
22| 6 | Outside India | 1.02316 |
23| 4 | Media and Advertising | 1.0176 |
24| 5 | Hospitality Management | 1.01022 |
25+----+-----------------------------+----------+
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 6309
Model: GLM Df Residuals: 6289
Model Family: Binomial Df Model: 19
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2518.3
Date: Tue, 08 Sep 2020 Deviance: 5036.6
Time: 07:36:57 Pearson chi2: 6.35e+03
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.8590 0.161 -5.346 0.000 -1.174 -0.544
Do Not Email -1.5025 0.189 -7.940 0.000 -1.873 -1.132
Unknown Occupation -1.2733 0.090 -14.083 0.000 -1.451 -1.096
No Specialization -0.8357 0.126 -6.656 0.000 -1.082 -0.590
Media and Advertising -0.2901 0.244 -1.187 0.235 -0.769 0.189
Hospitality Management -0.9063 0.332 -2.732 0.006 -1.556 -0.256
Outside India -0.6629 0.234 -2.828 0.005 -1.122 -0.203
SMS Sent 1.9118 0.125 15.332 0.000 1.667 2.156
Other Last Activity 1.2298 0.260 4.738 0.000 0.721 1.739
Working Professional 2.4225 0.191 12.702 0.000 2.049 2.796
Olark Chat Conversation -0.7552 0.201 -3.762 0.000 -1.149 -0.362
Email Link Clicked 0.4387 0.234 1.875 0.061 -0.020 0.897
Welingak Website 5.6036 0.739 7.587 0.000 4.156 7.051
Reference 3.2759 0.253 12.946 0.000 2.780 3.772
Other Lead Sources 0.8276 0.379 2.186 0.029 0.086 1.570
Olark Chat 1.4051 0.135 10.431 0.000 1.141 1.669
Landing Page Submission -0.9926 0.129 -7.685 0.000 -1.246 -0.739
Total Time Spent on Website 1.1253 0.042 26.905 0.000 1.043 1.207
TotalVisits 0.1958 0.042 4.614 0.000 0.113 0.279
Email Opened 0.7106 0.123 5.797 0.000 0.470 0.951
  • Media and Advertising has a high p-value. let’s drop this feature

Model 8

1# Model 8 : Removing `Media and Advertising`
2column_to_remove = 'Media and Advertising'
3features = X_train.columns[X_train.columns !=column_to_remove]
4X_train = X_train[features]
5logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
6print("VIF for X_train")
7vif(X_train)
8logm1.fit().summary()
1VIF for X_train
2+----+-----------------------------+----------+
3| | index | vif |
4|----+-----------------------------+----------|
5| 0 | const | 19.4607 |
6| 15 | Landing Page Submission | 3.40463 |
7| 3 | No Specialization | 3.01069 |
8| 18 | Email Opened | 2.43523 |
9| 6 | SMS Sent | 2.24839 |
10| 14 | Olark Chat | 2.22661 |
11| 9 | Olark Chat Conversation | 1.92165 |
12| 12 | Reference | 1.56581 |
13| 17 | TotalVisits | 1.49255 |
14| 16 | Total Time Spent on Website | 1.35565 |
15| 1 | Do Not Email | 1.202 |
16| 10 | Email Link Clicked | 1.20054 |
17| 2 | Unknown Occupation | 1.15412 |
18| 8 | Working Professional | 1.14797 |
19| 11 | Welingak Website | 1.12171 |
20| 7 | Other Last Activity | 1.11347 |
21| 13 | Other Lead Sources | 1.03154 |
22| 5 | Outside India | 1.02306 |
23| 4 | Hospitality Management | 1.00959 |
24+----+-----------------------------+----------+
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 6309
Model: GLM Df Residuals: 6290
Model Family: Binomial Df Model: 18
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2519.0
Date: Tue, 08 Sep 2020 Deviance: 5038.0
Time: 07:36:57 Pearson chi2: 6.34e+03
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.8673 0.160 -5.404 0.000 -1.182 -0.553
Do Not Email -1.5036 0.189 -7.945 0.000 -1.874 -1.133
Unknown Occupation -1.2714 0.090 -14.070 0.000 -1.448 -1.094
No Specialization -0.8261 0.125 -6.595 0.000 -1.072 -0.581
Hospitality Management -0.8963 0.332 -2.703 0.007 -1.546 -0.246
Outside India -0.6576 0.234 -2.809 0.005 -1.116 -0.199
SMS Sent 1.9074 0.125 15.311 0.000 1.663 2.152
Other Last Activity 1.2318 0.259 4.749 0.000 0.723 1.740
Working Professional 2.4169 0.191 12.682 0.000 2.043 2.790
Olark Chat Conversation -0.7593 0.201 -3.783 0.000 -1.153 -0.366
Email Link Clicked 0.4421 0.234 1.890 0.059 -0.016 0.901
Welingak Website 5.6066 0.739 7.591 0.000 4.159 7.054
Reference 3.2810 0.253 12.970 0.000 2.785 3.777
Other Lead Sources 0.8257 0.378 2.182 0.029 0.084 1.567
Olark Chat 1.4068 0.135 10.444 0.000 1.143 1.671
Landing Page Submission -0.9929 0.129 -7.688 0.000 -1.246 -0.740
Total Time Spent on Website 1.1264 0.042 26.929 0.000 1.044 1.208
TotalVisits 0.1962 0.042 4.623 0.000 0.113 0.279
Email Opened 0.7096 0.123 5.791 0.000 0.469 0.950
  • This model has a feature Email Link Clicked with high p-value of 0.059. Let’s drop this feature.

Model 9

1# Model 9 : Removing `Email Link Clicked`
2column_to_remove = 'Email Link Clicked'
3features = X_train.columns[X_train.columns !=column_to_remove]
4X_train = X_train[features]
5logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
6print("VIF for X_train")
7vif(X_train)
8logm1.fit().summary()
1VIF for X_train
2+----+-----------------------------+----------+
3| | index | vif |
4|----+-----------------------------+----------|
5| 0 | const | 18.2371 |
6| 14 | Landing Page Submission | 3.40436 |
7| 3 | No Specialization | 3.01066 |
8| 13 | Olark Chat | 2.21092 |
9| 17 | Email Opened | 2.09586 |
10| 6 | SMS Sent | 1.97156 |
11| 9 | Olark Chat Conversation | 1.75591 |
12| 11 | Reference | 1.56268 |
13| 16 | TotalVisits | 1.49255 |
14| 15 | Total Time Spent on Website | 1.35564 |
15| 1 | Do Not Email | 1.16781 |
16| 2 | Unknown Occupation | 1.1541 |
17| 8 | Working Professional | 1.14791 |
18| 10 | Welingak Website | 1.12098 |
19| 7 | Other Last Activity | 1.09682 |
20| 12 | Other Lead Sources | 1.0315 |
21| 5 | Outside India | 1.02286 |
22| 4 | Hospitality Management | 1.00955 |
23+----+-----------------------------+----------+
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 6309
Model: GLM Df Residuals: 6291
Model Family: Binomial Df Model: 17
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2520.8
Date: Tue, 08 Sep 2020 Deviance: 5041.5
Time: 07:36:57 Pearson chi2: 6.34e+03
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.7799 0.153 -5.111 0.000 -1.079 -0.481
Do Not Email -1.5328 0.188 -8.150 0.000 -1.901 -1.164
Unknown Occupation -1.2696 0.090 -14.058 0.000 -1.447 -1.093
No Specialization -0.8248 0.125 -6.587 0.000 -1.070 -0.579
Hospitality Management -0.8909 0.332 -2.687 0.007 -1.541 -0.241
Outside India -0.6555 0.235 -2.795 0.005 -1.115 -0.196
SMS Sent 1.8157 0.113 16.026 0.000 1.594 2.038
Other Last Activity 1.1473 0.255 4.500 0.000 0.648 1.647
Working Professional 2.4103 0.190 12.668 0.000 2.037 2.783
Olark Chat Conversation -0.8576 0.193 -4.441 0.000 -1.236 -0.479
Welingak Website 5.6297 0.738 7.632 0.000 4.184 7.076
Reference 3.3059 0.253 13.064 0.000 2.810 3.802
Other Lead Sources 0.8192 0.378 2.167 0.030 0.078 1.560
Olark Chat 1.4212 0.135 10.561 0.000 1.157 1.685
Landing Page Submission -0.9890 0.129 -7.664 0.000 -1.242 -0.736
Total Time Spent on Website 1.1247 0.042 26.931 0.000 1.043 1.207
TotalVisits 0.1953 0.042 4.613 0.000 0.112 0.278
Email Opened 0.6161 0.111 5.564 0.000 0.399 0.833
  • All coefficients are significant / low p-value
  • For further elimination , let’s use the magnitude of coefficient as the weight/importance of the variable. Higher values are more important than lower values.
  • By this reasoning, TotalVisits has the least coefficient. Let’d drop this.

Model 10

1# Model 10 : Removing `TotalVisits`
2column_to_remove = 'TotalVisits'
3features = X_train.columns[X_train.columns !=column_to_remove]
4X_train = X_train[features]
5logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
6logm1 = logm1.fit()
7print("VIF for X_train")
8vif(X_train)
9logm1.summary()
1VIF for X_train
2+----+-----------------------------+----------+
3| | index | vif |
4|----+-----------------------------+----------|
5| 0 | const | 17.9489 |
6| 14 | Landing Page Submission | 3.4038 |
7| 3 | No Specialization | 2.97226 |
8| 16 | Email Opened | 2.09586 |
9| 13 | Olark Chat | 1.98518 |
10| 6 | SMS Sent | 1.97039 |
11| 9 | Olark Chat Conversation | 1.7559 |
12| 11 | Reference | 1.46711 |
13| 15 | Total Time Spent on Website | 1.34366 |
14| 1 | Do Not Email | 1.1672 |
15| 2 | Unknown Occupation | 1.15404 |
16| 8 | Working Professional | 1.14782 |
17| 10 | Welingak Website | 1.10012 |
18| 7 | Other Last Activity | 1.09606 |
19| 12 | Other Lead Sources | 1.02698 |
20| 5 | Outside India | 1.0226 |
21| 4 | Hospitality Management | 1.00954 |
22+----+-----------------------------+----------+
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 6309
Model: GLM Df Residuals: 6292
Model Family: Binomial Df Model: 16
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2531.3
Date: Tue, 08 Sep 2020 Deviance: 5062.7
Time: 07:36:57 Pearson chi2: 6.40e+03
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.6821 0.151 -4.531 0.000 -0.977 -0.387
Do Not Email -1.5442 0.188 -8.229 0.000 -1.912 -1.176
Unknown Occupation -1.2691 0.090 -14.078 0.000 -1.446 -1.092
No Specialization -0.8891 0.124 -7.144 0.000 -1.133 -0.645
Hospitality Management -0.8856 0.332 -2.667 0.008 -1.536 -0.235
Outside India -0.6636 0.233 -2.845 0.004 -1.121 -0.206
SMS Sent 1.7926 0.113 15.916 0.000 1.572 2.013
Other Last Activity 1.1818 0.254 4.646 0.000 0.683 1.680
Working Professional 2.3799 0.189 12.584 0.000 2.009 2.751
Olark Chat Conversation -0.8591 0.192 -4.474 0.000 -1.236 -0.483
Welingak Website 5.4168 0.736 7.360 0.000 3.974 6.859
Reference 3.0622 0.247 12.402 0.000 2.578 3.546
Other Lead Sources 0.6577 0.377 1.744 0.081 -0.081 1.397
Olark Chat 1.2170 0.126 9.662 0.000 0.970 1.464
Landing Page Submission -0.9929 0.129 -7.702 0.000 -1.246 -0.740
Total Time Spent on Website 1.1330 0.042 27.181 0.000 1.051 1.215
Email Opened 0.6052 0.110 5.490 0.000 0.389 0.821
  • Other Lead Sources has high p-value. Let’s drop this variable.

Model 11 - Final Model

1# Model 11 : Removing `Other Lead Sources`
2column_to_remove = 'Other Lead Sources'
3features = X_train.columns[X_train.columns !=column_to_remove]
4X_train = X_train[features]
5logm_final = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())
6logm_final = logm_final.fit()
7print("VIF for X_train")
8vif(X_train)
9logm_final.summary()
1VIF for X_train
2+----+-----------------------------+----------+
3| | index | vif |
4|----+-----------------------------+----------|
5| 0 | const | 17.7441 |
6| 13 | Landing Page Submission | 3.3594 |
7| 3 | No Specialization | 2.96428 |
8| 15 | Email Opened | 2.09302 |
9| 6 | SMS Sent | 1.97027 |
10| 12 | Olark Chat | 1.96234 |
11| 9 | Olark Chat Conversation | 1.75584 |
12| 11 | Reference | 1.45497 |
13| 14 | Total Time Spent on Website | 1.3339 |
14| 1 | Do Not Email | 1.16719 |
15| 2 | Unknown Occupation | 1.1536 |
16| 8 | Working Professional | 1.14772 |
17| 10 | Welingak Website | 1.09768 |
18| 7 | Other Last Activity | 1.09593 |
19| 5 | Outside India | 1.02258 |
20| 4 | Hospitality Management | 1.0094 |
21+----+-----------------------------+----------+
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 6309
Model: GLM Df Residuals: 6293
Model Family: Binomial Df Model: 15
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2532.8
Date: Tue, 08 Sep 2020 Deviance: 5065.5
Time: 07:36:57 Pearson chi2: 6.39e+03
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -0.6469 0.149 -4.338 0.000 -0.939 -0.355
Do Not Email -1.5426 0.188 -8.222 0.000 -1.910 -1.175
Unknown Occupation -1.2699 0.090 -14.094 0.000 -1.446 -1.093
No Specialization -0.9057 0.124 -7.281 0.000 -1.150 -0.662
Hospitality Management -0.8704 0.332 -2.621 0.009 -1.521 -0.220
Outside India -0.6584 0.233 -2.823 0.005 -1.116 -0.201
SMS Sent 1.7923 0.113 15.927 0.000 1.572 2.013
Other Last Activity 1.1749 0.254 4.622 0.000 0.677 1.673
Working Professional 2.3769 0.189 12.576 0.000 2.006 2.747
Olark Chat Conversation -0.8614 0.192 -4.488 0.000 -1.238 -0.485
Welingak Website 5.3886 0.736 7.323 0.000 3.946 6.831
Reference 3.0246 0.246 12.302 0.000 2.543 3.506
Olark Chat 1.1876 0.125 9.530 0.000 0.943 1.432
Landing Page Submission -1.0250 0.128 -8.018 0.000 -1.276 -0.774
Total Time Spent on Website 1.1253 0.041 27.204 0.000 1.044 1.206
Email Opened 0.6106 0.110 5.545 0.000 0.395 0.826
  • From the above, the features that remain are statistically significant and donot show any multi collinearity.
  • Hence, we could use Model 11 is our final model.

Final Features

1finalFeatures = X_train.columns.values
2print('The Final Feature for Modelling are :', finalFeatures)
1The Final Feature for Modelling are : ['Do Not Email' 'Unknown Occupation' 'No Specialization'
2 'Hospitality Management' 'Outside India' 'SMS Sent' 'Other Last Activity'
3 'Working Professional' 'Olark Chat Conversation' 'Welingak Website'
4 'Reference' 'Olark Chat' 'Landing Page Submission'
5 'Total Time Spent on Website' 'Email Opened']

Predictions

Predictions on Train set

1X_train_sm = sm.add_constant(X_train)
2y_train_pred = logm_final.predict(X_train_sm)

Actual Conversions vs Conversion Predictions

1# Creating a data frame with converted vs converted probabilities
2y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Converted_Prob':y_train_pred})
3y_train_pred_final['CustID'] = y_train.index
4tab(y_train_pred_final.head(10))
1+------+-------------+------------------+----------+
2| | Converted | Converted_Prob | CustID |
3|------+-------------+------------------+----------|
4| 4948 | 1 | 0.401461 | 4948 |
5| 5938 | 1 | 0.318706 | 5938 |
6| 5688 | 1 | 0.745966 | 5688 |
7| 5381 | 0 | 0.00284809 | 5381 |
8| 4742 | 1 | 0.801898 | 4742 |
9| 5811 | 0 | 0.167979 | 5811 |
10| 898 | 0 | 0.0880976 | 898 |
11| 5316 | 0 | 0.0283184 | 5316 |
12| 7381 | 0 | 0.251634 | 7381 |
13| 1211 | 0 | 0.0235422 | 1211 |
14+------+-------------+------------------+----------+

Predictions with cut off = 0.5

1#Creating new column 'predicted' with 1 if Converted_Prob > 0.5 else 0
2y_train_pred_final['predicted'] = y_train_pred_final.Converted_Prob.map(lambda x: 1 if x > 0.5 else 0)
3
4# Let's see the head
5tab(y_train_pred_final.head(10))
1+------+-------------+------------------+----------+-------------+
2| | Converted | Converted_Prob | CustID | predicted |
3|------+-------------+------------------+----------+-------------|
4| 4948 | 1 | 0.401461 | 4948 | 0 |
5| 5938 | 1 | 0.318706 | 5938 | 0 |
6| 5688 | 1 | 0.745966 | 5688 | 1 |
7| 5381 | 0 | 0.00284809 | 5381 | 0 |
8| 4742 | 1 | 0.801898 | 4742 | 1 |
9| 5811 | 0 | 0.167979 | 5811 | 0 |
10| 898 | 0 | 0.0880976 | 898 | 0 |
11| 5316 | 0 | 0.0283184 | 5316 | 0 |
12| 7381 | 0 | 0.251634 | 7381 | 0 |
13| 1211 | 0 | 0.0235422 | 1211 | 0 |
14+------+-------------+------------------+----------+-------------+

Confusion Matrix

1from sklearn import metrics
2# Confusion matrix
3confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.predicted )
4print(confusion)
1[[3462 455]
2 [ 699 1693]]

Confusion Matrix for Train Set

$\frac{Predicted}{Actual}$Not ConvertedConverted
Not Converted3462455
Converted6991693

Accuracy of the Model

1# Let's check the overall accuracy.
2accuracy = metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.predicted)
3print('Accuracy on Train set : ', round(100*accuracy,3),'%')
1Accuracy on Train set : 81.709 %

Metrics beyond simple accuracy

1TP = confusion[1,1] # true positive
2TN = confusion[0,0] # true negatives
3FP = confusion[0,1] # false positives
4FN = confusion[1,0] # false negatives
5sensitivity = TP/(FN + TP)
6specificity = TN/(FP + TN)
7falsePositiveRate = FP/(FP + TN)
8positivePredictivePower = TP/(TP +FP )
9negativePredictivePower = TN/(TN + FN)
10print('sensitivity / Recall: ', round(100*sensitivity,3),'%')
11print('specificity : ', round(100*specificity,3),'%')
12print('False Positive Rate : ', round(100*falsePositiveRate,3),'%')
13print('Precision / Positive Predictive Power : ', round(100*positivePredictivePower,3),'%')
14print('Negative Predictive Power : ', round(100*negativePredictivePower,3),'%')
1sensitivity / Recall: 70.778 %
2specificity : 88.384 %
3False Positive Rate : 11.616 %
4Precision / Positive Predictive Power : 78.818 %
5Negative Predictive Power : 83.201 %

Plotting ROC Curve.

1def draw_roc( actual, probs ):
2 fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
3 drop_intermediate = False )
4 auc_score = metrics.roc_auc_score( actual, probs )
5 plt.figure(figsize=(5, 5))
6 plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
7 plt.plot([0, 1], [0, 1], 'k--')
8 plt.xlim([0.0, 1.0])
9 plt.ylim([0.0, 1.05])
10 plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
11 plt.ylabel('True Positive Rate')
12 plt.title('Receiver operating characteristic')
13 plt.legend(loc="lower right")
14 plt.show()
15
16 return None
1draw_roc(y_train_pred_final.Converted, y_train_pred_final.Converted_Prob)

png

Finding Optimal Cutoff Point

  • Optimal cutoff probability is that prob where we get balanced sensitivity and specificity
1# Let's create columns with different probability cutoffs
2numbers = [float(x)/10 for x in range(10)]
3for i in numbers:
4 y_train_pred_final[i]= y_train_pred_final.Converted_Prob.map(lambda x: 1 if x > i else 0)
5tab(y_train_pred_final.head())
1+------+-------------+------------------+----------+-------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
2| | Converted | Converted_Prob | CustID | predicted | 0.0 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |
3|------+-------------+------------------+----------+-------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------|
4| 4948 | 1 | 0.401461 | 4948 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
5| 5938 | 1 | 0.318706 | 5938 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
6| 5688 | 1 | 0.745966 | 5688 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
7| 5381 | 0 | 0.00284809 | 5381 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8| 4742 | 1 | 0.801898 | 4742 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
9+------+-------------+------------------+----------+-------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
1# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
2cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
3
4# TP = confusion[1,1] # true positive
5# TN = confusion[0,0] # true negatives
6# FP = confusion[0,1] # false positives
7# FN = confusion[1,0] # false negatives
8
9num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
10for i in num:
11 cm1 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final[i] )
12 total1=sum(sum(cm1))
13 accuracy = (cm1[0,0]+cm1[1,1])/total1
14
15 speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
16 sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
17 cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
18tab(cutoff_df)
1+-----+--------+------------+----------+----------+
2| | prob | accuracy | sensi | speci |
3|-----+--------+------------+----------+----------|
4| 0 | 0 | 0.379141 | 1 | 0 |
5| 0.1 | 0.1 | 0.632905 | 0.974498 | 0.424304 |
6| 0.2 | 0.2 | 0.758599 | 0.923077 | 0.658157 |
7| 0.3 | 0.3 | 0.793311 | 0.869983 | 0.74649 |
8| 0.4 | 0.4 | 0.81883 | 0.781355 | 0.841716 |
9| 0.5 | 0.5 | 0.817087 | 0.707776 | 0.88384 |
10| 0.6 | 0.6 | 0.808052 | 0.634197 | 0.91422 |
11| 0.7 | 0.7 | 0.782533 | 0.513796 | 0.946643 |
12| 0.8 | 0.8 | 0.755112 | 0.405518 | 0.968598 |
13| 0.9 | 0.9 | 0.715961 | 0.273829 | 0.985959 |
14+-----+--------+------------+----------+----------+
1# Let's plot accuracy sensitivity and specificity for various cutoff probabilities.
2
3fig,ax = plt.subplots()
4fig.set_figwidth(30)
5fig.set_figheight(10)
6plots=['accuracy','sensi','speci']
7ax.set_xticks(np.linspace(0,1,50))
8ax.set_title('Finding Optimal Cutoff')
9sns.lineplot(x='prob',y=plots[0] , data=cutoff_df,ax=ax)
10sns.lineplot(x='prob',y=plots[1] , data=cutoff_df,ax=ax)
11sns.lineplot(x='prob',y=plots[2] , data=cutoff_df,ax=ax)
12
13ax.set_xlabel('Probabilites')
14ax.set_ylabel('Accuracy,Sensitivity,Specificity')
15ax.legend(["Accuracy",'Sensitivity','Specificity'])
16# cutoff_df.plot.line(, figure=[10,10])
17plt.show()

png

  • From the curve above, 0.36 is the optimum cutoff probability.
1y_train_pred_final['final_predicted'] = y_train_pred_final.Converted_Prob.map( lambda x: 1 if x > 0.36 else 0)
2
3tab(y_train_pred_final.head())
1+------+-------------+------------------+----------+-------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------------------+
2| | Converted | Converted_Prob | CustID | predicted | 0.0 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | final_predicted |
3|------+-------------+------------------+----------+-------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------------------|
4| 4948 | 1 | 0.401461 | 4948 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
5| 5938 | 1 | 0.318706 | 5938 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
6| 5688 | 1 | 0.745966 | 5688 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 |
7| 5381 | 0 | 0.00284809 | 5381 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8| 4742 | 1 | 0.801898 | 4742 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
9+------+-------------+------------------+----------+-------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------------------+
1# Let's check the overall accuracy.
2accu = metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)
3print('Accuracy on Train set at Optimum Cut Off : ', round(100*accu,3),'%')
1Accuracy on Train set at Optimum Cut Off : 81.249 %
1confusion2 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )
2confusion2
1array([[3203, 714],
2 [ 469, 1923]])
1TP = confusion2[1,1] # true positive
2TN = confusion2[0,0] # true negatives
3FP = confusion2[0,1] # false positives
4FN = confusion2[1,0] # false negatives
5sensitivity = TP/(FN + TP)
6specificity = TN/(FP + TN)
7falsePositiveRate = FP/(FP + TN)
8positivePredictivePower = TP/(TP +FP )
9negativePredictivePower = TN/(TN + FN)
10print('sensitivity / Recall: ', round(100*sensitivity,3),'%')
11print('specificity : ', round(100*specificity,3),'%')
12print('False Positive Rate : ', round(100*falsePositiveRate,3),'%')
13print('Precision / Positive Predictive Power : ', round(100*positivePredictivePower,3),'%')
14print('Negative Predictive Power : ', round(100*negativePredictivePower,3),'%')
1sensitivity / Recall: 80.393 %
2specificity : 81.772 %
3False Positive Rate : 18.228 %
4Precision / Positive Predictive Power : 72.924 %
5Negative Predictive Power : 87.228 %
1## ROC curve for cut off probability of 0.36
2draw_roc(y_train_pred_final.Converted, y_train_pred_final.final_predicted)

png

Precision and Recall

1#Looking at the confusion matrix again
2confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.predicted )
3confusion
1array([[3462, 455],
2 [ 699, 1693]])
  • Precision :TP / TP + FP
  • Recall :TP / TP + FN
1print('Precision :', confusion[1,1]/(confusion[0,1]+confusion[1,1]))
2print('Recall :', confusion[1,1]/(confusion[1,0]+confusion[1,1]))
1Precision : 0.7881750465549349
2Recall : 0.7077759197324415
1#Doing the same using the sklearn.
2from sklearn.metrics import precision_score, recall_score
3print('Precision : ', precision_score(y_train_pred_final.Converted, y_train_pred_final.predicted))
4print('Recall :', recall_score(y_train_pred_final.Converted, y_train_pred_final.predicted))
1Precision : 0.7881750465549349
2Recall : 0.7077759197324415

Precision and Recall Tradeoff

1from sklearn.metrics import precision_recall_curve
2
3p, r, thresholds = precision_recall_curve(y_train_pred_final.Converted, y_train_pred_final.Converted_Prob)
4plt.plot(thresholds, p[:-1], "g-")
5plt.plot(thresholds, r[:-1], "r-")
6plt.show()

png

  • The cut off point from precision-recall curve is ~0.4.
  • Note that we have used the cut off obtained from ‘Sensitivity-Specificity’ trade off to predict conversions in this analysis.

Predictions on Test set

1X_test_sm = sm.add_constant(X_test[finalFeatures])
2y_test_pred = logm_final.predict(X_test_sm)

Actual Conversions vs Conversion Probability

1# predicted conversions vs actual conversions and customer ID
2y_test_predictions = pd.DataFrame({'Converted' :y_test, 'Conversion Probability' : y_test_pred, 'CustID' : y_test.index})
3tab(y_test_predictions.head())
1+------+-------------+--------------------------+----------+
2| | Converted | Conversion Probability | CustID |
3|------+-------------+--------------------------+----------|
4| 1260 | 0 | 0.116132 | 1260 |
5| 2104 | 1 | 0.318706 | 2104 |
6| 7105 | 1 | 0.982061 | 7105 |
7| 8916 | 0 | 0.42048 | 8916 |
8| 2822 | 0 | 0.0292669 | 2822 |
9+------+-------------+--------------------------+----------+
1# predictions with optimal cut off = 0.35
2cutoff=0.36
3y_test_predictions['Predicted'] = y_test_predictions[
4 'Conversion Probability'
5].map(lambda x : 1 if x > cutoff else 0 )

Confusion Matrix

1confusion = metrics.confusion_matrix(y_test_predictions['Converted'], y_test_predictions['Predicted'])
2confusion
1array([[1356, 322],
2 [ 230, 797]])
1TP = confusion[1,1] # true positive
2TN = confusion[0,0] # true negatives
3FP = confusion[0,1] # false positives
4FN = confusion[1,0] # false negatives

Accuracy

1print('Accuracy on Test set : ', round(100*(TP + TN)/(TP + TN + FP + FN),3),'%')
1Accuracy on Test set : 79.593 %

Metrics Beyond Simple Accuracy

1sensitivity = TP/(FN + TP)
2specificity = TN/(FP + TN)
3falsePositiveRate = FP/(FP + TN)
4positivePredictivePower = TP/(TP +FP )
5negativePredictivePower = TN/(TN + FN)
6print('sensitivity / Recall: ', round(100*sensitivity,3),'%')
7print('specificity : ', round(100*specificity,3),'%')
8print('False Positive Rate : ', round(100*falsePositiveRate,3),'%')
9print('Precision / Positive Predictive Power : ', round(100*positivePredictivePower,3),'%')
10print('Negative Predictive Power : ', round(100*negativePredictivePower,3),'%')
1sensitivity / Recall: 77.605 %
2specificity : 80.81 %
3False Positive Rate : 19.19 %
4Precision / Positive Predictive Power : 71.224 %
5Negative Predictive Power : 85.498 %
1## ROC curve for cut off probability of 0.364
2draw_roc(y_test_predictions['Converted'],y_test_predictions['Predicted'])

png

  • Note the AUC is 0.79 on the test test

Lead Scoring

1# merging final predictions with leads dataset
2conversionProb = pd.concat([y_test_predictions['Conversion Probability'],y_train_pred_final['Converted_Prob']],axis=0)
3conversionProb = pd.DataFrame({'Conversion Probability' : conversionProb}, index=conversionProb.index)
4leads = pd.concat([leads,conversionProb],axis=1)
5leads['Prospect ID'] = prospect_ids
6leads['Lead No'] = lead_no
7leads['Converted'] = y
1# Verifying prediction accuracy
2leads['Predicted'] = leads['Conversion Probability'].map(lambda x : 1 if x > 0.36 else 0)
1confusion = metrics.confusion_matrix(leads['Converted'], leads['Predicted'])
1TP = confusion[1,1] # true positive
2TN = confusion[0,0] # true negatives
3FP = confusion[0,1] # false positives
4FN = confusion[1,0] # false negatives
5acc = metrics.accuracy_score(leads['Converted'], leads['Predicted'])
6print('Accuracy : ', round(100*acc,3),'%')
7sensitivity = TP/(FN + TP)
8specificity = TN/(FP + TN)
9falsePositiveRate = FP/(FP + TN)
10falseNegativeRate = FN/(FP + TP)
11positivePredictivePower = TP/(TP +FP )
12negativePredictivePower = TN/(TN + FN)
13print('sensitivity : ', round(100*sensitivity,3),'%')
14print('specificity : ', round(100*specificity,3),'%')
15print('False Positive Rate : ', round(100*falsePositiveRate,3),'%')
16print('False Negative Rate : ', round(100*falseNegativeRate,3),'%')
17print('Positive Predictive Power / Precision : ', round(100*positivePredictivePower,3),'%')
18print('Negative Predictive Power : ', round(100*negativePredictivePower,3),'%')
1Accuracy : 80.752 %
2sensitivity : 79.555 %
3specificity : 81.483 %
4False Positive Rate : 18.517 %
5False Negative Rate : 18.61 %
6Positive Predictive Power / Precision : 72.417 %
7Negative Predictive Power : 86.706 %
1## ROC curve
2draw_roc(leads['Converted'], leads['Predicted'])

png

1# Lead Scores
2leads['Lead Score'] = leads['Conversion Probability']*100
3tab(leads[['Prospect ID','Lead No','Lead Score']].sort_values(by='Lead Score', ascending=False)[:10])
1+------+--------------------------------------+-----------+--------------+
2| | Prospect ID | Lead No | Lead Score |
3|------+--------------------------------------+-----------+--------------|
4| 7219 | ed62264f-7666-4bf9-9cb6-5b9a825f1e67 | 594038 | 99.9118 |
5| 7234 | 7e2819e8-97f0-416b-bcb6-45ef14f0e11a | 593962 | 99.8559 |
6| 2378 | 8ff353ab-1207-4608-a8cc-8172ea7c12eb | 636860 | 99.8303 |
7| 7327 | 95d1590f-7c47-4f40-9806-388f4472d3a4 | 593208 | 99.826 |
8| 2497 | e5fb32dd-b3b7-4fbf-972d-c13d2cfc6866 | 635761 | 99.8221 |
9| 5671 | 623bc6c9-9184-4437-b38f-d374be49d1a3 | 606508 | 99.8221 |
10| 7094 | 9ec1cafe-b019-498e-b246-7ab06167d72c | 595141 | 99.82 |
11| 7187 | f33166e8-d8d3-4e8c-b9d0-8a1922c35910 | 594369 | 99.8176 |
12| 7420 | 2caa32d0-50b7-4d29-b31f-2528b06d7bc8 | 592625 | 99.8162 |
13| 8120 | bf4a03bc-b747-45a6-a6b5-659afa3bf3ac | 587853 | 99.8151 |
14+------+--------------------------------------+-----------+--------------+

Score Sheet for X Education

1# Run the following to generate a sheet containing lead information provided by the company and corresponding scores
2leads.to_csv('lead_scores.csv')

KS Statistic

1# Gain Chart
2y_test_predictions = y_test_predictions.sort_values(by='Conversion Probability', ascending=False)
3y_test_predictions['decile'] = pd.qcut(y_test_predictions['Conversion Probability'],10,labels=range(10,0,-1))
4y_test_predictions['Converted'] = y_test_predictions['Converted'].astype('int')
5y_test_predictions['Un Converted'] = 1 - y_test_predictions['Converted']
6tab(y_test_predictions.head())
1+------+-------------+--------------------------+----------+-------------+----------+----------------+
2| | Converted | Conversion Probability | CustID | Predicted | decile | Un Converted |
3|------+-------------+--------------------------+----------+-------------+----------+----------------|
4| 7327 | 1 | 0.99826 | 7327 | 1 | 1 | 0 |
5| 7420 | 1 | 0.998162 | 7420 | 1 | 1 | 0 |
6| 4613 | 1 | 0.997809 | 4613 | 1 | 1 | 0 |
7| 6243 | 1 | 0.997674 | 6243 | 1 | 1 | 0 |
8| 7324 | 1 | 0.997402 | 7324 | 1 | 1 | 0 |
9+------+-------------+--------------------------+----------+-------------+----------+----------------+
1df1 = pd.pivot_table(data=y_test_predictions,index=['decile'],values=['Converted','Un Converted','Conversion Probability'],
2 aggfunc={'Converted':[np.sum],
3 'Un Converted':[np.sum],
4 'Conversion Probability' : [np.min,np.max]})
5df1 = df1.reset_index()
6df1.columns = ['Decile','Max Prob', 'Min Prob','Converted Count','Un Converted Count']
7df1 = df1.sort_values(by='Decile', ascending=False)
8df1['Total Leads'] = df1['Converted Count'] + df1['Un Converted Count']
9df1['Conversion Rate'] = df1['Converted Count'] / df1['Un Converted Count']
10converted_sum = df1['Converted Count'].sum()
11unconverted_sum = df1['Un Converted Count'].sum()
12df1['Converted %'] = df1['Converted Count'] / converted_sum
13df1['Un Converted %'] = df1['Un Converted Count'] / unconverted_sum
14tab(df1.head())
1+----+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+
2| | Decile | Max Prob | Min Prob | Converted Count | Un Converted Count | Total Leads | Conversion Rate | Converted % | Un Converted % |
3|----+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------|
4| 9 | 1 | 0.99826 | 0.910258 | 257 | 14 | 271 | 18.3571 | 0.250243 | 0.00834327 |
5| 8 | 2 | 0.909066 | 0.781567 | 224 | 46 | 270 | 4.86957 | 0.218111 | 0.0274136 |
6| 7 | 3 | 0.780757 | 0.564123 | 182 | 89 | 271 | 2.04494 | 0.177215 | 0.0530393 |
7| 6 | 4 | 0.563613 | 0.380673 | 118 | 152 | 270 | 0.776316 | 0.114898 | 0.090584 |
8| 5 | 5 | 0.380229 | 0.251101 | 113 | 157 | 270 | 0.719745 | 0.110029 | 0.0935638 |
9+----+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+
1df1['ks_stats'] = np.round(((df1['Converted Count'] / df1['Converted Count'].sum()).cumsum() -(df1['Un Converted Count'] / df1['Un Converted Count'].sum()).cumsum()), 4) * 100
2tab(df1)
1+----+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------+
2| | Decile | Max Prob | Min Prob | Converted Count | Un Converted Count | Total Leads | Conversion Rate | Converted % | Un Converted % | ks_stats |
3|----+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------|
4| 9 | 1 | 0.99826 | 0.910258 | 257 | 14 | 271 | 18.3571 | 0.250243 | 0.00834327 | 24.19 |
5| 8 | 2 | 0.909066 | 0.781567 | 224 | 46 | 270 | 4.86957 | 0.218111 | 0.0274136 | 43.26 |
6| 7 | 3 | 0.780757 | 0.564123 | 182 | 89 | 271 | 2.04494 | 0.177215 | 0.0530393 | 55.68 |
7| 6 | 4 | 0.563613 | 0.380673 | 118 | 152 | 270 | 0.776316 | 0.114898 | 0.090584 | 58.11 |
8| 5 | 5 | 0.380229 | 0.251101 | 113 | 157 | 270 | 0.719745 | 0.110029 | 0.0935638 | 59.76 |
9| 4 | 6 | 0.250676 | 0.149579 | 64 | 207 | 271 | 0.309179 | 0.0623174 | 0.123361 | 53.65 |
10| 3 | 7 | 0.149564 | 0.113845 | 37 | 233 | 270 | 0.158798 | 0.0360273 | 0.138856 | 43.37 |
11| 2 | 8 | 0.113644 | 0.0693867 | 22 | 249 | 271 | 0.0883534 | 0.0214216 | 0.148391 | 30.67 |
12| 1 | 9 | 0.0693778 | 0.0292784 | 8 | 256 | 264 | 0.03125 | 0.00778968 | 0.152563 | 16.19 |
13| 0 | 10 | 0.0292669 | 0.00223149 | 2 | 275 | 277 | 0.00727273 | 0.00194742 | 0.163886 | 0 |
14+----+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------+
  • Max KS Statistic is 59.76 for 5th decile
  • This model discriminates between Converted and Non-converted leads well since KS Statistic in 4th decile (58.11) is greater than 40%. Hence, this is a reasonable good model.

Gain Chart

1df1['Cum Conversion %'] = np.round(((df1['Converted Count'] / df1['Converted Count'].sum()).cumsum()), 4) * 100
2tab(df1)
1+----+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------+--------------------+
2| | Decile | Max Prob | Min Prob | Converted Count | Un Converted Count | Total Leads | Conversion Rate | Converted % | Un Converted % | ks_stats | Cum Conversion % |
3|----+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------+--------------------|
4| 9 | 1 | 0.99826 | 0.910258 | 257 | 14 | 271 | 18.3571 | 0.250243 | 0.00834327 | 24.19 | 25.02 |
5| 8 | 2 | 0.909066 | 0.781567 | 224 | 46 | 270 | 4.86957 | 0.218111 | 0.0274136 | 43.26 | 46.84 |
6| 7 | 3 | 0.780757 | 0.564123 | 182 | 89 | 271 | 2.04494 | 0.177215 | 0.0530393 | 55.68 | 64.56 |
7| 6 | 4 | 0.563613 | 0.380673 | 118 | 152 | 270 | 0.776316 | 0.114898 | 0.090584 | 58.11 | 76.05 |
8| 5 | 5 | 0.380229 | 0.251101 | 113 | 157 | 270 | 0.719745 | 0.110029 | 0.0935638 | 59.76 | 87.05 |
9| 4 | 6 | 0.250676 | 0.149579 | 64 | 207 | 271 | 0.309179 | 0.0623174 | 0.123361 | 53.65 | 93.28 |
10| 3 | 7 | 0.149564 | 0.113845 | 37 | 233 | 270 | 0.158798 | 0.0360273 | 0.138856 | 43.37 | 96.88 |
11| 2 | 8 | 0.113644 | 0.0693867 | 22 | 249 | 271 | 0.0883534 | 0.0214216 | 0.148391 | 30.67 | 99.03 |
12| 1 | 9 | 0.0693778 | 0.0292784 | 8 | 256 | 264 | 0.03125 | 0.00778968 | 0.152563 | 16.19 | 99.81 |
13| 0 | 10 | 0.0292669 | 0.00223149 | 2 | 275 | 277 | 0.00727273 | 0.00194742 | 0.163886 | 0 | 100 |
14+----+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------+--------------------+
1df1['Base %'] = np.arange(10,110,10)
2df1 = df1.set_index('Decile')
1tab(df1)
1+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------+--------------------+----------+
2| Decile | Max Prob | Min Prob | Converted Count | Un Converted Count | Total Leads | Conversion Rate | Converted % | Un Converted % | ks_stats | Cum Conversion % | Base % |
3|----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------+--------------------+----------|
4| 1 | 0.99826 | 0.910258 | 257 | 14 | 271 | 18.3571 | 0.250243 | 0.00834327 | 24.19 | 25.02 | 10 |
5| 2 | 0.909066 | 0.781567 | 224 | 46 | 270 | 4.86957 | 0.218111 | 0.0274136 | 43.26 | 46.84 | 20 |
6| 3 | 0.780757 | 0.564123 | 182 | 89 | 271 | 2.04494 | 0.177215 | 0.0530393 | 55.68 | 64.56 | 30 |
7| 4 | 0.563613 | 0.380673 | 118 | 152 | 270 | 0.776316 | 0.114898 | 0.090584 | 58.11 | 76.05 | 40 |
8| 5 | 0.380229 | 0.251101 | 113 | 157 | 270 | 0.719745 | 0.110029 | 0.0935638 | 59.76 | 87.05 | 50 |
9| 6 | 0.250676 | 0.149579 | 64 | 207 | 271 | 0.309179 | 0.0623174 | 0.123361 | 53.65 | 93.28 | 60 |
10| 7 | 0.149564 | 0.113845 | 37 | 233 | 270 | 0.158798 | 0.0360273 | 0.138856 | 43.37 | 96.88 | 70 |
11| 8 | 0.113644 | 0.0693867 | 22 | 249 | 271 | 0.0883534 | 0.0214216 | 0.148391 | 30.67 | 99.03 | 80 |
12| 9 | 0.0693778 | 0.0292784 | 8 | 256 | 264 | 0.03125 | 0.00778968 | 0.152563 | 16.19 | 99.81 | 90 |
13| 10 | 0.0292669 | 0.00223149 | 2 | 275 | 277 | 0.00727273 | 0.00194742 | 0.163886 | 0 | 100 | 100 |
14+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------+--------------------+----------+
1### Gain chart
2plot_columns =['Base %','Cum Conversion %']
3plt.plot(df1[plot_columns]);
4plt.xticks(df1.index);
5plt.title('Gain chart');
6plt.xlabel('Decile')
7plt.ylabel('Cummulative Conversion %')
8plt.legend(('Our Model','Random Model'));

png

  • Instead of pursuing leads randomly, pursuing the top 40% leads scored by the model would let the sales team reach 80% of leads likely to convert.

Lift Chart

1df1['Lift'] = df1['Cum Conversion %'] / df1['Base %']
2df1['Baseline'] = 1
3tab(df1)
1+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------+--------------------+----------+---------+------------+
2| Decile | Max Prob | Min Prob | Converted Count | Un Converted Count | Total Leads | Conversion Rate | Converted % | Un Converted % | ks_stats | Cum Conversion % | Base % | Lift | Baseline |
3|----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------+--------------------+----------+---------+------------|
4| 1 | 0.99826 | 0.910258 | 257 | 14 | 271 | 18.3571 | 0.250243 | 0.00834327 | 24.19 | 25.02 | 10 | 2.502 | 1 |
5| 2 | 0.909066 | 0.781567 | 224 | 46 | 270 | 4.86957 | 0.218111 | 0.0274136 | 43.26 | 46.84 | 20 | 2.342 | 1 |
6| 3 | 0.780757 | 0.564123 | 182 | 89 | 271 | 2.04494 | 0.177215 | 0.0530393 | 55.68 | 64.56 | 30 | 2.152 | 1 |
7| 4 | 0.563613 | 0.380673 | 118 | 152 | 270 | 0.776316 | 0.114898 | 0.090584 | 58.11 | 76.05 | 40 | 1.90125 | 1 |
8| 5 | 0.380229 | 0.251101 | 113 | 157 | 270 | 0.719745 | 0.110029 | 0.0935638 | 59.76 | 87.05 | 50 | 1.741 | 1 |
9| 6 | 0.250676 | 0.149579 | 64 | 207 | 271 | 0.309179 | 0.0623174 | 0.123361 | 53.65 | 93.28 | 60 | 1.55467 | 1 |
10| 7 | 0.149564 | 0.113845 | 37 | 233 | 270 | 0.158798 | 0.0360273 | 0.138856 | 43.37 | 96.88 | 70 | 1.384 | 1 |
11| 8 | 0.113644 | 0.0693867 | 22 | 249 | 271 | 0.0883534 | 0.0214216 | 0.148391 | 30.67 | 99.03 | 80 | 1.23788 | 1 |
12| 9 | 0.0693778 | 0.0292784 | 8 | 256 | 264 | 0.03125 | 0.00778968 | 0.152563 | 16.19 | 99.81 | 90 | 1.109 | 1 |
13| 10 | 0.0292669 | 0.00223149 | 2 | 275 | 277 | 0.00727273 | 0.00194742 | 0.163886 | 0 | 100 | 100 | 1 | 1 |
14+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------+--------------------+----------+---------+------------+
1# Lift chart
2plot_columns =['Lift', 'Baseline']
3plt.plot(df1[plot_columns]);
4plt.xticks(df1.index);
5plt.title('Lift chart');
6plt.xlabel('Decile')
7plt.ylabel('Lift')
8plt.legend(('Our Model','Random Model'));

png

  • The model outperforms a random model by alteast 2 times in identifying the top 40% potentially convertible leads.
  • As opposed to 10% conversions from 10% leads pursued randomly, pursuing the top 10% leads scored by this model would lead to 24% conversions.

Conclusion

A logistic regression model is created using lead features. To arrive at the list of features which significantly affect conversion probability, a mixed feature elimination approach is followed. 25 most important features are obtained through Recursive Feature Elimination and then reduced to 15 via p-value / VIF approach. The dataset is randomly divided into train and test set. (70 - 30 split).

The final relationship between log Odds of Conversion Probability and lead features is

logOdds(Conversion Probability) = -0.6469 - 1.5426 Do Not Email -1.2699 Unknown Occupation -0.9057 No Specialization -0.8704 Hospitality Management - 0.6584 Outside India + 1.7923 SMS Sent + 1.1749 Other Last Activity + 2.3769 Working Professional - 0.8614 Olark Chat Conversation + 5.3886 Welingak Website + 3.0246 Reference + 1.1876 Olark Chat -1.0250 Landing Page Submission + 1.1253 Total Time Spent on Website + 0.6106 * Email Opened

where Total Time Spent on Website is standardized to $\mu=0,\sigma=1$

Interpreting Top 6 features affecting Conversion Probability :

  • A lead from Welingak Website has 5.4 times higher log odds of conversion than those from Google.
  • Leads through Reference have 3 times higher log odds of conversion than those from Google.
  • Leads from Working Professional have 2.38 times higher log odds of conversion than those from Businessman.
  • Leads with SMS Sent have 1.8 times higher log odds of conversion than those with no SMS sent.
  • Leads with Do Not Email have 1.5 times lesser log odds of conversion compared to leads who would like email updates.
  • Leads with Unknown Occupation have 1.27 times lesser log odds of conversion compared to those from Businessman.

Lead Scores :

  • Score sheet can be generated by running this cell.

At an optimum cut-off probability of 0.36, model performance is as follows.

Model Performance on Training Set :

  • Accuracy : 81.7%
  • Sensitivity / Recall: 80.393 %
  • Specificity : 81.772 %
  • Precision / Positive Predictive Power : 72.924 %
  • False Positive Rate : 18.228 %
  • AUC Score : 0.81

Model Performance for Test Set :

  • Accuracy : 79.593 %
  • Sensitivity / Recall : 77.605 %
  • Specificity : 80.81%
  • Precision / Positive Predictive Power : 71.224 %
  • False Positive Rate : 19.19 %
  • AUC Score : 0.79

KS statistic :

  • Max KS Statistic is 59.76 for 5th decile
  • This model discriminates between Converted and Non-converted leads well since KS Statistic in 4th decile (58.11) is greater than 40%. Hence, this is a reasonably good model.

Gain :

  • Inside of pursuing leads randomly, pursuing the top 40% leads scored by the model would let the sales team reach 80% of leads likely to convert.

Lift :

  • The model outperforms a random model by alteast 2 times in identifying the top 40% potentially convertible leads.
  • As opposed to 10% conversions from 10% leads pursued randomly, pursuing the top 10% leads scored by this model would lead to 24% conversions.

Note :

  • Incorrect data types have been corrected
  • Columns with high missing values have been dropped.
  • Columns which do not explain variability in the model have been dropped.
  • Columns with sales teams notes like Tags where the classes are not mutually exclusive have been dropped.
  • Features with low missing values have been imputed with the most frequent values.
  • Categories in a feature with less than 1% contribution have been grouped together to reduce the number of levels.
  • Inconsistencies in Categories have been corrected.
  • 97.5 % of the leads provided by the company have been used for analysis.
  • Class imbalance = 0.6
  • Indicator variables have been created for all categorical variables with the first category as the reference.
  • Continuous variables have been standardized $\mu : 0 , \sigma = 1$ before modelling.

More articles from Yugen

Countries in need of Aid

Finding Countries in dire need of aid using clustering

August 31st, 2020 · 8 min read

Wine Quality Classification

Predicting Wine Quality using Logistic Regression

August 7th, 2020 · 2 min read
© 2020–2021 Yugen
Link to $http://twitter.com/JayanthBoddu/Link to $https://github.com/jayantb1019Link to $https://www.instagram.com/jayantb1019/Link to $https://www.linkedin.com/in/jayanthboddu/