Photo by Will Myers on Unsplash
This analysis is the combined effort of Umaer and mine.
Lead Scoring for X Education
Problem Statement
An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.
The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%.
Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone. A typical lead conversion process can be represented using the following funnel:
X Education has appointed us to help them select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires us to build a model wherein we need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.
Business Goals
- X-Education wants to improve their lead conversion.
- Rather than randomly pursuing leads, the company wants to create a pool of Hot Leads the sales team could focus on.
- They have tasked us to score their leads betwen 0-100 based on the probability of conversion. 100 being the most likely to convert and 0 being unlikely to convert.
Analysis Approach & Conclusions
Lead scoring is a class probability estimation problem, a form of classification problem. The target variable in the data set has two classes : 0 - Un converted and 1 - Converted. The objective is to model the probability(p) that each lead belongs to the class - Converted
. Since there are just two classes - it follows that the probability of belonging to class - Un-Converted is (1-p). The relationship between probability of conversion of each lead and its characteristics is modelled using Logistic Regression. And the leads are scored on a scale of 0-100, 100 being most probable conversion candidate.
The final solution has been provided in two parts. 1. Scoring the leads provided by the company in the order of probability of conversion (0-100) 2. Insights into the relationship between characteristics of a lead and the log-odds probability of conversion that could help the company score leads in the future.
A logistic regression model is created using lead features. To arrive at the list of features which significantly affect conversion probability, a mixed feature elimination approach is followed. 25 most important features are obtained through Recursive Feature Elimination and then reduced to 15 via p-value / VIF approach. The dataset is randomly divided into train and test set. (70 - 30 split).
The final relationship between log Odds of Conversion Probability and lead features is
logOdds(Conversion Probability)
= -0.6469 - 1.5426 Do Not Email
-1.2699 Unknown Occupation
-0.9057 No Specialization
-0.8704 Hospitality Management
- 0.6584 Outside India
+ 1.7923 SMS Sent
+ 1.1749 Other Last Activity
+ 2.3769 Working Professional
- 0.8614 Olark Chat Conversation
+ 5.3886 Welingak Website
+ 3.0246 Reference
+ 1.1876 Olark Chat
-1.0250 Landing Page Submission
+ 1.1253 Total Time Spent on Website
+ 0.6106 * Email Opened
where Total Time Spent on Website
is standardized to $\mu=0,\sigma=1$
Interpreting Top 6 features affecting Conversion Probability :
- A lead from
Welingak Website
has 5.4 times higher log odds of conversion than those fromGoogle
. - Leads through
Reference
have 3 times higher log odds of conversion than those fromGoogle
. - Leads from
Working Professional
have 2.38 times higher log odds of conversion than those fromBusinessman
. - Leads with
SMS Sent
have 1.8 times higher log odds of conversion than those with no SMS sent. - Leads with
Do Not Email
have 1.5 times lesser log odds of conversion compared to leads who would like email updates. - Leads with
Unknown Occupation
have 1.27 times lesser log odds of conversion compared to those fromBusinessman
.
Lead Scores :
- Score sheet can be generated by running coding in the cell named
Score Sheet for X Education
cell in the analysis notebook.
At an optimum cut-off probability of 0.36, model performance is as follows.
Model Performance on Training Set :
- Accuracy : 81.7%
- Sensitivity / Recall: 80.393 %
- Specificity : 81.772 %
- Precision / Positive Predictive Power : 72.924 %
- False Positive Rate : 18.228 %
- AUC Score : 0.81
Model Performance for Test Set :
- Accuracy : 79.593 %
- Sensitivity / Recall : 77.605 %
- Specificity : 80.81%
- Precision / Positive Predictive Power : 71.224 %
- False Positive Rate : 19.19 %
- AUC Score : 0.79
KS statistic :
- Max KS Statistic is 59.76 for 5th decile
- This model discriminates between Converted and Non-converted leads well since KS Statistic in 4th decile (58.11) is greater than 40%. Hence, this is a reasonably good model.
Gain :
- Instead of pursuing leads randomly, pursuing the top 40% leads scored by the model would let the sales team reach 80% of leads likely to convert.
Lift :
- The model outperforms a random model by alteast 2 times in identifying the top 40% potentially convertible leads.
- As opposed to 10% conversions from 10% leads pursued randomly, pursuing the top 10% leads scored by this model would lead to 24% conversions.
Note :
- Incorrect data types have been corrected
- Columns with high missing values have been dropped.
- Columns which do not explain variability in the model have been dropped.
- Columns with sales teams notes like
Tags
where the classes are not mutually exclusive have been dropped. - Features with low missing values have been imputed with the most frequent values.
- Categories in a feature with less than 1% contribution have been grouped together to reduce the number of levels.
- Inconsistencies in Categories have been corrected.
- 97.5 % of the leads provided by the company have been used for analysis.
- Class imbalance = 0.6
- Indicator variables have been created for all categorical variables with the first category as the reference.
- Continuous variables have been standardized $\mu : 0 , \sigma = 1$ before modelling.
1import numpy as np2import pandas as pd3import matplotlib.pyplot as plt45import seaborn as sns6sns.set_style('whitegrid')78import warnings9warnings.filterwarnings('ignore')1011!pip install tabulate12from tabulate import tabulate1314import sidetable1516!jt -t grade3 -f roboto -fs 12 -cellw 100%
1Requirement already satisfied: tabulate in /Users/jayanth/opt/anaconda3/lib/python3.7/site-packages (0.8.7)2libc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Couldn't close file
1# to table print a dataframe2def tab(ser) :3 print(tabulate(pd.DataFrame(ser), headers='keys', tablefmt="psql"))
Importing Data
1# importing the dataset2leads = pd.read_csv('./leads.csv')34# Inspecting few column heads at a time5for i in range(0,leads.shape[1], 5) :6 if i+4 <= leads.shape[1] :7 print('Columns : ',i,' to ',i+4)8 else :9 print('Columns : ',i,' to last')10 tab(leads.iloc[:,i : i+5].head())11 print('\n')
1Columns : 0 to 42+----+--------------------------------------+---------------+-------------------------+----------------+----------------+3| | Prospect ID | Lead Number | Lead Origin | Lead Source | Do Not Email |4|----+--------------------------------------+---------------+-------------------------+----------------+----------------|5| 0 | 7927b2df-8bba-4d29-b9a2-b6e0beafe620 | 660737 | API | Olark Chat | No |6| 1 | 2a272436-5132-4136-86fa-dcc88c88f482 | 660728 | API | Organic Search | No |7| 2 | 8cc8c611-a219-4f35-ad23-fdfd2656bd8a | 660727 | Landing Page Submission | Direct Traffic | No |8| 3 | 0cc2df48-7cf4-4e39-9de9-19797f9b38cc | 660719 | Landing Page Submission | Direct Traffic | No |9| 4 | 3256f628-e534-4826-9d63-4a8b88782852 | 660681 | Landing Page Submission | Google | No |10+----+--------------------------------------+---------------+-------------------------+----------------+----------------+111213Columns : 5 to 914+----+---------------+-------------+---------------+-------------------------------+------------------------+15| | Do Not Call | Converted | TotalVisits | Total Time Spent on Website | Page Views Per Visit |16|----+---------------+-------------+---------------+-------------------------------+------------------------|17| 0 | No | 0 | 0 | 0 | 0 |18| 1 | No | 0 | 5 | 674 | 2.5 |19| 2 | No | 1 | 2 | 1532 | 2 |20| 3 | No | 0 | 1 | 305 | 1 |21| 4 | No | 1 | 2 | 1428 | 1 |22+----+---------------+-------------+---------------+-------------------------------+------------------------+232425Columns : 10 to 1426+----+-------------------------+-----------+-------------------------+--------------------------------------+-----------------------------------+27| | Last Activity | Country | Specialization | How did you hear about X Education | What is your current occupation |28|----+-------------------------+-----------+-------------------------+--------------------------------------+-----------------------------------|29| 0 | Page Visited on Website | nan | Select | Select | Unemployed |30| 1 | Email Opened | India | Select | Select | Unemployed |31| 2 | Email Opened | India | Business Administration | Select | Student |32| 3 | Unreachable | India | Media and Advertising | Word Of Mouth | Unemployed |33| 4 | Converted to Lead | India | Select | Other | Unemployed |34+----+-------------------------+-----------+-------------------------+--------------------------------------+-----------------------------------+353637Columns : 15 to 1938+----+-------------------------------------------------+----------+------------+---------------------+----------------------+39| | What matters most to you in choosing a course | Search | Magazine | Newspaper Article | X Education Forums |40|----+-------------------------------------------------+----------+------------+---------------------+----------------------|41| 0 | Better Career Prospects | No | No | No | No |42| 1 | Better Career Prospects | No | No | No | No |43| 2 | Better Career Prospects | No | No | No | No |44| 3 | Better Career Prospects | No | No | No | No |45| 4 | Better Career Prospects | No | No | No | No |46+----+-------------------------------------------------+----------+------------+---------------------+----------------------+474849Columns : 20 to 2450+----+-------------+-------------------------+---------------------------+------------------------------------------+-------------------------------------+51| | Newspaper | Digital Advertisement | Through Recommendations | Receive More Updates About Our Courses | Tags |52|----+-------------+-------------------------+---------------------------+------------------------------------------+-------------------------------------|53| 0 | No | No | No | No | Interested in other courses |54| 1 | No | No | No | No | Ringing |55| 2 | No | No | No | No | Will revert after reading the email |56| 3 | No | No | No | No | Ringing |57| 4 | No | No | No | No | Will revert after reading the email |58+----+-------------+-------------------------+---------------------------+------------------------------------------+-------------------------------------+596061Columns : 25 to 2962+----+------------------+-------------------------------------+-----------------------------+----------------+--------+63| | Lead Quality | Update me on Supply Chain Content | Get updates on DM Content | Lead Profile | City |64|----+------------------+-------------------------------------+-----------------------------+----------------+--------|65| 0 | Low in Relevance | No | No | Select | Select |66| 1 | nan | No | No | Select | Select |67| 2 | Might be | No | No | Potential Lead | Mumbai |68| 3 | Not Sure | No | No | Select | Mumbai |69| 4 | Might be | No | No | Select | Mumbai |70+----+------------------+-------------------------------------+-----------------------------+----------------+--------+717273Columns : 30 to 3474+----+-------------------------------+------------------------------+-------------------------------+------------------------------+--------------------------------------------+75| | Asymmetrique Activity Index | Asymmetrique Profile Index | Asymmetrique Activity Score | Asymmetrique Profile Score | I agree to pay the amount through cheque |76|----+-------------------------------+------------------------------+-------------------------------+------------------------------+--------------------------------------------|77| 0 | 02.Medium | 02.Medium | 15 | 15 | No |78| 1 | 02.Medium | 02.Medium | 15 | 15 | No |79| 2 | 02.Medium | 01.High | 14 | 20 | No |80| 3 | 02.Medium | 01.High | 13 | 17 | No |81| 4 | 02.Medium | 01.High | 15 | 18 | No |82+----+-------------------------------+------------------------------+-------------------------------+------------------------------+--------------------------------------------+838485Columns : 35 to last86+----+------------------------------------------+-------------------------+87| | A free copy of Mastering The Interview | Last Notable Activity |88|----+------------------------------------------+-------------------------|89| 0 | No | Modified |90| 1 | No | Email Opened |91| 2 | Yes | Email Opened |92| 3 | No | Modified |93| 4 | No | Modified |94+----+------------------------------------------+-------------------------+
1# dataset information2leads.info()
1<class 'pandas.core.frame.DataFrame'>2RangeIndex: 9240 entries, 0 to 92393Data columns (total 37 columns):4 # Column Non-Null Count Dtype5--- ------ -------------- -----6 0 Prospect ID 9240 non-null object7 1 Lead Number 9240 non-null int648 2 Lead Origin 9240 non-null object9 3 Lead Source 9204 non-null object10 4 Do Not Email 9240 non-null object11 5 Do Not Call 9240 non-null object12 6 Converted 9240 non-null int6413 7 TotalVisits 9103 non-null float6414 8 Total Time Spent on Website 9240 non-null int6415 9 Page Views Per Visit 9103 non-null float6416 10 Last Activity 9137 non-null object17 11 Country 6779 non-null object18 12 Specialization 7802 non-null object19 13 How did you hear about X Education 7033 non-null object20 14 What is your current occupation 6550 non-null object21 15 What matters most to you in choosing a course 6531 non-null object22 16 Search 9240 non-null object23 17 Magazine 9240 non-null object24 18 Newspaper Article 9240 non-null object25 19 X Education Forums 9240 non-null object26 20 Newspaper 9240 non-null object27 21 Digital Advertisement 9240 non-null object28 22 Through Recommendations 9240 non-null object29 23 Receive More Updates About Our Courses 9240 non-null object30 24 Tags 5887 non-null object31 25 Lead Quality 4473 non-null object32 26 Update me on Supply Chain Content 9240 non-null object33 27 Get updates on DM Content 9240 non-null object34 28 Lead Profile 6531 non-null object35 29 City 7820 non-null object36 30 Asymmetrique Activity Index 5022 non-null object37 31 Asymmetrique Profile Index 5022 non-null object38 32 Asymmetrique Activity Score 5022 non-null float6439 33 Asymmetrique Profile Score 5022 non-null float6440 34 I agree to pay the amount through cheque 9240 non-null object41 35 A free copy of Mastering The Interview 9240 non-null object42 36 Last Notable Activity 9240 non-null object43dtypes: float64(4), int64(3), object(30)44memory usage: 2.6+ MB
- This data set has a total of 9240 records ,each with 36 features.
- Each record represents the characteristics of a lead and whether the lead was converted.
Converted
column indicates whether the particular lead was converted to a client. This is our target variable.
Data Cleaning
Incorrect data types
1# 'Converted' is a binary categorical variable but the info shows it is `int64`. Converting to `category` data type2leads['Converted'] = leads['Converted'].astype('category')
Duplicates
1# Checking for any duplicate leads / prospects23duplicate_prospect_ids = leads['Prospect ID'][leads['Prospect ID'].duplicated()].sum()4duplicate_lead_no = leads['Lead Number'][leads['Lead Number'].duplicated()].sum()5print('No of Duplicate Prospect IDs : ', duplicate_prospect_ids)6print('No of Duplicate Lead Nos : ', duplicate_lead_no)
1No of Duplicate Prospect IDs : 02No of Duplicate Lead Nos : 0
- There are no duplicate prospect IDs or Lead Numbers
- Since, these are dimensions (i.e identification columns) not required for analysis, they could be popped for re-indentification at a later step.
Separating ID columns
1# Popping Prospect ID and Lead Number columns for later use2prospect_ids = leads.pop('Prospect ID')3lead_no = leads.pop('Lead Number')
Missing Values
1# Null values in each Column2nulls = pd.DataFrame(100*leads.isnull().sum()/leads.shape[0])3nulls.columns = ['Null Percentage']45# Sorting null percentages in descending order and highlighting null % > 456nulls[nulls['Null Percentage'] !=0].sort_values(by ='Null Percentage', ascending=False).style.applymap(lambda x : 'color : red' if x > 45 else '')
Null Percentage | |
Lead Quality | 51.590909 |
Asymmetrique Profile Score | 45.649351 |
Asymmetrique Activity Score | 45.649351 |
Asymmetrique Profile Index | 45.649351 |
Asymmetrique Activity Index | 45.649351 |
Tags | 36.287879 |
Lead Profile | 29.318182 |
What matters most to you in choosing a course | 29.318182 |
What is your current occupation | 29.112554 |
Country | 26.634199 |
How did you hear about X Education | 23.885281 |
Specialization | 15.562771 |
City | 15.367965 |
TotalVisits | 1.482684 |
Page Views Per Visit | 1.482684 |
Last Activity | 1.114719 |
Lead Source | 0.389610 |
- More than 45% of the leads have missing values in
Lead Quality
,Asymmetrique Profile Score
,Asymmetrique Activity Score
,Asymmetrique Profile Index
,Asymmetrique Activity Index
- Further, the data in these columns is filled by the sales team and the values depend heavily on the team’s judgement. These columns are not good candidates for modelling since the values are subjective.
- Hence these columns could be dropped.
1# Dropping columns with null percentage > 452high_null_col = nulls[nulls['Null Percentage'] >=45].index3leads.drop(columns=high_null_col, inplace=True)
1# Rows Missing Target Variable2print('Number of rows with missing Target Variable : ',leads['Converted'].isnull().sum())
1Number of rows with missing Target Variable : 0
- No rows with missing target value
1# Rows missing more than 50% of values2highNullRowsCondition = leads.isnull().sum(axis=1)/leads.shape[1] > 0.53leads[highNullRowsCondition].index
1Int64Index([], dtype='int64')
- No rows missing more than 50% of values.
Disguised missing Values
1# Categorical columns2condition = leads.dtypes == 'object'3categoricalColumns = leads.dtypes[condition].index.values4categoricalColumns
1array(['Lead Origin', 'Lead Source', 'Do Not Email', 'Do Not Call',2 'Last Activity', 'Country', 'Specialization',3 'How did you hear about X Education',4 'What is your current occupation',5 'What matters most to you in choosing a course', 'Search',6 'Magazine', 'Newspaper Article', 'X Education Forums', 'Newspaper',7 'Digital Advertisement', 'Through Recommendations',8 'Receive More Updates About Our Courses', 'Tags',9 'Update me on Supply Chain Content', 'Get updates on DM Content',10 'Lead Profile', 'City', 'I agree to pay the amount through cheque',11 'A free copy of Mastering The Interview', 'Last Notable Activity'],12 dtype=object)
1# value counts of each label in a categorical feature2def cat_value_counts(column_name) :3 '''4 prints unique values and value counts of each label in categorical column5 '''6 print(tabulate(pd.DataFrame(leads.stb.freq([column_name])), headers='keys', tablefmt='psql'))7 print(pd.DataFrame(leads[column_name]).stb.missing(),'\n\n\n')
1# Looking at value counts of each label in categorical variables2for col in sorted(categoricalColumns) :3 print(col)4 cat_value_counts(col)
1A free copy of Mastering The Interview2+----+------------------------------------------+---------+-----------+--------------------+----------------------+3| | A free copy of Mastering The Interview | Count | Percent | Cumulative Count | Cumulative Percent |4|----+------------------------------------------+---------+-----------+--------------------+----------------------|5| 0 | No | 6352 | 0.687446 | 6352 | 0.687446 |6| 1 | Yes | 2888 | 0.312554 | 9240 | 1 |7+----+------------------------------------------+---------+-----------+--------------------+----------------------+8 Missing Total Percent9A free copy of Mastering The Interview 0 9240 0.010111213City14+----+-----------------------------+---------+------------+--------------------+----------------------+15| | City | Count | Percent | Cumulative Count | Cumulative Percent |16|----+-----------------------------+---------+------------+--------------------+----------------------|17| 0 | Mumbai | 3222 | 0.41202 | 3222 | 0.41202 |18| 1 | Select | 2249 | 0.287596 | 5471 | 0.699616 |19| 2 | Thane & Outskirts | 752 | 0.0961637 | 6223 | 0.79578 |20| 3 | Other Cities | 686 | 0.0877238 | 6909 | 0.883504 |21| 4 | Other Cities of Maharashtra | 457 | 0.0584399 | 7366 | 0.941944 |22| 5 | Other Metro Cities | 380 | 0.0485934 | 7746 | 0.990537 |23| 6 | Tier II Cities | 74 | 0.00946292 | 7820 | 1 |24+----+-----------------------------+---------+------------+--------------------+----------------------+25 Missing Total Percent26City 1420 9240 0.1536827282930Country31+----+----------------------+---------+-------------+--------------------+----------------------+32| | Country | Count | Percent | Cumulative Count | Cumulative Percent |33|----+----------------------+---------+-------------+--------------------+----------------------|34| 0 | India | 6492 | 0.957663 | 6492 | 0.957663 |35| 1 | United States | 69 | 0.0101785 | 6561 | 0.967842 |36| 2 | United Arab Emirates | 53 | 0.00781826 | 6614 | 0.97566 |37| 3 | Singapore | 24 | 0.00354035 | 6638 | 0.9792 |38| 4 | Saudi Arabia | 21 | 0.0030978 | 6659 | 0.982298 |39| 5 | United Kingdom | 15 | 0.00221272 | 6674 | 0.984511 |40| 6 | Australia | 13 | 0.00191769 | 6687 | 0.986429 |41| 7 | Qatar | 10 | 0.00147514 | 6697 | 0.987904 |42| 8 | Hong Kong | 7 | 0.0010326 | 6704 | 0.988936 |43| 9 | Bahrain | 7 | 0.0010326 | 6711 | 0.989969 |44| 10 | Oman | 6 | 0.000885086 | 6717 | 0.990854 |45| 11 | France | 6 | 0.000885086 | 6723 | 0.991739 |46| 12 | unknown | 5 | 0.000737572 | 6728 | 0.992477 |47| 13 | South Africa | 4 | 0.000590058 | 6732 | 0.993067 |48| 14 | Nigeria | 4 | 0.000590058 | 6736 | 0.993657 |49| 15 | Kuwait | 4 | 0.000590058 | 6740 | 0.994247 |50| 16 | Germany | 4 | 0.000590058 | 6744 | 0.994837 |51| 17 | Canada | 4 | 0.000590058 | 6748 | 0.995427 |52| 18 | Sweden | 3 | 0.000442543 | 6751 | 0.99587 |53| 19 | Uganda | 2 | 0.000295029 | 6753 | 0.996165 |54| 20 | Philippines | 2 | 0.000295029 | 6755 | 0.99646 |55| 21 | Netherlands | 2 | 0.000295029 | 6757 | 0.996755 |56| 22 | Italy | 2 | 0.000295029 | 6759 | 0.99705 |57| 23 | Ghana | 2 | 0.000295029 | 6761 | 0.997345 |58| 24 | China | 2 | 0.000295029 | 6763 | 0.99764 |59| 25 | Belgium | 2 | 0.000295029 | 6765 | 0.997935 |60| 26 | Bangladesh | 2 | 0.000295029 | 6767 | 0.99823 |61| 27 | Asia/Pacific Region | 2 | 0.000295029 | 6769 | 0.998525 |62| 28 | Vietnam | 1 | 0.000147514 | 6770 | 0.998672 |63| 29 | Tanzania | 1 | 0.000147514 | 6771 | 0.99882 |64| 30 | Switzerland | 1 | 0.000147514 | 6772 | 0.998967 |65| 31 | Sri Lanka | 1 | 0.000147514 | 6773 | 0.999115 |66| 32 | Russia | 1 | 0.000147514 | 6774 | 0.999262 |67| 33 | Malaysia | 1 | 0.000147514 | 6775 | 0.99941 |68| 34 | Liberia | 1 | 0.000147514 | 6776 | 0.999557 |69| 35 | Kenya | 1 | 0.000147514 | 6777 | 0.999705 |70| 36 | Indonesia | 1 | 0.000147514 | 6778 | 0.999852 |71| 37 | Denmark | 1 | 0.000147514 | 6779 | 1 |72+----+----------------------+---------+-------------+--------------------+----------------------+73 Missing Total Percent74Country 2461 9240 0.26634275767778Digital Advertisement79+----+-------------------------+---------+-----------+--------------------+----------------------+80| | Digital Advertisement | Count | Percent | Cumulative Count | Cumulative Percent |81|----+-------------------------+---------+-----------+--------------------+----------------------|82| 0 | No | 9236 | 0.999567 | 9236 | 0.999567 |83| 1 | Yes | 4 | 0.0004329 | 9240 | 1 |84+----+-------------------------+---------+-----------+--------------------+----------------------+85 Missing Total Percent86Digital Advertisement 0 9240 0.087888990Do Not Call91+----+---------------+---------+------------+--------------------+----------------------+92| | Do Not Call | Count | Percent | Cumulative Count | Cumulative Percent |93|----+---------------+---------+------------+--------------------+----------------------|94| 0 | No | 9238 | 0.999784 | 9238 | 0.999784 |95| 1 | Yes | 2 | 0.00021645 | 9240 | 1 |96+----+---------------+---------+------------+--------------------+----------------------+97 Missing Total Percent98Do Not Call 0 9240 0.099100101102Do Not Email103+----+----------------+---------+-----------+--------------------+----------------------+104| | Do Not Email | Count | Percent | Cumulative Count | Cumulative Percent |105|----+----------------+---------+-----------+--------------------+----------------------|106| 0 | No | 8506 | 0.920563 | 8506 | 0.920563 |107| 1 | Yes | 734 | 0.0794372 | 9240 | 1 |108+----+----------------+---------+-----------+--------------------+----------------------+109 Missing Total Percent110Do Not Email 0 9240 0.0111112113114Get updates on DM Content115+----+-----------------------------+---------+-----------+--------------------+----------------------+116| | Get updates on DM Content | Count | Percent | Cumulative Count | Cumulative Percent |117|----+-----------------------------+---------+-----------+--------------------+----------------------|118| 0 | No | 9240 | 1 | 9240 | 1 |119+----+-----------------------------+---------+-----------+--------------------+----------------------+120 Missing Total Percent121Get updates on DM Content 0 9240 0.0122123124125How did you hear about X Education126+----+--------------------------------------+---------+------------+--------------------+----------------------+127| | How did you hear about X Education | Count | Percent | Cumulative Count | Cumulative Percent |128|----+--------------------------------------+---------+------------+--------------------+----------------------|129| 0 | Select | 5043 | 0.717048 | 5043 | 0.717048 |130| 1 | Online Search | 808 | 0.114887 | 5851 | 0.831935 |131| 2 | Word Of Mouth | 348 | 0.049481 | 6199 | 0.881416 |132| 3 | Student of SomeSchool | 310 | 0.0440779 | 6509 | 0.925494 |133| 4 | Other | 186 | 0.0264468 | 6695 | 0.951941 |134| 5 | Multiple Sources | 152 | 0.0216124 | 6847 | 0.973553 |135| 6 | Advertisements | 70 | 0.00995308 | 6917 | 0.983506 |136| 7 | Social Media | 67 | 0.00952652 | 6984 | 0.993033 |137| 8 | Email | 26 | 0.00369686 | 7010 | 0.99673 |138| 9 | SMS | 23 | 0.0032703 | 7033 | 1 |139+----+--------------------------------------+---------+------------+--------------------+----------------------+140 Missing Total Percent141How did you hear about X Education 2207 9240 0.238853142143144145I agree to pay the amount through cheque146+----+--------------------------------------------+---------+-----------+--------------------+----------------------+147| | I agree to pay the amount through cheque | Count | Percent | Cumulative Count | Cumulative Percent |148|----+--------------------------------------------+---------+-----------+--------------------+----------------------|149| 0 | No | 9240 | 1 | 9240 | 1 |150+----+--------------------------------------------+---------+-----------+--------------------+----------------------+151 Missing Total Percent152I agree to pay the amount through cheque 0 9240 0.0153154155156Last Activity157+----+------------------------------+---------+-------------+--------------------+----------------------+158| | Last Activity | Count | Percent | Cumulative Count | Cumulative Percent |159|----+------------------------------+---------+-------------+--------------------+----------------------|160| 0 | Email Opened | 3437 | 0.376163 | 3437 | 0.376163 |161| 1 | SMS Sent | 2745 | 0.300427 | 6182 | 0.67659 |162| 2 | Olark Chat Conversation | 973 | 0.10649 | 7155 | 0.78308 |163| 3 | Page Visited on Website | 640 | 0.0700449 | 7795 | 0.853125 |164| 4 | Converted to Lead | 428 | 0.0468425 | 8223 | 0.899967 |165| 5 | Email Bounced | 326 | 0.0356791 | 8549 | 0.935646 |166| 6 | Email Link Clicked | 267 | 0.0292218 | 8816 | 0.964868 |167| 7 | Form Submitted on Website | 116 | 0.0126956 | 8932 | 0.977564 |168| 8 | Unreachable | 93 | 0.0101784 | 9025 | 0.987742 |169| 9 | Unsubscribed | 61 | 0.00667615 | 9086 | 0.994418 |170| 10 | Had a Phone Conversation | 30 | 0.00328335 | 9116 | 0.997702 |171| 11 | Approached upfront | 9 | 0.000985006 | 9125 | 0.998687 |172| 12 | View in browser link Clicked | 6 | 0.000656671 | 9131 | 0.999343 |173| 13 | Email Received | 2 | 0.00021889 | 9133 | 0.999562 |174| 14 | Email Marked Spam | 2 | 0.00021889 | 9135 | 0.999781 |175| 15 | Visited Booth in Tradeshow | 1 | 0.000109445 | 9136 | 0.999891 |176| 16 | Resubscribed to emails | 1 | 0.000109445 | 9137 | 1 |177+----+------------------------------+---------+-------------+--------------------+----------------------+178 Missing Total Percent179Last Activity 103 9240 0.011147180181182183Last Notable Activity184+----+------------------------------+---------+-------------+--------------------+----------------------+185| | Last Notable Activity | Count | Percent | Cumulative Count | Cumulative Percent |186|----+------------------------------+---------+-------------+--------------------+----------------------|187| 0 | Modified | 3407 | 0.368723 | 3407 | 0.368723 |188| 1 | Email Opened | 2827 | 0.305952 | 6234 | 0.674675 |189| 2 | SMS Sent | 2172 | 0.235065 | 8406 | 0.90974 |190| 3 | Page Visited on Website | 318 | 0.0344156 | 8724 | 0.944156 |191| 4 | Olark Chat Conversation | 183 | 0.0198052 | 8907 | 0.963961 |192| 5 | Email Link Clicked | 173 | 0.0187229 | 9080 | 0.982684 |193| 6 | Email Bounced | 60 | 0.00649351 | 9140 | 0.989177 |194| 7 | Unsubscribed | 47 | 0.00508658 | 9187 | 0.994264 |195| 8 | Unreachable | 32 | 0.0034632 | 9219 | 0.997727 |196| 9 | Had a Phone Conversation | 14 | 0.00151515 | 9233 | 0.999242 |197| 10 | Email Marked Spam | 2 | 0.00021645 | 9235 | 0.999459 |198| 11 | View in browser link Clicked | 1 | 0.000108225 | 9236 | 0.999567 |199| 12 | Resubscribed to emails | 1 | 0.000108225 | 9237 | 0.999675 |200| 13 | Form Submitted on Website | 1 | 0.000108225 | 9238 | 0.999784 |201| 14 | Email Received | 1 | 0.000108225 | 9239 | 0.999892 |202| 15 | Approached upfront | 1 | 0.000108225 | 9240 | 1 |203+----+------------------------------+---------+-------------+--------------------+----------------------+204 Missing Total Percent205Last Notable Activity 0 9240 0.0206207208209Lead Origin210+----+-------------------------+---------+-------------+--------------------+----------------------+211| | Lead Origin | Count | Percent | Cumulative Count | Cumulative Percent |212|----+-------------------------+---------+-------------+--------------------+----------------------|213| 0 | Landing Page Submission | 4886 | 0.528788 | 4886 | 0.528788 |214| 1 | API | 3580 | 0.387446 | 8466 | 0.916234 |215| 2 | Lead Add Form | 718 | 0.0777056 | 9184 | 0.993939 |216| 3 | Lead Import | 55 | 0.00595238 | 9239 | 0.999892 |217| 4 | Quick Add Form | 1 | 0.000108225 | 9240 | 1 |218+----+-------------------------+---------+-------------+--------------------+----------------------+219 Missing Total Percent220Lead Origin 0 9240 0.0221222223224Lead Profile225+----+-----------------------------+---------+------------+--------------------+----------------------+226| | Lead Profile | Count | Percent | Cumulative Count | Cumulative Percent |227|----+-----------------------------+---------+------------+--------------------+----------------------|228| 0 | Select | 4146 | 0.634819 | 4146 | 0.634819 |229| 1 | Potential Lead | 1613 | 0.246976 | 5759 | 0.881795 |230| 2 | Other Leads | 487 | 0.0745674 | 6246 | 0.956362 |231| 3 | Student of SomeSchool | 241 | 0.0369009 | 6487 | 0.993263 |232| 4 | Lateral Student | 24 | 0.00367478 | 6511 | 0.996938 |233| 5 | Dual Specialization Student | 20 | 0.00306232 | 6531 | 1 |234+----+-----------------------------+---------+------------+--------------------+----------------------+235 Missing Total Percent236Lead Profile 2709 9240 0.293182237238239240Lead Source241+----+-------------------+---------+-------------+--------------------+----------------------+242| | Lead Source | Count | Percent | Cumulative Count | Cumulative Percent |243|----+-------------------+---------+-------------+--------------------+----------------------|244| 0 | Google | 2868 | 0.311604 | 2868 | 0.311604 |245| 1 | Direct Traffic | 2543 | 0.276293 | 5411 | 0.587897 |246| 2 | Olark Chat | 1755 | 0.190678 | 7166 | 0.778575 |247| 3 | Organic Search | 1154 | 0.12538 | 8320 | 0.903955 |248| 4 | Reference | 534 | 0.0580183 | 8854 | 0.961973 |249| 5 | Welingak Website | 142 | 0.0154281 | 8996 | 0.977401 |250| 6 | Referral Sites | 125 | 0.0135811 | 9121 | 0.990982 |251| 7 | Facebook | 55 | 0.00597566 | 9176 | 0.996958 |252| 8 | bing | 6 | 0.00065189 | 9182 | 0.99761 |253| 9 | google | 5 | 0.000543242 | 9187 | 0.998153 |254| 10 | Click2call | 4 | 0.000434594 | 9191 | 0.998588 |255| 11 | Social Media | 2 | 0.000217297 | 9193 | 0.998805 |256| 12 | Press_Release | 2 | 0.000217297 | 9195 | 0.999022 |257| 13 | Live Chat | 2 | 0.000217297 | 9197 | 0.999239 |258| 14 | youtubechannel | 1 | 0.000108648 | 9198 | 0.999348 |259| 15 | welearnblog_Home | 1 | 0.000108648 | 9199 | 0.999457 |260| 16 | testone | 1 | 0.000108648 | 9200 | 0.999565 |261| 17 | blog | 1 | 0.000108648 | 9201 | 0.999674 |262| 18 | WeLearn | 1 | 0.000108648 | 9202 | 0.999783 |263| 19 | Pay per Click Ads | 1 | 0.000108648 | 9203 | 0.999891 |264| 20 | NC_EDM | 1 | 0.000108648 | 9204 | 1 |265+----+-------------------+---------+-------------+--------------------+----------------------+266 Missing Total Percent267Lead Source 36 9240 0.003896268269270271Magazine272+----+------------+---------+-----------+--------------------+----------------------+273| | Magazine | Count | Percent | Cumulative Count | Cumulative Percent |274|----+------------+---------+-----------+--------------------+----------------------|275| 0 | No | 9240 | 1 | 9240 | 1 |276+----+------------+---------+-----------+--------------------+----------------------+277 Missing Total Percent278Magazine 0 9240 0.0279280281282Newspaper283+----+-------------+---------+-------------+--------------------+----------------------+284| | Newspaper | Count | Percent | Cumulative Count | Cumulative Percent |285|----+-------------+---------+-------------+--------------------+----------------------|286| 0 | No | 9239 | 0.999892 | 9239 | 0.999892 |287| 1 | Yes | 1 | 0.000108225 | 9240 | 1 |288+----+-------------+---------+-------------+--------------------+----------------------+289 Missing Total Percent290Newspaper 0 9240 0.0291292293294Newspaper Article295+----+---------------------+---------+------------+--------------------+----------------------+296| | Newspaper Article | Count | Percent | Cumulative Count | Cumulative Percent |297|----+---------------------+---------+------------+--------------------+----------------------|298| 0 | No | 9238 | 0.999784 | 9238 | 0.999784 |299| 1 | Yes | 2 | 0.00021645 | 9240 | 1 |300+----+---------------------+---------+------------+--------------------+----------------------+301 Missing Total Percent302Newspaper Article 0 9240 0.0303304305306Receive More Updates About Our Courses307+----+------------------------------------------+---------+-----------+--------------------+----------------------+308| | Receive More Updates About Our Courses | Count | Percent | Cumulative Count | Cumulative Percent |309|----+------------------------------------------+---------+-----------+--------------------+----------------------|310| 0 | No | 9240 | 1 | 9240 | 1 |311+----+------------------------------------------+---------+-----------+--------------------+----------------------+312 Missing Total Percent313Receive More Updates About Our Courses 0 9240 0.0314315316317Search318+----+----------+---------+------------+--------------------+----------------------+319| | Search | Count | Percent | Cumulative Count | Cumulative Percent |320|----+----------+---------+------------+--------------------+----------------------|321| 0 | No | 9226 | 0.998485 | 9226 | 0.998485 |322| 1 | Yes | 14 | 0.00151515 | 9240 | 1 |323+----+----------+---------+------------+--------------------+----------------------+324 Missing Total Percent325Search 0 9240 0.0326327328329Specialization330+----+-----------------------------------+---------+------------+--------------------+----------------------+331| | Specialization | Count | Percent | Cumulative Count | Cumulative Percent |332|----+-----------------------------------+---------+------------+--------------------+----------------------|333| 0 | Select | 1942 | 0.248911 | 1942 | 0.248911 |334| 1 | Finance Management | 976 | 0.125096 | 2918 | 0.374007 |335| 2 | Human Resource Management | 848 | 0.10869 | 3766 | 0.482697 |336| 3 | Marketing Management | 838 | 0.107408 | 4604 | 0.590105 |337| 4 | Operations Management | 503 | 0.0644706 | 5107 | 0.654576 |338| 5 | Business Administration | 403 | 0.0516534 | 5510 | 0.706229 |339| 6 | IT Projects Management | 366 | 0.046911 | 5876 | 0.75314 |340| 7 | Supply Chain Management | 349 | 0.0447321 | 6225 | 0.797872 |341| 8 | Banking, Investment And Insurance | 338 | 0.0433222 | 6563 | 0.841195 |342| 9 | Travel and Tourism | 203 | 0.026019 | 6766 | 0.867214 |343| 10 | Media and Advertising | 203 | 0.026019 | 6969 | 0.893233 |344| 11 | International Business | 178 | 0.0228147 | 7147 | 0.916047 |345| 12 | Healthcare Management | 159 | 0.0203794 | 7306 | 0.936427 |346| 13 | Hospitality Management | 114 | 0.0146116 | 7420 | 0.951038 |347| 14 | E-COMMERCE | 112 | 0.0143553 | 7532 | 0.965393 |348| 15 | Retail Management | 100 | 0.0128172 | 7632 | 0.978211 |349| 16 | Rural and Agribusiness | 73 | 0.00935658 | 7705 | 0.987567 |350| 17 | E-Business | 57 | 0.00730582 | 7762 | 0.994873 |351| 18 | Services Excellence | 40 | 0.00512689 | 7802 | 1 |352+----+-----------------------------------+---------+------------+--------------------+----------------------+353 Missing Total Percent354Specialization 1438 9240 0.155628355356357358Tags359+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+360| | Tags | Count | Percent | Cumulative Count | Cumulative Percent |361|----+---------------------------------------------------+---------+-------------+--------------------+----------------------|362| 0 | Will revert after reading the email | 2072 | 0.351962 | 2072 | 0.351962 |363| 1 | Ringing | 1203 | 0.204349 | 3275 | 0.556311 |364| 2 | Interested in other courses | 513 | 0.0871412 | 3788 | 0.643452 |365| 3 | Already a student | 465 | 0.0789876 | 4253 | 0.722439 |366| 4 | Closed by Horizzon | 358 | 0.060812 | 4611 | 0.783251 |367| 5 | switched off | 240 | 0.0407678 | 4851 | 0.824019 |368| 6 | Busy | 186 | 0.031595 | 5037 | 0.855614 |369| 7 | Lost to EINS | 175 | 0.0297265 | 5212 | 0.885341 |370| 8 | Not doing further education | 145 | 0.0246305 | 5357 | 0.909971 |371| 9 | Interested in full time MBA | 117 | 0.0198743 | 5474 | 0.929845 |372| 10 | Graduation in progress | 111 | 0.0188551 | 5585 | 0.948701 |373| 11 | invalid number | 83 | 0.0140989 | 5668 | 0.962799 |374| 12 | Diploma holder (Not Eligible) | 63 | 0.0107015 | 5731 | 0.973501 |375| 13 | wrong number given | 47 | 0.00798369 | 5778 | 0.981485 |376| 14 | opp hangup | 33 | 0.00560557 | 5811 | 0.98709 |377| 15 | number not provided | 27 | 0.00458638 | 5838 | 0.991677 |378| 16 | in touch with EINS | 12 | 0.00203839 | 5850 | 0.993715 |379| 17 | Lost to Others | 7 | 0.00118906 | 5857 | 0.994904 |380| 18 | Want to take admission but has financial problems | 6 | 0.00101919 | 5863 | 0.995923 |381| 19 | Still Thinking | 6 | 0.00101919 | 5869 | 0.996942 |382| 20 | Interested in Next batch | 5 | 0.000849329 | 5874 | 0.997792 |383| 21 | In confusion whether part time or DLP | 5 | 0.000849329 | 5879 | 0.998641 |384| 22 | Lateral student | 3 | 0.000509597 | 5882 | 0.999151 |385| 23 | University not recognized | 2 | 0.000339732 | 5884 | 0.99949 |386| 24 | Shall take in the next coming month | 2 | 0.000339732 | 5886 | 0.99983 |387| 25 | Recognition issue (DEC approval) | 1 | 0.000169866 | 5887 | 1 |388+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+389 Missing Total Percent390Tags 3353 9240 0.362879391392393394Through Recommendations395+----+---------------------------+---------+-------------+--------------------+----------------------+396| | Through Recommendations | Count | Percent | Cumulative Count | Cumulative Percent |397|----+---------------------------+---------+-------------+--------------------+----------------------|398| 0 | No | 9233 | 0.999242 | 9233 | 0.999242 |399| 1 | Yes | 7 | 0.000757576 | 9240 | 1 |400+----+---------------------------+---------+-------------+--------------------+----------------------+401 Missing Total Percent402Through Recommendations 0 9240 0.0403404405406Update me on Supply Chain Content407+----+-------------------------------------+---------+-----------+--------------------+----------------------+408| | Update me on Supply Chain Content | Count | Percent | Cumulative Count | Cumulative Percent |409|----+-------------------------------------+---------+-----------+--------------------+----------------------|410| 0 | No | 9240 | 1 | 9240 | 1 |411+----+-------------------------------------+---------+-----------+--------------------+----------------------+412 Missing Total Percent413Update me on Supply Chain Content 0 9240 0.0414415416417What is your current occupation418+----+-----------------------------------+---------+------------+--------------------+----------------------+419| | What is your current occupation | Count | Percent | Cumulative Count | Cumulative Percent |420|----+-----------------------------------+---------+------------+--------------------+----------------------|421| 0 | Unemployed | 5600 | 0.854962 | 5600 | 0.854962 |422| 1 | Working Professional | 706 | 0.107786 | 6306 | 0.962748 |423| 2 | Student | 210 | 0.0320611 | 6516 | 0.994809 |424| 3 | Other | 16 | 0.00244275 | 6532 | 0.997252 |425| 4 | Housewife | 10 | 0.00152672 | 6542 | 0.998779 |426| 5 | Businessman | 8 | 0.00122137 | 6550 | 1 |427+----+-----------------------------------+---------+------------+--------------------+----------------------+428 Missing Total Percent429What is your current occupation 2690 9240 0.291126430431432433What matters most to you in choosing a course434+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+435| | What matters most to you in choosing a course | Count | Percent | Cumulative Count | Cumulative Percent |436|----+-------------------------------------------------+---------+-------------+--------------------+----------------------|437| 0 | Better Career Prospects | 6528 | 0.999541 | 6528 | 0.999541 |438| 1 | Flexibility & Convenience | 2 | 0.000306232 | 6530 | 0.999847 |439| 2 | Other | 1 | 0.000153116 | 6531 | 1 |440+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+441 Missing Total Percent442What matters most to you in choosing a course 2709 9240 0.293182443444445446X Education Forums447+----+----------------------+---------+-------------+--------------------+----------------------+448| | X Education Forums | Count | Percent | Cumulative Count | Cumulative Percent |449|----+----------------------+---------+-------------+--------------------+----------------------|450| 0 | No | 9239 | 0.999892 | 9239 | 0.999892 |451| 1 | Yes | 1 | 0.000108225 | 9240 | 1 |452+----+----------------------+---------+-------------+--------------------+----------------------+453 Missing Total Percent454X Education Forums 0 9240 0.0
- The following columns have a label
Select
which is a disguised missing value. Select
is the default option in online forms and this value might mean that the lead hasn’t selected any option.- We shall replace them with
np.nan
.Specialization
Lead Profile
City
How did you hear about X Education
1# Replacing Select with NaN value2leads.replace({'Select' : np.nan},inplace=True)
1# Looking at Missing Values again23# Null values in each Column4nulls = pd.DataFrame(100*leads.isnull().sum()/leads.shape[0])5nulls.columns = ['Null Percentage']67# Sorting null percentages in descending order and highlighting null % > 508nulls[nulls['Null Percentage'] !=0].sort_values(by ='Null Percentage', ascending=False).style.applymap(lambda x : 'color : red' if x > 50 else '')
Null Percentage | |
How did you hear about X Education | 78.463203 |
Lead Profile | 74.188312 |
City | 39.707792 |
Specialization | 36.580087 |
Tags | 36.287879 |
What matters most to you in choosing a course | 29.318182 |
What is your current occupation | 29.112554 |
Country | 26.634199 |
TotalVisits | 1.482684 |
Page Views Per Visit | 1.482684 |
Last Activity | 1.114719 |
Lead Source | 0.389610 |
Lead Profile
&How did you hear about X Education
have very high percentage of nulls. Let’s drop these columns
1leads.drop(columns=['Lead Profile','How did you hear about X Education'],inplace=True)
Imputation
1# Sorting null percentages in ascending order and highlighting null % < 162def lowNulls() :3 nulls = pd.DataFrame(100*leads.isnull().sum()/leads.shape[0])4 nulls.columns = ['Null Percentage']5 return nulls[nulls['Null Percentage'] !=0].sort_values(by ='Null Percentage', ascending=True).style.applymap(lambda x : 'color : green' if x < 16 else '')67lowNulls()
Null Percentage | |
Lead Source | 0.389610 |
Last Activity | 1.114719 |
TotalVisits | 1.482684 |
Page Views Per Visit | 1.482684 |
Country | 26.634199 |
What is your current occupation | 29.112554 |
What matters most to you in choosing a course | 29.318182 |
Tags | 36.287879 |
Specialization | 36.580087 |
City | 39.707792 |
‘Lead Source’,‘Last Activity’,‘TotalVisits’,‘Page Views Per Visit’ have less than 2% missing values. These rows could be dropped.
We could impute columns with higher missing values on a case by case basis.
Missing values are imputed by the metric most representative of the feature’s distribution.
For categorical features, missing values could be imputed by the most frequently occuring label i.e MODE value, since this is the most representative metric of a categorical feature.
For continuous features, if there are outliers, the most representative metric of the feature’s distribution is MEDIAN, else it is MEAN. Continuous feature imputations are thus dependent on presence of outliers.
About 26% of data in Country Column is missing.
1# Country Imputation :2tab(leads.stb.freq(['Country']))
1+----+----------------------+---------+-------------+--------------------+----------------------+2| | Country | Count | Percent | Cumulative Count | Cumulative Percent |3|----+----------------------+---------+-------------+--------------------+----------------------|4| 0 | India | 6492 | 0.957663 | 6492 | 0.957663 |5| 1 | United States | 69 | 0.0101785 | 6561 | 0.967842 |6| 2 | United Arab Emirates | 53 | 0.00781826 | 6614 | 0.97566 |7| 3 | Singapore | 24 | 0.00354035 | 6638 | 0.9792 |8| 4 | Saudi Arabia | 21 | 0.0030978 | 6659 | 0.982298 |9| 5 | United Kingdom | 15 | 0.00221272 | 6674 | 0.984511 |10| 6 | Australia | 13 | 0.00191769 | 6687 | 0.986429 |11| 7 | Qatar | 10 | 0.00147514 | 6697 | 0.987904 |12| 8 | Hong Kong | 7 | 0.0010326 | 6704 | 0.988936 |13| 9 | Bahrain | 7 | 0.0010326 | 6711 | 0.989969 |14| 10 | Oman | 6 | 0.000885086 | 6717 | 0.990854 |15| 11 | France | 6 | 0.000885086 | 6723 | 0.991739 |16| 12 | unknown | 5 | 0.000737572 | 6728 | 0.992477 |17| 13 | South Africa | 4 | 0.000590058 | 6732 | 0.993067 |18| 14 | Nigeria | 4 | 0.000590058 | 6736 | 0.993657 |19| 15 | Kuwait | 4 | 0.000590058 | 6740 | 0.994247 |20| 16 | Germany | 4 | 0.000590058 | 6744 | 0.994837 |21| 17 | Canada | 4 | 0.000590058 | 6748 | 0.995427 |22| 18 | Sweden | 3 | 0.000442543 | 6751 | 0.99587 |23| 19 | Uganda | 2 | 0.000295029 | 6753 | 0.996165 |24| 20 | Philippines | 2 | 0.000295029 | 6755 | 0.99646 |25| 21 | Netherlands | 2 | 0.000295029 | 6757 | 0.996755 |26| 22 | Italy | 2 | 0.000295029 | 6759 | 0.99705 |27| 23 | Ghana | 2 | 0.000295029 | 6761 | 0.997345 |28| 24 | China | 2 | 0.000295029 | 6763 | 0.99764 |29| 25 | Belgium | 2 | 0.000295029 | 6765 | 0.997935 |30| 26 | Bangladesh | 2 | 0.000295029 | 6767 | 0.99823 |31| 27 | Asia/Pacific Region | 2 | 0.000295029 | 6769 | 0.998525 |32| 28 | Vietnam | 1 | 0.000147514 | 6770 | 0.998672 |33| 29 | Tanzania | 1 | 0.000147514 | 6771 | 0.99882 |34| 30 | Switzerland | 1 | 0.000147514 | 6772 | 0.998967 |35| 31 | Sri Lanka | 1 | 0.000147514 | 6773 | 0.999115 |36| 32 | Russia | 1 | 0.000147514 | 6774 | 0.999262 |37| 33 | Malaysia | 1 | 0.000147514 | 6775 | 0.99941 |38| 34 | Liberia | 1 | 0.000147514 | 6776 | 0.999557 |39| 35 | Kenya | 1 | 0.000147514 | 6777 | 0.999705 |40| 36 | Indonesia | 1 | 0.000147514 | 6778 | 0.999852 |41| 37 | Denmark | 1 | 0.000147514 | 6779 | 1 |42+----+----------------------+---------+-------------+--------------------+----------------------+
- Since 95% of leads come from India, it is probable that missing values are from India.
1# Imputing missing values in Country feature with "India"2leads['Country'].fillna('India', inplace=True)
Specialization
feature has 36% of missing values.- Since there’s no one label that’s driving leads, replacement would mislead the analysis.
- Hence, we could impute missing values with a new label ‘No Specialization’
1# Imputing Null Values by filling it using "No Specialization".2leads['Specialization'].fillna("No Specialization",inplace=True)3print('Missing values in Specialization feature ', leads['Specialization'].isnull().sum())
1Missing values in Specialization feature 0
1leads['Specialization'].value_counts()
1No Specialization 33802Finance Management 9763Human Resource Management 8484Marketing Management 8385Operations Management 5036Business Administration 4037IT Projects Management 3668Supply Chain Management 3499Banking, Investment And Insurance 33810Travel and Tourism 20311Media and Advertising 20312International Business 17813Healthcare Management 15914Hospitality Management 11415E-COMMERCE 11216Retail Management 10017Rural and Agribusiness 7318E-Business 5719Services Excellence 4020Name: Specialization, dtype: int64
- About 39% of values in
City
column are missing
1# Imputation of missing cities2leads.stb.freq(['City'])
City | Count | Percent | Cumulative Count | Cumulative Percent | |
0 | Mumbai | 3222 | 0.578352 | 3222 | 0.578352 |
1 | Thane & Outskirts | 752 | 0.134985 | 3974 | 0.713337 |
2 | Other Cities | 686 | 0.123138 | 4660 | 0.836475 |
3 | Other Cities of Maharashtra | 457 | 0.082032 | 5117 | 0.918507 |
4 | Other Metro Cities | 380 | 0.068210 | 5497 | 0.986717 |
5 | Tier II Cities | 74 | 0.013283 | 5571 | 1.000000 |
1# Missing Cities vs Country2condition_india = leads['Country'] == 'India'3print('Total Missing City values :', leads['City'].isnull().sum())4print('Missing City values in leads from India : ',leads.loc[condition_india,'City'].isnull().sum())
1Total Missing City values : 36692Missing City values in leads from India : 3609
- Looks like 3609 out 3669 leads with Missing City label are from India.
- As can be seen from the value counts of
City
feature, 60% of the leads come from Mumbai. - Since, we could impute missing
City
value for leads from India withMumbai
1# Replacing Null Cities in India with Mumbai2condition = (leads['City'].isnull()) & condition_india3leads.loc[condition,'City'] = 'Mumbai'
- 29% of values in
What is your current occupation
column are missing. - Let’s look at the distribution of levels in this column
1tab(leads.stb.freq(['What is your current occupation']))2tab(leads['What is your current occupation'].reset_index().stb.missing())
1+----+-----------------------------------+---------+------------+--------------------+----------------------+2| | What is your current occupation | Count | Percent | Cumulative Count | Cumulative Percent |3|----+-----------------------------------+---------+------------+--------------------+----------------------|4| 0 | Unemployed | 5600 | 0.854962 | 5600 | 0.854962 |5| 1 | Working Professional | 706 | 0.107786 | 6306 | 0.962748 |6| 2 | Student | 210 | 0.0320611 | 6516 | 0.994809 |7| 3 | Other | 16 | 0.00244275 | 6532 | 0.997252 |8| 4 | Housewife | 10 | 0.00152672 | 6542 | 0.998779 |9| 5 | Businessman | 8 | 0.00122137 | 6550 | 1 |10+----+-----------------------------------+---------+------------+--------------------+----------------------+11+---------------------------------+-----------+---------+-----------+12| | Missing | Total | Percent |13|---------------------------------+-----------+---------+-----------|14| What is your current occupation | 2690 | 9240 | 0.291126 |15| index | 0 | 9240 | 0 |16+---------------------------------+-----------+---------+-----------+
- Since, the business problem clearly says the company targets working professionals, this is an extremely important variable.
- So to keep the analysis unbiased, we could impute missing values with a new level for now.
- Let’s replace missing values in
What is your current occupation
with ‘Unknown Occupation’
1leads['What is your current occupation'].fillna('Unknown Occupation',inplace=True)
1# Missing Values in `What matters most to you in choosing a course`2ftr = 'What matters most to you in choosing a course'3tab(leads.stb.freq([ftr]))4tab(leads[ftr].reset_index().stb.missing())
1+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+2| | What matters most to you in choosing a course | Count | Percent | Cumulative Count | Cumulative Percent |3|----+-------------------------------------------------+---------+-------------+--------------------+----------------------|4| 0 | Better Career Prospects | 6528 | 0.999541 | 6528 | 0.999541 |5| 1 | Flexibility & Convenience | 2 | 0.000306232 | 6530 | 0.999847 |6| 2 | Other | 1 | 0.000153116 | 6531 | 1 |7+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+8+-----------------------------------------------+-----------+---------+-----------+9| | Missing | Total | Percent |10|-----------------------------------------------+-----------+---------+-----------|11| What matters most to you in choosing a course | 2709 | 9240 | 0.293182 |12| index | 0 | 9240 | 0 |13+-----------------------------------------------+-----------+---------+-----------+
- Almost all leads ( >99%) show interest in the company’s offerings for
Better Career Prospects
, excluding leads who havent filled this feature. - Since this might be a very important feature from the analysis perspective, to keep the analysis unbiased, instead of imputation with an existing label, let’s impute with a new label for now.
- Let’s replace missing values in
What matters most to you in choosing a course
with ‘Unknown Target’ for now.
1# Filling Missing Values with `Unknown`2leads[ftr].fillna('Unknown Target',inplace=True)
Incorrect Labels
1# Looking at labels in each Categorical Variable to check for incorrect labels.2categoricalFeatures = leads.dtypes[leads.dtypes == 'object'].index.values3print('Categorical Features : ', categoricalFeatures,'\n\n')4for feature in categoricalFeatures :5 print('Levels in ',feature,' are ' , leads[feature].unique(),'\n\n')
1Categorical Features : ['Lead Origin' 'Lead Source' 'Do Not Email' 'Do Not Call' 'Last Activity'2 'Country' 'Specialization' 'What is your current occupation'3 'What matters most to you in choosing a course' 'Search' 'Magazine'4 'Newspaper Article' 'X Education Forums' 'Newspaper'5 'Digital Advertisement' 'Through Recommendations'6 'Receive More Updates About Our Courses' 'Tags'7 'Update me on Supply Chain Content' 'Get updates on DM Content' 'City'8 'I agree to pay the amount through cheque'9 'A free copy of Mastering The Interview' 'Last Notable Activity']101112Levels in Lead Origin are ['API' 'Landing Page Submission' 'Lead Add Form' 'Lead Import'13 'Quick Add Form']141516Levels in Lead Source are ['Olark Chat' 'Organic Search' 'Direct Traffic' 'Google' 'Referral Sites'17 'Welingak Website' 'Reference' 'google' 'Facebook' nan 'blog'18 'Pay per Click Ads' 'bing' 'Social Media' 'WeLearn' 'Click2call'19 'Live Chat' 'welearnblog_Home' 'youtubechannel' 'testone' 'Press_Release'20 'NC_EDM']212223Levels in Do Not Email are ['No' 'Yes']242526Levels in Do Not Call are ['No' 'Yes']272829Levels in Last Activity are ['Page Visited on Website' 'Email Opened' 'Unreachable'30 'Converted to Lead' 'Olark Chat Conversation' 'Email Bounced'31 'Email Link Clicked' 'Form Submitted on Website' 'Unsubscribed'32 'Had a Phone Conversation' 'View in browser link Clicked' nan33 'Approached upfront' 'SMS Sent' 'Visited Booth in Tradeshow'34 'Resubscribed to emails' 'Email Received' 'Email Marked Spam']353637Levels in Country are ['India' 'Russia' 'Kuwait' 'Oman' 'United Arab Emirates' 'United States'38 'Australia' 'United Kingdom' 'Bahrain' 'Ghana' 'Singapore' 'Qatar'39 'Saudi Arabia' 'Belgium' 'France' 'Sri Lanka' 'China' 'Canada'40 'Netherlands' 'Sweden' 'Nigeria' 'Hong Kong' 'Germany'41 'Asia/Pacific Region' 'Uganda' 'Kenya' 'Italy' 'South Africa' 'Tanzania'42 'unknown' 'Malaysia' 'Liberia' 'Switzerland' 'Denmark' 'Philippines'43 'Bangladesh' 'Vietnam' 'Indonesia']444546Levels in Specialization are ['No Specialization' 'Business Administration' 'Media and Advertising'47 'Supply Chain Management' 'IT Projects Management' 'Finance Management'48 'Travel and Tourism' 'Human Resource Management' 'Marketing Management'49 'Banking, Investment And Insurance' 'International Business' 'E-COMMERCE'50 'Operations Management' 'Retail Management' 'Services Excellence'51 'Hospitality Management' 'Rural and Agribusiness' 'Healthcare Management'52 'E-Business']535455Levels in What is your current occupation are ['Unemployed' 'Student' 'Unknown Occupation' 'Working Professional'56 'Businessman' 'Other' 'Housewife']575859Levels in What matters most to you in choosing a course are ['Better Career Prospects' 'Unknown Target' 'Flexibility & Convenience'60 'Other']616263Levels in Search are ['No' 'Yes']646566Levels in Magazine are ['No']676869Levels in Newspaper Article are ['No' 'Yes']707172Levels in X Education Forums are ['No' 'Yes']737475Levels in Newspaper are ['No' 'Yes']767778Levels in Digital Advertisement are ['No' 'Yes']798081Levels in Through Recommendations are ['No' 'Yes']828384Levels in Receive More Updates About Our Courses are ['No']858687Levels in Tags are ['Interested in other courses' 'Ringing'88 'Will revert after reading the email' nan 'Lost to EINS'89 'In confusion whether part time or DLP' 'Busy' 'switched off'90 'in touch with EINS' 'Already a student' 'Diploma holder (Not Eligible)'91 'Graduation in progress' 'Closed by Horizzon' 'number not provided'92 'opp hangup' 'Not doing further education' 'invalid number'93 'wrong number given' 'Interested in full time MBA' 'Still Thinking'94 'Lost to Others' 'Shall take in the next coming month' 'Lateral student'95 'Interested in Next batch' 'Recognition issue (DEC approval)'96 'Want to take admission but has financial problems'97 'University not recognized']9899100Levels in Update me on Supply Chain Content are ['No']101102103Levels in Get updates on DM Content are ['No']104105106Levels in City are ['Mumbai' 'Thane & Outskirts' 'Other Metro Cities' nan 'Other Cities'107 'Other Cities of Maharashtra' 'Tier II Cities']108109110Levels in I agree to pay the amount through cheque are ['No']111112113Levels in A free copy of Mastering The Interview are ['No' 'Yes']114115116Levels in Last Notable Activity are ['Modified' 'Email Opened' 'Page Visited on Website' 'Email Bounced'117 'Email Link Clicked' 'Unreachable' 'Unsubscribed'118 'Had a Phone Conversation' 'Olark Chat Conversation' 'SMS Sent'119 'Approached upfront' 'Resubscribed to emails'120 'View in browser link Clicked' 'Form Submitted on Website'121 'Email Received' 'Email Marked Spam']
- We can clearly see that Google is appearing twice in ‘Lead Source’- (Google,google)
1# Replacing 'google' with 'Google2leads['Lead Source']=leads['Lead Source'].str.replace("google","Google")
Cleaning Categorical Features
Dropping Unnecessary Columns
1# Missing Values and Value Counts for all categorical Variables2tab(leads.stb.missing())3print('Value Counts of each Feature : \n')4for feature in sorted(categoricalFeatures) :5 tab(leads.stb.freq([feature]))
1+-----------------------------------------------+-----------+---------+------------+2| | Missing | Total | Percent |3|-----------------------------------------------+-----------+---------+------------|4| Tags | 3353 | 9240 | 0.362879 |5| TotalVisits | 137 | 9240 | 0.0148268 |6| Page Views Per Visit | 137 | 9240 | 0.0148268 |7| Last Activity | 103 | 9240 | 0.0111472 |8| City | 60 | 9240 | 0.00649351 |9| Lead Source | 36 | 9240 | 0.0038961 |10| Lead Origin | 0 | 9240 | 0 |11| X Education Forums | 0 | 9240 | 0 |12| A free copy of Mastering The Interview | 0 | 9240 | 0 |13| I agree to pay the amount through cheque | 0 | 9240 | 0 |14| Get updates on DM Content | 0 | 9240 | 0 |15| Update me on Supply Chain Content | 0 | 9240 | 0 |16| Receive More Updates About Our Courses | 0 | 9240 | 0 |17| Through Recommendations | 0 | 9240 | 0 |18| Digital Advertisement | 0 | 9240 | 0 |19| Newspaper | 0 | 9240 | 0 |20| Magazine | 0 | 9240 | 0 |21| Newspaper Article | 0 | 9240 | 0 |22| Search | 0 | 9240 | 0 |23| What matters most to you in choosing a course | 0 | 9240 | 0 |24| What is your current occupation | 0 | 9240 | 0 |25| Specialization | 0 | 9240 | 0 |26| Country | 0 | 9240 | 0 |27| Total Time Spent on Website | 0 | 9240 | 0 |28| Converted | 0 | 9240 | 0 |29| Do Not Call | 0 | 9240 | 0 |30| Do Not Email | 0 | 9240 | 0 |31| Last Notable Activity | 0 | 9240 | 0 |32+-----------------------------------------------+-----------+---------+------------+33Value Counts of each Feature :3435+----+------------------------------------------+---------+-----------+--------------------+----------------------+36| | A free copy of Mastering The Interview | Count | Percent | Cumulative Count | Cumulative Percent |37|----+------------------------------------------+---------+-----------+--------------------+----------------------|38| 0 | No | 6352 | 0.687446 | 6352 | 0.687446 |39| 1 | Yes | 2888 | 0.312554 | 9240 | 1 |40+----+------------------------------------------+---------+-----------+--------------------+----------------------+41+----+-----------------------------+---------+-----------+--------------------+----------------------+42| | City | Count | Percent | Cumulative Count | Cumulative Percent |43|----+-----------------------------+---------+-----------+--------------------+----------------------|44| 0 | Mumbai | 6831 | 0.744118 | 6831 | 0.744118 |45| 1 | Thane & Outskirts | 752 | 0.0819172 | 7583 | 0.826035 |46| 2 | Other Cities | 686 | 0.0747277 | 8269 | 0.900763 |47| 3 | Other Cities of Maharashtra | 457 | 0.0497821 | 8726 | 0.950545 |48| 4 | Other Metro Cities | 380 | 0.0413943 | 9106 | 0.991939 |49| 5 | Tier II Cities | 74 | 0.008061 | 9180 | 1 |50+----+-----------------------------+---------+-----------+--------------------+----------------------+51+----+----------------------+---------+-------------+--------------------+----------------------+52| | Country | Count | Percent | Cumulative Count | Cumulative Percent |53|----+----------------------+---------+-------------+--------------------+----------------------|54| 0 | India | 8953 | 0.968939 | 8953 | 0.968939 |55| 1 | United States | 69 | 0.00746753 | 9022 | 0.976407 |56| 2 | United Arab Emirates | 53 | 0.00573593 | 9075 | 0.982143 |57| 3 | Singapore | 24 | 0.0025974 | 9099 | 0.98474 |58| 4 | Saudi Arabia | 21 | 0.00227273 | 9120 | 0.987013 |59| 5 | United Kingdom | 15 | 0.00162338 | 9135 | 0.988636 |60| 6 | Australia | 13 | 0.00140693 | 9148 | 0.990043 |61| 7 | Qatar | 10 | 0.00108225 | 9158 | 0.991126 |62| 8 | Hong Kong | 7 | 0.000757576 | 9165 | 0.991883 |63| 9 | Bahrain | 7 | 0.000757576 | 9172 | 0.992641 |64| 10 | Oman | 6 | 0.000649351 | 9178 | 0.99329 |65| 11 | France | 6 | 0.000649351 | 9184 | 0.993939 |66| 12 | unknown | 5 | 0.000541126 | 9189 | 0.994481 |67| 13 | South Africa | 4 | 0.0004329 | 9193 | 0.994913 |68| 14 | Nigeria | 4 | 0.0004329 | 9197 | 0.995346 |69| 15 | Kuwait | 4 | 0.0004329 | 9201 | 0.995779 |70| 16 | Germany | 4 | 0.0004329 | 9205 | 0.996212 |71| 17 | Canada | 4 | 0.0004329 | 9209 | 0.996645 |72| 18 | Sweden | 3 | 0.000324675 | 9212 | 0.99697 |73| 19 | Uganda | 2 | 0.00021645 | 9214 | 0.997186 |74| 20 | Philippines | 2 | 0.00021645 | 9216 | 0.997403 |75| 21 | Netherlands | 2 | 0.00021645 | 9218 | 0.997619 |76| 22 | Italy | 2 | 0.00021645 | 9220 | 0.997835 |77| 23 | Ghana | 2 | 0.00021645 | 9222 | 0.998052 |78| 24 | China | 2 | 0.00021645 | 9224 | 0.998268 |79| 25 | Belgium | 2 | 0.00021645 | 9226 | 0.998485 |80| 26 | Bangladesh | 2 | 0.00021645 | 9228 | 0.998701 |81| 27 | Asia/Pacific Region | 2 | 0.00021645 | 9230 | 0.998918 |82| 28 | Vietnam | 1 | 0.000108225 | 9231 | 0.999026 |83| 29 | Tanzania | 1 | 0.000108225 | 9232 | 0.999134 |84| 30 | Switzerland | 1 | 0.000108225 | 9233 | 0.999242 |85| 31 | Sri Lanka | 1 | 0.000108225 | 9234 | 0.999351 |86| 32 | Russia | 1 | 0.000108225 | 9235 | 0.999459 |87| 33 | Malaysia | 1 | 0.000108225 | 9236 | 0.999567 |88| 34 | Liberia | 1 | 0.000108225 | 9237 | 0.999675 |89| 35 | Kenya | 1 | 0.000108225 | 9238 | 0.999784 |90| 36 | Indonesia | 1 | 0.000108225 | 9239 | 0.999892 |91| 37 | Denmark | 1 | 0.000108225 | 9240 | 1 |92+----+----------------------+---------+-------------+--------------------+----------------------+93+----+-------------------------+---------+-----------+--------------------+----------------------+94| | Digital Advertisement | Count | Percent | Cumulative Count | Cumulative Percent |95|----+-------------------------+---------+-----------+--------------------+----------------------|96| 0 | No | 9236 | 0.999567 | 9236 | 0.999567 |97| 1 | Yes | 4 | 0.0004329 | 9240 | 1 |98+----+-------------------------+---------+-----------+--------------------+----------------------+99+----+---------------+---------+------------+--------------------+----------------------+100| | Do Not Call | Count | Percent | Cumulative Count | Cumulative Percent |101|----+---------------+---------+------------+--------------------+----------------------|102| 0 | No | 9238 | 0.999784 | 9238 | 0.999784 |103| 1 | Yes | 2 | 0.00021645 | 9240 | 1 |104+----+---------------+---------+------------+--------------------+----------------------+105+----+----------------+---------+-----------+--------------------+----------------------+106| | Do Not Email | Count | Percent | Cumulative Count | Cumulative Percent |107|----+----------------+---------+-----------+--------------------+----------------------|108| 0 | No | 8506 | 0.920563 | 8506 | 0.920563 |109| 1 | Yes | 734 | 0.0794372 | 9240 | 1 |110+----+----------------+---------+-----------+--------------------+----------------------+111+----+-----------------------------+---------+-----------+--------------------+----------------------+112| | Get updates on DM Content | Count | Percent | Cumulative Count | Cumulative Percent |113|----+-----------------------------+---------+-----------+--------------------+----------------------|114| 0 | No | 9240 | 1 | 9240 | 1 |115+----+-----------------------------+---------+-----------+--------------------+----------------------+116+----+--------------------------------------------+---------+-----------+--------------------+----------------------+117| | I agree to pay the amount through cheque | Count | Percent | Cumulative Count | Cumulative Percent |118|----+--------------------------------------------+---------+-----------+--------------------+----------------------|119| 0 | No | 9240 | 1 | 9240 | 1 |120+----+--------------------------------------------+---------+-----------+--------------------+----------------------+121+----+------------------------------+---------+-------------+--------------------+----------------------+122| | Last Activity | Count | Percent | Cumulative Count | Cumulative Percent |123|----+------------------------------+---------+-------------+--------------------+----------------------|124| 0 | Email Opened | 3437 | 0.376163 | 3437 | 0.376163 |125| 1 | SMS Sent | 2745 | 0.300427 | 6182 | 0.67659 |126| 2 | Olark Chat Conversation | 973 | 0.10649 | 7155 | 0.78308 |127| 3 | Page Visited on Website | 640 | 0.0700449 | 7795 | 0.853125 |128| 4 | Converted to Lead | 428 | 0.0468425 | 8223 | 0.899967 |129| 5 | Email Bounced | 326 | 0.0356791 | 8549 | 0.935646 |130| 6 | Email Link Clicked | 267 | 0.0292218 | 8816 | 0.964868 |131| 7 | Form Submitted on Website | 116 | 0.0126956 | 8932 | 0.977564 |132| 8 | Unreachable | 93 | 0.0101784 | 9025 | 0.987742 |133| 9 | Unsubscribed | 61 | 0.00667615 | 9086 | 0.994418 |134| 10 | Had a Phone Conversation | 30 | 0.00328335 | 9116 | 0.997702 |135| 11 | Approached upfront | 9 | 0.000985006 | 9125 | 0.998687 |136| 12 | View in browser link Clicked | 6 | 0.000656671 | 9131 | 0.999343 |137| 13 | Email Received | 2 | 0.00021889 | 9133 | 0.999562 |138| 14 | Email Marked Spam | 2 | 0.00021889 | 9135 | 0.999781 |139| 15 | Visited Booth in Tradeshow | 1 | 0.000109445 | 9136 | 0.999891 |140| 16 | Resubscribed to emails | 1 | 0.000109445 | 9137 | 1 |141+----+------------------------------+---------+-------------+--------------------+----------------------+142+----+------------------------------+---------+-------------+--------------------+----------------------+143| | Last Notable Activity | Count | Percent | Cumulative Count | Cumulative Percent |144|----+------------------------------+---------+-------------+--------------------+----------------------|145| 0 | Modified | 3407 | 0.368723 | 3407 | 0.368723 |146| 1 | Email Opened | 2827 | 0.305952 | 6234 | 0.674675 |147| 2 | SMS Sent | 2172 | 0.235065 | 8406 | 0.90974 |148| 3 | Page Visited on Website | 318 | 0.0344156 | 8724 | 0.944156 |149| 4 | Olark Chat Conversation | 183 | 0.0198052 | 8907 | 0.963961 |150| 5 | Email Link Clicked | 173 | 0.0187229 | 9080 | 0.982684 |151| 6 | Email Bounced | 60 | 0.00649351 | 9140 | 0.989177 |152| 7 | Unsubscribed | 47 | 0.00508658 | 9187 | 0.994264 |153| 8 | Unreachable | 32 | 0.0034632 | 9219 | 0.997727 |154| 9 | Had a Phone Conversation | 14 | 0.00151515 | 9233 | 0.999242 |155| 10 | Email Marked Spam | 2 | 0.00021645 | 9235 | 0.999459 |156| 11 | View in browser link Clicked | 1 | 0.000108225 | 9236 | 0.999567 |157| 12 | Resubscribed to emails | 1 | 0.000108225 | 9237 | 0.999675 |158| 13 | Form Submitted on Website | 1 | 0.000108225 | 9238 | 0.999784 |159| 14 | Email Received | 1 | 0.000108225 | 9239 | 0.999892 |160| 15 | Approached upfront | 1 | 0.000108225 | 9240 | 1 |161+----+------------------------------+---------+-------------+--------------------+----------------------+162+----+-------------------------+---------+-------------+--------------------+----------------------+163| | Lead Origin | Count | Percent | Cumulative Count | Cumulative Percent |164|----+-------------------------+---------+-------------+--------------------+----------------------|165| 0 | Landing Page Submission | 4886 | 0.528788 | 4886 | 0.528788 |166| 1 | API | 3580 | 0.387446 | 8466 | 0.916234 |167| 2 | Lead Add Form | 718 | 0.0777056 | 9184 | 0.993939 |168| 3 | Lead Import | 55 | 0.00595238 | 9239 | 0.999892 |169| 4 | Quick Add Form | 1 | 0.000108225 | 9240 | 1 |170+----+-------------------------+---------+-------------+--------------------+----------------------+171+----+-------------------+---------+-------------+--------------------+----------------------+172| | Lead Source | Count | Percent | Cumulative Count | Cumulative Percent |173|----+-------------------+---------+-------------+--------------------+----------------------|174| 0 | Google | 2873 | 0.312147 | 2873 | 0.312147 |175| 1 | Direct Traffic | 2543 | 0.276293 | 5416 | 0.58844 |176| 2 | Olark Chat | 1755 | 0.190678 | 7171 | 0.779118 |177| 3 | Organic Search | 1154 | 0.12538 | 8325 | 0.904498 |178| 4 | Reference | 534 | 0.0580183 | 8859 | 0.962516 |179| 5 | Welingak Website | 142 | 0.0154281 | 9001 | 0.977944 |180| 6 | Referral Sites | 125 | 0.0135811 | 9126 | 0.991525 |181| 7 | Facebook | 55 | 0.00597566 | 9181 | 0.997501 |182| 8 | bing | 6 | 0.00065189 | 9187 | 0.998153 |183| 9 | Click2call | 4 | 0.000434594 | 9191 | 0.998588 |184| 10 | Social Media | 2 | 0.000217297 | 9193 | 0.998805 |185| 11 | Press_Release | 2 | 0.000217297 | 9195 | 0.999022 |186| 12 | Live Chat | 2 | 0.000217297 | 9197 | 0.999239 |187| 13 | youtubechannel | 1 | 0.000108648 | 9198 | 0.999348 |188| 14 | welearnblog_Home | 1 | 0.000108648 | 9199 | 0.999457 |189| 15 | testone | 1 | 0.000108648 | 9200 | 0.999565 |190| 16 | blog | 1 | 0.000108648 | 9201 | 0.999674 |191| 17 | WeLearn | 1 | 0.000108648 | 9202 | 0.999783 |192| 18 | Pay per Click Ads | 1 | 0.000108648 | 9203 | 0.999891 |193| 19 | NC_EDM | 1 | 0.000108648 | 9204 | 1 |194+----+-------------------+---------+-------------+--------------------+----------------------+195+----+------------+---------+-----------+--------------------+----------------------+196| | Magazine | Count | Percent | Cumulative Count | Cumulative Percent |197|----+------------+---------+-----------+--------------------+----------------------|198| 0 | No | 9240 | 1 | 9240 | 1 |199+----+------------+---------+-----------+--------------------+----------------------+200+----+-------------+---------+-------------+--------------------+----------------------+201| | Newspaper | Count | Percent | Cumulative Count | Cumulative Percent |202|----+-------------+---------+-------------+--------------------+----------------------|203| 0 | No | 9239 | 0.999892 | 9239 | 0.999892 |204| 1 | Yes | 1 | 0.000108225 | 9240 | 1 |205+----+-------------+---------+-------------+--------------------+----------------------+206+----+---------------------+---------+------------+--------------------+----------------------+207| | Newspaper Article | Count | Percent | Cumulative Count | Cumulative Percent |208|----+---------------------+---------+------------+--------------------+----------------------|209| 0 | No | 9238 | 0.999784 | 9238 | 0.999784 |210| 1 | Yes | 2 | 0.00021645 | 9240 | 1 |211+----+---------------------+---------+------------+--------------------+----------------------+212+----+------------------------------------------+---------+-----------+--------------------+----------------------+213| | Receive More Updates About Our Courses | Count | Percent | Cumulative Count | Cumulative Percent |214|----+------------------------------------------+---------+-----------+--------------------+----------------------|215| 0 | No | 9240 | 1 | 9240 | 1 |216+----+------------------------------------------+---------+-----------+--------------------+----------------------+217+----+----------+---------+------------+--------------------+----------------------+218| | Search | Count | Percent | Cumulative Count | Cumulative Percent |219|----+----------+---------+------------+--------------------+----------------------|220| 0 | No | 9226 | 0.998485 | 9226 | 0.998485 |221| 1 | Yes | 14 | 0.00151515 | 9240 | 1 |222+----+----------+---------+------------+--------------------+----------------------+223+----+-----------------------------------+---------+------------+--------------------+----------------------+224| | Specialization | Count | Percent | Cumulative Count | Cumulative Percent |225|----+-----------------------------------+---------+------------+--------------------+----------------------|226| 0 | No Specialization | 3380 | 0.365801 | 3380 | 0.365801 |227| 1 | Finance Management | 976 | 0.105628 | 4356 | 0.471429 |228| 2 | Human Resource Management | 848 | 0.0917749 | 5204 | 0.563203 |229| 3 | Marketing Management | 838 | 0.0906926 | 6042 | 0.653896 |230| 4 | Operations Management | 503 | 0.0544372 | 6545 | 0.708333 |231| 5 | Business Administration | 403 | 0.0436147 | 6948 | 0.751948 |232| 6 | IT Projects Management | 366 | 0.0396104 | 7314 | 0.791558 |233| 7 | Supply Chain Management | 349 | 0.0377706 | 7663 | 0.829329 |234| 8 | Banking, Investment And Insurance | 338 | 0.0365801 | 8001 | 0.865909 |235| 9 | Travel and Tourism | 203 | 0.0219697 | 8204 | 0.887879 |236| 10 | Media and Advertising | 203 | 0.0219697 | 8407 | 0.909848 |237| 11 | International Business | 178 | 0.0192641 | 8585 | 0.929113 |238| 12 | Healthcare Management | 159 | 0.0172078 | 8744 | 0.94632 |239| 13 | Hospitality Management | 114 | 0.0123377 | 8858 | 0.958658 |240| 14 | E-COMMERCE | 112 | 0.0121212 | 8970 | 0.970779 |241| 15 | Retail Management | 100 | 0.0108225 | 9070 | 0.981602 |242| 16 | Rural and Agribusiness | 73 | 0.00790043 | 9143 | 0.989502 |243| 17 | E-Business | 57 | 0.00616883 | 9200 | 0.995671 |244| 18 | Services Excellence | 40 | 0.004329 | 9240 | 1 |245+----+-----------------------------------+---------+------------+--------------------+----------------------+246+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+247| | Tags | Count | Percent | Cumulative Count | Cumulative Percent |248|----+---------------------------------------------------+---------+-------------+--------------------+----------------------|249| 0 | Will revert after reading the email | 2072 | 0.351962 | 2072 | 0.351962 |250| 1 | Ringing | 1203 | 0.204349 | 3275 | 0.556311 |251| 2 | Interested in other courses | 513 | 0.0871412 | 3788 | 0.643452 |252| 3 | Already a student | 465 | 0.0789876 | 4253 | 0.722439 |253| 4 | Closed by Horizzon | 358 | 0.060812 | 4611 | 0.783251 |254| 5 | switched off | 240 | 0.0407678 | 4851 | 0.824019 |255| 6 | Busy | 186 | 0.031595 | 5037 | 0.855614 |256| 7 | Lost to EINS | 175 | 0.0297265 | 5212 | 0.885341 |257| 8 | Not doing further education | 145 | 0.0246305 | 5357 | 0.909971 |258| 9 | Interested in full time MBA | 117 | 0.0198743 | 5474 | 0.929845 |259| 10 | Graduation in progress | 111 | 0.0188551 | 5585 | 0.948701 |260| 11 | invalid number | 83 | 0.0140989 | 5668 | 0.962799 |261| 12 | Diploma holder (Not Eligible) | 63 | 0.0107015 | 5731 | 0.973501 |262| 13 | wrong number given | 47 | 0.00798369 | 5778 | 0.981485 |263| 14 | opp hangup | 33 | 0.00560557 | 5811 | 0.98709 |264| 15 | number not provided | 27 | 0.00458638 | 5838 | 0.991677 |265| 16 | in touch with EINS | 12 | 0.00203839 | 5850 | 0.993715 |266| 17 | Lost to Others | 7 | 0.00118906 | 5857 | 0.994904 |267| 18 | Want to take admission but has financial problems | 6 | 0.00101919 | 5863 | 0.995923 |268| 19 | Still Thinking | 6 | 0.00101919 | 5869 | 0.996942 |269| 20 | Interested in Next batch | 5 | 0.000849329 | 5874 | 0.997792 |270| 21 | In confusion whether part time or DLP | 5 | 0.000849329 | 5879 | 0.998641 |271| 22 | Lateral student | 3 | 0.000509597 | 5882 | 0.999151 |272| 23 | University not recognized | 2 | 0.000339732 | 5884 | 0.99949 |273| 24 | Shall take in the next coming month | 2 | 0.000339732 | 5886 | 0.99983 |274| 25 | Recognition issue (DEC approval) | 1 | 0.000169866 | 5887 | 1 |275+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+276+----+---------------------------+---------+-------------+--------------------+----------------------+277| | Through Recommendations | Count | Percent | Cumulative Count | Cumulative Percent |278|----+---------------------------+---------+-------------+--------------------+----------------------|279| 0 | No | 9233 | 0.999242 | 9233 | 0.999242 |280| 1 | Yes | 7 | 0.000757576 | 9240 | 1 |281+----+---------------------------+---------+-------------+--------------------+----------------------+282+----+-------------------------------------+---------+-----------+--------------------+----------------------+283| | Update me on Supply Chain Content | Count | Percent | Cumulative Count | Cumulative Percent |284|----+-------------------------------------+---------+-----------+--------------------+----------------------|285| 0 | No | 9240 | 1 | 9240 | 1 |286+----+-------------------------------------+---------+-----------+--------------------+----------------------+287+----+-----------------------------------+---------+-------------+--------------------+----------------------+288| | What is your current occupation | Count | Percent | Cumulative Count | Cumulative Percent |289|----+-----------------------------------+---------+-------------+--------------------+----------------------|290| 0 | Unemployed | 5600 | 0.606061 | 5600 | 0.606061 |291| 1 | Unknown Occupation | 2690 | 0.291126 | 8290 | 0.897186 |292| 2 | Working Professional | 706 | 0.0764069 | 8996 | 0.973593 |293| 3 | Student | 210 | 0.0227273 | 9206 | 0.99632 |294| 4 | Other | 16 | 0.0017316 | 9222 | 0.998052 |295| 5 | Housewife | 10 | 0.00108225 | 9232 | 0.999134 |296| 6 | Businessman | 8 | 0.000865801 | 9240 | 1 |297+----+-----------------------------------+---------+-------------+--------------------+----------------------+298+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+299| | What matters most to you in choosing a course | Count | Percent | Cumulative Count | Cumulative Percent |300|----+-------------------------------------------------+---------+-------------+--------------------+----------------------|301| 0 | Better Career Prospects | 6528 | 0.706494 | 6528 | 0.706494 |302| 1 | Unknown Target | 2709 | 0.293182 | 9237 | 0.999675 |303| 2 | Flexibility & Convenience | 2 | 0.00021645 | 9239 | 0.999892 |304| 3 | Other | 1 | 0.000108225 | 9240 | 1 |305+----+-------------------------------------------------+---------+-------------+--------------------+----------------------+306+----+----------------------+---------+-------------+--------------------+----------------------+307| | X Education Forums | Count | Percent | Cumulative Count | Cumulative Percent |308|----+----------------------+---------+-------------+--------------------+----------------------|309| 0 | No | 9239 | 0.999892 | 9239 | 0.999892 |310| 1 | Yes | 1 | 0.000108225 | 9240 | 1 |311+----+----------------------+---------+-------------+--------------------+----------------------+
- Let’s look for columns have more than 99% leads have the same level
- Such variables do not explain any variability.
- Hence, these columns are unnecessary to the analysis . They could be dropped
1# Dropping columns having only one label - since these donot explain any variability in the dataset23invariableCol = ['Digital Advertisement','Do Not Call','Get updates on DM Content','Magazine','Newspaper','Newspaper Article','Receive More Updates About Our Courses','Search',4'Update me on Supply Chain Content','Through Recommendations',5'I agree to pay the amount through cheque',"What matters most to you in choosing a course",'X Education Forums']6leads.drop(columns=invariableCol, inplace=True)
1# Tags feature2tab(leads.stb.freq(['Tags']))
1+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+2| | Tags | Count | Percent | Cumulative Count | Cumulative Percent |3|----+---------------------------------------------------+---------+-------------+--------------------+----------------------|4| 0 | Will revert after reading the email | 2072 | 0.351962 | 2072 | 0.351962 |5| 1 | Ringing | 1203 | 0.204349 | 3275 | 0.556311 |6| 2 | Interested in other courses | 513 | 0.0871412 | 3788 | 0.643452 |7| 3 | Already a student | 465 | 0.0789876 | 4253 | 0.722439 |8| 4 | Closed by Horizzon | 358 | 0.060812 | 4611 | 0.783251 |9| 5 | switched off | 240 | 0.0407678 | 4851 | 0.824019 |10| 6 | Busy | 186 | 0.031595 | 5037 | 0.855614 |11| 7 | Lost to EINS | 175 | 0.0297265 | 5212 | 0.885341 |12| 8 | Not doing further education | 145 | 0.0246305 | 5357 | 0.909971 |13| 9 | Interested in full time MBA | 117 | 0.0198743 | 5474 | 0.929845 |14| 10 | Graduation in progress | 111 | 0.0188551 | 5585 | 0.948701 |15| 11 | invalid number | 83 | 0.0140989 | 5668 | 0.962799 |16| 12 | Diploma holder (Not Eligible) | 63 | 0.0107015 | 5731 | 0.973501 |17| 13 | wrong number given | 47 | 0.00798369 | 5778 | 0.981485 |18| 14 | opp hangup | 33 | 0.00560557 | 5811 | 0.98709 |19| 15 | number not provided | 27 | 0.00458638 | 5838 | 0.991677 |20| 16 | in touch with EINS | 12 | 0.00203839 | 5850 | 0.993715 |21| 17 | Lost to Others | 7 | 0.00118906 | 5857 | 0.994904 |22| 18 | Want to take admission but has financial problems | 6 | 0.00101919 | 5863 | 0.995923 |23| 19 | Still Thinking | 6 | 0.00101919 | 5869 | 0.996942 |24| 20 | Interested in Next batch | 5 | 0.000849329 | 5874 | 0.997792 |25| 21 | In confusion whether part time or DLP | 5 | 0.000849329 | 5879 | 0.998641 |26| 22 | Lateral student | 3 | 0.000509597 | 5882 | 0.999151 |27| 23 | University not recognized | 2 | 0.000339732 | 5884 | 0.99949 |28| 24 | Shall take in the next coming month | 2 | 0.000339732 | 5886 | 0.99983 |29| 25 | Recognition issue (DEC approval) | 1 | 0.000169866 | 5887 | 1 |30+----+---------------------------------------------------+---------+-------------+--------------------+----------------------+
Tags
column shows remarks by the sales team. This a subjective variable based on judgement of the team and cannot be used for analysis since the lables might change or might not always be available.- Also, it has a lot of levels that don’t seem like mutually exclusive classes
- Let’s drop this feature for this analysis.
1# dropping Tags feature2leads.drop(columns=['Tags'], inplace=True)
Last Notable Activity
&Last Activity
seem to have similar levels- Lets look at the possibility of dropping one of them
1# Last Notable Activity vs Last Activity2leads_copy = leads.copy()3leads_copy['Converted'] = leads_copy['Converted'].astype('int')4tab(leads_copy.stb.freq(['Last Notable Activity'],value='Converted'))5tab(leads_copy.stb.freq(['Last Activity'],value='Converted'))
1+----+--------------------------+-------------+------------+------------------------+----------------------+2| | Last Notable Activity | Converted | Percent | Cumulative Converted | Cumulative Percent |3|----+--------------------------+-------------+------------+------------------------+----------------------|4| 0 | SMS Sent | 1508 | 0.423477 | 1508 | 0.423477 |5| 1 | Email Opened | 1044 | 0.293176 | 2552 | 0.716653 |6| 2 | Modified | 783 | 0.219882 | 3335 | 0.936535 |7| 3 | Page Visited on Website | 93 | 0.0261163 | 3428 | 0.962651 |8| 4 | Email Link Clicked | 45 | 0.0126369 | 3473 | 0.975288 |9| 5 | Olark Chat Conversation | 25 | 0.0070205 | 3498 | 0.982308 |10| 6 | Unreachable | 22 | 0.00617804 | 3520 | 0.988486 |11| 7 | Unsubscribed | 14 | 0.00393148 | 3534 | 0.992418 |12| 8 | Had a Phone Conversation | 13 | 0.00365066 | 3547 | 0.996069 |13| 9 | Email Bounced | 9 | 0.00252738 | 3556 | 0.998596 |14| 10 | Email Marked Spam | 2 | 0.00056164 | 3558 | 0.999158 |15| 11 | Resubscribed to emails | 1 | 0.00028082 | 3559 | 0.999438 |16| 12 | Email Received | 1 | 0.00028082 | 3560 | 0.999719 |17| 13 | Approached upfront | 1 | 0.00028082 | 3561 | 1 |18+----+--------------------------+-------------+------------+------------------------+----------------------+19+----+------------------------------+-------------+-------------+------------------------+----------------------+20| | Last Activity | Converted | Percent | Cumulative Converted | Cumulative Percent |21|----+------------------------------+-------------+-------------+------------------------+----------------------|22| 0 | SMS Sent | 1727 | 0.496264 | 1727 | 0.496264 |23| 1 | Email Opened | 1253 | 0.360057 | 2980 | 0.856322 |24| 2 | Page Visited on Website | 151 | 0.0433908 | 3131 | 0.899713 |25| 3 | Olark Chat Conversation | 84 | 0.0241379 | 3215 | 0.923851 |26| 4 | Email Link Clicked | 73 | 0.020977 | 3288 | 0.944828 |27| 5 | Converted to Lead | 54 | 0.0155172 | 3342 | 0.960345 |28| 6 | Unreachable | 31 | 0.00890805 | 3373 | 0.969253 |29| 7 | Form Submitted on Website | 28 | 0.00804598 | 3401 | 0.977299 |30| 8 | Email Bounced | 26 | 0.00747126 | 3427 | 0.98477 |31| 9 | Had a Phone Conversation | 22 | 0.00632184 | 3449 | 0.991092 |32| 10 | Unsubscribed | 16 | 0.0045977 | 3465 | 0.99569 |33| 11 | Approached upfront | 9 | 0.00258621 | 3474 | 0.998276 |34| 12 | Email Received | 2 | 0.000574713 | 3476 | 0.998851 |35| 13 | Email Marked Spam | 2 | 0.000574713 | 3478 | 0.999425 |36| 14 | View in browser link Clicked | 1 | 0.000287356 | 3479 | 0.999713 |37| 15 | Resubscribed to emails | 1 | 0.000287356 | 3480 | 1 |38+----+------------------------------+-------------+-------------+------------------------+----------------------+
Last Activity
has more levels compared toLast Notable Activity
Last Notable Activity
seems like a column derived by the sales team usingLast Activity
.- Since this insight might not be available for a new lead, let’d drop
Last Notable Activity
.
1leads.drop(columns = ['Last Notable Activity'], inplace=True)
1# Looking at Missing Values again2tab(leads.stb.missing())
1+----------------------------------------+-----------+---------+------------+2| | Missing | Total | Percent |3|----------------------------------------+-----------+---------+------------|4| TotalVisits | 137 | 9240 | 0.0148268 |5| Page Views Per Visit | 137 | 9240 | 0.0148268 |6| Last Activity | 103 | 9240 | 0.0111472 |7| City | 60 | 9240 | 0.00649351 |8| Lead Source | 36 | 9240 | 0.0038961 |9| Lead Origin | 0 | 9240 | 0 |10| Do Not Email | 0 | 9240 | 0 |11| Converted | 0 | 9240 | 0 |12| Total Time Spent on Website | 0 | 9240 | 0 |13| Country | 0 | 9240 | 0 |14| Specialization | 0 | 9240 | 0 |15| What is your current occupation | 0 | 9240 | 0 |16| A free copy of Mastering The Interview | 0 | 9240 | 0 |17+----------------------------------------+-----------+---------+------------+
- From the above, the number of missing values is less than 2%. These are deemed missing completely at random. And these rows could be dropped without affecting the analysis
1leads.dropna(inplace=True)2tab(leads.stb.missing())
1+----------------------------------------+-----------+---------+-----------+2| | Missing | Total | Percent |3|----------------------------------------+-----------+---------+-----------|4| Lead Origin | 0 | 9014 | 0 |5| Lead Source | 0 | 9014 | 0 |6| Do Not Email | 0 | 9014 | 0 |7| Converted | 0 | 9014 | 0 |8| TotalVisits | 0 | 9014 | 0 |9| Total Time Spent on Website | 0 | 9014 | 0 |10| Page Views Per Visit | 0 | 9014 | 0 |11| Last Activity | 0 | 9014 | 0 |12| Country | 0 | 9014 | 0 |13| Specialization | 0 | 9014 | 0 |14| What is your current occupation | 0 | 9014 | 0 |15| City | 0 | 9014 | 0 |16| A free copy of Mastering The Interview | 0 | 9014 | 0 |17+----------------------------------------+-----------+---------+-----------+
Grouping Labels with less leads
Country
1# Country distribution2tab(leads.stb.freq(['Country']))
1+----+----------------------+---------+-------------+--------------------+----------------------+2| | Country | Count | Percent | Cumulative Count | Cumulative Percent |3|----+----------------------+---------+-------------+--------------------+----------------------|4| 0 | India | 8787 | 0.974817 | 8787 | 0.974817 |5| 1 | United States | 51 | 0.00565787 | 8838 | 0.980475 |6| 2 | United Arab Emirates | 44 | 0.0048813 | 8882 | 0.985356 |7| 3 | Saudi Arabia | 21 | 0.00232971 | 8903 | 0.987686 |8| 4 | Singapore | 17 | 0.00188596 | 8920 | 0.989572 |9| 5 | United Kingdom | 12 | 0.00133126 | 8932 | 0.990903 |10| 6 | Australia | 11 | 0.00122032 | 8943 | 0.992123 |11| 7 | Qatar | 8 | 0.000887508 | 8951 | 0.993011 |12| 8 | Bahrain | 7 | 0.00077657 | 8958 | 0.993787 |13| 9 | Hong Kong | 6 | 0.000665631 | 8964 | 0.994453 |14| 10 | France | 6 | 0.000665631 | 8970 | 0.995119 |15| 11 | Oman | 5 | 0.000554693 | 8975 | 0.995673 |16| 12 | Nigeria | 4 | 0.000443754 | 8979 | 0.996117 |17| 13 | Kuwait | 4 | 0.000443754 | 8983 | 0.996561 |18| 14 | Germany | 4 | 0.000443754 | 8987 | 0.997005 |19| 15 | South Africa | 3 | 0.000332816 | 8990 | 0.997337 |20| 16 | Canada | 3 | 0.000332816 | 8993 | 0.99767 |21| 17 | Philippines | 2 | 0.000221877 | 8995 | 0.997892 |22| 18 | Netherlands | 2 | 0.000221877 | 8997 | 0.998114 |23| 19 | Belgium | 2 | 0.000221877 | 8999 | 0.998336 |24| 20 | Bangladesh | 2 | 0.000221877 | 9001 | 0.998558 |25| 21 | Vietnam | 1 | 0.000110939 | 9002 | 0.998669 |26| 22 | Uganda | 1 | 0.000110939 | 9003 | 0.99878 |27| 23 | Tanzania | 1 | 0.000110939 | 9004 | 0.998891 |28| 24 | Switzerland | 1 | 0.000110939 | 9005 | 0.999002 |29| 25 | Sweden | 1 | 0.000110939 | 9006 | 0.999112 |30| 26 | Malaysia | 1 | 0.000110939 | 9007 | 0.999223 |31| 27 | Liberia | 1 | 0.000110939 | 9008 | 0.999334 |32| 28 | Kenya | 1 | 0.000110939 | 9009 | 0.999445 |33| 29 | Italy | 1 | 0.000110939 | 9010 | 0.999556 |34| 30 | Indonesia | 1 | 0.000110939 | 9011 | 0.999667 |35| 31 | Ghana | 1 | 0.000110939 | 9012 | 0.999778 |36| 32 | Denmark | 1 | 0.000110939 | 9013 | 0.999889 |37| 33 | China | 1 | 0.000110939 | 9014 | 1 |38+----+----------------------+---------+-------------+--------------------+----------------------+
- We see that leads from India make 97% of all leads. And others collectively make up 3% and the contribution of each of these countries is <= 1%.
- To reduce the levels, let us group the minority labels into a new label called ‘Outside India’
1# Grouping Countries with very low lead count into 'Outside India'23leadsByCountry = leads['Country'].value_counts(normalize=True)4lowLeadCountries = leadsByCountry[leadsByCountry <= 0.01].index56leads['Country'].replace(lowLeadCountries,'Outside India',inplace=True)7tab(leads.stb.freq(['Country']))
1+----+---------------+---------+-----------+--------------------+----------------------+2| | Country | Count | Percent | Cumulative Count | Cumulative Percent |3|----+---------------+---------+-----------+--------------------+----------------------|4| 0 | India | 8787 | 0.974817 | 8787 | 0.974817 |5| 1 | Outside India | 227 | 0.025183 | 9014 | 1 |6+----+---------------+---------+-----------+--------------------+----------------------+
Lead Origin
1feature = 'Lead Origin'2tab(leads.stb.freq([feature]))
1+----+-------------------------+---------+------------+--------------------+----------------------+2| | Lead Origin | Count | Percent | Cumulative Count | Cumulative Percent |3|----+-------------------------+---------+------------+--------------------+----------------------|4| 0 | Landing Page Submission | 4870 | 0.540271 | 4870 | 0.540271 |5| 1 | API | 3533 | 0.391946 | 8403 | 0.932217 |6| 2 | Lead Add Form | 581 | 0.0644553 | 8984 | 0.996672 |7| 3 | Lead Import | 30 | 0.00332816 | 9014 | 1 |8+----+-------------------------+---------+------------+--------------------+----------------------+
- We see that lead origins like
Lead Add Form
,Lead Import
are less than 1% of all origins. - Let’s group them into a level called ‘Other Lead Origins’
1# Grouping lead origins2leadOriginsToGroup = ["Lead Add Form","Lead Import"]3leads[feature] = leads[feature].replace(leadOriginsToGroup, ['Other Lead Origins']*2)4tab(leads.stb.freq([feature]))
1+----+-------------------------+---------+-----------+--------------------+----------------------+2| | Lead Origin | Count | Percent | Cumulative Count | Cumulative Percent |3|----+-------------------------+---------+-----------+--------------------+----------------------|4| 0 | Landing Page Submission | 4870 | 0.540271 | 4870 | 0.540271 |5| 1 | API | 3533 | 0.391946 | 8403 | 0.932217 |6| 2 | Other Lead Origins | 611 | 0.0677834 | 9014 | 1 |7+----+-------------------------+---------+-----------+--------------------+----------------------+
Lead Source
1feature = 'Lead Source'2tab(leads.stb.freq([feature]))
1+----+-------------------+---------+-------------+--------------------+----------------------+2| | Lead Source | Count | Percent | Cumulative Count | Cumulative Percent |3|----+-------------------+---------+-------------+--------------------+----------------------|4| 0 | Google | 2857 | 0.316951 | 2857 | 0.316951 |5| 1 | Direct Traffic | 2528 | 0.280453 | 5385 | 0.597404 |6| 2 | Olark Chat | 1753 | 0.194475 | 7138 | 0.791879 |7| 3 | Organic Search | 1131 | 0.125471 | 8269 | 0.917351 |8| 4 | Reference | 443 | 0.0491458 | 8712 | 0.966497 |9| 5 | Welingak Website | 129 | 0.0143111 | 8841 | 0.980808 |10| 6 | Referral Sites | 120 | 0.0133126 | 8961 | 0.99412 |11| 7 | Facebook | 31 | 0.00343909 | 8992 | 0.997559 |12| 8 | bing | 6 | 0.000665631 | 8998 | 0.998225 |13| 9 | Click2call | 4 | 0.000443754 | 9002 | 0.998669 |14| 10 | Social Media | 2 | 0.000221877 | 9004 | 0.998891 |15| 11 | Press_Release | 2 | 0.000221877 | 9006 | 0.999112 |16| 12 | Live Chat | 2 | 0.000221877 | 9008 | 0.999334 |17| 13 | welearnblog_Home | 1 | 0.000110939 | 9009 | 0.999445 |18| 14 | testone | 1 | 0.000110939 | 9010 | 0.999556 |19| 15 | blog | 1 | 0.000110939 | 9011 | 0.999667 |20| 16 | WeLearn | 1 | 0.000110939 | 9012 | 0.999778 |21| 17 | Pay per Click Ads | 1 | 0.000110939 | 9013 | 0.999889 |22| 18 | NC_EDM | 1 | 0.000110939 | 9014 | 1 |23+----+-------------------+---------+-------------+--------------------+----------------------+
- Lead sources from #7 to #18 contribute to less than 1% of all sources.
- Let’s group these into a new label called ‘Other Lead Sources’
1# Grouping lead Sources2labelCounts = leads[feature].value_counts(normalize=True)34# labels with less than 1% contribution5labelsToGroup = labelCounts[labelCounts < 0.01].index.values67leads[feature] = leads[feature].replace(labelsToGroup, ['Other '+feature+'s']*len(labelsToGroup))89tab(leads.stb.freq([feature]))
1+----+--------------------+---------+------------+--------------------+----------------------+2| | Lead Source | Count | Percent | Cumulative Count | Cumulative Percent |3|----+--------------------+---------+------------+--------------------+----------------------|4| 0 | Google | 2857 | 0.316951 | 2857 | 0.316951 |5| 1 | Direct Traffic | 2528 | 0.280453 | 5385 | 0.597404 |6| 2 | Olark Chat | 1753 | 0.194475 | 7138 | 0.791879 |7| 3 | Organic Search | 1131 | 0.125471 | 8269 | 0.917351 |8| 4 | Reference | 443 | 0.0491458 | 8712 | 0.966497 |9| 5 | Welingak Website | 129 | 0.0143111 | 8841 | 0.980808 |10| 6 | Referral Sites | 120 | 0.0133126 | 8961 | 0.99412 |11| 7 | Other Lead Sources | 53 | 0.00587974 | 9014 | 1 |12+----+--------------------+---------+------------+--------------------+----------------------+
Last Activity
1feature = 'Last Activity'2tab(leads.stb.freq([feature]))
1+----+------------------------------+---------+-------------+--------------------+----------------------+2| | Last Activity | Count | Percent | Cumulative Count | Cumulative Percent |3|----+------------------------------+---------+-------------+--------------------+----------------------|4| 0 | Email Opened | 3417 | 0.379077 | 3417 | 0.379077 |5| 1 | SMS Sent | 2701 | 0.299645 | 6118 | 0.678722 |6| 2 | Olark Chat Conversation | 963 | 0.106834 | 7081 | 0.785556 |7| 3 | Page Visited on Website | 637 | 0.0706679 | 7718 | 0.856224 |8| 4 | Converted to Lead | 422 | 0.0468161 | 8140 | 0.90304 |9| 5 | Email Bounced | 304 | 0.0337253 | 8444 | 0.936765 |10| 6 | Email Link Clicked | 266 | 0.0295097 | 8710 | 0.966275 |11| 7 | Form Submitted on Website | 115 | 0.0127579 | 8825 | 0.979033 |12| 8 | Unreachable | 89 | 0.00987353 | 8914 | 0.988906 |13| 9 | Unsubscribed | 58 | 0.00643444 | 8972 | 0.995341 |14| 10 | Had a Phone Conversation | 25 | 0.00277346 | 8997 | 0.998114 |15| 11 | View in browser link Clicked | 6 | 0.000665631 | 9003 | 0.99878 |16| 12 | Approached upfront | 5 | 0.000554693 | 9008 | 0.999334 |17| 13 | Email Received | 2 | 0.000221877 | 9010 | 0.999556 |18| 14 | Email Marked Spam | 2 | 0.000221877 | 9012 | 0.999778 |19| 15 | Visited Booth in Tradeshow | 1 | 0.000110939 | 9013 | 0.999889 |20| 16 | Resubscribed to emails | 1 | 0.000110939 | 9014 | 1 |21+----+------------------------------+---------+-------------+--------------------+----------------------+
- Leads from #9 to #16 contribute to less than 1% of all last activity labels.Moreover, each of these labels contributes to less than 1% of leads.
- Let’s group these into a new label called ‘Other Last Activity’
1# Grouping Last Activity2labelCounts = leads[feature].value_counts(normalize=True)34# labels with less than 2% contribution5labelsToGroup = labelCounts[labelCounts < 0.01].index.values67leads[feature] = leads[feature].replace(labelsToGroup, ['Other '+feature]*len(labelsToGroup))89tab(leads.stb.freq([feature]))
1+----+---------------------------+---------+-----------+--------------------+----------------------+2| | Last Activity | Count | Percent | Cumulative Count | Cumulative Percent |3|----+---------------------------+---------+-----------+--------------------+----------------------|4| 0 | Email Opened | 3417 | 0.379077 | 3417 | 0.379077 |5| 1 | SMS Sent | 2701 | 0.299645 | 6118 | 0.678722 |6| 2 | Olark Chat Conversation | 963 | 0.106834 | 7081 | 0.785556 |7| 3 | Page Visited on Website | 637 | 0.0706679 | 7718 | 0.856224 |8| 4 | Converted to Lead | 422 | 0.0468161 | 8140 | 0.90304 |9| 5 | Email Bounced | 304 | 0.0337253 | 8444 | 0.936765 |10| 6 | Email Link Clicked | 266 | 0.0295097 | 8710 | 0.966275 |11| 7 | Other Last Activity | 189 | 0.0209674 | 8899 | 0.987242 |12| 8 | Form Submitted on Website | 115 | 0.0127579 | 9014 | 1 |13+----+---------------------------+---------+-----------+--------------------+----------------------+
Specialization
1feature = 'Specialization'2tab(leads.stb.freq([feature]))
1+----+-----------------------------------+---------+------------+--------------------+----------------------+2| | Specialization | Count | Percent | Cumulative Count | Cumulative Percent |3|----+-----------------------------------+---------+------------+--------------------+----------------------|4| 0 | No Specialization | 3230 | 0.358331 | 3230 | 0.358331 |5| 1 | Finance Management | 959 | 0.10639 | 4189 | 0.464722 |6| 2 | Human Resource Management | 836 | 0.0927446 | 5025 | 0.557466 |7| 3 | Marketing Management | 822 | 0.0911915 | 5847 | 0.648658 |8| 4 | Operations Management | 498 | 0.0552474 | 6345 | 0.703905 |9| 5 | Business Administration | 397 | 0.0440426 | 6742 | 0.747948 |10| 6 | IT Projects Management | 366 | 0.0406035 | 7108 | 0.788551 |11| 7 | Supply Chain Management | 344 | 0.0381629 | 7452 | 0.826714 |12| 8 | Banking, Investment And Insurance | 335 | 0.0371644 | 7787 | 0.863878 |13| 9 | Travel and Tourism | 202 | 0.0224096 | 7989 | 0.886288 |14| 10 | Media and Advertising | 202 | 0.0224096 | 8191 | 0.908698 |15| 11 | International Business | 176 | 0.0195252 | 8367 | 0.928223 |16| 12 | Healthcare Management | 156 | 0.0173064 | 8523 | 0.945529 |17| 13 | E-COMMERCE | 111 | 0.0123142 | 8634 | 0.957843 |18| 14 | Hospitality Management | 110 | 0.0122032 | 8744 | 0.970047 |19| 15 | Retail Management | 100 | 0.0110939 | 8844 | 0.98114 |20| 16 | Rural and Agribusiness | 73 | 0.00809851 | 8917 | 0.989239 |21| 17 | E-Business | 57 | 0.0063235 | 8974 | 0.995562 |22| 18 | Services Excellence | 40 | 0.00443754 | 9014 | 1 |23+----+-----------------------------------+---------+------------+--------------------+----------------------+
- Lead from from #16 to #18 contribute to less than 2% of all Specialization categories. Moreover, each of these categories contributes to less than 1% of leads.
- Let’s group these into a new label called ‘Other Specializations’
1# Grouping Last Activity2labelCounts = leads[feature].value_counts(normalize=True)34# labels with less than 2% contribution5labelsToGroup = labelCounts[labelCounts <=0.012121].index.values67leads[feature] = leads[feature].replace(labelsToGroup, ['Other '+feature]*len(labelsToGroup))89tab(leads.stb.freq([feature]))
1+----+-----------------------------------+---------+-----------+--------------------+----------------------+2| | Specialization | Count | Percent | Cumulative Count | Cumulative Percent |3|----+-----------------------------------+---------+-----------+--------------------+----------------------|4| 0 | No Specialization | 3230 | 0.358331 | 3230 | 0.358331 |5| 1 | Finance Management | 959 | 0.10639 | 4189 | 0.464722 |6| 2 | Human Resource Management | 836 | 0.0927446 | 5025 | 0.557466 |7| 3 | Marketing Management | 822 | 0.0911915 | 5847 | 0.648658 |8| 4 | Operations Management | 498 | 0.0552474 | 6345 | 0.703905 |9| 5 | Business Administration | 397 | 0.0440426 | 6742 | 0.747948 |10| 6 | IT Projects Management | 366 | 0.0406035 | 7108 | 0.788551 |11| 7 | Supply Chain Management | 344 | 0.0381629 | 7452 | 0.826714 |12| 8 | Banking, Investment And Insurance | 335 | 0.0371644 | 7787 | 0.863878 |13| 9 | Other Specialization | 270 | 0.0299534 | 8057 | 0.893832 |14| 10 | Travel and Tourism | 202 | 0.0224096 | 8259 | 0.916241 |15| 11 | Media and Advertising | 202 | 0.0224096 | 8461 | 0.938651 |16| 12 | International Business | 176 | 0.0195252 | 8637 | 0.958176 |17| 13 | Healthcare Management | 156 | 0.0173064 | 8793 | 0.975483 |18| 14 | E-COMMERCE | 111 | 0.0123142 | 8904 | 0.987797 |19| 15 | Hospitality Management | 110 | 0.0122032 | 9014 | 1 |20+----+-----------------------------------+---------+-----------+--------------------+----------------------+
- Cleaning tasks have been completed. No more missing values exist
Retained Data
1# Columns retained2print('Retained Columns\n\n', leads.columns.values)
1Retained Columns23 ['Lead Origin' 'Lead Source' 'Do Not Email' 'Converted' 'TotalVisits'4 'Total Time Spent on Website' 'Page Views Per Visit' 'Last Activity'5 'Country' 'Specialization' 'What is your current occupation' 'City'6 'A free copy of Mastering The Interview']
1# Retained rows2print('Retained rows : ',leads.shape[0])3print("Ratio of retained rows", 100*leads.shape[0]/9240)
1Retained rows : 90142Ratio of retained rows 97.55411255411255
Data Imbalance
1tab(leads.stb.freq(['Converted']))
1+----+-------------+---------+-----------+--------------------+----------------------+2| | Converted | Count | Percent | Cumulative Count | Cumulative Percent |3|----+-------------+---------+-----------+--------------------+----------------------|4| 0 | 0 | 5595 | 0.620701 | 5595 | 0.620701 |5| 1 | 1 | 3419 | 0.379299 | 9014 | 1 |6+----+-------------+---------+-----------+--------------------+----------------------+
1converted_cond = leads['Converted'] == 12imbalance = leads[converted_cond].shape[0]/leads[~converted_cond].shape[0]3print('Class Imbalance : Converted /Un-converted =', np.round(imbalance,3))
1Class Imbalance : Converted /Un-converted = 0.611
- From the above, you can see that this data set contains 37% of converted leads and 62% of un-converted leads.
- Ratio of classes = 0.6
- The dataset is skewed towards ‘unconverted leads’
Univariate Analysis
1def categoricalUAn(column,figsize=[8,8]) :23 ''' Function for categorical univariate analysis '''4 print('Types of ' + column)5 tab(leads.stb.freq([column]))67 converted = leads[leads['Converted'] == 1]8 unconverted = leads[leads['Converted'] == 0]910 print(column + ' for Converted Leads')1112 tab(converted.stb.freq([column]))1314 print(column + ' for Un-Converted Leads')1516 tab(unconverted.stb.freq([column]))1718 print(column + ' vs Conversion Rate')1920 tab((converted[column].value_counts()) / (converted[column].value_counts() + unconverted[column].value_counts()))2122 # bar plot23 plt.figure(figsize=figsize)24 ax = sns.countplot(y=column,hue='Converted',data=leads)25 title = column + ' vs Lead Conversion'26 ax.set(title= title)
Lead Origin
1column = 'Lead Origin'2categoricalUAn(column,figsize=[8,8])
1Types of Lead Origin2+----+-------------------------+---------+-----------+--------------------+----------------------+3| | Lead Origin | Count | Percent | Cumulative Count | Cumulative Percent |4|----+-------------------------+---------+-----------+--------------------+----------------------|5| 0 | Landing Page Submission | 4870 | 0.540271 | 4870 | 0.540271 |6| 1 | API | 3533 | 0.391946 | 8403 | 0.932217 |7| 2 | Other Lead Origins | 611 | 0.0677834 | 9014 | 1 |8+----+-------------------------+---------+-----------+--------------------+----------------------+9Lead Origin for Converted Leads10+----+-------------------------+---------+-----------+--------------------+----------------------+11| | Lead Origin | Count | Percent | Cumulative Count | Cumulative Percent |12|----+-------------------------+---------+-----------+--------------------+----------------------|13| 0 | Landing Page Submission | 1765 | 0.516233 | 1765 | 0.516233 |14| 1 | API | 1101 | 0.322024 | 2866 | 0.838257 |15| 2 | Other Lead Origins | 553 | 0.161743 | 3419 | 1 |16+----+-------------------------+---------+-----------+--------------------+----------------------+17Lead Origin for Un-Converted Leads18+----+-------------------------+---------+-----------+--------------------+----------------------+19| | Lead Origin | Count | Percent | Cumulative Count | Cumulative Percent |20|----+-------------------------+---------+-----------+--------------------+----------------------|21| 0 | Landing Page Submission | 3105 | 0.55496 | 3105 | 0.55496 |22| 1 | API | 2432 | 0.434674 | 5537 | 0.989634 |23| 2 | Other Lead Origins | 58 | 0.0103664 | 5595 | 1 |24+----+-------------------------+---------+-----------+--------------------+----------------------+25Lead Origin vs Conversion Rate26+-------------------------+---------------+27| | Lead Origin |28|-------------------------+---------------|29| Landing Page Submission | 0.362423 |30| API | 0.311633 |31| Other Lead Origins | 0.905074 |32+-------------------------+---------------+
- Leads from
Landing Page Submission
followed byAPI
make up 93% of all leads. - But it is interesting that 8.3% of leads coming from other sources have the highest conversion rate of 87.5%
Lead Source
1column = 'Lead Source'2categoricalUAn(column,figsize=[8,8])
1Types of Lead Source2+----+--------------------+---------+------------+--------------------+----------------------+3| | Lead Source | Count | Percent | Cumulative Count | Cumulative Percent |4|----+--------------------+---------+------------+--------------------+----------------------|5| 0 | Google | 2857 | 0.316951 | 2857 | 0.316951 |6| 1 | Direct Traffic | 2528 | 0.280453 | 5385 | 0.597404 |7| 2 | Olark Chat | 1753 | 0.194475 | 7138 | 0.791879 |8| 3 | Organic Search | 1131 | 0.125471 | 8269 | 0.917351 |9| 4 | Reference | 443 | 0.0491458 | 8712 | 0.966497 |10| 5 | Welingak Website | 129 | 0.0143111 | 8841 | 0.980808 |11| 6 | Referral Sites | 120 | 0.0133126 | 8961 | 0.99412 |12| 7 | Other Lead Sources | 53 | 0.00587974 | 9014 | 1 |13+----+--------------------+---------+------------+--------------------+----------------------+14Lead Source for Converted Leads15+----+--------------------+---------+------------+--------------------+----------------------+16| | Lead Source | Count | Percent | Cumulative Count | Cumulative Percent |17|----+--------------------+---------+------------+--------------------+----------------------|18| 0 | Google | 1142 | 0.334016 | 1142 | 0.334016 |19| 1 | Direct Traffic | 815 | 0.238374 | 1957 | 0.57239 |20| 2 | Olark Chat | 448 | 0.131032 | 2405 | 0.703422 |21| 3 | Organic Search | 428 | 0.125183 | 2833 | 0.828605 |22| 4 | Reference | 410 | 0.119918 | 3243 | 0.948523 |23| 5 | Welingak Website | 127 | 0.0371454 | 3370 | 0.985668 |24| 6 | Referral Sites | 31 | 0.00906698 | 3401 | 0.994735 |25| 7 | Other Lead Sources | 18 | 0.0052647 | 3419 | 1 |26+----+--------------------+---------+------------+--------------------+----------------------+27Lead Source for Un-Converted Leads28+----+--------------------+---------+-------------+--------------------+----------------------+29| | Lead Source | Count | Percent | Cumulative Count | Cumulative Percent |30|----+--------------------+---------+-------------+--------------------+----------------------|31| 0 | Google | 1715 | 0.306524 | 1715 | 0.306524 |32| 1 | Direct Traffic | 1713 | 0.306166 | 3428 | 0.61269 |33| 2 | Olark Chat | 1305 | 0.233244 | 4733 | 0.845934 |34| 3 | Organic Search | 703 | 0.125648 | 5436 | 0.971582 |35| 4 | Referral Sites | 89 | 0.0159071 | 5525 | 0.987489 |36| 5 | Other Lead Sources | 35 | 0.00625559 | 5560 | 0.993744 |37| 6 | Reference | 33 | 0.00589812 | 5593 | 0.999643 |38| 7 | Welingak Website | 2 | 0.000357462 | 5595 | 1 |39+----+--------------------+---------+-------------+--------------------+----------------------+40Lead Source vs Conversion Rate41+--------------------+---------------+42| | Lead Source |43|--------------------+---------------|44| Direct Traffic | 0.322389 |45| Google | 0.39972 |46| Olark Chat | 0.255562 |47| Organic Search | 0.378426 |48| Other Lead Sources | 0.339623 |49| Reference | 0.925508 |50| Referral Sites | 0.258333 |51| Welingak Website | 0.984496 |52+--------------------+---------------+
- Most leads that get converted come from
Google
(31%), followed byDirect Traffic
(28%) andOlark Chat
(19%) - And leads through
Reference
have a very high conversion rate (91%)
Do not Email
1feature = 'Do Not Email'2categoricalUAn(feature,figsize=[8,8])
1Types of Do Not Email2+----+----------------+---------+-----------+--------------------+----------------------+3| | Do Not Email | Count | Percent | Cumulative Count | Cumulative Percent |4|----+----------------+---------+-----------+--------------------+----------------------|5| 0 | No | 8311 | 0.92201 | 8311 | 0.92201 |6| 1 | Yes | 703 | 0.0779898 | 9014 | 1 |7+----+----------------+---------+-----------+--------------------+----------------------+8Do Not Email for Converted Leads9+----+----------------+---------+-----------+--------------------+----------------------+10| | Do Not Email | Count | Percent | Cumulative Count | Cumulative Percent |11|----+----------------+---------+-----------+--------------------+----------------------|12| 0 | No | 3315 | 0.969582 | 3315 | 0.969582 |13| 1 | Yes | 104 | 0.0304183 | 3419 | 1 |14+----+----------------+---------+-----------+--------------------+----------------------+15Do Not Email for Un-Converted Leads16+----+----------------+---------+-----------+--------------------+----------------------+17| | Do Not Email | Count | Percent | Cumulative Count | Cumulative Percent |18|----+----------------+---------+-----------+--------------------+----------------------|19| 0 | No | 4996 | 0.89294 | 4996 | 0.89294 |20| 1 | Yes | 599 | 0.10706 | 5595 | 1 |21+----+----------------+---------+-----------+--------------------+----------------------+22Do Not Email vs Conversion Rate23+-----+----------------+24| | Do Not Email |25|-----+----------------|26| No | 0.398869 |27| Yes | 0.147937 |28+-----+----------------+
- 92% of leads prefer to be sent Emails about the company.
Do not Email = No
- And these are the most converted customers (40%)
Last Activity
1# 'Last Activity'2feature = 'Last Activity'3categoricalUAn(feature,figsize=[8,8])
1Types of Last Activity2+----+---------------------------+---------+-----------+--------------------+----------------------+3| | Last Activity | Count | Percent | Cumulative Count | Cumulative Percent |4|----+---------------------------+---------+-----------+--------------------+----------------------|5| 0 | Email Opened | 3417 | 0.379077 | 3417 | 0.379077 |6| 1 | SMS Sent | 2701 | 0.299645 | 6118 | 0.678722 |7| 2 | Olark Chat Conversation | 963 | 0.106834 | 7081 | 0.785556 |8| 3 | Page Visited on Website | 637 | 0.0706679 | 7718 | 0.856224 |9| 4 | Converted to Lead | 422 | 0.0468161 | 8140 | 0.90304 |10| 5 | Email Bounced | 304 | 0.0337253 | 8444 | 0.936765 |11| 6 | Email Link Clicked | 266 | 0.0295097 | 8710 | 0.966275 |12| 7 | Other Last Activity | 189 | 0.0209674 | 8899 | 0.987242 |13| 8 | Form Submitted on Website | 115 | 0.0127579 | 9014 | 1 |14+----+---------------------------+---------+-----------+--------------------+----------------------+15Last Activity for Converted Leads16+----+---------------------------+---------+------------+--------------------+----------------------+17| | Last Activity | Count | Percent | Cumulative Count | Cumulative Percent |18|----+---------------------------+---------+------------+--------------------+----------------------|19| 0 | SMS Sent | 1697 | 0.496344 | 1697 | 0.496344 |20| 1 | Email Opened | 1246 | 0.364434 | 2943 | 0.860778 |21| 2 | Page Visited on Website | 150 | 0.0438725 | 3093 | 0.90465 |22| 3 | Olark Chat Conversation | 84 | 0.0245686 | 3177 | 0.929219 |23| 4 | Other Last Activity | 74 | 0.0216438 | 3251 | 0.950863 |24| 5 | Email Link Clicked | 72 | 0.0210588 | 3323 | 0.971922 |25| 6 | Converted to Lead | 53 | 0.0155016 | 3376 | 0.987423 |26| 7 | Form Submitted on Website | 27 | 0.00789705 | 3403 | 0.99532 |27| 8 | Email Bounced | 16 | 0.00467973 | 3419 | 1 |28+----+---------------------------+---------+------------+--------------------+----------------------+29Last Activity for Un-Converted Leads30+----+---------------------------+---------+-----------+--------------------+----------------------+31| | Last Activity | Count | Percent | Cumulative Count | Cumulative Percent |32|----+---------------------------+---------+-----------+--------------------+----------------------|33| 0 | Email Opened | 2171 | 0.388025 | 2171 | 0.388025 |34| 1 | SMS Sent | 1004 | 0.179446 | 3175 | 0.567471 |35| 2 | Olark Chat Conversation | 879 | 0.157105 | 4054 | 0.724576 |36| 3 | Page Visited on Website | 487 | 0.087042 | 4541 | 0.811618 |37| 4 | Converted to Lead | 369 | 0.0659517 | 4910 | 0.877569 |38| 5 | Email Bounced | 288 | 0.0514745 | 5198 | 0.929044 |39| 6 | Email Link Clicked | 194 | 0.0346738 | 5392 | 0.963718 |40| 7 | Other Last Activity | 115 | 0.0205541 | 5507 | 0.984272 |41| 8 | Form Submitted on Website | 88 | 0.0157283 | 5595 | 1 |42+----+---------------------------+---------+-----------+--------------------+----------------------+43Last Activity vs Conversion Rate44+---------------------------+-----------------+45| | Last Activity |46|---------------------------+-----------------|47| Converted to Lead | 0.125592 |48| Email Bounced | 0.0526316 |49| Email Link Clicked | 0.270677 |50| Email Opened | 0.364647 |51| Form Submitted on Website | 0.234783 |52| Olark Chat Conversation | 0.0872274 |53| Other Last Activity | 0.391534 |54| Page Visited on Website | 0.235479 |55| SMS Sent | 0.628286 |56+---------------------------+-----------------+
- Most leads open emails sent to them (38%) and that’s their last activity.
- Among those leads who’s last activity is opening emails, 37% are converted.
- Only 4% of last activity indicators show
Converted to Lead
- Last activiy as ‘SMS Sent’ has highest conversion rate (62%).
- Last activiy as ‘Email Bounced’ has lowest conversion rate (7.9%).
Country
1feature = 'Country'2categoricalUAn(feature,figsize=[8,8])
1Types of Country2+----+---------------+---------+-----------+--------------------+----------------------+3| | Country | Count | Percent | Cumulative Count | Cumulative Percent |4|----+---------------+---------+-----------+--------------------+----------------------|5| 0 | India | 8787 | 0.974817 | 8787 | 0.974817 |6| 1 | Outside India | 227 | 0.025183 | 9014 | 1 |7+----+---------------+---------+-----------+--------------------+----------------------+8Country for Converted Leads9+----+---------------+---------+-----------+--------------------+----------------------+10| | Country | Count | Percent | Cumulative Count | Cumulative Percent |11|----+---------------+---------+-----------+--------------------+----------------------|12| 0 | India | 3351 | 0.980111 | 3351 | 0.980111 |13| 1 | Outside India | 68 | 0.0198889 | 3419 | 1 |14+----+---------------+---------+-----------+--------------------+----------------------+15Country for Un-Converted Leads16+----+---------------+---------+-----------+--------------------+----------------------+17| | Country | Count | Percent | Cumulative Count | Cumulative Percent |18|----+---------------+---------+-----------+--------------------+----------------------|19| 0 | India | 5436 | 0.971582 | 5436 | 0.971582 |20| 1 | Outside India | 159 | 0.0284182 | 5595 | 1 |21+----+---------------+---------+-----------+--------------------+----------------------+22Country vs Conversion Rate23+---------------+-----------+24| | Country |25|---------------+-----------|26| India | 0.381359 |27| Outside India | 0.299559 |28+---------------+-----------+
- Most leads come from India (97%)
- Out of these 38% are converted.
Specialization
1feature = 'Specialization'2categoricalUAn(feature)
1Types of Specialization2+----+-----------------------------------+---------+-----------+--------------------+----------------------+3| | Specialization | Count | Percent | Cumulative Count | Cumulative Percent |4|----+-----------------------------------+---------+-----------+--------------------+----------------------|5| 0 | No Specialization | 3230 | 0.358331 | 3230 | 0.358331 |6| 1 | Finance Management | 959 | 0.10639 | 4189 | 0.464722 |7| 2 | Human Resource Management | 836 | 0.0927446 | 5025 | 0.557466 |8| 3 | Marketing Management | 822 | 0.0911915 | 5847 | 0.648658 |9| 4 | Operations Management | 498 | 0.0552474 | 6345 | 0.703905 |10| 5 | Business Administration | 397 | 0.0440426 | 6742 | 0.747948 |11| 6 | IT Projects Management | 366 | 0.0406035 | 7108 | 0.788551 |12| 7 | Supply Chain Management | 344 | 0.0381629 | 7452 | 0.826714 |13| 8 | Banking, Investment And Insurance | 335 | 0.0371644 | 7787 | 0.863878 |14| 9 | Other Specialization | 270 | 0.0299534 | 8057 | 0.893832 |15| 10 | Travel and Tourism | 202 | 0.0224096 | 8259 | 0.916241 |16| 11 | Media and Advertising | 202 | 0.0224096 | 8461 | 0.938651 |17| 12 | International Business | 176 | 0.0195252 | 8637 | 0.958176 |18| 13 | Healthcare Management | 156 | 0.0173064 | 8793 | 0.975483 |19| 14 | E-COMMERCE | 111 | 0.0123142 | 8904 | 0.987797 |20| 15 | Hospitality Management | 110 | 0.0122032 | 9014 | 1 |21+----+-----------------------------------+---------+-----------+--------------------+----------------------+22Specialization for Converted Leads23+----+-----------------------------------+---------+-----------+--------------------+----------------------+24| | Specialization | Count | Percent | Cumulative Count | Cumulative Percent |25|----+-----------------------------------+---------+-----------+--------------------+----------------------|26| 0 | No Specialization | 888 | 0.259725 | 888 | 0.259725 |27| 1 | Finance Management | 422 | 0.123428 | 1310 | 0.383153 |28| 2 | Marketing Management | 397 | 0.116116 | 1707 | 0.499269 |29| 3 | Human Resource Management | 379 | 0.110851 | 2086 | 0.61012 |30| 4 | Operations Management | 234 | 0.0684411 | 2320 | 0.678561 |31| 5 | Business Administration | 175 | 0.0511846 | 2495 | 0.729746 |32| 6 | Banking, Investment And Insurance | 164 | 0.0479672 | 2659 | 0.777713 |33| 7 | Supply Chain Management | 147 | 0.042995 | 2806 | 0.820708 |34| 8 | IT Projects Management | 140 | 0.0409476 | 2946 | 0.861655 |35| 9 | Other Specialization | 97 | 0.0283709 | 3043 | 0.890026 |36| 10 | Media and Advertising | 84 | 0.0245686 | 3127 | 0.914595 |37| 11 | Healthcare Management | 76 | 0.0222287 | 3203 | 0.936824 |38| 12 | Travel and Tourism | 71 | 0.0207663 | 3274 | 0.95759 |39| 13 | International Business | 62 | 0.018134 | 3336 | 0.975724 |40| 14 | Hospitality Management | 44 | 0.0128693 | 3380 | 0.988593 |41| 15 | E-COMMERCE | 39 | 0.0114068 | 3419 | 1 |42+----+-----------------------------------+---------+-----------+--------------------+----------------------+43Specialization for Un-Converted Leads44+----+-----------------------------------+---------+-----------+--------------------+----------------------+45| | Specialization | Count | Percent | Cumulative Count | Cumulative Percent |46|----+-----------------------------------+---------+-----------+--------------------+----------------------|47| 0 | No Specialization | 2342 | 0.418588 | 2342 | 0.418588 |48| 1 | Finance Management | 537 | 0.0959786 | 2879 | 0.514567 |49| 2 | Human Resource Management | 457 | 0.0816801 | 3336 | 0.596247 |50| 3 | Marketing Management | 425 | 0.0759607 | 3761 | 0.672207 |51| 4 | Operations Management | 264 | 0.047185 | 4025 | 0.719392 |52| 5 | IT Projects Management | 226 | 0.0403932 | 4251 | 0.759786 |53| 6 | Business Administration | 222 | 0.0396783 | 4473 | 0.799464 |54| 7 | Supply Chain Management | 197 | 0.03521 | 4670 | 0.834674 |55| 8 | Other Specialization | 173 | 0.0309205 | 4843 | 0.865594 |56| 9 | Banking, Investment And Insurance | 171 | 0.030563 | 5014 | 0.896157 |57| 10 | Travel and Tourism | 131 | 0.0234138 | 5145 | 0.919571 |58| 11 | Media and Advertising | 118 | 0.0210903 | 5263 | 0.940661 |59| 12 | International Business | 114 | 0.0203753 | 5377 | 0.961037 |60| 13 | Healthcare Management | 80 | 0.0142985 | 5457 | 0.975335 |61| 14 | E-COMMERCE | 72 | 0.0128686 | 5529 | 0.988204 |62| 15 | Hospitality Management | 66 | 0.0117962 | 5595 | 1 |63+----+-----------------------------------+---------+-----------+--------------------+----------------------+64Specialization vs Conversion Rate65+-----------------------------------+------------------+66| | Specialization |67|-----------------------------------+------------------|68| Banking, Investment And Insurance | 0.489552 |69| Business Administration | 0.440806 |70| E-COMMERCE | 0.351351 |71| Finance Management | 0.440042 |72| Healthcare Management | 0.487179 |73| Hospitality Management | 0.4 |74| Human Resource Management | 0.453349 |75| IT Projects Management | 0.382514 |76| International Business | 0.352273 |77| Marketing Management | 0.482968 |78| Media and Advertising | 0.415842 |79| No Specialization | 0.274923 |80| Operations Management | 0.46988 |81| Other Specialization | 0.359259 |82| Supply Chain Management | 0.427326 |83| Travel and Tourism | 0.351485 |84+-----------------------------------+------------------+
- Specialization of 36% of leads is missing.
- We have mapped those missing values with ‘No Specialization’.There might be two reason for this,
- Lead might be a fresher.
- Lead missed to fill it.
- Among all the specializations, ’ Banking, Investment And Insurance’ has the highest conversion rate(48.9%).
What is your current occupation
1feature = 'What is your current occupation'2categoricalUAn(feature,figsize=[8,8])
1Types of What is your current occupation2+----+-----------------------------------+---------+-------------+--------------------+----------------------+3| | What is your current occupation | Count | Percent | Cumulative Count | Cumulative Percent |4|----+-----------------------------------+---------+-------------+--------------------+----------------------|5| 0 | Unemployed | 5445 | 0.60406 | 5445 | 0.60406 |6| 1 | Unknown Occupation | 2656 | 0.294653 | 8101 | 0.898713 |7| 2 | Working Professional | 675 | 0.0748835 | 8776 | 0.973597 |8| 3 | Student | 206 | 0.0228533 | 8982 | 0.99645 |9| 4 | Other | 15 | 0.00166408 | 8997 | 0.998114 |10| 5 | Housewife | 9 | 0.000998447 | 9006 | 0.999112 |11| 6 | Businessman | 8 | 0.000887508 | 9014 | 1 |12+----+-----------------------------------+---------+-------------+--------------------+----------------------+13What is your current occupation for Converted Leads14+----+-----------------------------------+---------+------------+--------------------+----------------------+15| | What is your current occupation | Count | Percent | Cumulative Count | Cumulative Percent |16|----+-----------------------------------+---------+------------+--------------------+----------------------|17| 0 | Unemployed | 2336 | 0.683241 | 2336 | 0.683241 |18| 1 | Working Professional | 620 | 0.18134 | 2956 | 0.86458 |19| 2 | Unknown Occupation | 366 | 0.107049 | 3322 | 0.971629 |20| 3 | Student | 74 | 0.0216438 | 3396 | 0.993273 |21| 4 | Other | 9 | 0.00263235 | 3405 | 0.995905 |22| 5 | Housewife | 9 | 0.00263235 | 3414 | 0.998538 |23| 6 | Businessman | 5 | 0.00146242 | 3419 | 1 |24+----+-----------------------------------+---------+------------+--------------------+----------------------+25What is your current occupation for Un-Converted Leads26+----+-----------------------------------+---------+-------------+--------------------+----------------------+27| | What is your current occupation | Count | Percent | Cumulative Count | Cumulative Percent |28|----+-----------------------------------+---------+-------------+--------------------+----------------------|29| 0 | Unemployed | 3109 | 0.555675 | 3109 | 0.555675 |30| 1 | Unknown Occupation | 2290 | 0.409294 | 5399 | 0.964969 |31| 2 | Student | 132 | 0.0235925 | 5531 | 0.988561 |32| 3 | Working Professional | 55 | 0.00983021 | 5586 | 0.998391 |33| 4 | Other | 6 | 0.00107239 | 5592 | 0.999464 |34| 5 | Businessman | 3 | 0.000536193 | 5595 | 1 |35+----+-----------------------------------+---------+-------------+--------------------+----------------------+36What is your current occupation vs Conversion Rate37+----------------------+-----------------------------------+38| | What is your current occupation |39|----------------------+-----------------------------------|40| Businessman | 0.625 |41| Housewife | nan |42| Other | 0.6 |43| Student | 0.359223 |44| Unemployed | 0.429017 |45| Unknown Occupation | 0.137801 |46| Working Professional | 0.918519 |47+----------------------+-----------------------------------+
- Although the conversion rate for Working Professional is the highest ! 91.6%, they only make 7.4% of all leads. 60% leads are Unemployed customers followed by 29% with unknown nature of employment
- Among all the converted leads, Unemployed and Working Professionals top the list.
- Conversion for Housewife segment is 100%
City
1feature = 'City'2categoricalUAn(feature)
1Types of City2+----+-----------------------------+---------+------------+--------------------+----------------------+3| | City | Count | Percent | Cumulative Count | Cumulative Percent |4|----+-----------------------------+---------+------------+--------------------+----------------------|5| 0 | Mumbai | 6692 | 0.742401 | 6692 | 0.742401 |6| 1 | Thane & Outskirts | 745 | 0.0826492 | 7437 | 0.82505 |7| 2 | Other Cities | 680 | 0.0754382 | 8117 | 0.900488 |8| 3 | Other Cities of Maharashtra | 446 | 0.0494786 | 8563 | 0.949967 |9| 4 | Other Metro Cities | 377 | 0.0418238 | 8940 | 0.991791 |10| 5 | Tier II Cities | 74 | 0.00820945 | 9014 | 1 |11+----+-----------------------------+---------+------------+--------------------+----------------------+12City for Converted Leads13+----+-----------------------------+---------+------------+--------------------+----------------------+14| | City | Count | Percent | Cumulative Count | Cumulative Percent |15|----+-----------------------------+---------+------------+--------------------+----------------------|16| 0 | Mumbai | 2440 | 0.713659 | 2440 | 0.713659 |17| 1 | Thane & Outskirts | 332 | 0.0971044 | 2772 | 0.810763 |18| 2 | Other Cities | 272 | 0.0795554 | 3044 | 0.890319 |19| 3 | Other Cities of Maharashtra | 196 | 0.0573267 | 3240 | 0.947646 |20| 4 | Other Metro Cities | 154 | 0.0450424 | 3394 | 0.992688 |21| 5 | Tier II Cities | 25 | 0.00731208 | 3419 | 1 |22+----+-----------------------------+---------+------------+--------------------+----------------------+23City for Un-Converted Leads24+----+-----------------------------+---------+------------+--------------------+----------------------+25| | City | Count | Percent | Cumulative Count | Cumulative Percent |26|----+-----------------------------+---------+------------+--------------------+----------------------|27| 0 | Mumbai | 4252 | 0.759964 | 4252 | 0.759964 |28| 1 | Thane & Outskirts | 413 | 0.0738159 | 4665 | 0.83378 |29| 2 | Other Cities | 408 | 0.0729223 | 5073 | 0.906702 |30| 3 | Other Cities of Maharashtra | 250 | 0.0446828 | 5323 | 0.951385 |31| 4 | Other Metro Cities | 223 | 0.039857 | 5546 | 0.991242 |32| 5 | Tier II Cities | 49 | 0.00875782 | 5595 | 1 |33+----+-----------------------------+---------+------------+--------------------+----------------------+34City vs Conversion Rate35+-----------------------------+----------+36| | City |37|-----------------------------+----------|38| Mumbai | 0.364614 |39| Thane & Outskirts | 0.445638 |40| Other Cities | 0.4 |41| Other Cities of Maharashtra | 0.439462 |42| Other Metro Cities | 0.408488 |43| Tier II Cities | 0.337838 |44+-----------------------------+----------+
- Most Leads come from ‘Mumbai’ and they have a decent conversion rate of 36.4%.
- Leads from Thane and outskirts make up 8.2% with a conversion rate of 44%
A free copy of Mastering The Interview.
1feature = 'A free copy of Mastering The Interview'2categoricalUAn(feature)
1Types of A free copy of Mastering The Interview2+----+------------------------------------------+---------+-----------+--------------------+----------------------+3| | A free copy of Mastering The Interview | Count | Percent | Cumulative Count | Cumulative Percent |4|----+------------------------------------------+---------+-----------+--------------------+----------------------|5| 0 | No | 6126 | 0.679609 | 6126 | 0.679609 |6| 1 | Yes | 2888 | 0.320391 | 9014 | 1 |7+----+------------------------------------------+---------+-----------+--------------------+----------------------+8A free copy of Mastering The Interview for Converted Leads9+----+------------------------------------------+---------+-----------+--------------------+----------------------+10| | A free copy of Mastering The Interview | Count | Percent | Cumulative Count | Cumulative Percent |11|----+------------------------------------------+---------+-----------+--------------------+----------------------|12| 0 | No | 2389 | 0.698742 | 2389 | 0.698742 |13| 1 | Yes | 1030 | 0.301258 | 3419 | 1 |14+----+------------------------------------------+---------+-----------+--------------------+----------------------+15A free copy of Mastering The Interview for Un-Converted Leads16+----+------------------------------------------+---------+-----------+--------------------+----------------------+17| | A free copy of Mastering The Interview | Count | Percent | Cumulative Count | Cumulative Percent |18|----+------------------------------------------+---------+-----------+--------------------+----------------------|19| 0 | No | 3737 | 0.667918 | 3737 | 0.667918 |20| 1 | Yes | 1858 | 0.332082 | 5595 | 1 |21+----+------------------------------------------+---------+-----------+--------------------+----------------------+22A free copy of Mastering The Interview vs Conversion Rate23+-----+------------------------------------------+24| | A free copy of Mastering The Interview |25|-----+------------------------------------------|26| No | 0.389977 |27| Yes | 0.356648 |28+-----+------------------------------------------+
- 68% of the leads said “No” for a free copy of ‘Mastering The Interview’.
- Conversion rate of leads who said “No” is high (39.8%).
1def num_univariate_analysis(column_name,scale='linear') :23 converted = leads[leads['Converted'] == 1]4 unconverted = leads[leads['Converted'] == 0]56 plt.figure(figsize=(8,6))7 ax = sns.boxplot(x=column_name, y='Converted', data = leads)8 title = 'Boxplot of ' + column_name+' vs Conversion'9 ax.set(title=title)10 if scale == 'log' :11 ax.set_xscale('log')12 ax.set(ylabel=column_name + '(Log Scale)')1314 print("Spread for range of "+column_name+" that were Converted")15 tab(converted[column_name].describe())16 print("Spread for range of "+column_name+" that were not converted")17 tab(unconverted[column_name].describe())
TotalVisits
1column_name = 'TotalVisits'2num_univariate_analysis(column_name,scale='log')
1Spread for range of TotalVisits that were Converted2+-------+---------------+3| | TotalVisits |4|-------+---------------|5| count | 3419 |6| mean | 3.65575 |7| std | 5.57527 |8| min | 0 |9| 25% | 0 |10| 50% | 3 |11| 75% | 5 |12| max | 251 |13+-------+---------------+14Spread for range of TotalVisits that were not converted15+-------+---------------+16| | TotalVisits |17|-------+---------------|18| count | 5595 |19| mean | 3.33262 |20| std | 4.37298 |21| min | 0 |22| 25% | 1 |23| 50% | 3 |24| 75% | 4 |25| max | 141 |26+-------+---------------+
- Looks like
Total Visits
have a lot of outliers among bothConverted
andUn-converted
leads. - Let’s take a look at the quantiles between 90 and 100.
1# Looking at Quantiles2tab(leads[column_name].quantile(np.linspace(.90,1,20)))
1+----------+---------------+2| | TotalVisits |3|----------+---------------|4| 0.9 | 7 |5| 0.905263 | 7 |6| 0.910526 | 8 |7| 0.915789 | 8 |8| 0.921053 | 8 |9| 0.926316 | 8 |10| 0.931579 | 9 |11| 0.936842 | 9 |12| 0.942105 | 9 |13| 0.947368 | 9 |14| 0.952632 | 10 |15| 0.957895 | 10 |16| 0.963158 | 11 |17| 0.968421 | 11 |18| 0.973684 | 12 |19| 0.978947 | 13 |20| 0.984211 | 14 |21| 0.989474 | 17 |22| 0.994737 | 20 |23| 1 | 251 |24+----------+---------------+
- From the above, it is clear that outliers exist and these might skew the analyses.
- For now, let’s cap the outliers about 99th percentile to 99th percentile value.
soft range
capping.
1# Capping outliers to 99th perentile value2cap = leads[column_name].quantile(.99)3condition = leads[column_name] > cap4leads.loc[condition, column_name] = cap
Total TIme Spent on Website.
1column = 'Total Time Spent on Website'2num_univariate_analysis(column)
1Spread for range of Total Time Spent on Website that were Converted2+-------+-------------------------------+3| | Total Time Spent on Website |4|-------+-------------------------------|5| count | 3419 |6| mean | 732.945 |7| std | 614.476 |8| min | 0 |9| 25% | 0 |10| 50% | 826 |11| 75% | 1265.5 |12| max | 2253 |13+-------+-------------------------------+14Spread for range of Total Time Spent on Website that were not converted15+-------+-------------------------------+16| | Total Time Spent on Website |17|-------+-------------------------------|18| count | 5595 |19| mean | 329.919 |20| std | 432.757 |21| min | 0 |22| 25% | 14 |23| 50% | 178 |24| 75% | 393.5 |25| max | 2272 |26+-------+-------------------------------+
- ‘Total Time Spend on Website’ has many outliers.
- Let’s look quantiles to confirm this.
1tab(leads[column].quantile(np.linspace(0.75,1,25)))
1+----------+-------------------------------+2| | Total Time Spent on Website |3|----------+-------------------------------|4| 0.75 | 924 |5| 0.760417 | 953.635 |6| 0.770833 | 991 |7| 0.78125 | 1022.41 |8| 0.791667 | 1054 |9| 0.802083 | 1087 |10| 0.8125 | 1115.06 |11| 0.822917 | 1143 |12| 0.833333 | 1177.83 |13| 0.84375 | 1208 |14| 0.854167 | 1238.6 |15| 0.864583 | 1271 |16| 0.875 | 1296.38 |17| 0.885417 | 1328 |18| 0.895833 | 1360 |19| 0.90625 | 1392 |20| 0.916667 | 1434 |21| 0.927083 | 1468 |22| 0.9375 | 1503 |23| 0.947917 | 1549 |24| 0.958333 | 1592.46 |25| 0.96875 | 1647 |26| 0.979167 | 1720.23 |27| 0.989583 | 1830.34 |28| 1 | 2272 |29+----------+-------------------------------+
1leads[column].quantile(np.linspace(0.75,1,50)).plot()
1<matplotlib.axes._subplots.AxesSubplot at 0x7fc54b9106d0>
1# Capping `Total Time Spent on Website` values to 99th percentile2cap = leads[column].quantile(.99)3condition = leads[column] > cap4leads.loc[condition, column] = cap
Page Views Per Visit.
1column = 'Page Views Per Visit'2num_univariate_analysis(column)
1Spread for range of Page Views Per Visit that were Converted2+-------+------------------------+3| | Page Views Per Visit |4|-------+------------------------|5| count | 3419 |6| mean | 2.36407 |7| std | 2.10862 |8| min | 0 |9| 25% | 0 |10| 50% | 2 |11| 75% | 3.5 |12| max | 15 |13+-------+------------------------+14Spread for range of Page Views Per Visit that were not converted15+-------+------------------------+16| | Page Views Per Visit |17|-------+------------------------|18| count | 5595 |19| mean | 2.36962 |20| std | 2.17789 |21| min | 0 |22| 25% | 1 |23| 50% | 2 |24| 75% | 3 |25| max | 55 |26+-------+------------------------+
- ‘Page Views Per Visit’ has many outliers.
- Let’s look quantiles to confirm this.
1leads[column].quantile(np.linspace(0.75,1,30)).plot()
1<matplotlib.axes._subplots.AxesSubplot at 0x7fc54c15d890>
- There is a sudden jump between 99th percentile and maximum value.
- let’s cap the values to 99th percentile to avoid skewing the analysis
1# Capping `Page Views Per Visit` values to 99th percentile2cap = leads[column].quantile(.99)3condition = leads[column] > cap4leads.loc[condition, column] = cap
Bivariate Analysis
1leads.columns.values
1array(['Lead Origin', 'Lead Source', 'Do Not Email', 'Converted',2 'TotalVisits', 'Total Time Spent on Website',3 'Page Views Per Visit', 'Last Activity', 'Country',4 'Specialization', 'What is your current occupation', 'City',5 'A free copy of Mastering The Interview'], dtype=object)
1continuous_vars = ['TotalVisits', 'Page Views Per Visit', 'Total Time Spent on Website']
TotalVisits vs A free copy of Mastering The Interview
1plt.figure(figsize=[8,8])2sns.barplot(x=continuous_vars[0], y = 'A free copy of Mastering The Interview', data=leads, hue='Converted')
1<matplotlib.axes._subplots.AxesSubplot at 0x7fc54c0fba90>
- One can see that the proportion of leads with high Total Visits to the website also like a free copy of Mastering The Interview.
- Incidentally, these are leads with higher conversion rate.
- More convertable leads are being attracted by the website through providing ‘A free copy of Mastering The Interview’.
Lead Source vs Country
1# sns.barplot(x='Lead Source', y = 'Country', hue='Converted', data=leads)2leads.groupby(['Country','Lead Source'])['Converted'].value_counts(normalize=True)\3.unstack()\4 .plot(5 layout=(2,2),6 figsize=(14,12), kind='barh', stacked=True);
- Most leads from India through Reference Sources have very high conversion rate.
- Leads from outside of India from other Lead sources do not convert at all.
Occupation vs City
1x = "What is your current occupation"2y = 'City'34leads.groupby([x,y])['Converted'].value_counts(normalize=True)\5.unstack()\6 .plot(7 layout=(2,2),8 figsize=(14,12), kind='barh', stacked=True);
- Working Professionals in Other cities of Maharashtra have higher conversion rates compared to those from Mumbai , Thane and other cities.
- BusinessMen from Mumbai and Thane & Outskirts are poor leads in comparison to Tier 2 and Other cities .
Last Activity vs Country
1x = "Country"2y = 'Last Activity'34leads.groupby([x,y])['Converted'].value_counts(normalize=True)\5.unstack()\6 .plot(7 layout=(2,2),8 figsize=(14,12), kind='barh', stacked=True);
- SMS and Emails are more favourable for conversion over Website Visits outside of India.
- Leads from outside of India who click email links have higher conversion rate compared those from India.
Data Preparation
Mapping Binary Variables to 0 / 1
1binary_var = ['Do Not Email', 'A free copy of Mastering The Interview']2leads[binary_var] = leads[binary_var].replace({'Yes' : 1, 'No' : 0})
Creating Indicator Variables
1categoricalCol = ['Lead Origin', 'Lead Source','Last Activity', 'Country', 'Specialization',2 'What is your current occupation', 'City']34print('Levels in Each Cateogrical Variable\n')5for col in sorted(categoricalCol) :6 print(col, leads[col].unique(), '\n')78# Creating dummy variables9leadOriginDummies = pd.get_dummies(leads['Lead Origin'], drop_first=True)10leadSourceDummies = pd.get_dummies(leads['Lead Source'], drop_first=True)11lastActivityDummies = pd.get_dummies(leads['Last Activity'], drop_first=True)12countryDummies = pd.get_dummies(leads['Country'] ,drop_first=True)13specDummies = pd.get_dummies(leads['Specialization'],drop_first=True)14occupationDummies = pd.get_dummies(leads[ 'What is your current occupation'],drop_first=True)15cityDummies = pd.get_dummies(leads[ 'City'],drop_first=True)1617# adding dummy variables to leads dataframe18leads = pd.concat([leads, leadOriginDummies,leadSourceDummies,lastActivityDummies, countryDummies, specDummies, occupationDummies, cityDummies], axis=1)1920# dropping categorical columns21leads.drop(columns = categoricalCol, inplace=True)222324print('Final Columns')25leads.columns
1Levels in Each Cateogrical Variable23City ['Mumbai' 'Thane & Outskirts' 'Other Metro Cities' 'Other Cities'4 'Other Cities of Maharashtra' 'Tier II Cities']56Country ['India' 'Outside India']78Last Activity ['Page Visited on Website' 'Email Opened' 'Other Last Activity'9 'Converted to Lead' 'Olark Chat Conversation' 'Email Link Clicked'10 'Form Submitted on Website' 'Email Bounced' 'SMS Sent']1112Lead Origin ['API' 'Landing Page Submission' 'Other Lead Origins']1314Lead Source ['Olark Chat' 'Organic Search' 'Direct Traffic' 'Google' 'Referral Sites'15 'Reference' 'Welingak Website' 'Other Lead Sources']1617Specialization ['No Specialization' 'Business Administration' 'Media and Advertising'18 'Supply Chain Management' 'IT Projects Management' 'Finance Management'19 'Travel and Tourism' 'Human Resource Management' 'Marketing Management'20 'Banking, Investment And Insurance' 'International Business' 'E-COMMERCE'21 'Operations Management' 'Other Specialization' 'Hospitality Management'22 'Healthcare Management']2324What is your current occupation ['Unemployed' 'Student' 'Unknown Occupation' 'Working Professional'25 'Businessman' 'Other' 'Housewife']2627Final Columns282930313233Index(['Do Not Email', 'Converted', 'TotalVisits',34 'Total Time Spent on Website', 'Page Views Per Visit',35 'A free copy of Mastering The Interview', 'Landing Page Submission',36 'Other Lead Origins', 'Google', 'Olark Chat', 'Organic Search',37 'Other Lead Sources', 'Reference', 'Referral Sites', 'Welingak Website',38 'Email Bounced', 'Email Link Clicked', 'Email Opened',39 'Form Submitted on Website', 'Olark Chat Conversation',40 'Other Last Activity', 'Page Visited on Website', 'SMS Sent',41 'Outside India', 'Business Administration', 'E-COMMERCE',42 'Finance Management', 'Healthcare Management', 'Hospitality Management',43 'Human Resource Management', 'IT Projects Management',44 'International Business', 'Marketing Management',45 'Media and Advertising', 'No Specialization', 'Operations Management',46 'Other Specialization', 'Supply Chain Management', 'Travel and Tourism',47 'Housewife', 'Other', 'Student', 'Unemployed', 'Unknown Occupation',48 'Working Professional', 'Other Cities', 'Other Cities of Maharashtra',49 'Other Metro Cities', 'Thane & Outskirts', 'Tier II Cities'],50 dtype='object')
Correlation
1# Top Correlations2def correlation(dataframe) :3 cor0=dataframe.corr()4 type(cor0)5 cor0.where(np.triu(np.ones(cor0.shape),k=1).astype(np.bool))6 cor0=cor0.unstack().reset_index()7 cor0.columns=['VAR1','VAR2','CORR']8 cor0.dropna(subset=['CORR'], inplace=True)9 cor0.CORR=round(cor0['CORR'],2)10 cor0.CORR=cor0.CORR.abs()11 cor0.sort_values(by=['CORR'],ascending=False)12 cor0=cor0[~(cor0['VAR1']==cor0['VAR2'])]13 return pd.DataFrame(cor0.sort_values(by=['CORR'],ascending=False))
1#Correlations for Converted Leads2convertedCondition= leads['Converted']==13print('Correlations for Converted Leads')4correlation(leads[convertedCondition])[1:30:2].style.background_gradient(cmap='GnBu').hide_index()
1Correlations for Converted Leads
VAR1 | VAR2 | CORR |
Other Lead Origins | Reference | 0.840000 |
SMS Sent | Email Opened | 0.750000 |
Unemployed | Working Professional | 0.690000 |
Page Views Per Visit | TotalVisits | 0.680000 |
No Specialization | Landing Page Submission | 0.600000 |
Page Views Per Visit | Landing Page Submission | 0.560000 |
Landing Page Submission | A free copy of Mastering The Interview | 0.540000 |
Unemployed | Unknown Occupation | 0.510000 |
Total Time Spent on Website | Page Views Per Visit | 0.500000 |
Other Lead Origins | Page Views Per Visit | 0.480000 |
Other Lead Origins | Total Time Spent on Website | 0.460000 |
Other Lead Origins | Landing Page Submission | 0.450000 |
Welingak Website | Other Lead Origins | 0.450000 |
Total Time Spent on Website | TotalVisits | 0.440000 |
TotalVisits | Total Time Spent on Website | 0.440000 |
- Conversions of leads from other lead origins and the ones through reference have similar conversion behaviour.
1#Correlations for un-Converted Leads2unconvertedCondition=leads['Converted']==03print('Correlations for Non-Converted Leads')4correlation(leads[unconvertedCondition])[1:30:2].style.background_gradient(cmap='GnBu').hide_index()
1Correlations for Non-Converted Leads
VAR1 | VAR2 | CORR |
Unemployed | Unknown Occupation | 0.930000 |
Landing Page Submission | No Specialization | 0.870000 |
Reference | Other Lead Origins | 0.750000 |
TotalVisits | Page Views Per Visit | 0.730000 |
Email Bounced | Do Not Email | 0.650000 |
Olark Chat | Page Views Per Visit | 0.630000 |
Olark Chat | Landing Page Submission | 0.620000 |
A free copy of Mastering The Interview | No Specialization | 0.590000 |
A free copy of Mastering The Interview | Landing Page Submission | 0.580000 |
No Specialization | Olark Chat | 0.570000 |
Olark Chat | TotalVisits | 0.520000 |
Page Views Per Visit | Landing Page Submission | 0.500000 |
Olark Chat Conversation | Olark Chat | 0.500000 |
No Specialization | Page Views Per Visit | 0.490000 |
Other Lead Origins | Other Lead Sources | 0.480000 |
- From the above,
Unknown Occupation
andUnemployed
are highly correlated for non-converted leads. - This might mean that unemployed leads and leads with unknown occupation have the same conversion behaviour.
Train-Test Split
1from sklearn.model_selection import train_test_split2y = leads.pop('Converted')3X = leads
1X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=100)
Standardizing Continuous Variables
1continuous_vars
1['TotalVisits', 'Page Views Per Visit', 'Total Time Spent on Website']
1from sklearn.preprocessing import StandardScaler2scaler = StandardScaler()34# fitting and transforming train set5X_train[continuous_vars] = scaler.fit_transform(X_train[continuous_vars])67# Transforming test set for later use8X_test[continuous_vars] = scaler.transform(X_test[continuous_vars])
Modelling
Recurvise Feature Elimination
1print('No of features : ', len(X_train.columns))
1No of features : 49
- Currently, the dataset has 49 features.
- We shall follow a mixed feature elimination approach.
- We could use Recursive Feature Elimination for coarse elimination to 25 columns
- This is followed by manual elimination of features with high p-value / VIF.
1# RFE2from sklearn.linear_model import LogisticRegression3from sklearn.feature_selection import RFE4minFeatures = 255model = LogisticRegression()6rfe = RFE(model, n_features_to_select=minFeatures)7rfe = rfe.fit(X_train, y_train)
1# Columns selected by RFE :2RFE_features = pd.DataFrame( {'feature' : X_train.columns, 'rank' : rfe.ranking_, 'support' : rfe.support_})3condition = RFE_features['support'] == True4rfe_features = RFE_features[condition].sort_values(by='rank',ascending=True )5print('Features selected by RFE\n')6tab(rfe_features)
1Features selected by RFE23+----+-----------------------------+--------+-----------+4| | feature | rank | support |5|----+-----------------------------+--------+-----------|6| 0 | Do Not Email | 1 | True |7| 42 | Unknown Occupation | 1 | True |8| 41 | Unemployed | 1 | True |9| 40 | Student | 1 | True |10| 38 | Housewife | 1 | True |11| 33 | No Specialization | 1 | True |12| 32 | Media and Advertising | 1 | True |13| 27 | Hospitality Management | 1 | True |14| 22 | Outside India | 1 | True |15| 21 | SMS Sent | 1 | True |16| 19 | Other Last Activity | 1 | True |17| 43 | Working Professional | 1 | True |18| 18 | Olark Chat Conversation | 1 | True |19| 15 | Email Link Clicked | 1 | True |20| 13 | Welingak Website | 1 | True |21| 11 | Reference | 1 | True |22| 10 | Other Lead Sources | 1 | True |23| 8 | Olark Chat | 1 | True |24| 6 | Other Lead Origins | 1 | True |25| 5 | Landing Page Submission | 1 | True |26| 3 | Page Views Per Visit | 1 | True |27| 2 | Total Time Spent on Website | 1 | True |28| 1 | TotalVisits | 1 | True |29| 16 | Email Opened | 1 | True |30| 48 | Tier II Cities | 1 | True |31+----+-----------------------------+--------+-----------+
1rfeFeatures = rfe_features['feature'].values
Manual Feature Elimination
Model 1
1### Multicollinearity2from statsmodels.stats.outliers_influence import variance_inflation_factor3def vif(X) :4 df = sm.add_constant(X)5 vif = [variance_inflation_factor(df.values,i) for i in range(df.shape[1])]6 vif_frame = pd.DataFrame({'vif' : vif[0:]},index = df.columns).reset_index()7 tab(vif_frame.sort_values(by='vif',ascending=False))
1# Model 12import statsmodels.api as sm3features = rfe_features['feature'].values4X_train = X_train[features]5logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())6print("VIF for X_train")7vif(X_train)8logm1.fit().summary()
1VIF for X_train2+----+-----------------------------+-----------+3| | index | vif |4|----+-----------------------------+-----------|5| 0 | const | 498.666 |6| 3 | Unemployed | 117.314 |7| 2 | Unknown Occupation | 102.767 |8| 19 | Other Lead Origins | 34.8794 |9| 12 | Working Professional | 34.6746 |10| 16 | Reference | 26.0024 |11| 4 | Student | 11.9606 |12| 15 | Welingak Website | 8.73583 |13| 20 | Landing Page Submission | 3.48279 |14| 6 | No Specialization | 3.05316 |15| 17 | Other Lead Sources | 2.84805 |16| 21 | Page Views Per Visit | 2.65403 |17| 24 | Email Opened | 2.47549 |18| 18 | Olark Chat | 2.45889 |19| 10 | SMS Sent | 2.30595 |20| 23 | TotalVisits | 2.0918 |21| 13 | Olark Chat Conversation | 1.93429 |22| 5 | Housewife | 1.46972 |23| 22 | Total Time Spent on Website | 1.35898 |24| 1 | Do Not Email | 1.2055 |25| 14 | Email Link Clicked | 1.20382 |26| 11 | Other Last Activity | 1.11873 |27| 9 | Outside India | 1.02551 |28| 7 | Media and Advertising | 1.01922 |29| 8 | Hospitality Management | 1.0113 |30| 25 | Tier II Cities | 1.00967 |31+----+-----------------------------+-----------+
Dep. Variable: | Converted | No. Observations: | 6309 |
Model: | GLM | Df Residuals: | 6283 |
Model Family: | Binomial | Df Model: | 25 |
Link Function: | logit | Scale: | 1.0000 |
Method: | IRLS | Log-Likelihood: | -2503.0 |
Date: | Tue, 08 Sep 2020 | Deviance: | 5005.9 |
Time: | 07:36:56 | Pearson chi2: | 6.36e+03 |
No. Iterations: | 20 | ||
Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
const | -0.3145 | 0.675 | -0.466 | 0.641 | -1.638 | 1.009 |
Do Not Email | -1.4730 | 0.189 | -7.775 | 0.000 | -1.844 | -1.102 |
Unknown Occupation | -1.8781 | 0.668 | -2.813 | 0.005 | -3.187 | -0.570 |
Unemployed | -0.6028 | 0.664 | -0.908 | 0.364 | -1.904 | 0.698 |
Student | -0.6044 | 0.699 | -0.864 | 0.387 | -1.975 | 0.766 |
Housewife | 19.9950 | 1.12e+04 | 0.002 | 0.999 | -2.2e+04 | 2.21e+04 |
No Specialization | -0.8514 | 0.127 | -6.698 | 0.000 | -1.101 | -0.602 |
Media and Advertising | -0.2951 | 0.247 | -1.196 | 0.232 | -0.779 | 0.189 |
Hospitality Management | -0.9043 | 0.333 | -2.715 | 0.007 | -1.557 | -0.252 |
Outside India | -0.6829 | 0.236 | -2.899 | 0.004 | -1.145 | -0.221 |
SMS Sent | 2.0121 | 0.128 | 15.767 | 0.000 | 1.762 | 2.262 |
Other Last Activity | 1.3061 | 0.262 | 4.991 | 0.000 | 0.793 | 1.819 |
Working Professional | 1.8232 | 0.688 | 2.651 | 0.008 | 0.475 | 3.171 |
Olark Chat Conversation | -0.6947 | 0.202 | -3.437 | 0.001 | -1.091 | -0.298 |
Email Link Clicked | 0.4905 | 0.236 | 2.079 | 0.038 | 0.028 | 0.953 |
Welingak Website | 4.3758 | 1.134 | 3.858 | 0.000 | 2.153 | 6.599 |
Reference | 2.0165 | 0.892 | 2.261 | 0.024 | 0.268 | 3.765 |
Other Lead Sources | -0.1279 | 0.830 | -0.154 | 0.878 | -1.755 | 1.500 |
Olark Chat | 1.2173 | 0.141 | 8.613 | 0.000 | 0.940 | 1.494 |
Other Lead Origins | 1.0385 | 0.869 | 1.196 | 0.232 | -0.664 | 2.741 |
Landing Page Submission | -0.9141 | 0.131 | -6.962 | 0.000 | -1.171 | -0.657 |
Page Views Per Visit | -0.2670 | 0.056 | -4.727 | 0.000 | -0.378 | -0.156 |
Total Time Spent on Website | 1.1245 | 0.042 | 26.804 | 0.000 | 1.042 | 1.207 |
TotalVisits | 0.3212 | 0.050 | 6.432 | 0.000 | 0.223 | 0.419 |
Email Opened | 0.7814 | 0.125 | 6.271 | 0.000 | 0.537 | 1.026 |
Tier II Cities | -0.4419 | 0.425 | -1.040 | 0.298 | -1.275 | 0.391 |
Unemployed
has the highest VIF. let’s drop this feature.
Model 2
1# Model 2 : Removing `Unemployed`2column_to_remove = 'Unemployed'3features = X_train.columns[X_train.columns !=column_to_remove]4X_train = X_train[features]5logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())6print("VIF for X_train")7vif(X_train)8logm1.fit().summary()
1VIF for X_train2+----+-----------------------------+----------+3| | index | vif |4|----+-----------------------------+----------|5| 18 | Other Lead Origins | 34.8784 |6| 15 | Reference | 26.0024 |7| 0 | const | 19.6719 |8| 14 | Welingak Website | 8.73573 |9| 19 | Landing Page Submission | 3.4731 |10| 5 | No Specialization | 3.04661 |11| 16 | Other Lead Sources | 2.84805 |12| 20 | Page Views Per Visit | 2.65403 |13| 23 | Email Opened | 2.47543 |14| 17 | Olark Chat | 2.45888 |15| 9 | SMS Sent | 2.30562 |16| 22 | TotalVisits | 2.09129 |17| 12 | Olark Chat Conversation | 1.93402 |18| 21 | Total Time Spent on Website | 1.35834 |19| 1 | Do Not Email | 1.20547 |20| 13 | Email Link Clicked | 1.2038 |21| 2 | Unknown Occupation | 1.17383 |22| 11 | Working Professional | 1.15381 |23| 10 | Other Last Activity | 1.11849 |24| 8 | Outside India | 1.02545 |25| 3 | Student | 1.02534 |26| 6 | Media and Advertising | 1.01899 |27| 7 | Hospitality Management | 1.01126 |28| 4 | Housewife | 1.01026 |29| 24 | Tier II Cities | 1.00967 |30+----+-----------------------------+----------+
Dep. Variable: | Converted | No. Observations: | 6309 |
Model: | GLM | Df Residuals: | 6284 |
Model Family: | Binomial | Df Model: | 24 |
Link Function: | logit | Scale: | 1.0000 |
Method: | IRLS | Log-Likelihood: | -2503.4 |
Date: | Tue, 08 Sep 2020 | Deviance: | 5006.8 |
Time: | 07:36:56 | Pearson chi2: | 6.36e+03 |
No. Iterations: | 21 | ||
Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
const | -0.9101 | 0.163 | -5.598 | 0.000 | -1.229 | -0.591 |
Do Not Email | -1.4697 | 0.189 | -7.761 | 0.000 | -1.841 | -1.099 |
Unknown Occupation | -1.2777 | 0.091 | -14.023 | 0.000 | -1.456 | -1.099 |
Student | -0.0044 | 0.229 | -0.019 | 0.985 | -0.453 | 0.444 |
Housewife | 21.5945 | 1.85e+04 | 0.001 | 0.999 | -3.63e+04 | 3.63e+04 |
No Specialization | -0.8575 | 0.127 | -6.754 | 0.000 | -1.106 | -0.609 |
Media and Advertising | -0.2953 | 0.247 | -1.198 | 0.231 | -0.779 | 0.188 |
Hospitality Management | -0.9066 | 0.333 | -2.722 | 0.006 | -1.559 | -0.254 |
Outside India | -0.6853 | 0.236 | -2.909 | 0.004 | -1.147 | -0.224 |
SMS Sent | 2.0111 | 0.128 | 15.755 | 0.000 | 1.761 | 2.261 |
Other Last Activity | 1.3069 | 0.261 | 5.001 | 0.000 | 0.795 | 1.819 |
Working Professional | 2.4233 | 0.191 | 12.700 | 0.000 | 2.049 | 2.797 |
Olark Chat Conversation | -0.6896 | 0.202 | -3.415 | 0.001 | -1.085 | -0.294 |
Email Link Clicked | 0.4893 | 0.236 | 2.074 | 0.038 | 0.027 | 0.952 |
Welingak Website | 4.3797 | 1.134 | 3.861 | 0.000 | 2.157 | 6.603 |
Reference | 2.0184 | 0.892 | 2.263 | 0.024 | 0.270 | 3.766 |
Other Lead Sources | -0.1269 | 0.830 | -0.153 | 0.879 | -1.754 | 1.500 |
Olark Chat | 1.2189 | 0.141 | 8.624 | 0.000 | 0.942 | 1.496 |
Other Lead Origins | 1.0344 | 0.869 | 1.191 | 0.234 | -0.668 | 2.737 |
Landing Page Submission | -0.9202 | 0.131 | -7.016 | 0.000 | -1.177 | -0.663 |
Page Views Per Visit | -0.2668 | 0.056 | -4.723 | 0.000 | -0.377 | -0.156 |
Total Time Spent on Website | 1.1251 | 0.042 | 26.818 | 0.000 | 1.043 | 1.207 |
TotalVisits | 0.3227 | 0.050 | 6.466 | 0.000 | 0.225 | 0.421 |
Email Opened | 0.7824 | 0.125 | 6.278 | 0.000 | 0.538 | 1.027 |
Tier II Cities | -0.4432 | 0.425 | -1.043 | 0.297 | -1.276 | 0.389 |
Other Lead Origins
has a very high VIF.- Let’s drop this variable
Model 3
1# Model 3 : Removing `Other Lead Origins`2column_to_remove = 'Other Lead Origins'3features = X_train.columns[X_train.columns !=column_to_remove]4X_train = X_train[features]5logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())6print("VIF for X_train")7vif(X_train)8logm1.fit().summary()
1VIF for X_train2+----+-----------------------------+----------+3| | index | vif |4|----+-----------------------------+----------|5| 0 | const | 19.5742 |6| 18 | Landing Page Submission | 3.44401 |7| 5 | No Specialization | 3.03467 |8| 19 | Page Views Per Visit | 2.64774 |9| 22 | Email Opened | 2.4731 |10| 17 | Olark Chat | 2.45033 |11| 9 | SMS Sent | 2.30433 |12| 21 | TotalVisits | 2.09083 |13| 12 | Olark Chat Conversation | 1.93393 |14| 15 | Reference | 1.64559 |15| 20 | Total Time Spent on Website | 1.35735 |16| 1 | Do Not Email | 1.20544 |17| 13 | Email Link Clicked | 1.20372 |18| 2 | Unknown Occupation | 1.17275 |19| 11 | Working Professional | 1.15369 |20| 14 | Welingak Website | 1.14402 |21| 10 | Other Last Activity | 1.11839 |22| 16 | Other Lead Sources | 1.03494 |23| 8 | Outside India | 1.02524 |24| 3 | Student | 1.02486 |25| 6 | Media and Advertising | 1.01894 |26| 7 | Hospitality Management | 1.01112 |27| 4 | Housewife | 1.01026 |28| 23 | Tier II Cities | 1.00966 |29+----+-----------------------------+----------+
Dep. Variable: | Converted | No. Observations: | 6309 |
Model: | GLM | Df Residuals: | 6285 |
Model Family: | Binomial | Df Model: | 23 |
Link Function: | logit | Scale: | 1.0000 |
Method: | IRLS | Log-Likelihood: | -2504.2 |
Date: | Tue, 08 Sep 2020 | Deviance: | 5008.4 |
Time: | 07:36:56 | Pearson chi2: | 6.37e+03 |
No. Iterations: | 21 | ||
Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
const | -0.8922 | 0.162 | -5.511 | 0.000 | -1.209 | -0.575 |
Do Not Email | -1.4710 | 0.189 | -7.770 | 0.000 | -1.842 | -1.100 |
Unknown Occupation | -1.2816 | 0.091 | -14.076 | 0.000 | -1.460 | -1.103 |
Student | -0.0129 | 0.228 | -0.056 | 0.955 | -0.461 | 0.435 |
Housewife | 21.5902 | 1.85e+04 | 0.001 | 0.999 | -3.63e+04 | 3.63e+04 |
No Specialization | -0.8719 | 0.127 | -6.891 | 0.000 | -1.120 | -0.624 |
Media and Advertising | -0.2936 | 0.247 | -1.190 | 0.234 | -0.777 | 0.190 |
Hospitality Management | -0.9011 | 0.333 | -2.705 | 0.007 | -1.554 | -0.248 |
Outside India | -0.6916 | 0.235 | -2.940 | 0.003 | -1.153 | -0.231 |
SMS Sent | 2.0143 | 0.127 | 15.802 | 0.000 | 1.764 | 2.264 |
Other Last Activity | 1.3096 | 0.261 | 5.011 | 0.000 | 0.797 | 1.822 |
Working Professional | 2.4221 | 0.191 | 12.693 | 0.000 | 2.048 | 2.796 |
Olark Chat Conversation | -0.6896 | 0.202 | -3.416 | 0.001 | -1.085 | -0.294 |
Email Link Clicked | 0.4914 | 0.236 | 2.084 | 0.037 | 0.029 | 0.954 |
Welingak Website | 5.4015 | 0.741 | 7.294 | 0.000 | 3.950 | 6.853 |
Reference | 3.0328 | 0.258 | 11.775 | 0.000 | 2.528 | 3.538 |
Other Lead Sources | 0.6953 | 0.383 | 1.815 | 0.070 | -0.056 | 1.446 |
Olark Chat | 1.2083 | 0.141 | 8.581 | 0.000 | 0.932 | 1.484 |
Landing Page Submission | -0.9391 | 0.130 | -7.204 | 0.000 | -1.195 | -0.684 |
Page Views Per Visit | -0.2697 | 0.056 | -4.780 | 0.000 | -0.380 | -0.159 |
Total Time Spent on Website | 1.1240 | 0.042 | 26.816 | 0.000 | 1.042 | 1.206 |
TotalVisits | 0.3212 | 0.050 | 6.437 | 0.000 | 0.223 | 0.419 |
Email Opened | 0.7860 | 0.124 | 6.318 | 0.000 | 0.542 | 1.030 |
Tier II Cities | -0.4421 | 0.425 | -1.041 | 0.298 | -1.275 | 0.391 |
Housewife
has a high p-value and hence the coefficient is insignificant. let’s drop the same.
Model 4
1# Model 4 : Removing `Housewife`2column_to_remove = 'Housewife'3features = X_train.columns[X_train.columns !=column_to_remove]4X_train = X_train[features]5logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())6print("VIF for X_train")7vif(X_train)8logm1.fit().summary()
1VIF for X_train2+----+-----------------------------+----------+3| | index | vif |4|----+-----------------------------+----------|5| 0 | const | 19.5671 |6| 17 | Landing Page Submission | 3.44232 |7| 4 | No Specialization | 3.03251 |8| 18 | Page Views Per Visit | 2.64773 |9| 21 | Email Opened | 2.47231 |10| 16 | Olark Chat | 2.4503 |11| 8 | SMS Sent | 2.30422 |12| 20 | TotalVisits | 2.09075 |13| 11 | Olark Chat Conversation | 1.93383 |14| 14 | Reference | 1.64154 |15| 19 | Total Time Spent on Website | 1.35679 |16| 1 | Do Not Email | 1.20544 |17| 12 | Email Link Clicked | 1.20292 |18| 2 | Unknown Occupation | 1.17264 |19| 10 | Working Professional | 1.15248 |20| 13 | Welingak Website | 1.14402 |21| 9 | Other Last Activity | 1.11839 |22| 15 | Other Lead Sources | 1.03492 |23| 3 | Student | 1.02478 |24| 7 | Outside India | 1.02446 |25| 5 | Media and Advertising | 1.01783 |26| 6 | Hospitality Management | 1.01111 |27| 22 | Tier II Cities | 1.00965 |28+----+-----------------------------+----------+
Dep. Variable: | Converted | No. Observations: | 6309 |
Model: | GLM | Df Residuals: | 6286 |
Model Family: | Binomial | Df Model: | 22 |
Link Function: | logit | Scale: | 1.0000 |
Method: | IRLS | Log-Likelihood: | -2506.1 |
Date: | Tue, 08 Sep 2020 | Deviance: | 5012.2 |
Time: | 07:36:57 | Pearson chi2: | 6.39e+03 |
No. Iterations: | 7 | ||
Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
const | -0.8894 | 0.162 | -5.494 | 0.000 | -1.207 | -0.572 |
Do Not Email | -1.4718 | 0.189 | -7.770 | 0.000 | -1.843 | -1.101 |
Unknown Occupation | -1.2830 | 0.091 | -14.091 | 0.000 | -1.461 | -1.105 |
Student | -0.0165 | 0.229 | -0.072 | 0.942 | -0.465 | 0.432 |
No Specialization | -0.8767 | 0.127 | -6.929 | 0.000 | -1.125 | -0.629 |
Media and Advertising | -0.2871 | 0.246 | -1.169 | 0.243 | -0.769 | 0.194 |
Hospitality Management | -0.9050 | 0.333 | -2.715 | 0.007 | -1.558 | -0.252 |
Outside India | -0.6828 | 0.234 | -2.919 | 0.004 | -1.141 | -0.224 |
SMS Sent | 2.0146 | 0.128 | 15.799 | 0.000 | 1.765 | 2.264 |
Other Last Activity | 1.3101 | 0.261 | 5.011 | 0.000 | 0.798 | 1.823 |
Working Professional | 2.4187 | 0.191 | 12.674 | 0.000 | 2.045 | 2.793 |
Olark Chat Conversation | -0.6880 | 0.202 | -3.407 | 0.001 | -1.084 | -0.292 |
Email Link Clicked | 0.4981 | 0.235 | 2.117 | 0.034 | 0.037 | 0.959 |
Welingak Website | 5.4013 | 0.741 | 7.294 | 0.000 | 3.950 | 6.853 |
Reference | 3.0546 | 0.257 | 11.869 | 0.000 | 2.550 | 3.559 |
Other Lead Sources | 0.6922 | 0.383 | 1.807 | 0.071 | -0.059 | 1.443 |
Olark Chat | 1.2075 | 0.141 | 8.575 | 0.000 | 0.931 | 1.483 |
Landing Page Submission | -0.9416 | 0.130 | -7.228 | 0.000 | -1.197 | -0.686 |
Page Views Per Visit | -0.2701 | 0.056 | -4.788 | 0.000 | -0.381 | -0.160 |
Total Time Spent on Website | 1.1252 | 0.042 | 26.851 | 0.000 | 1.043 | 1.207 |
TotalVisits | 0.3203 | 0.050 | 6.421 | 0.000 | 0.223 | 0.418 |
Email Opened | 0.7897 | 0.124 | 6.347 | 0.000 | 0.546 | 1.034 |
Tier II Cities | -0.4440 | 0.425 | -1.045 | 0.296 | -1.277 | 0.389 |
Student
has a p-value higher than 0.05 and the highest among all p-values. Let’s drop this feature.
Model 5
1# Model 5 : Removing `Student`2column_to_remove = 'Student'3features = X_train.columns[X_train.columns !=column_to_remove]4X_train = X_train[features]5logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())6print("VIF for X_train")7vif(X_train)8logm1.fit().summary()
1VIF for X_train2+----+-----------------------------+----------+3| | index | vif |4|----+-----------------------------+----------|5| 0 | const | 19.5487 |6| 16 | Landing Page Submission | 3.43698 |7| 3 | No Specialization | 3.02919 |8| 17 | Page Views Per Visit | 2.64719 |9| 20 | Email Opened | 2.47218 |10| 15 | Olark Chat | 2.44998 |11| 7 | SMS Sent | 2.30157 |12| 19 | TotalVisits | 2.09067 |13| 10 | Olark Chat Conversation | 1.93303 |14| 13 | Reference | 1.63987 |15| 18 | Total Time Spent on Website | 1.35679 |16| 1 | Do Not Email | 1.20485 |17| 11 | Email Link Clicked | 1.2029 |18| 2 | Unknown Occupation | 1.15459 |19| 9 | Working Professional | 1.14906 |20| 12 | Welingak Website | 1.14371 |21| 8 | Other Last Activity | 1.11789 |22| 14 | Other Lead Sources | 1.03492 |23| 6 | Outside India | 1.02427 |24| 4 | Media and Advertising | 1.01768 |25| 5 | Hospitality Management | 1.01055 |26| 21 | Tier II Cities | 1.00946 |27+----+-----------------------------+----------+
Dep. Variable: | Converted | No. Observations: | 6309 |
Model: | GLM | Df Residuals: | 6287 |
Model Family: | Binomial | Df Model: | 21 |
Link Function: | logit | Scale: | 1.0000 |
Method: | IRLS | Log-Likelihood: | -2506.1 |
Date: | Tue, 08 Sep 2020 | Deviance: | 5012.2 |
Time: | 07:36:57 | Pearson chi2: | 6.39e+03 |
No. Iterations: | 7 | ||
Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
const | -0.8898 | 0.162 | -5.500 | 0.000 | -1.207 | -0.573 |
Do Not Email | -1.4718 | 0.189 | -7.771 | 0.000 | -1.843 | -1.101 |
Unknown Occupation | -1.2824 | 0.091 | -14.134 | 0.000 | -1.460 | -1.105 |
No Specialization | -0.8767 | 0.127 | -6.929 | 0.000 | -1.125 | -0.629 |
Media and Advertising | -0.2868 | 0.246 | -1.168 | 0.243 | -0.768 | 0.195 |
Hospitality Management | -0.9053 | 0.333 | -2.717 | 0.007 | -1.559 | -0.252 |
Outside India | -0.6826 | 0.234 | -2.919 | 0.004 | -1.141 | -0.224 |
SMS Sent | 2.0148 | 0.127 | 15.806 | 0.000 | 1.765 | 2.265 |
Other Last Activity | 1.3105 | 0.261 | 5.014 | 0.000 | 0.798 | 1.823 |
Working Professional | 2.4193 | 0.191 | 12.687 | 0.000 | 2.046 | 2.793 |
Olark Chat Conversation | -0.6883 | 0.202 | -3.410 | 0.001 | -1.084 | -0.293 |
Email Link Clicked | 0.4981 | 0.235 | 2.117 | 0.034 | 0.037 | 0.959 |
Welingak Website | 5.4017 | 0.741 | 7.294 | 0.000 | 3.950 | 6.853 |
Reference | 3.0541 | 0.257 | 11.873 | 0.000 | 2.550 | 3.558 |
Other Lead Sources | 0.6922 | 0.383 | 1.807 | 0.071 | -0.059 | 1.443 |
Olark Chat | 1.2072 | 0.141 | 8.577 | 0.000 | 0.931 | 1.483 |
Landing Page Submission | -0.9418 | 0.130 | -7.231 | 0.000 | -1.197 | -0.687 |
Page Views Per Visit | -0.2701 | 0.056 | -4.788 | 0.000 | -0.381 | -0.160 |
Total Time Spent on Website | 1.1252 | 0.042 | 26.851 | 0.000 | 1.043 | 1.207 |
TotalVisits | 0.3204 | 0.050 | 6.421 | 0.000 | 0.223 | 0.418 |
Email Opened | 0.7896 | 0.124 | 6.347 | 0.000 | 0.546 | 1.033 |
Tier II Cities | -0.4436 | 0.425 | -1.044 | 0.297 | -1.276 | 0.389 |
Tier II Cities
has a p-value higher than confidence level and the highest among all the p - values.- Let’s remove this feature
Model 6
1# Model 6 : Removing `Tier II Cities`2column_to_remove = 'Tier II Cities'3features = X_train.columns[X_train.columns !=column_to_remove]4X_train = X_train[features]5logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())6print("VIF for X_train")7vif(X_train)8logm1.fit().summary()
1VIF for X_train2+----+-----------------------------+----------+3| | index | vif |4|----+-----------------------------+----------|5| 0 | const | 19.5461 |6| 16 | Landing Page Submission | 3.43294 |7| 3 | No Specialization | 3.02879 |8| 17 | Page Views Per Visit | 2.64718 |9| 20 | Email Opened | 2.47182 |10| 15 | Olark Chat | 2.44996 |11| 7 | SMS Sent | 2.30139 |12| 19 | TotalVisits | 2.09046 |13| 10 | Olark Chat Conversation | 1.93285 |14| 13 | Reference | 1.63987 |15| 18 | Total Time Spent on Website | 1.35679 |16| 1 | Do Not Email | 1.20341 |17| 11 | Email Link Clicked | 1.20289 |18| 2 | Unknown Occupation | 1.15454 |19| 9 | Working Professional | 1.14885 |20| 12 | Welingak Website | 1.14371 |21| 8 | Other Last Activity | 1.11724 |22| 14 | Other Lead Sources | 1.0349 |23| 6 | Outside India | 1.02425 |24| 4 | Media and Advertising | 1.0176 |25| 5 | Hospitality Management | 1.01026 |26+----+-----------------------------+----------+
Dep. Variable: | Converted | No. Observations: | 6309 |
Model: | GLM | Df Residuals: | 6288 |
Model Family: | Binomial | Df Model: | 20 |
Link Function: | logit | Scale: | 1.0000 |
Method: | IRLS | Log-Likelihood: | -2506.6 |
Date: | Tue, 08 Sep 2020 | Deviance: | 5013.3 |
Time: | 07:36:57 | Pearson chi2: | 6.38e+03 |
No. Iterations: | 7 | ||
Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
const | -0.8916 | 0.162 | -5.511 | 0.000 | -1.209 | -0.574 |
Do Not Email | -1.4762 | 0.189 | -7.805 | 0.000 | -1.847 | -1.106 |
Unknown Occupation | -1.2818 | 0.091 | -14.129 | 0.000 | -1.460 | -1.104 |
No Specialization | -0.8758 | 0.127 | -6.923 | 0.000 | -1.124 | -0.628 |
Media and Advertising | -0.2821 | 0.246 | -1.149 | 0.251 | -0.763 | 0.199 |
Hospitality Management | -0.9112 | 0.333 | -2.740 | 0.006 | -1.563 | -0.259 |
Outside India | -0.6825 | 0.234 | -2.919 | 0.004 | -1.141 | -0.224 |
SMS Sent | 2.0155 | 0.127 | 15.809 | 0.000 | 1.766 | 2.265 |
Other Last Activity | 1.3159 | 0.261 | 5.035 | 0.000 | 0.804 | 1.828 |
Working Professional | 2.4165 | 0.191 | 12.684 | 0.000 | 2.043 | 2.790 |
Olark Chat Conversation | -0.6867 | 0.202 | -3.402 | 0.001 | -1.082 | -0.291 |
Email Link Clicked | 0.4992 | 0.235 | 2.123 | 0.034 | 0.038 | 0.960 |
Welingak Website | 5.4036 | 0.741 | 7.296 | 0.000 | 3.952 | 6.855 |
Reference | 3.0553 | 0.257 | 11.877 | 0.000 | 2.551 | 3.559 |
Other Lead Sources | 0.6941 | 0.383 | 1.812 | 0.070 | -0.057 | 1.445 |
Olark Chat | 1.2077 | 0.141 | 8.582 | 0.000 | 0.932 | 1.483 |
Landing Page Submission | -0.9467 | 0.130 | -7.273 | 0.000 | -1.202 | -0.692 |
Page Views Per Visit | -0.2700 | 0.056 | -4.790 | 0.000 | -0.381 | -0.160 |
Total Time Spent on Website | 1.1249 | 0.042 | 26.851 | 0.000 | 1.043 | 1.207 |
TotalVisits | 0.3212 | 0.050 | 6.441 | 0.000 | 0.223 | 0.419 |
Email Opened | 0.7907 | 0.124 | 6.355 | 0.000 | 0.547 | 1.035 |
Page Views Per Visit
has a high p-value. Let’s eliminate this.
Model 7
1# Model 7 : Removing `Page Views Per Visit`2column_to_remove = 'Page Views Per Visit'3features = X_train.columns[X_train.columns !=column_to_remove]4X_train = X_train[features]5logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())6print("VIF for X_train")7vif(X_train)8logm1.fit().summary()
1VIF for X_train2+----+-----------------------------+----------+3| | index | vif |4|----+-----------------------------+----------|5| 0 | const | 19.521 |6| 16 | Landing Page Submission | 3.40467 |7| 3 | No Specialization | 3.02277 |8| 19 | Email Opened | 2.43525 |9| 7 | SMS Sent | 2.2499 |10| 15 | Olark Chat | 2.22665 |11| 10 | Olark Chat Conversation | 1.92174 |12| 13 | Reference | 1.56753 |13| 18 | TotalVisits | 1.49299 |14| 17 | Total Time Spent on Website | 1.35607 |15| 1 | Do Not Email | 1.20207 |16| 11 | Email Link Clicked | 1.20085 |17| 2 | Unknown Occupation | 1.15454 |18| 9 | Working Professional | 1.14798 |19| 12 | Welingak Website | 1.12183 |20| 8 | Other Last Activity | 1.11356 |21| 14 | Other Lead Sources | 1.03154 |22| 6 | Outside India | 1.02316 |23| 4 | Media and Advertising | 1.0176 |24| 5 | Hospitality Management | 1.01022 |25+----+-----------------------------+----------+
Dep. Variable: | Converted | No. Observations: | 6309 |
Model: | GLM | Df Residuals: | 6289 |
Model Family: | Binomial | Df Model: | 19 |
Link Function: | logit | Scale: | 1.0000 |
Method: | IRLS | Log-Likelihood: | -2518.3 |
Date: | Tue, 08 Sep 2020 | Deviance: | 5036.6 |
Time: | 07:36:57 | Pearson chi2: | 6.35e+03 |
No. Iterations: | 7 | ||
Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
const | -0.8590 | 0.161 | -5.346 | 0.000 | -1.174 | -0.544 |
Do Not Email | -1.5025 | 0.189 | -7.940 | 0.000 | -1.873 | -1.132 |
Unknown Occupation | -1.2733 | 0.090 | -14.083 | 0.000 | -1.451 | -1.096 |
No Specialization | -0.8357 | 0.126 | -6.656 | 0.000 | -1.082 | -0.590 |
Media and Advertising | -0.2901 | 0.244 | -1.187 | 0.235 | -0.769 | 0.189 |
Hospitality Management | -0.9063 | 0.332 | -2.732 | 0.006 | -1.556 | -0.256 |
Outside India | -0.6629 | 0.234 | -2.828 | 0.005 | -1.122 | -0.203 |
SMS Sent | 1.9118 | 0.125 | 15.332 | 0.000 | 1.667 | 2.156 |
Other Last Activity | 1.2298 | 0.260 | 4.738 | 0.000 | 0.721 | 1.739 |
Working Professional | 2.4225 | 0.191 | 12.702 | 0.000 | 2.049 | 2.796 |
Olark Chat Conversation | -0.7552 | 0.201 | -3.762 | 0.000 | -1.149 | -0.362 |
Email Link Clicked | 0.4387 | 0.234 | 1.875 | 0.061 | -0.020 | 0.897 |
Welingak Website | 5.6036 | 0.739 | 7.587 | 0.000 | 4.156 | 7.051 |
Reference | 3.2759 | 0.253 | 12.946 | 0.000 | 2.780 | 3.772 |
Other Lead Sources | 0.8276 | 0.379 | 2.186 | 0.029 | 0.086 | 1.570 |
Olark Chat | 1.4051 | 0.135 | 10.431 | 0.000 | 1.141 | 1.669 |
Landing Page Submission | -0.9926 | 0.129 | -7.685 | 0.000 | -1.246 | -0.739 |
Total Time Spent on Website | 1.1253 | 0.042 | 26.905 | 0.000 | 1.043 | 1.207 |
TotalVisits | 0.1958 | 0.042 | 4.614 | 0.000 | 0.113 | 0.279 |
Email Opened | 0.7106 | 0.123 | 5.797 | 0.000 | 0.470 | 0.951 |
Media and Advertising
has a high p-value. let’s drop this feature
Model 8
1# Model 8 : Removing `Media and Advertising`2column_to_remove = 'Media and Advertising'3features = X_train.columns[X_train.columns !=column_to_remove]4X_train = X_train[features]5logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())6print("VIF for X_train")7vif(X_train)8logm1.fit().summary()
1VIF for X_train2+----+-----------------------------+----------+3| | index | vif |4|----+-----------------------------+----------|5| 0 | const | 19.4607 |6| 15 | Landing Page Submission | 3.40463 |7| 3 | No Specialization | 3.01069 |8| 18 | Email Opened | 2.43523 |9| 6 | SMS Sent | 2.24839 |10| 14 | Olark Chat | 2.22661 |11| 9 | Olark Chat Conversation | 1.92165 |12| 12 | Reference | 1.56581 |13| 17 | TotalVisits | 1.49255 |14| 16 | Total Time Spent on Website | 1.35565 |15| 1 | Do Not Email | 1.202 |16| 10 | Email Link Clicked | 1.20054 |17| 2 | Unknown Occupation | 1.15412 |18| 8 | Working Professional | 1.14797 |19| 11 | Welingak Website | 1.12171 |20| 7 | Other Last Activity | 1.11347 |21| 13 | Other Lead Sources | 1.03154 |22| 5 | Outside India | 1.02306 |23| 4 | Hospitality Management | 1.00959 |24+----+-----------------------------+----------+
Dep. Variable: | Converted | No. Observations: | 6309 |
Model: | GLM | Df Residuals: | 6290 |
Model Family: | Binomial | Df Model: | 18 |
Link Function: | logit | Scale: | 1.0000 |
Method: | IRLS | Log-Likelihood: | -2519.0 |
Date: | Tue, 08 Sep 2020 | Deviance: | 5038.0 |
Time: | 07:36:57 | Pearson chi2: | 6.34e+03 |
No. Iterations: | 7 | ||
Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
const | -0.8673 | 0.160 | -5.404 | 0.000 | -1.182 | -0.553 |
Do Not Email | -1.5036 | 0.189 | -7.945 | 0.000 | -1.874 | -1.133 |
Unknown Occupation | -1.2714 | 0.090 | -14.070 | 0.000 | -1.448 | -1.094 |
No Specialization | -0.8261 | 0.125 | -6.595 | 0.000 | -1.072 | -0.581 |
Hospitality Management | -0.8963 | 0.332 | -2.703 | 0.007 | -1.546 | -0.246 |
Outside India | -0.6576 | 0.234 | -2.809 | 0.005 | -1.116 | -0.199 |
SMS Sent | 1.9074 | 0.125 | 15.311 | 0.000 | 1.663 | 2.152 |
Other Last Activity | 1.2318 | 0.259 | 4.749 | 0.000 | 0.723 | 1.740 |
Working Professional | 2.4169 | 0.191 | 12.682 | 0.000 | 2.043 | 2.790 |
Olark Chat Conversation | -0.7593 | 0.201 | -3.783 | 0.000 | -1.153 | -0.366 |
Email Link Clicked | 0.4421 | 0.234 | 1.890 | 0.059 | -0.016 | 0.901 |
Welingak Website | 5.6066 | 0.739 | 7.591 | 0.000 | 4.159 | 7.054 |
Reference | 3.2810 | 0.253 | 12.970 | 0.000 | 2.785 | 3.777 |
Other Lead Sources | 0.8257 | 0.378 | 2.182 | 0.029 | 0.084 | 1.567 |
Olark Chat | 1.4068 | 0.135 | 10.444 | 0.000 | 1.143 | 1.671 |
Landing Page Submission | -0.9929 | 0.129 | -7.688 | 0.000 | -1.246 | -0.740 |
Total Time Spent on Website | 1.1264 | 0.042 | 26.929 | 0.000 | 1.044 | 1.208 |
TotalVisits | 0.1962 | 0.042 | 4.623 | 0.000 | 0.113 | 0.279 |
Email Opened | 0.7096 | 0.123 | 5.791 | 0.000 | 0.469 | 0.950 |
- This model has a feature
Email Link Clicked
with high p-value of 0.059. Let’s drop this feature.
Model 9
1# Model 9 : Removing `Email Link Clicked`2column_to_remove = 'Email Link Clicked'3features = X_train.columns[X_train.columns !=column_to_remove]4X_train = X_train[features]5logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())6print("VIF for X_train")7vif(X_train)8logm1.fit().summary()
1VIF for X_train2+----+-----------------------------+----------+3| | index | vif |4|----+-----------------------------+----------|5| 0 | const | 18.2371 |6| 14 | Landing Page Submission | 3.40436 |7| 3 | No Specialization | 3.01066 |8| 13 | Olark Chat | 2.21092 |9| 17 | Email Opened | 2.09586 |10| 6 | SMS Sent | 1.97156 |11| 9 | Olark Chat Conversation | 1.75591 |12| 11 | Reference | 1.56268 |13| 16 | TotalVisits | 1.49255 |14| 15 | Total Time Spent on Website | 1.35564 |15| 1 | Do Not Email | 1.16781 |16| 2 | Unknown Occupation | 1.1541 |17| 8 | Working Professional | 1.14791 |18| 10 | Welingak Website | 1.12098 |19| 7 | Other Last Activity | 1.09682 |20| 12 | Other Lead Sources | 1.0315 |21| 5 | Outside India | 1.02286 |22| 4 | Hospitality Management | 1.00955 |23+----+-----------------------------+----------+
Dep. Variable: | Converted | No. Observations: | 6309 |
Model: | GLM | Df Residuals: | 6291 |
Model Family: | Binomial | Df Model: | 17 |
Link Function: | logit | Scale: | 1.0000 |
Method: | IRLS | Log-Likelihood: | -2520.8 |
Date: | Tue, 08 Sep 2020 | Deviance: | 5041.5 |
Time: | 07:36:57 | Pearson chi2: | 6.34e+03 |
No. Iterations: | 7 | ||
Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
const | -0.7799 | 0.153 | -5.111 | 0.000 | -1.079 | -0.481 |
Do Not Email | -1.5328 | 0.188 | -8.150 | 0.000 | -1.901 | -1.164 |
Unknown Occupation | -1.2696 | 0.090 | -14.058 | 0.000 | -1.447 | -1.093 |
No Specialization | -0.8248 | 0.125 | -6.587 | 0.000 | -1.070 | -0.579 |
Hospitality Management | -0.8909 | 0.332 | -2.687 | 0.007 | -1.541 | -0.241 |
Outside India | -0.6555 | 0.235 | -2.795 | 0.005 | -1.115 | -0.196 |
SMS Sent | 1.8157 | 0.113 | 16.026 | 0.000 | 1.594 | 2.038 |
Other Last Activity | 1.1473 | 0.255 | 4.500 | 0.000 | 0.648 | 1.647 |
Working Professional | 2.4103 | 0.190 | 12.668 | 0.000 | 2.037 | 2.783 |
Olark Chat Conversation | -0.8576 | 0.193 | -4.441 | 0.000 | -1.236 | -0.479 |
Welingak Website | 5.6297 | 0.738 | 7.632 | 0.000 | 4.184 | 7.076 |
Reference | 3.3059 | 0.253 | 13.064 | 0.000 | 2.810 | 3.802 |
Other Lead Sources | 0.8192 | 0.378 | 2.167 | 0.030 | 0.078 | 1.560 |
Olark Chat | 1.4212 | 0.135 | 10.561 | 0.000 | 1.157 | 1.685 |
Landing Page Submission | -0.9890 | 0.129 | -7.664 | 0.000 | -1.242 | -0.736 |
Total Time Spent on Website | 1.1247 | 0.042 | 26.931 | 0.000 | 1.043 | 1.207 |
TotalVisits | 0.1953 | 0.042 | 4.613 | 0.000 | 0.112 | 0.278 |
Email Opened | 0.6161 | 0.111 | 5.564 | 0.000 | 0.399 | 0.833 |
- All coefficients are significant / low p-value
- For further elimination , let’s use the magnitude of coefficient as the weight/importance of the variable. Higher values are more important than lower values.
- By this reasoning,
TotalVisits
has the least coefficient. Let’d drop this.
Model 10
1# Model 10 : Removing `TotalVisits`2column_to_remove = 'TotalVisits'3features = X_train.columns[X_train.columns !=column_to_remove]4X_train = X_train[features]5logm1 = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())6logm1 = logm1.fit()7print("VIF for X_train")8vif(X_train)9logm1.summary()
1VIF for X_train2+----+-----------------------------+----------+3| | index | vif |4|----+-----------------------------+----------|5| 0 | const | 17.9489 |6| 14 | Landing Page Submission | 3.4038 |7| 3 | No Specialization | 2.97226 |8| 16 | Email Opened | 2.09586 |9| 13 | Olark Chat | 1.98518 |10| 6 | SMS Sent | 1.97039 |11| 9 | Olark Chat Conversation | 1.7559 |12| 11 | Reference | 1.46711 |13| 15 | Total Time Spent on Website | 1.34366 |14| 1 | Do Not Email | 1.1672 |15| 2 | Unknown Occupation | 1.15404 |16| 8 | Working Professional | 1.14782 |17| 10 | Welingak Website | 1.10012 |18| 7 | Other Last Activity | 1.09606 |19| 12 | Other Lead Sources | 1.02698 |20| 5 | Outside India | 1.0226 |21| 4 | Hospitality Management | 1.00954 |22+----+-----------------------------+----------+
Dep. Variable: | Converted | No. Observations: | 6309 |
Model: | GLM | Df Residuals: | 6292 |
Model Family: | Binomial | Df Model: | 16 |
Link Function: | logit | Scale: | 1.0000 |
Method: | IRLS | Log-Likelihood: | -2531.3 |
Date: | Tue, 08 Sep 2020 | Deviance: | 5062.7 |
Time: | 07:36:57 | Pearson chi2: | 6.40e+03 |
No. Iterations: | 7 | ||
Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
const | -0.6821 | 0.151 | -4.531 | 0.000 | -0.977 | -0.387 |
Do Not Email | -1.5442 | 0.188 | -8.229 | 0.000 | -1.912 | -1.176 |
Unknown Occupation | -1.2691 | 0.090 | -14.078 | 0.000 | -1.446 | -1.092 |
No Specialization | -0.8891 | 0.124 | -7.144 | 0.000 | -1.133 | -0.645 |
Hospitality Management | -0.8856 | 0.332 | -2.667 | 0.008 | -1.536 | -0.235 |
Outside India | -0.6636 | 0.233 | -2.845 | 0.004 | -1.121 | -0.206 |
SMS Sent | 1.7926 | 0.113 | 15.916 | 0.000 | 1.572 | 2.013 |
Other Last Activity | 1.1818 | 0.254 | 4.646 | 0.000 | 0.683 | 1.680 |
Working Professional | 2.3799 | 0.189 | 12.584 | 0.000 | 2.009 | 2.751 |
Olark Chat Conversation | -0.8591 | 0.192 | -4.474 | 0.000 | -1.236 | -0.483 |
Welingak Website | 5.4168 | 0.736 | 7.360 | 0.000 | 3.974 | 6.859 |
Reference | 3.0622 | 0.247 | 12.402 | 0.000 | 2.578 | 3.546 |
Other Lead Sources | 0.6577 | 0.377 | 1.744 | 0.081 | -0.081 | 1.397 |
Olark Chat | 1.2170 | 0.126 | 9.662 | 0.000 | 0.970 | 1.464 |
Landing Page Submission | -0.9929 | 0.129 | -7.702 | 0.000 | -1.246 | -0.740 |
Total Time Spent on Website | 1.1330 | 0.042 | 27.181 | 0.000 | 1.051 | 1.215 |
Email Opened | 0.6052 | 0.110 | 5.490 | 0.000 | 0.389 | 0.821 |
Other Lead Sources
has high p-value. Let’s drop this variable.
Model 11 - Final Model
1# Model 11 : Removing `Other Lead Sources`2column_to_remove = 'Other Lead Sources'3features = X_train.columns[X_train.columns !=column_to_remove]4X_train = X_train[features]5logm_final = sm.GLM(y_train, sm.add_constant(X_train), family=sm.families.Binomial())6logm_final = logm_final.fit()7print("VIF for X_train")8vif(X_train)9logm_final.summary()
1VIF for X_train2+----+-----------------------------+----------+3| | index | vif |4|----+-----------------------------+----------|5| 0 | const | 17.7441 |6| 13 | Landing Page Submission | 3.3594 |7| 3 | No Specialization | 2.96428 |8| 15 | Email Opened | 2.09302 |9| 6 | SMS Sent | 1.97027 |10| 12 | Olark Chat | 1.96234 |11| 9 | Olark Chat Conversation | 1.75584 |12| 11 | Reference | 1.45497 |13| 14 | Total Time Spent on Website | 1.3339 |14| 1 | Do Not Email | 1.16719 |15| 2 | Unknown Occupation | 1.1536 |16| 8 | Working Professional | 1.14772 |17| 10 | Welingak Website | 1.09768 |18| 7 | Other Last Activity | 1.09593 |19| 5 | Outside India | 1.02258 |20| 4 | Hospitality Management | 1.0094 |21+----+-----------------------------+----------+
Dep. Variable: | Converted | No. Observations: | 6309 |
Model: | GLM | Df Residuals: | 6293 |
Model Family: | Binomial | Df Model: | 15 |
Link Function: | logit | Scale: | 1.0000 |
Method: | IRLS | Log-Likelihood: | -2532.8 |
Date: | Tue, 08 Sep 2020 | Deviance: | 5065.5 |
Time: | 07:36:57 | Pearson chi2: | 6.39e+03 |
No. Iterations: | 7 | ||
Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
const | -0.6469 | 0.149 | -4.338 | 0.000 | -0.939 | -0.355 |
Do Not Email | -1.5426 | 0.188 | -8.222 | 0.000 | -1.910 | -1.175 |
Unknown Occupation | -1.2699 | 0.090 | -14.094 | 0.000 | -1.446 | -1.093 |
No Specialization | -0.9057 | 0.124 | -7.281 | 0.000 | -1.150 | -0.662 |
Hospitality Management | -0.8704 | 0.332 | -2.621 | 0.009 | -1.521 | -0.220 |
Outside India | -0.6584 | 0.233 | -2.823 | 0.005 | -1.116 | -0.201 |
SMS Sent | 1.7923 | 0.113 | 15.927 | 0.000 | 1.572 | 2.013 |
Other Last Activity | 1.1749 | 0.254 | 4.622 | 0.000 | 0.677 | 1.673 |
Working Professional | 2.3769 | 0.189 | 12.576 | 0.000 | 2.006 | 2.747 |
Olark Chat Conversation | -0.8614 | 0.192 | -4.488 | 0.000 | -1.238 | -0.485 |
Welingak Website | 5.3886 | 0.736 | 7.323 | 0.000 | 3.946 | 6.831 |
Reference | 3.0246 | 0.246 | 12.302 | 0.000 | 2.543 | 3.506 |
Olark Chat | 1.1876 | 0.125 | 9.530 | 0.000 | 0.943 | 1.432 |
Landing Page Submission | -1.0250 | 0.128 | -8.018 | 0.000 | -1.276 | -0.774 |
Total Time Spent on Website | 1.1253 | 0.041 | 27.204 | 0.000 | 1.044 | 1.206 |
Email Opened | 0.6106 | 0.110 | 5.545 | 0.000 | 0.395 | 0.826 |
- From the above, the features that remain are statistically significant and donot show any multi collinearity.
- Hence, we could use Model 11 is our final model.
Final Features
1finalFeatures = X_train.columns.values2print('The Final Feature for Modelling are :', finalFeatures)
1The Final Feature for Modelling are : ['Do Not Email' 'Unknown Occupation' 'No Specialization'2 'Hospitality Management' 'Outside India' 'SMS Sent' 'Other Last Activity'3 'Working Professional' 'Olark Chat Conversation' 'Welingak Website'4 'Reference' 'Olark Chat' 'Landing Page Submission'5 'Total Time Spent on Website' 'Email Opened']
Predictions
Predictions on Train set
1X_train_sm = sm.add_constant(X_train)2y_train_pred = logm_final.predict(X_train_sm)
Actual Conversions vs Conversion Predictions
1# Creating a data frame with converted vs converted probabilities2y_train_pred_final = pd.DataFrame({'Converted':y_train.values, 'Converted_Prob':y_train_pred})3y_train_pred_final['CustID'] = y_train.index4tab(y_train_pred_final.head(10))
1+------+-------------+------------------+----------+2| | Converted | Converted_Prob | CustID |3|------+-------------+------------------+----------|4| 4948 | 1 | 0.401461 | 4948 |5| 5938 | 1 | 0.318706 | 5938 |6| 5688 | 1 | 0.745966 | 5688 |7| 5381 | 0 | 0.00284809 | 5381 |8| 4742 | 1 | 0.801898 | 4742 |9| 5811 | 0 | 0.167979 | 5811 |10| 898 | 0 | 0.0880976 | 898 |11| 5316 | 0 | 0.0283184 | 5316 |12| 7381 | 0 | 0.251634 | 7381 |13| 1211 | 0 | 0.0235422 | 1211 |14+------+-------------+------------------+----------+
Predictions with cut off = 0.5
1#Creating new column 'predicted' with 1 if Converted_Prob > 0.5 else 02y_train_pred_final['predicted'] = y_train_pred_final.Converted_Prob.map(lambda x: 1 if x > 0.5 else 0)34# Let's see the head5tab(y_train_pred_final.head(10))
1+------+-------------+------------------+----------+-------------+2| | Converted | Converted_Prob | CustID | predicted |3|------+-------------+------------------+----------+-------------|4| 4948 | 1 | 0.401461 | 4948 | 0 |5| 5938 | 1 | 0.318706 | 5938 | 0 |6| 5688 | 1 | 0.745966 | 5688 | 1 |7| 5381 | 0 | 0.00284809 | 5381 | 0 |8| 4742 | 1 | 0.801898 | 4742 | 1 |9| 5811 | 0 | 0.167979 | 5811 | 0 |10| 898 | 0 | 0.0880976 | 898 | 0 |11| 5316 | 0 | 0.0283184 | 5316 | 0 |12| 7381 | 0 | 0.251634 | 7381 | 0 |13| 1211 | 0 | 0.0235422 | 1211 | 0 |14+------+-------------+------------------+----------+-------------+
Confusion Matrix
1from sklearn import metrics2# Confusion matrix3confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.predicted )4print(confusion)
1[[3462 455]2 [ 699 1693]]
Confusion Matrix for Train Set
$\frac{Predicted}{Actual}$ | Not Converted | Converted |
Not Converted | 3462 | 455 |
Converted | 699 | 1693 |
Accuracy of the Model
1# Let's check the overall accuracy.2accuracy = metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.predicted)3print('Accuracy on Train set : ', round(100*accuracy,3),'%')
1Accuracy on Train set : 81.709 %
Metrics beyond simple accuracy
1TP = confusion[1,1] # true positive2TN = confusion[0,0] # true negatives3FP = confusion[0,1] # false positives4FN = confusion[1,0] # false negatives5sensitivity = TP/(FN + TP)6specificity = TN/(FP + TN)7falsePositiveRate = FP/(FP + TN)8positivePredictivePower = TP/(TP +FP )9negativePredictivePower = TN/(TN + FN)10print('sensitivity / Recall: ', round(100*sensitivity,3),'%')11print('specificity : ', round(100*specificity,3),'%')12print('False Positive Rate : ', round(100*falsePositiveRate,3),'%')13print('Precision / Positive Predictive Power : ', round(100*positivePredictivePower,3),'%')14print('Negative Predictive Power : ', round(100*negativePredictivePower,3),'%')
1sensitivity / Recall: 70.778 %2specificity : 88.384 %3False Positive Rate : 11.616 %4Precision / Positive Predictive Power : 78.818 %5Negative Predictive Power : 83.201 %
Plotting ROC Curve.
1def draw_roc( actual, probs ):2 fpr, tpr, thresholds = metrics.roc_curve( actual, probs,3 drop_intermediate = False )4 auc_score = metrics.roc_auc_score( actual, probs )5 plt.figure(figsize=(5, 5))6 plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )7 plt.plot([0, 1], [0, 1], 'k--')8 plt.xlim([0.0, 1.0])9 plt.ylim([0.0, 1.05])10 plt.xlabel('False Positive Rate or [1 - True Negative Rate]')11 plt.ylabel('True Positive Rate')12 plt.title('Receiver operating characteristic')13 plt.legend(loc="lower right")14 plt.show()1516 return None
1draw_roc(y_train_pred_final.Converted, y_train_pred_final.Converted_Prob)
Finding Optimal Cutoff Point
- Optimal cutoff probability is that prob where we get balanced sensitivity and specificity
1# Let's create columns with different probability cutoffs2numbers = [float(x)/10 for x in range(10)]3for i in numbers:4 y_train_pred_final[i]= y_train_pred_final.Converted_Prob.map(lambda x: 1 if x > i else 0)5tab(y_train_pred_final.head())
1+------+-------------+------------------+----------+-------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+2| | Converted | Converted_Prob | CustID | predicted | 0.0 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |3|------+-------------+------------------+----------+-------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------|4| 4948 | 1 | 0.401461 | 4948 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |5| 5938 | 1 | 0.318706 | 5938 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |6| 5688 | 1 | 0.745966 | 5688 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |7| 5381 | 0 | 0.00284809 | 5381 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |8| 4742 | 1 | 0.801898 | 4742 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |9+------+-------------+------------------+----------+-------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
1# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.2cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])34# TP = confusion[1,1] # true positive5# TN = confusion[0,0] # true negatives6# FP = confusion[0,1] # false positives7# FN = confusion[1,0] # false negatives89num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]10for i in num:11 cm1 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final[i] )12 total1=sum(sum(cm1))13 accuracy = (cm1[0,0]+cm1[1,1])/total11415 speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])16 sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])17 cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]18tab(cutoff_df)
1+-----+--------+------------+----------+----------+2| | prob | accuracy | sensi | speci |3|-----+--------+------------+----------+----------|4| 0 | 0 | 0.379141 | 1 | 0 |5| 0.1 | 0.1 | 0.632905 | 0.974498 | 0.424304 |6| 0.2 | 0.2 | 0.758599 | 0.923077 | 0.658157 |7| 0.3 | 0.3 | 0.793311 | 0.869983 | 0.74649 |8| 0.4 | 0.4 | 0.81883 | 0.781355 | 0.841716 |9| 0.5 | 0.5 | 0.817087 | 0.707776 | 0.88384 |10| 0.6 | 0.6 | 0.808052 | 0.634197 | 0.91422 |11| 0.7 | 0.7 | 0.782533 | 0.513796 | 0.946643 |12| 0.8 | 0.8 | 0.755112 | 0.405518 | 0.968598 |13| 0.9 | 0.9 | 0.715961 | 0.273829 | 0.985959 |14+-----+--------+------------+----------+----------+
1# Let's plot accuracy sensitivity and specificity for various cutoff probabilities.23fig,ax = plt.subplots()4fig.set_figwidth(30)5fig.set_figheight(10)6plots=['accuracy','sensi','speci']7ax.set_xticks(np.linspace(0,1,50))8ax.set_title('Finding Optimal Cutoff')9sns.lineplot(x='prob',y=plots[0] , data=cutoff_df,ax=ax)10sns.lineplot(x='prob',y=plots[1] , data=cutoff_df,ax=ax)11sns.lineplot(x='prob',y=plots[2] , data=cutoff_df,ax=ax)1213ax.set_xlabel('Probabilites')14ax.set_ylabel('Accuracy,Sensitivity,Specificity')15ax.legend(["Accuracy",'Sensitivity','Specificity'])16# cutoff_df.plot.line(, figure=[10,10])17plt.show()
- From the curve above, 0.36 is the optimum cutoff probability.
1y_train_pred_final['final_predicted'] = y_train_pred_final.Converted_Prob.map( lambda x: 1 if x > 0.36 else 0)23tab(y_train_pred_final.head())
1+------+-------------+------------------+----------+-------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------------------+2| | Converted | Converted_Prob | CustID | predicted | 0.0 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | final_predicted |3|------+-------------+------------------+----------+-------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------------------|4| 4948 | 1 | 0.401461 | 4948 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |5| 5938 | 1 | 0.318706 | 5938 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |6| 5688 | 1 | 0.745966 | 5688 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 |7| 5381 | 0 | 0.00284809 | 5381 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |8| 4742 | 1 | 0.801898 | 4742 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |9+------+-------------+------------------+----------+-------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------------------+
1# Let's check the overall accuracy.2accu = metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)3print('Accuracy on Train set at Optimum Cut Off : ', round(100*accu,3),'%')
1Accuracy on Train set at Optimum Cut Off : 81.249 %
1confusion2 = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.final_predicted )2confusion2
1array([[3203, 714],2 [ 469, 1923]])
1TP = confusion2[1,1] # true positive2TN = confusion2[0,0] # true negatives3FP = confusion2[0,1] # false positives4FN = confusion2[1,0] # false negatives5sensitivity = TP/(FN + TP)6specificity = TN/(FP + TN)7falsePositiveRate = FP/(FP + TN)8positivePredictivePower = TP/(TP +FP )9negativePredictivePower = TN/(TN + FN)10print('sensitivity / Recall: ', round(100*sensitivity,3),'%')11print('specificity : ', round(100*specificity,3),'%')12print('False Positive Rate : ', round(100*falsePositiveRate,3),'%')13print('Precision / Positive Predictive Power : ', round(100*positivePredictivePower,3),'%')14print('Negative Predictive Power : ', round(100*negativePredictivePower,3),'%')
1sensitivity / Recall: 80.393 %2specificity : 81.772 %3False Positive Rate : 18.228 %4Precision / Positive Predictive Power : 72.924 %5Negative Predictive Power : 87.228 %
1## ROC curve for cut off probability of 0.362draw_roc(y_train_pred_final.Converted, y_train_pred_final.final_predicted)
Precision and Recall
1#Looking at the confusion matrix again2confusion = metrics.confusion_matrix(y_train_pred_final.Converted, y_train_pred_final.predicted )3confusion
1array([[3462, 455],2 [ 699, 1693]])
- Precision :TP / TP + FP
- Recall :TP / TP + FN
1print('Precision :', confusion[1,1]/(confusion[0,1]+confusion[1,1]))2print('Recall :', confusion[1,1]/(confusion[1,0]+confusion[1,1]))
1Precision : 0.78817504655493492Recall : 0.7077759197324415
1#Doing the same using the sklearn.2from sklearn.metrics import precision_score, recall_score3print('Precision : ', precision_score(y_train_pred_final.Converted, y_train_pred_final.predicted))4print('Recall :', recall_score(y_train_pred_final.Converted, y_train_pred_final.predicted))
1Precision : 0.78817504655493492Recall : 0.7077759197324415
Precision and Recall Tradeoff
1from sklearn.metrics import precision_recall_curve23p, r, thresholds = precision_recall_curve(y_train_pred_final.Converted, y_train_pred_final.Converted_Prob)4plt.plot(thresholds, p[:-1], "g-")5plt.plot(thresholds, r[:-1], "r-")6plt.show()
- The cut off point from precision-recall curve is ~0.4.
- Note that we have used the cut off obtained from ‘Sensitivity-Specificity’ trade off to predict conversions in this analysis.
Predictions on Test set
1X_test_sm = sm.add_constant(X_test[finalFeatures])2y_test_pred = logm_final.predict(X_test_sm)
Actual Conversions vs Conversion Probability
1# predicted conversions vs actual conversions and customer ID2y_test_predictions = pd.DataFrame({'Converted' :y_test, 'Conversion Probability' : y_test_pred, 'CustID' : y_test.index})3tab(y_test_predictions.head())
1+------+-------------+--------------------------+----------+2| | Converted | Conversion Probability | CustID |3|------+-------------+--------------------------+----------|4| 1260 | 0 | 0.116132 | 1260 |5| 2104 | 1 | 0.318706 | 2104 |6| 7105 | 1 | 0.982061 | 7105 |7| 8916 | 0 | 0.42048 | 8916 |8| 2822 | 0 | 0.0292669 | 2822 |9+------+-------------+--------------------------+----------+
1# predictions with optimal cut off = 0.352cutoff=0.363y_test_predictions['Predicted'] = y_test_predictions[4 'Conversion Probability'5].map(lambda x : 1 if x > cutoff else 0 )
Confusion Matrix
1confusion = metrics.confusion_matrix(y_test_predictions['Converted'], y_test_predictions['Predicted'])2confusion
1array([[1356, 322],2 [ 230, 797]])
1TP = confusion[1,1] # true positive2TN = confusion[0,0] # true negatives3FP = confusion[0,1] # false positives4FN = confusion[1,0] # false negatives
Accuracy
1print('Accuracy on Test set : ', round(100*(TP + TN)/(TP + TN + FP + FN),3),'%')
1Accuracy on Test set : 79.593 %
Metrics Beyond Simple Accuracy
1sensitivity = TP/(FN + TP)2specificity = TN/(FP + TN)3falsePositiveRate = FP/(FP + TN)4positivePredictivePower = TP/(TP +FP )5negativePredictivePower = TN/(TN + FN)6print('sensitivity / Recall: ', round(100*sensitivity,3),'%')7print('specificity : ', round(100*specificity,3),'%')8print('False Positive Rate : ', round(100*falsePositiveRate,3),'%')9print('Precision / Positive Predictive Power : ', round(100*positivePredictivePower,3),'%')10print('Negative Predictive Power : ', round(100*negativePredictivePower,3),'%')
1sensitivity / Recall: 77.605 %2specificity : 80.81 %3False Positive Rate : 19.19 %4Precision / Positive Predictive Power : 71.224 %5Negative Predictive Power : 85.498 %
1## ROC curve for cut off probability of 0.3642draw_roc(y_test_predictions['Converted'],y_test_predictions['Predicted'])
- Note the AUC is 0.79 on the test test
Lead Scoring
1# merging final predictions with leads dataset2conversionProb = pd.concat([y_test_predictions['Conversion Probability'],y_train_pred_final['Converted_Prob']],axis=0)3conversionProb = pd.DataFrame({'Conversion Probability' : conversionProb}, index=conversionProb.index)4leads = pd.concat([leads,conversionProb],axis=1)5leads['Prospect ID'] = prospect_ids6leads['Lead No'] = lead_no7leads['Converted'] = y
1# Verifying prediction accuracy2leads['Predicted'] = leads['Conversion Probability'].map(lambda x : 1 if x > 0.36 else 0)
1confusion = metrics.confusion_matrix(leads['Converted'], leads['Predicted'])
1TP = confusion[1,1] # true positive2TN = confusion[0,0] # true negatives3FP = confusion[0,1] # false positives4FN = confusion[1,0] # false negatives5acc = metrics.accuracy_score(leads['Converted'], leads['Predicted'])6print('Accuracy : ', round(100*acc,3),'%')7sensitivity = TP/(FN + TP)8specificity = TN/(FP + TN)9falsePositiveRate = FP/(FP + TN)10falseNegativeRate = FN/(FP + TP)11positivePredictivePower = TP/(TP +FP )12negativePredictivePower = TN/(TN + FN)13print('sensitivity : ', round(100*sensitivity,3),'%')14print('specificity : ', round(100*specificity,3),'%')15print('False Positive Rate : ', round(100*falsePositiveRate,3),'%')16print('False Negative Rate : ', round(100*falseNegativeRate,3),'%')17print('Positive Predictive Power / Precision : ', round(100*positivePredictivePower,3),'%')18print('Negative Predictive Power : ', round(100*negativePredictivePower,3),'%')
1Accuracy : 80.752 %2sensitivity : 79.555 %3specificity : 81.483 %4False Positive Rate : 18.517 %5False Negative Rate : 18.61 %6Positive Predictive Power / Precision : 72.417 %7Negative Predictive Power : 86.706 %
1## ROC curve2draw_roc(leads['Converted'], leads['Predicted'])
1# Lead Scores2leads['Lead Score'] = leads['Conversion Probability']*1003tab(leads[['Prospect ID','Lead No','Lead Score']].sort_values(by='Lead Score', ascending=False)[:10])
1+------+--------------------------------------+-----------+--------------+2| | Prospect ID | Lead No | Lead Score |3|------+--------------------------------------+-----------+--------------|4| 7219 | ed62264f-7666-4bf9-9cb6-5b9a825f1e67 | 594038 | 99.9118 |5| 7234 | 7e2819e8-97f0-416b-bcb6-45ef14f0e11a | 593962 | 99.8559 |6| 2378 | 8ff353ab-1207-4608-a8cc-8172ea7c12eb | 636860 | 99.8303 |7| 7327 | 95d1590f-7c47-4f40-9806-388f4472d3a4 | 593208 | 99.826 |8| 2497 | e5fb32dd-b3b7-4fbf-972d-c13d2cfc6866 | 635761 | 99.8221 |9| 5671 | 623bc6c9-9184-4437-b38f-d374be49d1a3 | 606508 | 99.8221 |10| 7094 | 9ec1cafe-b019-498e-b246-7ab06167d72c | 595141 | 99.82 |11| 7187 | f33166e8-d8d3-4e8c-b9d0-8a1922c35910 | 594369 | 99.8176 |12| 7420 | 2caa32d0-50b7-4d29-b31f-2528b06d7bc8 | 592625 | 99.8162 |13| 8120 | bf4a03bc-b747-45a6-a6b5-659afa3bf3ac | 587853 | 99.8151 |14+------+--------------------------------------+-----------+--------------+
Score Sheet for X Education
1# Run the following to generate a sheet containing lead information provided by the company and corresponding scores2leads.to_csv('lead_scores.csv')
KS Statistic
1# Gain Chart2y_test_predictions = y_test_predictions.sort_values(by='Conversion Probability', ascending=False)3y_test_predictions['decile'] = pd.qcut(y_test_predictions['Conversion Probability'],10,labels=range(10,0,-1))4y_test_predictions['Converted'] = y_test_predictions['Converted'].astype('int')5y_test_predictions['Un Converted'] = 1 - y_test_predictions['Converted']6tab(y_test_predictions.head())
1+------+-------------+--------------------------+----------+-------------+----------+----------------+2| | Converted | Conversion Probability | CustID | Predicted | decile | Un Converted |3|------+-------------+--------------------------+----------+-------------+----------+----------------|4| 7327 | 1 | 0.99826 | 7327 | 1 | 1 | 0 |5| 7420 | 1 | 0.998162 | 7420 | 1 | 1 | 0 |6| 4613 | 1 | 0.997809 | 4613 | 1 | 1 | 0 |7| 6243 | 1 | 0.997674 | 6243 | 1 | 1 | 0 |8| 7324 | 1 | 0.997402 | 7324 | 1 | 1 | 0 |9+------+-------------+--------------------------+----------+-------------+----------+----------------+
1df1 = pd.pivot_table(data=y_test_predictions,index=['decile'],values=['Converted','Un Converted','Conversion Probability'],2 aggfunc={'Converted':[np.sum],3 'Un Converted':[np.sum],4 'Conversion Probability' : [np.min,np.max]})5df1 = df1.reset_index()6df1.columns = ['Decile','Max Prob', 'Min Prob','Converted Count','Un Converted Count']7df1 = df1.sort_values(by='Decile', ascending=False)8df1['Total Leads'] = df1['Converted Count'] + df1['Un Converted Count']9df1['Conversion Rate'] = df1['Converted Count'] / df1['Un Converted Count']10converted_sum = df1['Converted Count'].sum()11unconverted_sum = df1['Un Converted Count'].sum()12df1['Converted %'] = df1['Converted Count'] / converted_sum13df1['Un Converted %'] = df1['Un Converted Count'] / unconverted_sum14tab(df1.head())
1+----+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+2| | Decile | Max Prob | Min Prob | Converted Count | Un Converted Count | Total Leads | Conversion Rate | Converted % | Un Converted % |3|----+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------|4| 9 | 1 | 0.99826 | 0.910258 | 257 | 14 | 271 | 18.3571 | 0.250243 | 0.00834327 |5| 8 | 2 | 0.909066 | 0.781567 | 224 | 46 | 270 | 4.86957 | 0.218111 | 0.0274136 |6| 7 | 3 | 0.780757 | 0.564123 | 182 | 89 | 271 | 2.04494 | 0.177215 | 0.0530393 |7| 6 | 4 | 0.563613 | 0.380673 | 118 | 152 | 270 | 0.776316 | 0.114898 | 0.090584 |8| 5 | 5 | 0.380229 | 0.251101 | 113 | 157 | 270 | 0.719745 | 0.110029 | 0.0935638 |9+----+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+
1df1['ks_stats'] = np.round(((df1['Converted Count'] / df1['Converted Count'].sum()).cumsum() -(df1['Un Converted Count'] / df1['Un Converted Count'].sum()).cumsum()), 4) * 1002tab(df1)
1+----+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------+2| | Decile | Max Prob | Min Prob | Converted Count | Un Converted Count | Total Leads | Conversion Rate | Converted % | Un Converted % | ks_stats |3|----+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------|4| 9 | 1 | 0.99826 | 0.910258 | 257 | 14 | 271 | 18.3571 | 0.250243 | 0.00834327 | 24.19 |5| 8 | 2 | 0.909066 | 0.781567 | 224 | 46 | 270 | 4.86957 | 0.218111 | 0.0274136 | 43.26 |6| 7 | 3 | 0.780757 | 0.564123 | 182 | 89 | 271 | 2.04494 | 0.177215 | 0.0530393 | 55.68 |7| 6 | 4 | 0.563613 | 0.380673 | 118 | 152 | 270 | 0.776316 | 0.114898 | 0.090584 | 58.11 |8| 5 | 5 | 0.380229 | 0.251101 | 113 | 157 | 270 | 0.719745 | 0.110029 | 0.0935638 | 59.76 |9| 4 | 6 | 0.250676 | 0.149579 | 64 | 207 | 271 | 0.309179 | 0.0623174 | 0.123361 | 53.65 |10| 3 | 7 | 0.149564 | 0.113845 | 37 | 233 | 270 | 0.158798 | 0.0360273 | 0.138856 | 43.37 |11| 2 | 8 | 0.113644 | 0.0693867 | 22 | 249 | 271 | 0.0883534 | 0.0214216 | 0.148391 | 30.67 |12| 1 | 9 | 0.0693778 | 0.0292784 | 8 | 256 | 264 | 0.03125 | 0.00778968 | 0.152563 | 16.19 |13| 0 | 10 | 0.0292669 | 0.00223149 | 2 | 275 | 277 | 0.00727273 | 0.00194742 | 0.163886 | 0 |14+----+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------+
- Max KS Statistic is 59.76 for 5th decile
- This model discriminates between Converted and Non-converted leads well since KS Statistic in 4th decile (58.11) is greater than 40%. Hence, this is a reasonable good model.
Gain Chart
1df1['Cum Conversion %'] = np.round(((df1['Converted Count'] / df1['Converted Count'].sum()).cumsum()), 4) * 1002tab(df1)
1+----+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------+--------------------+2| | Decile | Max Prob | Min Prob | Converted Count | Un Converted Count | Total Leads | Conversion Rate | Converted % | Un Converted % | ks_stats | Cum Conversion % |3|----+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------+--------------------|4| 9 | 1 | 0.99826 | 0.910258 | 257 | 14 | 271 | 18.3571 | 0.250243 | 0.00834327 | 24.19 | 25.02 |5| 8 | 2 | 0.909066 | 0.781567 | 224 | 46 | 270 | 4.86957 | 0.218111 | 0.0274136 | 43.26 | 46.84 |6| 7 | 3 | 0.780757 | 0.564123 | 182 | 89 | 271 | 2.04494 | 0.177215 | 0.0530393 | 55.68 | 64.56 |7| 6 | 4 | 0.563613 | 0.380673 | 118 | 152 | 270 | 0.776316 | 0.114898 | 0.090584 | 58.11 | 76.05 |8| 5 | 5 | 0.380229 | 0.251101 | 113 | 157 | 270 | 0.719745 | 0.110029 | 0.0935638 | 59.76 | 87.05 |9| 4 | 6 | 0.250676 | 0.149579 | 64 | 207 | 271 | 0.309179 | 0.0623174 | 0.123361 | 53.65 | 93.28 |10| 3 | 7 | 0.149564 | 0.113845 | 37 | 233 | 270 | 0.158798 | 0.0360273 | 0.138856 | 43.37 | 96.88 |11| 2 | 8 | 0.113644 | 0.0693867 | 22 | 249 | 271 | 0.0883534 | 0.0214216 | 0.148391 | 30.67 | 99.03 |12| 1 | 9 | 0.0693778 | 0.0292784 | 8 | 256 | 264 | 0.03125 | 0.00778968 | 0.152563 | 16.19 | 99.81 |13| 0 | 10 | 0.0292669 | 0.00223149 | 2 | 275 | 277 | 0.00727273 | 0.00194742 | 0.163886 | 0 | 100 |14+----+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------+--------------------+
1df1['Base %'] = np.arange(10,110,10)2df1 = df1.set_index('Decile')
1tab(df1)
1+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------+--------------------+----------+2| Decile | Max Prob | Min Prob | Converted Count | Un Converted Count | Total Leads | Conversion Rate | Converted % | Un Converted % | ks_stats | Cum Conversion % | Base % |3|----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------+--------------------+----------|4| 1 | 0.99826 | 0.910258 | 257 | 14 | 271 | 18.3571 | 0.250243 | 0.00834327 | 24.19 | 25.02 | 10 |5| 2 | 0.909066 | 0.781567 | 224 | 46 | 270 | 4.86957 | 0.218111 | 0.0274136 | 43.26 | 46.84 | 20 |6| 3 | 0.780757 | 0.564123 | 182 | 89 | 271 | 2.04494 | 0.177215 | 0.0530393 | 55.68 | 64.56 | 30 |7| 4 | 0.563613 | 0.380673 | 118 | 152 | 270 | 0.776316 | 0.114898 | 0.090584 | 58.11 | 76.05 | 40 |8| 5 | 0.380229 | 0.251101 | 113 | 157 | 270 | 0.719745 | 0.110029 | 0.0935638 | 59.76 | 87.05 | 50 |9| 6 | 0.250676 | 0.149579 | 64 | 207 | 271 | 0.309179 | 0.0623174 | 0.123361 | 53.65 | 93.28 | 60 |10| 7 | 0.149564 | 0.113845 | 37 | 233 | 270 | 0.158798 | 0.0360273 | 0.138856 | 43.37 | 96.88 | 70 |11| 8 | 0.113644 | 0.0693867 | 22 | 249 | 271 | 0.0883534 | 0.0214216 | 0.148391 | 30.67 | 99.03 | 80 |12| 9 | 0.0693778 | 0.0292784 | 8 | 256 | 264 | 0.03125 | 0.00778968 | 0.152563 | 16.19 | 99.81 | 90 |13| 10 | 0.0292669 | 0.00223149 | 2 | 275 | 277 | 0.00727273 | 0.00194742 | 0.163886 | 0 | 100 | 100 |14+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------+--------------------+----------+
1### Gain chart2plot_columns =['Base %','Cum Conversion %']3plt.plot(df1[plot_columns]);4plt.xticks(df1.index);5plt.title('Gain chart');6plt.xlabel('Decile')7plt.ylabel('Cummulative Conversion %')8plt.legend(('Our Model','Random Model'));
- Instead of pursuing leads randomly, pursuing the top 40% leads scored by the model would let the sales team reach 80% of leads likely to convert.
Lift Chart
1df1['Lift'] = df1['Cum Conversion %'] / df1['Base %']2df1['Baseline'] = 13tab(df1)
1+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------+--------------------+----------+---------+------------+2| Decile | Max Prob | Min Prob | Converted Count | Un Converted Count | Total Leads | Conversion Rate | Converted % | Un Converted % | ks_stats | Cum Conversion % | Base % | Lift | Baseline |3|----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------+--------------------+----------+---------+------------|4| 1 | 0.99826 | 0.910258 | 257 | 14 | 271 | 18.3571 | 0.250243 | 0.00834327 | 24.19 | 25.02 | 10 | 2.502 | 1 |5| 2 | 0.909066 | 0.781567 | 224 | 46 | 270 | 4.86957 | 0.218111 | 0.0274136 | 43.26 | 46.84 | 20 | 2.342 | 1 |6| 3 | 0.780757 | 0.564123 | 182 | 89 | 271 | 2.04494 | 0.177215 | 0.0530393 | 55.68 | 64.56 | 30 | 2.152 | 1 |7| 4 | 0.563613 | 0.380673 | 118 | 152 | 270 | 0.776316 | 0.114898 | 0.090584 | 58.11 | 76.05 | 40 | 1.90125 | 1 |8| 5 | 0.380229 | 0.251101 | 113 | 157 | 270 | 0.719745 | 0.110029 | 0.0935638 | 59.76 | 87.05 | 50 | 1.741 | 1 |9| 6 | 0.250676 | 0.149579 | 64 | 207 | 271 | 0.309179 | 0.0623174 | 0.123361 | 53.65 | 93.28 | 60 | 1.55467 | 1 |10| 7 | 0.149564 | 0.113845 | 37 | 233 | 270 | 0.158798 | 0.0360273 | 0.138856 | 43.37 | 96.88 | 70 | 1.384 | 1 |11| 8 | 0.113644 | 0.0693867 | 22 | 249 | 271 | 0.0883534 | 0.0214216 | 0.148391 | 30.67 | 99.03 | 80 | 1.23788 | 1 |12| 9 | 0.0693778 | 0.0292784 | 8 | 256 | 264 | 0.03125 | 0.00778968 | 0.152563 | 16.19 | 99.81 | 90 | 1.109 | 1 |13| 10 | 0.0292669 | 0.00223149 | 2 | 275 | 277 | 0.00727273 | 0.00194742 | 0.163886 | 0 | 100 | 100 | 1 | 1 |14+----------+------------+------------+-------------------+----------------------+---------------+-------------------+---------------+------------------+------------+--------------------+----------+---------+------------+
1# Lift chart2plot_columns =['Lift', 'Baseline']3plt.plot(df1[plot_columns]);4plt.xticks(df1.index);5plt.title('Lift chart');6plt.xlabel('Decile')7plt.ylabel('Lift')8plt.legend(('Our Model','Random Model'));
- The model outperforms a random model by alteast 2 times in identifying the top 40% potentially convertible leads.
- As opposed to 10% conversions from 10% leads pursued randomly, pursuing the top 10% leads scored by this model would lead to 24% conversions.
Conclusion
A logistic regression model is created using lead features. To arrive at the list of features which significantly affect conversion probability, a mixed feature elimination approach is followed. 25 most important features are obtained through Recursive Feature Elimination and then reduced to 15 via p-value / VIF approach. The dataset is randomly divided into train and test set. (70 - 30 split).
The final relationship between log Odds of Conversion Probability and lead features is
logOdds(Conversion Probability)
= -0.6469 - 1.5426 Do Not Email
-1.2699 Unknown Occupation
-0.9057 No Specialization
-0.8704 Hospitality Management
- 0.6584 Outside India
+ 1.7923 SMS Sent
+ 1.1749 Other Last Activity
+ 2.3769 Working Professional
- 0.8614 Olark Chat Conversation
+ 5.3886 Welingak Website
+ 3.0246 Reference
+ 1.1876 Olark Chat
-1.0250 Landing Page Submission
+ 1.1253 Total Time Spent on Website
+ 0.6106 * Email Opened
where Total Time Spent on Website
is standardized to $\mu=0,\sigma=1$
Interpreting Top 6 features affecting Conversion Probability :
- A lead from
Welingak Website
has 5.4 times higher log odds of conversion than those fromGoogle
. - Leads through
Reference
have 3 times higher log odds of conversion than those fromGoogle
. - Leads from
Working Professional
have 2.38 times higher log odds of conversion than those fromBusinessman
. - Leads with
SMS Sent
have 1.8 times higher log odds of conversion than those with no SMS sent. - Leads with
Do Not Email
have 1.5 times lesser log odds of conversion compared to leads who would like email updates. - Leads with
Unknown Occupation
have 1.27 times lesser log odds of conversion compared to those fromBusinessman
.
Lead Scores :
- Score sheet can be generated by running this cell.
At an optimum cut-off probability of 0.36, model performance is as follows.
Model Performance on Training Set :
- Accuracy : 81.7%
- Sensitivity / Recall: 80.393 %
- Specificity : 81.772 %
- Precision / Positive Predictive Power : 72.924 %
- False Positive Rate : 18.228 %
- AUC Score : 0.81
Model Performance for Test Set :
- Accuracy : 79.593 %
- Sensitivity / Recall : 77.605 %
- Specificity : 80.81%
- Precision / Positive Predictive Power : 71.224 %
- False Positive Rate : 19.19 %
- AUC Score : 0.79
KS statistic :
- Max KS Statistic is 59.76 for 5th decile
- This model discriminates between Converted and Non-converted leads well since KS Statistic in 4th decile (58.11) is greater than 40%. Hence, this is a reasonably good model.
Gain :
- Inside of pursuing leads randomly, pursuing the top 40% leads scored by the model would let the sales team reach 80% of leads likely to convert.
Lift :
- The model outperforms a random model by alteast 2 times in identifying the top 40% potentially convertible leads.
- As opposed to 10% conversions from 10% leads pursued randomly, pursuing the top 10% leads scored by this model would lead to 24% conversions.
Note :
- Incorrect data types have been corrected
- Columns with high missing values have been dropped.
- Columns which do not explain variability in the model have been dropped.
- Columns with sales teams notes like
Tags
where the classes are not mutually exclusive have been dropped. - Features with low missing values have been imputed with the most frequent values.
- Categories in a feature with less than 1% contribution have been grouped together to reduce the number of levels.
- Inconsistencies in Categories have been corrected.
- 97.5 % of the leads provided by the company have been used for analysis.
- Class imbalance = 0.6
- Indicator variables have been created for all categorical variables with the first category as the reference.
- Continuous variables have been standardized $\mu : 0 , \sigma = 1$ before modelling.