Navigate back to the homepage

Countries in need of Aid

Jayanth Boddu
August 31st, 2020 · 8 min read

Table of Contents

HELP - Countries to Aid

Jayanth Boddu
Clustering

Problem Statement

An NGO wants to use their funding strategically so that they could aid five countries in dire need of help. The analyst’s objective is to use clustering algorithms to group countries based on socio-economic and health factors to judge the overall development of countries. Further,the final deliverable is - suggesting 5 countries that need the aid the most.

Analysis Approach & Conclusions

  • The objective of the analysis is to recommend 05 countries in dire need of aid of help.
  • The achieve this, the following features of 167 countries have been analyzed.
    • GDP per capita
    • Child mortality per 1000 live births
    • Net Income per person
    • Fertility rates at the current previaling age/fertility rate
    • Inflation index
    • Life Expectancy
    • Imports per capita
    • Exports per capita
    • Total Health Spending per capita
  • The above features are further grouped into health, economic and policy indicators
    • Health Indicators
      • Child mortality per 1000 births
      • Fertility rates at the current previaling age/fertility index
      • Life Expectancy
    • Economic Indicators
      • Net Income per person
      • Imports per capita
      • Exports per capita
      • Inflation Index
    • Policy Indicators
      • Total Health Spending per capita
  • From the above, countries deduced to be in dire need of aid are countries with bad economic indicators and bad health indicators having low health spending
  • Further on the feature level, the following are general indicators of need of aid.
FeatureHigh / Low
Child MortalityHigh
Life ExpectancyLow
FertilityHigh
Health SpendingLow
GDP per capitaLow
Inflation IndexHigh
Income per PersonLow
ImportsHigh
ExportsLow
  • Data has been check for quality.No Data Quality Issues have been found - No missing values, No duplicates , No incorrect data types

  • A new feature ‘Trade Deficit’ has been derived.

    • Trade Deficit = Imports - Exports
    • Trade Deficit or Trade balance is a good indicator of economic health of countries with low GDP per capita
    • Lower or negative is better than higher and positive.
  • Univariate analysis revealed the following information

FeatureHighestLowest
Child MortalityHaiti,Sierra LeoneIceland
Life ExpectancyJapan , SingaporeHaiti, Lesotho
Total fertilityChad, NigerSingapore, South Korea
Health SpendingSwitzerland, USMadagascar, Eritrea
GDP per capitaLuxemborg , NorwayBurundi , Liberia
Inflation IndexNigeria, VenezuelaSeycelles, Ireland
Net Income per personQatar, LuxemborgLiberia, Congo Dem. Rep.
Imports per capitaLuxemborg, SingaporeMyanmar, Burundi
Exports per capitaLuxemborg, SingaporeMyanmar, Burundi
Trade Deficit per capitaBahamas, GreeceLuxemborg, Qatar
  • Since there’s less data per cluster, soft range of 1st percentile - 99th perentile has been used to classify and cap outliers. Only those outliers which do not skew the areas of interest have been capped.
  • Outlier Treatment Summary
FeatureUpper OutliersLower Outliers
Child MortalityNot ChangedCapped
Life ExpectancyCappedNot Changed
FertilityNot ChangedCapped
Health SpendingCappedNot Capped
GDP per capitaCappedNot Changed
Inflation IndexNot ChangedCapped
Income per PersonCappedNot Changed
ImportsCappedNot Changed
ExportsCappedNot Changed
  • Bivariate Analysis revealed the following insights
    • There is a negative relationship between Child Mortality and Life expectancy. As child mortality increases, Life Expectancy decreases. Haiti and Sierra Leone have the highest child mortality and lowest life expectancy.
    • There is positive relationship between child mortality and total fertility. Ex : Chad and Mali
    • Health Spending has a positive relationship with GDP per capita , but the situation is dire in case of low GDP countries like Eritrea & Madagascar.
    • There’s very high correlation between child mortality and life expectancy , child mortality and total fertility.
  • Columns : ‘child_mort’, ‘exports’, ‘health’, ‘imports’, ‘income’,‘inflation’, ‘life_expec’, ‘total_fer’, ‘gdpp’,‘trade_deficit’ were used for clustering.

  • Hopkin’s Statistic was calculated which showed a very high mean clustering tendency of 96% with a standard deviation of 4%

  • The dataset was standardized with mean = 0 , standard deviation = 1 before clustering.

  • Optimum no of clusters for k-means was found to be 4 , both from Elbow curve and Silhoeutte Analysis curve.

  • Clustering with k-means was performed (Cluster centers initialized with k-means++)

  • Hierarchical Clustering with Single, Complete and Average linkages and Euclidean , correlation based distances were explored.

  • Mixed Clustering - K-Means clustering initialised with centroids of Hierarchical clusters , 6 no of clusters - was carried out.

  • Finally, Hierarchical Clustering with Complete linkage and correlation based distance measure was chosen to arrive at the results. Choice was made based on interpretability of results.

  • Characteristics of Clusters obtained :

ClusterGDPIncomeChild Mortality
0Very lowVery LowVery High
1lowlowlow
2ModerateModerateModerate
3HighHighLow
4Very HighVery HighLow
5Very LowVery LowLow
  • From the characteristics , we see that the Cluster 0 is our area of interest.
  • According to the UN goals of 2030, the top priority is health and then poverty.
  • Hence countries in cluster 0 were ranked based on Child Mortality followed by income and GDP per capita.
  • By that criteria, the following are the five countries HELP should consider extending their aid.
    • Haiti
    • Sierra Leone
    • Chad
    • Central African Republic
    • Mali
1import numpy as np
2import pandas as pd
3import matplotlib.pyplot as plt
4
5import seaborn as sns
6sns.set_style('whitegrid')
7
8from sklearn.cluster import KMeans
9from scipy.cluster.hierarchy import linkage
10from scipy.cluster.hierarchy import dendrogram
11from scipy.cluster.hierarchy import cut_tree
12
13import warnings
14warnings.filterwarnings('ignore')
15
16!pip install tabulate
17!jt -f roboto -fs 12 -cellw 100%
18
19from tabulate import tabulate
20
21from bokeh.models import ColumnDataSource, HoverTool
22from bokeh.plotting import figure,show,output_notebook,reset_output
23from bokeh.transform import factor_cmap
24from bokeh.layouts import row
25output_notebook()
1Requirement already satisfied: tabulate in /Users/jayanth/opt/anaconda3/lib/python3.7/site-packages (0.8.7)
Loading BokehJS ...
1# to table print a dataframe
2def tab(ser) :
3 print(tabulate(pd.DataFrame(ser), headers='keys', tablefmt='psql'))

Importing data

1countries = pd.read_csv('./Country-data.csv')
2tab(countries.head())
1+----+---------------------+--------------+-----------+----------+-----------+----------+-------------+--------------+-------------+--------+
2| | country | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp |
3|----+---------------------+--------------+-----------+----------+-----------+----------+-------------+--------------+-------------+--------|
4| 0 | Afghanistan | 90.2 | 10 | 7.58 | 44.9 | 1610 | 9.44 | 56.2 | 5.82 | 553 |
5| 1 | Albania | 16.6 | 28 | 6.55 | 48.6 | 9930 | 4.49 | 76.3 | 1.65 | 4090 |
6| 2 | Algeria | 27.3 | 38.4 | 4.17 | 31.4 | 12900 | 16.1 | 76.5 | 2.89 | 4460 |
7| 3 | Angola | 119 | 62.3 | 2.85 | 42.9 | 5900 | 22.4 | 60.1 | 6.16 | 3530 |
8| 4 | Antigua and Barbuda | 10.3 | 45.5 | 6.03 | 58.9 | 19100 | 1.44 | 76.8 | 2.13 | 12200 |
9+----+---------------------+--------------+-----------+----------+-----------+----------+-------------+--------------+-------------+--------+
  • The dataset contains exports, imports,health as a proportion of GDP per capita
  • Let’s convert them to actual values
1# conversion to actual values
2countries['exports'] = 0.01 * countries['exports'] * countries['gdpp']
3countries['imports'] = 0.01 * countries['imports'] * countries['gdpp']
4countries['health'] = 0.01 * countries['health'] * countries['gdpp']

Data Quality Checks

1tab(countries.info())
1<class 'pandas.core.frame.DataFrame'>
2RangeIndex: 167 entries, 0 to 166
3Data columns (total 10 columns):
4 # Column Non-Null Count Dtype
5--- ------ -------------- -----
6 0 country 167 non-null object
7 1 child_mort 167 non-null float64
8 2 exports 167 non-null float64
9 3 health 167 non-null float64
10 4 imports 167 non-null float64
11 5 income 167 non-null int64
12 6 inflation 167 non-null float64
13 7 life_expec 167 non-null float64
14 8 total_fer 167 non-null float64
15 9 gdpp 167 non-null int64
16dtypes: float64(7), int64(2), object(1)
17memory usage: 13.2+ KB
  • 167 countries
  • No apparent missing values
  • No incorrect data types
1# taking a closer look to see if any country is duplicated
2countries[countries.duplicated(subset=['country'])].index.values
1array([], dtype=int64)
  • No duplicate countries
1# exports, health, imports are a percentage of GDPP . Checking if there are any anomalies
2condition1 = countries['health'] < countries['gdpp']
3condition2 = countries['imports'] < countries['gdpp']
4condition3 = countries['exports'] < countries['gdpp']
5
6# countries which don't satisfy the above conditions
7index = countries[~(condition1 & condition2 & condition3)].index.values
8countries.loc[index]
countrychild_mortexportshealthimportsincomeinflationlife_expectotal_fergdpp
73Ireland4.250161.004475.5342125.545700-3.22080.42.0548700
87Lesotho99.7460.98129.871181.723804.15046.53.301170
91Luxembourg2.8183750.008158.50149100.0917003.62081.31.63105000
98Malta6.832283.001825.1532494.0283003.83080.31.3621100
131Seychelles14.410130.40367.2011664.020400-4.21073.42.1710800
133Singapore2.893200.001845.3681084.072100-0.04682.71.1546600
  • Some countries have greater imports than GDP per capita
1# child mortality rate is calculated for 1000 live births. Let's check if its above 1000
2countries[countries['child_mort'] > 1000].index.values
1array([], dtype=int64)
  • Sanity checks completed.
  • No Data Quality issues found.

Exploratory Data Analysis

1# summary statistics
2tab(countries.describe())
1+-------+--------------+--------------+-----------+---------------+----------+-------------+--------------+-------------+----------+
2| | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp |
3|-------+--------------+--------------+-----------+---------------+----------+-------------+--------------+-------------+----------|
4| count | 167 | 167 | 167 | 167 | 167 | 167 | 167 | 167 | 167 |
5| mean | 38.2701 | 7420.62 | 1056.73 | 6588.35 | 17144.7 | 7.78183 | 70.5557 | 2.94796 | 12964.2 |
6| std | 40.3289 | 17973.9 | 1801.41 | 14710.8 | 19278.1 | 10.5707 | 8.89317 | 1.51385 | 18328.7 |
7| min | 2.6 | 1.07692 | 12.8212 | 0.651092 | 609 | -4.21 | 32.1 | 1.15 | 231 |
8| 25% | 8.25 | 447.14 | 78.5355 | 640.215 | 3355 | 1.81 | 65.3 | 1.795 | 1330 |
9| 50% | 19.3 | 1777.44 | 321.886 | 2045.58 | 9960 | 5.39 | 73.1 | 2.41 | 4660 |
10| 75% | 62.1 | 7278 | 976.94 | 7719.6 | 22800 | 10.75 | 76.8 | 3.88 | 14050 |
11| max | 208 | 183750 | 8663.6 | 149100 | 125000 | 104 | 82.8 | 7.49 | 105000 |
12+-------+--------------+--------------+-----------+---------------+----------+-------------+--------------+-------------+----------+
  • The Maximum values of all the attributes are much higher than their 75% values.
1# let's look at quantiles are to take a closer look at outliers
2tab(countries.quantile(np.linspace(0.75,1,25)).reset_index())
1+----+----------+--------------+-----------+----------+-----------+----------+-------------+--------------+-------------+----------+
2| | index | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp |
3|----+----------+--------------+-----------+----------+-----------+----------+-------------+--------------+-------------+----------|
4| 0 | 0.75 | 62.1 | 7278 | 976.94 | 7719.6 | 22800 | 10.75 | 76.8 | 3.88 | 14050 |
5| 1 | 0.760417 | 62.2917 | 7846.5 | 1004.03 | 7977.36 | 23581.2 | 11.1229 | 76.9458 | 4.11667 | 16137.5 |
6| 2 | 0.770833 | 62.6958 | 7935.28 | 1012.41 | 8220.78 | 27116.7 | 11.5833 | 77.4833 | 4.26875 | 17079.2 |
7| 3 | 0.78125 | 63.6688 | 9400.55 | 1028.18 | 8366.03 | 28300 | 12.1 | 77.8688 | 4.36062 | 19300 |
8| 4 | 0.791667 | 64.1083 | 9937.67 | 1141.28 | 9561.67 | 28700 | 12.3833 | 78.025 | 4.53083 | 20175 |
9| 5 | 0.802083 | 67.5438 | 10220.6 | 1276.05 | 9904.05 | 29600 | 12.6313 | 78.2729 | 4.60146 | 21245.8 |
10| 6 | 0.8125 | 74.35 | 10655.8 | 1436.87 | 10029.1 | 30300 | 13.75 | 79.05 | 4.6625 | 22450 |
11| 7 | 0.822917 | 78.0292 | 10815.5 | 1548.88 | 10070.1 | 32420.8 | 14.1208 | 79.5 | 4.8225 | 25514.6 |
12| 8 | 0.833333 | 80.5333 | 10933.1 | 1829.69 | 10318.9 | 33766.7 | 14.5 | 79.6 | 4.90333 | 28866.7 |
13| 9 | 0.84375 | 83.4188 | 11075.8 | 1867.65 | 10882.2 | 35825 | 15.1125 | 79.8063 | 4.9825 | 30706.2 |
14| 10 | 0.854167 | 89.0708 | 12677.1 | 2207.69 | 11610.8 | 36200 | 15.5375 | 79.9792 | 5.04375 | 33095.8 |
15| 11 | 0.864583 | 90.2521 | 13445.8 | 2407.81 | 11848.4 | 37889.6 | 16.0042 | 80.0521 | 5.08604 | 35156.3 |
16| 12 | 0.875 | 90.9 | 14457.8 | 2810.22 | 12290.6 | 39950 | 16.2 | 80.15 | 5.2025 | 36475 |
17| 13 | 0.885417 | 93.5687 | 15038.4 | 3393.81 | 12905.2 | 40693.7 | 16.5979 | 80.3 | 5.21 | 38891.7 |
18| 14 | 0.895833 | 99.0292 | 17034 | 3651.31 | 14711.4 | 41100 | 16.6 | 80.4 | 5.29833 | 41450 |
19| 15 | 0.90625 | 104.062 | 19846.1 | 4024.48 | 16043.1 | 42056.2 | 16.9187 | 80.4437 | 5.34875 | 42993.8 |
20| 16 | 0.916667 | 109.333 | 23836.8 | 4265.13 | 17350.7 | 43333.3 | 17.4833 | 80.7333 | 5.405 | 44783.3 |
21| 17 | 0.927083 | 111 | 24069.1 | 4525.11 | 18097.6 | 45164.6 | 19.4375 | 80.9896 | 5.54646 | 46558.3 |
22| 18 | 0.9375 | 112.875 | 26626.7 | 4801.18 | 21864.3 | 45462.5 | 20.2875 | 81.3 | 5.77875 | 47212.5 |
23| 19 | 0.947917 | 116 | 30350 | 4908.45 | 23340.7 | 47010.4 | 20.8354 | 81.4 | 5.85062 | 48506.2 |
24| 20 | 0.958333 | 119.333 | 33999.5 | 5175.43 | 25846.6 | 55675 | 22.4333 | 81.5167 | 6.15083 | 50433.3 |
25| 21 | 0.96875 | 128.688 | 35961.1 | 5867.67 | 32399.7 | 61418.8 | 23.45 | 81.8625 | 6.21688 | 52062.5 |
26| 22 | 0.979167 | 143.5 | 45934.9 | 7449.69 | 36739.1 | 73779.2 | 25.7667 | 82 | 6.41167 | 64662.5 |
27| 23 | 0.989583 | 152.708 | 61817.4 | 8392.65 | 52676.8 | 83606.2 | 41.0146 | 82.3354 | 6.56083 | 78175 |
28| 24 | 1 | 208 | 183750 | 8663.6 | 149100 | 125000 | 104 | 82.8 | 7.49 | 105000 |
29+----+----------+--------------+-----------+----------+-----------+----------+-------------+--------------+-------------+----------+
  • exports,child_mort, imports, income, inflation, life_expec, total_fer and gdpp have outliers.
  • These outliers shall be treated in Univariate analysis

Univariate Analysis / Outlier Treatment

1# function to perform outlier analysis
2def outlier_analysis(column) :
3 '''
4 This function prints a violin plot and box plot of the column provided.
5 It also prints the five major quantiles, lower oultier threshold value, upper outlier threshold value, tables of countries which are outliers
6 Output : lower outlier threshold condition, upper outlier threshold condition
7 Input : column name
8 Side effects : Violin plot, box plot, outlier tables
9 '''
10 plt.figure(figsize=[12,6])
11 plt.subplot(121)
12 plt.title('Violin Plot of '+column)
13 sns.violinplot(countries[column])
14
15 plt.subplot(122)
16 plt.title('Box Plot of '+column)
17 sns.boxplot(countries[column])
18
19 print('Quantiles\n')
20 print(tab(countries[column].quantile([.1,0.25,.50,0.75,0.99])))
21
22 lower_outlier_threshold = countries[column].quantile(0.01)
23 upper_outlier_threshold = countries[column].quantile(0.99)
24
25 print('\n\nLOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR' ,column,': ',lower_outlier_threshold)
26 l_condition = countries[column] < lower_outlier_threshold
27 l_outliers = countries[l_condition][['country',column]].sort_values(by=column)
28
29 if l_outliers.shape[0] :
30 print('\n\nLower Outliers : ')
31 tab(l_outliers)
32 else :
33 print('No lower outliers found in ' + column)
34
35 print('\n\nUPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR' ,column,': ',upper_outlier_threshold)
36 u_condition = countries[column] > upper_outlier_threshold
37 u_outliers = countries[u_condition][['country',column]].sort_values(by=column)
38
39 if u_outliers.shape[0] :
40 print('\n\nUpper Outliers : ')
41 tab(u_outliers)
42 print('\n\n')
43
44 return l_condition, u_condition

Child Mortality

1# Countries with oultiers in child mortality
2column = 'child_mort'
3l_condition, u_condition = outlier_analysis(column)
1Quantiles
2
3+------+--------------+
4| | child_mort |
5|------+--------------|
6| 0.1 | 4.2 |
7| 0.25 | 8.25 |
8| 0.5 | 19.3 |
9| 0.75 | 62.1 |
10| 0.99 | 153.4 |
11+------+--------------+
12None
13
14
15LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR child_mort : 2.8
16
17
18Lower Outliers :
19+----+-----------+--------------+
20| | country | child_mort |
21|----+-----------+--------------|
22| 68 | Iceland | 2.6 |
23+----+-----------+--------------+
24
25
26UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR child_mort : 153.40000000000003
27
28
29Upper Outliers :
30+-----+--------------+--------------+
31| | country | child_mort |
32|-----+--------------+--------------|
33| 132 | Sierra Leone | 160 |
34| 66 | Haiti | 208 |
35+-----+--------------+--------------+

png

  • Notice that half of the countries have a child mortality below 20.
  • Further,countries with child mortality less than 1st percentile might not need aid at all. We shall cap their child mortalities to 1st percentile values.
  • Countries with extremely high child mortality rate - upper outliers (> 99th percentile) are the perfect candidates for aid. Let’s keep them as they are for further analysis.
1# Removing countries with lower outliers in `child_mort`
2countries.loc[l_condition, column] = countries[column].quantile(0.01)

Life Expectancy

1# LIFE EXPECTANCY
2column = 'life_expec'
3l_condition, u_condition = outlier_analysis(column)
1Quantiles
2
3+------+--------------+
4| | life_expec |
5|------+--------------|
6| 0.1 | 57.82 |
7| 0.25 | 65.3 |
8| 0.5 | 73.1 |
9| 0.75 | 76.8 |
10| 0.99 | 82.37 |
11+------+--------------+
12None
13
14
15LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR life_expec : 47.160000000000004
16
17
18Lower Outliers :
19+----+-----------+--------------+
20| | country | life_expec |
21|----+-----------+--------------|
22| 66 | Haiti | 32.1 |
23| 87 | Lesotho | 46.5 |
24+----+-----------+--------------+
25
26
27UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR life_expec : 82.37
28
29
30Upper Outliers :
31+-----+-----------+--------------+
32| | country | life_expec |
33|-----+-----------+--------------|
34| 133 | Singapore | 82.7 |
35| 77 | Japan | 82.8 |
36+-----+-----------+--------------+

png

  • Some of the lowest life expectancy is seen in Haiti and Lesoto.
  • About 50% of the countries have a life expectany of 73 or below and the other 50% have above 73.
  • Countries like Singapore and Japan have the highest life expective, better than 99% of the countries.
  • Countries with very low life expectancy are possible candidates for aid.
  • Hence, let’s keep the lower outliers.
  • One the other hand, countries like Singapore and Japan might not need aid but these values would skew our analysis. Let’s cap these outliers.
1# Capping upper outliers in life expectancy
2countries.loc[u_condition, column] = countries[column].quantile(0.99)

Fertility

1column = 'total_fer'
2l_condition, u_condition = outlier_analysis(column)
1Quantiles
2
3+------+-------------+
4| | total_fer |
5|------+-------------|
6| 0.1 | 1.452 |
7| 0.25 | 1.795 |
8| 0.5 | 2.41 |
9| 0.75 | 3.88 |
10| 0.99 | 6.5636 |
11+------+-------------+
12None
13
14
15LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR total_fer : 1.2431999999999999
16
17
18Lower Outliers :
19+-----+-------------+-------------+
20| | country | total_fer |
21|-----+-------------+-------------|
22| 133 | Singapore | 1.15 |
23| 138 | South Korea | 1.23 |
24+-----+-------------+-------------+
25
26
27UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR total_fer : 6.563599999999999
28
29
30Upper Outliers :
31+-----+-----------+-------------+
32| | country | total_fer |
33|-----+-----------+-------------|
34| 32 | Chad | 6.59 |
35| 112 | Niger | 7.49 |
36+-----+-----------+-------------+

png

  • About 50% of the countries have a fertility of 2.41 or less.
  • Lower fertility is seen in developed nations like Singapore and South Korea, where the fertility is less than 99% countries.
  • Countries with higher total fertility might need the aid more since this mean more health risk.
  • Fertility in Chad and Niger is higher than 99 percent of the countries.
  • Since these countries might need the aid, let’s leave these values for further analysis.
  • Countries with fertility rates less than 1 perecent of the population look like they are developed nations. Let’s cap these outlier so that they don’t skew our analysis.
  • Further, from the violin plot, notice that fertility has two peaks - 2 and 5. This indicates that fertility rate could be used to effectively segregate countries. More analysis is needed here.
1# capping lower outliers in fertility
2countries.loc[l_condition,column] = countries[column].quantile(.1)

Health Spending

1# health spending per capita
2column = 'health'
3l_condition, u_condition = outlier_analysis(column)
1Quantiles
2
3+------+-----------+
4| | health |
5|------+-----------|
6| 0.1 | 36.5026 |
7| 0.25 | 78.5355 |
8| 0.5 | 321.886 |
9| 0.75 | 976.94 |
10| 0.99 | 8410.33 |
11+------+-----------+
12None
13
14
15LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR health : 17.009362000000003
16
17
18Lower Outliers :
19+----+------------+----------+
20| | country | health |
21|----+------------+----------|
22| 50 | Eritrea | 12.8212 |
23| 93 | Madagascar | 15.5701 |
24+----+------------+----------+
25
26
27UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR health : 8410.3304
28
29
30Upper Outliers :
31+-----+---------------+----------+
32| | country | health |
33|-----+---------------+----------|
34| 145 | Switzerland | 8579 |
35| 159 | United States | 8663.6 |
36+-----+---------------+----------+

png

  • The countries with more health spending are less in need of aid.
  • Like Switzerland and United States, whose health spending is higher than 99% of the countries. But values like these would skew the entire analysis for aid needing countries. Hence, we’d cap these values with the 99th percentile value.
  • The countries with less health spending might have variety of reasons like - optimum general health , or bad economic conditions. Let’s keep these values for further analysis.
1# removing upper outliers in health spending
2countries.loc[u_condition,column] = countries[column].quantile(0.99)

GDP per capita

1# GDP per capita
2column = 'gdpp'
3l_condition, u_condition = outlier_analysis(column)
1Quantiles
2
3+------+---------+
4| | gdpp |
5|------+---------|
6| 0.1 | 593.8 |
7| 0.25 | 1330 |
8| 0.5 | 4660 |
9| 0.75 | 14050 |
10| 0.99 | 79088 |
11+------+---------+
12None
13
14
15LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR gdpp : 331.62
16
17
18Lower Outliers :
19+----+-----------+--------+
20| | country | gdpp |
21|----+-----------+--------|
22| 26 | Burundi | 231 |
23| 88 | Liberia | 327 |
24+----+-----------+--------+
25
26
27UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR gdpp : 79088.00000000004
28
29
30Upper Outliers :
31+-----+------------+--------+
32| | country | gdpp |
33|-----+------------+--------|
34| 114 | Norway | 87800 |
35| 91 | Luxembourg | 105000 |
36+-----+------------+--------+

png

  • GDP per capita is a very good indicator of a country’s prosperity
  • Notice that there are three peaks in the violin plot of GDP. These might indicate the three clusters - under developed, developing and developed nations.
  • Under developed nations are the most in need of aid. Countries like Burundi and Libera have GDPs less than 99% of the countries. Even though they are outliers, let’s keep them for further analysis.
  • But, we could cap GDPs which are greater than GDPs of 99% of the countries (Luxembourg & Norway)
1# capping upper outliers in gdpp
2countries.loc[u_condition, column] = countries[column].quantile(.99)

Inflation Index

1column = 'inflation'
2l_condition, u_condition = outlier_analysis(column)
1Quantiles
2
3+------+-------------+
4| | inflation |
5|------+-------------|
6| 0.1 | 0.5878 |
7| 0.25 | 1.81 |
8| 0.5 | 5.39 |
9| 0.75 | 10.75 |
10| 0.99 | 41.478 |
11+------+-------------+
12None
13
14
15LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR inflation : -2.3487999999999998
16
17
18Lower Outliers :
19+-----+------------+-------------+
20| | country | inflation |
21|-----+------------+-------------|
22| 131 | Seychelles | -4.21 |
23| 73 | Ireland | -3.22 |
24+-----+------------+-------------+
25
26
27UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR inflation : 41.47800000000002
28
29
30Upper Outliers :
31+-----+-----------+-------------+
32| | country | inflation |
33|-----+-----------+-------------|
34| 163 | Venezuela | 45.9 |
35| 113 | Nigeria | 104 |
36+-----+-----------+-------------+

png

  • Very high inflation indicates a bad economic state.
  • The violinplot again indicates three peaks - indicating three possible clusters of inflation - each with progressively less countries.
  • Countries with very high inflation are might need aid. Since this is our area of interest, let’s keep the upper outliers in inflation as they are for further analysis.
  • Let’s cap lower outliers since these look like good economies with no need of aid.
1# capping lower outliers in inflation
2countries.loc[l_condition, column] = countries[column].quantile(0.01)

Net Income per person

1# income
2column = 'income'
3l_condition, u_condition = outlier_analysis(column)
1Quantiles
2
3+------+----------+
4| | income |
5|------+----------|
6| 0.1 | 1524 |
7| 0.25 | 3355 |
8| 0.5 | 9960 |
9| 0.75 | 22800 |
10| 0.99 | 84374 |
11+------+----------+
12None
13
14
15LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR income : 742.24
16
17
18Lower Outliers :
19+----+------------------+----------+
20| | country | income |
21|----+------------------+----------|
22| 37 | Congo, Dem. Rep. | 609 |
23| 88 | Liberia | 700 |
24+----+------------------+----------+
25
26
27UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR income : 84374.00000000003
28
29
30Upper Outliers :
31+-----+------------+----------+
32| | country | income |
33|-----+------------+----------|
34| 91 | Luxembourg | 91700 |
35| 123 | Qatar | 125000 |
36+-----+------------+----------+

png

  • High net income per person is an indicator of general prosperity of a country. Such countries do not need aid.
  • Hence, we shall cap the upper outliers i,e net incomes greater than 99% of that of the other countries.
  • Lower outliers, countries with net incomes less than 1% of that of the other countries is our area of interest. Lets leave these values as they are for further analysis.

Imports per capita

1column = 'imports'
2l_condition, u_condition = outlier_analysis(column)
1Quantiles
2
3+------+-----------+
4| | imports |
5|------+-----------|
6| 0.1 | 211.006 |
7| 0.25 | 640.215 |
8| 0.5 | 2045.58 |
9| 0.75 | 7719.6 |
10| 0.99 | 55371.4 |
11+------+-----------+
12None
13
14
15LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR imports : 104.90964000000002
16
17
18Lower Outliers :
19+-----+-----------+-----------+
20| | country | imports |
21|-----+-----------+-----------|
22| 107 | Myanmar | 0.651092 |
23| 26 | Burundi | 90.552 |
24+-----+-----------+-----------+
25
26
27UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR imports : 55371.39000000013
28
29
30Upper Outliers :
31+-----+------------+-----------+
32| | country | imports |
33|-----+------------+-----------|
34| 133 | Singapore | 81084 |
35| 91 | Luxembourg | 149100 |
36+-----+------------+-----------+

png

Exports per capita

1column = 'exports'
2l_condition, u_condition = outlier_analysis(column)
1Quantiles
2
3+------+-----------+
4| | exports |
5|------+-----------|
6| 0.1 | 110.225 |
7| 0.25 | 447.14 |
8| 0.5 | 1777.44 |
9| 0.75 | 7278 |
10| 0.99 | 64794.3 |
11+------+-----------+
12None
13
14
15LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR exports : 22.243716
16
17
18Lower Outliers :
19+-----+-----------+-----------+
20| | country | exports |
21|-----+-----------+-----------|
22| 107 | Myanmar | 1.07692 |
23| 26 | Burundi | 20.6052 |
24+-----+-----------+-----------+
25
26
27UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR exports : 64794.26000000014
28
29
30Upper Outliers :
31+-----+------------+-----------+
32| | country | exports |
33|-----+------------+-----------|
34| 133 | Singapore | 93200 |
35| 91 | Luxembourg | 183750 |
36+-----+------------+-----------+

png

Trade Deficit per capita

1# trade deficit
2countries['trade_deficit'] = countries['imports'] - countries['exports']
3column = 'trade_deficit'
4l_condition, u_condition = outlier_analysis(column)
1Quantiles
2
3+------+-----------------+
4| | trade_deficit |
5|------+-----------------|
6| 0.1 | -3000.82 |
7| 0.25 | -327.05 |
8| 0.5 | 89.2182 |
9| 0.75 | 518.57 |
10| 0.99 | 2270.5 |
11+------+-----------------+
12None
13
14
15LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR trade_deficit : -18426.1
16
17
18Lower Outliers :
19+-----+------------+-----------------+
20| | country | trade_deficit |
21|-----+------------+-----------------|
22| 91 | Luxembourg | -34650 |
23| 123 | Qatar | -27065.5 |
24+-----+------------+-----------------+
25
26
27UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR trade_deficit : 2270.500000000002
28
29
30Upper Outliers :
31+----+-----------+-----------------+
32| | country | trade_deficit |
33|----+-----------+-----------------|
34| 60 | Greece | 2313.4 |
35| 10 | Bahamas | 2436 |
36+----+-----------+-----------------+

png

  • Trade deficit indicates the balance between exports and imports. Negative trade deficit is favourable indicator of economic health. It means that the country has higher exports compared to imports.
  • When the trade deficit is negative, it means that the country is import predominant.
  • From the above plot, Greece and Bahamas have a trade deficit higher than 99% of the countries. This could mean a bad economic state. Since this is our key area of interest, we could leave these outliers as they are.
  • Countries with negative trade deficit , i.e export predominant countries could be capped to 1st percentile values.
1# capping lower outliers of trade_deficit
2countries.loc[l_condition,column] = countries[column].quantile(0.01)

Bivariate Analysis

Pairplot

1# Pair Plots of all variables
2sns.pairplot(countries[['child_mort', 'exports', 'health', 'imports', 'income',
3 'inflation', 'life_expec', 'total_fer', 'gdpp']]);

png

1## bivariate analysis boken function
2def bivariate_analysis(x_var,y_var,dataframe=countries) :
3 # Bivariate Plots with tooltips : country,x,y
4
5 dataframe = dataframe.copy()
6 source =ColumnDataSource(dataframe)
7
8 # pallete = ["rgba(38, 70, 83, 1)", 'rgba(42, 157, 143, 1)', "rgba(233, 196, 106, 1)", "rgba(244, 162, 97, 1)", "rgba(231, 111, 81, 1)", '#009cc7']
9 pallete = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e","#38895e",'rgba(42, 157, 143, 1)']
10
11 # tooltips
12 tooltips1 = [
13 ("Country", "@country"),
14 (x_var,'@'+x_var),
15 (y_var,'@'+y_var)
16 ]
17
18 p = figure(plot_width=420, plot_height=400,title='Scatter Plot : '+ x_var+' vs '+y_var, tooltips=tooltips1)
19
20 p.scatter(x=x_var,y=y_var, fill_alpha=0.6, size=8, source=source, fill_color = "#3498db")
21 p.xaxis.axis_label = x_var
22 p.yaxis.axis_label = y_var
23 reset_output()
24 output_notebook()
25 show(p)

Child Mortality vs Life Expectancy

1# CHILD MORTALITY vs LIFE EXPECTANCY
2
3x='child_mort'
4y = 'life_expec'
5bivariate_analysis(x,y)
Loading BokehJS ...
  • looks like child_mort and life_expec are almost linearly related.
  • Life expectancy decreases as child mortality increases.
  • One of these features is enough for analysis.
1# Countries with high child mortality and low life expectancy
2life_expect_cond = countries['life_expec'] < 60
3child_mort_cond = countries['child_mort'] > 100
4
5print('Countries with low life expectancy and low Child Mortality Rate')
6tab(countries[life_expect_cond & child_mort_cond][['country','child_mort','life_expec']].sort_values(by=['child_mort','life_expec'], ascending=[False,True])[:10])
1Countries with low life expectancy and low Child Mortality Rate
2+-----+--------------------------+--------------+--------------+
3| | country | child_mort | life_expec |
4|-----+--------------------------+--------------+--------------|
5| 66 | Haiti | 208 | 32.1 |
6| 132 | Sierra Leone | 160 | 55 |
7| 32 | Chad | 150 | 56.5 |
8| 31 | Central African Republic | 149 | 47.5 |
9| 97 | Mali | 137 | 59.5 |
10| 112 | Niger | 123 | 58.8 |
11| 37 | Congo, Dem. Rep. | 116 | 57.5 |
12| 25 | Burkina Faso | 116 | 57.9 |
13| 64 | Guinea-Bissau | 114 | 55.6 |
14| 40 | Cote d'Ivoire | 111 | 56.3 |
15+-----+--------------------------+--------------+--------------+

Child Mortality vs Total Fertility

1# CHILD MORTALITY vs TOTAL FERTILITY
2
3x='child_mort'
4y = 'total_fer'
5bivariate_analysis(x,y)
Loading BokehJS ...
1# countries with high fertility and child mortality rate
2total_fer_cond = countries['total_fer'] > 6
3child_mort_cond = countries['child_mort'] > 100
4
5print('Countries with High Total Fertility and Child Mortality Rate')
6tab(countries[total_fer_cond & child_mort_cond][['country','total_fer','child_mort']].sort_values(by=['child_mort','total_fer'], ascending=[False,False]))
1Countries with High Total Fertility and Child Mortality Rate
2+-----+------------------+-------------+--------------+
3| | country | total_fer | child_mort |
4|-----+------------------+-------------+--------------|
5| 32 | Chad | 6.59 | 150 |
6| 97 | Mali | 6.55 | 137 |
7| 112 | Niger | 7.49 | 123 |
8| 3 | Angola | 6.16 | 119 |
9| 37 | Congo, Dem. Rep. | 6.54 | 116 |
10+-----+------------------+-------------+--------------+
1# Fertility in countries with extremely high mortality rate
2child_mort_cond = countries['child_mort'] > 150
3
4print('Countries with High Total Fertility and Child Mortality Rate')
5tab(countries[child_mort_cond][['country','total_fer','child_mort']].sort_values(by=['child_mort','total_fer'], ascending=[False,False]))
1Countries with High Total Fertility and Child Mortality Rate
2+-----+--------------+-------------+--------------+
3| | country | total_fer | child_mort |
4|-----+--------------+-------------+--------------|
5| 66 | Haiti | 3.33 | 208 |
6| 132 | Sierra Leone | 5.2 | 160 |
7+-----+--------------+-------------+--------------+

GDP per capita vs Health Spending

1# GDP per capita vs Health Spending
2y='health'
3x= 'gdpp'
4bivariate_analysis(x,y)
Loading BokehJS ...
1# Countries with Low GDPP and health spending might need aid
2health_cond = countries['health'] < 200
3gdpp_cond = countries['gdpp'] < 2500
4print('Countries with Low GDP and low Health Spending')
5tab(countries[health_cond & gdpp_cond][['country','health','gdpp']].sort_values(by=['health','gdpp'], ascending=[True,True])[:10])
1Countries with Low GDP and low Health Spending
2+-----+--------------------------+----------+--------+
3| | country | health | gdpp |
4|-----+--------------------------+----------+--------|
5| 50 | Eritrea | 12.8212 | 482 |
6| 93 | Madagascar | 15.5701 | 413 |
7| 31 | Central African Republic | 17.7508 | 446 |
8| 112 | Niger | 17.9568 | 348 |
9| 107 | Myanmar | 19.4636 | 988 |
10| 106 | Mozambique | 21.8299 | 419 |
11| 116 | Pakistan | 22.88 | 1040 |
12| 37 | Congo, Dem. Rep. | 26.4194 | 334 |
13| 12 | Bangladesh | 26.6816 | 758 |
14| 26 | Burundi | 26.796 | 231 |
15+-----+--------------------------+----------+--------+

Imports vs GDP per capita

1# IMPORTS vs GDPP
2y='imports'
3x= 'gdpp'
4bivariate_analysis(x,y)
Loading BokehJS ...

Exports vs GDP per capita

1# Exports vs GDPP
2y='exports'
3x= 'gdpp'
4bivariate_analysis(x,y)
Loading BokehJS ...

GDP per capita vs Trade Deficit

1y='trade_deficit'
2x= 'gdpp'
3bivariate_analysis(x,y)
Loading BokehJS ...
  • Trade deficit is a country’s net imports minus exports.
  • It indicates the dependence of a country on imports
  • Countries with low GDP per capita and positive trade deficit are likely to require aid
  • From the above plot, one can see that there are many such countries
1# Countries with positive high trade deficit and low GDP per capita
2tr_deficit_cond = countries['trade_deficit'] > 1000
3gdpp_cond = countries['gdpp'] < 10000
4print('Countries with high Trade Deficit and Low GDP')
5tab(countries[tr_deficit_cond & gdpp_cond][['country','gdpp','trade_deficit']].sort_values(by=['trade_deficit','gdpp'], ascending=[True,False])[:10])
1Countries with high Trade Deficit and Low GDP
2+-----+--------------------------------+--------+-----------------+
3| | country | gdpp | trade_deficit |
4|-----+--------------------------------+--------+-----------------|
5| 101 | Micronesia, Fed. Sts. | 2860 | 1644.5 |
6| 151 | Tonga | 3550 | 1700.45 |
7| 104 | Montenegro | 6680 | 1716.76 |
8| 61 | Grenada | 7370 | 1871.98 |
9| 141 | St. Vincent and the Grenadines | 6230 | 1881.46 |
10| 86 | Lebanon | 8860 | 2161.84 |
11+-----+--------------------------------+--------+-----------------+

Net Income per person vs Inflation

1# Income vs Inflation
2x = 'inflation'
3y = 'income'
4bivariate_analysis(x,y)
Loading BokehJS ...
1# Low Income - High inflation countries might require aid
2inflation_cond = countries['inflation'] > 20
3income_condition = countries['income'] < 10000
4print('Countries with high inflation and low income')
5tab(countries[inflation_cond & income_condition][['country','inflation','income']].sort_values(by=['inflation','income'], ascending=[False, True]))
1Countries with high inflation and low income
2+-----+------------------+-------------+----------+
3| | country | inflation | income |
4|-----+------------------+-------------+----------|
5| 113 | Nigeria | 104 | 5150 |
6| 103 | Mongolia | 39.2 | 7710 |
7| 149 | Timor-Leste | 26.5 | 1850 |
8| 165 | Yemen | 23.6 | 4480 |
9| 140 | Sri Lanka | 22.8 | 8560 |
10| 3 | Angola | 22.4 | 5900 |
11| 37 | Congo, Dem. Rep. | 20.8 | 609 |
12| 38 | Congo, Rep. | 20.7 | 5190 |
13+-----+------------------+-------------+----------+
  • Countries with High inflation and low income are possible candidates for aid requirement.

GDP per capita vs Inflation

1x = 'gdpp'
2y = 'inflation'
3bivariate_analysis(x,y)
Loading BokehJS ...
  • Countries with low GDP percapita and high Inflation are in dire need of support.
  • For example, Nigeria has an inflation > 100 while its GDP is 2330.

Correlation Analysis

1plt.figure(figsize=[12,12])
2sns.heatmap(countries.corr(),annot=True,cmap='YlGnBu', center=0)
1<matplotlib.axes._subplots.AxesSubplot at 0x7fe87cdf8510>

png

Top Correlations

  • Negative Correlation between life_expec and child_mort
  • Positive Correlation between total_fer and child_mort

Although clustering analysis is not affected by multicollinearity, this plot shows us the possible linear relationships between different features to help with results obtained from cluster analysis.

Hopkin’s Statistic

1# hopkins test function
2from sklearn.neighbors import NearestNeighbors
3from random import sample
4from numpy.random import uniform
5from math import isnan
6
7def hopkins(X):
8 d = X.shape[1]
9 #d = len(vars) # columns
10 n = len(X) # rows
11 m = int(0.1 * n)
12 nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)
13
14 rand_X = sample(range(0, n, 1), m)
15
16 ujd = []
17 wjd = []
18 for j in range(0, m):
19 u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
20 ujd.append(u_dist[0][1])
21 w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)
22 wjd.append(w_dist[0][1])
23
24 H = sum(ujd) / (sum(ujd) + sum(wjd))
25 if isnan(H):
26 print(ujd, wjd)
27 H = 0
28
29 return H
1## Data used for Clustering
2columns_for_clustering = ['child_mort', 'exports', 'health', 'imports', 'income',
3 'inflation', 'life_expec', 'total_fer', 'gdpp', 'trade_deficit']
4clustering_data = countries[columns_for_clustering].copy()
1# Hopkin's test
2n = 10
3hopkins_statistic = []
4for i in range(n) :
5 hopkins_statistic.append(hopkins(clustering_data))
6print('Min Hopkin\'s Statistic in ',n,'iterations :', min(hopkins_statistic))
7print('Max Hopkin\'s Statistic in ',n,'iterations :', max(hopkins_statistic))
8print('Mean Hopkin\'s Statistic in ',n,'iterations :', np.mean(hopkins_statistic))
9print('Std deviation of Hopkin\'s Statistic in ',n,'iterations :', np.std(hopkins_statistic))
1Min Hopkin's Statistic in 10 iterations : 0.8594851780754723
2Max Hopkin's Statistic in 10 iterations : 0.9789539440735269
3Mean Hopkin's Statistic in 10 iterations : 0.9482587172561626
4Std deviation of Hopkin's Statistic in 10 iterations : 0.03487962411406433

Since hopkin’s statistic is greater than 80% , the data shows good clustering tendency

Standardizing Values

1from sklearn.preprocessing import StandardScaler
2scaler = StandardScaler()
3clustering_data[columns_for_clustering] = scaler.fit_transform(clustering_data[columns_for_clustering])
4
5tab(clustering_data.describe())
1+-------+---------------+--------------+---------------+---------------+---------------+---------------+--------------+---------------+---------------+-----------------+
2| | child_mort | exports | health | imports | income | inflation | life_expec | total_fer | gdpp | trade_deficit |
3|-------+---------------+--------------+---------------+---------------+---------------+---------------+--------------+---------------+---------------+-----------------|
4| count | 167 | 167 | 167 | 167 | 167 | 167 | 167 | 167 | 167 | 167 |
5| mean | -7.97765e-18 | 9.1743e-17 | 2.26033e-17 | 4.65363e-17 | -7.51229e-17 | 8.31005e-17 | 3.7229e-17 | 8.24357e-17 | 8.04413e-17 | 6.18268e-17 |
6| std | 1.00301 | 1.00301 | 1.00301 | 1.00301 | 1.00301 | 1.00301 | 1.00301 | 1.00301 | 1.00301 | 1.00301 |
7| min | -0.882217 | -0.414037 | -0.583254 | -0.44916 | -0.860326 | -0.964355 | -4.33969 | -1.12962 | -0.720789 | -5.67561 |
8| 25% | -0.746668 | -0.389145 | -0.546449 | -0.405554 | -0.717456 | -0.569109 | -0.592657 | -0.767708 | -0.657548 | 0.113986 |
9| 50% | -0.47184 | -0.31491 | -0.410154 | -0.309734 | -0.373808 | -0.228871 | 0.287671 | -0.359318 | -0.465925 | 0.247143 |
10| 75% | 0.592652 | -0.00795865 | -0.0432751 | 0.0771304 | 0.294237 | 0.280535 | 0.705262 | 0.616834 | 0.0744147 | 0.384486 |
11| max | 4.22138 | 9.83981 | 4.11998 | 9.71668 | 5.61154 | 9.14287 | 1.33391 | 3.01405 | 3.81697 | 0.997842 |
12+-------+---------------+--------------+---------------+---------------+---------------+---------------+--------------+---------------+---------------+-----------------+

K-Means Clustering

Finding Optimal Number of Clusters

Elbow curve

1# Plotting Elbow curve of Sum of Squared distances of points in each cluster from the centroid of the nearest cluster.
2ssd = []
3range_n_clusters = np.arange(2,9)
4for num_clusters in range_n_clusters :
5 kmeans = KMeans(n_clusters=num_clusters)
6 kmeans.fit(clustering_data)
7 ssd.append(kmeans.inertia_)
8plt.plot(range_n_clusters,ssd)
9plt.title('Elbow Curve');
10plt.xlabel('No of clusters');
11plt.ylabel('SSD');

png

  • From the above Elbow curve, one can clearly see that there is a high gradient descent from k=2 to k=4 and then the curve tapers (Change in slope is not as SIGNIFICANT as earlier)
  • Hence, k=4 is optimum no of clusters, statistically.

Sihoutte Analysis

1from sklearn.metrics import silhouette_score
2
3no_of_clusters = np.arange(2,10)
4score = []
5
6for n_cluster in no_of_clusters :
7 kmeans = KMeans(n_clusters=n_cluster, init='k-means++')
8 kmeans = kmeans.fit(clustering_data)
9 labels = kmeans.labels_
10 score.append(silhouette_score(clustering_data,labels))
11
12
13plt.title('Silhouette Analysis Plot')
14plt.xlabel('No of Clusters')
15plt.ylabel('Silhouette Score')
16plt.plot(no_of_clusters, score);
17print(score)
1[0.46329893684299267, 0.40889975765795494, 0.4113959420115351, 0.40485071262968453, 0.41404360987563693, 0.31039501753497534, 0.302668856614785, 0.30746584335288263]

png

  • Higher the silhouette score the better
  • However , from the above plot, we see silhouette score is the highest for k = 2, sharply falls at 3 and there’s a local maximum at k=4
  • k = 4 seems to be the optimum no of clusters

Final k - Means Clustering

1# k - means clustering algo with k = 4
2n_cluster = 4
3kmeans = KMeans(n_clusters=n_cluster, init='k-means++', random_state = 100)
4kmeans = kmeans.fit(clustering_data)
5labels = kmeans.labels_
6countries['k_means_cluster_id'] = labels
1# Countries in each Cluster - k means
2for cluster_no in range(n_cluster) :
3 condition = countries['k_means_cluster_id'] == cluster_no
4 print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')
1CLUSTER # 0
2 ['Australia' 'Austria' 'Belgium' 'Brunei' 'Canada' 'Cyprus' 'Denmark'
3 'Finland' 'France' 'Germany' 'Greece' 'Iceland' 'Ireland' 'Israel'
4 'Italy' 'Japan' 'Kuwait' 'Malta' 'Netherlands' 'New Zealand' 'Norway'
5 'Slovenia' 'Spain' 'Sweden' 'Switzerland' 'United Arab Emirates'
6 'United Kingdom' 'United States']
7
8
9CLUSTER # 1
10 ['Afghanistan' 'Angola' 'Benin' 'Botswana' 'Burkina Faso' 'Burundi'
11 'Cameroon' 'Central African Republic' 'Chad' 'Comoros' 'Congo, Dem. Rep.'
12 'Congo, Rep.' "Cote d'Ivoire" 'Equatorial Guinea' 'Eritrea' 'Gabon'
13 'Gambia' 'Ghana' 'Guinea' 'Guinea-Bissau' 'Haiti' 'Iraq' 'Kenya'
14 'Kiribati' 'Lao' 'Lesotho' 'Liberia' 'Madagascar' 'Malawi' 'Mali'
15 'Mauritania' 'Mozambique' 'Namibia' 'Niger' 'Nigeria' 'Pakistan' 'Rwanda'
16 'Senegal' 'Sierra Leone' 'Solomon Islands' 'South Africa' 'Sudan'
17 'Tanzania' 'Timor-Leste' 'Togo' 'Uganda' 'Yemen' 'Zambia']
18
19
20CLUSTER # 2
21 ['Albania' 'Algeria' 'Antigua and Barbuda' 'Argentina' 'Armenia'
22 'Azerbaijan' 'Bahamas' 'Bahrain' 'Bangladesh' 'Barbados' 'Belarus'
23 'Belize' 'Bhutan' 'Bolivia' 'Bosnia and Herzegovina' 'Brazil' 'Bulgaria'
24 'Cambodia' 'Cape Verde' 'Chile' 'China' 'Colombia' 'Costa Rica' 'Croatia'
25 'Czech Republic' 'Dominican Republic' 'Ecuador' 'Egypt' 'El Salvador'
26 'Estonia' 'Fiji' 'Georgia' 'Grenada' 'Guatemala' 'Guyana' 'Hungary'
27 'India' 'Indonesia' 'Iran' 'Jamaica' 'Jordan' 'Kazakhstan'
28 'Kyrgyz Republic' 'Latvia' 'Lebanon' 'Libya' 'Lithuania' 'Macedonia, FYR'
29 'Malaysia' 'Maldives' 'Mauritius' 'Micronesia, Fed. Sts.' 'Moldova'
30 'Mongolia' 'Montenegro' 'Morocco' 'Myanmar' 'Nepal' 'Oman' 'Panama'
31 'Paraguay' 'Peru' 'Philippines' 'Poland' 'Portugal' 'Romania' 'Russia'
32 'Samoa' 'Saudi Arabia' 'Serbia' 'Seychelles' 'Slovak Republic'
33 'South Korea' 'Sri Lanka' 'St. Vincent and the Grenadines' 'Suriname'
34 'Tajikistan' 'Thailand' 'Tonga' 'Tunisia' 'Turkey' 'Turkmenistan'
35 'Ukraine' 'Uruguay' 'Uzbekistan' 'Vanuatu' 'Venezuela' 'Vietnam']
36
37
38CLUSTER # 3
39 ['Luxembourg' 'Qatar' 'Singapore']

Hierarchical Clustering

HAC : Single Linkage, Euclidean Measure

1# Agglomerative Single Linkage
2mergings = linkage(clustering_data,method='single',metric='euclidean')
3plt.figure(figsize=[16,10])
4plt.title('Single Linkage - Hierarchical Clustering')
5dendrogram(mergings);

png

HAC : Complete Linkage, Euclidean Measure

1# Complete Linkage
2mergings = linkage(clustering_data,method='complete',metric='euclidean')
3plt.figure(figsize=[16,10])
4plt.title('Complete Linkage - Hierarchical Clustering')
5dendrogram(mergings);

png

  • Hierarchical clustering with complete linkage has a more discriminative dendrogram
1# Using Complete Linkage, cutting the tree for 6 clusters
2n_clusters = 6
3cluster_labels = cut_tree(mergings, n_clusters=n_clusters)
4countries['hac_complete_cluster_id'] = cluster_labels
1# Countries in each Cluster - Hierarchical - Complete Linkage
2for cluster_no in range(n_clusters) :
3 condition = countries['hac_complete_cluster_id'] == cluster_no
4 print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')
1CLUSTER # 0
2 ['Afghanistan' 'Angola' 'Benin' 'Botswana' 'Burkina Faso' 'Burundi'
3 'Cameroon' 'Central African Republic' 'Chad' 'Comoros' 'Congo, Dem. Rep.'
4 'Congo, Rep.' "Cote d'Ivoire" 'Equatorial Guinea' 'Eritrea' 'Gabon'
5 'Gambia' 'Ghana' 'Guinea' 'Guinea-Bissau' 'Haiti' 'Iraq' 'Kenya'
6 'Kiribati' 'Lao' 'Lesotho' 'Liberia' 'Madagascar' 'Malawi' 'Mali'
7 'Mauritania' 'Mozambique' 'Namibia' 'Niger' 'Pakistan' 'Rwanda' 'Senegal'
8 'Sierra Leone' 'Solomon Islands' 'South Africa' 'Sudan' 'Tanzania'
9 'Timor-Leste' 'Togo' 'Uganda' 'Yemen' 'Zambia']
10
11
12CLUSTER # 1
13 ['Albania' 'Algeria' 'Antigua and Barbuda' 'Argentina' 'Armenia'
14 'Azerbaijan' 'Bahamas' 'Bahrain' 'Bangladesh' 'Barbados' 'Belarus'
15 'Belize' 'Bhutan' 'Bolivia' 'Bosnia and Herzegovina' 'Brazil' 'Bulgaria'
16 'Cambodia' 'Cape Verde' 'Chile' 'China' 'Colombia' 'Costa Rica' 'Croatia'
17 'Cyprus' 'Czech Republic' 'Dominican Republic' 'Ecuador' 'Egypt'
18 'El Salvador' 'Estonia' 'Fiji' 'Georgia' 'Greece' 'Grenada' 'Guatemala'
19 'Guyana' 'Hungary' 'India' 'Indonesia' 'Iran' 'Israel' 'Italy' 'Jamaica'
20 'Jordan' 'Kazakhstan' 'Kyrgyz Republic' 'Latvia' 'Lebanon' 'Libya'
21 'Lithuania' 'Macedonia, FYR' 'Malaysia' 'Maldives' 'Malta' 'Mauritius'
22 'Micronesia, Fed. Sts.' 'Moldova' 'Mongolia' 'Montenegro' 'Morocco'
23 'Myanmar' 'Nepal' 'New Zealand' 'Oman' 'Panama' 'Paraguay' 'Peru'
24 'Philippines' 'Poland' 'Portugal' 'Romania' 'Russia' 'Samoa'
25 'Saudi Arabia' 'Serbia' 'Seychelles' 'Slovak Republic' 'Slovenia'
26 'South Korea' 'Spain' 'Sri Lanka' 'St. Vincent and the Grenadines'
27 'Suriname' 'Tajikistan' 'Thailand' 'Tonga' 'Tunisia' 'Turkey'
28 'Turkmenistan' 'Ukraine' 'United Arab Emirates' 'Uruguay' 'Uzbekistan'
29 'Vanuatu' 'Venezuela' 'Vietnam']
30
31
32CLUSTER # 2
33 ['Australia' 'Austria' 'Belgium' 'Canada' 'Denmark' 'Finland' 'France'
34 'Germany' 'Iceland' 'Ireland' 'Japan' 'Netherlands' 'Norway' 'Sweden'
35 'Switzerland' 'United Kingdom' 'United States']
36
37
38CLUSTER # 3
39 ['Brunei' 'Kuwait' 'Qatar' 'Singapore']
40
41
42CLUSTER # 4
43 ['Luxembourg']
44
45
46CLUSTER # 5
47 ['Nigeria']

HAC : Average Linkage, Euclidean Measure

1# Average Linkage
2mergings = linkage(clustering_data,method='average',metric='euclidean')
3plt.figure(figsize=[16,10])
4plt.title('Average Linkage - Hierarchical Clustering')
5dendrogram(mergings);

png

1# Using Average Linkage, cutting the tree for 6 clusters
2n_clusters = 6
3cluster_labels = cut_tree(mergings, n_clusters=n_clusters)
4countries['hac_average_cluster_id'] = cluster_labels
1# Countries in each Cluster - Hierarchical - average Linkage
2for cluster_no in range(n_clusters) :
3 condition = countries['hac_average_cluster_id'] == cluster_no
4 print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')
1CLUSTER # 0
2 ['Afghanistan' 'Albania' 'Algeria' 'Angola' 'Antigua and Barbuda'
3 'Argentina' 'Armenia' 'Australia' 'Austria' 'Azerbaijan' 'Bahamas'
4 'Bahrain' 'Bangladesh' 'Barbados' 'Belarus' 'Belgium' 'Belize' 'Benin'
5 'Bhutan' 'Bolivia' 'Bosnia and Herzegovina' 'Botswana' 'Brazil'
6 'Bulgaria' 'Burkina Faso' 'Burundi' 'Cambodia' 'Cameroon' 'Canada'
7 'Cape Verde' 'Central African Republic' 'Chad' 'Chile' 'China' 'Colombia'
8 'Comoros' 'Congo, Dem. Rep.' 'Congo, Rep.' 'Costa Rica' "Cote d'Ivoire"
9 'Croatia' 'Cyprus' 'Czech Republic' 'Denmark' 'Dominican Republic'
10 'Ecuador' 'Egypt' 'El Salvador' 'Equatorial Guinea' 'Eritrea' 'Estonia'
11 'Fiji' 'Finland' 'France' 'Gabon' 'Gambia' 'Georgia' 'Germany' 'Ghana'
12 'Greece' 'Grenada' 'Guatemala' 'Guinea' 'Guinea-Bissau' 'Guyana'
13 'Hungary' 'Iceland' 'India' 'Indonesia' 'Iran' 'Iraq' 'Israel' 'Italy'
14 'Jamaica' 'Japan' 'Jordan' 'Kazakhstan' 'Kenya' 'Kiribati'
15 'Kyrgyz Republic' 'Lao' 'Latvia' 'Lebanon' 'Lesotho' 'Liberia' 'Libya'
16 'Lithuania' 'Macedonia, FYR' 'Madagascar' 'Malawi' 'Malaysia' 'Maldives'
17 'Mali' 'Malta' 'Mauritania' 'Mauritius' 'Micronesia, Fed. Sts.' 'Moldova'
18 'Mongolia' 'Montenegro' 'Morocco' 'Mozambique' 'Myanmar' 'Namibia'
19 'Nepal' 'Netherlands' 'New Zealand' 'Niger' 'Oman' 'Pakistan' 'Panama'
20 'Paraguay' 'Peru' 'Philippines' 'Poland' 'Portugal' 'Romania' 'Russia'
21 'Rwanda' 'Samoa' 'Saudi Arabia' 'Senegal' 'Serbia' 'Seychelles'
22 'Sierra Leone' 'Slovak Republic' 'Slovenia' 'Solomon Islands'
23 'South Africa' 'South Korea' 'Spain' 'Sri Lanka'
24 'St. Vincent and the Grenadines' 'Sudan' 'Suriname' 'Sweden' 'Tajikistan'
25 'Tanzania' 'Thailand' 'Timor-Leste' 'Togo' 'Tonga' 'Tunisia' 'Turkey'
26 'Turkmenistan' 'Uganda' 'Ukraine' 'United Arab Emirates' 'United Kingdom'
27 'United States' 'Uruguay' 'Uzbekistan' 'Vanuatu' 'Venezuela' 'Vietnam'
28 'Yemen' 'Zambia']
29
30
31CLUSTER # 1
32 ['Brunei' 'Ireland' 'Kuwait' 'Norway' 'Qatar' 'Switzerland']
33
34
35CLUSTER # 2
36 ['Haiti']
37
38
39CLUSTER # 3
40 ['Luxembourg']
41
42
43CLUSTER # 4
44 ['Nigeria']
45
46
47CLUSTER # 5
48 ['Singapore']

HAC : Complete Linkage , Correlation Measure

1## HAC Clustering : Dissimilarity Measure : Correlation
2hac_correlation_mergings = linkage(clustering_data,method='complete', metric='correlation')
3plt.figure(figsize=[12,12])
4plt.title('Hierarchical Clustering : Complete Linkage, Correlation Measure')
5dendrogram(hac_correlation_mergings);

png

1n_clusters = 6
2labels = cut_tree(hac_correlation_mergings, n_clusters=n_clusters)
3countries['hac_correlation_cluster_id'] = labels
1# HAC clustering : Correlation measure : complete distance : Countries in each cluster
2
3for cluster_no in range(n_clusters) :
4 condition = countries['hac_correlation_cluster_id'] == cluster_no
5 print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')
1CLUSTER # 0
2 ['Afghanistan' 'Angola' 'Benin' 'Botswana' 'Burkina Faso' 'Burundi'
3 'Cameroon' 'Central African Republic' 'Chad' 'Comoros' 'Congo, Dem. Rep.'
4 'Congo, Rep.' "Cote d'Ivoire" 'Equatorial Guinea' 'Eritrea' 'Gabon'
5 'Gambia' 'Ghana' 'Guinea' 'Guinea-Bissau' 'Haiti' 'India' 'Iraq' 'Kenya'
6 'Kiribati' 'Lao' 'Lesotho' 'Liberia' 'Madagascar' 'Malawi' 'Mali'
7 'Mauritania' 'Mozambique' 'Myanmar' 'Namibia' 'Niger' 'Pakistan' 'Rwanda'
8 'Senegal' 'Sierra Leone' 'South Africa' 'Sudan' 'Tanzania' 'Timor-Leste'
9 'Togo' 'Turkmenistan' 'Uganda' 'Yemen' 'Zambia']
10
11
12CLUSTER # 1
13 ['Albania' 'Belize' 'Cape Verde' 'Colombia' 'Dominican Republic' 'Ecuador'
14 'El Salvador' 'Grenada' 'Maldives' 'Morocco' 'Panama' 'Paraguay' 'Peru'
15 'St. Vincent and the Grenadines' 'Thailand' 'Tunisia']
16
17
18CLUSTER # 2
19 ['Algeria' 'Argentina' 'Armenia' 'Azerbaijan' 'Belarus' 'Georgia' 'Iran'
20 'Jamaica' 'Kazakhstan' 'Moldova' 'Mongolia' 'Nigeria' 'Russia'
21 'Sri Lanka' 'Suriname' 'Ukraine' 'Venezuela' 'Vietnam']
22
23
24CLUSTER # 3
25 ['Antigua and Barbuda' 'Australia' 'Bahamas' 'Barbados'
26 'Bosnia and Herzegovina' 'Brazil' 'Bulgaria' 'Canada' 'Chile' 'China'
27 'Costa Rica' 'Croatia' 'Cyprus' 'Czech Republic' 'Estonia' 'France'
28 'Greece' 'Hungary' 'Israel' 'Italy' 'Japan' 'Latvia' 'Lebanon'
29 'Lithuania' 'Macedonia, FYR' 'Malaysia' 'Malta' 'Mauritius' 'Montenegro'
30 'New Zealand' 'Poland' 'Portugal' 'Romania' 'Serbia' 'Seychelles'
31 'Slovak Republic' 'Slovenia' 'South Korea' 'Spain' 'Turkey'
32 'United Kingdom' 'United States' 'Uruguay']
33
34
35CLUSTER # 4
36 ['Austria' 'Bahrain' 'Belgium' 'Brunei' 'Denmark' 'Finland' 'Germany'
37 'Iceland' 'Ireland' 'Kuwait' 'Libya' 'Luxembourg' 'Netherlands' 'Norway'
38 'Oman' 'Qatar' 'Saudi Arabia' 'Singapore' 'Sweden' 'Switzerland'
39 'United Arab Emirates']
40
41
42CLUSTER # 5
43 ['Bangladesh' 'Bhutan' 'Bolivia' 'Cambodia' 'Egypt' 'Fiji' 'Guatemala'
44 'Guyana' 'Indonesia' 'Jordan' 'Kyrgyz Republic' 'Micronesia, Fed. Sts.'
45 'Nepal' 'Philippines' 'Samoa' 'Solomon Islands' 'Tajikistan' 'Tonga'
46 'Uzbekistan' 'Vanuatu']

Mixed K-Means Clustering

1# Performing k-means using results of Hierarchical clustering
2# 1. No of clusters of Hierarchical Clustering
3# 2. Centroids obtainded from Hierarchical Clustering as the initialization points.
4clustering_data['k_means_cluster_id'] = countries['k_means_cluster_id']
5clustering_data['hac_correlation_cluster_id'] = countries['hac_correlation_cluster_id']
6columns = columns_for_clustering.copy()
7columns.extend(['hac_correlation_cluster_id'])
8centroids = clustering_data[columns].groupby(['hac_correlation_cluster_id']).mean()
9n_clusters = 6
10mixed_kmeans = KMeans(n_clusters=6 , init = centroids.values, random_state=100)
11results = mixed_kmeans.fit(clustering_data[columns_for_clustering])
1countries['mixed_cluster_id'] = results.labels_
1# Mixed clustering : Euclidean measure : k-means : Countries in each cluster
2
3for cluster_no in range(n_clusters) :
4 condition = countries['mixed_cluster_id'] == cluster_no
5 print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')
1CLUSTER # 0
2 ['Afghanistan' 'Angola' 'Benin' 'Burkina Faso' 'Burundi' 'Cameroon'
3 'Central African Republic' 'Chad' 'Comoros' 'Congo, Dem. Rep.'
4 'Congo, Rep.' "Cote d'Ivoire" 'Equatorial Guinea' 'Gambia' 'Guinea'
5 'Guinea-Bissau' 'Haiti' 'Lesotho' 'Liberia' 'Malawi' 'Mali' 'Mauritania'
6 'Mozambique' 'Niger' 'Sierra Leone' 'Sudan' 'Tanzania' 'Timor-Leste'
7 'Togo' 'Uganda' 'Zambia']
8
9
10CLUSTER # 1
11 ['Albania' 'Algeria' 'Antigua and Barbuda' 'Argentina' 'Armenia'
12 'Azerbaijan' 'Bahrain' 'Barbados' 'Belarus' 'Belize'
13 'Bosnia and Herzegovina' 'Brazil' 'Bulgaria' 'Cape Verde' 'Chile' 'China'
14 'Colombia' 'Costa Rica' 'Croatia' 'Czech Republic' 'Dominican Republic'
15 'Ecuador' 'El Salvador' 'Estonia' 'Georgia' 'Grenada' 'Hungary' 'Iran'
16 'Jamaica' 'Jordan' 'Kazakhstan' 'Latvia' 'Lebanon' 'Libya' 'Lithuania'
17 'Macedonia, FYR' 'Malaysia' 'Maldives' 'Mauritius' 'Moldova' 'Montenegro'
18 'Morocco' 'Oman' 'Panama' 'Paraguay' 'Peru' 'Poland' 'Romania' 'Russia'
19 'Saudi Arabia' 'Serbia' 'Seychelles' 'Slovak Republic' 'South Korea'
20 'Sri Lanka' 'St. Vincent and the Grenadines' 'Suriname' 'Thailand'
21 'Tunisia' 'Turkey' 'Ukraine' 'Uruguay' 'Vietnam']
22
23
24CLUSTER # 2
25 ['Mongolia' 'Nigeria' 'Venezuela']
26
27
28CLUSTER # 3
29 ['Australia' 'Austria' 'Bahamas' 'Belgium' 'Canada' 'Cyprus' 'Denmark'
30 'Finland' 'France' 'Germany' 'Greece' 'Iceland' 'Israel' 'Italy' 'Japan'
31 'Malta' 'Netherlands' 'New Zealand' 'Portugal' 'Slovenia' 'Spain'
32 'Sweden' 'United Arab Emirates' 'United Kingdom' 'United States']
33
34
35CLUSTER # 4
36 ['Brunei' 'Ireland' 'Kuwait' 'Luxembourg' 'Norway' 'Qatar' 'Singapore'
37 'Switzerland']
38
39
40CLUSTER # 5
41 ['Bangladesh' 'Bhutan' 'Bolivia' 'Botswana' 'Cambodia' 'Egypt' 'Eritrea'
42 'Fiji' 'Gabon' 'Ghana' 'Guatemala' 'Guyana' 'India' 'Indonesia' 'Iraq'
43 'Kenya' 'Kiribati' 'Kyrgyz Republic' 'Lao' 'Madagascar'
44 'Micronesia, Fed. Sts.' 'Myanmar' 'Namibia' 'Nepal' 'Pakistan'
45 'Philippines' 'Rwanda' 'Samoa' 'Senegal' 'Solomon Islands' 'South Africa'
46 'Tajikistan' 'Tonga' 'Turkmenistan' 'Uzbekistan' 'Vanuatu' 'Yemen']
1# silhouette scores of all the methods
2print('Mixed Clustering',silhouette_score(clustering_data[columns_for_clustering],countries['mixed_cluster_id']))
3print('K-means Clustering',silhouette_score(clustering_data[columns_for_clustering],countries['k_means_cluster_id']))
4print('Hierarchical Correlation Clustering',silhouette_score(clustering_data[columns_for_clustering],countries['hac_correlation_cluster_id']))
1Mixed Clustering 0.295025902261296
2K-means Clustering 0.4113959420115351
3Hierarchical Correlation Clustering 0.06816978634185641

Cluster Profiling

1# Clustering Profling - Plots using Bokeh
2
3cluster_id_column = 'hierarchical-c-link-cluster-id'
4title="Hierarchical Clustering"
5
6def cluster_analysis_plot(cluster_id_column,title,x_var='income',y_var='child_mort',z_var='gdpp',dataframe=countries) :
7 # Plots
8 # works upto 6 clusters
9
10 dataframe = dataframe.copy()
11 dataframe[cluster_id_column] = dataframe[cluster_id_column].astype('str')
12 source =ColumnDataSource(dataframe)
13
14 cluster_ids = sorted(dataframe[cluster_id_column].unique())
15
16 # pallete = ["rgba(38, 70, 83, 1)", 'rgba(42, 157, 143, 1)', "rgba(233, 196, 106, 1)", "rgba(244, 162, 97, 1)", "rgba(231, 111, 81, 1)", '#009cc7']
17 pallete = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e","#38895e",'rgba(42, 157, 143, 1)']
18
19 mapper = factor_cmap(cluster_id_column,palette=pallete[:len(cluster_ids)], factors = cluster_ids)
20
21 # plot 1
22 tooltips1 = [
23 ("Country", "@country"),
24 (z_var,'@'+z_var),
25 (x_var,'@'+x_var),
26 ('Cluster', '@'+cluster_id_column)
27 ]
28
29 p = figure(plot_width=420, plot_height=400,title=title+ ' : '+ z_var+' vs '+x_var, tooltips=tooltips1, toolbar_location=None)
30 for num,index in enumerate(cluster_ids) :
31 condition = dataframe[cluster_id_column] == index
32 source = dataframe[condition]
33 p.scatter(x=z_var,y=x_var, fill_alpha=0.6, size=8, source=source, fill_color = pallete[num] , muted_alpha=0.1, legend_label=index)
34 p.xaxis.axis_label = z_var
35 p.yaxis.axis_label = x_var
36 p.legend.click_policy="mute"
37 # ----------------------------
38 #Plot 2
39
40 tooltips2 = [
41 ("Country", "@country"),
42 (z_var,'@'+z_var),
43 (y_var,'@'+y_var),
44 ('Cluster', '@'+cluster_id_column)
45 ]
46
47
48 q = figure(plot_width=420, plot_height=400,title=title+ ' : '+ z_var+' vs '+y_var, tooltips=tooltips2, toolbar_location=None)
49
50 for num,index in enumerate(cluster_ids) :
51 condition = dataframe[cluster_id_column] == index
52 source = dataframe[condition]
53 q.scatter(x=z_var,y=y_var, fill_alpha=0.6, size=8, source=source, fill_color = pallete[num] , muted_alpha=0.1, legend_label=index)
54 q.xaxis.axis_label = z_var
55 q.yaxis.axis_label = y_var
56 q.legend.click_policy="mute"
57
58 # ----------------------------
59 #Plot 3
60
61 tooltips3 = [
62 ("Country", "@country"),
63 (x_var,'@'+x_var),
64 (y_var,'@'+y_var),
65 ('Cluster', '@'+cluster_id_column)
66 ]
67
68 r = figure(plot_width=420, plot_height=400,title=title+ ' : '+ x_var+' vs '+y_var, tooltips=tooltips3, toolbar_location=None)
69
70 for num,index in enumerate(cluster_ids) :
71 condition = dataframe[cluster_id_column] == index
72 source = dataframe[condition]
73 r.scatter(x=x_var,y=y_var, fill_alpha=0.6, size=8, source=source, fill_color = pallete[num] , legend_label=index, muted_alpha=0.1 )
74
75 r.xaxis.axis_label = x_var
76 r.yaxis.axis_label = y_var
77 r.legend.click_policy="mute"
78
79
80 show(row(p,q,r))

K-Means

1# Cluster Profiling for k-means with 4 clusters
2# hover for country names and x, y values , cluster no
3# Click on legend to selectively view clusters
4cluster_analysis_plot('k_means_cluster_id','k-means clusters')
1# Comparing k-means Clusters using mean values of features
2clustering_data[['child_mort','income','gdpp','k_means_cluster_id']].groupby('k_means_cluster_id').mean().plot(kind='barh')
3plt.title('Comparison of Cluster Means for K-means results');

png

1plot_columns = ['child_mort','income','gdpp']
2
3for idx,column in enumerate(plot_columns) :
4 plt.suptitle('Comparison of Clusters Characteristics for K-means');
5 plt.subplot('13'+str(idx+1))
6 sns.boxplot(y=column, x='k_means_cluster_id',data=clustering_data)

png

1plt.figure(figsize=[8,8])
2pd.plotting.parallel_coordinates(clustering_data, 'k_means_cluster_id', color=["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e"]);
3plt.title('Parallel Coordinate Plot for K-means')
4plt.xticks(rotation=45);

png

  • From the above plots, we could characterize features of each cluster as within the following levels.

    • Levels : Low, Moderate, High , very high
  • Characteristics of each cluster

ClusterGDPIncomeChild Mortality
0HighHigh to Very HighLow
1LowLowHigh to Very High
2Low to ModerateLow to ModerateLow to Moderate
3Very HighVery HighLow

From the characteristics , we see that the Cluster 1 is our area of interest. Lets look at the countries in Cluster 1.

1# Countries in cluster with area of interest
2condition = countries['k_means_cluster_id'] == 1
3countries.loc[condition,['country','child_mort','income','gdpp']].sort_values(by=['child_mort','income','gdpp'], ascending=[False,True,True])[:5]
countrychild_mortincomegdpp
66Haiti208.01500662.0
132Sierra Leone160.01220399.0
32Chad150.01930897.0
31Central African Republic149.0888446.0
97Mali137.01870708.0

Hierarchical Clustering - Complete Linkage, Correlation based distance

1# HAC : Complete linkage, Correlation based distance : cluster analysis plot
2# hover for country names and x, y values , cluster no
3# Click on legend to selectively view clusters
4cluster_analysis_plot('hac_correlation_cluster_id','HAC CORRELATION CLUSTERS')
1# Comparing Hierarchical Clusters using mean values of features
2clustering_data['hac_correlation_cluster_id'] = countries['hac_correlation_cluster_id']
3clustering_data[['child_mort','income','gdpp','hac_correlation_cluster_id']].groupby('hac_correlation_cluster_id').mean().plot(kind='barh');
4plt.title('Comparison of Cluster Means for Hierarchical Clustering');

png

1# box plots
2plot_columns = ['child_mort','income','gdpp']
3
4for idx,column in enumerate(plot_columns) :
5 plt.suptitle('Comparison of Clusters Characteristics for Hierarchical Clustering');
6 plt.subplot('13'+str(idx+1))
7 sns.boxplot(y=column, x='hac_correlation_cluster_id',data=clustering_data)

png

1# Parallel Coordinates plot for Hierarchical clustering with correlation measure and complet linkage
2plt.figure(figsize=[8,8])
3pd.plotting.parallel_coordinates(clustering_data, 'hac_correlation_cluster_id', color=["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e","#38895e",'rgba(42, 157, 143, 1)']);
4plt.title('Parallel Coordinate Plot for HAC with Correlation Measure and Complete Linkage')
5plt.xticks(rotation=45);

png

  • From the above plots, we could characterize features of each cluster as within the following levels.

    • Levels : Very low , low, Moderate, High , very high
  • Characteristics of each cluster

ClusterGDPIncomeChild Mortality
0Very lowVery LowVery High
1lowlowlow
2ModerateModerateModerate
3HighHighLow
4Very HighVery HighLow
5Very LowVery LowLow

From the characteristics , we see that the Cluster 0 is our area of interest. Lets look at the countries in Cluster 0.

1# Countries in cluster with area of interest
2condition = countries['hac_correlation_cluster_id'] == 0
3countries.loc[condition,['country','child_mort','income','gdpp']].sort_values(by=['child_mort','income','gdpp'], ascending=[False,True,True])[:5]
countrychild_mortincomegdpp
66Haiti208.01500662.0
132Sierra Leone160.01220399.0
32Chad150.01930897.0
31Central African Republic149.0888446.0
97Mali137.01870708.0

Mixed Clustering : K-means initialized with Hierarchical Cluster Centroids

1cluster_analysis_plot('mixed_cluster_id','MIXED K-MEANS CLUSTERS')
1# Comparing MIXED k-means Clusters using mean values of features
2clustering_data['mixed_cluster_id'] = countries['mixed_cluster_id']
3clustering_data[['child_mort','income','gdpp','mixed_cluster_id']].groupby('mixed_cluster_id').mean().plot(kind='barh')
4plt.title('Comparison of cluster means for Mixed Clustering');

png

1# box plots
2plot_columns = ['child_mort','income','gdpp']
3
4for idx,column in enumerate(plot_columns) :
5 plt.suptitle('Comparison of Clusters Characteristics for Mixed Clustering');
6 plt.subplot('13'+str(idx+1))
7 sns.boxplot(y=column, x='mixed_cluster_id',data=clustering_data)

png

1# parallel coordinate plot
2plt.figure(figsize=[8,8])
3pd.plotting.parallel_coordinates(clustering_data, 'mixed_cluster_id', color=["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e","#38895e",'rgba(42, 157, 143, 1)']);
4plt.title('Parallel Coordinate Plot for Mixed Clustering');
5plt.xticks(rotation=45);

png

  • From the above plots, Cluster 0 has the very low income and gdpp and very high child mortality. Hence, Cluster 0 is the area of interest.
1# Countries in cluster with area of interest
2condition = countries['mixed_cluster_id'] == 0
3countries.loc[condition,['country','child_mort','income','gdpp']].sort_values(by=['child_mort','income','gdpp'], ascending=[False,True,True])[:5]
countrychild_mortincomegdpp
66Haiti208.01500662.0
132Sierra Leone160.01220399.0
32Chad150.01930897.0
31Central African Republic149.0888446.0
97Mali137.01870708.0
  • Hence , the top five countries to extend aid are
    • Haiti
    • Sierra Leone
    • Chad
    • Central African Republic
    • Mali

Conclusion

  • Columns : ‘child_mort’, ‘exports’, ‘health’, ‘imports’, ‘income’,‘inflation’, ‘life_expec’, ‘total_fer’, ‘gdpp’,‘trade_deficit’ were used for clustering.

  • Hopkin’s Statistic was calculated which showed a very high mean clustering tendency of 96% with a standard deviation of 4%

  • The dataset was standardized with mean = 0 , standard deviation = 1 before clustering.

  • Optimum no of clusters for k-means was found to be 4 , both from Elbow curve and Silhoeutte Analysis curve.

  • Clustering with k-means was performed (Cluster centers initialized with k-means++)

  • Hierarchical Clustering with Single, Complete and Average linkages and Euclidean , correlation based distances were explored.

  • Mixed k-means clustering - K-means initialized with centroids of hierarchical clustering - was also explored.

  • Due to more interpretable results, Hierarchical Clustering with Complete linkage and correlation based distance measure was chosen to arrive at the results.

  • Characteristics of Clusters obtained :

ClusterGDPIncomeChild Mortality
0Very lowVery LowVery High
1lowlowlow
2ModerateModerateModerate
3HighHighLow
4Very HighVery HighLow
5Very LowVery LowLow
  • From the characteristics , we see that the Cluster 0 is our area of interest.
  • According to the UN goals of 2030, the top priority is health and then poverty.
  • Hence countries in cluster 0 were ranked based on Child Mortality followed by income and GDP per capita.
  • By that criteria, the following are the five countries HELP should consider extending their aid.
    • Haiti
    • Sierra Leone
    • Chad
    • Central African Republic
    • Mali

More articles from Yugen

Wine Quality Classification

Predicting Wine Quality using Logistic Regression

August 7th, 2020 · 2 min read

Boom Bikes Demand Analysis

Boom Bikes Demand Analysis using Linear Regression

July 26th, 2020 · 5 min read
© 2020–2021 Yugen
Link to $http://twitter.com/JayanthBoddu/Link to $https://github.com/jayantb1019Link to $https://www.instagram.com/jayantb1019/Link to $https://www.linkedin.com/in/jayanthboddu/