Countries in need of Aid

1 Problem Statement
2 Analysis Approach & Conclusions
3 Importing data
4 Data Quality Checks
5 Exploratory Data Analysis
6 Hopkin's Statistic
7 Standardizing Values
8 K-Means Clustering
- 8.1 Finding Optimal Number of Clusters
  - 8.1.1 Elbow curve
  - 8.1.2 Sihoutte Analysis
- 8.2 Final k - Means Clustering
9 Hierarchical Clustering
10 Mixed K-Means Clustering
11 Cluster Profiling
12 Conclusion

HELP - Countries to Aid

Jayanth Boddu
Clustering

Problem Statement

An NGO wants to use their funding strategically so that they could aid five countries in dire need of help. The analyst’s objective is to use clustering algorithms to group countries based on socio-economic and health factors to judge the overall development of countries. Further,the final deliverable is - suggesting 5 countries that need the aid the most.

Analysis Approach & Conclusions

The objective of the analysis is to recommend 05 countries in dire need of aid of help.
The achieve this, the following features of 167 countries have been analyzed.
- GDP per capita
- Child mortality per 1000 live births
- Net Income per person
- Fertility rates at the current previaling age/fertility rate
- Inflation index
- Life Expectancy
- Imports per capita
- Exports per capita
- Total Health Spending per capita
The above features are further grouped into health, economic and policy indicators
- Health Indicators
  - Child mortality per 1000 births
  - Fertility rates at the current previaling age/fertility index
  - Life Expectancy
- Economic Indicators
  - Net Income per person
  - Imports per capita
  - Exports per capita
  - Inflation Index
- Policy Indicators
  - Total Health Spending per capita
From the above, countries deduced to be in dire need of aid are countries with bad economic indicators and bad health indicators having low health spending
Further on the feature level, the following are general indicators of need of aid.

Feature	High / Low
Child Mortality	High
Life Expectancy	Low
Fertility	High
Health Spending	Low
GDP per capita	Low
Inflation Index	High
Income per Person	Low
Imports	High
Exports	Low

Data has been check for quality.No Data Quality Issues have been found - No missing values, No duplicates , No incorrect data types
A new feature ‘Trade Deficit’ has been derived.
- Trade Deficit = Imports - Exports
- Trade Deficit or Trade balance is a good indicator of economic health of countries with low GDP per capita
- Lower or negative is better than higher and positive.
Univariate analysis revealed the following information

Feature	Highest	Lowest
Child Mortality	Haiti,Sierra Leone	Iceland
Life Expectancy	Japan , Singapore	Haiti, Lesotho
Total fertility	Chad, Niger	Singapore, South Korea
Health Spending	Switzerland, US	Madagascar, Eritrea
GDP per capita	Luxemborg , Norway	Burundi , Liberia
Inflation Index	Nigeria, Venezuela	Seycelles, Ireland
Net Income per person	Qatar, Luxemborg	Liberia, Congo Dem. Rep.
Imports per capita	Luxemborg, Singapore	Myanmar, Burundi
Exports per capita	Luxemborg, Singapore	Myanmar, Burundi
Trade Deficit per capita	Bahamas, Greece	Luxemborg, Qatar

Since there’s less data per cluster, soft range of 1st percentile - 99th perentile has been used to classify and cap outliers. Only those outliers which do not skew the areas of interest have been capped.
Outlier Treatment Summary

Feature	Upper Outliers	Lower Outliers
Child Mortality	Not Changed	Capped
Life Expectancy	Capped	Not Changed
Fertility	Not Changed	Capped
Health Spending	Capped	Not Capped
GDP per capita	Capped	Not Changed
Inflation Index	Not Changed	Capped
Income per Person	Capped	Not Changed
Imports	Capped	Not Changed
Exports	Capped	Not Changed

Bivariate Analysis revealed the following insights
- There is a negative relationship between Child Mortality and Life expectancy. As child mortality increases, Life Expectancy decreases. Haiti and Sierra Leone have the highest child mortality and lowest life expectancy.
- There is positive relationship between child mortality and total fertility. Ex : Chad and Mali
- Health Spending has a positive relationship with GDP per capita , but the situation is dire in case of low GDP countries like Eritrea & Madagascar.
- There’s very high correlation between child mortality and life expectancy , child mortality and total fertility.

Columns : ‘child_mort’, ‘exports’, ‘health’, ‘imports’, ‘income’,‘inflation’, ‘life_expec’, ‘total_fer’, ‘gdpp’,‘trade_deficit’ were used for clustering.
Hopkin’s Statistic was calculated which showed a very high mean clustering tendency of 96% with a standard deviation of 4%
The dataset was standardized with mean = 0 , standard deviation = 1 before clustering.
Optimum no of clusters for k-means was found to be 4 , both from Elbow curve and Silhoeutte Analysis curve.
Clustering with k-means was performed (Cluster centers initialized with k-means++)
Hierarchical Clustering with Single, Complete and Average linkages and Euclidean , correlation based distances were explored.
Mixed Clustering - K-Means clustering initialised with centroids of Hierarchical clusters , 6 no of clusters - was carried out.
Finally, Hierarchical Clustering with Complete linkage and correlation based distance measure was chosen to arrive at the results. Choice was made based on interpretability of results.
Characteristics of Clusters obtained :

Cluster	GDP	Income	Child Mortality
0	Very low	Very Low	Very High
1	low	low	low
2	Moderate	Moderate	Moderate
3	High	High	Low
4	Very High	Very High	Low
5	Very Low	Very Low	Low

From the characteristics , we see that the Cluster 0 is our area of interest.
According to the UN goals of 2030, the top priority is health and then poverty.
Hence countries in cluster 0 were ranked based on Child Mortality followed by income and GDP per capita.
By that criteria, the following are the five countries HELP should consider extending their aid.
- Haiti
- Sierra Leone
- Chad
- Central African Republic
- Mali

1import numpy as np
2import pandas as pd
3import matplotlib.pyplot as plt
4
5import seaborn as sns
6sns.set_style('whitegrid')
7
8from sklearn.cluster import KMeans
9from scipy.cluster.hierarchy import linkage
10from scipy.cluster.hierarchy import dendrogram
11from scipy.cluster.hierarchy import cut_tree
12
13import warnings
14warnings.filterwarnings('ignore')
15
16!pip install tabulate
17!jt -f roboto -fs 12 -cellw 100%
18
19from tabulate import tabulate
20
21from bokeh.models import ColumnDataSource, HoverTool
22from bokeh.plotting import figure,show,output_notebook,reset_output
23from bokeh.transform import factor_cmap
24from bokeh.layouts import row
25output_notebook()

1Requirement already satisfied: tabulate in /Users/jayanth/opt/anaconda3/lib/python3.7/site-packages (0.8.7)

Loading BokehJS ...

1# to table print  a dataframe
2def tab(ser) :
3    print(tabulate(pd.DataFrame(ser), headers='keys', tablefmt='psql'))

Importing data

1countries = pd.read_csv('./Country-data.csv')
2tab(countries.head())

1+----+---------------------+--------------+-----------+----------+-----------+----------+-------------+--------------+-------------+--------+
2|    | country             |   child_mort |   exports |   health |   imports |   income |   inflation |   life_expec |   total_fer |   gdpp |
3|----+---------------------+--------------+-----------+----------+-----------+----------+-------------+--------------+-------------+--------|
4|  0 | Afghanistan         |         90.2 |      10   |     7.58 |      44.9 |     1610 |        9.44 |         56.2 |        5.82 |    553 |
5|  1 | Albania             |         16.6 |      28   |     6.55 |      48.6 |     9930 |        4.49 |         76.3 |        1.65 |   4090 |
6|  2 | Algeria             |         27.3 |      38.4 |     4.17 |      31.4 |    12900 |       16.1  |         76.5 |        2.89 |   4460 |
7|  3 | Angola              |        119   |      62.3 |     2.85 |      42.9 |     5900 |       22.4  |         60.1 |        6.16 |   3530 |
8|  4 | Antigua and Barbuda |         10.3 |      45.5 |     6.03 |      58.9 |    19100 |        1.44 |         76.8 |        2.13 |  12200 |
9+----+---------------------+--------------+-----------+----------+-----------+----------+-------------+--------------+-------------+--------+

The dataset contains exports, imports,health as a proportion of GDP per capita
Let’s convert them to actual values

1# conversion to actual values
2countries['exports'] = 0.01 * countries['exports'] * countries['gdpp']
3countries['imports'] = 0.01 * countries['imports'] * countries['gdpp']
4countries['health'] = 0.01 * countries['health'] * countries['gdpp']

Data Quality Checks

1tab(countries.info())

1<class 'pandas.core.frame.DataFrame'>
2RangeIndex: 167 entries, 0 to 166
3Data columns (total 10 columns):
4 #   Column      Non-Null Count  Dtype
5---  ------      --------------  -----
6 0   country     167 non-null    object
7 1   child_mort  167 non-null    float64
8 2   exports     167 non-null    float64
9 3   health      167 non-null    float64
10 4   imports     167 non-null    float64
11 5   income      167 non-null    int64
12 6   inflation   167 non-null    float64
13 7   life_expec  167 non-null    float64
14 8   total_fer   167 non-null    float64
15 9   gdpp        167 non-null    int64
16dtypes: float64(7), int64(2), object(1)
17memory usage: 13.2+ KB

167 countries
No apparent missing values
No incorrect data types

1# taking a closer look to see if any country is duplicated
2countries[countries.duplicated(subset=['country'])].index.values

1array([], dtype=int64)

No duplicate countries

1# exports, health, imports are a percentage of GDPP . Checking if there are any anomalies
2condition1 = countries['health'] < countries['gdpp']
3condition2 = countries['imports'] < countries['gdpp']
4condition3 = countries['exports'] < countries['gdpp']
5
6# countries which don't satisfy the above conditions
7index = countries[~(condition1 & condition2 & condition3)].index.values
8countries.loc[index]

	country	child_mort	exports	health	imports	income	inflation	life_expec	total_fer	gdpp
73	Ireland	4.2	50161.00	4475.53	42125.5	45700	-3.220	80.4	2.05	48700
87	Lesotho	99.7	460.98	129.87	1181.7	2380	4.150	46.5	3.30	1170
91	Luxembourg	2.8	183750.00	8158.50	149100.0	91700	3.620	81.3	1.63	105000
98	Malta	6.8	32283.00	1825.15	32494.0	28300	3.830	80.3	1.36	21100
131	Seychelles	14.4	10130.40	367.20	11664.0	20400	-4.210	73.4	2.17	10800
133	Singapore	2.8	93200.00	1845.36	81084.0	72100	-0.046	82.7	1.15	46600

Some countries have greater imports than GDP per capita

1# child mortality rate is calculated for 1000 live births. Let's check if its above 1000
2countries[countries['child_mort'] > 1000].index.values

1array([], dtype=int64)

Sanity checks completed.
No Data Quality issues found.

Exploratory Data Analysis

1# summary statistics
2tab(countries.describe())

1+-------+--------------+--------------+-----------+---------------+----------+-------------+--------------+-------------+----------+
2|       |   child_mort |      exports |    health |       imports |   income |   inflation |   life_expec |   total_fer |     gdpp |
3|-------+--------------+--------------+-----------+---------------+----------+-------------+--------------+-------------+----------|
4| count |     167      |    167       |  167      |    167        |    167   |   167       |    167       |   167       |    167   |
5| mean  |      38.2701 |   7420.62    | 1056.73   |   6588.35     |  17144.7 |     7.78183 |     70.5557  |     2.94796 |  12964.2 |
6| std   |      40.3289 |  17973.9     | 1801.41   |  14710.8      |  19278.1 |    10.5707  |      8.89317 |     1.51385 |  18328.7 |
7| min   |       2.6    |      1.07692 |   12.8212 |      0.651092 |    609   |    -4.21    |     32.1     |     1.15    |    231   |
8| 25%   |       8.25   |    447.14    |   78.5355 |    640.215    |   3355   |     1.81    |     65.3     |     1.795   |   1330   |
9| 50%   |      19.3    |   1777.44    |  321.886  |   2045.58     |   9960   |     5.39    |     73.1     |     2.41    |   4660   |
10| 75%   |      62.1    |   7278       |  976.94   |   7719.6      |  22800   |    10.75    |     76.8     |     3.88    |  14050   |
11| max   |     208      | 183750       | 8663.6    | 149100        | 125000   |   104       |     82.8     |     7.49    | 105000   |
12+-------+--------------+--------------+-----------+---------------+----------+-------------+--------------+-------------+----------+

The Maximum values of all the attributes are much higher than their 75% values.

1# let's look at quantiles are to take a closer look at outliers
2tab(countries.quantile(np.linspace(0.75,1,25)).reset_index())

1+----+----------+--------------+-----------+----------+-----------+----------+-------------+--------------+-------------+----------+
2|    |    index |   child_mort |   exports |   health |   imports |   income |   inflation |   life_expec |   total_fer |     gdpp |
3|----+----------+--------------+-----------+----------+-----------+----------+-------------+--------------+-------------+----------|
4|  0 | 0.75     |      62.1    |   7278    |   976.94 |   7719.6  |  22800   |     10.75   |      76.8    |     3.88    |  14050   |
5|  1 | 0.760417 |      62.2917 |   7846.5  |  1004.03 |   7977.36 |  23581.2 |     11.1229 |      76.9458 |     4.11667 |  16137.5 |
6|  2 | 0.770833 |      62.6958 |   7935.28 |  1012.41 |   8220.78 |  27116.7 |     11.5833 |      77.4833 |     4.26875 |  17079.2 |
7|  3 | 0.78125  |      63.6688 |   9400.55 |  1028.18 |   8366.03 |  28300   |     12.1    |      77.8688 |     4.36062 |  19300   |
8|  4 | 0.791667 |      64.1083 |   9937.67 |  1141.28 |   9561.67 |  28700   |     12.3833 |      78.025  |     4.53083 |  20175   |
9|  5 | 0.802083 |      67.5438 |  10220.6  |  1276.05 |   9904.05 |  29600   |     12.6313 |      78.2729 |     4.60146 |  21245.8 |
10|  6 | 0.8125   |      74.35   |  10655.8  |  1436.87 |  10029.1  |  30300   |     13.75   |      79.05   |     4.6625  |  22450   |
11|  7 | 0.822917 |      78.0292 |  10815.5  |  1548.88 |  10070.1  |  32420.8 |     14.1208 |      79.5    |     4.8225  |  25514.6 |
12|  8 | 0.833333 |      80.5333 |  10933.1  |  1829.69 |  10318.9  |  33766.7 |     14.5    |      79.6    |     4.90333 |  28866.7 |
13|  9 | 0.84375  |      83.4188 |  11075.8  |  1867.65 |  10882.2  |  35825   |     15.1125 |      79.8063 |     4.9825  |  30706.2 |
14| 10 | 0.854167 |      89.0708 |  12677.1  |  2207.69 |  11610.8  |  36200   |     15.5375 |      79.9792 |     5.04375 |  33095.8 |
15| 11 | 0.864583 |      90.2521 |  13445.8  |  2407.81 |  11848.4  |  37889.6 |     16.0042 |      80.0521 |     5.08604 |  35156.3 |
16| 12 | 0.875    |      90.9    |  14457.8  |  2810.22 |  12290.6  |  39950   |     16.2    |      80.15   |     5.2025  |  36475   |
17| 13 | 0.885417 |      93.5687 |  15038.4  |  3393.81 |  12905.2  |  40693.7 |     16.5979 |      80.3    |     5.21    |  38891.7 |
18| 14 | 0.895833 |      99.0292 |  17034    |  3651.31 |  14711.4  |  41100   |     16.6    |      80.4    |     5.29833 |  41450   |
19| 15 | 0.90625  |     104.062  |  19846.1  |  4024.48 |  16043.1  |  42056.2 |     16.9187 |      80.4437 |     5.34875 |  42993.8 |
20| 16 | 0.916667 |     109.333  |  23836.8  |  4265.13 |  17350.7  |  43333.3 |     17.4833 |      80.7333 |     5.405   |  44783.3 |
21| 17 | 0.927083 |     111      |  24069.1  |  4525.11 |  18097.6  |  45164.6 |     19.4375 |      80.9896 |     5.54646 |  46558.3 |
22| 18 | 0.9375   |     112.875  |  26626.7  |  4801.18 |  21864.3  |  45462.5 |     20.2875 |      81.3    |     5.77875 |  47212.5 |
23| 19 | 0.947917 |     116      |  30350    |  4908.45 |  23340.7  |  47010.4 |     20.8354 |      81.4    |     5.85062 |  48506.2 |
24| 20 | 0.958333 |     119.333  |  33999.5  |  5175.43 |  25846.6  |  55675   |     22.4333 |      81.5167 |     6.15083 |  50433.3 |
25| 21 | 0.96875  |     128.688  |  35961.1  |  5867.67 |  32399.7  |  61418.8 |     23.45   |      81.8625 |     6.21688 |  52062.5 |
26| 22 | 0.979167 |     143.5    |  45934.9  |  7449.69 |  36739.1  |  73779.2 |     25.7667 |      82      |     6.41167 |  64662.5 |
27| 23 | 0.989583 |     152.708  |  61817.4  |  8392.65 |  52676.8  |  83606.2 |     41.0146 |      82.3354 |     6.56083 |  78175   |
28| 24 | 1        |     208      | 183750    |  8663.6  | 149100    | 125000   |    104      |      82.8    |     7.49    | 105000   |
29+----+----------+--------------+-----------+----------+-----------+----------+-------------+--------------+-------------+----------+

exports,child_mort, imports, income, inflation, life_expec, total_fer and gdpp have outliers.
These outliers shall be treated in Univariate analysis

Univariate Analysis / Outlier Treatment

1# function to perform outlier analysis
2def outlier_analysis(column) :
3    '''
4    This function prints a violin plot and box plot of the column provided.
5    It also prints the five major quantiles, lower oultier threshold value, upper outlier threshold value, tables of countries which are outliers
6    Output : lower outlier threshold condition, upper outlier threshold condition
7    Input : column name
8    Side effects : Violin plot, box plot, outlier tables
9    '''
10    plt.figure(figsize=[12,6])
11    plt.subplot(121)
12    plt.title('Violin Plot of '+column)
13    sns.violinplot(countries[column])
14
15    plt.subplot(122)
16    plt.title('Box Plot of '+column)
17    sns.boxplot(countries[column])
18
19    print('Quantiles\n')
20    print(tab(countries[column].quantile([.1,0.25,.50,0.75,0.99])))
21
22    lower_outlier_threshold = countries[column].quantile(0.01)
23    upper_outlier_threshold = countries[column].quantile(0.99)
24
25    print('\n\nLOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR' ,column,': ',lower_outlier_threshold)
26    l_condition = countries[column] < lower_outlier_threshold
27    l_outliers = countries[l_condition][['country',column]].sort_values(by=column)
28
29    if l_outliers.shape[0] :
30        print('\n\nLower Outliers : ')
31        tab(l_outliers)
32    else :
33        print('No lower outliers found in ' + column)
34
35    print('\n\nUPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR' ,column,': ',upper_outlier_threshold)
36    u_condition = countries[column] > upper_outlier_threshold
37    u_outliers = countries[u_condition][['country',column]].sort_values(by=column)
38
39    if u_outliers.shape[0] :
40        print('\n\nUpper Outliers : ')
41        tab(u_outliers)
42        print('\n\n')
43
44    return l_condition, u_condition

Child Mortality

1# Countries with oultiers in child mortality
2column = 'child_mort'
3l_condition, u_condition = outlier_analysis(column)

1Quantiles
2
3+------+--------------+
4|      |   child_mort |
5|------+--------------|
6| 0.1  |         4.2  |
7| 0.25 |         8.25 |
8| 0.5  |        19.3  |
9| 0.75 |        62.1  |
10| 0.99 |       153.4  |
11+------+--------------+
12None
13
14
15LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR child_mort :  2.8
16
17
18Lower Outliers :
19+----+-----------+--------------+
20|    | country   |   child_mort |
21|----+-----------+--------------|
22| 68 | Iceland   |          2.6 |
23+----+-----------+--------------+
24
25
26UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR child_mort :  153.40000000000003
27
28
29Upper Outliers :
30+-----+--------------+--------------+
31|     | country      |   child_mort |
32|-----+--------------+--------------|
33| 132 | Sierra Leone |          160 |
34|  66 | Haiti        |          208 |
35+-----+--------------+--------------+

Notice that half of the countries have a child mortality below 20.
Further,countries with child mortality less than 1st percentile might not need aid at all. We shall cap their child mortalities to 1st percentile values.
Countries with extremely high child mortality rate - upper outliers (> 99th percentile) are the perfect candidates for aid. Let’s keep them as they are for further analysis.

1# Removing countries with lower outliers in `child_mort`
2countries.loc[l_condition, column] = countries[column].quantile(0.01)

Life Expectancy

1# LIFE EXPECTANCY
2column = 'life_expec'
3l_condition, u_condition = outlier_analysis(column)

1Quantiles
2
3+------+--------------+
4|      |   life_expec |
5|------+--------------|
6| 0.1  |        57.82 |
7| 0.25 |        65.3  |
8| 0.5  |        73.1  |
9| 0.75 |        76.8  |
10| 0.99 |        82.37 |
11+------+--------------+
12None
13
14
15LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR life_expec :  47.160000000000004
16
17
18Lower Outliers :
19+----+-----------+--------------+
20|    | country   |   life_expec |
21|----+-----------+--------------|
22| 66 | Haiti     |         32.1 |
23| 87 | Lesotho   |         46.5 |
24+----+-----------+--------------+
25
26
27UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR life_expec :  82.37
28
29
30Upper Outliers :
31+-----+-----------+--------------+
32|     | country   |   life_expec |
33|-----+-----------+--------------|
34| 133 | Singapore |         82.7 |
35|  77 | Japan     |         82.8 |
36+-----+-----------+--------------+

Some of the lowest life expectancy is seen in Haiti and Lesoto.
About 50% of the countries have a life expectany of 73 or below and the other 50% have above 73.
Countries like Singapore and Japan have the highest life expective, better than 99% of the countries.
Countries with very low life expectancy are possible candidates for aid.
Hence, let’s keep the lower outliers.
One the other hand, countries like Singapore and Japan might not need aid but these values would skew our analysis. Let’s cap these outliers.

1# Capping upper outliers in life expectancy
2countries.loc[u_condition, column] = countries[column].quantile(0.99)

Fertility

1column = 'total_fer'
2l_condition, u_condition = outlier_analysis(column)

1Quantiles
2
3+------+-------------+
4|      |   total_fer |
5|------+-------------|
6| 0.1  |      1.452  |
7| 0.25 |      1.795  |
8| 0.5  |      2.41   |
9| 0.75 |      3.88   |
10| 0.99 |      6.5636 |
11+------+-------------+
12None
13
14
15LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR total_fer :  1.2431999999999999
16
17
18Lower Outliers :
19+-----+-------------+-------------+
20|     | country     |   total_fer |
21|-----+-------------+-------------|
22| 133 | Singapore   |        1.15 |
23| 138 | South Korea |        1.23 |
24+-----+-------------+-------------+
25
26
27UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR total_fer :  6.563599999999999
28
29
30Upper Outliers :
31+-----+-----------+-------------+
32|     | country   |   total_fer |
33|-----+-----------+-------------|
34|  32 | Chad      |        6.59 |
35| 112 | Niger     |        7.49 |
36+-----+-----------+-------------+

About 50% of the countries have a fertility of 2.41 or less.
Lower fertility is seen in developed nations like Singapore and South Korea, where the fertility is less than 99% countries.
Countries with higher total fertility might need the aid more since this mean more health risk.
Fertility in Chad and Niger is higher than 99 percent of the countries.
Since these countries might need the aid, let’s leave these values for further analysis.
Countries with fertility rates less than 1 perecent of the population look like they are developed nations. Let’s cap these outlier so that they don’t skew our analysis.
Further, from the violin plot, notice that fertility has two peaks - 2 and 5. This indicates that fertility rate could be used to effectively segregate countries. More analysis is needed here.

1# capping lower outliers in fertility
2countries.loc[l_condition,column] = countries[column].quantile(.1)

Health Spending

1# health spending per capita
2column = 'health'
3l_condition, u_condition = outlier_analysis(column)

1Quantiles
2
3+------+-----------+
4|      |    health |
5|------+-----------|
6| 0.1  |   36.5026 |
7| 0.25 |   78.5355 |
8| 0.5  |  321.886  |
9| 0.75 |  976.94   |
10| 0.99 | 8410.33   |
11+------+-----------+
12None
13
14
15LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR health :  17.009362000000003
16
17
18Lower Outliers :
19+----+------------+----------+
20|    | country    |   health |
21|----+------------+----------|
22| 50 | Eritrea    |  12.8212 |
23| 93 | Madagascar |  15.5701 |
24+----+------------+----------+
25
26
27UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR health :  8410.3304
28
29
30Upper Outliers :
31+-----+---------------+----------+
32|     | country       |   health |
33|-----+---------------+----------|
34| 145 | Switzerland   |   8579   |
35| 159 | United States |   8663.6 |
36+-----+---------------+----------+

The countries with more health spending are less in need of aid.
Like Switzerland and United States, whose health spending is higher than 99% of the countries. But values like these would skew the entire analysis for aid needing countries. Hence, we’d cap these values with the 99th percentile value.
The countries with less health spending might have variety of reasons like - optimum general health , or bad economic conditions. Let’s keep these values for further analysis.

1# removing upper outliers in health spending
2countries.loc[u_condition,column] = countries[column].quantile(0.99)

GDP per capita

1# GDP per capita
2column = 'gdpp'
3l_condition, u_condition = outlier_analysis(column)

1Quantiles
2
3+------+---------+
4|      |    gdpp |
5|------+---------|
6| 0.1  |   593.8 |
7| 0.25 |  1330   |
8| 0.5  |  4660   |
9| 0.75 | 14050   |
10| 0.99 | 79088   |
11+------+---------+
12None
13
14
15LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR gdpp :  331.62
16
17
18Lower Outliers :
19+----+-----------+--------+
20|    | country   |   gdpp |
21|----+-----------+--------|
22| 26 | Burundi   |    231 |
23| 88 | Liberia   |    327 |
24+----+-----------+--------+
25
26
27UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR gdpp :  79088.00000000004
28
29
30Upper Outliers :
31+-----+------------+--------+
32|     | country    |   gdpp |
33|-----+------------+--------|
34| 114 | Norway     |  87800 |
35|  91 | Luxembourg | 105000 |
36+-----+------------+--------+

GDP per capita is a very good indicator of a country’s prosperity
Notice that there are three peaks in the violin plot of GDP. These might indicate the three clusters - under developed, developing and developed nations.
Under developed nations are the most in need of aid. Countries like Burundi and Libera have GDPs less than 99% of the countries. Even though they are outliers, let’s keep them for further analysis.
But, we could cap GDPs which are greater than GDPs of 99% of the countries (Luxembourg & Norway)

1# capping upper outliers in gdpp
2countries.loc[u_condition, column] = countries[column].quantile(.99)

Inflation Index

1column = 'inflation'
2l_condition, u_condition = outlier_analysis(column)

1Quantiles
2
3+------+-------------+
4|      |   inflation |
5|------+-------------|
6| 0.1  |      0.5878 |
7| 0.25 |      1.81   |
8| 0.5  |      5.39   |
9| 0.75 |     10.75   |
10| 0.99 |     41.478  |
11+------+-------------+
12None
13
14
15LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR inflation :  -2.3487999999999998
16
17
18Lower Outliers :
19+-----+------------+-------------+
20|     | country    |   inflation |
21|-----+------------+-------------|
22| 131 | Seychelles |       -4.21 |
23|  73 | Ireland    |       -3.22 |
24+-----+------------+-------------+
25
26
27UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR inflation :  41.47800000000002
28
29
30Upper Outliers :
31+-----+-----------+-------------+
32|     | country   |   inflation |
33|-----+-----------+-------------|
34| 163 | Venezuela |        45.9 |
35| 113 | Nigeria   |       104   |
36+-----+-----------+-------------+

Very high inflation indicates a bad economic state.
The violinplot again indicates three peaks - indicating three possible clusters of inflation - each with progressively less countries.
Countries with very high inflation are might need aid. Since this is our area of interest, let’s keep the upper outliers in inflation as they are for further analysis.
Let’s cap lower outliers since these look like good economies with no need of aid.

1# capping lower outliers in inflation
2countries.loc[l_condition, column] = countries[column].quantile(0.01)

Net Income per person

1# income
2column = 'income'
3l_condition, u_condition = outlier_analysis(column)

1Quantiles
2
3+------+----------+
4|      |   income |
5|------+----------|
6| 0.1  |     1524 |
7| 0.25 |     3355 |
8| 0.5  |     9960 |
9| 0.75 |    22800 |
10| 0.99 |    84374 |
11+------+----------+
12None
13
14
15LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR income :  742.24
16
17
18Lower Outliers :
19+----+------------------+----------+
20|    | country          |   income |
21|----+------------------+----------|
22| 37 | Congo, Dem. Rep. |      609 |
23| 88 | Liberia          |      700 |
24+----+------------------+----------+
25
26
27UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR income :  84374.00000000003
28
29
30Upper Outliers :
31+-----+------------+----------+
32|     | country    |   income |
33|-----+------------+----------|
34|  91 | Luxembourg |    91700 |
35| 123 | Qatar      |   125000 |
36+-----+------------+----------+

High net income per person is an indicator of general prosperity of a country. Such countries do not need aid.
Hence, we shall cap the upper outliers i,e net incomes greater than 99% of that of the other countries.
Lower outliers, countries with net incomes less than 1% of that of the other countries is our area of interest. Lets leave these values as they are for further analysis.

Imports per capita

1column = 'imports'
2l_condition, u_condition = outlier_analysis(column)

1Quantiles
2
3+------+-----------+
4|      |   imports |
5|------+-----------|
6| 0.1  |   211.006 |
7| 0.25 |   640.215 |
8| 0.5  |  2045.58  |
9| 0.75 |  7719.6   |
10| 0.99 | 55371.4   |
11+------+-----------+
12None
13
14
15LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR imports :  104.90964000000002
16
17
18Lower Outliers :
19+-----+-----------+-----------+
20|     | country   |   imports |
21|-----+-----------+-----------|
22| 107 | Myanmar   |  0.651092 |
23|  26 | Burundi   | 90.552    |
24+-----+-----------+-----------+
25
26
27UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR imports :  55371.39000000013
28
29
30Upper Outliers :
31+-----+------------+-----------+
32|     | country    |   imports |
33|-----+------------+-----------|
34| 133 | Singapore  |     81084 |
35|  91 | Luxembourg |    149100 |
36+-----+------------+-----------+

Exports per capita

1column = 'exports'
2l_condition, u_condition = outlier_analysis(column)

1Quantiles
2
3+------+-----------+
4|      |   exports |
5|------+-----------|
6| 0.1  |   110.225 |
7| 0.25 |   447.14  |
8| 0.5  |  1777.44  |
9| 0.75 |  7278     |
10| 0.99 | 64794.3   |
11+------+-----------+
12None
13
14
15LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR exports :  22.243716
16
17
18Lower Outliers :
19+-----+-----------+-----------+
20|     | country   |   exports |
21|-----+-----------+-----------|
22| 107 | Myanmar   |   1.07692 |
23|  26 | Burundi   |  20.6052  |
24+-----+-----------+-----------+
25
26
27UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR exports :  64794.26000000014
28
29
30Upper Outliers :
31+-----+------------+-----------+
32|     | country    |   exports |
33|-----+------------+-----------|
34| 133 | Singapore  |     93200 |
35|  91 | Luxembourg |    183750 |
36+-----+------------+-----------+

Trade Deficit per capita

1# trade deficit
2countries['trade_deficit'] = countries['imports'] - countries['exports']
3column = 'trade_deficit'
4l_condition, u_condition = outlier_analysis(column)

1Quantiles
2
3+------+-----------------+
4|      |   trade_deficit |
5|------+-----------------|
6| 0.1  |      -3000.82   |
7| 0.25 |       -327.05   |
8| 0.5  |         89.2182 |
9| 0.75 |        518.57   |
10| 0.99 |       2270.5    |
11+------+-----------------+
12None
13
14
15LOWER OUTLIER THRESHOLD [1st PERCENTILE] FOR trade_deficit :  -18426.1
16
17
18Lower Outliers :
19+-----+------------+-----------------+
20|     | country    |   trade_deficit |
21|-----+------------+-----------------|
22|  91 | Luxembourg |        -34650   |
23| 123 | Qatar      |        -27065.5 |
24+-----+------------+-----------------+
25
26
27UPPER OUTLIER THRESHOLD [99th PERCENTILE] FOR trade_deficit :  2270.500000000002
28
29
30Upper Outliers :
31+----+-----------+-----------------+
32|    | country   |   trade_deficit |
33|----+-----------+-----------------|
34| 60 | Greece    |          2313.4 |
35| 10 | Bahamas   |          2436   |
36+----+-----------+-----------------+

Trade deficit indicates the balance between exports and imports. Negative trade deficit is favourable indicator of economic health. It means that the country has higher exports compared to imports.
When the trade deficit is negative, it means that the country is import predominant.
From the above plot, Greece and Bahamas have a trade deficit higher than 99% of the countries. This could mean a bad economic state. Since this is our key area of interest, we could leave these outliers as they are.
Countries with negative trade deficit , i.e export predominant countries could be capped to 1st percentile values.

1# capping lower outliers of trade_deficit
2countries.loc[l_condition,column] = countries[column].quantile(0.01)

Bivariate Analysis

Pairplot

1# Pair Plots of all variables
2sns.pairplot(countries[['child_mort', 'exports', 'health', 'imports', 'income',
3       'inflation', 'life_expec', 'total_fer', 'gdpp']]);

1## bivariate analysis boken function
2def bivariate_analysis(x_var,y_var,dataframe=countries) :
3    # Bivariate Plots with tooltips : country,x,y
4
5    dataframe = dataframe.copy()
6    source =ColumnDataSource(dataframe)
7
8    # pallete = ["rgba(38, 70, 83, 1)", 'rgba(42, 157, 143, 1)', "rgba(233, 196, 106, 1)", "rgba(244, 162, 97, 1)", "rgba(231, 111, 81, 1)", '#009cc7']
9    pallete = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e","#38895e",'rgba(42, 157, 143, 1)']
10
11    # tooltips
12    tooltips1 = [
13    ("Country", "@country"),
14    (x_var,'@'+x_var),
15    (y_var,'@'+y_var)
16    ]
17
18    p = figure(plot_width=420, plot_height=400,title='Scatter Plot : '+ x_var+' vs '+y_var, tooltips=tooltips1)
19
20    p.scatter(x=x_var,y=y_var, fill_alpha=0.6, size=8, source=source, fill_color = "#3498db")
21    p.xaxis.axis_label = x_var
22    p.yaxis.axis_label = y_var
23    reset_output()
24    output_notebook()
25    show(p)

Child Mortality vs Life Expectancy

1# CHILD MORTALITY vs LIFE EXPECTANCY
2
3x='child_mort'
4y = 'life_expec'
5bivariate_analysis(x,y)

Loading BokehJS ...

looks like child_mort and life_expec are almost linearly related.
Life expectancy decreases as child mortality increases.
One of these features is enough for analysis.

1# Countries with high child mortality and low life expectancy
2life_expect_cond = countries['life_expec'] < 60
3child_mort_cond = countries['child_mort'] > 100
4
5print('Countries with low life expectancy and low Child Mortality Rate')
6tab(countries[life_expect_cond & child_mort_cond][['country','child_mort','life_expec']].sort_values(by=['child_mort','life_expec'], ascending=[False,True])[:10])

1Countries with low life expectancy and low Child Mortality Rate
2+-----+--------------------------+--------------+--------------+
3|     | country                  |   child_mort |   life_expec |
4|-----+--------------------------+--------------+--------------|
5|  66 | Haiti                    |          208 |         32.1 |
6| 132 | Sierra Leone             |          160 |         55   |
7|  32 | Chad                     |          150 |         56.5 |
8|  31 | Central African Republic |          149 |         47.5 |
9|  97 | Mali                     |          137 |         59.5 |
10| 112 | Niger                    |          123 |         58.8 |
11|  37 | Congo, Dem. Rep.         |          116 |         57.5 |
12|  25 | Burkina Faso             |          116 |         57.9 |
13|  64 | Guinea-Bissau            |          114 |         55.6 |
14|  40 | Cote d'Ivoire            |          111 |         56.3 |
15+-----+--------------------------+--------------+--------------+

Child Mortality vs Total Fertility

1# CHILD MORTALITY vs TOTAL FERTILITY
2
3x='child_mort'
4y = 'total_fer'
5bivariate_analysis(x,y)

Loading BokehJS ...

1# countries with high fertility and child mortality rate
2total_fer_cond = countries['total_fer'] > 6
3child_mort_cond = countries['child_mort'] > 100
4
5print('Countries with High Total Fertility and Child Mortality Rate')
6tab(countries[total_fer_cond & child_mort_cond][['country','total_fer','child_mort']].sort_values(by=['child_mort','total_fer'], ascending=[False,False]))

1Countries with High Total Fertility and Child Mortality Rate
2+-----+------------------+-------------+--------------+
3|     | country          |   total_fer |   child_mort |
4|-----+------------------+-------------+--------------|
5|  32 | Chad             |        6.59 |          150 |
6|  97 | Mali             |        6.55 |          137 |
7| 112 | Niger            |        7.49 |          123 |
8|   3 | Angola           |        6.16 |          119 |
9|  37 | Congo, Dem. Rep. |        6.54 |          116 |
10+-----+------------------+-------------+--------------+

1# Fertility in countries with extremely high mortality rate
2child_mort_cond = countries['child_mort'] > 150
3
4print('Countries with High Total Fertility and Child Mortality Rate')
5tab(countries[child_mort_cond][['country','total_fer','child_mort']].sort_values(by=['child_mort','total_fer'], ascending=[False,False]))

1Countries with High Total Fertility and Child Mortality Rate
2+-----+--------------+-------------+--------------+
3|     | country      |   total_fer |   child_mort |
4|-----+--------------+-------------+--------------|
5|  66 | Haiti        |        3.33 |          208 |
6| 132 | Sierra Leone |        5.2  |          160 |
7+-----+--------------+-------------+--------------+

GDP per capita vs Health Spending

1# GDP per capita vs Health Spending
2y='health'
3x= 'gdpp'
4bivariate_analysis(x,y)

Loading BokehJS ...

1# Countries with Low GDPP and health spending might need aid
2health_cond = countries['health'] < 200
3gdpp_cond = countries['gdpp'] < 2500
4print('Countries with Low GDP and low Health Spending')
5tab(countries[health_cond & gdpp_cond][['country','health','gdpp']].sort_values(by=['health','gdpp'], ascending=[True,True])[:10])

1Countries with Low GDP and low Health Spending
2+-----+--------------------------+----------+--------+
3|     | country                  |   health |   gdpp |
4|-----+--------------------------+----------+--------|
5|  50 | Eritrea                  |  12.8212 |    482 |
6|  93 | Madagascar               |  15.5701 |    413 |
7|  31 | Central African Republic |  17.7508 |    446 |
8| 112 | Niger                    |  17.9568 |    348 |
9| 107 | Myanmar                  |  19.4636 |    988 |
10| 106 | Mozambique               |  21.8299 |    419 |
11| 116 | Pakistan                 |  22.88   |   1040 |
12|  37 | Congo, Dem. Rep.         |  26.4194 |    334 |
13|  12 | Bangladesh               |  26.6816 |    758 |
14|  26 | Burundi                  |  26.796  |    231 |
15+-----+--------------------------+----------+--------+

Imports vs GDP per capita

1# IMPORTS vs GDPP
2y='imports'
3x= 'gdpp'
4bivariate_analysis(x,y)

Loading BokehJS ...

Exports vs GDP per capita

1# Exports vs GDPP
2y='exports'
3x= 'gdpp'
4bivariate_analysis(x,y)

Loading BokehJS ...

GDP per capita vs Trade Deficit

1y='trade_deficit'
2x= 'gdpp'
3bivariate_analysis(x,y)

Loading BokehJS ...

Trade deficit is a country’s net imports minus exports.
It indicates the dependence of a country on imports
Countries with low GDP per capita and positive trade deficit are likely to require aid
From the above plot, one can see that there are many such countries

1# Countries with positive high trade deficit and low GDP per capita
2tr_deficit_cond = countries['trade_deficit'] > 1000
3gdpp_cond = countries['gdpp'] < 10000
4print('Countries with high Trade Deficit and Low GDP')
5tab(countries[tr_deficit_cond & gdpp_cond][['country','gdpp','trade_deficit']].sort_values(by=['trade_deficit','gdpp'], ascending=[True,False])[:10])

1Countries with high Trade Deficit and Low GDP
2+-----+--------------------------------+--------+-----------------+
3|     | country                        |   gdpp |   trade_deficit |
4|-----+--------------------------------+--------+-----------------|
5| 101 | Micronesia, Fed. Sts.          |   2860 |         1644.5  |
6| 151 | Tonga                          |   3550 |         1700.45 |
7| 104 | Montenegro                     |   6680 |         1716.76 |
8|  61 | Grenada                        |   7370 |         1871.98 |
9| 141 | St. Vincent and the Grenadines |   6230 |         1881.46 |
10|  86 | Lebanon                        |   8860 |         2161.84 |
11+-----+--------------------------------+--------+-----------------+

Net Income per person vs Inflation

1# Income vs Inflation
2x = 'inflation'
3y = 'income'
4bivariate_analysis(x,y)

Loading BokehJS ...

1# Low Income - High inflation countries might require aid
2inflation_cond = countries['inflation'] > 20
3income_condition = countries['income'] < 10000
4print('Countries with high inflation and low income')
5tab(countries[inflation_cond & income_condition][['country','inflation','income']].sort_values(by=['inflation','income'], ascending=[False, True]))

1Countries with high inflation and low income
2+-----+------------------+-------------+----------+
3|     | country          |   inflation |   income |
4|-----+------------------+-------------+----------|
5| 113 | Nigeria          |       104   |     5150 |
6| 103 | Mongolia         |        39.2 |     7710 |
7| 149 | Timor-Leste      |        26.5 |     1850 |
8| 165 | Yemen            |        23.6 |     4480 |
9| 140 | Sri Lanka        |        22.8 |     8560 |
10|   3 | Angola           |        22.4 |     5900 |
11|  37 | Congo, Dem. Rep. |        20.8 |      609 |
12|  38 | Congo, Rep.      |        20.7 |     5190 |
13+-----+------------------+-------------+----------+

Countries with High inflation and low income are possible candidates for aid requirement.

GDP per capita vs Inflation

1x = 'gdpp'
2y = 'inflation'
3bivariate_analysis(x,y)

Loading BokehJS ...

Countries with low GDP percapita and high Inflation are in dire need of support.
For example, Nigeria has an inflation > 100 while its GDP is 2330.

Correlation Analysis

1plt.figure(figsize=[12,12])
2sns.heatmap(countries.corr(),annot=True,cmap='YlGnBu', center=0)

1<matplotlib.axes._subplots.AxesSubplot at 0x7fe87cdf8510>

Top Correlations

Negative Correlation between life_expec and child_mort
Positive Correlation between total_fer and child_mort

Although clustering analysis is not affected by multicollinearity, this plot shows us the possible linear relationships between different features to help with results obtained from cluster analysis.

Hopkin’s Statistic

1# hopkins test function
2from sklearn.neighbors import NearestNeighbors
3from random import sample
4from numpy.random import uniform
5from math import isnan
6
7def hopkins(X):
8    d = X.shape[1]
9    #d = len(vars) # columns
10    n = len(X) # rows
11    m = int(0.1 * n)
12    nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)
13
14    rand_X = sample(range(0, n, 1), m)
15
16    ujd = []
17    wjd = []
18    for j in range(0, m):
19        u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
20        ujd.append(u_dist[0][1])
21        w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)
22        wjd.append(w_dist[0][1])
23
24    H = sum(ujd) / (sum(ujd) + sum(wjd))
25    if isnan(H):
26        print(ujd, wjd)
27        H = 0
28
29    return H

1## Data used for Clustering
2columns_for_clustering = ['child_mort', 'exports', 'health', 'imports', 'income',
3       'inflation', 'life_expec', 'total_fer', 'gdpp', 'trade_deficit']
4clustering_data = countries[columns_for_clustering].copy()

1# Hopkin's test
2n = 10
3hopkins_statistic = []
4for i in range(n) :
5    hopkins_statistic.append(hopkins(clustering_data))
6print('Min Hopkin\'s Statistic in ',n,'iterations :', min(hopkins_statistic))
7print('Max Hopkin\'s Statistic in ',n,'iterations :', max(hopkins_statistic))
8print('Mean Hopkin\'s Statistic in ',n,'iterations :', np.mean(hopkins_statistic))
9print('Std deviation of Hopkin\'s Statistic in ',n,'iterations :', np.std(hopkins_statistic))

1Min Hopkin's Statistic in  10 iterations : 0.8594851780754723
2Max Hopkin's Statistic in  10 iterations : 0.9789539440735269
3Mean Hopkin's Statistic in  10 iterations : 0.9482587172561626
4Std deviation of Hopkin's Statistic in  10 iterations : 0.03487962411406433

Since hopkin’s statistic is greater than 80% , the data shows good clustering tendency

Standardizing Values

1from sklearn.preprocessing import StandardScaler
2scaler = StandardScaler()
3clustering_data[columns_for_clustering] = scaler.fit_transform(clustering_data[columns_for_clustering])
4
5tab(clustering_data.describe())

1+-------+---------------+--------------+---------------+---------------+---------------+---------------+--------------+---------------+---------------+-----------------+
2|       |    child_mort |      exports |        health |       imports |        income |     inflation |   life_expec |     total_fer |          gdpp |   trade_deficit |
3|-------+---------------+--------------+---------------+---------------+---------------+---------------+--------------+---------------+---------------+-----------------|
4| count | 167           | 167          | 167           | 167           | 167           | 167           | 167          | 167           | 167           |   167           |
5| mean  |  -7.97765e-18 |   9.1743e-17 |   2.26033e-17 |   4.65363e-17 |  -7.51229e-17 |   8.31005e-17 |   3.7229e-17 |   8.24357e-17 |   8.04413e-17 |     6.18268e-17 |
6| std   |   1.00301     |   1.00301    |   1.00301     |   1.00301     |   1.00301     |   1.00301     |   1.00301    |   1.00301     |   1.00301     |     1.00301     |
7| min   |  -0.882217    |  -0.414037   |  -0.583254    |  -0.44916     |  -0.860326    |  -0.964355    |  -4.33969    |  -1.12962     |  -0.720789    |    -5.67561     |
8| 25%   |  -0.746668    |  -0.389145   |  -0.546449    |  -0.405554    |  -0.717456    |  -0.569109    |  -0.592657   |  -0.767708    |  -0.657548    |     0.113986    |
9| 50%   |  -0.47184     |  -0.31491    |  -0.410154    |  -0.309734    |  -0.373808    |  -0.228871    |   0.287671   |  -0.359318    |  -0.465925    |     0.247143    |
10| 75%   |   0.592652    |  -0.00795865 |  -0.0432751   |   0.0771304   |   0.294237    |   0.280535    |   0.705262   |   0.616834    |   0.0744147   |     0.384486    |
11| max   |   4.22138     |   9.83981    |   4.11998     |   9.71668     |   5.61154     |   9.14287     |   1.33391    |   3.01405     |   3.81697     |     0.997842    |
12+-------+---------------+--------------+---------------+---------------+---------------+---------------+--------------+---------------+---------------+-----------------+

K-Means Clustering

Finding Optimal Number of Clusters

Elbow curve

1# Plotting Elbow curve of Sum of Squared distances of points in each cluster from the centroid of the nearest cluster.
2ssd = []
3range_n_clusters = np.arange(2,9)
4for num_clusters in range_n_clusters :
5    kmeans = KMeans(n_clusters=num_clusters)
6    kmeans.fit(clustering_data)
7    ssd.append(kmeans.inertia_)
8plt.plot(range_n_clusters,ssd)
9plt.title('Elbow Curve');
10plt.xlabel('No of clusters');
11plt.ylabel('SSD');

From the above Elbow curve, one can clearly see that there is a high gradient descent from k=2 to k=4 and then the curve tapers (Change in slope is not as SIGNIFICANT as earlier)
Hence, k=4 is optimum no of clusters, statistically.

Sihoutte Analysis

1from sklearn.metrics import silhouette_score
2
3no_of_clusters = np.arange(2,10)
4score = []
5
6for n_cluster in no_of_clusters :
7    kmeans = KMeans(n_clusters=n_cluster, init='k-means++')
8    kmeans = kmeans.fit(clustering_data)
9    labels = kmeans.labels_
10    score.append(silhouette_score(clustering_data,labels))
11
12
13plt.title('Silhouette Analysis Plot')
14plt.xlabel('No of Clusters')
15plt.ylabel('Silhouette Score')
16plt.plot(no_of_clusters, score);
17print(score)

1[0.46329893684299267, 0.40889975765795494, 0.4113959420115351, 0.40485071262968453, 0.41404360987563693, 0.31039501753497534, 0.302668856614785, 0.30746584335288263]

Higher the silhouette score the better
However , from the above plot, we see silhouette score is the highest for k = 2, sharply falls at 3 and there’s a local maximum at k=4
k = 4 seems to be the optimum no of clusters

Final k - Means Clustering

1# k - means clustering algo with k = 4
2n_cluster = 4
3kmeans = KMeans(n_clusters=n_cluster, init='k-means++', random_state = 100)
4kmeans = kmeans.fit(clustering_data)
5labels = kmeans.labels_
6countries['k_means_cluster_id'] = labels

1# Countries in each Cluster - k means
2for cluster_no in range(n_cluster) :
3    condition = countries['k_means_cluster_id'] == cluster_no
4    print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')

1CLUSTER # 0
2 ['Australia' 'Austria' 'Belgium' 'Brunei' 'Canada' 'Cyprus' 'Denmark'
3 'Finland' 'France' 'Germany' 'Greece' 'Iceland' 'Ireland' 'Israel'
4 'Italy' 'Japan' 'Kuwait' 'Malta' 'Netherlands' 'New Zealand' 'Norway'
5 'Slovenia' 'Spain' 'Sweden' 'Switzerland' 'United Arab Emirates'
6 'United Kingdom' 'United States']
7
8
9CLUSTER # 1
10 ['Afghanistan' 'Angola' 'Benin' 'Botswana' 'Burkina Faso' 'Burundi'
11 'Cameroon' 'Central African Republic' 'Chad' 'Comoros' 'Congo, Dem. Rep.'
12 'Congo, Rep.' "Cote d'Ivoire" 'Equatorial Guinea' 'Eritrea' 'Gabon'
13 'Gambia' 'Ghana' 'Guinea' 'Guinea-Bissau' 'Haiti' 'Iraq' 'Kenya'
14 'Kiribati' 'Lao' 'Lesotho' 'Liberia' 'Madagascar' 'Malawi' 'Mali'
15 'Mauritania' 'Mozambique' 'Namibia' 'Niger' 'Nigeria' 'Pakistan' 'Rwanda'
16 'Senegal' 'Sierra Leone' 'Solomon Islands' 'South Africa' 'Sudan'
17 'Tanzania' 'Timor-Leste' 'Togo' 'Uganda' 'Yemen' 'Zambia']
18
19
20CLUSTER # 2
21 ['Albania' 'Algeria' 'Antigua and Barbuda' 'Argentina' 'Armenia'
22 'Azerbaijan' 'Bahamas' 'Bahrain' 'Bangladesh' 'Barbados' 'Belarus'
23 'Belize' 'Bhutan' 'Bolivia' 'Bosnia and Herzegovina' 'Brazil' 'Bulgaria'
24 'Cambodia' 'Cape Verde' 'Chile' 'China' 'Colombia' 'Costa Rica' 'Croatia'
25 'Czech Republic' 'Dominican Republic' 'Ecuador' 'Egypt' 'El Salvador'
26 'Estonia' 'Fiji' 'Georgia' 'Grenada' 'Guatemala' 'Guyana' 'Hungary'
27 'India' 'Indonesia' 'Iran' 'Jamaica' 'Jordan' 'Kazakhstan'
28 'Kyrgyz Republic' 'Latvia' 'Lebanon' 'Libya' 'Lithuania' 'Macedonia, FYR'
29 'Malaysia' 'Maldives' 'Mauritius' 'Micronesia, Fed. Sts.' 'Moldova'
30 'Mongolia' 'Montenegro' 'Morocco' 'Myanmar' 'Nepal' 'Oman' 'Panama'
31 'Paraguay' 'Peru' 'Philippines' 'Poland' 'Portugal' 'Romania' 'Russia'
32 'Samoa' 'Saudi Arabia' 'Serbia' 'Seychelles' 'Slovak Republic'
33 'South Korea' 'Sri Lanka' 'St. Vincent and the Grenadines' 'Suriname'
34 'Tajikistan' 'Thailand' 'Tonga' 'Tunisia' 'Turkey' 'Turkmenistan'
35 'Ukraine' 'Uruguay' 'Uzbekistan' 'Vanuatu' 'Venezuela' 'Vietnam']
36
37
38CLUSTER # 3
39 ['Luxembourg' 'Qatar' 'Singapore']

Hierarchical Clustering

HAC : Single Linkage, Euclidean Measure

1# Agglomerative Single Linkage
2mergings = linkage(clustering_data,method='single',metric='euclidean')
3plt.figure(figsize=[16,10])
4plt.title('Single Linkage - Hierarchical Clustering')
5dendrogram(mergings);

HAC : Complete Linkage, Euclidean Measure

1# Complete Linkage
2mergings = linkage(clustering_data,method='complete',metric='euclidean')
3plt.figure(figsize=[16,10])
4plt.title('Complete Linkage - Hierarchical Clustering')
5dendrogram(mergings);

Hierarchical clustering with complete linkage has a more discriminative dendrogram

1# Using Complete Linkage, cutting the tree for 6 clusters
2n_clusters = 6
3cluster_labels = cut_tree(mergings, n_clusters=n_clusters)
4countries['hac_complete_cluster_id'] = cluster_labels

1# Countries in each Cluster - Hierarchical - Complete Linkage
2for cluster_no in range(n_clusters) :
3    condition = countries['hac_complete_cluster_id'] == cluster_no
4    print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')

1CLUSTER # 0
2 ['Afghanistan' 'Angola' 'Benin' 'Botswana' 'Burkina Faso' 'Burundi'
3 'Cameroon' 'Central African Republic' 'Chad' 'Comoros' 'Congo, Dem. Rep.'
4 'Congo, Rep.' "Cote d'Ivoire" 'Equatorial Guinea' 'Eritrea' 'Gabon'
5 'Gambia' 'Ghana' 'Guinea' 'Guinea-Bissau' 'Haiti' 'Iraq' 'Kenya'
6 'Kiribati' 'Lao' 'Lesotho' 'Liberia' 'Madagascar' 'Malawi' 'Mali'
7 'Mauritania' 'Mozambique' 'Namibia' 'Niger' 'Pakistan' 'Rwanda' 'Senegal'
8 'Sierra Leone' 'Solomon Islands' 'South Africa' 'Sudan' 'Tanzania'
9 'Timor-Leste' 'Togo' 'Uganda' 'Yemen' 'Zambia']
10
11
12CLUSTER # 1
13 ['Albania' 'Algeria' 'Antigua and Barbuda' 'Argentina' 'Armenia'
14 'Azerbaijan' 'Bahamas' 'Bahrain' 'Bangladesh' 'Barbados' 'Belarus'
15 'Belize' 'Bhutan' 'Bolivia' 'Bosnia and Herzegovina' 'Brazil' 'Bulgaria'
16 'Cambodia' 'Cape Verde' 'Chile' 'China' 'Colombia' 'Costa Rica' 'Croatia'
17 'Cyprus' 'Czech Republic' 'Dominican Republic' 'Ecuador' 'Egypt'
18 'El Salvador' 'Estonia' 'Fiji' 'Georgia' 'Greece' 'Grenada' 'Guatemala'
19 'Guyana' 'Hungary' 'India' 'Indonesia' 'Iran' 'Israel' 'Italy' 'Jamaica'
20 'Jordan' 'Kazakhstan' 'Kyrgyz Republic' 'Latvia' 'Lebanon' 'Libya'
21 'Lithuania' 'Macedonia, FYR' 'Malaysia' 'Maldives' 'Malta' 'Mauritius'
22 'Micronesia, Fed. Sts.' 'Moldova' 'Mongolia' 'Montenegro' 'Morocco'
23 'Myanmar' 'Nepal' 'New Zealand' 'Oman' 'Panama' 'Paraguay' 'Peru'
24 'Philippines' 'Poland' 'Portugal' 'Romania' 'Russia' 'Samoa'
25 'Saudi Arabia' 'Serbia' 'Seychelles' 'Slovak Republic' 'Slovenia'
26 'South Korea' 'Spain' 'Sri Lanka' 'St. Vincent and the Grenadines'
27 'Suriname' 'Tajikistan' 'Thailand' 'Tonga' 'Tunisia' 'Turkey'
28 'Turkmenistan' 'Ukraine' 'United Arab Emirates' 'Uruguay' 'Uzbekistan'
29 'Vanuatu' 'Venezuela' 'Vietnam']
30
31
32CLUSTER # 2
33 ['Australia' 'Austria' 'Belgium' 'Canada' 'Denmark' 'Finland' 'France'
34 'Germany' 'Iceland' 'Ireland' 'Japan' 'Netherlands' 'Norway' 'Sweden'
35 'Switzerland' 'United Kingdom' 'United States']
36
37
38CLUSTER # 3
39 ['Brunei' 'Kuwait' 'Qatar' 'Singapore']
40
41
42CLUSTER # 4
43 ['Luxembourg']
44
45
46CLUSTER # 5
47 ['Nigeria']

HAC : Average Linkage, Euclidean Measure

1# Average Linkage
2mergings = linkage(clustering_data,method='average',metric='euclidean')
3plt.figure(figsize=[16,10])
4plt.title('Average Linkage - Hierarchical Clustering')
5dendrogram(mergings);

1# Using Average Linkage, cutting the tree for 6 clusters
2n_clusters = 6
3cluster_labels = cut_tree(mergings, n_clusters=n_clusters)
4countries['hac_average_cluster_id'] = cluster_labels

1# Countries in each Cluster - Hierarchical - average Linkage
2for cluster_no in range(n_clusters) :
3    condition = countries['hac_average_cluster_id'] == cluster_no
4    print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')

1CLUSTER # 0
2 ['Afghanistan' 'Albania' 'Algeria' 'Angola' 'Antigua and Barbuda'
3 'Argentina' 'Armenia' 'Australia' 'Austria' 'Azerbaijan' 'Bahamas'
4 'Bahrain' 'Bangladesh' 'Barbados' 'Belarus' 'Belgium' 'Belize' 'Benin'
5 'Bhutan' 'Bolivia' 'Bosnia and Herzegovina' 'Botswana' 'Brazil'
6 'Bulgaria' 'Burkina Faso' 'Burundi' 'Cambodia' 'Cameroon' 'Canada'
7 'Cape Verde' 'Central African Republic' 'Chad' 'Chile' 'China' 'Colombia'
8 'Comoros' 'Congo, Dem. Rep.' 'Congo, Rep.' 'Costa Rica' "Cote d'Ivoire"
9 'Croatia' 'Cyprus' 'Czech Republic' 'Denmark' 'Dominican Republic'
10 'Ecuador' 'Egypt' 'El Salvador' 'Equatorial Guinea' 'Eritrea' 'Estonia'
11 'Fiji' 'Finland' 'France' 'Gabon' 'Gambia' 'Georgia' 'Germany' 'Ghana'
12 'Greece' 'Grenada' 'Guatemala' 'Guinea' 'Guinea-Bissau' 'Guyana'
13 'Hungary' 'Iceland' 'India' 'Indonesia' 'Iran' 'Iraq' 'Israel' 'Italy'
14 'Jamaica' 'Japan' 'Jordan' 'Kazakhstan' 'Kenya' 'Kiribati'
15 'Kyrgyz Republic' 'Lao' 'Latvia' 'Lebanon' 'Lesotho' 'Liberia' 'Libya'
16 'Lithuania' 'Macedonia, FYR' 'Madagascar' 'Malawi' 'Malaysia' 'Maldives'
17 'Mali' 'Malta' 'Mauritania' 'Mauritius' 'Micronesia, Fed. Sts.' 'Moldova'
18 'Mongolia' 'Montenegro' 'Morocco' 'Mozambique' 'Myanmar' 'Namibia'
19 'Nepal' 'Netherlands' 'New Zealand' 'Niger' 'Oman' 'Pakistan' 'Panama'
20 'Paraguay' 'Peru' 'Philippines' 'Poland' 'Portugal' 'Romania' 'Russia'
21 'Rwanda' 'Samoa' 'Saudi Arabia' 'Senegal' 'Serbia' 'Seychelles'
22 'Sierra Leone' 'Slovak Republic' 'Slovenia' 'Solomon Islands'
23 'South Africa' 'South Korea' 'Spain' 'Sri Lanka'
24 'St. Vincent and the Grenadines' 'Sudan' 'Suriname' 'Sweden' 'Tajikistan'
25 'Tanzania' 'Thailand' 'Timor-Leste' 'Togo' 'Tonga' 'Tunisia' 'Turkey'
26 'Turkmenistan' 'Uganda' 'Ukraine' 'United Arab Emirates' 'United Kingdom'
27 'United States' 'Uruguay' 'Uzbekistan' 'Vanuatu' 'Venezuela' 'Vietnam'
28 'Yemen' 'Zambia']
29
30
31CLUSTER # 1
32 ['Brunei' 'Ireland' 'Kuwait' 'Norway' 'Qatar' 'Switzerland']
33
34
35CLUSTER # 2
36 ['Haiti']
37
38
39CLUSTER # 3
40 ['Luxembourg']
41
42
43CLUSTER # 4
44 ['Nigeria']
45
46
47CLUSTER # 5
48 ['Singapore']

HAC : Complete Linkage , Correlation Measure

1## HAC Clustering : Dissimilarity Measure : Correlation
2hac_correlation_mergings = linkage(clustering_data,method='complete', metric='correlation')
3plt.figure(figsize=[12,12])
4plt.title('Hierarchical Clustering : Complete Linkage, Correlation Measure')
5dendrogram(hac_correlation_mergings);

1n_clusters = 6
2labels = cut_tree(hac_correlation_mergings, n_clusters=n_clusters)
3countries['hac_correlation_cluster_id'] = labels

1# HAC clustering : Correlation measure : complete distance : Countries in each cluster
2
3for cluster_no in range(n_clusters) :
4    condition = countries['hac_correlation_cluster_id'] == cluster_no
5    print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')

1CLUSTER # 0
2 ['Afghanistan' 'Angola' 'Benin' 'Botswana' 'Burkina Faso' 'Burundi'
3 'Cameroon' 'Central African Republic' 'Chad' 'Comoros' 'Congo, Dem. Rep.'
4 'Congo, Rep.' "Cote d'Ivoire" 'Equatorial Guinea' 'Eritrea' 'Gabon'
5 'Gambia' 'Ghana' 'Guinea' 'Guinea-Bissau' 'Haiti' 'India' 'Iraq' 'Kenya'
6 'Kiribati' 'Lao' 'Lesotho' 'Liberia' 'Madagascar' 'Malawi' 'Mali'
7 'Mauritania' 'Mozambique' 'Myanmar' 'Namibia' 'Niger' 'Pakistan' 'Rwanda'
8 'Senegal' 'Sierra Leone' 'South Africa' 'Sudan' 'Tanzania' 'Timor-Leste'
9 'Togo' 'Turkmenistan' 'Uganda' 'Yemen' 'Zambia']
10
11
12CLUSTER # 1
13 ['Albania' 'Belize' 'Cape Verde' 'Colombia' 'Dominican Republic' 'Ecuador'
14 'El Salvador' 'Grenada' 'Maldives' 'Morocco' 'Panama' 'Paraguay' 'Peru'
15 'St. Vincent and the Grenadines' 'Thailand' 'Tunisia']
16
17
18CLUSTER # 2
19 ['Algeria' 'Argentina' 'Armenia' 'Azerbaijan' 'Belarus' 'Georgia' 'Iran'
20 'Jamaica' 'Kazakhstan' 'Moldova' 'Mongolia' 'Nigeria' 'Russia'
21 'Sri Lanka' 'Suriname' 'Ukraine' 'Venezuela' 'Vietnam']
22
23
24CLUSTER # 3
25 ['Antigua and Barbuda' 'Australia' 'Bahamas' 'Barbados'
26 'Bosnia and Herzegovina' 'Brazil' 'Bulgaria' 'Canada' 'Chile' 'China'
27 'Costa Rica' 'Croatia' 'Cyprus' 'Czech Republic' 'Estonia' 'France'
28 'Greece' 'Hungary' 'Israel' 'Italy' 'Japan' 'Latvia' 'Lebanon'
29 'Lithuania' 'Macedonia, FYR' 'Malaysia' 'Malta' 'Mauritius' 'Montenegro'
30 'New Zealand' 'Poland' 'Portugal' 'Romania' 'Serbia' 'Seychelles'
31 'Slovak Republic' 'Slovenia' 'South Korea' 'Spain' 'Turkey'
32 'United Kingdom' 'United States' 'Uruguay']
33
34
35CLUSTER # 4
36 ['Austria' 'Bahrain' 'Belgium' 'Brunei' 'Denmark' 'Finland' 'Germany'
37 'Iceland' 'Ireland' 'Kuwait' 'Libya' 'Luxembourg' 'Netherlands' 'Norway'
38 'Oman' 'Qatar' 'Saudi Arabia' 'Singapore' 'Sweden' 'Switzerland'
39 'United Arab Emirates']
40
41
42CLUSTER # 5
43 ['Bangladesh' 'Bhutan' 'Bolivia' 'Cambodia' 'Egypt' 'Fiji' 'Guatemala'
44 'Guyana' 'Indonesia' 'Jordan' 'Kyrgyz Republic' 'Micronesia, Fed. Sts.'
45 'Nepal' 'Philippines' 'Samoa' 'Solomon Islands' 'Tajikistan' 'Tonga'
46 'Uzbekistan' 'Vanuatu']

Mixed K-Means Clustering

1# Performing k-means using results of Hierarchical clustering
2# 1. No of clusters of Hierarchical Clustering
3# 2. Centroids obtainded from Hierarchical Clustering as the initialization points.
4clustering_data['k_means_cluster_id'] = countries['k_means_cluster_id']
5clustering_data['hac_correlation_cluster_id'] = countries['hac_correlation_cluster_id']
6columns = columns_for_clustering.copy()
7columns.extend(['hac_correlation_cluster_id'])
8centroids = clustering_data[columns].groupby(['hac_correlation_cluster_id']).mean()
9n_clusters = 6
10mixed_kmeans = KMeans(n_clusters=6 , init = centroids.values, random_state=100)
11results = mixed_kmeans.fit(clustering_data[columns_for_clustering])

1countries['mixed_cluster_id'] = results.labels_

1# Mixed clustering : Euclidean measure : k-means : Countries in each cluster
2
3for cluster_no in range(n_clusters) :
4    condition = countries['mixed_cluster_id'] == cluster_no
5    print( 'CLUSTER #',cluster_no,'\n',countries.loc[condition,'country'].values,'\n\n')

1CLUSTER # 0
2 ['Afghanistan' 'Angola' 'Benin' 'Burkina Faso' 'Burundi' 'Cameroon'
3 'Central African Republic' 'Chad' 'Comoros' 'Congo, Dem. Rep.'
4 'Congo, Rep.' "Cote d'Ivoire" 'Equatorial Guinea' 'Gambia' 'Guinea'
5 'Guinea-Bissau' 'Haiti' 'Lesotho' 'Liberia' 'Malawi' 'Mali' 'Mauritania'
6 'Mozambique' 'Niger' 'Sierra Leone' 'Sudan' 'Tanzania' 'Timor-Leste'
7 'Togo' 'Uganda' 'Zambia']
8
9
10CLUSTER # 1
11 ['Albania' 'Algeria' 'Antigua and Barbuda' 'Argentina' 'Armenia'
12 'Azerbaijan' 'Bahrain' 'Barbados' 'Belarus' 'Belize'
13 'Bosnia and Herzegovina' 'Brazil' 'Bulgaria' 'Cape Verde' 'Chile' 'China'
14 'Colombia' 'Costa Rica' 'Croatia' 'Czech Republic' 'Dominican Republic'
15 'Ecuador' 'El Salvador' 'Estonia' 'Georgia' 'Grenada' 'Hungary' 'Iran'
16 'Jamaica' 'Jordan' 'Kazakhstan' 'Latvia' 'Lebanon' 'Libya' 'Lithuania'
17 'Macedonia, FYR' 'Malaysia' 'Maldives' 'Mauritius' 'Moldova' 'Montenegro'
18 'Morocco' 'Oman' 'Panama' 'Paraguay' 'Peru' 'Poland' 'Romania' 'Russia'
19 'Saudi Arabia' 'Serbia' 'Seychelles' 'Slovak Republic' 'South Korea'
20 'Sri Lanka' 'St. Vincent and the Grenadines' 'Suriname' 'Thailand'
21 'Tunisia' 'Turkey' 'Ukraine' 'Uruguay' 'Vietnam']
22
23
24CLUSTER # 2
25 ['Mongolia' 'Nigeria' 'Venezuela']
26
27
28CLUSTER # 3
29 ['Australia' 'Austria' 'Bahamas' 'Belgium' 'Canada' 'Cyprus' 'Denmark'
30 'Finland' 'France' 'Germany' 'Greece' 'Iceland' 'Israel' 'Italy' 'Japan'
31 'Malta' 'Netherlands' 'New Zealand' 'Portugal' 'Slovenia' 'Spain'
32 'Sweden' 'United Arab Emirates' 'United Kingdom' 'United States']
33
34
35CLUSTER # 4
36 ['Brunei' 'Ireland' 'Kuwait' 'Luxembourg' 'Norway' 'Qatar' 'Singapore'
37 'Switzerland']
38
39
40CLUSTER # 5
41 ['Bangladesh' 'Bhutan' 'Bolivia' 'Botswana' 'Cambodia' 'Egypt' 'Eritrea'
42 'Fiji' 'Gabon' 'Ghana' 'Guatemala' 'Guyana' 'India' 'Indonesia' 'Iraq'
43 'Kenya' 'Kiribati' 'Kyrgyz Republic' 'Lao' 'Madagascar'
44 'Micronesia, Fed. Sts.' 'Myanmar' 'Namibia' 'Nepal' 'Pakistan'
45 'Philippines' 'Rwanda' 'Samoa' 'Senegal' 'Solomon Islands' 'South Africa'
46 'Tajikistan' 'Tonga' 'Turkmenistan' 'Uzbekistan' 'Vanuatu' 'Yemen']

1# silhouette scores of all the methods
2print('Mixed Clustering',silhouette_score(clustering_data[columns_for_clustering],countries['mixed_cluster_id']))
3print('K-means Clustering',silhouette_score(clustering_data[columns_for_clustering],countries['k_means_cluster_id']))
4print('Hierarchical Correlation Clustering',silhouette_score(clustering_data[columns_for_clustering],countries['hac_correlation_cluster_id']))

1Mixed Clustering 0.295025902261296
2K-means Clustering 0.4113959420115351
3Hierarchical Correlation Clustering 0.06816978634185641

Cluster Profiling

1# Clustering Profling -  Plots using Bokeh
2
3cluster_id_column = 'hierarchical-c-link-cluster-id'
4title="Hierarchical Clustering"
5
6def cluster_analysis_plot(cluster_id_column,title,x_var='income',y_var='child_mort',z_var='gdpp',dataframe=countries) :
7    # Plots
8    # works upto 6 clusters
9
10    dataframe = dataframe.copy()
11    dataframe[cluster_id_column] = dataframe[cluster_id_column].astype('str')
12    source =ColumnDataSource(dataframe)
13
14    cluster_ids = sorted(dataframe[cluster_id_column].unique())
15
16    # pallete = ["rgba(38, 70, 83, 1)", 'rgba(42, 157, 143, 1)', "rgba(233, 196, 106, 1)", "rgba(244, 162, 97, 1)", "rgba(231, 111, 81, 1)", '#009cc7']
17    pallete = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e","#38895e",'rgba(42, 157, 143, 1)']
18
19    mapper = factor_cmap(cluster_id_column,palette=pallete[:len(cluster_ids)], factors = cluster_ids)
20
21    # plot 1
22    tooltips1 = [
23    ("Country", "@country"),
24    (z_var,'@'+z_var),
25    (x_var,'@'+x_var),
26    ('Cluster', '@'+cluster_id_column)
27    ]
28
29    p = figure(plot_width=420, plot_height=400,title=title+ ' : '+ z_var+' vs '+x_var, tooltips=tooltips1, toolbar_location=None)
30    for num,index in enumerate(cluster_ids) :
31        condition = dataframe[cluster_id_column] == index
32        source = dataframe[condition]
33        p.scatter(x=z_var,y=x_var, fill_alpha=0.6, size=8, source=source, fill_color = pallete[num] , muted_alpha=0.1, legend_label=index)
34    p.xaxis.axis_label = z_var
35    p.yaxis.axis_label = x_var
36    p.legend.click_policy="mute"
37    # ----------------------------
38    #Plot 2
39
40    tooltips2 = [
41    ("Country", "@country"),
42    (z_var,'@'+z_var),
43    (y_var,'@'+y_var),
44    ('Cluster', '@'+cluster_id_column)
45    ]
46
47
48    q = figure(plot_width=420, plot_height=400,title=title+ ' : '+ z_var+' vs '+y_var, tooltips=tooltips2, toolbar_location=None)
49
50    for num,index in enumerate(cluster_ids) :
51        condition = dataframe[cluster_id_column] == index
52        source = dataframe[condition]
53        q.scatter(x=z_var,y=y_var, fill_alpha=0.6, size=8, source=source, fill_color = pallete[num] , muted_alpha=0.1,  legend_label=index)
54    q.xaxis.axis_label = z_var
55    q.yaxis.axis_label = y_var
56    q.legend.click_policy="mute"
57
58    # ----------------------------
59    #Plot 3
60
61    tooltips3 = [
62    ("Country", "@country"),
63    (x_var,'@'+x_var),
64    (y_var,'@'+y_var),
65    ('Cluster', '@'+cluster_id_column)
66    ]
67
68    r = figure(plot_width=420, plot_height=400,title=title+ ' : '+ x_var+' vs '+y_var, tooltips=tooltips3, toolbar_location=None)
69
70    for num,index in enumerate(cluster_ids) :
71        condition = dataframe[cluster_id_column] == index
72        source = dataframe[condition]
73        r.scatter(x=x_var,y=y_var, fill_alpha=0.6, size=8, source=source, fill_color = pallete[num] , legend_label=index, muted_alpha=0.1 )
74
75    r.xaxis.axis_label = x_var
76    r.yaxis.axis_label = y_var
77    r.legend.click_policy="mute"
78
79
80    show(row(p,q,r))

K-Means

1# Cluster Profiling for k-means with 4 clusters
2# hover for country names and x, y values , cluster no
3# Click on legend to selectively view clusters
4cluster_analysis_plot('k_means_cluster_id','k-means clusters')

1# Comparing k-means Clusters using mean values of features
2clustering_data[['child_mort','income','gdpp','k_means_cluster_id']].groupby('k_means_cluster_id').mean().plot(kind='barh')
3plt.title('Comparison of Cluster Means for K-means results');

1plot_columns = ['child_mort','income','gdpp']
2
3for idx,column in enumerate(plot_columns) :
4    plt.suptitle('Comparison of Clusters Characteristics for K-means');
5    plt.subplot('13'+str(idx+1))
6    sns.boxplot(y=column, x='k_means_cluster_id',data=clustering_data)

1plt.figure(figsize=[8,8])
2pd.plotting.parallel_coordinates(clustering_data, 'k_means_cluster_id', color=["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e"]);
3plt.title('Parallel Coordinate Plot for K-means')
4plt.xticks(rotation=45);

From the above plots, we could characterize features of each cluster as within the following levels.
- Levels : Low, Moderate, High , very high
Characteristics of each cluster

Cluster	GDP	Income	Child Mortality
0	High	High to Very High	Low
1	Low	Low	High to Very High
2	Low to Moderate	Low to Moderate	Low to Moderate
3	Very High	Very High	Low

From the characteristics , we see that the Cluster 1 is our area of interest. Lets look at the countries in Cluster 1.

1# Countries in cluster with area of interest
2condition = countries['k_means_cluster_id'] == 1
3countries.loc[condition,['country','child_mort','income','gdpp']].sort_values(by=['child_mort','income','gdpp'], ascending=[False,True,True])[:5]

	country	child_mort	income	gdpp
66	Haiti	208.0	1500	662.0
132	Sierra Leone	160.0	1220	399.0
32	Chad	150.0	1930	897.0
31	Central African Republic	149.0	888	446.0
97	Mali	137.0	1870	708.0

Hierarchical Clustering - Complete Linkage, Correlation based distance

1# HAC : Complete linkage, Correlation based distance :  cluster analysis plot
2# hover for country names and x, y values , cluster no
3# Click on legend to selectively view clusters
4cluster_analysis_plot('hac_correlation_cluster_id','HAC CORRELATION CLUSTERS')

1# Comparing Hierarchical Clusters using mean values of features
2clustering_data['hac_correlation_cluster_id'] = countries['hac_correlation_cluster_id']
3clustering_data[['child_mort','income','gdpp','hac_correlation_cluster_id']].groupby('hac_correlation_cluster_id').mean().plot(kind='barh');
4plt.title('Comparison of Cluster Means for Hierarchical Clustering');

1# box plots
2plot_columns = ['child_mort','income','gdpp']
3
4for idx,column in enumerate(plot_columns) :
5    plt.suptitle('Comparison of Clusters Characteristics for Hierarchical Clustering');
6    plt.subplot('13'+str(idx+1))
7    sns.boxplot(y=column, x='hac_correlation_cluster_id',data=clustering_data)

1# Parallel Coordinates plot for Hierarchical clustering with correlation measure and complet linkage
2plt.figure(figsize=[8,8])
3pd.plotting.parallel_coordinates(clustering_data, 'hac_correlation_cluster_id', color=["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e","#38895e",'rgba(42, 157, 143, 1)']);
4plt.title('Parallel Coordinate Plot for HAC with Correlation Measure and Complete Linkage')
5plt.xticks(rotation=45);

From the above plots, we could characterize features of each cluster as within the following levels.
- Levels : Very low , low, Moderate, High , very high
Characteristics of each cluster

Cluster	GDP	Income	Child Mortality
0	Very low	Very Low	Very High
1	low	low	low
2	Moderate	Moderate	Moderate
3	High	High	Low
4	Very High	Very High	Low
5	Very Low	Very Low	Low

From the characteristics , we see that the Cluster 0 is our area of interest. Lets look at the countries in Cluster 0.

1# Countries in cluster with area of interest
2condition = countries['hac_correlation_cluster_id'] == 0
3countries.loc[condition,['country','child_mort','income','gdpp']].sort_values(by=['child_mort','income','gdpp'], ascending=[False,True,True])[:5]

	country	child_mort	income	gdpp
66	Haiti	208.0	1500	662.0
132	Sierra Leone	160.0	1220	399.0
32	Chad	150.0	1930	897.0
31	Central African Republic	149.0	888	446.0
97	Mali	137.0	1870	708.0

Mixed Clustering : K-means initialized with Hierarchical Cluster Centroids

1cluster_analysis_plot('mixed_cluster_id','MIXED K-MEANS CLUSTERS')

1# Comparing MIXED k-means Clusters using mean values of features
2clustering_data['mixed_cluster_id'] = countries['mixed_cluster_id']
3clustering_data[['child_mort','income','gdpp','mixed_cluster_id']].groupby('mixed_cluster_id').mean().plot(kind='barh')
4plt.title('Comparison of cluster means for Mixed Clustering');

1# box plots
2plot_columns = ['child_mort','income','gdpp']
3
4for idx,column in enumerate(plot_columns) :
5    plt.suptitle('Comparison of Clusters Characteristics for Mixed Clustering');
6    plt.subplot('13'+str(idx+1))
7    sns.boxplot(y=column, x='mixed_cluster_id',data=clustering_data)

1# parallel coordinate plot
2plt.figure(figsize=[8,8])
3pd.plotting.parallel_coordinates(clustering_data, 'mixed_cluster_id', color=["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e","#38895e",'rgba(42, 157, 143, 1)']);
4plt.title('Parallel Coordinate Plot for Mixed Clustering');
5plt.xticks(rotation=45);

From the above plots, Cluster 0 has the very low income and gdpp and very high child mortality. Hence, Cluster 0 is the area of interest.

1# Countries in cluster with area of interest
2condition = countries['mixed_cluster_id'] == 0
3countries.loc[condition,['country','child_mort','income','gdpp']].sort_values(by=['child_mort','income','gdpp'], ascending=[False,True,True])[:5]

	country	child_mort	income	gdpp
66	Haiti	208.0	1500	662.0
132	Sierra Leone	160.0	1220	399.0
32	Chad	150.0	1930	897.0
31	Central African Republic	149.0	888	446.0
97	Mali	137.0	1870	708.0

Hence , the top five countries to extend aid are
- Haiti
- Sierra Leone
- Chad
- Central African Republic
- Mali

Conclusion

Columns : ‘child_mort’, ‘exports’, ‘health’, ‘imports’, ‘income’,‘inflation’, ‘life_expec’, ‘total_fer’, ‘gdpp’,‘trade_deficit’ were used for clustering.
Hopkin’s Statistic was calculated which showed a very high mean clustering tendency of 96% with a standard deviation of 4%
The dataset was standardized with mean = 0 , standard deviation = 1 before clustering.
Optimum no of clusters for k-means was found to be 4 , both from Elbow curve and Silhoeutte Analysis curve.
Clustering with k-means was performed (Cluster centers initialized with k-means++)
Hierarchical Clustering with Single, Complete and Average linkages and Euclidean , correlation based distances were explored.
Mixed k-means clustering - K-means initialized with centroids of hierarchical clustering - was also explored.
Due to more interpretable results, Hierarchical Clustering with Complete linkage and correlation based distance measure was chosen to arrive at the results.
Characteristics of Clusters obtained :

Cluster	GDP	Income	Child Mortality
0	Very low	Very Low	Very High
1	low	low	low
2	Moderate	Moderate	Moderate
3	High	High	Low
4	Very High	Very High	Low
5	Very Low	Very Low	Low

From the characteristics , we see that the Cluster 0 is our area of interest.
According to the UN goals of 2030, the top priority is health and then poverty.
Hence countries in cluster 0 were ranked based on Child Mortality followed by income and GDP per capita.
By that criteria, the following are the five countries HELP should consider extending their aid.
- Haiti
- Sierra Leone
- Chad
- Central African Republic
- Mali

Table of Contents

HELP - Countries to Aid

Problem Statement

Analysis Approach & Conclusions

Importing data

Data Quality Checks

Exploratory Data Analysis

Univariate Analysis / Outlier Treatment

Child Mortality

Life Expectancy

Fertility

Health Spending

GDP per capita

Inflation Index

Net Income per person

Imports per capita

Exports per capita

Trade Deficit per capita

Bivariate Analysis

Pairplot

Child Mortality vs Life Expectancy

Child Mortality vs Total Fertility

GDP per capita vs Health Spending

Imports vs GDP per capita

Exports vs GDP per capita

GDP per capita vs Trade Deficit

Net Income per person vs Inflation

GDP per capita vs Inflation

Correlation Analysis

Hopkin’s Statistic

Standardizing Values

K-Means Clustering

Finding Optimal Number of Clusters

Elbow curve

Sihoutte Analysis

Final k - Means Clustering

Hierarchical Clustering

HAC : Single Linkage, Euclidean Measure

HAC : Complete Linkage, Euclidean Measure

HAC : Average Linkage, Euclidean Measure

HAC : Complete Linkage , Correlation Measure

Mixed K-Means Clustering

Cluster Profiling

K-Means

Hierarchical Clustering - Complete Linkage, Correlation based distance

Mixed Clustering : K-means initialized with Hierarchical Cluster Centroids

Conclusion

More articles from Yugen

Wine Quality Classification

Boom Bikes Demand Analysis