Clustering Analysis

Published: September 09, 2020

8 minute read

Introduction

Clustering is an unsupervised learning problem , its aim is to identify or discover interesting patterns from the data.

This aim of this tutorial is to apply K-means algorithm on numerical data , K-Prototypes algorithm on mixed data (numerical + categorical data) and analyze the properties of resulting clusters to gain insights

  • Part 1 : K-means clustering for numerical data.
  • Part 2 : K-Prototypes clustering on mixed data.

Background

Data Source : Kaggle Dataset

Context You are owing a supermarket mall and through membership cards , you have some basic data about your customers like Customer ID, age, gender, annual income and spending score. Spending Score is something you assign to the customer based on your defined parameters like customer behavior and purchasing data.

Problem Statement You own the mall and want to understand the customers like who can be easily converge [Target Customers] so that the sense can be given to marketing team and plan the strategy accordingly.

  • How to achieve customer segmentation using machine learning algorithm.
  • Who are your target customers with whom you can start marketing strategy [easy to converse]
  • How the marketing strategy works in real world.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, normalize,MinMaxScaler
from sklearn.cluster import KMeans
from kmodes.kprototypes import KPrototypes
import warnings
warnings.filterwarnings("ignore") 
%matplotlib inline
df=pd.read_csv('Mall_Customers.csv')
df.shape
(200, 5)
df.head()
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
df.dtypes
CustomerID                 int64
Gender                    object
Age                        int64
Annual Income (k$)         int64
Spending Score (1-100)     int64
dtype: object

Missing values

missing_cols=df.isnull().sum()/df.shape[0]
missing_cols=missing_cols[missing_cols>0]
missing_cols
Series([], dtype: float64)

No missing values , one less thing to worry about :)

df.set_index('CustomerID',inplace=True)

CustomerID will not be input to the clustering algorithm as it doesn’t have information so converting it to dataframe index

df.head()
Gender Age Annual Income (k$) Spending Score (1-100)
CustomerID
1 Male 19 15 39
2 Male 21 15 81
3 Female 20 16 6
4 Female 23 16 77
5 Female 31 17 40
num_cols=df.select_dtypes(include=['int64']).columns
ctg_cols=df.select_dtypes(include=['object']).columns

print('Numerical Cols=',num_cols)
print('Categorical Cols=',ctg_cols)
Numerical Cols= Index(['Age', 'Annual Income (k$)', 'Spending Score (1-100)'], dtype='object')
Categorical Cols= Index(['Gender'], dtype='object')

Explooratory Data Analysis (EDA)

cols_val=2
fig, ax = plt.subplots(len(num_cols),cols_val,figsize=(12, 5))
colours_val=['c','b','r','g','y','p','m']
did_not_ran=True
for i,col in enumerate(num_cols):
    for j in range(cols_val):
        if did_not_ran==True:
            sns.boxplot(df[col],ax=ax[i,j],color=colours_val[i+j])
            ax[i,j].set_title(col)
            did_not_ran=False
        else:
            sns.distplot(df[col],ax=ax[i,j],color=colours_val[i+j])
            ax[i,j].set_title(col)
            did_not_ran=True
            
            
plt.suptitle("EDA")
plt.tight_layout()
plt.show()

png

sns.scatterplot(df['Annual Income (k$)'] ,df['Spending Score (1-100)'])
plt.title('Scatterplot')
plt.show()

png

Converting Age to bins

df.Age.min(),df.Age.max(),
(18, 70)
df['Age_bins']=pd.cut(df.Age,bins=(17,35,50,70),labels=["18-35","36-50","50+"])
df[['Age','Age_bins']].drop_duplicates(subset=['Age_bins']).reset_index(drop=True)
Age Age_bins
0 19 18-35
1 64 50+
2 37 36-50

For initial run , considering Only Annual Income & SpendingScore (numerical data) for the K-means algorithm

df.columns
Index(['Gender', 'Age', 'Annual Income (k$)', 'Spending Score (1-100)',
       'Age_bins'],
      dtype='object')
df1=df[['Annual Income (k$)', 'Spending Score (1-100)']]
df1.shape
(200, 2)

Standardize data to bring them in same scale since Annual income & Spending Score are on different scale

std=MinMaxScaler()
arr1=std.fit_transform(df1)

K-means Algorithm (Number of Cluster = 2)

Starting with K-means algorithm with only 2 clusters

  • Parameters
    • n_clusters : Number of clusters
    • random_state : for reproducibility
%%time
kmeans_cluster=KMeans(n_clusters=2,random_state=7)
result_cluster=kmeans_cluster.fit_predict(arr1)
Wall time: 52 ms
result_cluster[:3]
array([1, 0, 1])

Cluster Analysis

df1['Clusters']=result_cluster
df1['Clusters'].value_counts()
1    111
0     89
Name: Clusters, dtype: int64
ax=sns.countplot(x=df1.Clusters)
for index, row in pd.DataFrame(df1['Clusters'].value_counts()).iterrows():
    ax.text(index,row.values[0], str(round(row.values[0])),color='black', ha="center")
    #print(index,row.values[0])
plt.title('Cluster Count')
plt.show()

png

plt.figure(figsize=(12,5))
sns.scatterplot(x=df1['Annual Income (k$)'],y=df1['Spending Score (1-100)'],hue=df1.Clusters,palette="Set2",)
plt.title('2 Clusters')
plt.show()

png

fig,ax=plt.subplots(1,2,figsize=(12,5))
sns.heatmap(df1.loc[df1.Clusters==0,['Annual Income (k$)', 'Spending Score (1-100)']].describe().round(),annot=True,fmt='g',ax=ax[0])
ax[0].set_title("Cluster-0")
sns.heatmap(df1.loc[df1.Clusters==1,['Annual Income (k$)', 'Spending Score (1-100)']].describe().round(),annot=True,fmt='g',ax=ax[1])
ax[1].set_title("Cluster-1")
plt.suptitle("Cluster Analysis")
plt.show()

png

Based on the above scatterplot & heatmap ,

  • Cluster-0 are customers with spending score greater than 50 (approx)
  • Cluster-1 are customers with spending score less than 50 m(approx)

No distinctions of customers in terms of Annual income , so basically these clusters are not super useful , so try to find optimal clusters using elbow method

%%time
SSE = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, random_state = 7)
    kmeans.fit(arr1)
    SSE.append(kmeans.inertia_)
Wall time: 1.7 s
plt.figure(figsize=(12,5))
sns.lineplot(range(1, 11), SSE,marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.show()

png

K-means Algorithm (Number of Clusters = 5)

Starting with K-means algorithm with 5 clusters based on the optimal number of clusters from Elbow method

  • Parameters
    • n_clusters : Number of clusters
    • random_state : for reproducibility
kmeans_cluster=KMeans(n_clusters=5,random_state=7)
result_cluster=kmeans_cluster.fit_predict(arr1)
df1['Clusters']=result_cluster
df1['Clusters'].value_counts()
1    81
3    39
2    35
0    23
4    22
Name: Clusters, dtype: int64
d1=df[['Gender','Age_bins']].reset_index(drop=True)
d1.head()
Gender Age_bins
0 Male 18-35
1 Male 18-35
2 Female 18-35
3 Female 18-35
4 Female 18-35
df1_comb=pd.concat([df1.reset_index(drop=True),d1],axis=1)
df1_comb.head()
Annual Income (k$) Spending Score (1-100) Clusters Gender Age_bins
0 15 39 0 Male 18-35
1 15 81 4 Male 18-35
2 16 6 0 Female 18-35
3 16 77 4 Female 18-35
4 17 40 0 Female 18-35
ax=sns.countplot(x=df1_comb.Clusters)
for index, row in pd.DataFrame(df1_comb['Clusters'].value_counts()).iterrows():
    ax.text(index,row.values[0], str(round(row.values[0])),color='black', ha="center")
    #print(index,row.values[0])
plt.title('Cluster Count')
plt.show()

png

plt.figure(figsize=(12,7))
sns.scatterplot(x=df1_comb['Annual Income (k$)'],y=df1_comb['Spending Score (1-100)'],hue=df1_comb.Clusters,palette="Set2",)
plt.title('5 Clusters')
plt.show()

png

fig,ax=plt.subplots(1,5,figsize=(15,5))
#cbar_ax = fig.add_axes([1.03, .3, .03, .4])
for cluster_val in sorted(df1_comb.Clusters.unique()):
    #print(cluster_val)
    sns.heatmap(df1_comb.loc[df1_comb.Clusters==cluster_val,['Annual Income (k$)', 'Spending Score (1-100)']].describe().round(),annot=True,fmt='g',ax=ax[cluster_val],\
               cbar=i == 0,vmin=0, vmax=130)
    titl='Cluster-'+str(cluster_val)
    ax[cluster_val].set_title(titl)
    
plt.suptitle('Clustering Analysis')

#plt.tight_layout()
plt.show()

png

fig,ax=plt.subplots(1,5,figsize=(16,5))
#cbar_ax = fig.add_axes([1.03, .3, .03, .4])
for cluster_val in sorted(df1_comb.Clusters.unique()):
    #print(cluster_val)
    sns.heatmap(df1_comb.loc[df1_comb.Clusters==cluster_val,:].groupby('Age_bins').agg({'Clusters':'size','Annual Income (k$)':'mean','Spending Score (1-100)':'mean'}).\
    rename(columns={'Clusters':'Count','Annual Income (k$)':'IncomeMean','Spending Score (1-100)':'SpendScoreMean'})\
                .fillna(0).round(),annot=True,fmt='g',ax=ax[cluster_val],cbar=i == 0,vmin=0, vmax=130)
    titl='Cluster-'+str(cluster_val)+' Analysis'
    ax[cluster_val].set_title(titl)
    

plt.suptitle('Clustering Age wise Analysis')

#plt.tight_layout()
plt.show()

png

fig,ax=plt.subplots(1,5,figsize=(16,5))
#cbar_ax = fig.add_axes([1.03, .3, .03, .4])
for cluster_val in sorted(df1_comb.Clusters.unique()):
    #print(cluster_val)
    sns.heatmap(df1_comb.loc[df1_comb.Clusters==cluster_val,:].groupby('Gender').agg({'Clusters':'size','Annual Income (k$)':'mean','Spending Score (1-100)':'mean'}).\
    rename(columns={'Clusters':'Count','Annual Income (k$)':'IncomeMean','Spending Score (1-100)':'SpendScoreMean'})\
                .fillna(0).round(),annot=True,fmt='g',ax=ax[cluster_val],cbar=i == 0,vmin=0, vmax=130)
    titl='Cluster-'+str(cluster_val)+' Analysis'
    ax[cluster_val].set_title(titl)
    

plt.suptitle('Clustering Gender Wise Analysis')

#plt.tight_layout()
plt.show()

png

plt.figure(figsize=(12,5))

sns.boxplot(x='Clusters',y='value',hue='variable',\
            data=pd.melt(df1,id_vars=['Clusters'],value_vars=['Annual Income (k$)','Spending Score (1-100)']),\
           palette="Set2")
plt.xlabel("Clusters")
plt.title("Boxplot-Annual Income - Spending Score")
plt.show()

png

Observations

Cluster Income Spending Score
0 Low Low
1 Medium Medium
2 High Low
3 High High
4 Low High

Interesting Insight

  • Cluster-2 are high income customers but they are low spenders
  • Cluster-4 are low income customers but they are high spenders ( only Age Group 18-35 so mostly youngsters)
  • Cluster-1 have the most number of customers , these are middle class people with medium spending & income.

All age-groups & gender are kind of evenly distributed among these clusters ,so these clusters are not super useful if we want to target specific gender or age-group for our marketing campaigns , so lets try to bring in these demographics data and use it to build out clusters but since these demographics data (Age group / Gender) are categorical values K-means would not work because it uses euclidean distance as metric to calculate distance so we will be using K-Prototypes which takes care of mixed data types , it applies euclidean distance for numerical data and hamming distance for categorical data.

K-Prototypes

k-modes is used for clustering categorical variables. It defines clusters based on the number of matching categories between data points. (This is in contrast to the more well-known k-means algorithm, which clusters numerical data based on Euclidean distance.) The k-prototypes algorithm combines k-modes and k-means and is able to cluster mixed numerical / categorical data.

For more info, Please refer Github

df_proto=pd.DataFrame(arr1,columns=['AnnualIcome','SpendingScore'])
df_proto.head()
AnnualIcome SpendingScore
0 0.000000 0.387755
1 0.000000 0.816327
2 0.008197 0.051020
3 0.008197 0.775510
4 0.016393 0.397959
d2=pd.concat([df_proto,d1],axis=1)
d2.head()
AnnualIcome SpendingScore Gender Age_bins
0 0.000000 0.387755 Male 18-35
1 0.000000 0.816327 Male 18-35
2 0.008197 0.051020 Female 18-35
3 0.008197 0.775510 Female 18-35
4 0.016393 0.397959 Female 18-35

K-Prototypes Algorithm (Number of Clusters = 5)

Starting with K-Prototypes algorithm with 5 clusters

  • Parameters
    • n_clusters : Number of clusters
    • random_state : for reproducibility

Points to consider for K-Prototypes

  • In fit_predict method , “categorical” parameter takes in the value of index of the categorical features.
ColumnIndex Feature DataType
0 AnnualIcome float64
1 SpendingScore float64
2 Gender object
3 Age_bins object

To standardize the numerical data use MinMaxScaler instead of StandardScaler

  1. MinMaxScaler will bring the numerical features within range of [0,1], so because of this , distance calculated based on the euclidean distance for numerical features will be comparable to that of hamming distance for categorical data.

  2. If StandardScaler is used to standardize the numerical data , numerical features will no longer in the range of [0,1] and will drive the analysis and clusters will be biased towards numerical features and in this example resultant cluster will be exactly same as K-means clusters. I would encourage to run the same analysis with StandardScaler to understand this idea better.

%%time
kproto_clusters=KPrototypes(n_clusters=5,random_state=7,init="Cao")
result_cluster=kproto_clusters.fit_predict(d2,categorical=[2,3])
Wall time: 2.5 s
d2['Clusters']=result_cluster
d2['Clusters'].value_counts()
4    53
2    45
3    37
0    36
1    29
Name: Clusters, dtype: int64
ax=sns.countplot(x=d2.Clusters)
for index, row in pd.DataFrame(d2['Clusters'].value_counts()).iterrows():
    ax.text(index,row.values[0], str(round(row.values[0])),color='black', ha="center")
    #print(index,row.values[0])
plt.title('Cluster Count')
plt.show()

png

Clusters Centroid

kproto_clusters.cluster_centroids_
[array([[0.26662113, 0.34807256],
        [0.48473714, 0.2244898 ],
        [0.3712204 , 0.69931973],
        [0.40097475, 0.34362934],
        [0.36776987, 0.70157874]]),
 array([['Female', '50+'],
        ['Male', '36-50'],
        ['Male', '18-35'],
        ['Female', '36-50'],
        ['Female', '18-35']], dtype='<U6')]
df1.drop(['Clusters'],axis=1,inplace=True)
d3=pd.concat([df1.reset_index(drop=True),d2],axis=1)
d3.head()
Annual Income (k$) Spending Score (1-100) AnnualIcome SpendingScore Gender Age_bins Clusters
0 15 39 0.000000 0.387755 Male 18-35 2
1 15 81 0.000000 0.816327 Male 18-35 2
2 16 6 0.008197 0.051020 Female 18-35 0
3 16 77 0.008197 0.775510 Female 18-35 4
4 17 40 0.016393 0.397959 Female 18-35 0
plt.figure(figsize=(12,5))
sns.scatterplot(x=d3['Annual Income (k$)'],y=d3['Spending Score (1-100)'],hue=d3.Clusters,palette="Set2",)
plt.title('5 Clusters')
plt.show()

png

plt.figure(figsize=(12,5))
sns.scatterplot(x=d3['Annual Income (k$)'],y=d3['Spending Score (1-100)'],hue=d3.Age_bins,palette="Set2",)
plt.title('5 Clusters')
plt.show()

png

Based on the above scatter plot , it seems there is no clear pattern ,but important point here to understand is that we have used 4 features to build out these clusters and we have plotted just 2 features here so appearance could be deceptive ,keep an open mind.

fig,ax=plt.subplots(1,5,figsize=(15,5))
#cbar_ax = fig.add_axes([1.03, .3, .03, .4])
for cluster_val in sorted(d3.Clusters.unique()):
    #print(cluster_val)
    sns.heatmap(d3.loc[d3.Clusters==cluster_val,['Annual Income (k$)', 'Spending Score (1-100)']].describe().round(),annot=True,fmt='g',ax=ax[cluster_val],\
               cbar=i == 0,vmin=0, vmax=130)
    titl='Cluster-'+str(cluster_val)
    ax[cluster_val].set_title(titl)
    

plt.suptitle('Clustering Analysis')

#plt.tight_layout()
plt.show()

png

fig,ax=plt.subplots(1,5,figsize=(16,5))
#cbar_ax = fig.add_axes([1.03, .3, .03, .4])
for cluster_val in sorted(d3.Clusters.unique()):
    #print(cluster_val)
    sns.heatmap(d3.loc[d3.Clusters==cluster_val,:].groupby('Age_bins').agg({'Clusters':'size','Annual Income (k$)':'mean','Spending Score (1-100)':'mean'}).\
    rename(columns={'Clusters':'Count','Annual Income (k$)':'IncomeMean','Spending Score (1-100)':'SpendScoreMean'})\
                .fillna(0).round(),annot=True,fmt='g',ax=ax[cluster_val],cbar=i == 0,vmin=0, vmax=130)
    titl='Cluster-'+str(cluster_val)+' Analysis'
    ax[cluster_val].set_title(titl)
    

plt.suptitle('Clustering Age wise Analysis')

#plt.tight_layout()
plt.show()

png

fig,ax=plt.subplots(1,5,figsize=(16,5))
#cbar_ax = fig.add_axes([1.03, .3, .03, .4])
for cluster_val in sorted(d3.Clusters.unique()):
    #print(cluster_val)
    sns.heatmap(d3.loc[d3.Clusters==cluster_val,:].groupby('Gender').agg({'Clusters':'size','Annual Income (k$)':'mean','Spending Score (1-100)':'mean'}).\
    rename(columns={'Clusters':'Count','Annual Income (k$)':'IncomeMean','Spending Score (1-100)':'SpendScoreMean'})\
                .fillna(0).round(),annot=True,fmt='g',ax=ax[cluster_val],cbar=i == 0,vmin=0, vmax=130)
    titl='Cluster-'+str(cluster_val)+' Analysis'
    ax[cluster_val].set_title(titl)
    

plt.suptitle('Clustering Gender Wise Analysis')

#plt.tight_layout()
plt.show()

png

Observations

Cluster Age-Group Gender Income Spending Score
0 50+ Females Medium Low
1 36-50 Males High Low
2 18-35 Males Medium High
3 36-50 Females Medium Low
4 18-35 Females Medium High

Interesting Insight

  • Cluster 4 & Cluster 2 are youngsters who are high on spending.
  • Cluster 1 & Cluster 3 are middle age adults who are low on spending.
  • Cluster 0 are old age females who are low on spending.

Based on these insights we can map out a marketing strategy to target each cluster and increase the profits.

plt.figure(figsize=(12,5))

sns.boxplot(x='Clusters',y='value',hue='variable',\
            data=pd.melt(d3,id_vars=['Clusters'],value_vars=['Annual Income (k$)','Spending Score (1-100)']),\
           palette="Set2")
plt.xlabel("Clusters")
plt.title("Boxplot-Annual Income - Spending Score")
plt.show()

png


Leave a comment