In this guided project, we’ll play the role of a data scientist working for a credit card company. We’ve been given a dataset containing information about the company’s clients and asked to help segment them into different groups in order to apply different business strategies for each type of customer.
The company expects to receive a group for each client and also an explanation of the characteristics of each group and what are the main points that make them different.
In a planning meeting with the Data Science coordinator, it was decided that we should use the K-means algorithm to segment the data.
In order to use the algorithm properly and achieve all the goals that the company has set for us, we'll go through the following steps:
We'll start by importing the packages we'll use.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
np.random.seed(42)
sns.set_style('whitegrid')
%matplotlib inline
After reading the data into pandas, it's time to explore it. Let's investigate the size of the dataset, what columns it contains, the type of values in each column, and also check on missing values.
customers = pd.read_csv('customer_segmentation.csv')
customers.head()
customer_id | age | gender | dependent_count | education_level | marital_status | estimated_income | months_on_book | total_relationship_count | months_inactive_12_mon | credit_limit | total_trans_amount | total_trans_count | avg_utilization_ratio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 768805383 | 45 | M | 3 | High School | Married | 69000 | 39 | 5 | 1 | 12691.0 | 1144 | 42 | 0.061 |
1 | 818770008 | 49 | F | 5 | Graduate | Single | 24000 | 44 | 6 | 1 | 8256.0 | 1291 | 33 | 0.105 |
2 | 713982108 | 51 | M | 3 | Graduate | Married | 93000 | 36 | 4 | 1 | 3418.0 | 1887 | 20 | 0.000 |
3 | 769911858 | 40 | F | 4 | High School | Unknown | 37000 | 34 | 3 | 4 | 3313.0 | 1171 | 20 | 0.760 |
4 | 709106358 | 40 | M | 3 | Uneducated | Married | 65000 | 21 | 5 | 1 | 4716.0 | 816 | 28 | 0.000 |
customers.shape
(10127, 14)
customers.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customer_id 10127 non-null int64 1 age 10127 non-null int64 2 gender 10127 non-null object 3 dependent_count 10127 non-null int64 4 education_level 10127 non-null object 5 marital_status 10127 non-null object 6 estimated_income 10127 non-null int64 7 months_on_book 10127 non-null int64 8 total_relationship_count 10127 non-null int64 9 months_inactive_12_mon 10127 non-null int64 10 credit_limit 10127 non-null float64 11 total_trans_amount 10127 non-null int64 12 total_trans_count 10127 non-null int64 13 avg_utilization_ratio 10127 non-null float64 dtypes: float64(2), int64(9), object(3) memory usage: 1.1+ MB
There are 10127 rows and 14 columns in the dataset including a unique identifier for each client, which is not going to be needed for the segmentation.
Of the 13 columns left, there are 8 columns containing integers, 2 containing floats, and 3 columns containing strings, which means we have 3 categorical columns to deal with.
Also, there are no missing values.
for col in ['gender', 'education_level', 'marital_status']:
print(col)
print(customers[col].value_counts(), end='\n\n')
gender F 5358 M 4769 Name: gender, dtype: int64 education_level Graduate 3685 High School 2351 Uneducated 1755 College 1192 Post-Graduate 616 Doctorate 528 Name: education_level, dtype: int64 marital_status Married 4687 Single 3943 Unknown 749 Divorced 748 Name: marital_status, dtype: int64
Here we can see how many unique categories are there in each categorical variable and how many datapoints per category.
As we're working with unsupervised machine learning, there isn't a target variable on which we can measure the impacts of the other variables.
But we can see the correlation between the numeric variables and their distributions.
fig, ax = plt.subplots(figsize=(12,8))
sns.heatmap(round(customers.drop('customer_id', axis=1).corr(numeric_only=True), 2), cmap='Blues', annot=True, ax=ax)
plt.tight_layout()
plt.show()
Most of the variables present weak correlations between each other, but there are some we can highlight:
Age
is strongly correlated with how long the person has been a customer (months_on_book
);Credit limit
is positively correlated with the estimated income and negatively correlated with the average utilization ratio
;total_trans_count
) is strongly correlated with the total amount transitioned (total_trans_amount
).fig, ax = plt.subplots(nrows = 5, ncols = 2, figsize=(8, 10))
#Removing the customer's id before plotting the distributions
customers.drop('customer_id', axis=1).hist(ax=ax)
plt.tight_layout()
plt.show()
Regarding distributions, we have a couple of them closer to a normal distribution, but most of them are skewed.
We're now dealing with the 3 categorical variables.
The gender
column is the easiest one. We'll use a lambda function to replace the values with ones and zeros.
We'll also be able to transform the education_level
column to numeric. We'll use the replace()
method to perform this task. This method will assign a value to each level of education:
Unfortunately, we can't do the same for this marital_status
column as there is no level of magnitude between "Single", "Married" or "Divorced", for example. We can't say that any of them is higher or better than the others. Therefore, we'll use one-hot-encoding to create dummy variables from this column and then drop the original variable.
customers_modif = customers.copy()
customers_modif['gender'] = customers['gender'].apply(lambda x: 1 if x == 'M' else 0)
customers_modif.head()
customer_id | age | gender | dependent_count | education_level | marital_status | estimated_income | months_on_book | total_relationship_count | months_inactive_12_mon | credit_limit | total_trans_amount | total_trans_count | avg_utilization_ratio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 768805383 | 45 | 1 | 3 | High School | Married | 69000 | 39 | 5 | 1 | 12691.0 | 1144 | 42 | 0.061 |
1 | 818770008 | 49 | 0 | 5 | Graduate | Single | 24000 | 44 | 6 | 1 | 8256.0 | 1291 | 33 | 0.105 |
2 | 713982108 | 51 | 1 | 3 | Graduate | Married | 93000 | 36 | 4 | 1 | 3418.0 | 1887 | 20 | 0.000 |
3 | 769911858 | 40 | 0 | 4 | High School | Unknown | 37000 | 34 | 3 | 4 | 3313.0 | 1171 | 20 | 0.760 |
4 | 709106358 | 40 | 1 | 3 | Uneducated | Married | 65000 | 21 | 5 | 1 | 4716.0 | 816 | 28 | 0.000 |
# customers_modif.replace(to_replace={'Uneducated': 0, 'High School': 1, 'College':2,
# 'Graduate': 3, 'Post-Graduate': 4, 'Doctorate':5}, inplace=True)
customers_modif['education_level']=customers_modif['education_level'].map({'Uneducated': 0, 'High School': 1, 'College':2,
'Graduate': 3, 'Post-Graduate': 4, 'Doctorate':5})
customers_modif['education_level'].head()
0 1 1 3 2 3 3 1 4 0 Name: education_level, dtype: int64
dummies = pd.get_dummies(customers_modif[['marital_status']], drop_first=True)
customers_modif = pd.concat([customers_modif, dummies], axis=1)
customers_modif.drop(['marital_status'], axis=1, inplace=True)
print(customers_modif.shape)
customers_modif.head()
(10127, 16)
customer_id | age | gender | dependent_count | education_level | estimated_income | months_on_book | total_relationship_count | months_inactive_12_mon | credit_limit | total_trans_amount | total_trans_count | avg_utilization_ratio | marital_status_Married | marital_status_Single | marital_status_Unknown | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 768805383 | 45 | 1 | 3 | 1 | 69000 | 39 | 5 | 1 | 12691.0 | 1144 | 42 | 0.061 | 1 | 0 | 0 |
1 | 818770008 | 49 | 0 | 5 | 3 | 24000 | 44 | 6 | 1 | 8256.0 | 1291 | 33 | 0.105 | 0 | 1 | 0 |
2 | 713982108 | 51 | 1 | 3 | 3 | 93000 | 36 | 4 | 1 | 3418.0 | 1887 | 20 | 0.000 | 1 | 0 | 0 |
3 | 769911858 | 40 | 0 | 4 | 1 | 37000 | 34 | 3 | 4 | 3313.0 | 1171 | 20 | 0.760 | 0 | 0 | 1 |
4 | 709106358 | 40 | 1 | 3 | 0 | 65000 | 21 | 5 | 1 | 4716.0 | 816 | 28 | 0.000 | 1 | 0 | 0 |
First, we need to standardize the dataset. We'll use scikit-learn's StandardScaler()
for this task.
$z = \dfrac{(x - \mu)}{\sigma}$
X = customers_modif.drop('customer_id', axis=1)
# Standardize the variables using z-score
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
X_scaled[:5]
array([[-0.16540558, 1.05995565, 0.50336813, -0.75221102, 0.1758098 , 0.38462088, 0.76394261, -1.32713603, 0.4466219 , -0.95970657, -0.97389518, -0.77588223, 1.07733799, -0.79850685, -0.28260887], [ 0.33357038, -0.9434357 , 2.04319867, 0.66278684, -0.96716585, 1.01071482, 1.40730617, -1.32713603, -0.04136665, -0.91643261, -1.35734038, -0.61627565, -0.92821381, 1.2523374 , -0.28260887], [ 0.58305837, 1.05995565, 0.50336813, 0.66278684, 0.78539682, 0.00896451, 0.12057905, -1.32713603, -0.5736978 , -0.74098169, -1.91120566, -0.99715499, 1.07733799, -0.79850685, -0.28260887], [-0.78912553, -0.9434357 , 1.2732834 , -0.75221102, -0.63697289, -0.24147306, -0.52278451, 1.64147829, -0.58525108, -0.95175829, -1.91120566, 1.75968594, -0.92821381, -0.79850685, 3.53845931], [-0.78912553, 1.05995565, 0.50336813, -1.45970995, 0.07421197, -1.86931731, 0.76394261, -1.32713603, -0.43087725, -1.05626345, -1.57036549, -0.99715499, 1.07733799, -0.79850685, -0.28260887]])
It's time to decide on the number of clusters. We'll run the k-means algorithm considering a range from 1 to 10 possible Ks and store the results. Then, we'll plot the elbow curve that will help us find a final K.
X = pd.DataFrame(X_scaled)
inertias = []
for k in range(1, 11):
model = KMeans(n_clusters=k,n_init=10)
model.fit_predict(X)
inertias.append(model.inertia_)
plt.figure(figsize=(12, 8))
plt.plot(range(1, 11), inertias, marker='o')
plt.xticks(ticks=range(1, 11), labels=range(1, 11))
plt.title('Inertia vs Number of Clusters')
plt.tight_layout()
plt.show()
It looks like the rate of decreasing of the inertia slows down between 5 and 7 clusters. We'll use 6 clusters to move on.
model = KMeans(n_clusters=6, n_init='auto')
y = model.fit_predict(X_scaled)
y
array([5, 2, 5, ..., 3, 4, 3])
Now, let's create a CLUSTER
column in our original dataset so we can better understand the characteristics of each one.
customers['CLUSTER'] = y + 1
customers
customer_id | age | gender | dependent_count | education_level | marital_status | estimated_income | months_on_book | total_relationship_count | months_inactive_12_mon | credit_limit | total_trans_amount | total_trans_count | avg_utilization_ratio | CLUSTER | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 768805383 | 45 | M | 3 | High School | Married | 69000 | 39 | 5 | 1 | 12691.0 | 1144 | 42 | 0.061 | 6 |
1 | 818770008 | 49 | F | 5 | Graduate | Single | 24000 | 44 | 6 | 1 | 8256.0 | 1291 | 33 | 0.105 | 3 |
2 | 713982108 | 51 | M | 3 | Graduate | Married | 93000 | 36 | 4 | 1 | 3418.0 | 1887 | 20 | 0.000 | 6 |
3 | 769911858 | 40 | F | 4 | High School | Unknown | 37000 | 34 | 3 | 4 | 3313.0 | 1171 | 20 | 0.760 | 5 |
4 | 709106358 | 40 | M | 3 | Uneducated | Married | 65000 | 21 | 5 | 1 | 4716.0 | 816 | 28 | 0.000 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
10122 | 772366833 | 50 | M | 2 | Graduate | Single | 51000 | 40 | 3 | 2 | 4003.0 | 15476 | 117 | 0.462 | 1 |
10123 | 710638233 | 41 | M | 2 | Graduate | Divorced | 40000 | 25 | 4 | 2 | 4277.0 | 8764 | 69 | 0.511 | 2 |
10124 | 716506083 | 44 | F | 1 | High School | Married | 33000 | 36 | 5 | 3 | 5409.0 | 10291 | 60 | 0.000 | 4 |
10125 | 717406983 | 30 | M | 2 | Graduate | Unknown | 47000 | 36 | 4 | 3 | 5281.0 | 8395 | 62 | 0.000 | 5 |
10126 | 714337233 | 43 | F | 2 | Graduate | Married | 36000 | 25 | 6 | 2 | 10388.0 | 10294 | 61 | 0.189 | 4 |
10127 rows × 15 columns
customers['CLUSTER'].value_counts()
4 3132 3 2034 6 1880 2 1414 1 937 5 730 Name: CLUSTER, dtype: int64
numeric_columns = customers.select_dtypes(include=np.number).drop(['customer_id', 'CLUSTER'], axis=1).columns
fig = plt.figure(figsize=(20, 20))
for i, column in enumerate(numeric_columns):
df_plot = round(customers.groupby('CLUSTER')[column].mean(),2)
ax = fig.add_subplot(5, 2, i+1)
ax.bar(df_plot.index, df_plot, color=sns.color_palette('Set1'), alpha=0.6)
ax.set_title(f'Average {column.title()} per Cluster', alpha=0.5)
ax.xaxis.grid(False)
# Add data labels to the bars
for i in ax.containers:
ax.bar_label(i, label_type='edge')
plt.tight_layout()
plt.show()
For those numerical variables with higher correlations we saw earlier, we can also use a scatter plot to visualize this correlation grouped by clusters and analyze how the clusters change between each area of the chart.
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 8))
sns.scatterplot(x='age', y='months_on_book', hue='CLUSTER', data=customers, palette='tab10', alpha=0.4, ax=ax1)
sns.scatterplot(x='estimated_income', y='credit_limit', hue='CLUSTER', data=customers, palette='tab10', alpha=0.4, ax=ax2, legend=False)
sns.scatterplot(x='credit_limit', y='avg_utilization_ratio', hue='CLUSTER', data=customers, palette='tab10', alpha=0.4, ax=ax3)
sns.scatterplot(x='total_trans_count', y='total_trans_amount', hue='CLUSTER', data=customers, palette='tab10', alpha=0.4, ax=ax4, legend=False)
plt.tight_layout()
plt.show()
We can draw some early conclusions considering only the numeric variables.
For instance, Cluster 1 has the highest amount of money transitioned, while Cluster 2 has the lowest credit limit and estimated income and the highest utilization rate. Cluster 4 has the highest credit limit. Older clients are grouped in Cluster 5.
For the categorical columns, we'll plot the percentual distribution of each variable in each cluster. This will allow us to verify if a particular cluster is mostly composed of men, or of married people only, for example.
cat_columns = customers.select_dtypes(include=['object'])
fig = plt.figure(figsize=(18, 6))
for i, col in enumerate(cat_columns):
plot_df = pd.crosstab(index=customers['CLUSTER'], columns=customers[col], values=customers[col], aggfunc='size', normalize='index')
ax = fig.add_subplot(1, 3, i+1)
plot_df.plot.bar(stacked=True, ax=ax, alpha=0.6)
ax.set_title(f'% {col.title()} per Cluster', alpha=0.5)
ax.set_ylim(0, 1.4)
ax.legend(frameon=False)
ax.xaxis.grid(False)
labels = [0, 0.2, 0.4, 0.6, 0.8, 1]
ax.set_yticks(labels)
ax.set_yticklabels(labels)
# Rotate x-axis labels by 0 degrees
ax.set_xticklabels(labels=set(customers['CLUSTER']),rotation=0)
plt.tight_layout()
plt.show()
Considering the categorical variables, we notice that the education level is well divided between clusters.
In other highlights, Cluster 2 is composed almost entirely of married people, while we don't know the marital status of anybody in Cluster 3. Cluster 4 is almost completely male and Cluster 6 is 100% made of single people.
As demanded by the company, we now have listed the most important characteristics of each cluster. We could also some suggestions and insights into each one of them.
In the end, we have the list of customers with a cluster assigned to each one.
Characteristics: Mostly men; high credit limit; high amount transitioned; high number of transactions; low utilization rate.
Insight: People with high volume spent on the card, but do not use it on a daily basis. Could be incentivised to spend more.
Characteristics: Mostly women; mostly married; low estimated income; low credit limit; low amount transitioned; high utilization rate.
Insight: Married people (majority women) with low income and limit but utilize too much of their credit with a few larger purchases
Cluster 3: Gender well divided; low credit limit, high utilization rate; marital status 100% unknown; smaller cluster.
Insight: Men and women with low credit limits but do have high balances.
Cluster 4: Mostly men, mostly single and married, high estimated income, high credit limit; low amount transitioned; low utilization rate.
Insight: People (majority men) with high income and credit limits, but don't use the card. Could be incentivized to use it.
Cluster 5: Mostly married, high age, low dependent count, long time customers, low credit limit, low amount transitioned, high utilization rate.
Insight: Older people and long-time customers. Low credit limit and transactions, but use the card very often. Could receive benefits to spend more money.
Cluster 6: Mostly women; 100% single people, low estimated income, low credit limit, low amount transitioned, high utilization rate.
Insight: Single (mostly women) people that use their card a lot but have low credit limits and income. Could be given a bit more credit limit.
# List of customers and clusters
customers[['customer_id', 'CLUSTER']]
customer_id | CLUSTER | |
---|---|---|
0 | 768805383 | 6 |
1 | 818770008 | 3 |
2 | 713982108 | 6 |
3 | 769911858 | 5 |
4 | 709106358 | 2 |
... | ... | ... |
10122 | 772366833 | 1 |
10123 | 710638233 | 2 |
10124 | 716506083 | 4 |
10125 | 717406983 | 5 |
10126 | 714337233 | 4 |
10127 rows × 2 columns