Let's Work on a Data Science Project Together

A few weeks ago I shared a data analytics project walkthrough and a lot of you said you wanted more. So today we are doing it again — but this time we are doing machine learning.

Don't worry if that sounds scary. By the end of this email you will have built your first ML model. It takes about 30 minutes.

Today's project: customer segmentation using clustering.

What is customer segmentation?

It's the process of grouping customers based on shared behaviour — how recently they bought, how often they buy, how much they spend. Instead of treating all customers the same, you discover natural groups hiding in the data.

In the real world this is used everywhere. A marketing team might find they have a "high spend, disengaged" group and build a win-back campaign. A product team might discover a "frequent, low-value" segment and build a loyalty programme just for them.

I must admit that subconsciously we apply ‘clustering’ to people we meet. Your boss behaves the same way as the Sales Manager at your previous company — and because of that, you already know how to handle them. We might call it a ‘type’, but data scientists will call it a ‘cluster’.

The ML technique we use is called K-Means clustering. You tell it how many groups you want, and it figures out which customers belong together.

Before we proceed, a little bit of advertising. Your clicks on the ads help me to cover the newsletter hosting fee. Thank you for your support.

Free, private email that puts your privacy first

Proton Mail’s free plan keeps your inbox private and secure—no ads, no data mining. Built by privacy experts, it gives you real protection with no strings attached.

Get free private email

The dataset

We are using the Online Retail dataset from the UCI Machine Learning Repository. Real transactions from a UK-based online retailer — 541,909 rows covering two years of sales.

Download it here: https://archive.ics.uci.edu/dataset/352/online+retail

It has 8 columns: InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country.

This is raw transactional data — one row per item sold. Before we can cluster customers, we need to build customer-level features. That's where RFM comes in.

What is RFM?

RFM stands for Recency, Frequency, Monetary Value. It's one of the most widely used frameworks in customer analytics.

Recency — how many days since the customer last purchased? (lower = more engaged)
Frequency — how many orders have they placed in total?
Monetary Value — how much have they spent in total?

Every customer gets three numbers. Those three numbers become the input to our clustering model.

This is real feature engineering — taking raw transactional data and turning it into something meaningful.

Step 1 — Load the data and clean it

# pip install pandas matplotlib seaborn scikit-learn openpyxl

import pandas as pd
import datetime as dt
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans


# ─────────────────────────────────────────────
# Load the data and clean it
# ─────────────────────────────────────────────

df = pd.read_excel('Online Retail.xlsx', parse_dates=['InvoiceDate'])

print(df.shape)  
print(df.head())
print(df.isnull().sum())

You will see some missing CustomerIDs. We can't do customer-level analysis without knowing who the customer is, so we drop those rows. We also remove cancelled orders (InvoiceNo starting with 'C') and rows with negative quantities.

# Drop missing CustomerIDs
df = df.dropna(subset=['CustomerID'])

df['CustomerID'] = df['CustomerID'].astype(int)

# Remove cancellations (InvoiceNo starting with 'C')
df = df[~df['InvoiceNo'].astype(str).str.startswith('C')]

# Remove negative or zero quantities
df = df[df['Quantity'] > 0]

# Remove zero or negative unit prices (free items / data errors)
df = df[df['UnitPrice'] > 0]

# Create a revenue column
df['Revenue'] = df['Quantity'] * df['UnitPrice']

print(df.shape)

Step 2 — Build RFM features

Now we calculate one row per customer with their Recency, Frequency and Monetary values.

# Set reference date as day after last transaction
reference_date = df['InvoiceDate'].max() + dt.timedelta(days=1)

rfm = df.groupby('CustomerID').agg(
    Recency=('InvoiceDate', lambda x: (reference_date - x.max()).days),
    Frequency=('InvoiceNo', 'nunique'),
    Monetary=('Revenue', 'sum')
).reset_index()

print(rfm.head())
print(rfm.describe())

You now have one row per customer.

Step 3 — Handle outliers and scale

RFM data is almost always skewed — a small number of VIP customers spend 10x the average. K-Means is sensitive to this, so we log-transform the data first to compress extreme values.

rfm_log = rfm[['Recency', 'Frequency', 'Monetary']].copy()

rfm_log['Recency']   = np.log1p(rfm_log['Recency'])
rfm_log['Frequency'] = np.log1p(rfm_log['Frequency'])
rfm_log['Monetary']  = np.log1p(rfm_log['Monetary'])

# Scale so all features are on the same range
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm_log)

Step 4 — Find the right number of clusters

We use the Elbow Method — run K-Means for k=1 through 10, measure how tight the clusters are, and look for where the curve bends.

inertia = []
k_range = range(1, 11)

for k in k_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(rfm_scaled)
    inertia.append(km.inertia_)

plt.figure(figsize=(8, 4))
plt.plot(k_range, inertia, marker='o', color='steelblue')
plt.title('Elbow Method — How Many Customer Segments?')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.tight_layout()  
plt.show()

Look for where the curve stops dropping sharply. With this dataset you will likely see a clear elbow at k=3 or k=4.

Step 5 — Run K-Means and label your segments

# Run with k=4
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
rfm['Cluster'] = kmeans.fit_predict(rfm_scaled)

rfm['Cluster'] = rfm['Cluster'].astype(str)

# Get the average profile of each segment
summary = rfm.groupby('Cluster')[['Recency', 'Frequency', 'Monetary']].mean().round(1)
print(summary)

Step 6 — Interpret the clusters

This is the most important step. Numbers mean nothing. Business insights do.

Look at the summary table and label each cluster based on its profile:

❝

Cluster → RFM Label

Low recency, high freq, high spend → Champions — your best customers
Low recency, low freq, low spend → New customers — just started
High recency, high freq, high spend → At risk — used to be great, gone quiet
High recency, low freq, low spend → Lost — haven't seen them in a while

Your actual clusters will vary — that's the point. You are discovering the real structure in this specific dataset.

rfm_scaled_df = pd.DataFrame(rfm_scaled, columns=['Recency', 'Frequency', 'Monetary'])

rfm_scaled_df['Cluster'] = rfm['Cluster'].values

cluster_profile = rfm_scaled_df.groupby('Cluster').mean().reset_index()

rfm_melt = cluster_profile.melt(id_vars='Cluster', var_name='Metric', value_name='Value')

plt.figure(figsize=(10, 5))
sns.barplot(data=rfm_melt, x='Metric', y='Value', hue='Cluster', palette='Set2')
plt.title('Customer Segment Profiles (Standardised)')
plt.ylabel('Standardised Value (Z-score)')
plt.tight_layout()
plt.show()

What you just built

You took half a million raw transactions, engineered features, and discovered distinct customer groups — all without labels or a predefined 'right answer' to train on.

Next step: try adding a fourth RFM metric — average order value (Monetary / Frequency) — and see if it changes your segments.

Keep pushing 💪,

Karina

Need more help?

Just starting with Python? Wondering if programming is for you?

Master key data analysis tasks like cleaning, filtering, pivot and grouping data using Pandas, and learn how to present your insights visually with Matplotlib with ‘Data Analysis with Python’ masterclass.

Building your portfolio?
Grab the Complete EDA Portfolio Project — a full e-commerce analysis (ShopTrend 2024) with Python notebook, realistic dataset, portfolio template, and step-by-step workflow. See exactly how to structure professional portfolio projects.

Grab your Pandas CheatSheet here. Everything you need to know about Pandas - from file operations to visualisations in one place.

Get CV, Portfolio and LinkedIn Review.

More from me: YouTube | TikTok | Instagram | Threads | LinkedIn

Data Analyst & Data Scientist

Let's Work on a Data Science Project Together

Free, private email that puts your privacy first

Keep Reading

Karina Datascientist's Newsletter

Home