In partnership with

We have done EDA, A/B testing, customer segmentation. Previous newsletters you can find here - link. Today we are going further.

Today's project: fraud detection using machine learning. Can we predict which e-commerce transactions are fraudulent?

This is one of the most valuable ML applications in the real world. Every online retailer, payment processor and bank runs models like this. And it is a great portfolio project because it teaches you something most tutorials skip: what to do when your data has real problems baked in.

Why fraud detection is a different kind of ML problem

Most ML tutorials tell you to maximise model accuracy. Fraud detection is different.

Fraud is rare. In most real datasets, 1-3% of transactions are fraudulent. If your model predicts "not fraud" for every single row, it will be 95-99% accurate — and completely useless.

So we use different metrics:

  • Recall — of all actual fraud cases, how many did your model catch?

  • Precision — of all cases your model flagged as fraud, how many were actually fraud?

  • F1 score — the balance between the two

A model with 60% recall means 40% of fraud slipped through undetected. That is money leaving the business.

Before we proceed, a small ad. As always, your clicks on ads help me to cover hosting fees as well as cup of matcha. Thank you

Analytics on Live Data Without Leaving Postgres

When analytics on Postgres slows down, most teams add a second database. TimescaleDB by Tiger Data takes a different approach: extend Postgres with columnar storage and time-series primitives to run analytics on live data, no split architecture, no pipeline lag, no new query language to learn. Start building for free. No credit card required.

The dataset

We are using the Fraudulent E-Commerce Transactions dataset from Kaggle.

1.4 million transactions. 16 columns. Real e-commerce context — payment methods, product categories, device types, customer demographics.

Key columns we will use:

  • Transaction Amount — value of the transaction

  • Transaction Date — when it happened

  • Payment Method — credit card, debit card, bank transfer, PayPal

  • Product Category — electronics, clothing, home & garden, toys & games, health & beauty

  • Customer Age — age of the customer (watch this one closely)

  • Device Used — mobile, desktop, tablet

  • Account Age Days — how old the customer account is

  • Transaction Hour — hour of day the transaction occurred (already in the dataset)

  • Is Fraudulent — our target variable (1 = fraud, 0 = legitimate)

Step 1 — Load and combine both files

# pip install pandas scikit-learn matplotlib seaborn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

# Load both files and combine
df1 = pd.read_csv('Fraudulent_E-Commerce_Transaction_Data.csv')
df2 = pd.read_csv('Fraudulent_E-Commerce_Transaction_Data_2.csv')
df = pd.concat([df1, df2], ignore_index=True)

print(df.shape)        # ~1.5 million rows, 16 columns
print(df.head())
print(df.isnull().sum())  # no nulls — clean dataset

Good news — no missing values. Now look at the target variable.

print(df['Is Fraudulent'].value_counts())
print(df['Is Fraudulent'].value_counts(normalize=True).round(3))

You will find 95% legitimate, 5% fraud. Imbalanced — but that is exactly what real fraud data looks like.

Step 2 — Spot the data quality issue

Before touching any model, always explore your data carefully.

print(df['Customer Age'].describe())
print('Negative ages:', len(df[df['Customer Age'] < 0]))
print('Under 18:', len(df[df['Customer Age'] < 18]))

You will find rows with negative ages. You will also find customers under 18. Both are data quality problems.

Decision: remove anyone under 18 and anyone with a negative age in one step.

df = df[df['Customer Age'] >= 18]
print(df.shape)

Always document decisions like this. In a portfolio writeup or interview, being able to say I removed customers under 18 as a business rule — in most financial and e-commerce settings, minors are not valid account holders.

Step 3 — Explore what drives fraud

# Fraud rate by payment method
fraud_by_payment = df.groupby('Payment Method')['Is Fraudulent'].mean().sort_values(ascending=False)
fraud_by_payment.plot(kind='bar', color='coral', figsize=(8, 4))
plt.title('Fraud Rate by Payment Method')
plt.ylabel('Fraud Rate')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Fraud rate by product category
fraud_by_category = df.groupby('Product Category')['Is Fraudulent'].mean().sort_values(ascending=False)
fraud_by_category.plot(kind='bar', color='steelblue', figsize=(8, 4))
plt.title('Fraud Rate by Product Category')
plt.ylabel('Fraud Rate')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Fraud rate by hour of day
fraud_by_hour = df.groupby('Transaction Hour')['Is Fraudulent'].mean()
fraud_by_hour.plot(kind='line', color='steelblue', figsize=(10, 4), marker='o')
plt.title('Fraud Rate by Hour of Day')
plt.ylabel('Fraud Rate')
plt.xlabel('Hour')
plt.tight_layout()
plt.show()

# Transaction amount — fraud vs legitimate
df.boxplot(column='Transaction Amount', by='Is Fraudulent', figsize=(8, 5))
plt.title('Transaction Amount by Fraud Status')
plt.suptitle('')
plt.show()

Look for patterns before modelling. Are fraudulent transactions higher value? Are certain payment methods riskier? Are there peak fraud hours? These are the insights a fraud team would actually use.

Step 4 — Prepare the data

ML models need numbers. Drop identifier columns and encode categorical ones.

# Drop columns not useful for modelling — IDs and addresses
drop_cols = ['Transaction ID', 'Customer ID', 'IP Address',
             'Shipping Address', 'Billing Address', 'Customer Location',
             'Transaction Date']
df = df.drop(columns=drop_cols, errors='ignore')

# One-hot encode categorical columns
df = pd.get_dummies(df, columns=['Payment Method', 'Product Category', 'Device Used'], drop_first=True)

print(df.shape)
print(df.head())

Step 5 — Train the model

We will use Random Forest with class_weight='balanced' to handle the class imbalance. On 1.5 million rows this may take a few minutes — use a sample if your machine is slow.

# Optional: sample for speed during development
# df = df.sample(n=200_000, random_state=42)

X = df.drop('Is Fraudulent', axis=1)
y = df['Is Fraudulent']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

model = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1        # use all CPU cores to speed things up
)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

Step 6 — Evaluate properly

# Classification report
print(classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraud']))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Legitimate', 'Fraud'])
disp.plot(cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

Read the classification report carefully:

  • Recall for Fraud — what percentage of actual fraud did you catch?

  • Precision for Fraud — when you flagged something as fraud, how often were you right?

  • A false negative (missed fraud) costs money. A false positive (blocking a legitimate transaction) costs a customer. The business decides which is worse.

    Usually financial institutions as well as security (i.e. airport security) prefer to have false positives rather than false negatives. Better be safe than sorry.

Step 7 — Find the most important features

One of the best things about Random Forest is feature importance — it tells you which columns matter most for predicting fraud.

importances = pd.Series(model.feature_importances_, index=X.columns)
top_features = importances.sort_values(ascending=False).head(10)

top_features.plot(kind='barh', color='steelblue', figsize=(8, 5))
plt.title('Top 10 Features for Fraud Detection')
plt.xlabel('Importance Score')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

This is what you show in a presentation. Not the model internals — but "In this run, transaction amount, account age and transaction hour appeared among the strongest predictors.”

How to turn this into a portfolio project

Same four-part structure as our previous projects:

Overview — 1.5 million e-commerce transactions across two files, goal is to predict fraudulent ones before payment is processed.

Approach — combined two files, cleaned invalid ages, explored fraud patterns by payment method, category and hour, built a Random Forest classifier with balanced class weights, evaluated on recall and F1 rather than accuracy.

Findings — fraud rate of 5%, strongest predictors are transaction amount, account age and transaction hour. Fill in your actual recall score. Certain payment methods and product categories show higher fraud rates.

Limitations — no real-time element, model would need retraining as fraud patterns evolve, no cost-benefit analysis of false positives vs false negatives included.

Keep pushing 💪,

Karina

Ready to build a real portfolio project?

The SQL & Python Challenge is a realistic compliance brief — the kind that lands on an analyst's desk on a Tuesday morning with a tight deadline and evolving requirements.

You work through it in SQL or Python, walk away with two complete portfolio projects, and have something you can actually talk through in an interview.

Early bird price of $99 until 24 May — increasing to $129 after that.

Learn more → SQL & Python Challenge

Data Analyst & Data Scientist

Keep Reading