Let's Build a Mini Data Project Together

Today, we're building a complete data project.

No code provided. Just clear steps. You choose your tool—Excel, SQL, or Python.

What We're Building

The Dataset: Online Retail Data from Kaggle

(you might need to register on Kaggle to be able to download it, which is free)

The Business Question: Which products are our bestsellers, and when do customers buy most?

Before You Start

Download the dataset and open it in your tool of choice:

  • Excel: Open the CSV directly

  • SQL: Import into your database (SQLite, PostgreSQL, whatever you use)

  • Python: Load with pandas

Quick look at the structure:

  • InvoiceNo: Transaction ID

  • StockCode: Product code

  • Description: Product name

  • Quantity: Number of items

  • InvoiceDate: When the purchase happened

  • UnitPrice: Price per item

  • CustomerID: Customer identifier

  • Country: Where the customer is located

Note: This dataset has over 500K rows. For this quick exercise, filter out the UK (all countries but UK) which will leave you with 46K rows—plenty for analysis, manageable for speed.

Phase 1: Quick Data Quality Check

Before doing anything, check for problems:

Check 1: Look for negative quantities

  • You'll find some. These are returns/cancellations (InvoiceNo starting with ‘C’) as well as some manual adjustments (StockCode = M)

  • Decision: I suggest to keep them, but perform separate analysis on this subset - what products get returned the most, what month, what country etc.

    If you decide to exclude anything - this should be specified in your final insights.

Check 2: Check for nulls in CustomerID

  • You'll find many. These might be guest checkouts

  • Decision: Keep them for product analysis, but note we can't track customer behaviour for these

Check 3: Validate data types

This step is important if you are using SQL or Python for this analysis

  • Make sure Quantity and UnitPrice are numeric

  • Make sure InvoiceDate is actually a date

What you're doing: Understanding your data before analysing it.

Phase 2: Clean Your Data

Step 1: Filter your dataset

  • Keep only: Country <> 'United Kingdom' (exclude UK)

If you want to exclude returns, do the following:

  • Remove: InvoiceNo starting with 'C'

  • Remove: Quantity < 0 (returns)

Step 2: Create a calculated field

  • Create new column: TotalSales = Quantity × UnitPrice

  • This gives you revenue per line item

What you should have now:

  • Clean dataset excl. UK

  • A TotalSales column showing revenue

  • Roughly 40-50K rows (manageable size)

Phase 3: Analyse — Find Your Insights

Analysis 1: Top 10 Bestselling Products

What to calculate:

  • Group by Description (product name)

  • Sum the Quantity for each product

  • Sort by total quantity descending

  • Take top 10

What you're looking for: Which products sell most by volume?

You can do this step twice - 1) by looking at all sales including returns/cancellations 2) excluding returns and checking pure sales only

Analysis 2: Top 10 Revenue-Generating Products

What to calculate:

  • Group by Description

  • Sum the TotalSales for each product

  • Sort descending

  • Take top 10

What you're checking: Are the bestselling products (by quantity) the same as highest revenue products? Or are you selling lots of cheap items (this can also be confirmed by creating a distribution chart)?

Analysis 3: Sales by Hour of Day

What to calculate:

  • Extract hour from InvoiceDate

  • Group by hour

  • Sum TotalSales

  • Count number of Invoices

What you're finding: When do customers shop most?

Analysis 4: Sales by Day of Week

What to calculate:

  • Extract day of week from InvoiceDate (Monday, Tuesday, etc.). In Excel the easiest way is to do it by using formula = TEXT(A2,”dddd”)

  • Group by day

  • Sum TotalSales

What you're finding: Which days are busiest?

What you should have now:

  • List of top 10 products by quantity

  • List of top 10 products by revenue

  • Sales volume by hour

  • Sales volume by day of week

In all of these cuts, you can do analysis per different country too - are customers from Germany behave different to Portugal? What is the busiest day of the week in France?

Phase 4: Visualise Your Key Finding

Choose ONE chart to create.

Option A: Top 10 Products Bar Chart

  • X-axis: Product names

  • Y-axis: Total quantity sold

  • Simple horizontal bar chart

  • Title: "Top 10 Bestselling Products (Jan-Mar 2011)"

Option B: Sales by Hour

  • X-axis: Hour of day (0-23)

  • Y-axis: Total sales

  • Line or bar chart

  • Title: "Peak Shopping Hours"

Option C: Sales by Day of Week

  • X-axis: Days (Mon-Sun)

  • Y-axis: Total sales

  • Bar chart

  • Title: "Sales by Day of Week"

Pick the one that tells the clearest story for your analysis.

Keep it simple: Clean labels, clear title, easy to understand at a glance. If you want to add countries’ breakdown, and your visual becomes messy - you can create separate visuals for the biggest (in terms of revenue) countries or the one that really stand out compared to the rest.

Phase 5: Your Insight and Recommendation

Based on what you found, write down the insights:

Your Insight: Example: "Product X accounts for 15% of all units sold, but our top 3 products by revenue are actually Y, Z, and W—lower volume, higher value items."

Your Recommendation: Example: "Focus marketing spend on high-value products Y, Z, W rather than high-volume, low-margin items."

Or:

Your Insight: "Peak shopping hours are 12-2pm, with 70% of daily sales occurring between 10am-4pm."

Your Recommendation: "Schedule promotional emails to arrive between 10am-12pm to catch peak shopping window."

What You've Just Built

You've completed:

  • Data quality assessment

  • Data cleaning

  • Multi-dimensional analysis

  • Data visualisation

  • Business insight with actionable recommendation

This is the exact workflow you'd use for a real business project.

The difference: Real projects take longer because datasets are messier, questions are more complex, and stakeholders change their minds (every single day. Haha).

But the process? Identical.

Tool-Specific Tips

If you used Excel:

  • PivotTables are your friend for the grouping analysis

  • Use HOUR() and WEEKDAY()/TEXT(,”dddd”) functions to extract time components

  • Keep your data in a Table format for easier filtering

If you used SQL:

  • Your GROUP BY queries should be straightforward

  • Use EXTRACT(HOUR FROM InvoiceDate) for hour analysis

  • TO_CHAR() or similar for day of week

  • Remember to filter in your WHERE clause

If you used Python:

  • pandas groupby() handles all the aggregations

  • dt.hour and dt.dayofweek give you time components

  • matplotlib or seaborn for quick visualisations

  • Use query() to filter your dataframe

The Meta-Lesson

Notice what we didn't focus on:

  • Perfect code

  • Complex techniques

  • Fancy visualisations

  • Hours of exploration

We focused on:

  • Understanding the business question

  • Cleaning data properly

  • Finding actionable insights

  • Communicating clearly

Make It Your Own

Now that you've done the basic analysis, try these next:

  • Find customers who made repeat purchases

  • Calculate average order value by day/hour

  • Identify products frequently bought together

  • Compare month-over-month trends

Or pick a different business question:

  • Which products have highest return rates?

  • Which countries generate most revenue?

  • Do high-value customers shop at different times?

Same dataset, endless questions. Just like in real job.

Now do it again with a different dataset. Then again. Then again.

Keep pushing 💪,

Karina

Grab your freebies if you haven’t done already:

Data Playbook (CV template, Books on Data Analytics and Data Science, Examples of portfolio projects)

Need more help?

Just starting with Python? Wondering if programming is for you?

Master key data analysis tasks like cleaning, filtering, pivot and grouping data using Pandas, and learn how to present your insights visually with Matplotlib with ‘Data Analysis with Python’ masterclass.

Building your portfolio?
Grab the Complete EDA Portfolio Project — a full e-commerce analysis (ShopTrend 2024) with Python notebook, realistic dataset, portfolio template, and step-by-step workflow. See exactly how to structure professional portfolio projects.

Grab your Pandas CheatSheet here. Everything you need to know about Pandas - from file operations to visualisations in one place.

More from me: YouTube | TikTok | Instagram | Threads | LinkedIn

Data Analyst & Data Scientist