- Karina Datascientist's Newsletter
- Posts
- Let's Build a Mini Data Project Together
Let's Build a Mini Data Project Together
Today, we're building a complete data project.
No code provided. Just clear steps. You choose your tool—Excel, SQL, or Python.
What We're Building
The Dataset: Online Retail Data from Kaggle
Download here: https://www.kaggle.com/datasets/vijayuv/onlineretail
(you might need to register on Kaggle to be able to download it, which is free)
The Business Question: Which products are our bestsellers, and when do customers buy most?
Before You Start
Download the dataset and open it in your tool of choice:
Excel: Open the CSV directly
SQL: Import into your database (SQLite, PostgreSQL, whatever you use)
Python: Load with pandas
Quick look at the structure:
InvoiceNo: Transaction ID
StockCode: Product code
Description: Product name
Quantity: Number of items
InvoiceDate: When the purchase happened
UnitPrice: Price per item
CustomerID: Customer identifier
Country: Where the customer is located
Note: This dataset has over 500K rows. For this quick exercise, filter out the UK (all countries but UK) which will leave you with 46K rows—plenty for analysis, manageable for speed.
Phase 1: Quick Data Quality Check
Before doing anything, check for problems:
Check 1: Look for negative quantities
You'll find some. These are returns/cancellations (InvoiceNo starting with ‘C’) as well as some manual adjustments (StockCode = M)
Decision: I suggest to keep them, but perform separate analysis on this subset - what products get returned the most, what month, what country etc.
If you decide to exclude anything - this should be specified in your final insights.
Check 2: Check for nulls in CustomerID
You'll find many. These might be guest checkouts
Decision: Keep them for product analysis, but note we can't track customer behaviour for these
Check 3: Validate data types
This step is important if you are using SQL or Python for this analysis
Make sure Quantity and UnitPrice are numeric
Make sure InvoiceDate is actually a date
What you're doing: Understanding your data before analysing it.
Phase 2: Clean Your Data
Step 1: Filter your dataset
Keep only: Country <> 'United Kingdom' (exclude UK)
If you want to exclude returns, do the following:
Remove: InvoiceNo starting with 'C'
Remove: Quantity < 0 (returns)
Step 2: Create a calculated field
Create new column: TotalSales = Quantity × UnitPrice
This gives you revenue per line item
What you should have now:
Clean dataset excl. UK
A TotalSales column showing revenue
Roughly 40-50K rows (manageable size)
Phase 3: Analyse — Find Your Insights
Analysis 1: Top 10 Bestselling Products
What to calculate:
Group by Description (product name)
Sum the Quantity for each product
Sort by total quantity descending
Take top 10
What you're looking for: Which products sell most by volume?
You can do this step twice - 1) by looking at all sales including returns/cancellations 2) excluding returns and checking pure sales only
Analysis 2: Top 10 Revenue-Generating Products
What to calculate:
Group by Description
Sum the TotalSales for each product
Sort descending
Take top 10
What you're checking: Are the bestselling products (by quantity) the same as highest revenue products? Or are you selling lots of cheap items (this can also be confirmed by creating a distribution chart)?
Analysis 3: Sales by Hour of Day
What to calculate:
Extract hour from InvoiceDate
Group by hour
Sum TotalSales
Count number of Invoices
What you're finding: When do customers shop most?
Analysis 4: Sales by Day of Week
What to calculate:
Extract day of week from InvoiceDate (Monday, Tuesday, etc.). In Excel the easiest way is to do it by using formula = TEXT(A2,”dddd”)
Group by day
Sum TotalSales
What you're finding: Which days are busiest?
What you should have now:
List of top 10 products by quantity
List of top 10 products by revenue
Sales volume by hour
Sales volume by day of week
In all of these cuts, you can do analysis per different country too - are customers from Germany behave different to Portugal? What is the busiest day of the week in France?
Phase 4: Visualise Your Key Finding
Choose ONE chart to create.
Option A: Top 10 Products Bar Chart
X-axis: Product names
Y-axis: Total quantity sold
Simple horizontal bar chart
Title: "Top 10 Bestselling Products (Jan-Mar 2011)"
Option B: Sales by Hour
X-axis: Hour of day (0-23)
Y-axis: Total sales
Line or bar chart
Title: "Peak Shopping Hours"
Option C: Sales by Day of Week
X-axis: Days (Mon-Sun)
Y-axis: Total sales
Bar chart
Title: "Sales by Day of Week"
Pick the one that tells the clearest story for your analysis.
Keep it simple: Clean labels, clear title, easy to understand at a glance. If you want to add countries’ breakdown, and your visual becomes messy - you can create separate visuals for the biggest (in terms of revenue) countries or the one that really stand out compared to the rest.
Phase 5: Your Insight and Recommendation
Based on what you found, write down the insights:
Your Insight: Example: "Product X accounts for 15% of all units sold, but our top 3 products by revenue are actually Y, Z, and W—lower volume, higher value items."
Your Recommendation: Example: "Focus marketing spend on high-value products Y, Z, W rather than high-volume, low-margin items."
Or:
Your Insight: "Peak shopping hours are 12-2pm, with 70% of daily sales occurring between 10am-4pm."
Your Recommendation: "Schedule promotional emails to arrive between 10am-12pm to catch peak shopping window."
What You've Just Built
You've completed:
Data quality assessment
Data cleaning
Multi-dimensional analysis
Data visualisation
Business insight with actionable recommendation
This is the exact workflow you'd use for a real business project.
The difference: Real projects take longer because datasets are messier, questions are more complex, and stakeholders change their minds (every single day. Haha).
But the process? Identical.
Tool-Specific Tips
If you used Excel:
PivotTables are your friend for the grouping analysis
Use HOUR() and WEEKDAY()/TEXT(,”dddd”) functions to extract time components
Keep your data in a Table format for easier filtering
If you used SQL:
Your GROUP BY queries should be straightforward
Use EXTRACT(HOUR FROM InvoiceDate) for hour analysis
TO_CHAR() or similar for day of week
Remember to filter in your WHERE clause
If you used Python:
pandas groupby() handles all the aggregations
dt.hour and dt.dayofweek give you time components
matplotlib or seaborn for quick visualisations
Use query() to filter your dataframe
The Meta-Lesson
Notice what we didn't focus on:
Perfect code
Complex techniques
Fancy visualisations
Hours of exploration
We focused on:
Understanding the business question
Cleaning data properly
Finding actionable insights
Communicating clearly
Make It Your Own
Now that you've done the basic analysis, try these next:
Find customers who made repeat purchases
Calculate average order value by day/hour
Identify products frequently bought together
Compare month-over-month trends
Or pick a different business question:
Which products have highest return rates?
Which countries generate most revenue?
Do high-value customers shop at different times?
Same dataset, endless questions. Just like in real job.
Now do it again with a different dataset. Then again. Then again.
Keep pushing 💪,
Karina
Grab your freebies if you haven’t done already:
Data Playbook (CV template, Books on Data Analytics and Data Science, Examples of portfolio projects)
Need more help?
Just starting with Python? Wondering if programming is for you?
Master key data analysis tasks like cleaning, filtering, pivot and grouping data using Pandas, and learn how to present your insights visually with Matplotlib with ‘Data Analysis with Python’ masterclass.
Building your portfolio?
Grab the Complete EDA Portfolio Project — a full e-commerce analysis (ShopTrend 2024) with Python notebook, realistic dataset, portfolio template, and step-by-step workflow. See exactly how to structure professional portfolio projects.
Grab your Pandas CheatSheet here. Everything you need to know about Pandas - from file operations to visualisations in one place.
![]() | Data Analyst & Data Scientist |
