In partnership with

Two weeks ago we covered the basics — variables, data types, loops, functions. The feedback was incredible, thank you.

Today we go further. We are going to load a real dataset, explore it, and clean it using pandas — the most important Python library for data analysts.

If you missed Part 1, read it first (link). The concepts there are the foundation for everything we do today.

Before we proceed - a small ad. Your clicks on the ads help to cover newsletter hosting fees. And it is a way to say Thank you for the newsletter.

Your prompts are leaving out 80% of what you're thinking.

When you type a prompt, you summarize. When you speak one, you explain. Wispr Flow captures your full reasoning — constraints, edge cases, examples, tone — and turns it into clean, structured text you paste into ChatGPT, Claude, or any AI tool. The difference shows up immediately. More context in, fewer follow-ups out.

89% of messages sent with zero edits. Used by teams at OpenAI, Vercel, and Clay. Try Wispr Flow free — works on Mac, Windows, and iPhone.

What is pandas?

Pandas is a Python library built specifically for working with data. Think of it as Excel, but in code — you can read files, filter rows, clean columns, calculate summaries, and analyse data.

It is the first thing every data analyst learns after Python basics, and the one you will use every single day.

There are other libraries too (more modern, that can deal with huge datasets way faster than pandas, but pandas is still a default).

The dataset

We are using the Cafe Sales dataset from Kaggle — a deliberately messy sales dataset from a fictional cafe.

Perfect for learning because it has real data quality problems baked in.

In real life you will not have all the problems at once!

Open Google Colab (colab.research.google.com) and let's go.

Step 1 — Install and import pandas

Pandas comes pre-installed in Google Colab so you just need to import it. However, if you are using different IDE, like VS Code or Jupyter Notebook - you need to install the modules first.

# !pip install pandas  #install the library if needed. 
                       #Colab has it pre-installed 
import pandas as pd

The as pd part is just a shortcut — instead of typing pandas every time, you type pd.

It is called alias and it is a convention.

For Google Colab we will also import files library, that will allow us to upload files to Colab

from google.colab import files

Step 2 — Load the data

For Colab you need to add the following line, which will allow you to select the file you want to work with.

uploaded = files.upload()

Now let’s read our file:

df = pd.read_csv('dirty_cafe_sales.csv')

That one line reads the entire CSV file into a DataFrame — pandas' version of a table. We call it df by convention, same as everyone else.

If you are working with several files, it is better idea to call your dataframe with some readable name like sales, customers, transactions. But if you are working with 1 file only, everyone defaults to simply df .

Now let's see what we have.

print(df.shape)      # rows and columns — how big is this dataset?
print(df.head())     # first 5 rows
print(df.tail())     # last 5 rows

df.shape gives you a tuple like (10000, 8) — 10,000 rows, 8 columns. Always check this first. It tells you what you are working with.

df.head()/df.tail() by default shows 5 rows, but you can change this number by indicating the number of rows in brackets. For example, df.head(10) will display 10 rows.

Step 3 — Understand your data

df.info()      # column names, data types, non-null counts
df.describe()  # summary statistics for numeric columns

df.info() is one of the most useful commands in pandas. It shows you every column, what type it is, and how many non-null values it has. If a column has 10,000 rows but only 8,500 non-null values — you have 1,500 missing values to deal with.

df.describe() gives you min, max, mean, median and more for every numeric column at a glance.

# Check column names
print(df.columns.tolist())

# Check for missing values per column
print(df.isnull().sum())

With this dataset you will find missing values and some columns with 'UNKNOWN' or 'ERROR' as placeholder values — classic real-world data quality issues.

Step 4 — Select and filter data

# Select one column
print(df['Item'])

# Select multiple columns
print(df[['Item', 'Quantity', 'Price Per Unit']])

# Filter rows — show only cash transactions
cash_sales = df[df['Payment Method'] == 'Cash']
print(cash_sales.head())

# Filter with multiple conditions
high_value = df[(df['Price Per Unit'] > 3) & (df['Quantity'] > 2)]
print(high_value.shape)

This is the pandas equivalent of SQL's SELECT and WHERE. You will use this constantly.

Step 5 — Clean the data

Now the real work. This dataset has several problems — let's fix them.

# Replace 'UNKNOWN' and 'ERROR' with NaN (proper missing value marker)
df = df.replace(['UNKNOWN', 'ERROR'], pd.NA)

# Check how many nulls we now have
print(df.isnull().sum())
# Fix data types — Price Per Unit should be numeric, not text
df['Price Per Unit'] = pd.to_numeric(df['Price Per Unit'], errors='coerce')
df['Quantity'] = pd.to_numeric(df['Quantity'], errors='coerce')

errors='coerce' tells pandas: if you find something that cannot be converted to a number, turn it into NaN instead of crashing. Very useful.

You might ask why we are calculating Total Sales, when our dataset already has a column ‘Total Spent’. In real projects we often recalculate important metrics rather then trusting the source. Especially I recalculate ratios, %, as I’ve seen many issues with those datapoints over the years.

# Drop rows where key columns are missing
df = df.dropna(subset=['Item', 'Price Per Unit', 'Quantity'])

# Create a Total Sales column
df['Total Sales'] = df['Price Per Unit'] * df['Quantity']

print(df.shape)  # you will go from 10,000 rows down to around 8,147 after cleaning

Step 6 — Explore and summarise

Now your data is clean. Let's ask some questions.

# What items sell most?
print(df['Item'].value_counts().head(10))

# What is the total revenue?
print(df['Total Sales'].sum())

# What is the average transaction value?
print(df['Total Sales'].mean().round(2))

# Revenue by payment method
# Note: rows with missing Payment Method are excluded from groupby automatically
print(df.groupby('Payment Method')['Total Sales'].sum().sort_values(ascending=False))

# Revenue by item
print(df.groupby('Item')['Total Sales'].sum().sort_values(ascending=False).head(10))

groupby is the pandas version of SQL's GROUP BY. You will use it in almost every analysis you ever do.

Step 7 — Save your clean data

df.to_csv('cafe_sales_clean.csv', index=False)

Now you can use the cleaned data for further analysis.

What you just did

You loaded a real messy dataset, explored it, fixed data quality issues, created a new calculated column, ran grouped analysis, and saved a clean version. That is a complete pandas workflow — the same one professional analysts use every day.

What comes next

Part 3 will be visualisations — turning this data into charts using matplotlib and seaborn.

If you want structured lessons with hands-on projects, check out my Data Analysis with Python masterclass: link

Keep pushing 💪,

Karina

Data Analyst & Data Scientist

Keep Reading