Data Lakes vs Data Warehouses: What They Actually Are

When I joined a new company, one of my SPOCs kept saying "DWH" in every meeting.

"Should be in DWH", "Check DWH". (He also pronounced it dee-vee-eich, and ‘vee’ confused me).

I nodded along, pretending I understood. Till one day I realised he was referring to the data warehouse.

In today's newsletter, let's talk about data warehouses, data lakes, databases - and what the differences are.

Because nobody explains this clearly, and everyone assumes you already know.

The Jargon Problem

People throw around these terms constantly:

"Store it in the data lake"
"Query the DWH"
"It's in the database"
"Pull from the data mart"
"Check the data lakehouse" (yes, this is now a thing)

If you're confused, you're not alone.

Let's break down the main three you'll actually encounter: database, data warehouse, and data lake.

Database: Where Live Data Lives

What it is: Where your company's operational data lives right now. The system your app/website uses to function.

Examples:

Customer database (all current customer records)
Orders database (every order placed)
Product database (inventory, prices, descriptions)
User database (login credentials, settings)

Common types: MySQL, PostgreSQL, MongoDB, SQL Server

What it's optimised for:

Fast transactions (someone places an order → save it immediately)
Reading and writing constantly
Current state ("what's in stock right now?")

What you do with it as an analyst: Honestly? Usually nothing directly.

Why not?

It's optimised for transactions, not analysis
Your queries would slow down the production system
Data changes constantly (hard to analyse moving target)
Complex joins are slow

Instead: Data gets copied from here to the data warehouse for you to analyse.

When you might use it:

Very small companies with no data warehouse yet
Ad-hoc urgent queries (with permission)
When you're also a backend developer

You don't query it directly for analysis. You wait for that data to be copied to the data warehouse overnight, then analyse it there.

Data Warehouse (DWH): The Clean, Organised Library

What it is: A structured repository of cleaned, processed, historical data from multiple sources - designed specifically for analysis.

Think of it like a well-organised library:

Books are catalogued
Everything has a clear place
Easy to find what you need
Historical archives

What happens behind the scenes:

Data gets copied from operational databases (orders, customers, products)
Data engineers clean it (remove duplicates, fix formats, handle nulls)
Data gets structured into tables with clear relationships
Historical snapshots are stored

You don't see this process. You just query the clean result.

What it's optimised for:

Analysis and reporting
Complex queries and joins
Historical data ("what were sales 2 years ago?")
Aggregations and calculations

What you do with it as an analyst: This is where you spend 80-90% of your time.

Write SQL queries
Build dashboards
Create reports
Run analysis
Join tables to answer business questions

Tools you'll use:

SQL (always)
BI tools (Power BI, Tableau, Looker) connected to DWH
DBeaver, DataGrip, SQL Server Management Studio
Your queries run fast because data is optimised for reading

What happens to old data:

Data warehouses typically keep 2-3 years of data readily accessible. Older data gets archived.

In my company:

We have different retention policy for various tables. For some areas data older than 2 months gets moved to cold storage (cheaper, slower to access). For others - you can easily access 2-3 years of data.
If we need historical data from years ago, my team submits a backup recovery request
It takes a few days/weeks to retrieve and restore
Then we can query it like normal

Why this matters: If you need to analyse very old data, plan ahead. It's not instant. Ask your data team about retention policies - you don't want to promise a stakeholder analysis from 2018 only to discover it takes a week to retrieve. I recently was working on a request, when the team recovered old data, new data got archived, so I was always missing some information to complete analysis.

Data Lake: The Giant Storage Room

What it is: A massive repository of raw, unprocessed data in its original format. Everything gets dumped here before it's cleaned.

Think of it like a warehouse storage room:

Everything thrown in
Minimal organisation
Raw materials, not finished products
You need to process it yourself before using it

What's stored here:

Raw data files (CSV, JSON, Parquet)
Unstructured data (PDFs, emails, images, videos, voice files)
Server logs (millions of rows, messy)
Streaming data (real-time clickstreams)
Data from APIs (before it's cleaned)
Old data nobody's processed yet

What it's optimised for:

Cheap storage of massive amounts of data
Storing any format (structured, semi-structured, unstructured)
Flexibility (figure out how to use it later)

What you do with it as an analyst/data scientist:

Data Analysts: Rarely use it directly (maybe 10% of time)

Data Scientists: Use it regularly (maybe 40-50% of time)

When you use it:

Building ML models (need raw features not in warehouse)
Accessing data not cleaned yet
Exploratory analysis on new data sources
Processing unstructured data (text, images)

Tools you'll use:

Python (pandas for smaller data)
PySpark (for big data processing)
Jupyter notebooks
Cloud platforms (AWS S3, Azure Data Lake Storage, Google Cloud Storage)
Databricks (if your company uses it)

Common Acronyms You'll Hear

Once you understand the concepts, here are the jargon terms people actually use:

DWH / EDW: Data Warehouse / Enterprise Data Warehouse (same thing)

Data Mart: Small, focused subset of data warehouse (e.g., just marketing data)

OLTP: Online Transaction Processing (production databases)

OLAP: Online Analytical Processing (data warehouses)

ETL: Extract, Transform, Load (how data gets from databases to warehouse)

ELT: Extract, Load, Transform (newer approach, transform after loading)

Data Lakehouse: Combination of lake + warehouse (yes, this is real, relatively new concept)

You don't need to memorise these. Just know they exist.

As a data analyst: Master SQL and the data warehouse. That's 90% of your job.

As a data scientist: Be comfortable with both warehouse (SQL) and lake (Python/Spark).

And next time someone says "DWH," you'll know exactly what they mean.

Keep pushing 💪,

Karina

P.S. If you're starting out and your company only has databases (no warehouse or lake), that's normal for small companies. You'll learn warehouse concepts when you move to a bigger company. Focus on SQL first—that skill transfers everywhere.

Need more help?

Just starting with Python? Wondering if programming is for you?

Master key data analysis tasks like cleaning, filtering, pivot and grouping data using Pandas, and learn how to present your insights visually with Matplotlib with ‘Data Analysis with Python’ masterclass.

Building your portfolio?
Grab the Complete EDA Portfolio Project — a full e-commerce analysis (ShopTrend 2024) with Python notebook, realistic dataset, portfolio template, and step-by-step workflow. See exactly how to structure professional portfolio projects.

Grab your Pandas CheatSheet here. Everything you need to know about Pandas - from file operations to visualisations in one place.

Get CV, Portfolio and LinkedIn Review.

More from me: YouTube | TikTok | Instagram | Threads | LinkedIn

Data Analyst & Data Scientist

Data Lakes vs Data Warehouses: What They Actually Are

Keep Reading

Karina Datascientist's Newsletter

Home