Karina Datascientist's Newsletter
Posts
The Data Science Roadmap: What to Actually Learn

The Data Science Roadmap: What to Actually Learn

Karina Samsonova
December 05, 2025

"I want to become a data scientist. Where do I start?"

I get this question constantly. And I get it—data science sounds exciting. Machine learning! AI! Predictive models!

Here's the roadmap that actually makes sense, based on working as both a data analyst and data scientist for over 11 years.

The Reality Check First

Before we dive into the roadmap, let's be honest about something:

Data science isn't always more advanced than data analytics. It's just different.

Data analysts answer "what happened and why?" with data. Data scientists predict "what will happen?" using models.

Both require:

Strong analytical thinking
Business understanding
Communication skills
Technical proficiency

The difference is in the tools and the questions you're answering, not in who's "smarter."

"Can I Skip Data Analytics and Go Straight to Data Science?"

I get asked this often. The answer is: Yes, but...

You can absolutely skip having "Data Analyst" as a job title. You don't need to spend 2-3 years as an analyst before becoming a data scientist.

However—and this is crucial—you CANNOT skip data analytics skills.

Here's why:

Data scientists spend 60-80% of their time doing analyst work:

Cleaning messy data
Exploring datasets to understand patterns
Validating data quality
Creating visualisations
Communicating findings to stakeholders

The machine learning part? That's maybe 20-40% of the job.

What happens when people skip analyst skills:

I've interviewed data scientists who:

Could build complex neural networks
Couldn't write a proper SQL JOIN
Built models on data they didn't understand
Couldn't explain their findings to stakeholders
Had no idea if their results made business sense

They didn't get the job.

The skills you absolutely need (regardless of title):

Data manipulation - SQL and Python/pandas. Non-negotiable.
Exploratory data analysis - Understanding your data before modelling
Data visualisation - Communicating insights clearly
Business context - Knowing what questions matter and why
Stakeholder communication - Explaining technical work to non-technical people

So yes, you can go straight for data science roles. But you'll need to learn all the analyst skills along the way. There's no shortcut around understanding data fundamentally.

My recommendation? Learn data analytics skills FIRST, then build up to machine learning. Not because you need the job title, but because you need the foundation.

Are You Ready for Data Science?

You should have these foundations first:

Comfortable with data manipulation (SQL, Excel, or Python/pandas)
Understanding of basic statistics (mean, median, distributions, correlation)
Experience answering business questions with data
Ability to communicate findings clearly

Don't have these yet? Spend 2-3 months building analyst skills first. You can learn them without being employed as an analyst—through projects, courses, and practice.

Then move into machine learning. Your data science work will be infinitely better because of it.

Phase 1: The Maths You Actually Need

Everyone panics about maths. "Do I need a PhD in mathematics?!"

No. But you do need to understand the fundamentals.

Statistics and Probability - Non-Negotiable

Hypothesis testing (p-values, confidence intervals)
Probability distributions (normal, binomial, Poisson)
Regression (linear and logistic)
Understanding bias and variance
Overfitting vs underfitting

Why this matters: You need to know if your model is actually working or if it's just memorising data.

Linear Algebra - Just Enough

Vectors and matrices
Matrix multiplication
Understanding what dimensions mean

Why this matters: Machine learning is matrix operations under the hood. You don't need to be an expert, but you need to understand what's happening.

Calculus - The Bare Minimum

Derivatives and gradients
Understanding optimisation

Why this matters: Models "learn" by minimising error through gradient descent. You need to understand the concept, not solve equations by hand.

How to learn: Khan Academy (free), StatQuest on YouTube, or 3Blue1Brown for visual understanding.

Don't overthink this phase. You need understanding, not expertise. Move on once you grasp the concepts.

Phase 2: Machine Learning Fundamentals

This is where it gets interesting.

Supervised Learning - Start Here

Regression:

Linear regression (you probably know this)
Polynomial regression
Regularisation (Ridge, Lasso)

Classification:

Logistic regression
Decision trees
Random forests
Support Vector Machines (SVM)
K-Nearest Neighbours (KNN)

Ensemble Methods:

Bagging
Boosting (XGBoost, LightGBM)

Unsupervised Learning

Clustering:

K-means
Hierarchical clustering
DBSCAN

Dimensionality Reduction:

PCA (Principal Component Analysis)
t-SNE

The Critical Skills Here:

Understanding when to use which algorithm
Feature engineering (creating useful inputs)
Model evaluation (accuracy, precision, recall, F1-score, ROC-AUC)
Cross-validation
Hyperparameter tuning

What's more important than knowing algorithms? Knowing which algorithm to use and why.

I've seen people throw neural networks at problems that linear regression would solve perfectly.

Phase 3: Python for Machine Learning (Overlaps with Phase 2)

If you're coming from data analytics, you might know Python basics. Now you need the ML libraries.

Essential Libraries:

scikit-learn - Your Main Tool

Model training and evaluation
Pre-processing and feature scaling
Pipeline creation
Every common ML algorithm

pandas - Data Manipulation

You should already know this from analyst work
If not, learn it first

NumPy - Numerical Computing

Array operations
Mathematical functions

Matplotlib/Seaborn - Visualisation

Model performance visualisation
Feature importance plots

What to focus on:

Loading and preparing data
Splitting data (train/test/validation)
Training models
Evaluating performance
Tuning hyperparameters
Making predictions

Don't worry about: Building algorithms from scratch. No need to reinvent the wheel. Use the libraries. That's what they're for.

Phase 4: The Real-World Skills

Feature Engineering Creating useful features from raw data. This is often more important than choosing the "perfect" algorithm.

Examples:

Creating date features (day of week, month, is_weekend)
Combining features (revenue = price × quantity)
Encoding categorical variables
Handling missing values intelligently

Model Deployment Your model is useless if it only runs on your laptop.

Learn:

Saving models (pickle, joblib)
Creating simple web apps (Streamlit or Gradio) - easiest way to demo your models
Basic API creation (Flask or FastAPI) - for integrating models into applications
Version control (Git/GitHub) - tracking your code and models
Understanding production constraints (speed, memory)

Optional but valuable:

Docker basics - packaging your model and dependencies
Cloud deployment (AWS, GCP, or Azure) - getting your model online

You don't need to be a DevOps expert, but you should understand how models get into production.

A/B Testing and Experimentation How do you know your model is actually better than the current process?

Learn:

Experimental design
Statistical significance
Measuring incremental impact

MLOps Basics

Version control for data and models
Monitoring model performance over time
Retraining strategies

Phase 5: Deep Learning

Here's the controversial bit: You might not need deep learning.

Seriously. Most business problems don't require neural networks. They require good feature engineering and the right algorithm.

When you DO need deep learning:

Image recognition
Natural language processing (text)
Speech recognition
Time series with complex patterns
Very large, unstructured datasets

When you DON'T need deep learning:

Tabular data (use XGBoost instead)
Small datasets (neural networks need lots of data)
Problems where interpretability matters
When simpler models work fine

If you decide to learn it:

Frameworks:

TensorFlow or PyTorch (pick one, don't learn both at once)
Start with Keras (simpler interface)

Concepts:

Neural network architecture
Activation functions
Backpropagation (conceptually)
CNNs for images
RNNs/LSTMs for sequences
Transfer learning

My honest advice: Unless you're specifically targeting computer vision or NLP roles, spend your time getting really good at Phases 1-4 first.

Your Realistic Timeline

Can you learn data science in 3 months? Enough to apply for junior roles? Maybe, if you're coming from data analytics and focused full-time.

More realistic:

6-9 months: Job-ready for junior data scientist roles (coming from analyst background)
12-18 months: Comfortable and confident
2+ years: Actually good at this

If you're starting from zero (no programming, no analytics): Add 6-12 months for foundations.

The Skills That Actually Matter

After 11 years, here's what I've learned:

Technical skills get you the interview. These skills get you the job:

Problem framing: Understanding what problem you're actually solving
Business sense: Knowing when a model is worth building
Communication: Explaining models to non-technical stakeholders
Critical thinking: Knowing when your model is wrong
Experimentation: Proper A/B testing and measurement

Your Action Plan Based on Where You Are

Complete beginner (no tech background): Start with data analyst skills first. Seriously. You need the foundation.

Data analyst wanting to transition: Phase 1 (maths fundamentals) is your priority. You likely have the tools, need the theory.

Have some Python, want to level up: Phase 2 (ML fundamentals) and Phase 3 (scikit-learn). Build projects while learning.

Know the basics, struggling to get hired: Phase 4 (real-world skills) and build portfolio projects that show business impact, not just model accuracy.

Data science isn't magic. It's not always more advanced or more valuable than data analytics.

It's a different toolkit for different problems.

Some projects need prediction (data science). Some need understanding (data analytics). Many need both.

Keep pushing 💪

Karina

Python Tip

Real-world data has dates like this:

2024-01-05

01/05/2024

05.01.24

Trying to parse them manually? Nightmare.

Try these key parameters:

errors='coerce' → Invalid dates become NaT (Not a Time)

dayfirst=True → Interprets 05/01/2024 as Jan 5, not May 1

import pandas as pd

df = pd.DataFrame({
    'date': ['2024-01-05', '01/05/2024', '05.01.24', 'invalid'],
    'sales': [100, 200, 150, 300]
})


df['date'] = pd.to_datetime(
    df['date'], 
    errors='coerce',      # Invalid → NaT (not error)
    dayfirst=True        # Handles EU format (DD/MM/YYYY)
)

df

Grab your freebies if you haven’t done already:

Data Storytelling Guide

Data Playbook (CV template, Books on Data Analytics and Data Science, Examples of portfolio projects)

Need more help?

Just starting with Python? Wondering if programming is for you?

Master key data analysis tasks like cleaning, filtering, pivot and grouping data using Pandas, and learn how to present your insights visually with Matplotlib with ‘Data Analysis with Python’ masterclass.

Building your portfolio?
Grab the Complete EDA Portfolio Project — a full e-commerce analysis (ShopTrend 2024) with Python notebook, realistic dataset, portfolio template, and step-by-step workflow. See exactly how to structure professional portfolio projects.

Grab your Pandas CheatSheet here. Everything you need to know about Pandas - from file operations to visualisations in one place.

Get CV, Portfolio and LinkedIn Review.

More from me: YouTube | TikTok | Instagram | Threads | LinkedIn

Data Analyst & Data Scientist