Last week, Andrej Karpathy published one of the clearest “x‑ray views” into how modern AI models actually work: microgpt, a ~200‑line, dependency‑free Python implementation of a GPT‑style language model.

Today, let's talk about what large language models (LLMs) like ChatGPT, Claude and others are actually doing.

Under all the layers of infrastructure, scale, product UX, and billions of dollars of compute—ChatGPT, Claude, Llama, etc - they're all doing the same fundamental thing as this 200-line mircogpt script.

Predicting the next token in a sequence.

That's it.

What's a Token?

Before we go further: tokens.

A token is a chunk of text. Usually a word, sometimes part of a word, sometimes punctuation.

Examples:

  • "Hello" = 1 token

  • "Hello world" = 2 tokens

  • "I'm" = 2 tokens ("I" + "'m")

  • "data" = 1 token

  • "ChatGPT" = 2 tokens ("Chat" + "GPT")

The model doesn't see words. It sees tokens.

I always say that a computer is a computing machine, a calculator - it can't understand the words, it understands digits and tokens. (I know it's a very simplified view, but it sticks)

What LLMs Do

Here's the entire process:

  1. You type: "The capital of France is"

  2. Model breaks it into tokens: ["The", "capital", "of", "France", "is"]

  3. Model looks at those tokens

  4. Model predicts: "What token is most likely to come next?"

  5. Model outputs: "Paris" (high probability). Based on all the data that the model was trained on.

Then it repeats:

  1. Now the sequence is: ["The", "capital", "of", "France", "is", "Paris"]

  2. Model predicts next token: probably a period or comma

  3. And so on, token by token

That's what's happening when ChatGPT "writes" a response.

It's predicting one token at a time. Thousands of times.

You see it streaming word by word? That's because it's literally generating one token, then the next, then the next.

The Core Ingredients (What microGPT Shows)

Karpathy's code distills modern LLMs into five core components:

1. Token embeddings

Each token gets converted into a vector of numbers.

"Paris" becomes something like [0.23, -0.45, 0.67, ...] (hundreds of dimensions)

These numbers capture meaning. Similar words have similar vectors.

2. Positional embeddings

The model needs to know where each token is in the sequence.

"Paris is the capital" vs "The capital is Paris" — same tokens, different meanings.

Position matters. So each position gets its own embedding too.

3. Self-attention

This is where the model "looks at" all previous tokens to understand context.

When predicting the next token after "The capital of France is", the model pays attention to:

  • "France" (high attention — very relevant)

  • "capital" (high attention — also relevant)

  • "The" (low attention — less relevant)

It figures out which previous tokens matter most for predicting the next one.

Multi-head attention means it does this multiple times in parallel, looking for different patterns.

4. Feed-forward network

After attention, the data goes through a simple neural network that transforms it further.

Think of it as: attention figures out what to pay attention to, feed-forward processes what to do with that information.

5. Training loop

The model learns by:

  • Looking at massive amounts of text

  • Trying to predict the next token

  • Getting it wrong

  • Adjusting its internal weights to get better

  • Repeating billions of times

That's it. That's the core.

Everything else - GPT-4's 1.7 trillion parameters, Claude's massive context window, ChatGPT's interface - it’s built on top of these same five ingredients.

Why This Matters

When you understand this, AI stops feeling like magic and becomes more like... maths and stats.

You realise:

  • LLMs don't "know" things. They predict likely next tokens based on patterns in training data.

  • They don't "think." They do very sophisticated pattern matching.

  • They can't "reason" in the way humans do. They're extrapolating from what they've seen.

  • When they hallucinate (make up facts), it's because they're predicting a plausible-sounding next token, not retrieving truth.

microGPT is tiny. Maybe a few thousand parameters.

GPT-4? Estimated 1.7 trillion parameters.

Same mechanism. Wildly different capabilities.

The simple process of "predict next token" becomes incredibly powerful when you:

  • Train on trillions of tokens of text

  • Use billions of parameters to capture patterns

  • Optimise with massive compute

It's like the difference between:

  • A 10-pixel image (you can see it's a face, barely)

  • A 10-million-pixel image (you can see every pore, every eyelash)

Same concept (arranging pixels). Completely different result at scale.

If you’ve been curious about how LLMs really work beyond the buzzwords, Karpathy’s post and code are one of the best starting points I’ve seen this year.

Keep pushing 💪,

Karina

Need more help?

Just starting with Python? Wondering if programming is for you?

Master key data analysis tasks like cleaning, filtering, pivot and grouping data using Pandas, and learn how to present your insights visually with Matplotlib with ‘Data Analysis with Python’ masterclass.

Building your portfolio?
Grab the Complete EDA Portfolio Project — a full e-commerce analysis (ShopTrend 2024) with Python notebook, realistic dataset, portfolio template, and step-by-step workflow. See exactly how to structure professional portfolio projects.

Grab your Pandas CheatSheet here. Everything you need to know about Pandas - from file operations to visualisations in one place.

More from me: YouTube | TikTok | Instagram | Threads | LinkedIn

Data Analyst & Data Scientist

Keep Reading