Deep Learning

You already know how to stack neurons into a network and train it with backpropagation. Deep learning is what happens when you take that idea seriously — and push it all the way to systems that write code, hold conversations, and translate between a hundred languages. This course climbs from a plain trained network to a modern large language model you can train and serve, one honest step at a time.

The destination is the Transformer and the recipe that makes today's frontier models work. Its beating heart is a single, almost suspiciously simple formula — attention:

\operatorname{Attention}(Q, K, V) = \operatorname{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V.

The maths you already hold

Nothing here needs new mathematics — it needs the maths you've built to come alive. An attention score is a dot product; a whole layer of attention is a matrix multiplication; training is rolling downhill on the gradient via the chain rule; and the output is a softmax over the vocabulary. If you've done linear algebra, calculus and the machine-learning branch, you already hold every key.

The shape of the journey

Eight stages, from a trained network all the way to a frontier model in production.

Stage 1 — Training deep networks. The engine that makes depth actually train: better optimizers, normalization, initialization and residual connections.
Stage 2 — Sequence models. Embeddings, recurrent networks and the long-range bottleneck that gave birth to attention.
Stage 3 — The Transformer. Attention, multi-head attention, and the block that the whole modern field is built on.
Stage 4 — Language models. Next-token prediction, pretraining, scaling laws, and where the "intelligence" comes from.
Stage 5 — Training at scale. Mixed precision, parallelism, sharding and FlashAttention — how you train a model too big for one GPU.
Stage 6 — Fine-tuning & alignment. LoRA, instruction tuning and RLHF / DPO — turning a raw model into a helpful assistant.
Stage 7 — Inference. The KV cache, sampling, quantization and speculative decoding — serving a model fast and cheap.
Stage 8 — Modern architectures. RoPE, RMSNorm, SwiGLU, grouped-query attention and mixture-of-experts: the parts inside a 2020s frontier model.

Stage 1 — Training deep networks

Basic gradient descent stalls on deep networks. This stage is the modern toolkit that makes depth trainable: smarter steps, steadier signals, and shortcuts for the gradient.

Mini-Batch SGD
Momentum
Adam & AdamW
Learning-Rate Schedules
Weight Initialization
Batch Normalization
Layer Normalization
Dropout
Vanishing & Exploding Gradients
Residual Connections
Automatic Differentiation
The Softmax Function

Stage 2 — Sequence models

Language is a sequence. This stage learns to feed sequences through a network — and runs headlong into the long-range memory problem that attention was invented to solve.

Sequence Data
Word Embeddings
Recurrent Neural Networks
Backpropagation Through Time
The Long-Range Problem
LSTMs and GRUs
Sequence to Sequence
The Attention Mechanism

Stage 3 — The Transformer

The architecture that changed everything. Drop recurrence entirely and let every token look at every other token at once, through attention.

Subword Tokenization
Positional Encoding
Self-Attention
Scaled Dot-Product Attention
Multi-Head Attention
The Feed-Forward Block
Add & Norm
The Transformer Block
Causal Masking
Cross-Attention
The Full Transformer
Encoder vs Decoder Models

Stage 4 — Language models

Point a Transformer at an ocean of text and ask it to predict the next token. That single, humble objective is where large language models come from.

Language Modeling
Perplexity
Self-Supervised Pretraining
Masked vs Causal LM
The GPT Family
Scaling Laws
In-Context Learning
Prompting

Stage 5 — Training at scale

A frontier model doesn't fit on one GPU — not even close. This stage is the systems engineering that spreads a single training run across thousands of them.

The Training Recipe
Mixed-Precision Training
Gradient Accumulation
Gradient Checkpointing
Data Parallelism
Tensor Parallelism
Pipeline Parallelism
ZeRO & FSDP
FlashAttention
Distributed Training Overview

Stage 6 — Fine-tuning & alignment

A freshly pretrained model is a brilliant autocomplete, not an assistant. This stage adapts it cheaply and teaches it to be helpful, honest and harmless.

Transfer Learning
Parameter-Efficient Fine-Tuning
LoRA
QLoRA
Instruction Tuning
RLHF
Direct Preference Optimization

Stage 7 — Inference

Training is a one-time cost; serving the model happens billions of times. This stage is the art of generating tokens fast, cheap and at scale.

Autoregressive Decoding
Sampling Strategies
Beam Search
The KV Cache
Quantization for Inference
Speculative Decoding
Paged Attention & Batching
Throughput vs Latency

Stage 8 — Modern architectures

The original 2017 Transformer has been quietly upgraded. These are the components inside a 2020s frontier model — and how they fit together.

Rotary Position Embeddings
RMSNorm
SwiGLU
Grouped-Query Attention
Mixture of Experts
Long Context
The Modern LLM Stack

Let's get started

We begin where ordinary training breaks down: a deep network whose gradient descent has ground to a halt — and the first fix that gets it moving again.

Let's get started → Mini-Batch SGD