Deep Learning
You already know how to stack neurons into a network and train it with
backpropagation.
Deep learning is what happens when you take that idea seriously — and push it
all the way to systems that write code, hold conversations, and translate between a hundred
languages. This course climbs from a plain trained network to a modern
large language model you can train and serve, one honest step at a time.
The destination is the Transformer and the recipe that makes today's frontier
models work. Its beating heart is a single, almost suspiciously simple formula —
attention:
\operatorname{Attention}(Q, K, V) = \operatorname{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V.
The maths you already hold
Nothing here needs new mathematics — it needs the maths you've built to come alive. An
attention score is a
dot product;
a whole layer of attention is a
matrix multiplication;
training is rolling downhill on the
gradient via the
chain rule; and the
output is a softmax
over the vocabulary. If you've done linear algebra, calculus and the
machine-learning
branch, you already hold every key.
The shape of the journey
Eight stages, from a trained network all the way to a frontier model in production.
- Stage 1 — Training deep networks. The engine that makes depth actually
train: better optimizers, normalization, initialization and residual connections.
- Stage 2 — Sequence models. Embeddings, recurrent networks and the
long-range bottleneck that gave birth to attention.
- Stage 3 — The Transformer. Attention, multi-head attention, and the block
that the whole modern field is built on.
- Stage 4 — Language models. Next-token prediction, pretraining, scaling
laws, and where the "intelligence" comes from.
- Stage 5 — Training at scale. Mixed precision, parallelism, sharding and
FlashAttention — how you train a model too big for one GPU.
- Stage 6 — Fine-tuning & alignment. LoRA, instruction tuning and RLHF /
DPO — turning a raw model into a helpful assistant.
- Stage 7 — Inference. The KV cache, sampling, quantization and speculative
decoding — serving a model fast and cheap.
- Stage 8 — Modern architectures. RoPE, RMSNorm, SwiGLU, grouped-query
attention and mixture-of-experts: the parts inside a 2020s frontier model.
Stage 1 — Training deep networks
Basic gradient descent stalls on deep networks. This stage is the modern toolkit that makes
depth trainable: smarter steps, steadier signals, and shortcuts for the gradient.
- Mini-Batch SGD
- Momentum
- Adam & AdamW
- Learning-Rate Schedules
- Weight Initialization
- Batch Normalization
- Layer Normalization
- Dropout
- Vanishing & Exploding Gradients
- Residual Connections
- Automatic Differentiation
- The Softmax Function
Stage 2 — Sequence models
Language is a sequence. This stage learns to feed sequences through a network — and
runs headlong into the long-range memory problem that attention was invented to solve.
- Sequence Data
- Word Embeddings
- Recurrent Neural Networks
- Backpropagation Through Time
- The Long-Range Problem
- LSTMs and GRUs
- Sequence to Sequence
- The Attention Mechanism
Stage 3 — The Transformer
The architecture that changed everything. Drop recurrence entirely and let every token look at
every other token at once, through attention.
- Subword Tokenization
- Positional Encoding
- Self-Attention
- Scaled Dot-Product Attention
- Multi-Head Attention
- The Feed-Forward Block
- Add & Norm
- The Transformer Block
- Causal Masking
- Cross-Attention
- The Full Transformer
- Encoder vs Decoder Models
Stage 4 — Language models
Point a Transformer at an ocean of text and ask it to predict the next token. That single,
humble objective is where large language models come from.
- Language Modeling
- Perplexity
- Self-Supervised Pretraining
- Masked vs Causal LM
- The GPT Family
- Scaling Laws
- In-Context Learning
- Prompting
Stage 5 — Training at scale
A frontier model doesn't fit on one GPU — not even close. This stage is the systems
engineering that spreads a single training run across thousands of them.
- The Training Recipe
- Mixed-Precision Training
- Gradient Accumulation
- Gradient Checkpointing
- Data Parallelism
- Tensor Parallelism
- Pipeline Parallelism
- ZeRO & FSDP
- FlashAttention
- Distributed Training Overview
Stage 6 — Fine-tuning & alignment
A freshly pretrained model is a brilliant autocomplete, not an assistant. This stage adapts it
cheaply and teaches it to be helpful, honest and harmless.
- Transfer Learning
- Parameter-Efficient Fine-Tuning
- LoRA
- QLoRA
- Instruction Tuning
- RLHF
- Direct Preference Optimization
Stage 7 — Inference
Training is a one-time cost; serving the model happens billions of times. This stage is
the art of generating tokens fast, cheap and at scale.
- Autoregressive Decoding
- Sampling Strategies
- Beam Search
- The KV Cache
- Quantization for Inference
- Speculative Decoding
- Paged Attention & Batching
- Throughput vs Latency
Stage 8 — Modern architectures
The original 2017 Transformer has been quietly upgraded. These are the components inside a
2020s frontier model — and how they fit together.
- Rotary Position Embeddings
- RMSNorm
- SwiGLU
- Grouped-Query Attention
- Mixture of Experts
- Long Context
- The Modern LLM Stack
Let's get started
We begin where ordinary training breaks down: a deep network whose gradient descent has
ground to a halt — and the first fix that gets it moving again.
Let's get started → Mini-Batch SGD