Deep Learning

You already know how to stack neurons into a network and train it with backpropagation. Deep learning is what happens when you take that idea seriously — and push it all the way to systems that write code, hold conversations, and translate between a hundred languages. This course climbs from a plain trained network to a modern large language model you can train and serve, one honest step at a time.

The destination is the Transformer and the recipe that makes today's frontier models work. Its beating heart is a single, almost suspiciously simple formula — attention:

\operatorname{Attention}(Q, K, V) = \operatorname{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V.

The maths you already hold

Nothing here needs new mathematics — it needs the maths you've built to come alive. An attention score is a dot product; a whole layer of attention is a matrix multiplication; training is rolling downhill on the gradient via the chain rule; and the output is a softmax over the vocabulary. If you've done linear algebra, calculus and the machine-learning branch, you already hold every key.

The shape of the journey

Eight stages, from a trained network all the way to a frontier model in production.

Stage 1 — Training deep networks

Basic gradient descent stalls on deep networks. This stage is the modern toolkit that makes depth trainable: smarter steps, steadier signals, and shortcuts for the gradient.

  1. Mini-Batch SGD
  2. Momentum
  3. Adam & AdamW
  4. Learning-Rate Schedules
  5. Weight Initialization
  6. Batch Normalization
  7. Layer Normalization
  8. Dropout
  9. Vanishing & Exploding Gradients
  10. Residual Connections
  11. Automatic Differentiation
  12. The Softmax Function

Stage 2 — Sequence models

Language is a sequence. This stage learns to feed sequences through a network — and runs headlong into the long-range memory problem that attention was invented to solve.

  1. Sequence Data
  2. Word Embeddings
  3. Recurrent Neural Networks
  4. Backpropagation Through Time
  5. The Long-Range Problem
  6. LSTMs and GRUs
  7. Sequence to Sequence
  8. The Attention Mechanism

Stage 3 — The Transformer

The architecture that changed everything. Drop recurrence entirely and let every token look at every other token at once, through attention.

  1. Subword Tokenization
  2. Positional Encoding
  3. Self-Attention
  4. Scaled Dot-Product Attention
  5. Multi-Head Attention
  6. The Feed-Forward Block
  7. Add & Norm
  8. The Transformer Block
  9. Causal Masking
  10. Cross-Attention
  11. The Full Transformer
  12. Encoder vs Decoder Models

Stage 4 — Language models

Point a Transformer at an ocean of text and ask it to predict the next token. That single, humble objective is where large language models come from.

  1. Language Modeling
  2. Perplexity
  3. Self-Supervised Pretraining
  4. Masked vs Causal LM
  5. The GPT Family
  6. Scaling Laws
  7. In-Context Learning
  8. Prompting

Stage 5 — Training at scale

A frontier model doesn't fit on one GPU — not even close. This stage is the systems engineering that spreads a single training run across thousands of them.

  1. The Training Recipe
  2. Mixed-Precision Training
  3. Gradient Accumulation
  4. Gradient Checkpointing
  5. Data Parallelism
  6. Tensor Parallelism
  7. Pipeline Parallelism
  8. ZeRO & FSDP
  9. FlashAttention
  10. Distributed Training Overview

Stage 6 — Fine-tuning & alignment

A freshly pretrained model is a brilliant autocomplete, not an assistant. This stage adapts it cheaply and teaches it to be helpful, honest and harmless.

  1. Transfer Learning
  2. Parameter-Efficient Fine-Tuning
  3. LoRA
  4. QLoRA
  5. Instruction Tuning
  6. RLHF
  7. Direct Preference Optimization

Stage 7 — Inference

Training is a one-time cost; serving the model happens billions of times. This stage is the art of generating tokens fast, cheap and at scale.

  1. Autoregressive Decoding
  2. Sampling Strategies
  3. Beam Search
  4. The KV Cache
  5. Quantization for Inference
  6. Speculative Decoding
  7. Paged Attention & Batching
  8. Throughput vs Latency

Stage 8 — Modern architectures

The original 2017 Transformer has been quietly upgraded. These are the components inside a 2020s frontier model — and how they fit together.

  1. Rotary Position Embeddings
  2. RMSNorm
  3. SwiGLU
  4. Grouped-Query Attention
  5. Mixture of Experts
  6. Long Context
  7. The Modern LLM Stack

Let's get started

We begin where ordinary training breaks down: a deep network whose gradient descent has ground to a halt — and the first fix that gets it moving again.

Let's get started → Mini-Batch SGD