Activation Functions

Without its activation, a neuron is just a linear function — and stacking linear functions only ever gives you another linear function. The activation is the non-linear squash that breaks that spell, and it's the single ingredient that lets deep networks learn curved, complicated patterns. A few are standard:

Sigmoid — squashes to (0, 1); great for probabilities, but slow to train deep.
Tanh — an S-curve squashing to (-1, 1); centred at zero.
ReLU — \max(0, z); pass positives, zero out negatives. Brutally simple, and the workhorse of modern deep learning.

Meet the curves

Switch between the three. Sigmoid and tanh gently saturate at both ends; ReLU just kinks at the origin. That kink is enough — a network of ReLUs glues together straight pieces into any curve you like, and it trains fast because its slope is a constant 1 wherever the input is positive.

Why ReLU took over

Sigmoid and tanh flatten out for large inputs, so their slope nearly vanishes — and a vanishing slope means backpropagation has almost no gradient to learn from, stalling deep networks (the "vanishing gradient" problem). ReLU's slope stays a healthy 1 for positive inputs, so gradients flow freely and training is fast. It (and variants like Leaky ReLU and GELU) powers almost every modern network.