Activation Functions

Without its activation, a neuron is just a linear function — and stacking linear functions only ever gives you another linear function. The activation is the non-linear squash that breaks that spell, and it's the single ingredient that lets deep networks learn curved, complicated patterns. A few are standard:

Meet the curves

Switch between the three. Sigmoid and tanh gently saturate at both ends; ReLU just kinks at the origin. That kink is enough — a network of ReLUs glues together straight pieces into any curve you like, and it trains fast because its slope is a constant 1 wherever the input is positive.

Why ReLU took over

Sigmoid and tanh flatten out for large inputs, so their slope nearly vanishes — and a vanishing slope means backpropagation has almost no gradient to learn from, stalling deep networks (the "vanishing gradient" problem). ReLU's slope stays a healthy 1 for positive inputs, so gradients flow freely and training is fast. It (and variants like Leaky ReLU and GELU) powers almost every modern network.