Without its activation, a neuron is just a linear function — and stacking linear functions only ever gives you another linear function. The activation is the non-linear squash that breaks that spell, and it's the single ingredient that lets deep networks learn curved, complicated patterns. A few are standard:
Switch between the three. Sigmoid and tanh gently saturate at both ends; ReLU just kinks at the
origin. That kink is enough — a network of ReLUs glues together straight pieces into any curve you
like, and it trains fast because its slope is a constant
Sigmoid and tanh flatten out for large inputs, so their slope nearly vanishes — and a vanishing
slope means