In this article we are going to talk about activation functions. We will see that we have to be careful when choosing the proper one when dealing with deep neural networks. So first of all what are deep neural networks? Basically, neural networks with several hidden layers.

First of all, why do we need activation functions? It is to produce a non-linear decision boundary via linear combinations of the weighted inputs. So we would like to introduce nonlinearity into the network’s modeling capabilities. Ok, so there are several activation functions, let’s consider all of them one by one:

**Linear Activation Function**

A linear transformation is basically the identity function **f(x)=x**. This activation function does not change the signal at all. Academic papers say that we use a linear activation function in the input layer: because we do not change the input values at all 🙂

**Sigmoid Activation Function**

We usually like logistic trasformation. First of all, because it reduces extreme values or outliers in data without removing them. Another reason why we like logistic function, because it converts all variables (even variables near infinity) into simple probabilities. So into the range **[0,1]**.

**Tanh Activation Function**

It is very similar to the sigmoid activation function but it can deal more easily with negative numbers.

**Rectified Linear (ReLU)**

As far as deep learning is concerned, this is the activation function we are looking for. When the input is below zero, the output is zero as well. After a given threshold, it has a linear relationship. So this activation function is **f(x)=max(0,x)**

This is the “state of the art” activation function when dealing with deep neural networks. Compared to the sigmoid and tanh activation functions, the ReLU function does not suffer from vanishing gradient issues. What do I mean exactly? We have been talking about the basics of neural networks and we have mentioned the error cost function. Training the neural network means minimizing the cost function. Either with gradient descent or backpropagation. In such algorithms, each of the neural network’s weights receives an update proportional to the gradien of the error function with respect to the current weight in each iteration of training. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training.

OK, but we have managed to train single hidden layer neural networks with sigmoid activation functions. Yeah it is right. This issue becomes more significant when dealing with more hidden layers – so basically, when dealing with deep neural networks. What is the solution? Let’s use ReLU activation function instead!