So far we have been talking about neural networks and activation functions. One of the most important concepts in deep learning is convolutional neural networks. This is one of the most commonly used techniques in deep learning: self driving cars or pedestrian detection has something to do with convolutional neural networks.

**Problems with Other Approaches**

Let’s consider feedforward neural networks (no matter how many hidden layers we have) as far as image recognition is concerned. Regular neural networks do not scale well to full images. If we have images with size 32x32x3 (because there are 3 color channels) pixels, a single fully-connected neuron in a first hidden layer of a regular neural network would have 32x32x3 = 3072 weights. What does it mean? It means we have to train 3072 weights with backpropagation for example. If an image has the size 200x200x3 it would lead to neurons with 120,000 weights. Clearly, this full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting. So how to deal with this problem? We can use convolutional neural networks instead.

**Convolutional Neural Networks**

Convolutional neural networks have an assumption: the inputs are **images**. Which allows us to encode certain properties into the architecture. Under the hood it uses a standard neural network but at the beginning it transforms the data in order to achieve the best accuracy possible. There are important steps we have to do:

- convolutional operation
- max pooling
- flattening

**Convolution**

One of the most important concepts when dealing with convolutional neural networks is feature detector (or kernel). It is capable of detecting the most relevant features in the image. So it is a 3×3 matrix (or 7×7 is another popular size), something like this:

Because we represent the image with matrixes, we can use the convolution operation: we multiply the items accordingly (not matrix multiplication!!!). The feature map contains less values than the original image. So basically with convolutional neural networks we keep reducing the number of features.

As you can see it is not a matrix multiplication: we multiple the items one by one accordingly. Image (i,j) value with the kernel (i,j) value. There are several types of kernels: edge detector kernel, sharpen kernel or emboss kernel. We can use multiple feature detectors independently to attain multiple feature maps. Why is it good? Of course we do not know what kernel is the best. So we have to test them one by one. During the training procedure the algorithm will choose the best kernel possible … so it decides what are the relevant features. For us humans: the shape of a nose, the distance between the eyes are features. But for a neural network there can be more complicated features as well.

Then we have to aply the ReLU activation function when dealing with the feature maps. Why is it good? We want to increase non-linearity in our neural network. Because images are usually highly non-linear.

**Pooling**

The next problem we have to deal with: we may have several images with the same object but from different angles etc. So we want to recognize a cat no matter the cat is located on the left corner of the image or on the right corner of the image. So if we have 3 images of a cat, we would like to make sure our network will recognize all of them:

Maximum pooling keeps the most relevant features exclusively. After applying maximum pooling the angle or the size of the image does not matter any more. So what to do? We take the feature maps: we have a 2×2 window (we can tune the size as well) and we select the maximum value out of this 2×2 frame. This is how we make sure only the most relevant values will remain (so we are going to deal with the most important features and discard all the others).

**Flattening**

After applying the maximum pooling we get the pooling layer. We just have to flatten this matrix and this is the last important step when dealing with convolutional networks.

After flattening the final vector of values will be the input layer of a standard artificial neural network we have been discussing in the first part. What was the original problem of ANN when dealing with images? That there would be lots of connections between the neurons so it would take a lot of time to train the network. With convolutional layer, max pooling and flattening we can eliminate all the unnecessary values from the input. And … it is working extremely fine!