top of page
Writer's pictureScientist Express

A CNN with Visuals and Intuitions Behind it


While I was studying CNN architecture, the most challenging part was to comprehend all the processes that were occurring. I wanted a visual cheat-sheet guide that helped me understand the process while not skipping the computational part of CNN. I also wanted to avoid oversimplification, which we frequently risk whenever we come across a visual guide. This step-by-step explanation, accompanied by diagrams, is an attempt to provide readers with a clearer understanding of this powerful neural network architecture.


CNN as Representational learning:

Convolutional neural networks (CNNs) are a part of representation learning.

Representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data.

This reduces the need for manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

As a human, everyone must have gone through a representational learning experience. Imagine you are asked to draw a house in 10 minutes. Most likely, you will draw a house that is a line diagram with basic shapes like a rectangle for walls, a trapezium for the roof, and an upside-down “U” for the entrance. Something like below, unless you are a really, really, really good artist who is keen on attention to detail:


You most probably would not draw a house like below, even though it seems like a dream house:


There, you just applied representational learning without any university degree!

CNNs also do a similar job by abstracting details from an image and learning it as a representation rather than getting caught up in unnecessary detailing. To achieve this, we need to literally “tone down” the amount of information processed while still preserving the abstract concept of the image.


To achieve that below are the steps that CNN has to go through.


Step 1: Input and Padding

• Consider an image converted into an input matrix.

• In our case, it a simple binary image where '1's might represent edges or other features and '0's represent the background.

• To ensure that we don't lose information at the edges during the convolution process, we pad this matrix with zeros around the border.

Step 2: Applying the Kernel

• A kernel (or filter) is a small matrix that slides over the input matrix—a process depicted in visualization below and detect specific features in the image.

• At each position, the kernel is multiplied element-wise with the part of the image it covers, and the results are summed up to form a single entry in the feature map

• The sliding kernel by one position is called as striding where stride = 1.

Step 3: Feature Map

• The feature map is new image highlighting where certain features (like edges, corners, etc.) are detected within the original image.

• The values in the feature map depend on how well the features in the input match the pattern in the kernel.


Step 4: Max Pooling

• Max pooling reduces the spatial dimensions of the feature map.

• It divides the feature map into blocks (in this case, 2x2 blocks) and retains only the maximum value from each block.

• This operation is visualized in the diagram below, where the highest values within each block are selected to form a new, reduced feature map.

Why Convolution and Max Pooling?

Convolution: Identifies and isolates features in images, which are crucial for understanding the content within. This operation allows the network to focus on the most important elements of the input.

Max Pooling: Simplifies the output by reducing its dimensionality and allows the network to focus on the most prominent features, which enhances the network's ability to generalize. Also combats overfitting.


From Max Pooling to Output

After max pooling, the network usually flattens the pooled feature map and feeds it into one or more fully connected layers (not shown in the diagrams).


These layers classify the image by considering the features detected and pooled in previous layers. The final layer typically uses a SoftMax function to classify the image into categories, providing the probability that the input belongs to each category.


This guide aims to make the convolution and pooling stages of a CNN easy to understand, showing how layers build upon each other to achieve sophisticated image recognition.


Stay tuned for more detailed explorations into AI topics in future posts!


Writer:

Khushi RJ

Data Scientist working in a Fintech

39 views1 comment

Recent Posts

See All

1 Comment


Very Informative

Like
bottom of page