Subtitles section Play video
If there’s one deep net that has completely dominated the machine vision space in recent
years, it’s certainly the convolutional neural net, or CNN. These nets are so influential
that they’ve made Deep Learning one of the hottest topics in AI today. But they can be
tricky to understand, so let’s take a closer look and see how they work.
CNNs were pioneered by Yann Lecun of New York University, who also serves as the director
of Facebook's AI group. It is currently believed that Facebook uses a CNN for its facial recognition
software.
A convolutional net has been the go to solution for machine vision projects in the last few
years. Early in 2015, after a series of breakthroughs by Microsoft, Google, and Baidu, a machine
was able to beat a human at an object recognition challenge for the first time in the history
of AI.
It’s hard to mention a CNN without touching on the ImageNet challenge. ImageNet is a project
that was inspired by the growing need for high-quality data in the image processing
space. Every year, the top Deep Learning teams in the world compete with each other to create
the best possible object recognition software. Going back to 2012 when Geoff Hinton’s team
took first place in the challenge, every single winner has used a convolutional net as their
model. This isn’t surprising, since the error rate of image detection tasks has dropped
significantly with CNNs, as seen in this image.
Have you ever struggled while trying to learn about CNNs? If so, please comment and share
your experiences.
We’ll keep our discussion of CNNs high level, but if you’re inclined to learn about the
math, be sure to check out Andrej Karpathy’s amazing CS231n course notes on these nets.
There are many component layers to a CNN, and we will explain them one at a time. Let’s
start with an analogy that will help describe the first component, which is the “convolutional
layer”
Imagine that we have a wall, which will represent a digital image. Also imagine that we have
a series of flashlights shining at the wall, creating a group of overlapping circles. The
purpose of these flashlights is to seek out a certain pattern in the image, like an edge
or a color contrast for example. Each flashlight looks for the exact same pattern as all the
others, but they all search in a different section of the image, defined by the fixed
region created by the circle of light. When combined together, the flashlights form what’s
a called a filter. A filter is able to determine if the given pattern occurs in the image,
and in what regions. What you see in this example is an 8x6 grid of lights, which is
all considered to be one filter.
Now let’s take a look from the top. In practice, flashlights from multiple different filters
will all be shining at the same spots in parallel, simultaneously detecting a wide array of patterns.
In this example, we have four filters all shining at the wall, all looking for a different
pattern. So this particular convolutional layer is an 8x6x4, 3-dimensionsal grid of
these flashlights.
Now let’s connect the dots of our explanation: - Why is it called a convolutional net? The
net uses the technical operation of convolution to search for a particular pattern. While
the exact definition of convolution is beyond the scope of this video, to keep things simple,
just think of it as the process of filtering through the image for a specific pattern.
Although one important note is that the weights and biases of this layer affect how this operation
is performed: tweaking these numbers impacts the effectiveness of the filtering process.
- Each flashlight represents a neuron in the CNN. Typically, neurons in a layer activate
or fire. On the other hand, in the convolutional layer, neurons perform this “convolution”
operation. We're going to draw a box around one set of flashlights to make things look
a bit more organized.
- Unlike the nets we've seen thus far where every neuron in a layer is connected to every
neuron in the adjacent layers, a CNN has the flashlight structure. Each neuron is only
connected to the input neurons it "shines" upon.
The neurons in a given filter share the same weight and bias parameters. This means that,
anywhere on the filter, a given neuron is connected to the same number of input neurons
and has the same weights and biases. This is what allows the filter to look for the
same pattern in different sections of the image. By arranging these neurons in the same
structure as the flashlight grid, we ensure that the entire image is scanned.
The next two layers that follow are RELU and pooling, both of which help to build up the
simple patterns discovered by the convolutional layer. Each node in the convolutional layer
is connected to a node that fires like in other nets. The activation used is called
RELU, or rectified linear unit. CNNs are trained using backpropagation, so the vanishing gradient
is once again a potential issue. For reasons that depend on the mathematical definition
of RELU, the gradient is held more or less constant at every layer of the net. So the
RELU activation allows the net to be properly trained, without harmful slowdowns in the
crucial early layers.
The pooling layer is used for dimensionality reduction. CNNs tile multiple instances of
convolutional layers and RELU layers together in a sequence, in order to build more and
more complex patterns. The problem with this is that the number of possible patterns becomes
exceedingly large. By introducing pooling layers, we ensure that the net focuses on
only the most relevant patterns discovered by convolution and RELU. This helps limit
both the memory and processing requirements for running a CNN.
Together, these three layers can discover a host of complex patterns, but the net will
have no understanding of what these patterns mean. So a fully connected layer is attached
to the end of the net in order to equip the net with the ability to classify data samples.
Let’s recap the major components of a CNN. A typical deep CNN has three sets of layers
– a convolutional layer, RELU, and pooling layers – all of which are repeated several
times. These layers are followed by a few fully connected layers in order to support
classification. Since CNNs are such deep nets, they most likely need to be trained using
server resources with GPUs.
Despite the power of CNNs, these nets have one drawback. Since they are a supervised
learning method, they require a large set of labelled data for training, which can be
challenging to obtain in a real-world application. In the next video, we’ll shift our attention
to another important deep learning model – the Recurrent Net.