Subtitles section Play video Print subtitles So now you’re probably thinking – wow, deep nets are really great! But why did it take so long for them to become popular? Well as it turns out, when you try to train them with a method called backpropagation, you run into a fundamental problem called the vanishing gradient, or sometimes the exploding gradient. When that happens, training takes too long and the accuracy really suffers. Let’s take a closer look. When you’re training a neural net, you’re constantly calculating a cost value. The cost is typically the difference between the net’s predicted output and the actual output from a set of labelled training data. The cost is then lowered by making slight adjustments to the weights and biases over and over throughout the training process, until the lowest possible value is obtained. Here is that forward prop again; and here are the example weights and biases. The training process utilizes something called a gradient, which measures the rate at which the cost will change with respect to a change in a weight or a bias. Deep architectures are your best and sometimes your only choice for complex machine learning problems such as facial recognition. But up until 2006, there was no way to accurately train deep nets due to a fundamental problem with the training process: the vanishing gradient. Let’s think of a gradient like a slope, and the training process like a rock rolling down that slope. A rock will roll quickly down a steep slope but will barely move at all on a flat surface. The same is true with the gradient of a deep net. When the gradient is large, the net will train quickly. When the gradient is small, the net will train slowly. Here's that deep net again. And here is how the gradient could potentially vanish or decay back through the net. As you can see, the gradients are much smaller in the earlier layers. As a result, the early layers of the network are the slowest to train. But this is a fundamental problem! The early layers are responsible for detecting the simple patterns and the building blocks – when it came to facial recognition, the early layers detected the edges which were combined to form facial features later in the network. And if the early layers get it wrong, the result built up by the net will be wrong as well. It could mean that instead of a face like this, your net looks for this. The process used for training a neural net is called back-propagation or back-prop. We saw before that forward prop starts with the inputs and works forward; back-prop does the reverse, calculating the gradient from right to left. For example, here are 5 gradients, 4 weight and 1 bias. It starts with the left and works back through the layers, like so. Each time it calculates a gradient, it uses all the previous gradients up to that point. So, lets start with that node. That edge uses the gradient at that node. And the next. So far things are simple. As you keep going back, things get a bit more complex - that one for example uses a lot of gradients, even though this is a relatively simple net. If your net gets larger and deeper, like this one, it gets even worse. But why is that? Well, a gradient at any point is the product of the previous gradients up to that point. And the product of two numbers between 0 and 1 gives you a smaller number. Say this rectangle is a one. Also, say there are two gradients - a fourth - like that - and a third. If you multiply them, you get a fourth of a third which is a twelfth. A fourth of a twelfth is a forty-eighth. You can see that numbers keep getting smaller the more you multiply. Have you ever had this issue while training a neural network with backpropagation? If so, please comment and let me know your thoughts. As a result of all this, backprop ends up taking a lot of time to train the net, and the accuracy is often very low. Up until 2006, deep nets were still underperforming shallow nets and other machine learning algorithms. But everything changed after three breakthrough papers published by Hinton, Lecun, and Bengio in 2006 and 2007. In the next video, we’ll begin taking a closer look at these breakthroughs, starting with the Restricted Boltzmann Machine.
B1 US gradient training prop train slope neural An Old Problem - Ep. 5 (Deep Learning SIMPLIFIED) 1314 52 firefox posted on 2017/06/14 More Share Save Report Video vocabulary