Placeholder Image

Subtitles section Play video

  • Hello World it’s Siraj and let's talk about optimization

  • There are 1000s of languages spoken across the world, each one unique in its ability to represents concepts and convey ideas.

  • No one person is fluent in all of them, but there is one language that is shared by all humans, regardless of where you come from; Mathematics.

  • No matter your culture or your age, you possess the ability to understand this language of numbers that connects us all across continents and time.

  • Like all languages, fluency requires practice.

  • But unlike any other language, the more fluent you become in math the more unstoppable youll be in anything you want to do in life.

  • Math is happening all around us, to a degree most people don’t realize. We can think of everything as a set of variables, as metrics.

  • and there exists relations between all of these variables.

  • In math, we call these relations functions. Its our way of representing a set of patterns, a mapping,

  • a relationship between many variables.

  • No matter what machine learning model we use, no matter what dataset we use, the goal of machine learning is to optimize for an objective and by doing so we are approximating a function.

  • The process of optimization helps us iteratively discover the functions hidden in the depth of data.

  • Last week we talked about a popular optimization technique called Skit 1 gradient descent.

  • This can be broken down into a 5 step process. First we define some machine learning model with a set of initial weight values,

  • these act as the the coefficients of the function that the model represents.

  • The mapping between input data and output predictions. These values are naive, we have no idea what they should actually be, were trying to discover the optimal ones.

  • Well define an error function and when we plot the graph of the relationship between all the possible error values and all the possible weight values for our function,

  • well see that there exists a valley, the minima. Well use our error to help us compute the partial derivative with respect to each weight value we have and this gives us our gradient.

  • The gradient represents the change in the error when the weights are changed by a very small value from their original value.

  • We use the gradient to update the values of our weights in a direction such that the error is minimized,

  • teratively coming closer and closer to the minima of the function. We step our solution in the negative direction of the gradient repeatedly.

  • When we reach it, we have learned the optimal weight values for our model, where our gradient is equal to zero.

  • Our model will then be able to make predictions for input data it’s never seen before.

  • Most optimization problems can be solved using gradient descent and its variants.

  • They all fall into a category called first order optimization methods. We call them first order because they only require us to compute the first derivative.

  • But theres another class of techniques that aren’t as widely used called second order optimization methods that require us to compute the second derivative.

  • The first derivative tells us if the function is increasing or decreasing at a certain point, and the second derivative tells us if the first derivative is increasing or decreasing, which hints at its curvature.

  • First order methods provide us with a line that is tangential to a point on an error surface, and second order methods provides us with a quadratic surface that kisses the curvature of the error surface.

  • Haha Get a room you two The advantage then of second order methods is that they don’t ignore the curvature of the error surface, and in terms of step-wise performance, they are better.

  • Lets look at a popular second order optimization technique called Newton’s method named after the dude who invented calculus.

  • Who’s name was

  • There are actually two versions of Newton’s method, the first version is for finding the roots of a polynomial, all those points where it intersects the x-axis.

  • So if you threw a ball and recorded its trajectory, finding the root of the equation would tell you exactly what time it hits the ground.

  • The second version is for optimization and its what we use in machine learning.

  • But lets code the root finding version first to develop some basic intuition.

  • Lets say we have a function f of x and some initial guessed solution. Newtons method says that we first find the slope of the tangent line at our guess point, then find the point at which the tangent line intersects the x axis.

  • Well use that point to find its projection in the original function.

  • Then we iterate again from our first step, this time replacing our first point with this one.

  • We keep iterating and eventually well stop when our current value of x is less than or equal to a threshold.

  • So thats the root finding version of Newton’s Method, where were trying to find where the function equals zero,

  • but in the optimization version, were trying to find where the derivative of the function equals zero, its minima.

  • At a high level, given a random starting location, we construct a quadratic approximation to the objective function that matches the first and second derivative values at that point.

  • And then we minimize that quadratic function instead of the original function.

  • he minimizer of the quadratic function is used as the starting point in the next step and we repeat this process iteratively.

  • OK lets go over two cases of Newton’s Method for optimization to learn more, a 1D case and 2D case.

  • In the first case weve got a 1 dimensional function. We can obtain a quadratic approximation at a given point of the function using what’s called a Taylor series expansion,

  • neglecting terms of order three or higher.

  • A Taylor series is a representation of a function as an infinite sum of terms that are calculated from the values of the functions derivatives at a single point.

  • It was invented by an English mathematician named Brook Taylor. Swift. So we’d take the second order Taylor series for our initial point x, and minimize it by finding the first derivative and second derivative and equating them to zero.

  • In order to find the minimum x value, we iterate this process.

  • In the second case, lets say weve got a function of multiple dimensions. We can find the minimum of it using the same approach except for 2 changes

  • we replace the first derivatives with a gradient and the second derivatives with a hessian.

  • A hessian is a matrix of the second order partial derivatives of a scalar, and it describes the local curvature of a multivariable function.

  • Derivatives help us compute gradients which we can represent using a Jacobian matrix for first order optimization.

  • And we use the hessian for second order optimization. These are 4 of the 5 derivative operators used in all of calculus, theyre the ways that we organize and represent change numerically.

  • So when should you use a second order method? First order methods are usually less computationally expensive to compute and less time expensive, converging pretty fast on large datasets.

  • Second order methods are faster when the second derivative is known and easy to compute.

  • But the second derivative is often intractable to compute, requiring lots of computation.

  • For certain problems gradient descent can get stuck along paths of slow convergence around saddle points, whereas second order methods won’t.

  • Trying out different optimization techniques for your specific problem is the best way to see what works best.

  • Here are the key points to remember: First order optimization techniques use the first derivative of a function to minimize it, second order optimization techniques used

  • the second derivative. The Jacobian is a matrix of first partial derivatives and the Hessian is a matrix of second partial derivatives.

  • And Newton’s Method is a a popular second order optimization technique that can sometimes outperform gradient descent. Last weeks coding challenge winner is Alberto Garces.

  • Alberto used gradient descent to find the line of best fit His Jupyter notebook is insanely detailed, you could learn gradient descent just from reading it alone.

  • Very well thought out. That was dope Alberto, Wizard of the Week.

  • And the runner up is Ivan Gusev who implemented gradient descent from scratch for polynomials of any order.

  • This weeks challenge is to implement Newtons method for optimization from scratch. Details in the README,

  • post your GitHub link in the comments, winners announced next week.

  • Please subscribe for more programming videos and for now I’ve gotta invent the 6th derivative so thanks for watching :)

Hello World it’s Siraj and let's talk about optimization

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it