Placeholder Image

Subtitles section Play video

  • Hello World! It's Siraj

  • And welcome to "The Math of Intelligence"

  • For the next 3 months, we're going to take a journey through the most important math concepts that underlie machine learning.

  • That means [that] all the concepts you need from the great disciplines of calculus, linear algebra, probability theory, and statistics.

  • The prerequisites are knowing basic python syntax and algebra.

  • Every single algorithm we code will be done without using any popular machine learning library,

  • because the point of this course is to help you build a solid mathematical intuition around building algorithms that can learn from data.

  • I mean let's face it, you could just use a black box API for all this stuff, but if you have the intuition you'll have the intuition you'll know exactly which algorithm to use for the job or even custom make your own from scratch.

  • As humans, we are constantly receiving data through our five senses and somehow we've got to make sense of all this chaotic input so that we can survive.

  • Thanks to the evolutionary process we've developed brains capable of doing this.

  • We've got the most precious resource in the universeintelligence

  • the ability to learn and apply knowledge.

  • One way to measure our intelligence against the rest of the animal kingdom is using a ladder.

  • Ours is indeed the most generalized type of intelligence, capable of being applied to the widest variety of tasks.

  • But that doesn't mean that we are necessarily the best kind of intelligence.

  • In the 1960s, a primate researcher named, Dr. Jane Goodall, concluded

  • that chimpanzees had been living in the forest for 100's of thousands of years without overpopulating or destroying their environment at all.

  • Orcas have the ability to sleep with one hemisphere of their brain at a time, which allows them to recuperate, while at the same time being aware of their surroundings.

  • In some ways animals are more intelligent than us.

  • Intelligence consists of many dimensions.

  • Think of it like a multi-dimensional space of possibility.

  • When building a AI, the human brain is a great road map, after all the neural networks have achieved state of the art performance in countless tasks,

  • but it's not the only road map, there are many possible types of intelligence out there that we can and will create.

  • Some will seem familiar to us, and some very alien.

  • Thinking in a way we've never done before.

  • Like when AlphaGo played move 37.

  • Even the best Go players in the world were stunned at the move.

  • It went against everything we've learned about the game from millennia of practice, but it turned out to be an objectively better strategy that led to its win.

  • The many different types of intelligence are like symphonies,

  • each comprising of different instruments and

  • these instruments vary, not just in their dynamics but in their pitch and tempo and color and melody.

  • The amount of data that we're generating is growing really fast.

  • No I mean really, REALLY fast!

  • In the time since you started watching this video enough data was generated for you to spend an entire lifetime analyzing.

  • And only .5% of all data ever is.

  • Creating intelligence isn't just a nice to have, it's a necessity.

  • Put in the right hands it will help us solve problems we never dreamed could be possible to solve.

  • So were do we start?

  • At it's core, machine learning is all about mathematical optimization.

  • This is a way of thinking.

  • Every single problem can be broken down into an optimization problem.

  • Once we have some data set that acts as our input, we'll build a model that uses that data to optimize for an objective - a goal that we want to reach.

  • And the way it does this is by minimizing some error value that we define.

  • One example problem could be, "what should I wear today?"

  • I could frame this as optimizing for stylishness, instead of say, comfort,

  • then define an error that I want to minimize as the amount of ratings a group of people give me that are negative.

  • Or even what's the best design for my iOs app's homepage.

  • Rather than hardcoding in some elements, I could find a data set of app designs and their ratings from users.

  • If I want to optimize for a design that would be the highest rated I would learn the mapping between design styles and ratings.

  • This is the way that every single layer of the stack will be built in the future.

  • Sometimes our data is labeled,

  • sometimes it isn't,

  • there are different techniques we can use to find patterns in this data.

  • And sometimes optimizing for an objective can happen not through the frame of pattern recognition but

  • through the exploration of many possibilities and seeing what works and what doesn't.

  • There are many ways that we can frame the learning process...

  • But the easiest way to learn is when we used labelled data.

  • Mathematically speaking we have some input.

  • Theres a domain, X, where every point of X has features that we observe.

  • Then we have a label set Y. So the data consists of a set of labeled examples that we can denote this way.

  • The output, then, would be a prediction rule. So given a new X value, what’s its associated Y value?

  • Weve gotta learn this mapping, which is an unknown distribution over X,

  • to be able to answer this. So we have to measure some error function that acts as a performance metric.

  • So what we’d do is choose from a number of possible models to represent this function.

  • Well initially set some parameter values to represent the mapping, then we’d evaluate the initial result,

  • measure the error, update the parameters, and repeat this process optimizing the model again and again until it fully learns the mapping. So that brings us to the main topic of this video, first order optimization. What is this?

  • Was it convex or concave functions that were easier to optimize? I think convex. I really hope my lab partner is epic at optimization.

  • I guess I should be thankful, not many data scientists get a grant from CERN to detect the Higgs-Boson.

  • What was her name again? Eloise, I think. Yup, she did win an award at ICML. I wonder if she’s cute?

  • No, that doesn’t matter. I am not going to mix business and pleasure, not this time.

  • Suppose I’ve got a bunch of data points. These are just toy data points, like what Apple probably trained Siri on.

  • Theyre all x-y value pairs where x represents the distance a person bikes,

  • and y represents the amount of calories they lost. We can just plot them on a graph like so.

  • We want to be able to predict the calories lost for a new person giving their biking distance.

  • How should we do this? Well we could try to draw a line that fits through all the data points but it seems like our points are too spaced out for a straight line to pass through all of them.

  • So we can settle for drawing the line of best fit, a line that goes through as many data points as possible.

  • Algebra tells us that the equation for a straight line is of the form y = mx+ b.

  • Where m represents the slope or steepness of the line and b represents it’s y-axis intercept point.

  • We want to find the optimal values for b and m such that

  • line fits as many points as possible, so given any new x value, we can plug it into our equation and itll output the most likely y value.

  • Our error metric can be a measure of closeness, which we can define like this. So lets start off with a random b and m value and plot this line.

  • For every single data point we have, lets calculate its associated y value in our already randomly drawn line.

  • Then well subtract the actual y value from it to measure the distance between the two.

  • Well want to square this error to make our next steps easier.

  • Once we sum all these values we get a single value that represents our error given that line we just drew.

  • Now if we did this process repeatedly, say 666 times, for a bunch of different randomly drawn lines,

  • we could create a 3D graph that shows the error value for every associated b and m value.

  • Notice how there is a valley in this graph. At the bottom of this valley, the error is at its smallest.

  • And so the associated b and m values would be the line of best fit, where the distance between all our data points and our line would be the smallest!

  • But how do we find it? Well well need to try out a bunch of different lines to create this 3D graph.

  • But rather than just randomly drawing lines over and over again with no signal, what if we could do it in a more efficient way,

  • such that each successive line we draw brings us closer and closer to the bottom of this valley.

  • We need a direction a way to descend this valley. What if for a given function, we could find the slope of it at a given point.

  • Then that slope would point in a certain direction, towards the minima of the graph.

  • And when we re-draw our line over and over again we could do so using the slope as our compass, as our guide on how best to redraw as wewalk through the valley of the shadow of death

  • towards the minima until our slope approaches 0. In calculus, we call this slope the derivative of a function.

  • Since we are updating 2 values, b and m. We want to calculate the derivative with respect to both of them, the partial derivative.

  • The partial derivative with respect to a variable means that we calculate the derivative of that variable while ignoring the others.

  • So well compute the partial derivative with respect to b. Then the partial derivative with respect to m.

  • To do this we usr the power rule. We multiply the exponent by the coefficient and subtract 1 from the exponent.

  • Once we have these 2 values we can update both of these parameters from our function by subtracting them from our existing b and m values.

  • And we just keep doing that for a set number of iterations that we pre-define.

  • So this optimization technique that we just performed is called gradient descent and its the most popular one in machine learning.

  • So what do you need to remember from this video? 3 points.

  • The derivative is the slope of a function at a given point, the partial derivative is the slope with respect to one variable in that function.

  • We can use them to compose a gradient which points in the direction of the local minima of a function.

  • And gradient descent is a very popular optimization strategy in machine learning that uses the gradient to do this. Now its your turn. I’ve got a coding challenge for you.

  • Implement gradient descent on your own on a different dataset that I’ll provide.

  • Check out the GitHub link for details, the winner will be announced in a week.

  • Please subscribe for more programming videos and for now I’ve gotta find memorize the power rule so thanks for watching :)

Hello World! It's Siraj

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it