Placeholder Image

Subtitles section Play video

  • ANNOUNCER: The following program is brought to you by Caltech.

  • YASER ABU-MOSTAFA: Welcome to machine learning, and welcome to our online

  • audience as well.

  • Let me start with an outline of the course, and then go into the material

  • of today's lecture.

  • As you see from the outline, the topics are given colors, and that

  • designates their main content, whether it's mathematical or practical.

  • Machine learning is a very broad subject.

  • It goes from very abstract theory to extreme practice as in rules of thumb.

  • And the inclusion of a topic in the course depends on the relevance to

  • machine learning.

  • So some mathematics is useful because it gives you the conceptual framework,

  • and then some practical aspects are useful because they give you the way

  • to deal with real learning systems.

  • Now if you look at the topics, these are not meant to be separate topics

  • for each lecture.

  • They just highlight the main content of those lectures.

  • But there is a story line that goes through it, and let me tell you what

  • the story line is like.

  • It starts here with: what is learning?

  • Can we learn?

  • How to do it?

  • How to do it well?

  • And then the take-home lessons.

  • There is a logical dependency that goes through the course, and there's

  • one exception to that logical dependency.

  • One lecture, which is the third one, doesn't really belong here.

  • It's a practical topic, and the reason I included it early on is because I

  • needed to give you some tools to play around with, to test the

  • theoretical and conceptual aspects.

  • If I waited until it belonged normally, which is to the second aspect of the

  • linear models which is down there, the beginning of the course would be

  • just too theoretical for people's taste.

  • And as you see, if you look at the colors, it is mostly red in the

  • beginning and mostly blue in the end.

  • So it starts building the concepts and the theory.

  • And then it goes on to the practical aspects.

  • Now, let me start today's lecture.

  • And the subject of the lecture is the learning problem.

  • It's an introduction to what learning is.

  • And I will draw your attention to one aspect of this slide,

  • which is this part.

  • That's the logo of the course.

  • And believe it or not, this is not artwork.

  • This is actually a technical figure that will come up

  • in one of the lectures.

  • I'm not going to tell you which one.

  • So you can wait in anticipation until it comes up, but this will actually be

  • a scientific figure that we will talk about.

  • Now when we move to today's lecture, I'm going to talk

  • today about the following.

  • Machine learning is a very broad subject, and I'm going to start with

  • one example that captures the essence of machine learning.

  • It's a fun example about movies that everybody watches.

  • And then after that, I'm going to abstract from the learning problem,

  • the practical learning problem, aspects that are common to all

  • learning situations that you're going to face.

  • And in abstracting them, we'll have the mathematical formalization of the

  • learning problem.

  • And then we will get our first algorithm for machine learning today.

  • It's a very simple algorithm, but it will fix the idea about what is the

  • role of an algorithm in this case.

  • And we will survey the types of learning, so that we know which part we

  • are emphasizing in this course, and which parts are nearby.

  • And I will end up with a puzzle, a very interesting puzzle, and it's

  • a puzzle in more ways than one, as you will see.

  • OK, so let me start with an example.

  • The example of machine learning that I'm going to start with is how

  • a viewer would rate a movie.

  • Now that is an interesting problem, and it's interesting for us because we

  • watch movies, and very interesting for a company that rents out movies.

  • And indeed, a company which is Netflix wanted to improve the in-house system

  • by a mere 10%.

  • So they make recommendations when you log in, they recommend movies that

  • they think you will like, so they think that you'll rate them highly.

  • And they had a system, and they wanted to improve the system.

  • So how much is a 10% improvement in performance worth to the company?

  • It was actually $1 million that was paid out to the first group that

  • actually managed to get the 10% improvement.

  • So you ask yourself, 10% improvement in something like that, why should

  • that be worth a million dollars?

  • It's because, if the recommendations that the movie company makes are spot

  • on, you will pay more attention to the recommendation, you are likely to rent

  • the movies that they recommend, and they will make lots of money-- much

  • more than the million dollars they promised.

  • And this is very typical in machine learning.

  • For example, machine learning has applications in financial forecasting.

  • You can imagine that the minutest improvement in financial forecasting

  • can make a lot of money.

  • So the fact that you can actually push the system to be better using machine

  • learning is a very attractive aspect of the technique in a wide spectrum of

  • applications.

  • So what did these guys do?

  • They gave the data, and people started working on the problem using different

  • algorithms, until someone managed to get the prize.

  • Now if you look at the problem of rating a movie, it captures the

  • essence of machine learning, and the essence has three components.

  • If you find these three components in a problem you have in your field, then

  • you know that machine learning is ready as an application tool.

  • What are the three?

  • The first one is that a pattern exists.

  • If a pattern didn't exist, there would be nothing to look for.

  • So what is the pattern here?

  • There is no question that the way a person rates a movie is related to how

  • they rated other movies, and is also related to how other

  • people rated that movie.

  • We know that much.

  • So there is a pattern to be discovered.

  • However, we cannot really pin it down mathematically.

  • I cannot ask you to write a 17th-order polynomial that captures how people

  • rate movies.

  • So the fact that there is a pattern, and that we cannot pin it down

  • mathematically, is the reason why we are going for machine learning.

  • For "learning from data".

  • We couldn't write down the system on our own, so we're going to depend on

  • data in order to be able to find the system.

  • There is a missing component which is very important.

  • If you don't have that, you are out of luck.

  • We have to have data. We are learning from data.

  • So if someone knocks on my door with an interesting machine learning

  • application, and they tell me how exciting it is, and how great the

  • application would be, and how much money they would make, the first

  • question I ask, what data do you have?

  • If you data, we are in business.

  • If you don't, you are out of luck.

  • If you have these three components, you are ready to

  • apply machine learning.

  • Now let me give you a solution to the movie rating, in order to start

  • getting a feel for it.

  • So here is a system.

  • Let me start to focus on part of it.

  • We are going to describe a viewer as a vector of factors, a profile if

  • you will.

  • So if you look here for example, the first one would be comedy content.

  • Does the movie have a lot of comedy?

  • From a viewer point of view, do they like comedies?

  • Here, do they like action?

  • Do they prefer blockbusters, or do they like fringe movies?

  • And you can go on all the way, even to asking yourself whether you like the

  • lead actor or not.

  • Now you go to the content of the movie itself, and you get the

  • corresponding part.

  • Does the movie have comedy?

  • Does it have action?

  • Is it a blockbuster?

  • And so on.

  • Now you compare the two, and you realize that if there is a match--

  • let's say you hate comedy and the movie has a lot of comedy--

  • then the chances are you're not going to like it.

  • But if there is a match between so many coordinates, and the

  • number of factors here could be really like 300 factors.

  • Then the chances are you'll like the movie.

  • And if there's a mismatch, the chances are you're not

  • going to like the movie.

  • So what do you do,

  • you match the movie and the viewer factors, and then you add the

  • contributions of them.

  • And then as a result of that, you get the predicted rating.

  • This is all good except for one problem, which is this is really not

  • machine learning.

  • In order to produce this thing, you have to watch the movie, and analyze

  • the content.

  • You have to interview the viewer, and ask about their taste.

  • And then after that, you combine them and try to get

  • a prediction for the rating.

  • Now the idea of machine learning is that you don't have to do any of that.

  • All you do is sit down and sip on your tea, while the machine is doing

  • something to come up with this figure on its own.

  • So let's look at the learning approach.

  • So in the learning approach, we know that the viewer will be a vector of

  • different factors, and different components for every factor.

  • So this vector will be different from one viewer to another.

  • For example, one viewer will have a big blue content here, and one of them

  • will have a small blue content, depending on their taste.

  • And then, there is the movie.

  • And a particular movie will have different contents that correspond to those.

  • And the way we said we are computing the rating, is by simply taking these

  • and combining them and getting the rating.

  • Now what machine learning will do is reverse-engineer that process.

  • It starts from the rating, and then tries to find out what factors would be

  • consistent with that rating.

  • So think of it this way.

  • You start, let's say, with completely random factors.

  • So you take these guys, just random numbers from beginning to end, and

  • these guys, random numbers from beginning to end.

  • For every user and every movie, that's your starting point.

  • Obviously, there is no chance in the world that when you get the inner

  • product between these factors that are random, that you'll get anything that

  • looks like the rating that actually took place, right?

  • But what you do is you take a rating that actually happened, and then you

  • start nudging the factors ever so slightly toward that rating.

  • Make the direction of the inner product get closer to the rating.

  • Now it looks like a hopeless thing. I start with so many factors, they are

  • all random, and I'm trying to make them match a rating.

  • What are the chances?

  • Well the point is that you are going to do this not for one rating, but for

  • a 100 million ratings.

  • And you keep cycling through the 100 million, over

  • and over and over.

  • And eventually, lo and behold, you find that the factors now are

  • meaningful in terms of the ratings.

  • And if you get a user, a viewer here, that didn't watch a movie, and you get

  • the vector that resulted from that learning process, and you get the

  • movie vector that resulted from that process, and you do the inner product,

  • lo and behold, you get a rating which is actually consistent with how that

  • viewer rates the movie.

  • That's the idea.

  • Now this actually, the solution I described, is one of the winning

  • solutions in the competition that I mentioned.

  • So this is for real, this actually can be used.

  • Now with this example in mind, let's actually go to the

  • components of learning.

  • So now I would like to abstract from the learning problems that I see, what

  • are the mathematical components that make up the learning problem?

  • And I'm going to use a metaphor.

  • I'm going to use a metaphor now from another application domain, which

  • is a financial application.

  • So the metaphor we are going to use is credit approval.

  • You apply for a credit card, and the bank wants to decide whether it's

  • a good idea to extend a credit card for you or not.

  • From the bank's point of view, if they're going to make

  • money, they are happy.

  • If they are going to lose money, they are not happy.

  • That's the only criterion they have.

  • Now, very much like we didn't have a magic formula for deciding how

  • a viewer will rate a movie, the bank doesn't have a magic formula for

  • deciding whether a person is creditworthy or not.

  • What they're going to do, they're going to rely on historical records of

  • previous customers, and how their credit behavior turned out, and then

  • try to reverse-engineer the system, and when they get the system frozen,

  • they're going to apply it to a future customer.

  • That's the deal.

  • What are the components here?

  • Let's look at it.

  • First, you have the applicant information.

  • And the applicant information-- you look at this, and you can see that

  • there is the age, the gender, how much money you make, how much money you

  • owe, and all kinds of fields that are believed to be related to the

  • creditworthiness.

  • Again, pretty much like we did in the movie example, there is no question

  • that these fields are related to the creditworthiness.

  • They don't necessarily uniquely determine it, but they are related.

  • And the bank doesn't want a sure bet. They want to get the credit decision

  • as reliable as possible.

  • So they want to use that pattern, in order to be able to come up with

  • a good decision.

  • And they take this input, and they want to approve the credit or deny it.

  • So let's formalize this.

  • First, we are going to have an input.

  • And the input is called x. Surprise, surprise!

  • And that input happens to be the customer application.

  • So we can think of it as a d-dimensional vector, the first

  • component is the salary, years in residence, outstanding debt, whatever

  • the components are.

  • You put it as a vector, and that becomes the input.

  • Then we get the output y. The output y is simply the decision, either to

  • extend credit or not to extend credit, +1 and -1.

  • And being a good or bad customer, that is from the bank's point of view.

  • Now we have after that, the target function.

  • The target function is a function from a domain X, which is the

  • set of all of these x's.

  • So it is the set of vectors of d dimensions.

  • So it's a d-dimensional Euclidean space, in this case.

  • And then the Y is the set of y's.

  • Well, that's an easy one because y can only be +1 or -1,

  • accept or deny.

  • And therefore this is just a binary co-domain.

  • And this target function is the ideal credit approval formula, which we

  • don't know.

  • In all of our endeavors in machine learning, the target function is

  • unknown to us.

  • If it were known, nobody needs learning.

  • We just go ahead and implement it.

  • But we need to learn it because it is unknown to us.

  • So what are we going to do to learn it?

  • We are going to use data, examples.

  • So the data in this case is based on previous customer application records.

  • The input, which is the information in their applications, and the output,

  • which is how they turned out in hindsight.

  • This is not a question of prediction at the time they applied, but after

  • five years, they turned out to be a great customer.

  • So the bank says, if someone has these attributes again, let's approve

  • credit because these guys tend to make us money.

  • And this one made us lose a lot of money, so let's deny it, and so on.

  • And the historical records-- there are plenty of historical records.

  • All of this makes sense when you're talking about having 100,000 of

  • those guys.

  • Then you pretty much say, I will capture what the essence of that

  • function is.

  • So this is the data, and then you use the data, which is the historical

  • records, in order to get the hypothesis.

  • The hypothesis is the formal name we're going to call the formula we get

  • to approximate the target function.

  • So the hypothesis lives in the same world as the target function, and if

  • you look at the value of g, it supposedly approximates f.

  • While f is unknown to us, g is very much known--

  • actually we created it-- and the hope is that it does approximate f well.

  • That's the goal of learning.

  • So this notation will be our notation for the rest of the course, so get

  • used to it.

  • The target function is always f, the hypothesis we produce, which we'll

  • refer to as the final hypothesis will be called g, the data will always have

  • that notation-- there are capital N points, which are the data set.

  • And the output is always y.

  • So this is the formula to be used.

  • Now, let's put it in a diagram in order to analyze it a little bit more.

  • If you look at the diagram here, here is the target

  • function and it is unknown--

  • that is the ideal approval which we will never know, but that's what we're

  • hoping to get to approximate.

  • And we don't see it.

  • We see it only through the eyes of the training examples.

  • This is our vehicle of understanding what the target function is.

  • Otherwise the target function is a mysterious quantity for us.

  • And eventually, we would like to produce the final hypothesis.

  • The final hypothesis is the formula the bank is going to use in order to

  • approve or deny credit, with the hope that g hopefully approximates that f.

  • Now what connects those two guys?

  • This will be the learning algorithm.

  • So the learning algorithm takes the examples, and will produce the final

  • hypothesis, as we described in the example of the movie rating.

  • Now there is another component that goes into the learning algorithm.

  • So what the learning algorithm does, it creates the formula from a preset

  • model of formulas, a set of candidate formulas, if you will.

  • And these we are going to call the hypothesis set, a set of hypotheses

  • from which we are going to pick one hypothesis.

  • So from this H comes a bunch of small h's, which are functions that can be

  • candidates for being the credit approval.

  • And one of them will be picked by the learning algorithm, which happens to

  • be g, hopefully approximating f.

  • Now if you look at this part of the chain, from the target function to the

  • training to the learning algorithm to the final hypothesis, this is very

  • natural, and nobody will object to that.

  • But why do we have this hypothesis set?

  • Why not let the algorithm pick from anything?

  • Just create the formula, without being restricted to a particular set of

  • formulas H.

  • There are two reasons, and I want to explain them.

  • One of them is that there is no downside for including a hypothesis

  • set in the formalization.

  • And there is an upside.

  • So let me describe why there is no downside, and then describe why there

  • is an upside.

  • There is no downside for the simple reason that, from a practical point of

  • view, that's what you do.

  • You want to learn, you say I'm going to use a linear formula.

  • I'm going to use a neural network.

  • I'm going to use a support vector machine.

  • So you are already dictating a set of hypotheses.

  • If you happen to be a brave soul, and you don't want to restrict yourself at

  • all, very well, then your hypothesis set is the set of all possible

  • hypotheses.

  • Right?

  • So there is no loss of generality in putting it.

  • So there is no downside.

  • The upside is not obvious here, but it will become obvious as we go through

  • the theory.

  • The hypothesis set will play a pivotal role in the theory of learning.

  • It will tell us: can we learn, and how well we learn, and whatnot.

  • Therefore having it as an explicit component in the problem statement

  • will make the theory go through.

  • So that's why we have this figure.

  • Now, let me focus on the solution components of that figure.

  • What do I mean by the solution components?

  • If you look at this, the first part, which is the target-- let me try to

  • expand it--

  • so the target function is not under your control.

  • Someone knocks on my door and says: I want to approve credit.

  • That's the target function, I have no control over that.

  • And by the way, here are the historical records.

  • I have no control over that, so they give me the data.

  • And would you please hand me the final hypothesis?

  • That is what I'm going to give them at the end, before I receive my check.

  • So all of that is completely dictated.

  • Now let's look at the other part. The learning algorithm, and the hypothesis

  • set that we talked about, are your solution tools.

  • These are things you choose, in order to solve the problem.

  • And I would like to take a little bit of a look into what they look like,

  • and give you an example of them, so that you have a complete chain for

  • the entire figure in your mind.

  • From the target function, to the data set, to the learning algorithm,

  • hypothesis set, and the final hypothesis.

  • So, here is the hypothesis set.

  • We chose the notation H for the set, and the element will be given the

  • symbol small h.

  • So h is a function, pretty much like the final hypothesis g.

  • g is just one of them that you happen to elect.

  • So when we elect it, we call it g. If it's sitting there generically, we

  • call it h.

  • And then, when you put them together, they are referred to as

  • the learning model.

  • So if you're asked what is the learning model you are using, you're

  • actually choosing both a hypothesis set and a learning algorithm.

  • We'll see the perceptron in a moment, so this would be the

  • perceptron model, and this would be the perceptron learning algorithm.

  • This could be neural network, and this would be back propagation.

  • This could be support vector machines of some kind, let's say

  • radial basis function version, and this would be the quadratic programming.

  • So every time you have a model, there is a hypothesis set, and then there is

  • an algorithm that will do the searching and produce

  • one of those guys.

  • So this is the standard form

  • for the solution.

  • Now, let me go through a simple hypothesis set in detail so we have

  • something to implement.

  • So after the lecture, you can actually implement a learning algorithm on real

  • data if you want to.

  • This is not a glorious model. It's a very simple model. On the other hand,

  • it's a very clear model to pinpoint what we are talking about.

  • So here is the deal.

  • You have an input, and the input is x_1 up to x_d, as we said--

  • d-dimensional vector-- and each of them comes from the real numbers, just

  • for simplicity.

  • So this belongs to the real numbers.

  • And these are the attributes of a customer.

  • As we said, salary, years in residence, and whatnot.

  • So what does the perceptron model do?

  • It does a very simple formula.

  • It takes the attributes you have and gives them different weights, w.

  • So let's say the salary is important, the chances are w corresponding to the

  • salary will be big.

  • Some other attribute is not that important.

  • The chances are the w that goes with it is not that big.

  • Actually, outstanding debt is bad news.

  • If you owe a lot, that's not good.

  • So the chances are the weight will be negative for outstanding

  • debt, and so on.

  • Now you add them together, and you add them in a linear form-- that's what

  • makes it a perceptron--

  • and you can look at this as a credit score, of sorts.

  • Now you compare the credit score with a threshold.

  • If you exceed the threshold, they approve the credit card.

  • And if you don't, they deny the credit card.

  • So that is the formula they

  • settle on.

  • They have no idea, yet, what the w's and the threshold are, but they dictated the

  • formula-- the analytic form that they're going to use.

  • Now we take this and we put it in the formalization we had.

  • We have to define a hypothesis h, and this will tell us what is the

  • hypothesis set that has all the hypotheses that have the same

  • functional form.

  • So you can write it down as this.

  • This is a little bit long, but there's absolutely nothing to it.

  • This is your credit score, and this is the threshold you compare to by

  • subtracting.

  • If this quantity is positive, you belong to the first thing and you will

  • approve credit.

  • If it's negative, you belong here and you will deny credit.

  • Well, the function that takes a real number, and produces the sign +1 or

  • -1, is called the sign.

  • So when you take the sign of this thing, this will indeed be +1 or

  • -1, and this will give the decision you want.

  • And that will be the form of your hypothesis.

  • Now let's put it in color, and you realize that what defines h is your

  • choice of w_i and the threshold.

  • These are the parameters that define one hypothesis versus the other.

  • x is an input that will be put into any hypothesis.

  • As far as we are concerned, when we are learning for example, the inputs

  • and outputs are already determined.

  • These are the data set.

  • But what we vary to get one hypothesis or another, and what the algorithm

  • needs to vary in order to choose the final hypothesis, are those parameters

  • which, in this case, are w_i and the threshold.

  • So let's look at it visually.

  • Let's assume that the data you are working

  • with is linearly separable.

  • Linearly separable in this case, for example, you have nine data points.

  • And if you look at the nine data points, some of them were good

  • customers and some of them were bad customers.

  • And you would like now to apply the perceptron model, in order to separate

  • them correctly.

  • You would like to get to this situation, where the perceptron, which

  • is this purple line, separates the blue region from the red region or the

  • pink region, and indeed all the good customers belong to one, and the bad

  • customers belong to the other.

  • So you have hope that a future customer, if they lie here or lie

  • here, they will be classified correctly.

  • If there is actually a simple linear pattern to this to be detected.

  • But when you start, you start with random weights, and the random weights

  • will give you any line.

  • So the purple line in both cases corresponds to the

  • purple parameters there.

  • One choice of these w's and the threshold corresponds to one line.

  • You change them, you get another line.

  • So you can see that the learning algorithm is playing around with these

  • parameters, and therefore moving the line around, trying to arrive at this

  • happy solution.

  • Now we are going to have a simple change of notation.

  • Instead of calling it threshold, we're going to treat it as if it's a weight.

  • It was minus threshold. Now we call it, plus w_0.

  • Absolutely nothing, all you need to do is choose w_0 to

  • be minus the threshold.

  • No big deal.

  • So why do we do that?

  • We do that because we are going to introduce an artificial coordinate.

  • Remember that the input was x_1 through x_d.

  • Now we're going to add x_0.

  • This is not an attribute of the customer, but

  • an artificial constant we add, which happens to be always +1.

  • Why are we doing this? You probably guessed.

  • Because when you do that, then all of a sudden the formula simplifies.

  • Now you are summing from i equals 0, instead of i equals 1.

  • So you added the zero term, and what is the zero term?

  • It's the threshold which you conveniently call w_0 with a plus sign,

  • multiplied by the 1.

  • So indeed, this the formula will be equivalent to that.

  • So it looks better.

  • And this is the standard notation we're going to use.

  • And now we put it as a vector form, which will simplify matters, so

  • in this case you will be having an inner product between a vector w,

  • a column vector, and a vector x.

  • So the vector w would be w_0, w_1, w_2, w_3, w_4, et cetera.

  • And x_0, x_1, x_2, et cetera.

  • And you do the inner product by taking a transpose, and you get a formula

  • which is exactly the formula you have here.

  • So now we are down to this formula for the perceptron hypothesis.

  • Now that we have the hypothesis set, let's look for the learning algorithm

  • that goes with it.

  • The hypothesis set tells you the resources you can work with.

  • Now we need the algorithm that is going to look at the data, the

  • training data that you're going to use, and navigate through the space

  • of hypotheses, to bring the one that is going to output as the final

  • hypothesis that you give to your customer.

  • So this one is called the perceptron learning algorithm, and it implements

  • this function.

  • What it does is the following.

  • It takes the training data.

  • That is always what a learning algorithm does. This is

  • their starting point.

  • So it takes existing customers, and their existing credit behavior in

  • hindsight--

  • that's what it uses--

  • and what does it do?

  • It tries to make the w correct.

  • So it really doesn't like at all when a point is misclassified.

  • So if a point is misclassified, it means that your w didn't do

  • the right job here.

  • So what does it mean to be a misclassified point here?

  • It means that when you apply your formula, with the current w--

  • the w is the one that the algorithm will play with--

  • apply it to this particular x.

  • Then what happens?

  • You get something that is not the credit behavior you want.

  • It is misclassified.

  • So what do we do when a point is misclassified?

  • We have to do something.

  • So what the algorithm does, it updates the weight vector.

  • It changes the weight, which changes the hypothesis, so that it behaves

  • better on that particular point.

  • And this is the formula that it does.

  • So I'll explain it in a moment.

  • Let me first try to explain the inner product in terms of agreement or

  • disagreement.

  • If you have the vector x and the vector w this way, their inner product

  • will be positive, and the sign will give you a +1.

  • If they are this way, the inner product will be negative, and the sign

  • will be -1.

  • So being misclassified means that either they are this way and the

  • output should be -1, or it's this way and output should be +1.

  • That's what makes it misclassified, right?

  • So if you look here at this formula, it takes the old w and adds something

  • that depends on the misclassified point.

  • Both in terms of the x_n and y_n.

  • y_n is just +1 or -1.

  • So here you are either adding a vector or subtracting a vector.

  • And we will see from this diagram that you're always doing so in such a way

  • that you make the point more likely to be correctly classified.

  • How is that?

  • If y equals +1, as you see here, then it must be that since the point

  • is misclassified, that w dot x was negative.

  • Now when you modify this to w plus y x, it's actually w plus x.

  • You add x to w, and when you add x to w you get the blue vector instead of

  • the red vector.

  • And lo and behold, now the inner product is indeed positive.

  • And in the other case when it's -1, it is misclassified because they

  • were this way.

  • They give you +1 when it should be -1.

  • And when you apply the rule, since y is -1, you are actually

  • subtracting x.

  • So you subtract x and get this guy, and you will get the correct

  • classification.

  • So this is the intuition behind it.

  • However, it is not the intuition that makes this work.

  • There are a number of problems with this approach.

  • I just motivated that this is not a crazy rule.

  • Whether or not it's a working rule, that is yet to be seen.

  • Let's look at the iterations of the perceptron learning algorithm.

  • Here is one iteration of PLA.

  • So you look at this thing, and you have this current w corresponds to

  • the purple line.

  • This guy is blue in the red region.

  • It means it's misclassified.

  • So now you would like to adjust the weights, that is move around

  • that purple line, such that the point is classified correctly.

  • If you apply the learning rule, you'll find that you're actually moving in

  • this direction, which means that the blue point will likely be correctly

  • classified after that iteration.

  • There is a problem because, let's say that I actually move

  • this guy in this direction.

  • Well this one, I got it right, but this one, which used to be right,

  • now is messed up.

  • It moved to the blue region, right?

  • And if you think about it, I'm trying to take care of one point, and I may be

  • messing up all other points, because I'm not taking them into

  • consideration.

  • Well, the good news for the perceptron

  • learning algorithm is that all you need to do, is for iterations 1,

  • 2, 3, 4, et cetera, pick a misclassified point, anyone you like.

  • And then apply the iteration to it.

  • The iteration we just talked about, which is this one.

  • The top one.

  • And that's it.

  • If you do that, and the data was originally linearly separable, then

  • you will end up with the case that you will get to a correct solution.

  • You will get to something that classifies all of them correctly.

  • This is not an obvious statement.

  • It requires a proof.

  • The proof is not that hard.

  • But it gives us the simplest possible learning model we can think of.

  • It's a linear model, and this is your algorithm.

  • All you need to do is be very patient, because 1, 2, 3, 4-- this is

  • a really long.

  • At times it can be very long.

  • But it eventually converges.

  • That's the promise,

  • as long as the data is linearly separable.

  • So now we have one learning model, and if I give you now data from a bank--

  • previous customers and their credit behavior-- you can actually run the

  • perceptron learning algorithm, and come up with a final hypothesis g that you

  • can hand to the bank.

  • Not clear at all that it will be good, because all you did was match the

  • historical records.

  • Well, you may ask the question: if I match the historical records, does this

  • mean that I'm getting future customers right, which is the

  • only thing that matters?

  • The bank already knows what happened with the previous customers. It's just

  • using the data to help you find a good formula.

  • The formula will be good or not good to the extent that it applies to a new

  • customer, and can predict the behavior correctly.

  • Well, that's a loaded question which will be handled in

  • extreme detail, when we talk about the theory of learning.

  • That's why we have to develop all of this theory.

  • So, that's it.

  • And that is the perceptron learning algorithm.

  • Now let me go into the bigger picture of learning, because what I talked

  • about so far is one type of learning.

  • It happens to be by far the most popular, and the most used.

  • But there are other types of learning.

  • So let's talk about the premise of learning, from which the different

  • types came about.

  • That's what learning is about.

  • This is the premise that is common between any problem that you

  • would consider learning.

  • You use a set of observations, what we call data, to uncover

  • an underlying process.

  • In our case, the target function.

  • You can see that this is a very broad premise.

  • And therefore, you can see that people have rediscovered that over and over

  • and over, in so many disciplines.

  • Can you think of a discipline, other than machine learning, that uses that

  • as its exclusive premise?

  • Anybody have taken courses in statistics?

  • In statistics, that's what they do.

  • The underlying process is a probability distribution.

  • And the observations are samples generated by that distribution.

  • And you want to take the samples, and predict what the probability

  • distribution is.

  • And over and over, there are so many disciplines under different names.

  • Now when we talk about different types of learning, it's not like we sit down

  • and look at the world and say, this looks different from this because the

  • assumptions look different.

  • What you do is, you take this premise and apply it in a context.

  • And that calls for a certain amount of mathematics and algorithms.

  • If a particular set of assumptions takes you sufficiently far from the

  • mathematics and the algorithms you used in the other disciplines, that

  • it takes on a life of its own.

  • And it develops its own math and algorithms, then you declare it

  • a different type.

  • So when I list the types, it's not completely obvious just by the slide

  • itself, that these should be the types that you have.

  • But for what it's worth, these are the most important types.

  • First one is supervised learning, that's what we have

  • been talking about.

  • And I will discuss it in detail, and tell you why it's called supervised.

  • And it is, by far, the concentration of this course.

  • There is another one which is called unsupervised learning, and

  • unsupervised learning is very intriguing.

  • I will mention it briefly here, and then we will talk about a very famous

  • algorithm for unsupervised learning later in the course.

  • And the final type is reinforcement learning, which is even more

  • intriguing, and I will discuss it in a brief

  • introduction in a moment.

  • So let's take them one by one.

  • Supervised learning.

  • So what is supervising learning?

  • Anytime you have the data that is given to you, with the output

  • explicitly given-- here is the user and movie, and here is the rating.

  • Here is the previous customer, and here is their credit behavior.

  • It's as if a supervisor is helping you out, in order to be able to classify

  • the future ones.

  • That's why it's called supervised.

  • Let's take an example of coin recognition, just to be able to

  • contrast it with unsupervised learning in a moment.

  • Let's say you have a vending machine, and you would like to make

  • the system able to recognize the coins.

  • So what do you do?

  • You have physical measurements of the coin, let's be simplistic and say we

  • measure the size and mass of the coin you put.

  • Now the coins will be quarters, nickels, pennies, and dimes.

  • 25, 5, 1, and 10.

  • And when you put the data in this diagram, they will belong there.

  • So the quarters, for example, are bigger, so they will belong here.

  • And the dimes in the US currency happen to be the smallest of them,

  • so they are smallest here, and there will be a scatter because of the error

  • in measurement, because of the exposure to the elements, and whatnot.

  • So let's say that this is your training data, and it's supervised

  • because things are colored.

  • I gave you those and told you they are 25 cents, 5 cents, et cetera.

  • So you use those in order to train a system, and the system will then be

  • able to classify a future one.

  • For example, if we stick to the linear approach, you may be able to

  • find separator lines like those.

  • And those separator lines will separate, based on the data, the 10

  • from the 1 from the 5 from the 25.

  • And once you have those,

  • you can bid farewell to the data. You don't need it anymore.

  • And when you get a future coin that is now unlabeled, you don't know what it

  • is, when the vending machine is actually working, then the coin will

  • lie in one region or another, and you're going to classify it accordingly.

  • So that is supervised learning.

  • Now let's look at unsupervised learning.

  • For unsupervised learning, instead of having the examples, the training data,

  • having this form which is the input plus the correct

  • target-- the correct output--

  • the customer and how they behaved in reality in credit,

  • we are going to have examples that have less information, so much less it

  • is laughable.

  • I'm just going to tell you what the input is.

  • And I'm not going to tell you what the target function is at all.

  • I'm not going to tell you anything about the target function.

  • I'm just going to tell you, here is the data of a customer.

  • Good luck, try to predict the credit.

  • OK--

  • How in the world are we going to do that?

  • Let me show you that the situation is not totally hopeless.

  • That's what I'm going to achieve.

  • I'm not going to tell you how to do it completely.

  • But let me show you that a situation like that is not totally hopeless.

  • Let's go for the coin example.

  • For the coin example, we have data that looks like this.

  • If I didn't tell you what the denominations are, the data

  • would look like this.

  • Right?

  • You have the measurements, but you don't know, is that a quarter, is

  • it-- you don't know.

  • Now honestly, if you look at this thing, you say I can know

  • something from this figure.

  • Things tend to cluster together.

  • So I may be able to classify those clusters into categories, without

  • knowing what the categories are.

  • That will be quite an achievement already.

  • You still don't know whether it's 25 cents, or whatever.

  • But the data actually made you able to do something that is

  • a significant step.

  • You're going to be able to come up with these boundaries.

  • And now, you are so close to finding the full system.

  • So unlabeled data actually can be pretty useful.

  • Obviously, I have seen the colored ones, so I actually chose the

  • boundaries right because I still remember them visually.

  • But if you look at the clusters and you have never heard about that,

  • especially these guys might not look like two clusters.

  • They may look like one cluster.

  • So it actually could be that this is ambiguous, and indeed in unsupervised

  • learning, the number of clusters is ambiguous at times.

  • And then, what you do--

  • this is the output of your system. Now, I can categorize the

  • coins into types.

  • I'm just going to call them types: type 1, type 2,

  • type 3, type 4.

  • I have no idea which belongs to which, but obviously if someone comes with

  • a single example of a quarter, a dime, et cetera, then you are ready to go.

  • Whereas before, you had to have lots of examples in order to choose where

  • exactly to put the boundary.

  • And this is why a set like that, which looks like complete

  • jungle, is actually useful.

  • Let me give you another interesting example of unsupervised learning,

  • where I give you the input without the output, and you are actually in

  • a better situation to learn.

  • Let's say that your company or your school in this case, is sending you

  • for a semester in Rio de Janeiro.

  • So you're very excited, and you decide that you'd better learn some

  • Portuguese, in order to be able to speak the language when you arrive.

  • Not to worry, when you arrive, there will be a tutor who teaches you

  • Portuguese.

  • But you have a month to go, and you want to help

  • yourself as much as possible.

  • You look around, and you find that the only resource you have is a radio

  • station in Portuguese in your car.

  • So what you do, you just turn it on whenever you drive.

  • And for an entire month, you're bombarded with Portuguese.

  • "tudo bem", "como vai", "valeu", stuff like that comes back.

  • After a while, without knowing anything-- it's unsupervised, nobody

  • told you the meaning of any word--

  • you start to develop a model of the language in your mind.

  • You know what the idioms are, et cetera.

  • You are very eager to know what actually "tudo bem"

  • -- what does that mean?

  • You are ready to learn, and once you learn it, it's actually

  • fixed in your mind.

  • Then when you go there, you will learn the language faster than if you didn't

  • go through this experience.

  • So you can think of unsupervised learning, in one way or another, as

  • a way of getting a higher-level representation of the input.

  • Whether it's extremely high level as in clusters-- you forgot all the

  • attributes and you just tell me a label, or higher level as in this-- a better

  • representation than just the crude input into some model

  • in your mind.

  • Now let's talk about reinforcement learning.

  • In this case, it's not as bad as unsupervised learning.

  • So again, without the benefit of supervised learning, you don't get

  • the correct output.

  • What you do is-- I will give you the input.

  • OK, thank you very much, that's very kind.

  • What else?

  • I'm going to give you some output.

  • The correct output?

  • No!

  • Some output.

  • OK, that's very nice, but doesn't seem very helpful.

  • It looks now like unsupervised learning, because in unsupervised learning I

  • could give you some output.

  • Here is a dime. Oh, it's a quarter.

  • It's some output!

  • Such output has no information.

  • The information comes from the next one.

  • I'm going to grade this output.

  • So that is the information provided to you.

  • So I'm not explicitly giving you the output, but when you choose an output,

  • I'm going to tell you how well you're doing.

  • Reinforcement learning is interesting because it is mostly our own

  • experience in learning.

  • Think of a toddler, and a hot cup of tea in front of her.

  • She is looking at it, and she is very curious.

  • So she reaches to touch. Ouch!

  • And she starts crying.

  • The reward is very negative for trying.

  • Now next time she looks at it, and she remembers the previous experience, and

  • she doesn't touch it.

  • But there is a certain level of pain, because there is an unfulfilled

  • curiosity.

  • And curiosity killed the cat. In three or four trials, the toddler

  • tries again.

  • Maybe now it's OK.

  • And Ouch!

  • Eventually from just the grade of the behavior of to touch it or not to

  • touch it, the toddler will learn not to touch cups of tea that have smoke

  • coming out of them.

  • So that is a case of reinforcement learning.

  • The most important application, or one of the most important applications of

  • reinforcement learning, is in playing games.

  • So backgammon is one of the games, and think that you want

  • a system to learn it.

  • So what you want, you want to take the current state of the board, and you

  • roll the dice, and then you decide what is the optimal move in

  • order to stand the best chance to win.

  • That's the game.

  • So the target function is the best move given a state.

  • Now, if I have to generate those things in order for the system to

  • learn, then I must be a pretty good backgammon player already.

  • So now it's a vicious cycle.

  • Now, reinforcement learning comes in handy.

  • What you're going to do, you are going to have the

  • computer choose any output.

  • A crazy move, for all you care.

  • And then see what happens eventually.

  • So this computer is playing against another computer, both of

  • them want to learn.

  • And you make a move, and eventually you win or lose.

  • So you propagate back the credit because of winning or losing,

  • according to a very specific and sophisticated formula, into all the

  • moves that happened.

  • Now you think that's completely hopeless, because maybe this is not the

  • move that resulted in this, it's another move.

  • But always remember, that you are going to do this 100 billion times.

  • Not you, the poor computer.

  • You're sitting down sipping your tea.

  • A computer is doing this, playing against an imaginary opponent, and

  • they keep playing and playing and playing.

  • And in three hours of CPU time, you go back to the computer-- maybe not three

  • hours, maybe three days of CPU time-- you go back to the computer, and you

  • have a backgammon champion.

  • Actually, that's true.

  • The world champion, at some point, was a neural network that learned the way

  • I described.

  • So it is actually a very attractive approach, because in machine

  • learning now, we have a target function that we cannot model.

  • That covers a lot of territory, I've seen a lot of those.

  • We have data coming from the target function.

  • I usually have that.

  • And now we have the lazy man's approach to life.

  • We are going to sit down, and let the computer do all of the work, and

  • produce the system we want.

  • Instead of studying the thing mathematically, and writing code, and

  • debugging--

  • I hate debugging.

  • And then you go. No, we're not going to do that.

  • The learning algorithm just works, and produces something good.

  • And we get the check.

  • So this is a pretty good deal.

  • It actually is so good, it might be too good to be true.

  • So let's actually examine if all of this was a fantasy.

  • So now I'm going to give you a learning puzzle.

  • Humans are very good learners, right?

  • So I'm now going to give you a learning problem in the form that I

  • described, a supervised learning problem.

  • And that supervised learning problem will give you a training set, some

  • points mapped to +1, some points mapped to -1.

  • And then I'm going to give you a test point that is unlabeled.

  • Your task is to look at the examples, learn the target function, apply it to

  • the test point, and then decide what the value of the function is.

  • After that, I'm going to ask, who decided that the function is +1,

  • and who decided that the function is -1.

  • OK? It's clear what the deal is.

  • And I would like our online audience to do the same thing.

  • And please text what the solution is.

  • Just +1 or -1.

  • Fair enough?

  • Let's start the game.

  • What is above the line are the training examples.

  • I put the input as a three-by-three pattern in order to be visually easy

  • to understand.

  • But this is just really nine bits worth of information.

  • And they are ones and zeros, black and white.

  • And for this input, this input, and this input, the value of the target

  • function is -1.

  • For this input, this input, and this input, the value of the target

  • function is +1.

  • Now this is your data set, this is your training set.

  • Now you should learn the function.

  • And when you're done, could you please tell me what your function will return

  • on this test point?

  • Is it +1 or -1.

  • I will give everybody 30 seconds before I ask for an answer.

  • Maybe we should have some background music?

  • OK, time's up.

  • Your learning algorithm has converged, I hope.

  • And now we apply it here, and I ask people here, who says it's +1?

  • Thank you.

  • Who says it's -1?

  • Thank you.

  • I see that the online audience also contributed?

  • MODERATOR: Yeah, the big majority says +1.

  • PROFESSOR: But are there -1's?

  • MODERATOR: Two -1's.

  • PROFESSOR: Cool.

  • I don't care if it's a +1 or -1.

  • What I care about is that I get both answers.

  • That is the essence of it.

  • Why do I care?

  • Because in reality, this is an impossible task.

  • I told you the target function is unknown.

  • It could be anything, really anything.

  • And now I give you the value of the target function at 6 points.

  • Well, there are many functions that fit those 6 points, and behave

  • differently outside.

  • For example, if you take the function to be +1 if the top left square

  • is white, then this should be -1, right?

  • If you take the function to be +1 if the pattern is symmetric--

  • let's see, I said it the other way around.

  • So the top one is black, it would be -1.

  • So this would be -1.

  • If it's symmetric, it would be +1.

  • So this would be +1, because this guy has both-- this is

  • black, and also it is symmetric.

  • Right?

  • And you can find infinite variety like that.

  • And that problem is not restricted to this case.

  • The question here is obvious.

  • The function is unknown.

  • You really mean unknown, right?

  • Yes, I mean it.

  • Unknown-- anything?

  • Yes, I do.

  • OK.

  • You give me a finite sample, it can be anything outside.

  • How in the world am I going to tell what the learning outside is?

  • OK, that sounds about right.

  • But we are in trouble, because that's the premise of learning.

  • If the goal was to memorize the examples I gave you, that would be

  • memorizing, not learning.

  • Learning is to figure out a pattern that applies outside.

  • And now we realize that outside, I cannot say anything.

  • Does this mean that learning is doomed?

  • Well, this is going to be a very short course!

  • Well, the good news is that learning is alive and well.

  • And we are going to show that, without compromising our basic premise.

  • The target function will continue to be unknown.

  • And we still mean unknown.

  • And we will be able to learn.

  • And that will be the subject of the next lecture.

  • Right now, we are going to go for a short break, after which we are going

  • to take the Q&A.

  • We'll start the Q&A, and we will get questions from the class here, and

  • from the online audience.

  • And if you'd like to ask a question, let me ask you to go to this side of

  • the room where the mic is, so that your question can be heard.

  • And we will alternate, if there are questions here, we will alternate

  • between campus and off campus.

  • So let me start if there is a question from outside.

  • MODERATOR: Yes, so the most common question is, how do you determine if

  • a set of points is linearly separable, and what do you do

  • if they're not separable.

  • PROFESSOR: The linear separability assumption is a very

  • simplistic assumption, and doesn't apply mostly in practice.

  • And I chose it only because it goes with a very simple algorithm, which is

  • the perceptron learning algorithm.

  • There are two ways to deal with the case of linear inseparability.

  • There are algorithms, and most algorithms actually deal with that

  • case, and there's also a technique that we are going to study next

  • week, which will take a set of points which is not linearly separable, and

  • create a mapping that makes them linearly separable.

  • So there is a way to deal with it.

  • However, the question how do you determine it's linearly separable, the

  • right way of doing it in practice is that, when someone gives you data, you

  • assume in general it's not linearly separable.

  • It will hardly ever be, and therefore you take techniques that can deal with

  • that case as well.

  • There is a simple modification of the perceptron learning algorithm, which

  • is called the pocket algorithm,

  • that applies the same rule with a very minor modification, and deals with the

  • case where the data is not separable.

  • However, if you apply the perceptron learning algorithm, that is guaranteed

  • to converge to a correct solution in the case of linear separability, and

  • you apply it to data that is not linearly separable, bad things happen.

  • Not only is it going not to converge, obviously it is not going to converge

  • because it terminates when there are no misclassified points, right?

  • If there is a misclassified point, then there's a next iteration always.

  • So since the data is not linearly separable, we will never come to

  • a point where all the points are classified correctly.

  • So this is not what is bothering us.

  • What is bothering us is that, as you go from one step to another, you can

  • go from a very good solution to a terrible solution.

  • In the case of no linear separability.

  • So it's not an algorithm that you would like to use, and just

  • terminate by force at an iteration.

  • A modification of it can be used this way, and I'll mention it briefly when

  • we talk about linear regression and other linear methods.

  • MODERATOR: There's also a question of how does the rate of convergence of

  • the perceptron change with the dimensionality of the data?

  • PROFESSOR: Badly!

  • That's the answer.

  • Let me put it this way.

  • You can build pathological cases, where it really will take forever.

  • However, I did not give the perceptron learning algorithm in the first

  • lecture to tell you that this is the great algorithm that you

  • need to learn.

  • I gave it in the first lecture, because this is simplest

  • algorithm I could give.

  • By the end of this course, you'll be saying, what?

  • Perceptron?

  • Never heard of it.

  • So it will go out of contention, after we get to the more interesting stuff.

  • But as a method that can be used, it indeed can be used, and can be

  • explained in five minutes as you have seen.

  • MODERATOR: Regarding the items for learning, you mentioned that there

  • must be a pattern.

  • So can you be more specific about that?

  • How do you know if there's a pattern?

  • PROFESSOR: You don't.

  • My answers seem to be very abrupt, but that's the way it is.

  • When we get to the theory-- is learning feasible-- it will

  • become very clear that there is a separation between the target

  • function-- there is a pattern to detect--

  • and whether we can learn it.

  • It is very difficult for me to explain it in two minutes, it will take a full

  • lecture to get there.

  • But the essence of it is that you take the data, you apply your learning

  • algorithm, and there is something you can explicitly detect that will

  • tell you whether you learned or not.

  • So in some cases, you're not going to be able to learn.

  • In some cases, you'll be able to learn.

  • And the key is that you're going to be able to tell by

  • running your algorithm.

  • And I'm going to explain that in more details later on.

  • So basically, I'm also resisting taking the data, deciding

  • whether it's linearly separable, looking at it and seeing. You will

  • realize as we go through that it's a no-no to actually look at the data.

  • What?

  • That's what data is for, to look at.

  • Bear with me.

  • We will come to the level where we ask why don't we look at the data--

  • just looking at it and then saying: It's linearly separable.

  • Let's pick the perceptron.

  • That's bad practice, for reasons that are not obvious now.

  • They will become obvious, once we are done with the theory.

  • So when someone knocks on my door with a set of data, I can ask them all

  • kinds of questions about the data-- not the particular data set that they gave

  • me, but about the general data that is generated by their process.

  • They can tell me this variable is important, the function is symmetric,

  • they can give you all kinds of information that I will take to heart.

  • But I will try, as much as I can, to avoid looking at the particular data

  • set that they gave me, lest I should tailor my system toward this data set,

  • and be disappointed when another data set comes about.

  • You don't want to get too close to the data set.

  • This will become very clear as we go with the theory.

  • MODERATOR: In general about machine learning, how does it

  • relate to other statistical, especially econometric techniques?

  • PROFESSOR: Statistics is, in the form I said, it's machine

  • learning where the target--

  • it's not a function in this case-- is a probability distribution.

  • Statistics is a mathematical field.

  • And therefore, you put the assumptions that you need in order to be able to

  • rigorously prove the results you have, and get the results in detail.

  • For example, linear regression.

  • When we talk about linear regression, it will have very few assumptions, and

  • the results will apply to a wide range, because we didn't make too many

  • assumptions.

  • When you study linear regression under statistics, there is a lot of

  • mathematics that goes with it, lot of assumptions, because that is the

  • purpose of the field.

  • In general, machine learning tries to make the least assumptions and cover the

  • most territory. These go together.

  • So it is not a mathematical discipline, but it's not a purely

  • applied discipline.

  • It spans both the mathematical, to certain extent, but it is willing to

  • actually go into territory where we don't have mathematical models, and

  • still want to apply our techniques.

  • So that is what characterizes it the most.

  • And then there are other fields. By doing machine learning,

  • you can find it under the name computational learning,

  • or statistical learning.

  • Data mining has a huge intersection with machine learning.

  • There are lots of disciplines around that actually share some value.

  • But the point is, the premise that you saw is so broad, that it shouldn't be

  • surprising that people at different times developed a particular discipline

  • with its own jargon, to deal with that discipline.

  • So what I'm giving you is machine learning as the mainstream goes, and

  • that can be applied as widely as possible to applications, both

  • practical applications and scientific applications.

  • You will see, here is a situation, I have an experiment, here is a target,

  • I have the data.

  • How do I produce the target in the best way I want?

  • And then you apply machine learning.

  • MODERATOR: Also, in a general question about machine learning.

  • Do machine learning algorithms perform global optimization methods,

  • or just local optimization methods?

  • PROFESSOR: Obviously, a general question.

  • Optimization is a tool for machine learning.

  • So we will pick whatever optimization that does the job for us.

  • And sometimes, there is a very specific optimization method.

  • For example, in support vector machines, it will be quadratic

  • programming.

  • It happens to be the one that works with that.

  • But optimization is not something that machine learning people

  • study for its own sake.

  • They obviously study it to understand it better, and to choose the correct

  • optimization method.

  • Now, the question is alluding to something that will

  • become clear when we talk about neural networks, which is local minimum versus

  • global minimum.

  • And it is impossible to put this in any perspective before we get the

  • details of neural networks, so I will defer that until

  • we get to that lecture.

  • MODERATOR: Also, this is a math question, I guess.

  • Is the hypothesis set, in a topological sense, continuous?

  • PROFESSOR: The hypothesis set can be anything, in principle.

  • So it can be continuous, and it can be discrete.

  • For example, in the next lecture I take the simplest case where we have

  • a finite hypothesis set, in order to make a certain point.

  • In reality, almost all the hypothesis sets that you find are

  • continuous and infinite.

  • Very infinite!

  • And the level of sophistication of the hypothesis set can be huge.

  • And nonetheless, we will be able to see that under one condition, which

  • comes from the theory, we'll be able to learn even if the hypothesis set is

  • huge and complicated.

  • There's a question from inside, yes?

  • STUDENT: I think I understood, more or less, the general idea, but I don't

  • understand the second example you gave about credit approval.

  • So how do we collect our data?

  • Should we give credit to everyone, or should we make our data biased,

  • because we cannot determine the data of--

  • we can't determine, should we give credit or not to persons we rejected?

  • PROFESSOR: Correct.

  • This is a good point. Every time someone asks a question, the

  • lecture number comes to my mind.

  • I know when I'm going to talk about it.

  • So what you describe is called sampling bias.

  • And I will describe it in detail.

  • But when you use the biased data, let's say the bank uses historical records.

  • So it sees the people who applied and were accepted, and for those guys, it

  • can actually predict what the credit behavior is, because it has their

  • credit history.

  • They charged and repaid and maxed out, and all of this.

  • And then they decide: is this a good customer or not?

  • For those who were rejected, there's really no way to tell in this case

  • whether they were falsely rejected, that they would have been good

  • customers or not.

  • Nonetheless, if you take the customer base that you have, and base your

  • decision on it, the boundary works fairly decently.

  • Actually, pretty decently, even for the other guys, because the other guys

  • usually are deeper into the classification region than the

  • boundary guys that you accepted, and turned out to be bad.

  • But the point is well taken.

  • The data set in this case is not completely representative, and there

  • is a particular principle in learning that we'll talk about, which is

  • sampling bias, that deals with this case.

  • Another question from here?

  • STUDENT: You explain that we need to have a lot of data to learn.

  • So how do you decide how much amount of data that is required for

  • a particular problem, in order to be able to come up with a reasonable--

  • PROFESSOR: Good question.

  • So let me tell you the theoretical, and the practical answer.

  • The theoretical answer is that this is exactly the crux of the theory part

  • that we're going to talk about.

  • And in the theory, we are going to see, can we learn?

  • And how much data.

  • So all of this will be answered in a mathematical way.

  • So this is the theoretical answer.

  • The practical answer is: that's not under your control.

  • When someone knocks on your door: Here is the data, I have 500 points.

  • I tell him, I will give you a fantastic system if you

  • just give me 2000.

  • But I don't have 2000, I have 500.

  • So now you go and you use your theory to do something to your system, such

  • that it can work with the 500.

  • There was one case--

  • I worked with data in different applications--

  • at some point, we had almost 100 million points.

  • You were swimming in data.

  • You wouldn't complain about data.

  • Data was wonderful.

  • And in another case, there were less than 100 points.

  • And you had to deal with the data with gloves!

  • Because if you use them the wrong way, they are contaminated, which is

  • an expression we will see, and then you have nothing.

  • And you will produce a system, and you are proud of it, but you have no idea

  • whether it will perform well or not.

  • And you cannot give this to the customer, and have the customer come

  • back to you and say: what did you do!?

  • So there is a question of, what performance can you do given

  • what data size you have?

  • But in practice, you really have no control over the data size in almost

  • all the cases, almost all the practical cases.

  • Yes?

  • STUDENT: Another question I have is regarding the hypothesis set.

  • So the larger the hypothesis set is, probably I'll be able to

  • better fit the data.

  • But that, as you were explaining, might be a bad thing to do because

  • when the new data point comes, there might be troubles.

  • So how do you decide the size of your--

  • PROFESSOR: You are asking all the right questions, and all of

  • them are coming up.

  • This is again part of the theory, but let me try to explain this.

  • As we mentioned, learning is about being able to predict.

  • So you are using the data, not to memorize it, but to figure out what

  • the pattern is.

  • And if you figure out a pattern that applies to all the data, and it's

  • a reasonable pattern, then you have a chance that it

  • will generalize outside.

  • Now the problem is that, if I give you 50 points, and you use a 7000th-order

  • polynomial, you will fit the heck out of the data.

  • You will fit it so much with so many degrees of freedom to spare, but you

  • haven't learned anything.

  • You just memorized it in a fancy way.

  • You put it in a polynomial form, and that actually carries all the

  • information about the data that you have,

  • and then some.

  • So you don't expect at all that this will generalize outside.

  • And that intuitive observation will be formalized when we

  • talk about the theory.

  • There will be a measurement of the hypothesis set that you give me, that

  • measures the sophistication of it, and will tell you with that

  • sophistication, you need that amount of data in order to be able to make

  • any statement about generalization.

  • So that is what the theory is about.

  • STUDENT: Suppose, I mean, here whatever we discussed, it is like I

  • had a data set and I came up with an algorithm, and gave the output.

  • But won't it be also important to see, OK, we came up with the output, and

  • using that, what was the feedback?

  • Are there techniques where you take the feedback and try to

  • correct your--

  • PROFESSOR: You are alluding to different techniques here.

  • But one of them would be validation, which is after you learn, you validate

  • your solution.

  • And this is an extremely established and core technique in machine learning

  • that will be covered in one of the lectures.

  • Any questions from the online audience?

  • MODERATOR: In practice, how many dimensions would be considered easy,

  • medium, and hard for a perceptron problem?

  • PROFESSOR: The hard,

  • in most people's mind before they get into machine learning, is the

  • computational time.

  • If something takes a lot of time, then it's a hard problem.

  • If something can be computed quickly, it's an easy problem.

  • For machine learning, the bottleneck in my case, has never been the

  • computation time, even in incredibly big data sets.

  • The bottleneck for machine learning is to be able to generalize outside the

  • data that you have seen.

  • So to answer your question, the perceptron behaves badly in terms of

  • the computational behavior.

  • We will be able to predict its generalization behavior, based on the

  • number of dimensions and the amount of data.

  • This will be given explicitly.

  • And therefore, the perceptron algorithm is bad computationally, good

  • in terms of generalization.

  • If you actually can get away with perceptrons, your chances of

  • generalizing are good because it's a simplistic

  • model, and therefore its ability to generalize is good, as we will see.

  • MODERATOR: Also, in the example you explain the use of binary function.

  • So can you use more multi-valued or real functions?

  • PROFESSOR: Correct.

  • Remember when I told you that there is a topic that is out of sequence.

  • There was a logical sequence to the course, and then I took part of the

  • linear models and put it very early on, to give you something a little bit

  • more sophisticated than perceptrons to try your hand on.

  • That happens to be for real-valued functions.

  • And obviously there are hypotheses that cover all types of co-domains.

  • Y could be anything as well.

  • MODERATOR: Another question is, in the learning process you showed, when

  • do you pick your learning algorithm, when do you pick your hypothesis set,

  • and what liberty do you have?

  • PROFESSOR: The hypothesis set is the most important aspect of

  • determining the generalization behavior that we'll talk about.

  • The learning algorithm does play a role, although it is a secondary role,

  • as we will see in the discussion.

  • So in general, the learning algorithm has the form of

  • minimizing an error function.

  • So you can think of the perceptron, what does

  • the algorithm do?

  • It tries to minimize the classification error.

  • That is your error function, and you're minimizing it using this

  • particular update rule.

  • And in other cases, we'll see that we are minimizing an error function.

  • Now the minimization aspect is an optimization question, and once you

  • determine that this is indeed the error function that I want to

  • minimize, then you go and minimize as much as you can using the most

  • sophisticated optimization technique that you find.

  • So the question now translates into what is the choice of the error

  • function or error measure that will help or not help.

  • And that will be covered also next week under the topic, Error and Noise.

  • When I talk about error, we'll talk about error measures, and this

  • translates directly to the learning algorithm that goes with them.

  • MODERATOR: Back to the perceptron.

  • So what happens if your hypothesis gives you exactly 0 in this case?

  • PROFESSOR: So remember that the quantity you compute and

  • compare with the threshold was your credit score.

  • So I told you what happens if you are above threshold, and what happens if

  • you're below threshold.

  • So what happens if you're exactly at the threshold?

  • Your score is exactly that.

  • The informal answer is that it depends on the mood of the credit

  • officer on that day.

  • If they had a bad day, you will be denied!

  • But the serious answer is that there are technical ways of

  • defining that point.

  • You can define it as 0, so the sign of 0 is 0.

  • In which case you are always making an error, because you are never +1 or

  • -1, when you should be.

  • Or you could make it belong to the +1 category or

  • to the -1 category.

  • There are ramifications for all of these decisions

  • that are purely technical.

  • Nothing conceptual comes out of them.

  • That's why I decided not to include it.

  • Because it clutters the main concept with something that really has no

  • ramification.

  • As far as you're concerned, the easiest way to consider it is that the

  • output will be 0, and therefore you will be making an error regardless of

  • whether it's +1 or -1.

  • MODERATOR: Is there a kind of problem that cannot be learned even if

  • there's a huge amount of data?

  • PROFESSOR: Correct.

  • For example, if I go to my computer and use a pseudo-random number

  • generator to generate the target over the entire domain, then patently,

  • nothing I can give you will make you learn the other guys.

  • So remember the three--

  • let me try to--

  • the essence of machine learning.

  • The first one was, a pattern exists.

  • If there's no pattern that exists, there is nothing to learn.

  • Let's say that it's like a baby, and stuff is happening, and the

  • baby is just staring. There is nothing to pick from that thing.

  • Once there is a pattern, you can see the smile on the baby's face.

  • Now I can see what is going on.

  • So whatever you are learning, there needs to be a pattern.

  • Now, how to tell that there's a pattern or not,

  • that's a different question.

  • But the main ingredient, there's a pattern. The other one is we cannot pin

  • it down mathematically.

  • If we can pin it down mathematically, and you decide to do

  • the learning, then you are really lazy.

  • Because you could just write the code.

  • But fine.

  • You can use learning in this case, but it's not the recommended method,

  • because it has certain errors in performance.

  • Whereas if you have the mathematical definition, you just implement it and

  • you'll get the best possible solution.

  • And the third one, you have data, which is key.

  • So you have plenty of data, but the first one is off, you are simply not

  • going to learn.

  • And it's not like I have to answer each of these questions at random.

  • The theory will completely capture what is going on.

  • So there's a very good reason for going through four lectures in the

  • outline that are mathematically inclined.

  • This is not for the sake of math.

  • I don't like to do math hacking, if you will.

  • I pick the math that is necessary to establish a concept.

  • And these will establish it, and they are very much worth being patient with

  • and going through.

  • Because once you're done with them, you basically have it cold about what

  • are the components that make learning possible, and how do we tell, and all

  • of the questions that have been asked.

  • MODERATOR: Historical question.

  • So why is the perceptron often related with a neuron?

  • PROFESSOR: I will discuss this in neural networks, but in general,

  • when you take a neuron and synapses, and you find what is the function that

  • gets to the neuron, you find that the neuron fires, which is +1, if the

  • signal coming to it, which is roughly a combination of the stimuli, exceeds

  • a certain threshold.

  • So that was the initial inspiration, and the initial inspiration was

  • that: the brain does a pretty good job, so maybe if we mimic the

  • function, we will get something good.

  • But you mimic one neuron, and then you put it together and you'll get the

  • neural network that you are talking about.

  • And I will discuss the analogy with biology, and the extent that it can be

  • benefited from, when we talk about neural networks, because

  • that will be the more proper context for that.

  • MODERATOR: Another question is, regarding the hypothesis set, are there

  • Bayesian hierarchical procedures to narrow down the hypothesis set?

  • PROFESSOR: OK.

  • The choice of the hypothesis set and the model in general is model

  • selection, and there's quite a bit of stuff that we are going to talk about

  • in model selection, when we talk about validation.

  • In general, the word Bayesian was mentioned here-- if you

  • look at machine learning, there are schools that deal with the subject

  • differently.

  • So for example, the Bayesian school puts a mathematical framework

  • completely on it.

  • And then everything can be derived, and that is based on Bayesian

  • principles.

  • I will talk about that at the very end, so it's last but not least.

  • And I will make a very specific point about it, for what it's worth.

  • But what I'm talking about in the course in all of the details, are the

  • most commonly useful methods in practice.

  • That is my criterion for inclusion.

  • So I will get to that when we get there.

  • In terms of a hierarchy,

  • there are a number of hierarchical methods.

  • For example, structural risk minimization is one of them.

  • There are methods of hierarchies, and the ramifications of it in

  • generalization.

  • I may touch upon it, when I get to support vector machines.

  • But again, there's a lot of theory, and if you read a book on machine

  • learning written by someone from pure theory, you would think that your are

  • reading about a completely different subject.

  • It's respectable stuff, but different from the other

  • stuff that is practiced.

  • So one of the things that I'm trying to do, I'm trying to pick from all the

  • components of machine learning, the big picture that gives you the

  • understanding of the concept, and the tools to use it in practice.

  • That is the criterion for inclusion.

  • Any questions from the inside here?

  • OK, we'll call it a day, and we'll see you on Thursday.

ANNOUNCER: The following program is brought to you by Caltech.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it