Subtitles section Play video
ANNOUNCER: The following program is brought to you by Caltech.
YASER ABU-MOSTAFA: Welcome to machine learning, and welcome to our online
audience as well.
Let me start with an outline of the course, and then go into the material
of today's lecture.
As you see from the outline, the topics are given colors, and that
designates their main content, whether it's mathematical or practical.
Machine learning is a very broad subject.
It goes from very abstract theory to extreme practice as in rules of thumb.
And the inclusion of a topic in the course depends on the relevance to
machine learning.
So some mathematics is useful because it gives you the conceptual framework,
and then some practical aspects are useful because they give you the way
to deal with real learning systems.
Now if you look at the topics, these are not meant to be separate topics
for each lecture.
They just highlight the main content of those lectures.
But there is a story line that goes through it, and let me tell you what
the story line is like.
It starts here with: what is learning?
Can we learn?
How to do it?
How to do it well?
And then the take-home lessons.
There is a logical dependency that goes through the course, and there's
one exception to that logical dependency.
One lecture, which is the third one, doesn't really belong here.
It's a practical topic, and the reason I included it early on is because I
needed to give you some tools to play around with, to test the
theoretical and conceptual aspects.
If I waited until it belonged normally, which is to the second aspect of the
linear models which is down there, the beginning of the course would be
just too theoretical for people's taste.
And as you see, if you look at the colors, it is mostly red in the
beginning and mostly blue in the end.
So it starts building the concepts and the theory.
And then it goes on to the practical aspects.
Now, let me start today's lecture.
And the subject of the lecture is the learning problem.
It's an introduction to what learning is.
And I will draw your attention to one aspect of this slide,
which is this part.
That's the logo of the course.
And believe it or not, this is not artwork.
This is actually a technical figure that will come up
in one of the lectures.
I'm not going to tell you which one.
So you can wait in anticipation until it comes up, but this will actually be
a scientific figure that we will talk about.
Now when we move to today's lecture, I'm going to talk
today about the following.
Machine learning is a very broad subject, and I'm going to start with
one example that captures the essence of machine learning.
It's a fun example about movies that everybody watches.
And then after that, I'm going to abstract from the learning problem,
the practical learning problem, aspects that are common to all
learning situations that you're going to face.
And in abstracting them, we'll have the mathematical formalization of the
learning problem.
And then we will get our first algorithm for machine learning today.
It's a very simple algorithm, but it will fix the idea about what is the
role of an algorithm in this case.
And we will survey the types of learning, so that we know which part we
are emphasizing in this course, and which parts are nearby.
And I will end up with a puzzle, a very interesting puzzle, and it's
a puzzle in more ways than one, as you will see.
OK, so let me start with an example.
The example of machine learning that I'm going to start with is how
a viewer would rate a movie.
Now that is an interesting problem, and it's interesting for us because we
watch movies, and very interesting for a company that rents out movies.
And indeed, a company which is Netflix wanted to improve the in-house system
by a mere 10%.
So they make recommendations when you log in, they recommend movies that
they think you will like, so they think that you'll rate them highly.
And they had a system, and they wanted to improve the system.
So how much is a 10% improvement in performance worth to the company?
It was actually $1 million that was paid out to the first group that
actually managed to get the 10% improvement.
So you ask yourself, 10% improvement in something like that, why should
that be worth a million dollars?
It's because, if the recommendations that the movie company makes are spot
on, you will pay more attention to the recommendation, you are likely to rent
the movies that they recommend, and they will make lots of money-- much
more than the million dollars they promised.
And this is very typical in machine learning.
For example, machine learning has applications in financial forecasting.
You can imagine that the minutest improvement in financial forecasting
can make a lot of money.
So the fact that you can actually push the system to be better using machine
learning is a very attractive aspect of the technique in a wide spectrum of
applications.
So what did these guys do?
They gave the data, and people started working on the problem using different
algorithms, until someone managed to get the prize.
Now if you look at the problem of rating a movie, it captures the
essence of machine learning, and the essence has three components.
If you find these three components in a problem you have in your field, then
you know that machine learning is ready as an application tool.
What are the three?
The first one is that a pattern exists.
If a pattern didn't exist, there would be nothing to look for.
So what is the pattern here?
There is no question that the way a person rates a movie is related to how
they rated other movies, and is also related to how other
people rated that movie.
We know that much.
So there is a pattern to be discovered.
However, we cannot really pin it down mathematically.
I cannot ask you to write a 17th-order polynomial that captures how people
rate movies.
So the fact that there is a pattern, and that we cannot pin it down
mathematically, is the reason why we are going for machine learning.
For "learning from data".
We couldn't write down the system on our own, so we're going to depend on
data in order to be able to find the system.
There is a missing component which is very important.
If you don't have that, you are out of luck.
We have to have data. We are learning from data.
So if someone knocks on my door with an interesting machine learning
application, and they tell me how exciting it is, and how great the
application would be, and how much money they would make, the first
question I ask, what data do you have?
If you data, we are in business.
If you don't, you are out of luck.
If you have these three components, you are ready to
apply machine learning.
Now let me give you a solution to the movie rating, in order to start
getting a feel for it.
So here is a system.
Let me start to focus on part of it.
We are going to describe a viewer as a vector of factors, a profile if
you will.
So if you look here for example, the first one would be comedy content.
Does the movie have a lot of comedy?
From a viewer point of view, do they like comedies?
Here, do they like action?
Do they prefer blockbusters, or do they like fringe movies?
And you can go on all the way, even to asking yourself whether you like the
lead actor or not.
Now you go to the content of the movie itself, and you get the
corresponding part.
Does the movie have comedy?
Does it have action?
Is it a blockbuster?
And so on.
Now you compare the two, and you realize that if there is a match--
let's say you hate comedy and the movie has a lot of comedy--
then the chances are you're not going to like it.
But if there is a match between so many coordinates, and the
number of factors here could be really like 300 factors.
Then the chances are you'll like the movie.
And if there's a mismatch, the chances are you're not
going to like the movie.
So what do you do,
you match the movie and the viewer factors, and then you add the
contributions of them.
And then as a result of that, you get the predicted rating.
This is all good except for one problem, which is this is really not
machine learning.
In order to produce this thing, you have to watch the movie, and analyze
the content.
You have to interview the viewer, and ask about their taste.
And then after that, you combine them and try to get
a prediction for the rating.
Now the idea of machine learning is that you don't have to do any of that.
All you do is sit down and sip on your tea, while the machine is doing
something to come up with this figure on its own.
So let's look at the learning approach.
So in the learning approach, we know that the viewer will be a vector of
different factors, and different components for every factor.
So this vector will be different from one viewer to another.
For example, one viewer will have a big blue content here, and one of them
will have a small blue content, depending on their taste.
And then, there is the movie.
And a particular movie will have different contents that correspond to those.
And the way we said we are computing the rating, is by simply taking these
and combining them and getting the rating.
Now what machine learning will do is reverse-engineer that process.
It starts from the rating, and then tries to find out what factors would be
consistent with that rating.
So think of it this way.
You start, let's say, with completely random factors.
So you take these guys, just random numbers from beginning to end, and
these guys, random numbers from beginning to end.
For every user and every movie, that's your starting point.
Obviously, there is no chance in the world that when you get the inner
product between these factors that are random, that you'll get anything that
looks like the rating that actually took place, right?
But what you do is you take a rating that actually happened, and then you
start nudging the factors ever so slightly toward that rating.
Make the direction of the inner product get closer to the rating.
Now it looks like a hopeless thing. I start with so many factors, they are
all random, and I'm trying to make them match a rating.
What are the chances?
Well the point is that you are going to do this not for one rating, but for
a 100 million ratings.
And you keep cycling through the 100 million, over
and over and over.
And eventually, lo and behold, you find that the factors now are
meaningful in terms of the ratings.
And if you get a user, a viewer here, that didn't watch a movie, and you get
the vector that resulted from that learning process, and you get the
movie vector that resulted from that process, and you do the inner product,
lo and behold, you get a rating which is actually consistent with how that
viewer rates the movie.
That's the idea.
Now this actually, the solution I described, is one of the winning
solutions in the competition that I mentioned.
So this is for real, this actually can be used.
Now with this example in mind, let's actually go to the
components of learning.
So now I would like to abstract from the learning problems that I see, what
are the mathematical components that make up the learning problem?
And I'm going to use a metaphor.
I'm going to use a metaphor now from another application domain, which
is a financial application.
So the metaphor we are going to use is credit approval.
You apply for a credit card, and the bank wants to decide whether it's
a good idea to extend a credit card for you or not.
From the bank's point of view, if they're going to make
money, they are happy.
If they are going to lose money, they are not happy.
That's the only criterion they have.
Now, very much like we didn't have a magic formula for deciding how
a viewer will rate a movie, the bank doesn't have a magic formula for
deciding whether a person is creditworthy or not.
What they're going to do, they're going to rely on historical records of
previous customers, and how their credit behavior turned out, and then
try to reverse-engineer the system, and when they get the system frozen,
they're going to apply it to a future customer.
That's the deal.
What are the components here?
Let's look at it.
First, you have the applicant information.
And the applicant information-- you look at this, and you can see that
there is the age, the gender, how much money you make, how much money you
owe, and all kinds of fields that are believed to be related to the
creditworthiness.
Again, pretty much like we did in the movie example, there is no question
that these fields are related to the creditworthiness.
They don't necessarily uniquely determine it, but they are related.
And the bank doesn't want a sure bet. They want to get the credit decision
as reliable as possible.
So they want to use that pattern, in order to be able to come up with
a good decision.
And they take this input, and they want to approve the credit or deny it.
So let's formalize this.
First, we are going to have an input.
And the input is called x. Surprise, surprise!
And that input happens to be the customer application.
So we can think of it as a d-dimensional vector, the first
component is the salary, years in residence, outstanding debt, whatever
the components are.
You put it as a vector, and that becomes the input.
Then we get the output y. The output y is simply the decision, either to
extend credit or not to extend credit, +1 and -1.
And being a good or bad customer, that is from the bank's point of view.
Now we have after that, the target function.
The target function is a function from a domain X, which is the
set of all of these x's.
So it is the set of vectors of d dimensions.
So it's a d-dimensional Euclidean space, in this case.
And then the Y is the set of y's.
Well, that's an easy one because y can only be +1 or -1,
accept or deny.
And therefore this is just a binary co-domain.
And this target function is the ideal credit approval formula, which we
don't know.
In all of our endeavors in machine learning, the target function is
unknown to us.
If it were known, nobody needs learning.
We just go ahead and implement it.
But we need to learn it because it is unknown to us.
So what are we going to do to learn it?
We are going to use data, examples.
So the data in this case is based on previous customer application records.
The input, which is the information in their applications, and the output,
which is how they turned out in hindsight.
This is not a question of prediction at the time they applied, but after
five years, they turned out to be a great customer.
So the bank says, if someone has these attributes again, let's approve
credit because these guys tend to make us money.
And this one made us lose a lot of money, so let's deny it, and so on.
And the historical records-- there are plenty of historical records.
All of this makes sense when you're talking about having 100,000 of
those guys.
Then you pretty much say, I will capture what the essence of that
function is.
So this is the data, and then you use the data, which is the historical
records, in order to get the hypothesis.
The hypothesis is the formal name we're going to call the formula we get
to approximate the target function.
So the hypothesis lives in the same world as the target function, and if
you look at the value of g, it supposedly approximates f.
While f is unknown to us, g is very much known--
actually we created it-- and the hope is that it does approximate f well.
That's the goal of learning.
So this notation will be our notation for the rest of the course, so get
used to it.
The target function is always f, the hypothesis we produce, which we'll
refer to as the final hypothesis will be called g, the data will always have
that notation-- there are capital N points, which are the data set.
And the output is always y.
So this is the formula to be used.
Now, let's put it in a diagram in order to analyze it a little bit more.
If you look at the diagram here, here is the target
function and it is unknown--
that is the ideal approval which we will never know, but that's what we're
hoping to get to approximate.
And we don't see it.
We see it only through the eyes of the training examples.
This is our vehicle of understanding what the target function is.
Otherwise the target function is a mysterious quantity for us.
And eventually, we would like to produce the final hypothesis.
The final hypothesis is the formula the bank is going to use in order to
approve or deny credit, with the hope that g hopefully approximates that f.
Now what connects those two guys?
This will be the learning algorithm.
So the learning algorithm takes the examples, and will produce the final
hypothesis, as we described in the example of the movie rating.
Now there is another component that goes into the learning algorithm.
So what the learning algorithm does, it creates the formula from a preset
model of formulas, a set of candidate formulas, if you will.
And these we are going to call the hypothesis set, a set of hypotheses
from which we are going to pick one hypothesis.
So from this H comes a bunch of small h's, which are functions that can be
candidates for being the credit approval.
And one of them will be picked by the learning algorithm, which happens to
be g, hopefully approximating f.
Now if you look at this part of the chain, from the target function to the
training to the learning algorithm to the final hypothesis, this is very
natural, and nobody will object to that.
But why do we have this hypothesis set?
Why not let the algorithm pick from anything?
Just create the formula, without being restricted to a particular set of
formulas H.
There are two reasons, and I want to explain them.
One of them is that there is no downside for including a hypothesis
set in the formalization.
And there is an upside.
So let me describe why there is no downside, and then describe why there
is an upside.
There is no downside for the simple reason that, from a practical point of
view, that's what you do.
You want to learn, you say I'm going to use a linear formula.
I'm going to use a neural network.
I'm going to use a support vector machine.
So you are already dictating a set of hypotheses.
If you happen to be a brave soul, and you don't want to restrict yourself at
all, very well, then your hypothesis set is the set of all possible
hypotheses.
Right?
So there is no loss of generality in putting it.
So there is no downside.
The upside is not obvious here, but it will become obvious as we go through
the theory.
The hypothesis set will play a pivotal role in the theory of learning.
It will tell us: can we learn, and how well we learn, and whatnot.
Therefore having it as an explicit component in the problem statement
will make the theory go through.
So that's why we have this figure.
Now, let me focus on the solution components of that figure.
What do I mean by the solution components?
If you look at this, the first part, which is the target-- let me try to
expand it--
so the target function is not under your control.
Someone knocks on my door and says: I want to approve credit.
That's the target function, I have no control over that.
And by the way, here are the historical records.
I have no control over that, so they give me the data.
And would you please hand me the final hypothesis?
That is what I'm going to give them at the end, before I receive my check.
So all of that is completely dictated.
Now let's look at the other part. The learning algorithm, and the hypothesis
set that we talked about, are your solution tools.
These are things you choose, in order to solve the problem.
And I would like to take a little bit of a look into what they look like,
and give you an example of them, so that you have a complete chain for
the entire figure in your mind.
From the target function, to the data set, to the learning algorithm,
hypothesis set, and the final hypothesis.
So, here is the hypothesis set.
We chose the notation H for the set, and the element will be given the
symbol small h.
So h is a function, pretty much like the final hypothesis g.
g is just one of them that you happen to elect.
So when we elect it, we call it g. If it's sitting there generically, we
call it h.
And then, when you put them together, they are referred to as
the learning model.
So if you're asked what is the learning model you are using, you're
actually choosing both a hypothesis set and a learning algorithm.
We'll see the perceptron in a moment, so this would be the
perceptron model, and this would be the perceptron learning algorithm.
This could be neural network, and this would be back propagation.
This could be support vector machines of some kind, let's say
radial basis function version, and this would be the quadratic programming.
So every time you have a model, there is a hypothesis set, and then there is
an algorithm that will do the searching and produce
one of those guys.
So this is the standard form
for the solution.
Now, let me go through a simple hypothesis set in detail so we have
something to implement.
So after the lecture, you can actually implement a learning algorithm on real
data if you want to.
This is not a glorious model. It's a very simple model. On the other hand,
it's a very clear model to pinpoint what we are talking about.
So here is the deal.
You have an input, and the input is x_1 up to x_d, as we said--
d-dimensional vector-- and each of them comes from the real numbers, just
for simplicity.
So this belongs to the real numbers.
And these are the attributes of a customer.
As we said, salary, years in residence, and whatnot.
So what does the perceptron model do?
It does a very simple formula.
It takes the attributes you have and gives them different weights, w.
So let's say the salary is important, the chances are w corresponding to the
salary will be big.
Some other attribute is not that important.
The chances are the w that goes with it is not that big.
Actually, outstanding debt is bad news.
If you owe a lot, that's not good.
So the chances are the weight will be negative for outstanding
debt, and so on.
Now you add them together, and you add them in a linear form-- that's what
makes it a perceptron--
and you can look at this as a credit score, of sorts.
Now you compare the credit score with a threshold.
If you exceed the threshold, they approve the credit card.
And if you don't, they deny the credit card.
So that is the formula they
settle on.
They have no idea, yet, what the w's and the threshold are, but they dictated the
formula-- the analytic form that they're going to use.
Now we take this and we put it in the formalization we had.
We have to define a hypothesis h, and this will tell us what is the
hypothesis set that has all the hypotheses that have the same
functional form.
So you can write it down as this.
This is a little bit long, but there's absolutely nothing to it.
This is your credit score, and this is the threshold you compare to by
subtracting.
If this quantity is positive, you belong to the first thing and you will
approve credit.
If it's negative, you belong here and you will deny credit.
Well, the function that takes a real number, and produces the sign +1 or
-1, is called the sign.
So when you take the sign of this thing, this will indeed be +1 or
-1, and this will give the decision you want.
And that will be the form of your hypothesis.
Now let's put it in color, and you realize that what defines h is your
choice of w_i and the threshold.
These are the parameters that define one hypothesis versus the other.
x is an input that will be put into any hypothesis.
As far as we are concerned, when we are learning for example, the inputs
and outputs are already determined.
These are the data set.
But what we vary to get one hypothesis or another, and what the algorithm
needs to vary in order to choose the final hypothesis, are those parameters
which, in this case, are w_i and the threshold.
So let's look at it visually.
Let's assume that the data you are working
with is linearly separable.
Linearly separable in this case, for example, you have nine data points.
And if you look at the nine data points, some of them were good
customers and some of them were bad customers.
And you would like now to apply the perceptron model, in order to separate
them correctly.
You would like to get to this situation, where the perceptron, which
is this purple line, separates the blue region from the red region or the
pink region, and indeed all the good customers belong to one, and the bad
customers belong to the other.
So you have hope that a future customer, if they lie here or lie
here, they will be classified correctly.
If there is actually a simple linear pattern to this to be detected.
But when you start, you start with random weights, and the random weights
will give you any line.
So the purple line in both cases corresponds to the
purple parameters there.
One choice of these w's and the threshold corresponds to one line.
You change them, you get another line.
So you can see that the learning algorithm is playing around with these
parameters, and therefore moving the line around, trying to arrive at this
happy solution.
Now we are going to have a simple change of notation.
Instead of calling it threshold, we're going to treat it as if it's a weight.
It was minus threshold. Now we call it, plus w_0.
Absolutely nothing, all you need to do is choose w_0 to
be minus the threshold.
No big deal.
So why do we do that?
We do that because we are going to introduce an artificial coordinate.
Remember that the input was x_1 through x_d.
Now we're going to add x_0.
This is not an attribute of the customer, but
an artificial constant we add, which happens to be always +1.
Why are we doing this? You probably guessed.
Because when you do that, then all of a sudden the formula simplifies.
Now you are summing from i equals 0, instead of i equals 1.
So you added the zero term, and what is the zero term?
It's the threshold which you conveniently call w_0 with a plus sign,
multiplied by the 1.
So indeed, this the formula will be equivalent to that.
So it looks better.
And this is the standard notation we're going to use.
And now we put it as a vector form, which will simplify matters, so
in this case you will be having an inner product between a vector w,
a column vector, and a vector x.
So the vector w would be w_0, w_1, w_2, w_3, w_4, et cetera.
And x_0, x_1, x_2, et cetera.
And you do the inner product by taking a transpose, and you get a formula
which is exactly the formula you have here.
So now we are down to this formula for the perceptron hypothesis.
Now that we have the hypothesis set, let's look for the learning algorithm
that goes with it.
The hypothesis set tells you the resources you can work with.
Now we need the algorithm that is going to look at the data, the
training data that you're going to use, and navigate through the space
of hypotheses, to bring the one that is going to output as the final
hypothesis that you give to your customer.
So this one is called the perceptron learning algorithm, and it implements
this function.
What it does is the following.
It takes the training data.
That is always what a learning algorithm does. This is
their starting point.
So it takes existing customers, and their existing credit behavior in
hindsight--
that's what it uses--
and what does it do?
It tries to make the w correct.
So it really doesn't like at all when a point is misclassified.
So if a point is misclassified, it means that your w didn't do
the right job here.
So what does it mean to be a misclassified point here?
It means that when you apply your formula, with the current w--
the w is the one that the algorithm will play with--
apply it to this particular x.
Then what happens?
You get something that is not the credit behavior you want.
It is misclassified.
So what do we do when a point is misclassified?
We have to do something.
So what the algorithm does, it updates the weight vector.
It changes the weight, which changes the hypothesis, so that it behaves
better on that particular point.
And this is the formula that it does.
So I'll explain it in a moment.
Let me first try to explain the inner product in terms of agreement or
disagreement.
If you have the vector x and the vector w this way, their inner product
will be positive, and the sign will give you a +1.
If they are this way, the inner product will be negative, and the sign
will be -1.
So being misclassified means that either they are this way and the
output should be -1, or it's this way and output should be +1.
That's what makes it misclassified, right?
So if you look here at this formula, it takes the old w and adds something
that depends on the misclassified point.
Both in terms of the x_n and y_n.
y_n is just +1 or -1.
So here you are either adding a vector or subtracting a vector.
And we will see from this diagram that you're always doing so in such a way
that you make the point more likely to be correctly classified.
How is that?
If y equals +1, as you see here, then it must be that since the point
is misclassified, that w dot x was negative.
Now when you modify this to w plus y x, it's actually w plus x.
You add x to w, and when you add x to w you get the blue vector instead of
the red vector.
And lo and behold, now the inner product is indeed positive.
And in the other case when it's -1, it is misclassified because they
were this way.
They give you +1 when it should be -1.
And when you apply the rule, since y is -1, you are actually
subtracting x.
So you subtract x and get this guy, and you will get the correct
classification.
So this is the intuition behind it.
However, it is not the intuition that makes this work.
There are a number of problems with this approach.
I just motivated that this is not a crazy rule.
Whether or not it's a working rule, that is yet to be seen.
Let's look at the iterations of the perceptron learning algorithm.
Here is one iteration of PLA.
So you look at this thing, and you have this current w corresponds to
the purple line.
This guy is blue in the red region.
It means it's misclassified.
So now you would like to adjust the weights, that is move around
that purple line, such that the point is classified correctly.
If you apply the learning rule, you'll find that you're actually moving in
this direction, which means that the blue point will likely be correctly
classified after that iteration.
There is a problem because, let's say that I actually move
this guy in this direction.
Well this one, I got it right, but this one, which used to be right,
now is messed up.
It moved to the blue region, right?
And if you think about it, I'm trying to take care of one point, and I may be
messing up all other points, because I'm not taking them into
consideration.
Well, the good news for the perceptron
learning algorithm is that all you need to do, is for iterations 1,
2, 3, 4, et cetera, pick a misclassified point, anyone you like.
And then apply the iteration to it.
The iteration we just talked about, which is this one.
The top one.
And that's it.
If you do that, and the data was originally linearly separable, then
you will end up with the case that you will get to a correct solution.
You will get to something that classifies all of them correctly.
This is not an obvious statement.
It requires a proof.
The proof is not that hard.
But it gives us the simplest possible learning model we can think of.
It's a linear model, and this is your algorithm.
All you need to do is be very patient, because 1, 2, 3, 4-- this is
a really long.
At times it can be very long.
But it eventually converges.
That's the promise,
as long as the data is linearly separable.
So now we have one learning model, and if I give you now data from a bank--
previous customers and their credit behavior-- you can actually run the
perceptron learning algorithm, and come up with a final hypothesis g that you
can hand to the bank.
Not clear at all that it will be good, because all you did was match the
historical records.
Well, you may ask the question: if I match the historical records, does this
mean that I'm getting future customers right, which is the
only thing that matters?
The bank already knows what happened with the previous customers. It's just
using the data to help you find a good formula.
The formula will be good or not good to the extent that it applies to a new
customer, and can predict the behavior correctly.
Well, that's a loaded question which will be handled in
extreme detail, when we talk about the theory of learning.
That's why we have to develop all of this theory.
So, that's it.
And that is the perceptron learning algorithm.
Now let me go into the bigger picture of learning, because what I talked
about so far is one type of learning.
It happens to be by far the most popular, and the most used.
But there are other types of learning.
So let's talk about the premise of learning, from which the different
types came about.
That's what learning is about.
This is the premise that is common between any problem that you
would consider learning.
You use a set of observations, what we call data, to uncover
an underlying process.
In our case, the target function.
You can see that this is a very broad premise.
And therefore, you can see that people have rediscovered that over and over
and over, in so many disciplines.
Can you think of a discipline, other than machine learning, that uses that
as its exclusive premise?
Anybody have taken courses in statistics?
In statistics, that's what they do.
The underlying process is a probability distribution.
And the observations are samples generated by that distribution.
And you want to take the samples, and predict what the probability
distribution is.
And over and over, there are so many disciplines under different names.
Now when we talk about different types of learning, it's not like we sit down
and look at the world and say, this looks different from this because the
assumptions look different.
What you do is, you take this premise and apply it in a context.
And that calls for a certain amount of mathematics and algorithms.
If a particular set of assumptions takes you sufficiently far from the
mathematics and the algorithms you used in the other disciplines, that
it takes on a life of its own.
And it develops its own math and algorithms, then you declare it
a different type.
So when I list the types, it's not completely obvious just by the slide
itself, that these should be the types that you have.
But for what it's worth, these are the most important types.
First one is supervised learning, that's what we have
been talking about.
And I will discuss it in detail, and tell you why it's called supervised.
And it is, by far, the concentration of this course.
There is another one which is called unsupervised learning, and
unsupervised learning is very intriguing.
I will mention it briefly here, and then we will talk about a very famous
algorithm for unsupervised learning later in the course.
And the final type is reinforcement learning, which is even more
intriguing, and I will discuss it in a brief
introduction in a moment.
So let's take them one by one.
Supervised learning.
So what is supervising learning?
Anytime you have the data that is given to you, with the output
explicitly given-- here is the user and movie, and here is the rating.
Here is the previous customer, and here is their credit behavior.
It's as if a supervisor is helping you out, in order to be able to classify
the future ones.
That's why it's called supervised.
Let's take an example of coin recognition, just to be able to
contrast it with unsupervised learning in a moment.
Let's say you have a vending machine, and you would like to make
the system able to recognize the coins.
So what do you do?
You have physical measurements of the coin, let's be simplistic and say we
measure the size and mass of the coin you put.
Now the coins will be quarters, nickels, pennies, and dimes.
25, 5, 1, and 10.
And when you put the data in this diagram, they will belong there.
So the quarters, for example, are bigger, so they will belong here.
And the dimes in the US currency happen to be the smallest of them,
so they are smallest here, and there will be a scatter because of the error
in measurement, because of the exposure to the elements, and whatnot.
So let's say that this is your training data, and it's supervised
because things are colored.
I gave you those and told you they are 25 cents, 5 cents, et cetera.
So you use those in order to train a system, and the system will then be
able to classify a future one.
For example, if we stick to the linear approach, you may be able to
find separator lines like those.
And those separator lines will separate, based on the data, the 10
from the 1 from the 5 from the 25.
And once you have those,
you can bid farewell to the data. You don't need it anymore.
And when you get a future coin that is now unlabeled, you don't know what it
is, when the vending machine is actually working, then the coin will
lie in one region or another, and you're going to classify it accordingly.
So that is supervised learning.
Now let's look at unsupervised learning.
For unsupervised learning, instead of having the examples, the training data,
having this form which is the input plus the correct
target-- the correct output--
the customer and how they behaved in reality in credit,
we are going to have examples that have less information, so much less it
is laughable.
I'm just going to tell you what the input is.
And I'm not going to tell you what the target function is at all.
I'm not going to tell you anything about the target function.
I'm just going to tell you, here is the data of a customer.
Good luck, try to predict the credit.
OK--
How in the world are we going to do that?
Let me show you that the situation is not totally hopeless.
That's what I'm going to achieve.
I'm not going to tell you how to do it completely.
But let me show you that a situation like that is not totally hopeless.
Let's go for the coin example.
For the coin example, we have data that looks like this.
If I didn't tell you what the denominations are, the data
would look like this.
Right?
You have the measurements, but you don't know, is that a quarter, is
it-- you don't know.
Now honestly, if you look at this thing, you say I can know
something from this figure.
Things tend to cluster together.
So I may be able to classify those clusters into categories, without
knowing what the categories are.
That will be quite an achievement already.
You still don't know whether it's 25 cents, or whatever.
But the data actually made you able to do something that is
a significant step.
You're going to be able to come up with these boundaries.
And now, you are so close to finding the full system.
So unlabeled data actually can be pretty useful.
Obviously, I have seen the colored ones, so I actually chose the
boundaries right because I still remember them visually.
But if you look at the clusters and you have never heard about that,
especially these guys might not look like two clusters.
They may look like one cluster.
So it actually could be that this is ambiguous, and indeed in unsupervised
learning, the number of clusters is ambiguous at times.
And then, what you do--
this is the output of your system. Now, I can categorize the
coins into types.
I'm just going to call them types: type 1, type 2,
type 3, type 4.
I have no idea which belongs to which, but obviously if someone comes with
a single example of a quarter, a dime, et cetera, then you are ready to go.
Whereas before, you had to have lots of examples in order to choose where
exactly to put the boundary.
And this is why a set like that, which looks like complete
jungle, is actually useful.
Let me give you another interesting example of unsupervised learning,
where I give you the input without the output, and you are actually in
a better situation to learn.
Let's say that your company or your school in this case, is sending you
for a semester in Rio de Janeiro.
So you're very excited, and you decide that you'd better learn some
Portuguese, in order to be able to speak the language when you arrive.
Not to worry, when you arrive, there will be a tutor who teaches you
Portuguese.
But you have a month to go, and you want to help
yourself as much as possible.
You look around, and you find that the only resource you have is a radio
station in Portuguese in your car.
So what you do, you just turn it on whenever you drive.
And for an entire month, you're bombarded with Portuguese.
"tudo bem", "como vai", "valeu", stuff like that comes back.
After a while, without knowing anything-- it's unsupervised, nobody
told you the meaning of any word--
you start to develop a model of the language in your mind.
You know what the idioms are, et cetera.
You are very eager to know what actually "tudo bem"
-- what does that mean?
You are ready to learn, and once you learn it, it's actually
fixed in your mind.
Then when you go there, you will learn the language faster than if you didn't
go through this experience.
So you can think of unsupervised learning, in one way or another, as
a way of getting a higher-level representation of the input.
Whether it's extremely high level as in clusters-- you forgot all the
attributes and you just tell me a label, or higher level as in this-- a better
representation than just the crude input into some model
in your mind.
Now let's talk about reinforcement learning.
In this case, it's not as bad as unsupervised learning.
So again, without the benefit of supervised learning, you don't get
the correct output.
What you do is-- I will give you the input.
OK, thank you very much, that's very kind.
What else?
I'm going to give you some output.
The correct output?
No!
Some output.
OK, that's very nice, but doesn't seem very helpful.
It looks now like unsupervised learning, because in unsupervised learning I
could give you some output.
Here is a dime. Oh, it's a quarter.
It's some output!
Such output has no information.
The information comes from the next one.
I'm going to grade this output.
So that is the information provided to you.
So I'm not explicitly giving you the output, but when you choose an output,
I'm going to tell you how well you're doing.
Reinforcement learning is interesting because it is mostly our own
experience in learning.
Think of a toddler, and a hot cup of tea in front of her.
She is looking at it, and she is very curious.
So she reaches to touch. Ouch!
And she starts crying.
The reward is very negative for trying.
Now next time she looks at it, and she remembers the previous experience, and
she doesn't touch it.
But there is a certain level of pain, because there is an unfulfilled
curiosity.
And curiosity killed the cat. In three or four trials, the toddler
tries again.
Maybe now it's OK.
And Ouch!
Eventually from just the grade of the behavior of to touch it or not to
touch it, the toddler will learn not to touch cups of tea that have smoke
coming out of them.
So that is a case of reinforcement learning.
The most important application, or one of the most important applications of
reinforcement learning, is in playing games.
So backgammon is one of the games, and think that you want
a system to learn it.
So what you want, you want to take the current state of the board, and you
roll the dice, and then you decide what is the optimal move in
order to stand the best chance to win.
That's the game.
So the target function is the best move given a state.
Now, if I have to generate those things in order for the system to
learn, then I must be a pretty good backgammon player already.
So now it's a vicious cycle.
Now, reinforcement learning comes in handy.
What you're going to do, you are going to have the
computer choose any output.
A crazy move, for all you care.
And then see what happens eventually.
So this computer is playing against another computer, both of
them want to learn.
And you make a move, and eventually you win or lose.
So you propagate back the credit because of winning or losing,
according to a very specific and sophisticated formula, into all the
moves that happened.
Now you think that's completely hopeless, because maybe this is not the
move that resulted in this, it's another move.
But always remember, that you are going to do this 100 billion times.
Not you, the poor computer.
You're sitting down sipping your tea.
A computer is doing this, playing against an imaginary opponent, and
they keep playing and playing and playing.
And in three hours of CPU time, you go back to the computer-- maybe not three
hours, maybe three days of CPU time-- you go back to the computer, and you
have a backgammon champion.
Actually, that's true.
The world champion, at some point, was a neural network that learned the way
I described.
So it is actually a very attractive approach, because in machine
learning now, we have a target function that we cannot model.
That covers a lot of territory, I've seen a lot of those.
We have data coming from the target function.
I usually have that.
And now we have the lazy man's approach to life.
We are going to sit down, and let the computer do all of the work, and
produce the system we want.
Instead of studying the thing mathematically, and writing code, and
debugging--
I hate debugging.
And then you go. No, we're not going to do that.
The learning algorithm just works, and produces something good.
And we get the check.
So this is a pretty good deal.
It actually is so good, it might be too good to be true.
So let's actually examine if all of this was a fantasy.
So now I'm going to give you a learning puzzle.
Humans are very good learners, right?
So I'm now going to give you a learning problem in the form that I
described, a supervised learning problem.
And that supervised learning problem will give you a training set, some
points mapped to +1, some points mapped to -1.
And then I'm going to give you a test point that is unlabeled.
Your task is to look at the examples, learn the target function, apply it to
the test point, and then decide what the value of the function is.
After that, I'm going to ask, who decided that the function is +1,
and who decided that the function is -1.
OK? It's clear what the deal is.
And I would like our online audience to do the same thing.
And please text what the solution is.
Just +1 or -1.
Fair enough?
Let's start the game.
What is above the line are the training examples.
I put the input as a three-by-three pattern in order to be visually easy
to understand.
But this is just really nine bits worth of information.
And they are ones and zeros, black and white.
And for this input, this input, and this input, the value of the target
function is -1.
For this input, this input, and this input, the value of the target
function is +1.
Now this is your data set, this is your training set.
Now you should learn the function.
And when you're done, could you please tell me what your function will return
on this test point?
Is it +1 or -1.
I will give everybody 30 seconds before I ask for an answer.
Maybe we should have some background music?
OK, time's up.
Your learning algorithm has converged, I hope.
And now we apply it here, and I ask people here, who says it's +1?
Thank you.
Who says it's -1?
Thank you.
I see that the online audience also contributed?
MODERATOR: Yeah, the big majority says +1.
PROFESSOR: But are there -1's?
MODERATOR: Two -1's.
PROFESSOR: Cool.
I don't care if it's a +1 or -1.
What I care about is that I get both answers.
That is the essence of it.
Why do I care?
Because in reality, this is an impossible task.
I told you the target function is unknown.
It could be anything, really anything.
And now I give you the value of the target function at 6 points.
Well, there are many functions that fit those 6 points, and behave
differently outside.
For example, if you take the function to be +1 if the top left square
is white, then this should be -1, right?
If you take the function to be +1 if the pattern is symmetric--
let's see, I said it the other way around.
So the top one is black, it would be -1.
So this would be -1.
If it's symmetric, it would be +1.
So this would be +1, because this guy has both-- this is
black, and also it is symmetric.
Right?
And you can find infinite variety like that.
And that problem is not restricted to this case.
The question here is obvious.
The function is unknown.
You really mean unknown, right?
Yes, I mean it.
Unknown-- anything?
Yes, I do.
OK.
You give me a finite sample, it can be anything outside.
How in the world am I going to tell what the learning outside is?
OK, that sounds about right.
But we are in trouble, because that's the premise of learning.
If the goal was to memorize the examples I gave you, that would be
memorizing, not learning.
Learning is to figure out a pattern that applies outside.
And now we realize that outside, I cannot say anything.
Does this mean that learning is doomed?
Well, this is going to be a very short course!
Well, the good news is that learning is alive and well.
And we are going to show that, without compromising our basic premise.
The target function will continue to be unknown.
And we still mean unknown.
And we will be able to learn.
And that will be the subject of the next lecture.
Right now, we are going to go for a short break, after which we are going
to take the Q&A.
We'll start the Q&A, and we will get questions from the class here, and
from the online audience.
And if you'd like to ask a question, let me ask you to go to this side of
the room where the mic is, so that your question can be heard.
And we will alternate, if there are questions here, we will alternate
between campus and off campus.
So let me start if there is a question from outside.
MODERATOR: Yes, so the most common question is, how do you determine if
a set of points is linearly separable, and what do you do
if they're not separable.
PROFESSOR: The linear separability assumption is a very
simplistic assumption, and doesn't apply mostly in practice.
And I chose it only because it goes with a very simple algorithm, which is
the perceptron learning algorithm.
There are two ways to deal with the case of linear inseparability.
There are algorithms, and most algorithms actually deal with that
case, and there's also a technique that we are going to study next
week, which will take a set of points which is not linearly separable, and
create a mapping that makes them linearly separable.
So there is a way to deal with it.
However, the question how do you determine it's linearly separable, the
right way of doing it in practice is that, when someone gives you data, you
assume in general it's not linearly separable.
It will hardly ever be, and therefore you take techniques that can deal with
that case as well.
There is a simple modification of the perceptron learning algorithm, which
is called the pocket algorithm,
that applies the same rule with a very minor modification, and deals with the
case where the data is not separable.
However, if you apply the perceptron learning algorithm, that is guaranteed
to converge to a correct solution in the case of linear separability, and
you apply it to data that is not linearly separable, bad things happen.
Not only is it going not to converge, obviously it is not going to converge
because it terminates when there are no misclassified points, right?
If there is a misclassified point, then there's a next iteration always.
So since the data is not linearly separable, we will never come to
a point where all the points are classified correctly.
So this is not what is bothering us.
What is bothering us is that, as you go from one step to another, you can
go from a very good solution to a terrible solution.
In the case of no linear separability.
So it's not an algorithm that you would like to use, and just
terminate by force at an iteration.
A modification of it can be used this way, and I'll mention it briefly when
we talk about linear regression and other linear methods.
MODERATOR: There's also a question of how does the rate of convergence of
the perceptron change with the dimensionality of the data?
PROFESSOR: Badly!
That's the answer.
Let me put it this way.
You can build pathological cases, where it really will take forever.
However, I did not give the perceptron learning algorithm in the first
lecture to tell you that this is the great algorithm that you
need to learn.
I gave it in the first lecture, because this is simplest
algorithm I could give.
By the end of this course, you'll be saying, what?
Perceptron?
Never heard of it.
So it will go out of contention, after we get to the more interesting stuff.
But as a method that can be used, it indeed can be used, and can be
explained in five minutes as you have seen.
MODERATOR: Regarding the items for learning, you mentioned that there
must be a pattern.
So can you be more specific about that?
How do you know if there's a pattern?
PROFESSOR: You don't.
My answers seem to be very abrupt, but that's the way it is.
When we get to the theory-- is learning feasible-- it will
become very clear that there is a separation between the target
function-- there is a pattern to detect--
and whether we can learn it.
It is very difficult for me to explain it in two minutes, it will take a full
lecture to get there.
But the essence of it is that you take the data, you apply your learning
algorithm, and there is something you can explicitly detect that will
tell you whether you learned or not.
So in some cases, you're not going to be able to learn.
In some cases, you'll be able to learn.
And the key is that you're going to be able to tell by
running your algorithm.
And I'm going to explain that in more details later on.
So basically, I'm also resisting taking the data, deciding
whether it's linearly separable, looking at it and seeing. You will
realize as we go through that it's a no-no to actually look at the data.
What?
That's what data is for, to look at.
Bear with me.
We will come to the level where we ask why don't we look at the data--
just looking at it and then saying: It's linearly separable.
Let's pick the perceptron.
That's bad practice, for reasons that are not obvious now.
They will become obvious, once we are done with the theory.
So when someone knocks on my door with a set of data, I can ask them all
kinds of questions about the data-- not the particular data set that they gave
me, but about the general data that is generated by their process.
They can tell me this variable is important, the function is symmetric,
they can give you all kinds of information that I will take to heart.
But I will try, as much as I can, to avoid looking at the particular data
set that they gave me, lest I should tailor my system toward this data set,
and be disappointed when another data set comes about.
You don't want to get too close to the data set.
This will become very clear as we go with the theory.
MODERATOR: In general about machine learning, how does it
relate to other statistical, especially econometric techniques?
PROFESSOR: Statistics is, in the form I said, it's machine
learning where the target--
it's not a function in this case-- is a probability distribution.
Statistics is a mathematical field.
And therefore, you put the assumptions that you need in order to be able to
rigorously prove the results you have, and get the results in detail.
For example, linear regression.
When we talk about linear regression, it will have very few assumptions, and
the results will apply to a wide range, because we didn't make too many
assumptions.
When you study linear regression under statistics, there is a lot of
mathematics that goes with it, lot of assumptions, because that is the
purpose of the field.
In general, machine learning tries to make the least assumptions and cover the
most territory. These go together.
So it is not a mathematical discipline, but it's not a purely
applied discipline.
It spans both the mathematical, to certain extent, but it is willing to
actually go into territory where we don't have mathematical models, and
still want to apply our techniques.
So that is what characterizes it the most.
And then there are other fields. By doing machine learning,
you can find it under the name computational learning,
or statistical learning.
Data mining has a huge intersection with machine learning.
There are lots of disciplines around that actually share some value.
But the point is, the premise that you saw is so broad, that it shouldn't be
surprising that people at different times developed a particular discipline
with its own jargon, to deal with that discipline.
So what I'm giving you is machine learning as the mainstream goes, and
that can be applied as widely as possible to applications, both
practical applications and scientific applications.
You will see, here is a situation, I have an experiment, here is a target,
I have the data.
How do I produce the target in the best way I want?
And then you apply machine learning.
MODERATOR: Also, in a general question about machine learning.
Do machine learning algorithms perform global optimization methods,
or just local optimization methods?
PROFESSOR: Obviously, a general question.
Optimization is a tool for machine learning.
So we will pick whatever optimization that does the job for us.
And sometimes, there is a very specific optimization method.
For example, in support vector machines, it will be quadratic
programming.
It happens to be the one that works with that.
But optimization is not something that machine learning people
study for its own sake.
They obviously study it to understand it better, and to choose the correct
optimization method.
Now, the question is alluding to something that will
become clear when we talk about neural networks, which is local minimum versus
global minimum.
And it is impossible to put this in any perspective before we get the
details of neural networks, so I will defer that until
we get to that lecture.
MODERATOR: Also, this is a math question, I guess.
Is the hypothesis set, in a topological sense, continuous?
PROFESSOR: The hypothesis set can be anything, in principle.
So it can be continuous, and it can be discrete.
For example, in the next lecture I take the simplest case where we have
a finite hypothesis set, in order to make a certain point.
In reality, almost all the hypothesis sets that you find are
continuous and infinite.
Very infinite!
And the level of sophistication of the hypothesis set can be huge.
And nonetheless, we will be able to see that under one condition, which
comes from the theory, we'll be able to learn even if the hypothesis set is
huge and complicated.
There's a question from inside, yes?
STUDENT: I think I understood, more or less, the general idea, but I don't
understand the second example you gave about credit approval.
So how do we collect our data?
Should we give credit to everyone, or should we make our data biased,
because we cannot determine the data of--
we can't determine, should we give credit or not to persons we rejected?
PROFESSOR: Correct.
This is a good point. Every time someone asks a question, the
lecture number comes to my mind.
I know when I'm going to talk about it.
So what you describe is called sampling bias.
And I will describe it in detail.
But when you use the biased data, let's say the bank uses historical records.
So it sees the people who applied and were accepted, and for those guys, it
can actually predict what the credit behavior is, because it has their
credit history.
They charged and repaid and maxed out, and all of this.
And then they decide: is this a good customer or not?
For those who were rejected, there's really no way to tell in this case
whether they were falsely rejected, that they would have been good
customers or not.
Nonetheless, if you take the customer base that you have, and base your
decision on it, the boundary works fairly decently.
Actually, pretty decently, even for the other guys, because the other guys
usually are deeper into the classification region than the
boundary guys that you accepted, and turned out to be bad.
But the point is well taken.
The data set in this case is not completely representative, and there
is a particular principle in learning that we'll talk about, which is
sampling bias, that deals with this case.
Another question from here?
STUDENT: You explain that we need to have a lot of data to learn.
So how do you decide how much amount of data that is required for
a particular problem, in order to be able to come up with a reasonable--
PROFESSOR: Good question.
So let me tell you the theoretical, and the practical answer.
The theoretical answer is that this is exactly the crux of the theory part
that we're going to talk about.
And in the theory, we are going to see, can we learn?
And how much data.
So all of this will be answered in a mathematical way.
So this is the theoretical answer.
The practical answer is: that's not under your control.
When someone knocks on your door: Here is the data, I have 500 points.
I tell him, I will give you a fantastic system if you
just give me 2000.
But I don't have 2000, I have 500.
So now you go and you use your theory to do something to your system, such
that it can work with the 500.
There was one case--
I worked with data in different applications--
at some point, we had almost 100 million points.
You were swimming in data.
You wouldn't complain about data.
Data was wonderful.
And in another case, there were less than 100 points.
And you had to deal with the data with gloves!
Because if you use them the wrong way, they are contaminated, which is
an expression we will see, and then you have nothing.
And you will produce a system, and you are proud of it, but you have no idea
whether it will perform well or not.
And you cannot give this to the customer, and have the customer come
back to you and say: what did you do!?
So there is a question of, what performance can you do given
what data size you have?
But in practice, you really have no control over the data size in almost
all the cases, almost all the practical cases.
Yes?
STUDENT: Another question I have is regarding the hypothesis set.
So the larger the hypothesis set is, probably I'll be able to
better fit the data.
But that, as you were explaining, might be a bad thing to do because
when the new data point comes, there might be troubles.
So how do you decide the size of your--
PROFESSOR: You are asking all the right questions, and all of
them are coming up.
This is again part of the theory, but let me try to explain this.
As we mentioned, learning is about being able to predict.
So you are using the data, not to memorize it, but to figure out what
the pattern is.
And if you figure out a pattern that applies to all the data, and it's
a reasonable pattern, then you have a chance that it
will generalize outside.
Now the problem is that, if I give you 50 points, and you use a 7000th-order
polynomial, you will fit the heck out of the data.
You will fit it so much with so many degrees of freedom to spare, but you
haven't learned anything.
You just memorized it in a fancy way.
You put it in a polynomial form, and that actually carries all the
information about the data that you have,
and then some.
So you don't expect at all that this will generalize outside.
And that intuitive observation will be formalized when we
talk about the theory.
There will be a measurement of the hypothesis set that you give me, that
measures the sophistication of it, and will tell you with that
sophistication, you need that amount of data in order to be able to make
any statement about generalization.
So that is what the theory is about.
STUDENT: Suppose, I mean, here whatever we discussed, it is like I
had a data set and I came up with an algorithm, and gave the output.
But won't it be also important to see, OK, we came up with the output, and
using that, what was the feedback?
Are there techniques where you take the feedback and try to
correct your--
PROFESSOR: You are alluding to different techniques here.
But one of them would be validation, which is after you learn, you validate
your solution.
And this is an extremely established and core technique in machine learning
that will be covered in one of the lectures.
Any questions from the online audience?
MODERATOR: In practice, how many dimensions would be considered easy,
medium, and hard for a perceptron problem?
PROFESSOR: The hard,
in most people's mind before they get into machine learning, is the
computational time.
If something takes a lot of time, then it's a hard problem.
If something can be computed quickly, it's an easy problem.
For machine learning, the bottleneck in my case, has never been the
computation time, even in incredibly big data sets.
The bottleneck for machine learning is to be able to generalize outside the
data that you have seen.
So to answer your question, the perceptron behaves badly in terms of
the computational behavior.
We will be able to predict its generalization behavior, based on the
number of dimensions and the amount of data.
This will be given explicitly.
And therefore, the perceptron algorithm is bad computationally, good
in terms of generalization.
If you actually can get away with perceptrons, your chances of
generalizing are good because it's a simplistic
model, and therefore its ability to generalize is good, as we will see.
MODERATOR: Also, in the example you explain the use of binary function.
So can you use more multi-valued or real functions?
PROFESSOR: Correct.
Remember when I told you that there is a topic that is out of sequence.
There was a logical sequence to the course, and then I took part of the
linear models and put it very early on, to give you something a little bit
more sophisticated than perceptrons to try your hand on.
That happens to be for real-valued functions.
And obviously there are hypotheses that cover all types of co-domains.
Y could be anything as well.
MODERATOR: Another question is, in the learning process you showed, when
do you pick your learning algorithm, when do you pick your hypothesis set,
and what liberty do you have?
PROFESSOR: The hypothesis set is the most important aspect of
determining the generalization behavior that we'll talk about.
The learning algorithm does play a role, although it is a secondary role,
as we will see in the discussion.
So in general, the learning algorithm has the form of
minimizing an error function.
So you can think of the perceptron, what does
the algorithm do?
It tries to minimize the classification error.
That is your error function, and you're minimizing it using this
particular update rule.
And in other cases, we'll see that we are minimizing an error function.
Now the minimization aspect is an optimization question, and once you
determine that this is indeed the error function that I want to
minimize, then you go and minimize as much as you can using the most
sophisticated optimization technique that you find.
So the question now translates into what is the choice of the error
function or error measure that will help or not help.
And that will be covered also next week under the topic, Error and Noise.
When I talk about error, we'll talk about error measures, and this
translates directly to the learning algorithm that goes with them.
MODERATOR: Back to the perceptron.
So what happens if your hypothesis gives you exactly 0 in this case?
PROFESSOR: So remember that the quantity you compute and
compare with the threshold was your credit score.
So I told you what happens if you are above threshold, and what happens if
you're below threshold.
So what happens if you're exactly at the threshold?
Your score is exactly that.
The informal answer is that it depends on the mood of the credit
officer on that day.
If they had a bad day, you will be denied!
But the serious answer is that there are technical ways of
defining that point.
You can define it as 0, so the sign of 0 is 0.
In which case you are always making an error, because you are never +1 or
-1, when you should be.
Or you could make it belong to the +1 category or
to the -1 category.
There are ramifications for all of these decisions
that are purely technical.
Nothing conceptual comes out of them.
That's why I decided not to include it.
Because it clutters the main concept with something that really has no
ramification.
As far as you're concerned, the easiest way to consider it is that the
output will be 0, and therefore you will be making an error regardless of
whether it's +1 or -1.
MODERATOR: Is there a kind of problem that cannot be learned even if
there's a huge amount of data?
PROFESSOR: Correct.
For example, if I go to my computer and use a pseudo-random number
generator to generate the target over the entire domain, then patently,
nothing I can give you will make you learn the other guys.
So remember the three--
let me try to--
the essence of machine learning.
The first one was, a pattern exists.
If there's no pattern that exists, there is nothing to learn.
Let's say that it's like a baby, and stuff is happening, and the
baby is just staring. There is nothing to pick from that thing.
Once there is a pattern, you can see the smile on the baby's face.
Now I can see what is going on.
So whatever you are learning, there needs to be a pattern.
Now, how to tell that there's a pattern or not,
that's a different question.
But the main ingredient, there's a pattern. The other one is we cannot pin
it down mathematically.
If we can pin it down mathematically, and you decide to do
the learning, then you are really lazy.
Because you could just write the code.
But fine.
You can use learning in this case, but it's not the recommended method,
because it has certain errors in performance.
Whereas if you have the mathematical definition, you just implement it and
you'll get the best possible solution.
And the third one, you have data, which is key.
So you have plenty of data, but the first one is off, you are simply not
going to learn.
And it's not like I have to answer each of these questions at random.
The theory will completely capture what is going on.
So there's a very good reason for going through four lectures in the
outline that are mathematically inclined.
This is not for the sake of math.
I don't like to do math hacking, if you will.
I pick the math that is necessary to establish a concept.
And these will establish it, and they are very much worth being patient with
and going through.
Because once you're done with them, you basically have it cold about what
are the components that make learning possible, and how do we tell, and all
of the questions that have been asked.
MODERATOR: Historical question.
So why is the perceptron often related with a neuron?
PROFESSOR: I will discuss this in neural networks, but in general,
when you take a neuron and synapses, and you find what is the function that
gets to the neuron, you find that the neuron fires, which is +1, if the
signal coming to it, which is roughly a combination of the stimuli, exceeds
a certain threshold.
So that was the initial inspiration, and the initial inspiration was
that: the brain does a pretty good job, so maybe if we mimic the
function, we will get something good.
But you mimic one neuron, and then you put it together and you'll get the
neural network that you are talking about.
And I will discuss the analogy with biology, and the extent that it can be
benefited from, when we talk about neural networks, because
that will be the more proper context for that.
MODERATOR: Another question is, regarding the hypothesis set, are there
Bayesian hierarchical procedures to narrow down the hypothesis set?
PROFESSOR: OK.
The choice of the hypothesis set and the model in general is model
selection, and there's quite a bit of stuff that we are going to talk about
in model selection, when we talk about validation.
In general, the word Bayesian was mentioned here-- if you
look at machine learning, there are schools that deal with the subject
differently.
So for example, the Bayesian school puts a mathematical framework
completely on it.
And then everything can be derived, and that is based on Bayesian
principles.
I will talk about that at the very end, so it's last but not least.
And I will make a very specific point about it, for what it's worth.
But what I'm talking about in the course in all of the details, are the
most commonly useful methods in practice.
That is my criterion for inclusion.
So I will get to that when we get there.
In terms of a hierarchy,
there are a number of hierarchical methods.
For example, structural risk minimization is one of them.
There are methods of hierarchies, and the ramifications of it in
generalization.
I may touch upon it, when I get to support vector machines.
But again, there's a lot of theory, and if you read a book on machine
learning written by someone from pure theory, you would think that your are
reading about a completely different subject.
It's respectable stuff, but different from the other
stuff that is practiced.
So one of the things that I'm trying to do, I'm trying to pick from all the
components of machine learning, the big picture that gives you the
understanding of the concept, and the tools to use it in practice.
That is the criterion for inclusion.
Any questions from the inside here?
OK, we'll call it a day, and we'll see you on Thursday.