Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • Hey, everyone.

  • Welcome back.

  • In this episode, we're going to do something special,

  • and that's write our own classifier from scratch.

  • If you're new to machine learning,

  • this is a big milestone.

  • Because if you can follow along and do this on your own,

  • it means you understand an important piece of the puzzle.

  • The classifier we're going to write today

  • is a scrappy version of k-Nearest Neighbors.

  • That's one of the simplest classifiers around.

  • First, here's a quick outline of what we'll do in this episode.

  • We'll start with our code from Episode 4, Let's

  • Write a Pipeline.

  • Recall in that episode we did a simple experiment.

  • We imported a data set and split it into train and test.

  • We used train to train a classifier,

  • and test to see how accurate it was.

  • Writing the classifier is the part

  • we're going to focus on today.

  • Previously we imported the classifier

  • from a library using these two lines.

  • Here we'll comment them out and write our own.

  • The rest of the pipeline will stay exactly the same.

  • I'll pop in and out of the screencast to explain things

  • as we go along.

  • To start, let's run our pipeline to remind ourselves

  • what the accuracy was.

  • As you can see, it's over 90%.

  • And that's the goal for the classifier

  • we'll write ourselves.

  • Now let's comment out that import.

  • Right off the bat, this breaks our code.

  • So the first thing we need to do is fix our pipeline.

  • And to do that, we'll implement a class for our classifier.

  • I'll call it ScrappyKNN.

  • And by scrappy, I mean bare bones.

  • Just enough to get it working.

  • Next, I'll change our pipeline to use it.

  • Now let's see what methods we need to implement.

  • Looking at the interface for a classifier,

  • we see there are two we care about-- fit,

  • which does the training, and predict,

  • which does the prediction.

  • First we'll declare our fit method.

  • Remember this takes the features and labels for the training set

  • as input, so we'll add parameters for those.

  • Now let's move on to our predict method.

  • As input, this receives the features for our testing data.

  • And as output, it returns predictions for the labels.

  • Our first goal is to get the pipeline working,

  • and to understand what these methods do.

  • So before we write our real classifier,

  • we'll start with something simpler.

  • We'll write a random classifier.

  • And by random, I mean we'll just guess the label.

  • To start, we'll add some code to the fit and predict methods.

  • In fit, I'll store the training data in this class.

  • You can think of this as just memorizing it.

  • And you'll see why we do that later on.

  • Inside the predict method, remember

  • that we'll need to return a list of predictions.

  • That's because the parameter, X_test, is actually

  • a 2D array, or list of lists.

  • Each row contains the features for one testing example.

  • To make a prediction for each row,

  • I'll just randomly pick a label from the training data

  • and append that to our predictions.

  • At this point, our pipeline is working again.

  • So let's run it and see how well it does.

  • Recall there are three different types of flowers

  • in the iris dataset, so accuracy should be about 33%.

  • Now we know the interface for a classifier.

  • But when we started this exercise,

  • our accuracy was above 90%.

  • So let's see if we can do better.

  • To do that, we'll implement our classifier,

  • which is based on k-Nearest Neighbors.

  • Here's the intuition for how that algorithm works.

  • Let's return to our drawings of green dots and red dots

  • from the last episode.

  • Imagine the dots we see on the screen

  • are the training data we memorized in the fit method,

  • say for a toy dataset.

  • Now imagine we're asked to make a prediction for this testing

  • point that I'll draw here in gray.

  • How can we do that?

  • Well in a nearest neighbor classifier,

  • it works exactly like it sounds.

  • We'll find the training point that's

  • closest to the testing point.

  • This point is the nearest neighbor.

  • Then we'll predict that the testing

  • point has the same label.

  • For example, we'll guess that this testing dot is green,

  • because that's the color of its nearest neighbor.

  • As another example, if we had a testing dot over here,

  • we'd guess that it's red.

  • Now what about this one right in the middle?

  • Imagine that this dot is equidistant to the nearest

  • green dot and the nearest red one.

  • There's a tie, so how could we classify it?

  • Well one way is we could randomly break the tie.

  • But there's another way, and that's where k comes in.

  • K is the number of neighbors we consider

  • when making our prediction.

  • If k was 1, we'd just look at the closest training point.

  • But if k was 3, we'd look at the three closest.

  • In this case, two of those are green and one is red.

  • To predict, we could vote and predict the majority class.

  • Now there's more detail to this algorithm,

  • but that's enough to get us started.

  • To code this up, first we'll need a way

  • to find the nearest neighbor.

  • And to do that, we'll measure the straight line

  • distance between two points, just like you do with a ruler.

  • There's a formula for that called the Euclidean Distance,

  • and here's what the formula looks like.

  • It measures the distance between two points,

  • and it works a bit like the Pythagorean Theorem.

  • A squared plus B squared equals C squared.

  • You can think of this term as A, or the difference

  • between the first two features.

  • Likewise, you can think of this term as B,

  • or the difference between the second pair of features.

  • And the distance we compute is the length of the hypotenuse.

  • Now here's something cool.

  • Right now we're computing distance

  • in two-dimensional space, because we have just two

  • features in our toy dataset.

  • But what if we had three features or three dimensions?

  • Well then we'd be in a cube.

  • We can still visualize how to measure distance

  • in the space with a ruler.

  • But what if we had four features or four dimensions,

  • like we do in iris?

  • Well, now we're in a hypercube, and we

  • can't visualize this very easy.

  • The good news is the Euclidean Distance

  • works the same way regardless of the number of dimensions.

  • With more features, we can just add more terms to the equation.

  • You can find more details about this online.

  • Now let's code up Euclidean distance.

  • There are plenty of ways to do that,

  • but we'll use a library called scipy that's

  • already installed by Anaconda.

  • Here, A and B are lists of numeric features.

  • Say A is a point from our training data,

  • and B is a point from our testing data.

  • This function returns the distance between them.

  • That's all the math we need, so now

  • let's take a look at the algorithm for a classifier.

  • To make a prediction for a test point,

  • we'll calculate the distance to all the training points.

  • Then we'll predict the testing point has the same label

  • as the closest one.

  • I'll delete the random prediction we made,

  • and replace it with a method that finds the closest training

  • point to the test point.

  • For this video hard, I'll hard-code k to 1,

  • so we're writing a nearest neighbor classifier.

  • The k variable won't appear in our code,

  • since we'll always just find the closest point.

  • Inside this method, we'll loop over all the training points

  • and keep track of the closest one so far.

  • Remember that we memorized the training data in our fit

  • function, and X_train contains the features.

  • To start, I'll calculate the distance from the test point

  • to the first training point.

  • I'll use this variable to keep track of the shortest

  • distance we've found so far.

  • And I'll use this variable to keep

  • track of the index of the training point that's closest.

  • We'll need this later to retrieve its label.

  • Now we'll iterate over all the other training points.

  • And every time we find a closer one,

  • we'll update our variables.

  • Finally, we'll use the index to return

  • the label for the closest training example.

  • At this point, we have a working nearest neighbor classifier,

  • so let's run it and see what the accuracy is.

  • As you can see, it's over 90%.

  • And we did it.

  • When you run this on your own, the accuracy

  • might be a bit different because of randomness in the train test

  • split.

  • Now if you can code this up and understand it,

  • that's a big accomplishment because it

  • means you can write a simple classifier from scratch.

  • Now, there are a number of pros and cons

  • to this algorithm, many of which you can find online.

  • The basic pro is that it's relatively easy to understand,

  • and works reasonably well for some problems.

  • And the basic cons are that it's slow,

  • because it has to iterate over every training point

  • to make a prediction.

  • And importantly, as we saw in Episode 3,

  • some features are more informative than others.

  • But there's not an easy way to represent

  • that in k-Nearest Neighbors.

  • In the long run, we want a classifier

  • that learns more complex relationships between features

  • and the label we're trying to predict.

  • A decision tree is a good example of that.

  • And a neural network like we saw in TensorFlow Playground

  • is even better.

  • OK, hope that was helpful.

  • Thanks as always for watching.

  • You can follow me on Twitter for updates and, of course,

  • Google Developers.

  • And I'll see you guys next time.

  • [MUSIC PLAYING]

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it