Subtitles section Play video
[MUSIC PLAYING]
Last episode, we used a decision tree as our classifier.
Today we'll add code to visualize it
so we can see how it works under the hood.
There are many types of classifiers
you may have heard of before-- things like neural nets
or support vector machines.
So why did we use a decision tree to start?
Well, they have a very unique property--
they're easy to read and understand.
In fact, they're one of the few models that are interpretable,
where you can understand exactly why the classifier makes
a decision.
That's amazingly useful in practice.
To get started, I'll introduce you
to a real data set we'll work with today.
It's called Iris.
Iris is a classic machine learning problem.
In it, you want to identify what type of flower
you have based on different measurements,
like the length and width of the petal.
The data set includes three different types of flowers.
They're all species of iris-- setosa, versicolor,
and virginica.
Scrolling down, you can see we're
given 50 examples of each type, so 150 examples total.
Notice there are four features that are
used to describe each example.
These are the length and width of the sepal and petal.
And just like in our apples and oranges problem,
the first four columns give the features and the last column
gives the labels, which is the type of flower in each row.
Our goal is to use this data set to train a classifier.
Then we can use that classifier to predict what species
of flower we have if we're given a new flower that we've never
seen before.
Knowing how to work with an existing data set
is a good skill, so let's import Iris into scikit-learn
and see what it looks like in code.
Conveniently, the friendly folks at scikit
provided a bunch of sample data sets,
including Iris, as well as utilities
to make them easy to import.
We can import Iris into our code like this.
The data set includes both the table
from Wikipedia as well as some metadata.
The metadata tells you the names of the features
and the names of different types of flowers.
The features and examples themselves
are contained in the data variable.
For example, if I print out the first entry,
you can see the measurements for this flower.
These index to the feature names, so the first value
refers to the sepal length, and the second to sepal width,
and so on.
The target variable contains the labels.
Likewise, these index to the target names.
Let's print out the first one.
A label of 0 means it's a setosa.
If you look at the table from Wikipedia,
you'll notice that we just printed out the first row.
Now both the data and target variables have 150 entries.
If you want, you can iterate over them
to print out the entire data set like this.
Now that we know how to work with the data set,
we're ready to train a classifier.
But before we do that, first we need to split up the data.
I'm going to remove several of the examples
and put them aside for later.
We'll call the examples I'm putting aside our testing data.
We'll keep these separate from our training data,
and later on we'll use our testing examples
to test how accurate the classifier is
on data it's never seen before.
Testing is actually a really important part
of doing machine learning well in practice,
and we'll cover it in more detail in a future episode.
Just for this exercise, I'll remove one example
of each type of flower.
And as it happens, the data set is
ordered so the first setosa is at index 0,
and the first versicolor is at 50, and so on.
The syntax looks a little bit complicated, but all I'm doing
is removing three entries from the data and target variables.
Then I'll create two new sets of variables-- one
for training and one for testing.
Training will have the majority of our data,
and testing will have just the examples I removed.
Now, just as before, we can create a decision tree
classifier and train it on our training data.
Before we visualize it, let's use the tree
to classify our testing data.
We know we have one flower of each type,
and we can print out the labels we expect.
Now let's see what the tree predicts.
We'll give it the features for our testing data,
and we'll get back labels.
You can see the predicted labels match our testing data.
That means it got them all right.
Now, keep in mind, this was a very simple test,
and we'll go into more detail down the road.
Now let's visualize the tree so we can
see how the classifier works.
To do that, I'm going to copy-paste
some code in from scikit's tutorials,
and because this code is for visualization
and not machine-learning concepts,
I won't cover the details here.
Note that I'm combining the code from these two examples
to create an easy-to-read PDF.
I can run our script and open up the PDF,
and we can see the tree.
To use it to classify data, you start by reading from the top.
Each node asks a yes or no question
about one of the features.
For example, this node asks if the pedal width
is less than 0.8 centimeters.
If it's true for the example you're classifying, go left.
Otherwise, go right.
Now let's use this tree to classify an example
from our testing data.
Here are the features and label for our first testing flower.
Remember, you can find the feature names
by looking at the metadata.
We know this flower is a setosa, so let's see
what the tree predicts.
I'll resize the windows to make this easier to see.
And the first question the tree asks
is whether the petal width is less than 0.8 centimeters.
That's the fourth feature.
The answer is true, so we proceed left.
At this point, we're already at a leaf node.
There are no other questions to ask,
so the tree gives us a prediction, setosa,
and it's right.
Notice the label is 0, which indexes to that type of flower.
Now let's try our second testing example.
This one is a versicolor.
Let's see what the tree predicts.
Again we read from the top, and this time the pedal width
is greater than 0.8 centimeters.
The answer to the tree's question is false,
so we go right.
The next question the tree asks is whether the pedal width
is less than 1.75.
It's trying to narrow it down.
That's true, so we go left.
Now it asks if the pedal length is less than 4.95.
That's true, so we go left again.
And finally, the tree asks if the pedal width
is less than 1.65.
That's true, so left it is.
And now we have our prediction-- it's a versicolor,
and that's right again.
You can try the last one on your own as an exercise.
And remember, the way we're using the tree
is the same way it works in code.
So that's how you quickly visualize and read
a decision tree.
There's a lot more to learn here,
especially how they're built automatically from examples.
We'll get to that in a future episode.
But for now, let's close with an essential point.
Every question the tree asks must be about one
of your features.
That means the better your features are, the better a tree
you can build.
And the next episode will start looking
at what makes a good feature.
Thanks very much for watching, and I'll see you next time.
[MUSIC PLAYING]