  • Last episode, we used a decision tree as our classifier.

  • Today we'll add code to visualize it

  • so we can see how it works under the hood.

  • There are many types of classifiers

  • you may have heard of before-- things like neural nets

  • or support vector machines.

  • So why did we use a decision tree to start?

  • Well, they have a very unique property--

  • they're easy to read and understand.

  • In fact, they're one of the few models that are interpretable,

  • where you can understand exactly why the classifier makes

  • a decision.

  • That's amazingly useful in practice.

  • To get started, I'll introduce you

  • to a real data set we'll work with today.

  • It's called Iris.

  • Iris is a classic machine learning problem.

  • In it, you want to identify what type of flower

  • you have based on different measurements,

  • like the length and width of the petal.

  • The data set includes three different types of flowers.

  • They're all species of iris-- setosa, versicolor,

  • and virginica.

  • Scrolling down, you can see we're

  • given 50 examples of each type, so 150 examples total.

  • Notice there are four features that are

  • used to describe each example.

  • These are the length and width of the sepal and petal.

  • And just like in our apples and oranges problem,

  • the first four columns give the features and the last column

  • gives the labels, which is the type of flower in each row.

  • Our goal is to use this data set to train a classifier.

  • Then we can use that classifier to predict what species

  • of flower we have if we're given a new flower that we've never

  • seen before.

  • Knowing how to work with an existing data set

  • is a good skill, so let's import Iris into scikit-learn

  • and see what it looks like in code.

  • Conveniently, the friendly folks at scikit

  • provided a bunch of sample data sets,

  • including Iris, as well as utilities

  • to make them easy to import.

  • We can import Iris into our code like this.

  • The data set includes both the table

  • from Wikipedia as well as some metadata.

  • The metadata tells you the names of the features

  • and the names of different types of flowers.

  • The features and examples themselves

  • are contained in the data variable.

  • For example, if I print out the first entry,

  • you can see the measurements for this flower.

  • These index to the feature names, so the first value

  • refers to the sepal length, and the second to sepal width,

  • and so on.

  • The target variable contains the labels.

  • Likewise, these index to the target names.

  • Let's print out the first one.

  • A label of 0 means it's a setosa.

  • If you look at the table from Wikipedia,

  • you'll notice that we just printed out the first row.

  • Now both the data and target variables have 150 entries.

  • If you want, you can iterate over them

  • to print out the entire data set like this.

  • Now that we know how to work with the data set,

  • we're ready to train a classifier.

  • But before we do that, first we need to split up the data.

  • I'm going to remove several of the examples

  • and put them aside for later.

  • We'll call the examples I'm putting aside our testing data.

  • We'll keep these separate from our training data,

  • and later on we'll use our testing examples

  • to test how accurate the classifier is

  • on data it's never seen before.

  • Testing is actually a really important part

  • of doing machine learning well in practice,

  • and we'll cover it in more detail in a future episode.

  • Just for this exercise, I'll remove one example

  • of each type of flower.

  • And as it happens, the data set is

  • ordered so the first setosa is at index 0,

  • and the first versicolor is at 50, and so on.

  • The syntax looks a little bit complicated, but all I'm doing

  • is removing three entries from the data and target variables.

  • Then I'll create two new sets of variables-- one

  • for training and one for testing.

  • Training will have the majority of our data,

  • and testing will have just the examples I removed.

  • Now, just as before, we can create a decision tree

  • classifier and train it on our training data.

  • Before we visualize it, let's use the tree

  • to classify our testing data.

  • We know we have one flower of each type,

  • and we can print out the labels we expect.

  • Now let's see what the tree predicts.

  • We'll give it the features for our testing data,

  • and we'll get back labels.

  • You can see the predicted labels match our testing data.

  • That means it got them all right.

  • Now, keep in mind, this was a very simple test,

  • and we'll go into more detail down the road.

  • Now let's visualize the tree so we can

  • see how the classifier works.

  • To do that, I'm going to copy-paste

  • some code in from scikit's tutorials,

  • and because this code is for visualization

  • and not machine-learning concepts,

  • I won't cover the details here.

  • Note that I'm combining the code from these two examples

  • to create an easy-to-read PDF.

  • I can run our script and open up the PDF,

  • and we can see the tree.

  • To use it to classify data, you start by reading from the top.

  • Each node asks a yes or no question

  • about one of the features.

  • For example, this node asks if the pedal width

  • is less than 0.8 centimeters.

  • If it's true for the example you're classifying, go left.

  • Otherwise, go right.

  • Now let's use this tree to classify an example

  • from our testing data.

  • Here are the features and label for our first testing flower.

  • Remember, you can find the feature names

  • by looking at the metadata.

  • We know this flower is a setosa, so let's see

  • what the tree predicts.

  • I'll resize the windows to make this easier to see.

  • And the first question the tree asks

  • is whether the petal width is less than 0.8 centimeters.

  • That's the fourth feature.

  • The answer is true, so we proceed left.

  • At this point, we're already at a leaf node.

  • There are no other questions to ask,

  • so the tree gives us a prediction, setosa,

  • and it's right.

  • Notice the label is 0, which indexes to that type of flower.

  • Now let's try our second testing example.

  • This one is a versicolor.

  • Let's see what the tree predicts.

  • Again we read from the top, and this time the pedal width

  • is greater than 0.8 centimeters.

  • The answer to the tree's question is false,

  • so we go right.

  • The next question the tree asks is whether the pedal width

  • is less than 1.75.

  • It's trying to narrow it down.

  • That's true, so we go left.

  • Now it asks if the pedal length is less than 4.95.

  • That's true, so we go left again.

  • And finally, the tree asks if the pedal width

  • is less than 1.65.

  • That's true, so left it is.

  • And now we have our prediction-- it's a versicolor,

  • and that's right again.

  • You can try the last one on your own as an exercise.

  • And remember, the way we're using the tree

  • is the same way it works in code.

  • So that's how you quickly visualize and read

  • a decision tree.

  • There's a lot more to learn here,

  • especially how they're built automatically from examples.

  • We'll get to that in a future episode.

  • But for now, let's close with an essential point.

  • Every question the tree asks must be about one

  • of your features.

  • That means the better your features are, the better a tree

  • you can build.

  • And the next episode will start looking

  • at what makes a good feature.

  • Thanks very much for watching, and I'll see you next time.



