Subtitles section Play video
Hey, John-Green-bot.
I've been thinking really hard about a HUGE life decision.
I want to adopt a pet, and I've narrowed it down to either a cat or a dog.
But there are so many great cats and dogs on adoption websites.
John Green Bot: The Grey Parrot (Psittacus erithacus) has an average lifespan in captivity
of 40 to 60 years.
Jabril: Yeah, birds are great and all but I was thinking maybe a cat or a dog.
John Green Bot: Turtles will need a tank approximately 7.5 to 15 times their shell length in centimeters.
Jabril: Yeah, you're no help.
Come on Spot and Mr. Cuddles.
It looks like I'm going to have to figure this out myself, and by myself I mean
make an AI figure it out.
Today we're going to train an AI to go through the list of pets and make the best decision
for me based on data!
That'll make things less stressful… surely, nothing will go wrong with this… right?
INTRO
Hey, I'm Jabril and welcome to Crash Course AI.
Today we're going to build a fairly simple AI program to find out if adopting a cat or
a dog will make me happier.
This is a pretty subjective question, and if I use data from the internet, I'll have
a lot of strong opinions.
So, I'll conduct my own survey where I collect data about people's cats and dogs and their
happiness.
I don't care what pet I get, as long as it makes me happy, so I won't even include cat
and dog labels in the model.
Like in previous labs, I'll be writing all of my code using a language called Python
in a tool called Google Colaboratory.
And as you watch this video, you can follow along with the code in your browser from the
link we put in the description.
In these Colaboratory files, there's some regular text explaining what I'm trying
to do, and pieces of code that you can run by pushing the play button.
These pieces of code build on each other, so keep in mind that you have to run them
in order from top to bottom, otherwise you might get an error.
To actually run the code or make changes to it, you'll have to either click “open
in playground” at the top of the page or open the File menu and click “Save a Copy
to Drive”.
And one last time, I'll give you this fyi: you'll need a Google account for this.
Creating this AI to help me decide between a cat and a dog should be pretty simple, so
there are only a couple of steps: First, I have to gather the data.
I have to decide on a few features that could predict if a cat or dog makes people happy.
Then, I'll make a survey that asks about these features, and go out in the world and
ask people if their pet fits these features and makes them happy.
It might be a little biased or imperfect, but I think it'll be juuust finnne to help
me make my decision.
Second, I have to build an AI model to predict if a specific pet makes people happy.
Because I'm not collecting a massive amount of data, it's helpful to use a small model
to prevent overfitting.
So I'll plan on using a neural network with just one hidden layer.
And for our final step, I can go through an adoption website of adorable cats and dogs,
put in their features, and let the AI decide which pet will make me happy.
No more stressing about this tough decision, the machines have my back!
Step 1.
Instead of importing a dataset this time, we've got to create our own!
So browsing through some adoption websites, the most common features I saw represented,
that are important to me are cuddly, soft, quiet (especially when I'm trying to sleep),
and energetic (because playing with an energetic pet might remind me to get up from my computer
a little more).
In the AI I'm programming, I'll use these four values to predict their answer to “does
your pet make you happy most of the time: yes or no?”
For the data collection part of this process, I gave this five-question survey of yes/no
questions to 30 people who own one cat or one dog.
I want to avoid bias based on the kind of pet, so I put everyone's answers into one
big list.
Every row is one person's response, and yes's are represented as 1 and no's as 0.
By representing the answers as numbers, I can use them directly as features in my model.
The first four questions are my input features and the last question about happiness is my
label.
And I'm not using cat or dog labels anywhere in my model.
I also have to split this dataset into the training set and the testing set.
The training set is used to train the neural network, and the testing set is kept hidden
from the neural network during training, so I can use it to check the network's accuracy later.
Step 2.
Now that I have a dataset, I need to build a neural network to help make predictions.
And if you did episode 5's Neural Network Lab (when I digitized John-Green-bot's handwriting),
this step will sound familiar because I'm using the same tools.
I'm going to use a multi-layer perceptron neural network or MLP.
As a refresher, this neural network has an input layer for features, some number of hidden
layers to learn representations, and a final output layer to make a prediction.
The hidden layers find relationships between the features that help it make accurate predictions.
Like in the Neural Networks Lab, we're going to import a library called SKLearn (which
is short for Sci Kit Learn).
SKLearn includes a bunch of different machine learning algorithms, but I'll just be using
its Multi-Layer Perceptron algorithm.
You can easily change the number of hidden layers and other parts of the model, but I'll
start with something simple: four input features, one hidden layer, and two outputs.
We'll set our hidden layer to four neurons, the same size as our input.
SKLearn will actually take care of counting the size of my input and output automatically,
so I only have to specify the size of the hidden layer.
Over the span of one epoch of training this neural network, the hidden layer will pick
up on patterns in the input features, and pass a prediction to one of two output neurons:
yes, happiness OR no, unhappiness.
The code in our Collab notebook calls this an “iteration” because an iteration and
an epoch are the same thing in the algorithm we're using.
As the model loops through the data, it predicts happiness based on the features, compares
its guess to the actual survey results, and updates its weights and biases to give a better
prediction in the future.
And over multiple epochs of the same training dataset, the neural network's predictions
should keep getting better!
We'll just go with 1000 epochs for now.
Now, I can test my AI on my original training data to see how well it captured that information,
and on the testing data I set aside.
The output here lets us know how good our neural network is at guessing if these pet
features predict owner happiness.
And it looks like our model got 100% correct on the testing data and 85% correct on the
training data!
Well guys, thanks for tuning in, but I think this project is almost over!
Everything was easy to do, performance looks great.
I'll just put in some pet features and let it help me with this big life decision!
Man, AI really is awesome.
Step 3.
Let's see... here's a pet I could adopt.
The description says it's cuddly, soft, quiet at night, and isn't that energetic.
Let's put in those features and see what the model says.
What?
Why not?
It seemed nice…
But I guess that's why I programmed an AI, so I wouldn't be swayed by my FLAWED human
judgment!
Let's move on to the next one.
Let's see, this pet isn't cuddly, isn't soft, isn't quiet, and is really energetic
… but let's see what my AI says.
Yes?!
I'm not so sure that pet would've made me happy, but my AI model had 100% accuracy
on the testing set!
I think I'm gonna test a few more...
Ok, so I've tested a bunch of animals and something weird is happening.
The AI rarely told me that adopting a cat would make me happy, but it almost always
said a dog would make me happy.
Maybe everyone I surveyed hates their cats?
But, that seems unlikely.
Besides, I never even told my AI what a cat is!
I combined all the surveys into one big dataset without “cat” or “dog” labels!
And I only taught the model about if a pet is soft, cuddly, quiet, or energetic.
Both cats and dogs can have all of those traits, right?
Is there a war between cats and AIs that I don't know about, and THAT'S why it's biased?
Hey John-Green-bot….
Do you guys hate cats?!
John-Green-bot: No, Jabril.
We love hairy babies...
Jabril: Ugh, I don't understand!!!!
So, obviously, AI doesn't have a grudge against cats.
I collected the survey data and I built the AI, so if something went wrong and introduced
an anti-cat bias… it's on me, and I can figure out what it is.
So I should go back to analyze the data and my model design.
First, I'll look for patterns and correlations in my data by hand and make sure there's
nothing fishy going on.
This means a new step!
Step 4.
What's weird is that the model's predictions don't seem to make sense to me despite the
high performance.
Specifically, I'm noticing a bias towards dogs.
So there might be something strange about the data.
Earlier, I decided to just pool all the survey results together, but now I'll split them apart.
Now I can create plots that compare the percentage of dog owners I surveyed who are happy, the
percentage of cat owners who are happy, and the percentage of all the people who are happy
with their pet (no matter what kind).
To do this, I just need to compute the number of happy dog owners divided by the total number
of dog owners, the same for cat owners, and the same for everyone I surveyed.
Interesting.
According to my survey results, cats make people really happy.
But when I put in the features for a cat, my AI usually says it won't make the owner
happy.
How can I have such good accuracy at predicting happiness and always be wrong about cats?!
I still don't have answers about why the data is skewed towards dogs… so I guess
I should look at who even filled out my survey?
Let's make a plot that compares the total number of dog owners and the total number
of cat owners in my dataset.
Yikes!
Why are there so few cat responses in here?!
I guess when I surveyed random people to make my dataset bigger, I was at a park, and…
that's where I might have accidentally biased my data collection.
A lot of people who responded to my survey in the park must have been dog owners.
So the first mistake I made is that my data doesn't actually have the same distributions
as the real world.
Instead of collecting the true frequencies of each feature from a large random group
of pet owners, I sampled from a dog-biased set.
That's definitely something that should be fixed… but it still doesn't answer
why the model seems so biased against cats.
Both cats and dogs can be energetic, cuddly, quiet, and soft, or not.
That's why I chose those features, they seemed like they'd be common for both pets.
But we can test this.
I'll make a plot where I divide the number of times each feature is true for each animal
by the total number of survey responses I have for each animal.
It looks like there are lots of different types of dogs in my dataset.
Some are energetic and some are cuddly, but none of the cats are energetic.
So this is a correlated feature, which is a feature that is (unintentionally) correlated
to a specific prediction or hidden category.
In this case, knowing if something is energetic is a cheat for knowing it's a dog even though
I didn't tell the model about dogs.
My model might have then learned that if a pet is energetic, it makes owners happy, just
because there was no data to tell it otherwise.
We can see this correlation if we plot pet energy vs owner happiness.
In my data, if a pet is energetic, a person is likely to be happy with it... no matter
what other features are true.
But if the pet isn't energetic, it's a mixed bag of happiness.
This is my second mistake: the data had a correlated feature, so my AI found patterns
that I didn't want.
To fix the first mistake, I need to collect new data and make sure I balance the number
of cat owners and dog owners.
So I'll go to the park, the pet store, the grocery store... you get the idea.
And I'll keep track if I end up with too much of one pet or the other.
To fix the second mistake, I should make sure the features are actually the most important
things I care about when it comes to happiness.
Honestly, I don't NEED my pet to be energetic.
So I could just cut it out of my dataset, and not worry about it becoming a correlated
feature as I train my AI.
Although, I will be more careful and make sure the other three features don't get
biased either.
It's important to note that every problem isn't this easy.
For some AI, we can't just remove features that don't have a clear meaning, or we might
need to keep features because they're the only measurable values.
In either case it's usually EXTRA important to have a human checking the results and ask
a few important questions to avoid bias:
Does the data match my goals?
Does the AI have the right features?
And am I really optimizing the right thing?
And these questions aren't that easy to answer…
So far in our labs we've demonstrated the amazing abilities that AI can grant you, but
as you can see, it's important to be cautious.
As far as my dog-or-cat decision goes…
I'm going to have to do more work on this algorithm.
And collect a lot more survey data.
So I guess the main takeaway for this episode (and the last of our labs) is that when building
AI systems, there aren't always straightforward and foolproof solutions.
You have to iterate on your designs and account for biases whenever possible.
So, our next and final episode for Crash Course AI is all about the future and our role in
shaping where AI is headed.
I'll see ya then.
Crash Course AI is produced in association with PBS Digital Studios!
If you want to help keep all Crash Course free for everybody, forever, you can join
our community on Patreon.
And if you want to learn more about research methods to build good surveys and datasets,
check out this episode of Crash Course Sociology.