Placeholder Image

Subtitles section Play video

  • Jabril: Yeah, that was...

  • John Green Bot: … the BEST movie ever!

  • Jabril: That's not what I was gonna say.

  • How about for the next movie night we pick a new movie that we'll both probably like?

  • John-Green-Bot: Maybe something romantic?

  • How about Pride & Prejudice?

  • Jabril: Oh John Green Bot...

  • I'm going to need this.

  • Okay, I think it's time to make a movie recommender system AI.

  • It's the only hope for the future of John Green Bot and my friendshipor at least

  • our movie nights!

  • INTRO

  • Hey, I'm Jabril, and welcome to Crash Course AI.

  • Last time, we introduced the idea of recommender systems, which are AIs that use information

  • about something, and its social ratings to recommend new things to people.

  • These things can be ads, products, YouTube videos, or pretty much anything like that.

  • Today, I'm going to build a recommender system for movies to hopefully find a new

  • movie that both me and John-Green-bot want to watch for our next movie night.

  • Like in previous labs, I'll be writing all of the code using a language called Python

  • in a tool called Google Colaboratory.

  • And as you watch this video, you can follow along with the code in your browser from the

  • link we put in the description.

  • In these Colaboratory files, there's some regular text explaining what we're trying

  • to do, and pieces of code that you can run by pushing the play button.

  • These pieces of code build on each other, so keep in mind that you have to run them

  • in order from top to bottom, otherwise you might get an error.

  • To actually run the code or make changes to it, you'll have to either clickopen

  • in playgroundat the top of the page or open the File menu and clickSave a Copy

  • to Drive”.

  • And just an fyi: you'll need a Google account for this.

  • So, if I'm going to build a movie-recommending AI, the first thing I know is that AI systems

  • need data.

  • I'll need to find and import a dataset of movies, and ideally it'll already have ratings

  • given by lots of different people to lots of different movies, so I won't have to

  • go through and rank every single movie by myself.

  • That would take a while.

  • Second, I'll need to do some basic analysis.

  • Let's start by finding some generic recommendations, like the top-rated movies in both John-Green-bot's

  • and my favorite genres.

  • Maybe we'll get lucky and find a movie we both want to watch and haven't seen yet

  • on those lists.

  • But

  • I don't really have hope for that because we like such different movies.

  • So, third, John-Green-bot and I will need to personalize this dataset by providing some

  • of our own movie ratings.

  • Fourth, I'll use a technique known as user-user collaborative filtering to generate a set

  • of recommendations for both me and John-Green-bot.

  • Hopefully there will be SOME overlap on those recommendation lists.

  • Alright, let's get started.

  • The first step is getting data.

  • And just like other labs, I'm not going to start from scratch.

  • This time, I'm using an existing dataset published by MovieLens, which has about 100,000

  • user ratings for about 10,000 different movies.

  • MovieLens has bigger datasets available, going up to tens of millions of ratings, but this

  • smaller set should be enough to plan movie nights for John-Green-bot and me.

  • I'm also going to use a library known as LensKit, which comes built-in with some nice

  • tools for building recommender systems.

  • So now, I've got data, but let's make sure I understand what data are even there.

  • This code lets me see the first 10 rows of the ratings dataset.

  • There's one important thing that I notice about this dataset right away: how it handles

  • missing data.

  • Like, for example, here I can see that user #1 gave a rating of 4.0 to item #1, and that

  • they provided a rating of 4.0 to item #3.

  • But I don't see a rating for item #2 at all.

  • Most people don't watch most movies, so that makes sense that there would be missing

  • data.

  • And storing a bunch of zeros would take a lot of space, so it's good to know that

  • MovieLens decided to avoid zeros in this dataset by not storing unranked items at all.

  • But the way it stores movie data isn't super useful for this current problem, because I

  • want to know what these movies are!

  • Not just ID numbers likeitem #2.”

  • John-Green-bot and I can't exactly search foritem #2” when we're trying to rent

  • a movie.

  • Thankfully, the MovieLens dataset has more than just "ratings."

  • It also contains a table called "movies" that has a bunch of information about each of these

  • items, like titles and genres.

  • So we can get a better sense of the data by joining theratingsandmovies

  • files.

  • From now on, let's include the genre and title whenever I print results, because that's

  • much more clear.

  • So I'm done with Step 1!

  • Step 2 is getting some generic recommendations from the MovieLens dataset, just to see what

  • happens.

  • Let's just average the ratings for each movie and print out a sorted list, with the

  • best-rated movies at the top.

  • Uh...

  • Paper Birds?

  • Bill Hicks: Revelations?

  • I have no idea what these movies arebut they're supposed to be good?

  • They all have a perfect 5.0 average rating.

  • I would expect to see movies like Harry Potter or Titanic or I dunno

  • The Avengers?

  • So let's look at the data and see why these are perfect.

  • Let's add a count column to see how many people rated these movies so highly

  • Okay, so these movies have a perfect 5.0 average rating because only one person actually rated

  • each of these!

  • That doesn't really help me pick what to watch, because if I just wanted ONE person's

  • opinion, I'd ask a friend who knows me!

  • We're using the MovieLens dataset to get a more general idea of good movies.

  • So let's try only sorting movies with at least a certain number of ratings.

  • This is kind of arbitrary, but I guess I'd want at least 20 people to weigh in before

  • I trust an average rating.

  • Okay, now I've heard of most of these movies and I trust that they're actually sort of

  • popular recommendations.

  • But these movies are all sorts of genres, so maybe I can narrow the list down a little

  • more based on what John-Green-bot and I usually watch.

  • I like action movies and John-Green-Bot likes romance movie.

  • There's actually one movie that's on both of our recommended lists: The Princess Bride!

  • Jabril: John-Green-Bot I've got the perfect movie.

  • You're gonna love it.

  • It's got a love story, swords fights, the greatest movie line of all time: “Hello,

  • my name is Inigo Montoya, you killed my father, prepare to die...

  • John-Green-Bot: Seen it.

  • Let's watch something new.

  • *sighs*… our lists don't have any other movies in common.

  • So even though finding generic recommendations is sort of helpful, our AI system hasn't

  • found us a new movie to watch together.

  • What we're facing is the cold-start problem we talked about in the last video.

  • The recommender system we're programming doesn't know anything about John-Green-bot

  • and me to make personalized recommendations.

  • So for Step 3, it's time to get personal!

  • To personalize our recommender system AI, we need to give it our own movie data.

  • Okay, we've got two spreadsheets now, but I don't think that they're in the right

  • format for LensKit, so I need to check the documentation which is linked it in the description.

  • It looks like I need to import our spreadsheets and store the data in item-rating pairs just

  • like the original dataset.

  • Thankfully, Python is great for changing data formats.

  • As a sanity check to make sure I coded everything correctly, let's print both of our ratings

  • for The Princess Bride, since we know we've both seen it.

  • This all looks reasonable, so we're done with Step 3!

  • Remember, our goal is to program an AI to give us personalized movie recommendations

  • based on our ratings.

  • So, to make this happen, I'll implement User-User Collaborative Filtering in Step 4.

  • There are techniques like Item-Item Collaborative Filtering, latent factor analysis, and others

  • too, but the User-User approach is pretty common and a nice first step to understanding

  • recommender systems.

  • In multiple episodes of Crash Course AI, we've talked about visualizing AI features on a

  • graph, whether it's petal lengths on a flower or weather and swimmers.

  • As we add more features, we add more dimensions to that graph.

  • In user-user collaborative filtering, each item is its own dimension.

  • So if we have 10,000 movies in our dataset, that's 10,000 dimensions.

  • We're not even going to try to visualize that, but we can understand the logic behind

  • user-user collaborative filtering with a two-movie example.

  • To be totally honest, this is going to be a pretty simplified explanation of what the

  • user-user algorithm does.

  • Dealing with thousands of dimensions and lots of missing data requires a lot of clever linear

  • algebra and statistics.

  • But I can use the LensKit library to do this math and understand what's happening conceptually,

  • without diving under the hood.

  • So, okay, let's say we have a graph where one axis is the movie Inception and the other

  • axis is The Notebook.

  • And for this example, we'll plot social ratings on it from everyone who has seen and

  • ranked both movies, such as John-Green-bot, me, and a bunch of other people in the MovieLens

  • dataset.

  • Some people may really like or hate both movies.

  • I like Inception but dislike The Notebook, and John-Green-bot is the opposite of me.

  • The user-user algorithm will try to cluster people who gave the movies similar ratings.

  • This is a classic unsupervised learning approach, except there isn't a “correctsize

  • for these clusters, so I have to set parameters.

  • First, I have to set a minimum neighborhood size, or the minimum number of people the

  • algorithm should put in one cluster.

  • Like, for example, if I set the minimum neighborhood size to 5, when the algorithm looks for people

  • similar to John-Green-bot, it may select this neat cluster here.

  • But if I set the minimum neighborhood size higher, the algorithm may be forced to include

  • some people who are less similar to each other and John-Green-bot.

  • I also have to set a maximum neighborhood size, or the maximum number of people the

  • algorithm should put in one cluster.

  • Again, having clusters that are too big might give recommendations that are too generic

  • and don't consider individual taste enough.

  • After the algorithm has defined the cluster of people who like these movies just about

  • as much as John-Green-Bot, it can analyze what those users have rated movies that John-Green-bot

  • hasn't seen yet, such as Casablanca.

  • Now, this is a classic supervised learning problem.

  • The user-user algorithm trains on past data from users in the cluster to guess how much

  • John-Green-bot would rate Casablanca.

  • It might predict something like “4.6.”

  • And then the algorithm will do the same thing for all the other movies John-Green-bot hasn't

  • seen, that his cluster-neighbors have.

  • In the end, I want the algorithm to give us a sorted list of the top 10 movies John-Green-bot

  • will probably like.

  • There isn't really a “bestminimum and maximum neighborhood size.

  • It really depends on what I want this AI to recommend.

  • Different parameters have different pros and cons.

  • A small neighborhood size would mean the AI considers fewer people who have more similar

  • movie tastes, and it has less data to make predictions.

  • So I'm more likely to run into theBill Hicks: Revelationssituation from earlier

  • which was when recommendations of surprising or obscure movies were based on what a few people like.

  • A big neighborhood size would mean the AI considers more people who have less similar

  • movie tastes, and it has more data to make predictions.

  • So I'm more likely to get movie recommendations that are generally popular and more widely

  • known.

  • Figuring out the best approach to clustering requires a lot of tinkering.

  • But if someone did work on it, they could make a video streaming service that could

  • recommend videos to billions of different people online. YouTube. It's a joke on YouTube if you didn't get it.

  • For this movie night AI, I'll just set a minimum neighborhood size of 3 and maximum

  • size of 15, because those seem reasonable.

  • But feel free to play around with those values in your own code to see how it changes the

  • recommendations.

  • Now that the AI system has run the user-user collaborative filtering algorithm and has

  • clusters, I can give it our personal ratings to get its top 10 recommended movies for both John-Green-bot and me!

  • Now we're talkingshow me what to watch!

  • Remember, for each of us, the user-user algorithm finds a neighborhood of similar users based

  • on their movie ratings compared to ours.

  • The algorithm looks for movies that people in that neighborhood have seen, and rated,

  • that we HAVEN'T seen yet.

  • And based on the ratings in our neighborhoods, the algorithm will predict how we might rate

  • each of those movies, and print a list of itstop 10” recommendations for us.

  • So now we have thoughtful movie recommendations by our newly programmed AI, but there's

  • still a huge problem.

  • John-Green-bot and I have to AGREE on a movie to watch, and ourtop 10” lists don't

  • overlap at all because we like such different things.

  • We need another STEP!

  • This is the beauty of representing movies we like as lists of numbers!

  • I can create a Jabril-Green-bot hybrid!

  • Uh, but not a cyborg.

  • Just a dataset.

  • So if both of us have rated a movie, I'll use the average of our ratings.

  • Using the two-axis graph of Inception and The Notebook from before, this would place

  • our Jabril-Green-bot hybrid around here.

  • And if only one of us has rated a movie, I'll just add that movie rating to the list.

  • I know this isn't a perfect strategy.

  • Like, it's possible that I might hate some movie that I haven't seen but John-Green-Bot

  • highly rated.

  • But this keeps things simple, and it should give a reasonable estimate across both of

  • our ratings.

  • Like always when I reorganize data with code, I should do a quick sanity check.

  • Let's look at The Princess Bride again because I rated it as a 4.5 and John-Green-bot rated

  • it as a 3.5, so I'd expect our combined list would have it as a 4.

  • Looks like everything checks out!

  • So now, I have a combined dataset of ratings that I can plug right into our user-user collaborative

  • filtering model from earlier.

  • And I SHOULD get a ranked list of 10 movies that we'll both like!

  • The number one recommendation is Submarine which seems to be a quirky movie from 2010.

  • I've never heard of it, but I'm willing to give it a try.

  • If that's too obscure for John-Green-bot, we could pick a different recommendation from

  • this list... like

  • I've heard some good things about True Grit.

  • In fact, all these movies seem like they might have some stuff we both like.

  • At this point, I could also go back to step 4.1 and select different settings for my clusters.

  • Bigger neighborhoods would probably give me a more well-known list of movies.

  • But that list may also be a little less tailored to our individual interests.

  • Anyway, we know what we'll be watching this weekend.

  • Anyone can use our spreadsheets as a template to enter their own preferences and see some

  • recommendations for themselves and their friends.

  • Of course, these spreadsheets don't have EVERY MOVIE EVER -- that's just one of the

  • limits of our smaller dataset.

  • By using one of the bigger datasets from MovieLens, anyone can create a new set of spreadsheets

  • for this project that does include more movies.

  • But be warned that more movies will mean that all the math will take a LOT longer to do

  • before you get your recommendations!

  • There's also nothing that limits our algorithm to just two people!

  • You could combine a ten-person movie club into one rating dataset to see what results

  • it comes up with.

  • Next time, we'll take a look at a different kind of recommendation that we use all the time:

  • search engines.

  • I'll see ya then.

  • Crash Course AI is produced in association with PBS Digital Studios.

  • If you want to help keep Crash Course free for everyone, forever, you can join our community

  • on Patreon.

  • And if you want some more movie recommendations along with analysis, check out Crash Course

  • Film Criticism.

Jabril: Yeah, that was...

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it