Subtitles section Play video Print subtitles Have you heard about this concept called "machine learning", and you're trying to figure out exactly what that means? Or maybe you've checked out a few machine learning competitions on Kaggle.com, but you don't know how to get started? If so, I'm here to help. My name is Kevin Markham, and I'm a data science instructor in Washington, DC. This is my brand new video series about how to use the scikit-learn library in Python for machine learning. This is material that I love to teach, and I can't wait to share it with you. In this series, I'm going to cover scikit-learn from the basics all the way through advanced techniques. I'm not going to presume any familiarity with machine learning, and in fact, we're going to spend the next few videos talking about machine learning before we write any code. The reason being, there's really no point to using scikit-learn if you don't know how to do proper machine learning. You will need to have at least minimal experience with the Python programming language, but I'll suggest some resources in the next video if you don't yet know Python. So with that, let's get started! In this video, I'll be covering the following topics: What is machine learning? What are the two main categories of machine learning? What are some examples of machine learning? And, how does machine learning "work"? So, what exactly is machine learning? There's no universal definition, but at a high level, I would define machine learning as the semi-automated extraction of knowledge from data. Let's break that down into three component parts: First, machine learning always starts with data, and your goal is to extract knowledge or insight from that data. You have a question you're trying to answer, and you hypothesize that your question might be answerable using the data. Second, machine learning involves some amount of automation. Rather than trying to gather your insights from the data manually, you are applying some process or algorithm to the data using a computer so that the computer can help to provide the insight. Third, machine learning is not a fully automated process. As any practitioner can tell you, machine learning requires you to make many smart decisions in order for the process to be successful. We'll cover many of those decisions throughout this video series. Next, let's talk about the two main categories of machine learning, which are supervised learning and unsupervised learning. Supervised learning, also known as predictive modeling, is the process of making predictions using data. For example, if my dataset is a series of email messages, my supervised learning task might be to predict whether each email message is spam or non-spam, which is also known as "ham". This is supervised learning because there is a specific outcome we are trying to predict, namely ham or spam. In contrast, unsupervised learning is the process of extracting structure from data or learning how to best represent data. For example, if my dataset was the characteristics and purchasing behavior of shoppers at a grocery store, my unsupervised learning task might be to segment the shoppers into groups or "clusters" that exhibit similar behaviors. I might find that college students, parents with young childern, and older adults have characteristic shopping behaviors that are similar within each group but dissimilar from the other two groups. This is an unsupervised learning task because there is no right or wrong answer about how many clusters can be found in the data, which people belong in which cluster, or even how to describe each cluster. Let's do a quick quiz. This is Kaggle website, which is a popular platform for machine learning competitions. This is their well-known Titanic competition, and the goal is to predict which passengers survived the tragic sinking of the Titanic. Is this supervised or unsupervised learning? This is supervised learning, because your goal is to predict a specific outcome (namely survival) for each passenger. In this video series, I'm going to primarily focus on supervised learning, though I may cover unsupervised learning in later videos. We've talked about what supervised learning is, but we haven't yet talked about how it works. So, how does it actually work? At very high level, here are the two main steps of supervised learning: First, you train a machine learning model using your existing labeled data. Labeled data is data which has been labeled with the outcome, which in the case of the email example, is whether each message is ham or spam. This is called "model training" because the model is learning the relationship between the attributes of the data and the outcome. These attributes might include the message text, the number of embedded links, the length of the message, and so on. Second, you make predictions on new data for which you don't know the true outcome. In other words, when a new email message arrives, you want your trained model to accurately predict whether the email is ham or spam without a human examining it. To summarize these two steps, you could say that the model is learning from past examples, made up of inputs and outputs, and then applying what it has learned to future inputs in order to predict future outputs. Because you are making predictions on unseen data, which is data that was not used to train the model, it is often said that the primary goal of supervised learning is to build models that generalize. In other words, you want to build machine learning models that accurately predict the labels of your future emails, rather than accurately predicting the labels of emails you have already received. This simplified description of machine learning might raise some questions in your mind, such as: How do I choose which attributes of my data to include in the model? How do I choose which model to use? How do I optimize this model for best performance? How do I ensure that I'm building a model that will generalize to unseen data? Can I estimate how well my model is likely to perform on unseen data? These are excellent questions, and hint at the complexity of doing effective machine learning! All of these issues will be addressed later in the video series. If you'd like a more in-depth introduction to machine learning, there are two resources that I recommend that I've linked to below the video. The first resource is my favorite book on machine learning, "An Introduction to Statistical Learning" by Trevor Hastie and Rob Tibshirani. It's available as a free PDF download, and section 2.1 introduces machine learning in a thorough yet accessible way. The second resource I recommend is a 13-minute video from Caltech's "Learning From Data" course, which uses some excellent examples to compare supervised and unsupervised learning, and also introduces another type of machine learning called reinforcement learning. In the next video in this series, I'll be covering the benefits and drawbacks of scikit-learn, as well as my recommended way to set up Python for machine learning. In the meantime, I'd love to hear from you in the YouTube comments if you have a question about machine learning, or if you just have a cool example of machine learning that you'd like to share. Please do subscribe on YouTube if you'd like to hear the moment my next video comes out. Thanks for watching, and I'll see you soon.
B1 machine learning learning machine data model predict What is machine learning, and how does it work? 60 10 scu.louis posted on 2017/07/23 More Share Save Report Video vocabulary