Subtitles section Play video Print subtitles Well, we're looking at chocolate datasets today, so I thought I'd bring some research I'm Yeah, good and definitely relevant We've been looking at techniques like data visualization to try and explore our data and start to draw some initial You know conclusions or hypotheses we're going to start to move towards kind of modeling our data and actually trying to extract proper knowledge from this data because remember just Because we've got data doesn't mean we've got knowledge. Now. This is going to be a kind of iterative process We're going to need to clean up our data to make it as useful as possible we need to transform it so that we can combine datasets together and Statistically measure our datasets and then we're going to need to reduce it sometimes if our data set is too big and unmanageable and this combination of cleaning data and Transforming data and reducing data is a kind of cycle where we're going to iterate this until our data set It's a smallest most useful form It can be so if we've got redundant variables which are basically the same as others or we've got duplicates These are all problems that we need to sort out because if we're going to be analyzing data with these kinds of issues We're just making our life slightly more difficult It's computationally inefficient and you know in the worst case we could draw the wrong conclusions you might be surprised and disappointed when you get your first data set that actually it's not quite as nice as you were hoping and It's gonna need some cleaning up. Maybe there's missing values Maybe there's outliers that need to be dealt with because this yeah they're warping your distributions and your and your medians and means Right and perhaps you also got noise in your data set right these few things we can start to address with cleaning So cleaning data is essentially the idea of trying to correct or fill in any missing Values or remove those bits completely you might be surprised but there's missing data at all I mean Oh, what are we not paying attention? Like we've got one job and that was to collect the data and we seem to have got missing data But actually it's quite common because I mean, for example if you're if you're trying to track patient records over time Maybe we didn't show up to an appointment or maybe you in a hospital when they weren't there when they needed to have their temperature Taken, right and then your trend line of your temperature over time is going to be missing some values maybe you've got data going back for hundreds of years and they didn't have certain techniques and certain measurement systems back then so they only Have other kinds of data so missing data is we common rye and we're gonna have to be able to deal with it So the data set we're looking at today is a kind of ratings for chocolate bars This is why ate my chocolate or at least that's why I'm telling myself So we'll read the data in we've got ten different variables We've got about 1,700 nearly observations and let's have a quick look using the summary So we've got things like the company would produce for chocolate the name of the chocolate Reviews the cocoa percentage the type of being this kind of information Right and you can imagine what you might do if you were trying to produce better chocolate It's a look at a huge amount of this kind of data and work out what it is that customers like and what it is They don't like but this is going to be quite common in market research So the first thing we're going to do right we've we've received this data We know now what the columns are, but we don't really know anything else other than this So we're going to have to start looking through and seeing first of all, is there any missing data? So we're going to use the S supply function for this the s supply function will apply a function Over our data set so for each column or each attribute of our data We're going to apply this and the function we're going to use we're lighting ourselves so it's going to be the sum for any time where our item is either blank or Na now blank means an empty string and na means literally not applicable Right, which is something that comes up in data from time to time Alright, so in any case both of these are missing values And so we're going to treat them both the same So if we apply this to our chocolate data set then we're going to see that for example, there are eight missing names There are seven missing review dates and there are four missing Coco percentages So for each row in our data, there are four rows where the Coco percent is missing right now That's not too bad for I mean, this is as data set of nearly 1,700 items for it's not too bad That's quite quite expected You might imagine that if you're pooling this data from lots of different sources People are going to forget to add datum in or they weren't able to record data on that day There's a huge amount of reasons Why you might have missing data and now it starts to become a little bit of a problem when we look at things like bean Type because bean type has got 1200 missing values That's a huge proportion of a data set. And in that case we might have to do something about this So the only issue we've got is that 1200 is not relish. It's just an amount of rows It's not relative to the of a data set. So we're going to use the exact same function any empty rows but this time we're going to divide by the total number of rows so we can get a percentage for how much of a missing data we've got so we can see for example But company name has zero missing data. Whereas beam type has 74 percent missing data So that's a huge problem Now a kind of general rule of thumb is if you've got over half your data is missing It's going to be quite hard to estimate or guess what that data is going to be That's when you might want to start thinking about removing it. So what we want to do is we want to extract Any of the names of any of our attributes that have this sort of over let's say 60% missing All right So we're going to start by calculating always percentages and saving them in a variable and then we're going to select only those percentages Where the value is over point 6 right 60 percent? so we're gonna say any attribute where the attribute is over point 6 and that is just being typed at point 7 for right or 74% so we can now delete beam type ourselves so we could say something like chaco all the rows for being type is Null and by setting that to nah, that's just going to delete that column We can also do it automatically so we could actually pass in those attributes that we just calculated as a parameter So that would be this line here So be something like chaco all rows that's here the names of any attributes where the percentage Missing is greater than 0.6. And that's going to just delete being type right? It's not a lot we can do about being type We've only got 25% issue of a data. It's not enough to start guessing What beam types are going to be in other chocolate bars? Let's have a look at now our rows of data voters for each instance They're going to have a number of attributes now There's 9 left and we want really to keep the instances that have the majority of their data, right? So we're gonna apply my so this is going to be row wise to dimension 1 So that's the rows we're gonna count any of it a blank or n/a? For each row over our data set and we're going to put this into missing so what it is going to do is return a List of values for every single row but tells you how many missing items are there in that rug so we can now look at the first few missing items, so we're going to order them right by larges first and then we're going to show just the first few and you can see But actually some of them are missing seven and six attributes. That's quite a serious situation Because it was only nine lights only only got a couple of entries in their fields Now let's do this again as a percentage of the number of attributes So this is exactly the same thing but this time we're dividing by the number of columns which is nine and we're going to have a look at the top of these and so you can see that we're missing 77% of Some of these initial attributes that's a real problem Missing is the same length as the number of rows we've got so we can actually look up any Rows where there's a greater percentage of missing values and we want and just remove them from the data set So what we're going to do that is a bit like this. So there's a choco is Choco, anywhere. We're Missing is less than or point seven and then all the columns and what that's going to do is select only the rows we want Where they've got a nice amount of data So machaca dataset is going to be a little bit smaller now, but much more useful to us We don't really want to be trying to do things like machine learning or statistics when 70% of some of the data is missing Right, that isn't going to be a good idea. So it's quite easy just to delete data, right? I mean in some sense, it's just more convenient to do that in general The rule is that if you've got more than 50 or 60% missing data, it's a good idea to delete it, right? Delete are the instances or the attributes depending on how much data you've got missing and where if you're missing a huge amount of data Then you're not going to be able to recreate it by let's say using an average, right? We've got so little later that an average isn't going to be reliable if we have got sufficient data that we could maybe start to Infer what these missing values might be we can start to try and replace them instead of deleting them So what we might do is we might for example set them all to zero Maybe if an attribute is missing we can say well, okay If it's missing, we'll just not use it and we'll say it's zero now whether you do that It's going to depend on what the attribute is. Something zero is not a useful property Right and we'll look at that in the chocolate dataset in a moment What we might also do is we might start to add the data set mean into those attributes So maybe we don't know what the rating for this chocolate bar is but we can guess but it's going to be around the average Rating for any chocolate bar again. This is going to depend on your situation, right? You're still making up datum in some sense You've got to be very careful about what you do here. So We've deleted as much of our chakra data as we feel comfortable doing now. Let's see if we can Fill in some of missing values with appropriate replacements. So let's have a look at our attributes Alright, so we've got company name reference things like this being tab has been removed But we still got things like the bean origin and the ratings and there's a few of these missing from our data set Can we estimate these rather than completely removing them from the data set? obviously the less data you use for less useful things like machine learning are going to be so let's look at an Attribute and see what we can do So if we look at bar price, I bats the price of each chocolate bar We can see that there's a few missing values somewhere around 3% That's something we want to deal with But we've got enough data, you know 97% Maybe we can start to guess what the prices of these chocolate bars might be Now this is a good instance of a time when you wouldn't want to just populate with zeros, right? No chocolate bar is free I wish and so what we need to do is produce a reliable Value to represent an unknown price rather than just setting them all to be zero so what we could do here is something like this we could set every missing bar price to be the average of all the Chocolate bar prices and that way at least we're not warping our distribution up or down We're keeping it exactly the same. We're gonna say for the chocolate data set for any row Where bar price is n/a and for all columns? We're going to set the bar price to be The mean of all the bar prices and we're gonna obviously remove any na s from that calculation of what it's not going to work And that's already worked. So now if we recalculate our missing values, you'll see that bar price now has zero missing values So we've fixed that problem great. So that was quite an easy one Right bar pie seems to me to be quite an intuitive time when you would just calculate an average and put it in right now actually, maybe not because You know bar price might depend on where in the world we're selling it or you know, what company is producing the chocolate bar So could we do the same thing for rating if we look if we take the sum of all the NA values in rating? It's eight like so there are eight chocolate bars for which there is no rating So what we can do is we could do something called a stratified replacement We could say well, let's group our chocolate bars by country or by company calculate those averages and then we can specifically fill in Companies missing ratings based on what that company actually show in the market rather than just an average over everything So what we're going to do is we're going to calculate an aggregate function over of the ratings by company And we're going to calculate a median median is a little bit more robust to outliers So maybe you make up a very very expensive or very very cheap line. The median will get what middle value is, right? So this is going to be per company and we can set the columns to be a little bit more helpful using coal names and so now our per company if we look at it is each company and the median rating of Chocolate bar form I think one to five your fingers data set is going so now we know that data per company We can actually fill those in now. You could automate this process. We don't have much missing data So let's just show an example of doing it by hand So this is the line of code we're going to do and I'll talk through it So we're going to say for the chocolate data set for any value where the chocolate rating is n/a for missing, right? And the company is Vai Kuna We want to set the rating to be equal to The vikon entry in our new per company average or median and that's going to fill in That value there so we do this for all the missing companies and what we're going to find is that we've replaced all our missing values with Appropriate medians for those ratings per company. So the last thing we might find in our data is outliers So let's imagine we do a box plot of cocoa percentage, right? So I'm going to produce a box plot of a cocoa percentage now, maybe our assumption is that cocoa percentage in some way? Informs what the rating is going to be because maybe a higher cocoa percentage tastes nicer. I don't really know about chocolate So if we look at this box plot what we'll see is we've got actually quite a tight distribution of cocoa percentage right between about 50% and just above 80% But you can see there are three outliers when it produces a box plot Or I will show outliers is anything that is more than three standard? Deviations away from the median what we do with these outliers is going to be a judgment call It's going to depend on a situation So, for example, we have an outlier here, which is above a hundred percent now that makes no sense We can't have a chocolate bar with more than a hundred percent. Cocoa, right doesn't make sense So that is obviously a mistake. We would delete that item right and probably delete the whole row or we estimate that value Based on a stratified average or a different average for these lower ones This is a judgment call one is just above 20 and one is up closer to 30% I don't know whether those of outliers or not, right? Is it possible to make a viable chocolate bar with 20% cocoa? I mean it may be right you're going to have to know a little bit something about your the situation But your data was collected in and whether that's reasonable so you might for example, delete the bottom one as a mistake But keep the top one because that's just a low amount of cocoa. So this is what cleaning data is about We're going to have missing data We're going to have outliers We might have noise and you're going to have to look for your data and try and condense it and remove all these problems We do this so that we can more effectively Transform our data later and also reduce our data if we need to and then eventually your datasets going to be really nice So that we can do things like modeling or machine learning on it My fuel economy is messaging miles per gallon, but of course, I don't pump fuel in my in gallons I pump it in liters And then but when I run anywhere so short distances I run in kilometers and I run in kilometers per hour So I'm using two different systems there
A2 data missing chocolate data set chocolate bar bar Data Analysis 3: Cleaning Data - Computerphile 53 1 林宜悉 posted on 2020/03/27 More Share Save Report Video vocabulary