Name: 數據分析4：數據轉換 - Computerphile (Data Analysis 4: Data Transformation - Computerphile)
Uploaded: 2021-01-14T10:13:28.000Z
Duration: 19 min 31 s
Description: 【看影片學英語】數萬部 YouTube 影片，搭配英漢字典即點即查，輕鬆掌握單字發音與用法，長久累積看電影不必再看字幕。

People need to learn to use standardized measures for things. So take me

For example when I Drive anywhere I driving miles I Drive in miles per hour

My fuel economy is messaging miles per gallon, but of course, I don't pump fuel in my in gallons

And then but when I run anywhere so short distances I run in kilometres and I run in kilometers per hour

So I'm using two different systems there and any short distances. I'm measuring are going to be in meat. It's not feet, right

around my house for painting I'm going to measure in square meters so I know how much paint to buy but then

If I'm selling a house, or I'm buying a house

I'm going to be looking at the size of the house in square feet again. What who knows why British people?

If I'm baking anything, it's going to be weight in grams or kilograms going into the recipe

but if I'm weighing myself is going to be in stones and

pounds but of course a ton would for me would be a metric tonne not an imperial time and

As I said, I measure fuel in litres and most of my liquids are measured in liters except for coarse for beer and milk

Which are in pints? So this is the kind of problem

You're going to be dealing with when you're looking at data. You're trying to transform your data into a usable form

Maybe the data is coming from different sources

None of it goes together. You need standardized units standardized scales so we can go on and analyze it

we what we're doing is we're trying to prepare our data into a

Densest most clean format so that we can apply modeling or machine learning or some kind of statistical

Test to work out what's going on and draw knowledge from our data. So this is going to be an iterative process

We're going to transform the data and then we're going to reduce for data and transforming data is what we're going to do today

So let's imagine that you've cleaned your data. So we've got rid of as many missing variables as possible

Hopefully all of them with deleted instances and attributes that just we're not going to work out for us

Now what we're going to try and do is we're going to try and transform our data so that everything's on the same scale

Everything makes sense together and if we're bringing datasets from different places

We need to also make sure what the units are the same and everything makes sense

There's no point in trying to use machine learning or sum or clustering or any other mechanism

To draw knowledge from our data if our data is is all wrong

So today we're going to be looking at census data now census data is kind of a classic example of a kind of data you

Might look at in data analysis. It's got lots of different kinds of attributes things that are going to need cleaning up and transforming

So we're back in our we're going to read the census data using census is read CSV

So we've downloaded some census data that represents samples from the US population

To begin with we're going to read that in and you can see that we've got 32,000 observations and 15 attributes or variables

So what are the first timers so let's have a quick look at just a little bit of it and we can see the kind

Of thing. We're looking at so we're going to say head of census and that's just going to produce the first few rows

So we can kind of see the kind of data so you can see we've got age

we've got what working classification that person has their educational level a

Numerical representation about whether they're married or not this kind of thing

So there's a lot of different kinds of data here some of its going to be nominal

So for example, this working-class state government private employee. That's a nominal value

We might have ordinal values or ratio values or interval values

We're gonna have to delve in a little bit closer to find out what these are now

What we do to transform this data into a usable format for clustering or machine learning

It's going to depend on exactly what these types of these columns are and what we want to do with them

So let's look at it just a couple of the attributes and see what we can do with them, right?

we're going to use a process called codification the idea is that may be things like random forests or

Multi-layer perceptrons, you know neural networks aren't going to be very amenable to putting in text-based inputs

And what we want to do is try and replace these attributes with a numerical score

So let's look at just for example of a working class and also for example

The educational level so education now work class is the kind of class of worker that we're looking at here

So for example a state worker or in private sector or someone that worked in a school or something like this now

This is a nominal value. That means there's no order to this data at all

we can't say but someone in state is higher or lower than someone in private and we can't also say but let's say

State is two times more or less than some other one. That makes no sense at all

So what we can't we can replace this with numbers?

so let's say we could replace private with zero and state with one and

You know self-employed with two and so on right and that week that's perfectly reasonable thing to do, but it's still nominal data

so what we can't do is then calculate a mean and

Say are the mean is halfway between private and public that doesn't make any sense just because something has been replaced by a numerical score

Doesn't mean that it actually represents something that we can quantify in that way right? It's still nominal data

Okay, so I bet the best advice I can give is feel free to codify your data into easy-to-read numbers

but just bear in mind that you can calculate the mode just like

you know the most common but you can't calculate the median and you can't calculate the mean another example would be something like the

Educational level now fear letting me this is ordinal data so we could save it someone with a an undergraduate degree

It's maybe slightly higher in terms of their the amount of time. They spent in education, but someone with a high school diploma

But we don't know exactly what the distance is

And what's the distance between let's say a high school when a degree and then a PhD?

Using numbers and probably in order right so we could say that zero is no

Education and one is sort of the end of primary school and two is the end of high school and so on and so forth

But again, it's difficult to calculate distances between these things

We don't know what high school is two times more than primary school and half of a degree or something like that

So again, you might be able to calculate a median on this or a mode, but you can't calculate an average

You can't say the average level of ocation. It's halfway between high school and undergraduate that doesn't make any sense either

So for any kind of attribute that is nominal or possibly ordinal and it's sort of represented using text

We can codify this so but it's more amenable to things like decision trees depending on the library you're using right?

But you just have to be careful all machine learning

Algorithms will take any number you give them and you just have to be careful that this makes sense to do

So what you would do is you would go through your data and you'd begin to systematically replace appropriate attributes with numerical versions of themselves

Remembering all the time, but they don't necessarily represent true numbers, you know in a ratio or interval format

So for any text-based value, we're going to start with places and possibly with numerical scores. What about the numerical values?

Well, they might be okay, but the issue is going to be one of scale

you might find for example in this census data that one of the

Dimensions or one of the attributes is much much larger than another one. So for example, this data set has hours per week

which is obviously going to be somewhere between naught and maybe 60 or 70 hours for someone that's got

Salary right or salary or income or any other measure of you know?

monetary gain now obviously hours per week is going to be in the tens and

Salary could be into the tens of thousands. Maybe even the hundreds of thousands

Those scales are not even close to being the same. That means if you're doing clustering or machine learning on this kind of data

You're going to be finding the salary is kind of overbearing everything, right?

So it's going to be very easy for your clustering to find differences in salary and it's harder for it to spot differences in hours

字幕列表影片播放

數據分析4：數據轉換 - Computerphile (Data Analysis 4: Data Transformation - Computerphile)

sort

native

scale

average