Placeholder Image

Subtitles section Play video

  • What is data right I'm pretty sure that's data

  • Right is this data, you know this picture or that data

  • Is this data? What what is data?

  • So we talked a lot about data in the last video

  • Why is it important that we can analyze and understand data, but what is data? Everybody has data everybody's generating it

  • Companies are generating on us. We're generating it ourselves, you know when we use social media so on

  • but what is it and

  • Understanding what it is is a prerequisite for being able to use it properly

  • Perhaps the most important thing as far as we're concerned

  • So people who are trying to analyze data sort of scientifically is the data has to be measurable, right?

  • so the idea is, you know, if you're going to do a survey on what people like

  • Everyone's got to be using the same scale and the same rating system

  • Otherwise, it doesn't make any sense

  • Well, we can't have someone rating things from one to five and someone else saying I thought it was good

  • Right because which one of one to five is good. We don't you know, we don't know

  • All right

  • So everyone is going to be doing the same thing your data's got to be a consistent format and once that's achieved at least

  • We're a little bit closer

  • To be able to make some sense of it. Broadly speaking when we talk about data

  • We kind of have four different types and we summarize this with this nice noir word. So n, o, i, r, noir

  • And each of these different types of data we can do different things with all right

  • So n that's the first type so this is nominal data

  • The normal data is where we have no distance between the values that we can measure

  • Right because they're not really quantities and we can't order them. So a good example would be colors

  • So maybe you have your favorite color is red. And my favorite color is blue

  • I don't know which is better than the other

  • There is no measurement between them right is blue closer to green the medes. You know, that doesn't make any sense, right?

  • We're not talking about wavelengths. We're just talking about the colors, right?

  • Another good example would be lets say in football player numbers on your back right now symbolically

  • Sometimes certain player numbers have a meaning but you can't compare and contrast them

  • You can't say that 8 is 2 times better than 4. All right, that doesn't make any sense, right?

  • You also can't really order them in general right player

  • 16 doesn't go before or after player 13 in a list but you know, but that doesn't make any sense, right?

  • So nominal data is data where and it's useful, right?

  • It could be really important but it's data where we we kind of have labels

  • But no way of ordering these labels so you can still analyze it, but you can't for example calculate

  • the average that the mean average right? That wouldn't make any sense

  • What you can do is calculate the mode so you can calculate the most common one so you could say that more people prefer red

  • To blue but you couldn't say you know

  • The average color that people like is a sort of muddy brown right. That doesn't make any sense at all, right

  • So as we go down this list, we get slightly more and more informative in some sense types of data

  • So the next one is ordinal

  • so in ordinal data

  • we have an order but we can't measure distances between things so a good example would be something like

  • Positions people finished in a race. So, you know, maybe I finished first

  • I'm super quick right you didn't you finished third

  • But how far we are a part that isn't included in that kind of data

  • You'd have to have a separate value for that another example what we're all familiar with is rating systems, right?

  • So perhaps you I rate a film from one to five stars and you rate the film from one to five stars

  • but you can't really say that a

  • film that's got four stars is two times better than one that scored two

  • Because that's a very subjective and it's there's no real sort of measurable distance between these stars if you have ordinal data

  • You can calculate the mode again. You can calculate the most

  • Common value of all the values that were returned or you can calculate the median the one that sits in the middle, right?

  • So maybe you know fifty runners in a race the 25th position roughly speaking is going to be you know around the median

  • So it's still not hugely useful, right the next up. We have interval data interval data

  • We have an order and we have a distance, but we have no sort of absolute zero for this scale

  • So a good example would be something like degree Celsius or degrees Fahrenheit

  • Zero degrees Celsius isn't no temperature. It's it's a specific temperature, right?

  • So we can't say that fifty degrees is half of a hundred degrees

  • I have a numbers a half but doesn't really make sense, right?

  • They are we can we can say that a hundred degrees is hotter than 50, which is hotter than zero, right?

  • So this is interval data now interval data

  • Lets us do a few more things than we could with ordinal as well as be able to calculate the mode and median we can

  • Now calculate the mean temperature. That's okay

  • And we could also calculate things like the rain the minimum and maximum temperatures for a certain window, right?

  • So that's pretty useful another good example of interval will be pH level right again, the pH of zero means very acidic

  • It doesn't mean there is no acidity at all or no pH at all. We can say that a

  • So 13 is higher than a pH of 7 is higher than a pH of 3

  • And we know how far apart these numbers are but we can't necessarily say if one is double one another one

  • So the final kind of data we're going to look at is ratio data

  • So this is exactly like interval, except we now have a sort of true zero value

  • So a good example of this would be degrees Kelvin right. So Kelvin has an absolute zero which is the absolute average

  • absence of any kind of heat right and when it goes upwards so we can say that in terms of Kelvin a hundred is

  • Half of 200 and so on like this and we can get to 0 another example would be number of children, right?

  • Zero children means the absence of any children and you can also say that let's say four children is double the amount of two children

  • And two many to look after in my opinion

  • So that is an example of ratio data

  • Right now ratio data is quite similar in terms of what you can calculate to interval, but it allows some more

  • complicated statistical measures such as t-test

  • So these are the types of data now actually, it's quite important how you structure your data in general

  • We can't just have it sitting in some massive spreadsheet with no thought given to where everything is, right

  • There's actually a pretty standard way of doing this that we're going to look at

  • Data comes in lots of forms, right different types of measurements different experiments people are going to collect it in different ways

  • But actually there's a very standard way that we use

  • To represent data once it's actually on a computer so we can have some kind of table of our data

  • We almost always

  • represent our data in a matrix like this a

  • Two-dimensional table because it's much easier to do and so along the top

  • We're going to have our attributes right which are the the things we've been measuring

  • So an example would be maybe we're collecting data on people so we could have name

  • That would be some nominal data and then, you know age height

  • So the columns are attributes all the things we've been measuring the rows

  • Those are the instances or the samples we've got so that's all the individual people

  • So here's person 1 and person 2 person 3 and person 3 is called John and there

  • You know 54 and you know 5 foot 11 or whatever, you know

  • Whatever right and so on and you can put you know have as many rows as you want

  • so when we talk about

  • attributes

  • We're talking about the number of columns people use lots of different terms for these. I like to think of them as features

  • Attributes is another one and we have instances or samples down the rows now quite often on the very last column of your data

  • Sometimes separated out but not really important. We'll have our output

  • Maybe we're trying to make a decision based on these people

  • Maybe these are candidates for a football team and we're saying, you know, are they gonna be on the team or not?

  • So this is yes. No John's made it

  • Yes, no, no and so on and that way we could perhaps analyze our decision-making process and decide you know

  • Is there any aspect of these things that inform our decision-making process as an example right now?

  • We always structure data in this way

  • But if we don't it becomes a huge problem because you end up spending all this time formatting and trying to work out

  • What's what and you know, why is John listed down the table or not across the table? And you know, nothing makes any sense anymore

  • So let's look at an actual data set and we'll see all this in action

  • So we have here a data set of whether someone goes to play tennis

  • Right and whether or not they go is going to depend a little bit on what the weather conditions are, right

  • So we don't like to play for example

  • When it's too hot the tennis data set is just the same structure as a data set. We looked at already

  • We're gonna load it into R it's held in a CSV file. So tennis read CSV

  • Tennis now we're using R for this because it's free and it has a load of decent functions for analyzing examining

  • Visualizing data, right? So we're going to be using it throughout these videos

  • obviously you could use MATLAB or Python or some other library if you wanted to

  • I think that you should use whatever you're most comfortable with

  • Looking at these rows and tables

  • I mean, it looks a lot like something like Microsoft Excel

  • You could do this data analysis in Excel

  • Some people would disagree. No, Excel is perfectly good for what it does you could do with data analysis in it. I think that

  • Excel in it doesn't enforce anything to do with

  • Observations versus variables and things like that. These are distinctions that are not really made in Excel

  • Obviously if you enforce those rules yourself that's going to work, but you have to be a little bit more

  • You know regimented and rule-based about it

  • Think the consensus would be that if you really want to get into data analysis and start doing things like principal component analysis or more

  • Advanced statistical measures something like R or Python is going to help a lot more

  • Okay

  • So I've loaded the data set and if we look up the data set

  • so we look at the top few rows of the data you'll see that there are 6 different variables or 6 attributes and

  • This data set has 14 instances or observations

  • R calls them observations. So what we're saying is we have six columns and

  • fourteen rows right of our data set and this data set is

  • structured exactly like

  • This people data set that I was looking at a minute ago

  • So we can examine a single instance we can say what is it about day three?

  • So let's have a look at day three so we can say tennis on day 3

  • And we can say on day three it was overcast. The temperature was only five degrees

  • The humidity was high there wasn't any wind so they decided to play tennis, right?

  • So it's a bit chilly, but I guess they gave it a go

  • So on we could also look at all the different temperatures, for example, all the different forecasts tennis dollar outlook

  • All right

  • And we can look at all the outlooks in the data set so we can say we've got sunny sunny overcast rainy rainy

  • rainy and so on and we can get a feel for what kind of weather we're looking at here as well using something like R

  • You can examine the instances

  • You can examine the individual attributes you can group them together or not as you see fit and then you can start to drill into

  • What this dataset means

  • Now this dataset has in it the final column which is whether they actually played so you could use something like machine learning

  • To predict that final column based on the other columns. That's something you could do one other thing about this dataset

  • It's quite interesting is it has a few examples of the different kinds of data. We were looking at earlier

  • So remember we have nominal ordinal interval and ratio

  • So for example Outlook is really a nominal field right, it's a nominal data type

  • You could perhaps suggest that you could order it from rainy through to sunny, but then cloudy overcast, you know

  • It doesn't really make any sense

  • so this is kind of nominal you could calculate for example the mode and say that most of the days were rainy or something like this

  • Temperature as we discussed before this is in Celsius. So this is going to be

  • Interval we can order the data and we can say but one of them is 50 away from another one

  • But we can't say how much of a difference that it's like. Is that double the temperature or half a temperature?

  • We can't really say so humidity is ordinal so we can say high is more humidity than normal, right?

  • But we can't really say how much that's going to depend on who was measuring it and where their differences lie and finally

  • Wind in kilometers per hour. Well, zero is no wind. Yeah, you can't have negative wind. So this is a ratio, right?

  • You can say that 20 mile an hour wind or 20 kilometers an hour wind, is two times more than ten

  • That's something you can say this little dataset contains all the kinds of data

  • so the different

  • Statistics and measures you can calculate using these it's going to depend on what kind of data they are

  • So we can see that even a very simple data set

  • Like this has loads of different kinds of data and different ways we could interpret this data

  • Right, if you make a decision to play based only on whether the Outlook is good

  • You're maybe not going to solve the whole problem, right?

  • So these are the kind of things we'll be looking at as we go forward

  • And one thing we might do next is to visualize this data. Start to try and understand some patterns or extract some kind of knowledge

  • They're very important tool but you've gotta use it properly

  • You can't just plot anything and everything

  • Every chart you use has got to support your hypothesis or it's got to try and show the story

  • You're trying to tell right? You don't just plot something because it could be plotted right?

  • There's got to be a point to if there's a lot of problems with using inappropriate graphs and only picking subsets of your data

  • That's a huge problem

What is data right I'm pretty sure that's data

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it