Data Analysis 5: Data Reduction - Computerphile - VoiceTube: Learn English through videos!

Subtitles section Play video

Let's imagine that you work for a major streaming media provider right? So you have I know some 100 million drivers
So you've got I don't know ten thousand videos on your site or many more audio files, right
so for each user you're gonna have collected information on what they've watched when they've watched it how long they've watched it for whether they
Went from this one to this one. Did that work? Was that good for them? And
So maybe you've got 30,000 data points per user
We're now talking about trillions of data points and your job is to try and predict what someone wants to watch or listen to next
best of luck
So we've cleaned the data we've transformed our data everything's on the same scale we've joined data sets together
The problem is because we've joined data sets together perhaps our data set has got quite large right now
or maybe we just work for a company that has a lot a lot of data certainly the
General consensus these days is to collect as much data as you can like this isn't always a good idea
We what we want remember
It's the smallest most compact and useful data set we can otherwise you're just going to be wasting
CPU hours or GPU hours training on this wasting time
We want to get to the knowledge as quickly as possible
And if you can do that with a small amount of data that's going to be great
So we've got quite an interesting data set to look at today based on music
It's quite common these days when you're building something like a streaming service for example Spotify
You might want to have a recommender system
This is an idea where you've maybe clustered people who are similar in their tastes, you know
what kind of music they're listening to and you know, the
attributes of that music and if you know that you can say well this person likes high tempo music
So maybe he'd like this track as well. And this is how playlists are generated
One of the problems is that you're gonna have to produce
Descriptions of the audio on things like tempo and how upbeat they are in order to machine learn on this kind of system
Right, and that's what this data sets about. So we've collected a dataset here today. That is
Lots and lots of metadata on music tracks right now. These are freely available
Tracks and freely available data and put a link in the description if you want to have a look at it yourself
I've cleaned it up a bit already because obviously I've been through the process of cleaning and transforming my data
So we're gonna load this now this takes quite a long time to do
Because there's quite a lot of attributes and quite a lot of instances
It's loaded right? How much is this data? Well, we've got 13,500
Observations that's instances, and we've got seven hundred and sixty-two attributes, right?
so that means another way of putting this if in sort of machine learning parlance is we've got thirteen thousand instances and
760 features now these features are a combination of things. So let's have a quick look at the columns
we're looking at so we can see what this data sets about so names of
Music all right, so we've got some
760 features or attributes and you can see there's a lot of slightly meaningless text here
But if we look at the top you'll see some actual things that may be familiar to us
So we've got the track ID album ID the genre, right?
So Jean was an interesting one because maybe we can start to use
Some of these audio descriptions to predict what Jean with its music is or something like that
things like the track number and the track duration and
Then we get on to the actual audio description features. Now. These have been generated by two different libraries
the first is called Lib rosa, which is a publicly available library for taking an mp3 and
Calculating musical sort of attributes of it
What we're trying to do here is represent our data in terms of attributes an mp3 file is not an attribute
It's a lot of data. So can we summarize it in some way? Can we calculate by looking at the mp3?
What the tempo is what the amplitude is how loud the track is these kind of things this is a kind of thing
We're measuring and a lot of these are going to go into a lot of detail down at kind of a waveform level
so we have the Lib Roza features first and then if we scroll down
After a while we'd get to some echo nest features. Echinus is a company that
Produces very interesting features on music and actually these are the features that power Spotify is recommender system and numerous others
We've got things like acoustic nurse. How a coup stick does it sound we've got instrumental nurse
I'm not convinced that the word speech enos their hat hat to what extent is it speech or not? Speech
And then things like tempo how fast is it and valence?
How happy does it sound right a track of zero would be quite sad?
I guess and a track of one will be really high happy and upbeat and then of course
We've got a load of features. I've labeled temporal here and these are going to be based on the actual music data themselves
Often when we talk about data reduction
We're actually using its dimensionality reduction
right
well way of thinking about it is we as we started we've been looking at things like attributes and we've been saying what is the
Mean or a standard deviation of some attribute on our data
but actually when we start to talk about clustering and machine learning
We're going to talk a little bit more about dimensions. Now. This is in many ways
The number of attributes is the number of dimensions
It's just another term for the same thing, but certainly from a machine learning background
We refer to a lot of these things as dimensions so you can imagine if you've got some data here
So you've got your instances down here and you've got your attributes across here
So in this case our music data, we've got each song. So this is puts on one
This is on two song three and then all the attributes of a temple echo nest attributes its tempo and things like this
These are all dimensions in which this data can vary so they can be different in the first dimension, which is the track ID
But they can also down here be different in this dimension
Which is for tempo when we say?
Some data is seven hundred dimensional
What that actually means is it has seven hundred different ways or different attributes in which it can vary and you can imagine that first
Of all this is going to get quite big quite quickly
My seven hundred a tribute seems like a lot to me
Right and depending on what the algorithm you're running is it can get quite slow when you're running
Oh this kind of size of data and you can maybe this is a relatively small data set compared to what Spotify might deal with
on a daily basis
But another way to think about this data is actually points in this space
so we have some 700 different attributes that you can vary and when we take a
Specific track it sits somewhere in this space
So if we were looking at it in just two dimensions
You know a track one might be over here and track two over here and track three over here and in three
Dimensions track four might be back at the back here. You can imagine the more dimensions
We add the further spread out these things are going to get
But we can still do all the same things. We can in three dimensions in 700 dimensions. It just takes a little bit longer
So one of the problems is that some things like machine learning don't like to have too many dimensions
So things like linear regression can get quite slow if you have tens of thousands of attributes or dimensions
So remember that perhaps the the default response to anyone collecting data is just deflect it all and worry about it. Later
This is a time reporting when you have to worry about it. What we're trying to do is
Move any redundant variables if you've got two?
Attributes of your music like tempo and valence that turn out to be exactly the same
Why are we using Bo for making our problem a little bit harder right now in actual fact echo nests features are pretty good
They don't tend to correlate that strongly but you might find where we've collected some data on a big scale
actually
A lot of it variables are very very similar all the time and you can just remove some of them or combine some of them
Together and just make your problem a little bit easier
So let's look at this on the music data set and see what we can do
So the first thing we can do is we could remove duplicates Ryba sounds like an obvious one and perhaps one that we could also
Do during cleaning, but exactly when you do it doesn't really matter as long as you're paying attention
what we're going to say is music all
equals unique of music all and what that's going to do is look for find any duplicate rows and
Remove them the number of rows. We've got will drop by some amount. Let's see
thinking
It's where you live timer
Actually, this is quite a slow process
You've got to consider that we're going to look through every single row and try and find any other rows that match
Okay, so this is removed a bit about 40 rows
So this meant we had some duplicate tracks
You can imagine that things might get accidentally added to the database twice or maybe two tracks are actually identical because they were released multiple
Times or something like this now what this is doing?
The unique function actually finds rows that are exactly the same for every single attribute or every single dimension, of course in practice
You might find that you have two versions of the same track, which differ by one second they might have slightly different attributes
Hopefully they'll be very very similar. So what we could also do is have a threshold where we said these are too similar
They're the same thing. The name is the same. The artist is the same and the audio descriptors are very very similar
Maybe we should just remove one of them
Well, this is the other thing you could do just for demonstration
what we're going to do is focus on just a few of
The genres in this data set right just to make things a little bit clearer for visualizations
we're going to select just the classical jazz pop and
Spoken-word genres, right because these have a good distribution of different amounts in the data set
So we're going to run that we're creating a list of genres. We're going to say music is musical
Where any time where the genre is in that list of genres we just produced?
and that's going to produce a much smaller dataset of
1,600 observations the same number of attributes or dimensions now
Normally you would obviously keep most of your data in this is just for a demonstration
But removing genres that aren't useful to you for your experiment is a perfectly reasonable way of reducing your data size if that's a problem
Assuming they've been labeled right in the first place, right that's on someone else. That's someone else's job
Let's imagine but 1,600 is still too long. Now actually computers are getting pretty quick. Maybe 1,600 observations is fine, but
Perhaps we want to remove some more
The first thing we could do is just chop off the day to half way and keep about half. So let's try that
first of all, so we're going to say the first music that's the first few rows of our music is
Rows 1 to 835 and all the columns. So we're going to run that and
That's even smaller. Right so we can start to whittle down our data. This is not necessarily a good idea
We're assuming here that our genre is equally, you know, randomly sampled around our data set. That might not be true
You might have all the lock first and then all the pop or something like that
If you take the first few, you're just going to get all the rock right depending on what you like
That might not be for you
So let's plot these on was in the normal data set and you can see that we've got very little spoken word
but it is there we have some classical international jazz and pop in sort of roughly the same amount if
We plot after we've selected the first 50 you can see we've lost two of the genres like we only have classical
International and jazz and there's hardly any jazz. That's not a good idea. So don't do that unless you know that your data is randomized
So this is not this is not giving us a good representation of genres if we wanted to predict
Jonatha, for example based on the musical features cutting out half the genres seems like an unwise decision
So a better thing to do will be to sample randomly from the data set
So what we're going to do is we're going to use the sample function to give us
835 random indices into this data and then we're going to use that the index our music data frame instead
Alright, that's this line here
And hopefully this will give us a better distribution if we plot the original again
It looks like this and you can see we've got a broad distribution and then if we plot the randomized version
You can see we've still got some spoken. It's actually going up slightly, but the distributions are broadly the same
So this is worked exactly how we want
So how you select your data?
If you're trying to make it a little bit smaller
It's very very important and consider but obviously we only had 1,600 here and even the human is whole data set is only
1,300 rows you could imagine that you might have
Tens of millions of rows and you've got to think about this before you start just getting rid of them completely
Randomized sampling is is a perfectly good way of selecting your data. Obviously, it has a risk that maybe if the distributions of your
Genres are a little bit off and maybe you haven't got very much of a certain genre
You can't guarantee that the distributions are going to be the same on the way out
And if you're trying to predict Jama that's going to be a problem. So perhaps the best approach is stratified sampling
This is where we try and maintain the distribution of our classes
So for example in this case genre so we could say we all we had 50% Rock
30% pop and 20% spoken and we want to maintain that kind of distribution on the way out
Even if we only saw about 50% right?
This is a little bit more complicated in our but it can be done
And this is a good approach if you want to make absolutely sure with
Distributions of your sample data are the same as your original data. We just looked at some ways
We can reduce the size of our data set in terms of a number of instances or the number of rows
Can we make the number of dimensions or the number of attributes smaller?
Because that's often one of the problems and the answer is yes
And there's lots of different ways we can do this some more powerful and useful than others
One of the ways we can do this is something called correlation analysis
so a correlation between two attributes basically tells us that when one of them increases the other one either increases or decreases in
General in relation to it. So you might have some data like this. We've actually won
And we might have attribute two and they sort of look like this
These are the data points for all of our different data
obviously
We've got a lot of data points and you can see that roughly speaking they kind of increase in this
Sort of direction here like this now it might be but if this correlation is very very strong. So basically
Attribute to is a copy of attribute one more or less
Maybe it doesn't make sense to have attribute two in our data set. Maybe we can remove it without too much of a problem
What we can do is something called correlation analysis where we pitch all of the attributes versus all of the other attributes
We look for high correlations and we decide
Ourselves whether to remove them now, sometimes it's useful just to keep everything in and try not to remove them too early
But on the other hand, if you've got a huge amount of data and your correlations are very high
This could be one way of doing it. Another option is something called forward or backward attribute selection
Now this is the idea that maybe we have a machine learning model or clustering algorithm in mind
we can measure the performance of that and then we can remove features and
See if the performance remains the same because if it does maybe we didn't need those features
so what we could do is we could train our model on let's say a
720 dimensional data set and then we could get a certain level of accuracy and record that then we could try it again by removing
One of the dimensions and try on 719 and maybe the accuracy is exactly the same in which case we can say
Well, we didn't really need that dimension at all and we can start to whittle down. Are they set this way?
Another option is forwards attribute selection
this is where we literally train our machine learning on just one of the attributes and
then we see what our accuracy is and we keep adding attributes in and Retraining until our
Performance plateaus and we can say you know what? We're not gaining anything now by adding more attributes
Obviously, there's the question of which order do you tribus in usually bandim?
Lee, so what you would do is you would train on all the data for example of a backwards attribute selection
You take one out at random
If your performance stays the same you can leave it out if your performance gets much worse
You put it back in and you don't try that one again
And you try a different one and you stole slowly start to take dimensions away and hopefully Whittle Daniel data
Let's have a quick look at correlation analysis on this data set you might imagine that if we're calculating
features based on the mp3 from Lib rosa or echo nest
Maybe they're quite similar a lot of the time and maybe we can remove them
Let's have a quick look. So we're just going to focus on one of a set of Lib rosa features just for simplicity
So we're going to select only
the attributes that contain this chroma kurtosis
Field which is one of the attributes that you can calculate using Lib rosa
so I'm going to run that we're going to rename them just for a home simplicity to Kurt one Kurt - Kurt 3 and
Then we're going to calculate a correlation matrix of each of these different features versus each other like this
Ok, finally, we're going to plot this and see what it looks like
hopefully we can find some good correlations and we could have
candidates for just removing a few of these dimensions if it's redundant and it's not too bad so you can see that we've got for
Example Kurt 7 here. So index 7 is fairly similar to 8. That's a correlation of 0.65
Maybe that means that we could remove one over two of those. This one here is 0.5 nine
We've got a point four eight over here
These are fairly high correlations if you're really stretched for CPU time, or you're worried about a size of your data set
This is the kind of thing you could do to remove them
Of course, wherever point six five is a strong enough correlation that you want to delete and completely remove one of these dimensions
It's really up to you and it's going to depend on your situation
one of the reasons that the
Correlations aren't quite as hard as you might think is that these libraries have been designed with this in mind if you just if echo
Nests just produce 200 feet all exactly the same. It wouldn't be very useful for picking playlists
So they've produced 200 features that are widely different. So we're not necessarily going to correlate all the time, right?
That's the whole point and that's a really useful feature of this data
We've looked at some ways we can try and make our data set a little bit smaller
Remember our ultimate goal is a smallest most sort of useful data
We can get our hands on right then we can put that into machine learning or clustering and really extract some knowledge
The problem is that what we might do may based on correlation analysis or forward backwards attribute selection
We might just be deleting data and maybe the correlation wasn't one. It wasn't completely redundant
Do we actually want to completely remove this data?
Is there another way we can transform our data to make more informed decisions as to what we remove and more effective ones?
That's PCA or principal component analysis
At the moment. We're just fitting one line through our two-dimensional data
There's going to be more principal components later, right?
But what we want to do is we want to pick the direction through this data
However, many attributes it has that has the most spread. So how do we measure this? Well quite simply