Placeholder Image

Subtitles section Play video

  • StatQuest breaks it down into bite-sized pieces hooray

  • Hello, I'm Josh stormer and welcome to stat quest in this StatQuest

  • We're going to go through principle component analysis PCA one step at a time using singular value decomposition

  • SVD

  • You'll learn about what PCA does

  • How it does it and how to use it to get deeper insight into your data

  • Let's start with a simple dataset

  • We've measured the transcription of two genes gene 1 and gene 2 in 6 different mice

  • Note if you're not into mice and genes

  • think of the mice as individual samples and

  • The genes as variables that we measure for each sample

  • For example the samples could be students in high school and the variables could be test scores in math and reading

  • Or the samples could be businesses, and the variables could be market capitalization and the number of employees

  • Ok now we're back to mice and genes because I'm a geneticist and I work in a genetics department if

  • We only measure one gene. We can plot the data on a number line

  • mice 1 2 & 3 have relatively high values and

  • mice 4 5 & 6 have relatively low values

  • Even though it's a simple graph it shows us that mice 1 2 & 3 are more similar to each other than they are to

  • mice 4 5 & 6

  • If we measured two genes then we can plot the data on a two-dimensional XY graph

  • Gene 1 is the x-axis and spans one of the two dimensions in this graph

  • Gene - is the y-axis and spans the other dimension

  • we can see that mice 1 2 & 3 cluster on the right side and

  • mice 4 5 & 6 cluster on the lower left-hand side

  • if we measured three genes we would add another axis to the graph and make it look 3d ie

  • three dimensional

  • The smaller dots have larger values for gene three and are further away

  • The larger dots have smaller values for gene three and are closer

  • If we measured for jeans however we can no longer plot the data

  • for jeans require four dimensions

  • All

  • So we're going to talk about how PCA can take four or more Jean measurements and thus

  • four or more dimensions of data and make a two dimensional PCA plot

  • This plot will show us that similar mice cluster together

  • We'll also talk about how PCA can tell us which gene or variable is the most valuable for clustering the data?

  • For example PCA might tell us that gene 3 is responsible for separating samples along the x axis

  • Lastly, we'll talk about how PCA can tell us how accurate the 2d graph is?

  • To understand what PCA does and how it works let's go back to the dataset that only had two genes

  • We'll start by plotting the data

  • Then we'll calculate the average measurement for gene 1 and

  • the average measurement for gene 2

  • With the average values we can calculate the center of the data

  • From this point on we'll focus on what happens in the graph we no longer need the original data

  • Now we'll shift the data so that the center is on top of the origin in the graph

  • Note shifting the data did not change how the data points are positioned relative to each other

  • this point is still the highest one and

  • This is still the rightmost point

  • Etc

  • Now that the data are centered on the origin

  • We can try to fit a line to it to do this

  • We start by drawing a random line that goes through the origin

  • Then we rotate the line until it fits the data as well as it can given that it has to go through the origin

  • Ultimately this line fits best

  • But I'm getting ahead of myself first we need to talk about how PCA decides if a fit is good or not

  • So let's go back to the original random line that goes through the origin

  • To quantify how good this line fits the data PCA projects the data onto it

  • And then it can either measure the distances from the data to the line and try to find the line that minimizes those distances

  • Or it. Can try to find the line that maximizes the distances from the projected points to the origin

  • If those options don't seem equivalent to you

  • We can build intuition by looking at how these distances shrink when the line fits better

  • While these distances get larger when the line fits better

  • Now to understand what is going on in a mathematical way, let's just consider one data point

  • This point is fixed and so is its distance from the origin in

  • Other words the distance from the point to the origin

  • Doesn't change when the red dotted line rotates

  • When we project the point onto the line

  • We get a right angle between the black dotted line and the red dotted line

  • that means that if we label the sides like this a

  • b and c

  • Then we can use the Pythagorean theorem to show how B and C are inversely related

  • Since a and thus a squared doesn't change

  • if B gets bigger

  • then C must get smaller

  • Likewise if C gets bigger, then B must get smaller

  • Thus PCA can either minimize the distance to the line or

  • Maximize the distance from the projected point to the origin

  • The reason I'm making such a fuss about this is that

  • Intuitively, it makes sense to minimize B. And the distance from the point to the line

  • But it's actually easier to calculate C the distance from the projected point to the origin

  • so PCA finds the best fitting line by

  • Maximizing the sum of the squared distances from the projected points to the origin

  • So for this line

  • PCA projects the data onto it and

  • Then measures the distance from this point to the origin

  • Let's call it d sub1

  • Note I'm going to keep track of the distance as we measure up here and

  • Then PCA measures the distance from this point to the origin. We'll call that D 2

  • Then it measures d3

  • d4

  • d5 and d6

  • Here are all six distances that we measured

  • The next thing we do is Square all of them

  • The distances are squared so that negative values don't cancel out positive values

  • Then we sum up all these squared distances and that equals the sum of the squared distances

  • For short. We'll call this SS distances or sum of squared distances

  • Now we rotate the line

  • project the data onto the line and

  • Then sum up the squared distances from the projected points to the origin

  • And we repeat until we end up with the line with the largest sum of square

  • distances between the projected points and the origin

  • Ultimately we end up with this line it has the largest sum of squared distances

  • This line is called principal component one or PC one for short

  • PC one has a slope of

  • 0.25 in

  • Other words for every four units that we go out along the gene 1 axis

  • We go up one unit along the gene to access

  • That means that the data are mostly spread out along the gene one axis and

  • Only a little bit spread out along the gene to access

  • One way to think about PC one is in terms of a cocktail recipe

  • to make PC one

  • mix four parts gene one

  • with one part gene to

  • Pour over ice and serve

  • The ratio of gene 1 - gene -

  • Tells you that gene 1 is more important when it comes to describing how the data are spread out

  • Oh, No terminology alert

  • mathematicians call this cocktail recipe a linear combination of genes 1 & 2 I

  • mention this because when someone says PC 1 is a linear combination of variables

  • This is what they're talking about

  • It's no big deal

  • The recipe for PC one going over 4 and up 1 gets us to this point

  • We can solve for the length of the red line using the Pythagorean theorem the old a squared

  • equals B squared

  • plus C squared

  • Plugging in the numbers gives us a equals four point one two

  • So the length of the red line is four point one two

  • When you do pca with SVD the recipe for PC one is scaled so that this length equals one

  • All we have to do to scale the triangle so that the red line is one unit long is to divide each side by

  • four point one two

  • For those of you keeping score

  • Here's the math worked out that shows that all we need to do is divide all three sides by four point one two

  • Here are the scaled values

  • the new values change our recipe

  • But the ratio is the same we still use four times as much gene one as gene two

  • So now we are back to looking at the data

  • the best fitting line and the unit vector that we just calculated oh

  • No another terminology alert this one unit long vector

  • consisting of

  • 0.97 parts gene one and

  • 0.24 two parts gene two is called the singular vector or the eigenvector for PC one and

  • the proportions of each gene are called loading scores

  • Also while I'm at it

  • pca calls the sums of squares of the distances

  • for the best fit line the eigenvalue for pc 1

  • In the square root of the eigenvalue for pc. One is called the singular value for PC one

  • BAM that's a lot of terminology

  • Now that we've got pc1 all figured out, let's work on PC to

  • Because this is only a two-dimensional graph

  • PC 2 is simply the line through the origin that is perpendicular to PC 1 without any further

  • optimization that has to be done

  • And this means that the recipe for PC 2 is negative 1 parts gene 1 to 4 parts. Gene 2

  • If we scale everything so that we get a unit vector the recipe is

  • negative zero point two for two parts gene one and zero point nine seven parts gene -

  • this is the singular vector for PC - or the eigenvector for PC -

  • These are the loading scores for PC to

  • they tell us that in terms of how the values are projected onto PC -

  • Gene - is four times as important as gene one

  • Lastly the eigenvalue for pc. - is the sum of squares of the distances between the projected points and the origin

  • Hooray we've worked out pc1 & pc2

  • To draw the final PCA plot we simply rotate everything so that PC one is horizontal

  • Then we use the projected points to find where the samples go in the PCA plot

  • For example these projected points correspond a sample six

  • So sample six goes here

  • sample two goes here and

  • Sample one goes here etc

  • Double bam that's how PCA is done using singular value decomposition

  • Okay one last thing before we dive into a slightly more complicated example

  • Remember the eigenvalues

  • We got those by projecting the data onto the principal components

  • Measuring the distances to the origin then squaring and adding them together

  • We can convert them into variation around the origin by dividing by the sample size minus one

  • for the sake of this example

  • imagine that the variation for pc1 equals 15 and the variation for pc2 equals 3

  • that means that the total variation around both pcs is 15 plus 3 equals 18 and

  • That means PC 1 accounts for 15 divided by 18

  • equals zero point 8 3 or 83 percent of the total variation around the PCs

  • Pc2 accounts for 3/18 equals 17% of the total variation around the PCs oh

  • no another terminology alert a scree plot is a graphical representation of the percentages of

  • variation that each PC accounts for

  • We'll talk more about scree plot Slater

  • BAM

  • Okay now let's quickly go through a slightly more complicated example

  • PC a with three variables in this case that means three genes is pretty much the same as two variables

  • You Center the data?

  • You then find the best fitting line that goes through the origin

  • Just like before the best fitting line is PC one

  • But the recipe for pc1 now has three ingredients in

  • This case Jean 3 is the most important ingredient for pc1

  • You then find pc2 the next best fitting line

  • Given that it goes through the origin and is perpendicular to PC one

  • Here's the recipe for pc2

  • In this case gene one is the most important ingredient for PC to

  • Lastly we find

  • PC three the best fitting line that goes through the origin and is perpendicular pc1 & pc2

  • If we had more genes we just keep on finding more and more principal components by adding

  • perpendicular lines and rotating them

  • in

  • theory, there is one per gene or variable but in practice the number of PCs is either the number of variables or

  • the number of samples whichever is smaller

  • If this is confusing don't sweat it

  • It's not super important, and I'm going to make a separate video on this topic in the next week

  • Once you have all the principal components figured out you can use the eigenvalues ie the sums of squares of the distances

  • to determine the proportion of variation that each PC accounts for in

  • This case PC one accounts for 79 percent of the variation

  • PC to accounts for fifteen percent of the variation and

  • PC three accounts for six percent of the variation

  • Here's the scree plot

  • Pc1 & pc2 account for the vast majority of the variation

  • That means that a 2d graph using just pc1 & pc2

  • Would be a good approximation of this 3d graph since it would account for

  • 94% of the variation in the data

  • To convert the 3d graph into a two-dimensional PCA graph

  • We just strip away everything, but the data and pc1 & pc2

  • Then project the samples onto pc1 &

  • Pc2

  • Then we rotate so that PC one is horizontal in PC two is vertical this just makes it easier to look at

  • Since these projected points correspond a sample for

  • This is where sample four goes on our new PCA plot

  • Etc etc etc

  • Double bail

  • To review we started with an awkward 3d graph that was kind of hard to read

  • Then we calculated the principal components

  • then with the eigenvalues for pc1 & pc2

  • We determined that a 2d graph would still be very informative

  • Lastly we used pc1 & pc2 to draw two dimensional graph with the data

  • If we measured for jeans per mouse we would not be able to draw a four dimensional graph of the data

  • wall

  • But that doesn't stop us from doing the pca math

  • Which doesn't care if we can draw a picture of it or not and looking at the screen in?

  • this case

  • Pc1 & pc2 account for 90% of the variation so we can just use those to draw two dimensional pca graph

  • So we project the samples onto the first two pcs

  • These two projected points correspond to sample two

  • So sample two goes here

  • BAM

  • Note if the scree plot looked like this where PC 3 and PC four account for a substantial

  • Amount of variation then just using the first two pcs would not create a very accurate

  • representation of the data

  • Wha-wha

  • However even a noisy PCA plot like this can be used to identify clusters of data

  • These samples are still more similar to each other than they are to the other samples

  • Little bam

  • Hooray we've made it to the end of another exciting stat quest if you liked this stack quest and want to see more please subscribe

  • And if you want to support stack quest please consider buying one or two of my original songs

  • The link to my Bandcamp page is in the lower right corner and in the description below

  • alright until next time quest on

StatQuest breaks it down into bite-sized pieces hooray

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it