Subtitles section Play video Print subtitles StatQuest breaks it down into bite-sized pieces hooray Hello, I'm Josh stormer and welcome to stat quest in this StatQuest We're going to go through principle component analysis PCA one step at a time using singular value decomposition SVD You'll learn about what PCA does How it does it and how to use it to get deeper insight into your data Let's start with a simple dataset We've measured the transcription of two genes gene 1 and gene 2 in 6 different mice Note if you're not into mice and genes think of the mice as individual samples and The genes as variables that we measure for each sample For example the samples could be students in high school and the variables could be test scores in math and reading Or the samples could be businesses, and the variables could be market capitalization and the number of employees Ok now we're back to mice and genes because I'm a geneticist and I work in a genetics department if We only measure one gene. We can plot the data on a number line mice 1 2 & 3 have relatively high values and mice 4 5 & 6 have relatively low values Even though it's a simple graph it shows us that mice 1 2 & 3 are more similar to each other than they are to mice 4 5 & 6 If we measured two genes then we can plot the data on a two-dimensional XY graph Gene 1 is the x-axis and spans one of the two dimensions in this graph Gene - is the y-axis and spans the other dimension we can see that mice 1 2 & 3 cluster on the right side and mice 4 5 & 6 cluster on the lower left-hand side if we measured three genes we would add another axis to the graph and make it look 3d ie three dimensional The smaller dots have larger values for gene three and are further away The larger dots have smaller values for gene three and are closer If we measured for jeans however we can no longer plot the data for jeans require four dimensions All So we're going to talk about how PCA can take four or more Jean measurements and thus four or more dimensions of data and make a two dimensional PCA plot This plot will show us that similar mice cluster together We'll also talk about how PCA can tell us which gene or variable is the most valuable for clustering the data? For example PCA might tell us that gene 3 is responsible for separating samples along the x axis Lastly, we'll talk about how PCA can tell us how accurate the 2d graph is? To understand what PCA does and how it works let's go back to the dataset that only had two genes We'll start by plotting the data Then we'll calculate the average measurement for gene 1 and the average measurement for gene 2 With the average values we can calculate the center of the data From this point on we'll focus on what happens in the graph we no longer need the original data Now we'll shift the data so that the center is on top of the origin in the graph Note shifting the data did not change how the data points are positioned relative to each other this point is still the highest one and This is still the rightmost point Etc Now that the data are centered on the origin We can try to fit a line to it to do this We start by drawing a random line that goes through the origin Then we rotate the line until it fits the data as well as it can given that it has to go through the origin Ultimately this line fits best But I'm getting ahead of myself first we need to talk about how PCA decides if a fit is good or not So let's go back to the original random line that goes through the origin To quantify how good this line fits the data PCA projects the data onto it And then it can either measure the distances from the data to the line and try to find the line that minimizes those distances Or it. Can try to find the line that maximizes the distances from the projected points to the origin If those options don't seem equivalent to you We can build intuition by looking at how these distances shrink when the line fits better While these distances get larger when the line fits better Now to understand what is going on in a mathematical way, let's just consider one data point This point is fixed and so is its distance from the origin in Other words the distance from the point to the origin Doesn't change when the red dotted line rotates When we project the point onto the line We get a right angle between the black dotted line and the red dotted line that means that if we label the sides like this a b and c Then we can use the Pythagorean theorem to show how B and C are inversely related Since a and thus a squared doesn't change if B gets bigger then C must get smaller Likewise if C gets bigger, then B must get smaller Thus PCA can either minimize the distance to the line or Maximize the distance from the projected point to the origin The reason I'm making such a fuss about this is that Intuitively, it makes sense to minimize B. And the distance from the point to the line But it's actually easier to calculate C the distance from the projected point to the origin so PCA finds the best fitting line by Maximizing the sum of the squared distances from the projected points to the origin So for this line PCA projects the data onto it and Then measures the distance from this point to the origin Let's call it d sub1 Note I'm going to keep track of the distance as we measure up here and Then PCA measures the distance from this point to the origin. We'll call that D 2 Then it measures d3 d4 d5 and d6 Here are all six distances that we measured The next thing we do is Square all of them The distances are squared so that negative values don't cancel out positive values Then we sum up all these squared distances and that equals the sum of the squared distances For short. We'll call this SS distances or sum of squared distances Now we rotate the line project the data onto the line and Then sum up the squared distances from the projected points to the origin And we repeat until we end up with the line with the largest sum of square distances between the projected points and the origin Ultimately we end up with this line it has the largest sum of squared distances This line is called principal component one or PC one for short PC one has a slope of 0.25 in Other words for every four units that we go out along the gene 1 axis We go up one unit along the gene to access That means that the data are mostly spread out along the gene one axis and Only a little bit spread out along the gene to access One way to think about PC one is in terms of a cocktail recipe to make PC one mix four parts gene one with one part gene to Pour over ice and serve The ratio of gene 1 - gene - Tells you that gene 1 is more important when it comes to describing how the data are spread out Oh, No terminology alert mathematicians call this cocktail recipe a linear combination of genes 1 & 2 I mention this because when someone says PC 1 is a linear combination of variables This is what they're talking about It's no big deal The recipe for PC one going over 4 and up 1 gets us to this point We can solve for the length of the red line using the Pythagorean theorem the old a squared equals B squared plus C squared Plugging in the numbers gives us a equals four point one two So the length of the red line is four point one two When you do pca with SVD the recipe for PC one is scaled so that this length equals one All we have to do to scale the triangle so that the red line is one unit long is to divide each side by four point one two For those of you keeping score Here's the math worked out that shows that all we need to do is divide all three sides by four point one two Here are the scaled values the new values change our recipe But the ratio is the same we still use four times as much gene one as gene two So now we are back to looking at the data the best fitting line and the unit vector that we just calculated oh No another terminology alert this one unit long vector consisting of 0.97 parts gene one and 0.24 two parts gene two is called the singular vector or the eigenvector for PC one and the proportions of each gene are called loading scores Also while I'm at it pca calls the sums of squares of the distances for the best fit line the eigenvalue for pc 1 In the square root of the eigenvalue for pc. One is called the singular value for PC one BAM that's a lot of terminology Now that we've got pc1 all figured out, let's work on PC to Because this is only a two-dimensional graph PC 2 is simply the line through the origin that is perpendicular to PC 1 without any further optimization that has to be done And this means that the recipe for PC 2 is negative 1 parts gene 1 to 4 parts. Gene 2 If we scale everything so that we get a unit vector the recipe is negative zero point two for two parts gene one and zero point nine seven parts gene - this is the singular vector for PC - or the eigenvector for PC - These are the loading scores for PC to they tell us that in terms of how the values are projected onto PC - Gene - is four times as important as gene one Lastly the eigenvalue for pc. - is the sum of squares of the distances between the projected points and the origin Hooray we've worked out pc1 & pc2 To draw the final PCA plot we simply rotate everything so that PC one is horizontal Then we use the projected points to find where the samples go in the PCA plot For example these projected points correspond a sample six So sample six goes here sample two goes here and Sample one goes here etc Double bam that's how PCA is done using singular value decomposition Okay one last thing before we dive into a slightly more complicated example Remember the eigenvalues We got those by projecting the data onto the principal components Measuring the distances to the origin then squaring and adding them together We can convert them into variation around the origin by dividing by the sample size minus one for the sake of this example imagine that the variation for pc1 equals 15 and the variation for pc2 equals 3 that means that the total variation around both pcs is 15 plus 3 equals 18 and That means PC 1 accounts for 15 divided by 18 equals zero point 8 3 or 83 percent of the total variation around the PCs Pc2 accounts for 3/18 equals 17% of the total variation around the PCs oh no another terminology alert a scree plot is a graphical representation of the percentages of variation that each PC accounts for We'll talk more about scree plot Slater BAM Okay now let's quickly go through a slightly more complicated example PC a with three variables in this case that means three genes is pretty much the same as two variables You Center the data? You then find the best fitting line that goes through the origin Just like before the best fitting line is PC one But the recipe for pc1 now has three ingredients in This case Jean 3 is the most important ingredient for pc1 You then find pc2 the next best fitting line Given that it goes through the origin and is perpendicular to PC one Here's the recipe for pc2 In this case gene one is the most important ingredient for PC to Lastly we find PC three the best fitting line that goes through the origin and is perpendicular pc1 & pc2 If we had more genes we just keep on finding more and more principal components by adding perpendicular lines and rotating them in theory, there is one per gene or variable but in practice the number of PCs is either the number of variables or the number of samples whichever is smaller If this is confusing don't sweat it It's not super important, and I'm going to make a separate video on this topic in the next week Once you have all the principal components figured out you can use the eigenvalues ie the sums of squares of the distances to determine the proportion of variation that each PC accounts for in This case PC one accounts for 79 percent of the variation PC to accounts for fifteen percent of the variation and PC three accounts for six percent of the variation Here's the scree plot Pc1 & pc2 account for the vast majority of the variation That means that a 2d graph using just pc1 & pc2 Would be a good approximation of this 3d graph since it would account for 94% of the variation in the data To convert the 3d graph into a two-dimensional PCA graph We just strip away everything, but the data and pc1 & pc2 Then project the samples onto pc1 & Pc2 Then we rotate so that PC one is horizontal in PC two is vertical this just makes it easier to look at Since these projected points correspond a sample for This is where sample four goes on our new PCA plot Etc etc etc Double bail To review we started with an awkward 3d graph that was kind of hard to read Then we calculated the principal components then with the eigenvalues for pc1 & pc2 We determined that a 2d graph would still be very informative Lastly we used pc1 & pc2 to draw two dimensional graph with the data If we measured for jeans per mouse we would not be able to draw a four dimensional graph of the data wall But that doesn't stop us from doing the pca math Which doesn't care if we can draw a picture of it or not and looking at the screen in? this case Pc1 & pc2 account for 90% of the variation so we can just use those to draw two dimensional pca graph So we project the samples onto the first two pcs These two projected points correspond to sample two So sample two goes here BAM Note if the scree plot looked like this where PC 3 and PC four account for a substantial Amount of variation then just using the first two pcs would not create a very accurate representation of the data Wha-wha However even a noisy PCA plot like this can be used to identify clusters of data These samples are still more similar to each other than they are to the other samples Little bam Hooray we've made it to the end of another exciting stat quest if you liked this stack quest and want to see more please subscribe And if you want to support stack quest please consider buying one or two of my original songs The link to my Bandcamp page is in the lower right corner and in the description below alright until next time quest on
B1 US pc gene data origin line graph StatQuest: Principal Component Analysis (PCA), Step-by-Step 13 1 Guset posted on 2022/07/28 More Share Save Report Video vocabulary