Subtitles section Play video Print subtitles The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu. CHARLES E. LEISERSON: Hi, it's my great pleasure to introduce, again, TB Schardl. TB is not only a fabulous, world-class performance engineer, he is a world-class performance meta engineer. In other words, building the tools and such to make it so that people can engineer fast code. And he's the author of the technology that we're using in our compiler, the taper technology that's in the open compiler for parallelism. So he implemented all of that, and all the optimizations, and so forth, which has greatly improved the quality of the programming environment. So today, he's going to talk about something near and dear to his heart, which is compilers, and what they can and cannot do. TAO B. SCHARDL: Great, thank you very much for that introduction. Can everyone hear me in the back? Yes, great. All right, so as I understand it, last lecture you talked about multi-threaded algorithms. And you spent the lecture studying those algorithms, analyzing them in a theoretical sense, essentially analyzing their asymptotic running times, work and span complexity. This lecture is not that at all. We're not going to do that kind of math anywhere in the course of this lecture. Instead, this lecture is going to take a look at compilers, as professor mentioned, and what compilers can and cannot do. So the last time, you saw me standing up here was back in lecture five. And during that lecture we talked about LLVM IR and x8664 assembly, and how C code got translated into assembly code via LLVM IR. In this lecture, we're going to talk more about what happens between the LLVM IR and assembly stages. And, essentially, that's what happens when the compiler is allowed to edit and optimize the code in its IR representation, while it's producing the assembly. So last time, we were talking about this IR, and the assembly. And this time, they called the compiler guy back, I suppose, to tell you about the boxes in the middle. Now, even though you're predominately dealing with C code within this class, I hope that some of the lessons from today's lecture you will be able to take away into any job that you pursue in the future, because there are a lot of languages today that do end up being compiled, C and C++, Rust, Swift, even Haskell, Julia, Halide, the list goes on and on. And those languages all get compiled for a variety of different what we call backends, different machine architectures, not just x86-64. And, in fact, a lot of those languages get compiled using very similar compilation technology to what you have in the Clang LLVM compiler that you're using in this class. In fact, many of those languages today are optimized by LLVM itself. LLVM is the internal engine within the compiler that actually does all of the optimization. So that's my hope, that the lessons you'll learn here today don't just apply to 172. They'll, in fact, apply to software that you use and develop for many years on the road. But let's take a step back, and ask ourselves, why bother studying the compiler optimizations at all? Why should we take a look at what's going on within this, up to this point, black box of software? Any ideas? Any suggestions? In the back? AUDIENCE: [INAUDIBLE] TAO B. SCHARDL: You can avoid manually trying to optimize things that the compiler will do for you, great answer. Great, great answer. Any other answers? AUDIENCE: You learn how to best write your code to take advantages of the compiler optimizations. TAO B. SCHARDL: You can learn how to write your code to take advantage of the compiler optimizations, how to suggest to the compiler what it should or should not do as you're constructing your program, great answer as well. Very good, in the front. AUDIENCE: It might help for debugging if the compiler has bugs. TAO B. SCHARDL: It can absolutely help for debugging when the compiler itself has bugs. The compiler is a big piece of software. And you may have noticed that a lot of software contains bugs. The compiler is no exception. And it helps to understand where the compiler might have made a mistake, or where the compiler simply just didn't do what you thought it should be able to do. Understanding more of what happens in the compiler can demystify some of those oddities. Good answer. Any other thoughts? AUDIENCE: It's fun. TAO B. SCHARDL: It's fun. Well, OK, so in my completely biased opinion, I would agree that it's fun to understand what the compiler does. You may have different opinions. That's OK. I won't judge. So I put together a list of reasons why, in general, we may care about what goes on inside the compiler. I highlighted that last point from this list, my bad. Compilers can have a really big impact on software. It's kind of like this. Imagine that you're working on some software project. And you have a teammate on your team he's pretty quiet but extremely smart. And what that teammate does is whenever that teammate gets access to some code, they jump in and immediately start trying to make that code work faster. And that's really cool, because that teammate does good work. And, oftentimes, you see that what the teammate produces is, indeed, much faster code than what you wrote. Now, in other industries, you might just sit back and say, this teammate does fantastic work. Maybe they don't talk very often. But that's OK. Teammate, you do you. But in this class, we're performance engineers. We want to understand what that teammate did to the software. How did that teammate get so much performance out of the code? The compiler is kind of like that teammate. And so understanding what the compiler does is valuable in that sense. As mentioned before, compilers can save you performance engineering work. If you understand that the compiler can do some optimization for you, then you don't have to do it yourself. And that means that you can continue writing simple, and readable, and maintainable code without sacrificing performance. You can also understand the differences between the source code and whatever you might see show up in either the LLVM IR or the assembly, if you have to look at the assembly language produced for your executable. And compilers can make mistakes. Sometimes, that's because of a genuine bug in the compiler. And other times, it's because the compiler just couldn't understand something about what was going on. And having some insight into how the compiler reasons about code can help you understand why those mistakes were made, or figure out ways to work around those mistakes, or let you write meaningful bug reports to the compiler developers. And, of course, understanding computers can help you use them more effectively. Plus, I think it's fun. So the first thing to understand about a compiler is a basic anatomy of how the compiler works. The compiler takes as input LLVM IR. And up until this point, we thought of it as just a big black box. That does stuff to the IR, and out pops more LLVM IR, but it's somehow optimized. In fact, what's going on within that black box the compiler is executing a sequence of what we call transformation passes on the code. Each transformation pass takes a look at its input, and analyzes that code, and then tries to edit the code in an effort to optimize the code's performance. Now, a transformation pass might end up running multiple times. And those passes run in some order. That order ends up being a predetermined order that the compiler writers found to work pretty well on their tests. That's about the level of insight that went into picking the order. It seems to work well. Now, some good news, in terms of trying to understand what the compiler does, you can actually just ask the compiler, what did you do? And you've already used this functionality, as I understand, in some of your assignments. You've already asked the compiler to give you a report specifically about whether or not it could vectorize some code. But, in fact, LLVM, the compiler you have access to, can produce reports not just for factorization, but for a lot of the different transformation passes that it tries to perform. And there's some syntax that you have to pass to the compiler, some compiler flags that you have to specify in order to get those reports. Those are described on the slide. I won't walk you through that text. You can look at the slides afterwards. At the end of the day, the string that you're passing is actually a regular expression. If you know what regular expressions are, great, then you can use that to narrow down the search for your report. If you don't, and you just want to see the whole report, just provide dot star as a string and you're good to go. That's the good news. You can get the compiler to tell you exactly what it did. The bad news is that when you ask the compiler what it did, it will give you a report. And the report looks something like this. In fact, I've highlighted most of the report for this particular piece of code, because the report ends up being very long. And as you might have noticed just from reading some of the texts, there are definitely English words in this text. And there are pointers to pieces of code that you've compiled. But it is very jargon, and hard to understand. This isn't the easiest report to make sense of. OK, so that's some good news and some bad news about these compiler reports. The good news is, you can ask the compiler. And it'll happily tell you all about the things that it did. It can tell you about which transformation passes were successfully able to transform the code. It can tell you conclusions that it drew about its analysis of the code. But the bad news is, these reports are kind of complicated. They can be long. They use a lot of internal compiler jargon, which if you're not familiar with that jargon, it makes it hard to understand. It also turns out that not all of the transformation passes in the compiler give you these nice reports. So you don't get to see the whole picture. And, in general, the reports don't really tell you the whole story about what the compiler did or did not do. And we'll see another example of that later on. So part of the goal of today's lecture is to get some context for understanding the reports that you might see if you pass those flags to the compiler. And the structure of today's lecture is basically divided up into two parts. First, I want to give you some examples of compiler optimizations, just simple examples so you get a sense as to how a compiler mechanically reasons about the code it's given, and tries to optimize that code. We'll take a look at how the compiler optimizes a single scalar value, how it can optimize a structure, how it can optimize function calls, and how it can optimize loops, just simple examples to give some flavor. And then the second half of lecture, I have a few case studies for you which get into diagnosing ways in which the compiler failed, not due to bugs, per se, but simply didn't do an optimization you might have expected it to do. But, to be frank, I think all those case studies are really cool. But it's not totally crucial that we get through every single case study during today's lecture. The slides will be available afterwards. So when we get to that part, we'll just see how many case studies we can cover. Sound good? Any questions so far? All right, let's get to it. Let's start with a quick overview of compiler optimizations. So here is a summary of the various-- oh, I forgot that I just copied this slide from a previous lecture given in this class. You might recognize this slide I think from lecture two. Sorry about that. That's OK. We can fix this. We'll just go ahead and add this slide right now. We need to change the title. So let's cross that out and put in our new title. OK, so, great, and now we should double check these lists and make sure that they're accurate. Data structures, we'll come back to data structures. Loops, hoisting, yeah, the compiler can do hoisting. Sentinels, not really, the compiler is not good at sentinels. Loop unrolling, yeah, it absolutely does loop unrolling. Loop fusion, yeah, it can, but there are some restrictions that apply. Your mileage might vary. Eliminate waste iterations, some restrictions might apply. OK, logic, constant folding and propagation, yeah, it's good on that. Common subexpression elimination, yeah, I can find common subexpressions, you're fin there. It knows algebra, yeah good. Short circuiting, yes, absolutely. Ordering tests, depends on the tests-- I'll give it to the compiler. But I'll say, restrictions apply. Creating a fast path, compilers aren't that smart about fast paths. They come up with really boring fast paths. I'm going to take that off the list. Combining tests, again, it kind of depends on the tests. Functions, compilers are pretty good at functions. So inling, it can do that. Tail recursion elimination, yes, absolutely. Coarsening, not so much. OK, great. Let's come back to data structures, which we skipped before. Packing, augmentation-- OK, honestly, the compiler does a lot with data structures but really none of those things. The compiler isn't smart about data structures in that particular way. Really, the way that the compiler is smart about data structures is shown here, if we expand this list to include even more compiler optimizations. Bottom line with data structures, the compiler knows a lot about architecture. And it really has put a lot of effort into figuring out how to use registers really effectively. Reading and writing and register is super fast. Touching memory is not so fast. And so the compiler works really hard to allocate registers, put anything that lives in memory ordinarily into registers, manipulate aggregate types to use registers, as we'll see in a couple of slides, align data that has to live in memory. Compilers are good at that. Compilers are also good at loops. We already saw some example optimization on the previous slide. It can vectorize. It does a lot of other cool stuff. Unswitching is a cool optimization that I won't cover here. Idiom replacement, it finds common patterns, and does something smart with those. Vision, skewing, tiling, interchange, those all try to process the iterations of the loop in some clever way to make stuff go fast. And some restrictions apply. Those are really in development in LLVM. Logic, it does a lot more with logic than what we saw before. It can eliminate instructions that aren't necessary. It can do strength reduction, and other cool optimization. I think we saw that one in the Bentley slides. It gets rid of dead code. It can do more idiom replacement. Branch reordering is kind like reordering tests. Global value numbering, another cool optimization that we won't talk about today. Functions, it can do more on switching. It can eliminate arguments that aren't necessary. So the compiler can do a lot of stuff for you. And at the end the day, writing down this whole list is kind of a futile activity because it changes over time. Compilers are a moving target. Compiler developers, they're software engineers like you and me. And they're clever. And they're trying to apply all their clever software engineering practice to this compiler code base to make it do more stuff. And so they are constantly adding new optimizations to the compiler, new clever analyses, all the time. So, really, what we're going to look at today is just a couple examples to get a flavor for what the compiler does internally. Now, if you want to follow along with how the compiler works, the good news is, by and large, you can take a look at the LLVM IR to see what happens as the compiler processes your code. You don't need to look out the assembly. That's generally true. But there are some exceptions. So, for example, if we have these three snippets of C code on the left, and we look at what your LLVM compiler generates, in terms of the IR, we can see that there are some optimizations reflected, but not too many interesting ones. The multiply by 8 turns into a shift left operation by 3, because 8 is a power of 2. That's straightforward. Good, we can see that in the IR. The multiply by 15 still looks like a multiply by 15. No changes there. The divide by 71 looks like a divide by 71. Again, no changes there. Now, with arithmetic ops, the difference between what you see in the LLVM IR and what you see in the assembly, this is where it's most pronounced, at least in my experience, because if we take a look at these same snippets of C code, and we look at the corresponding x86 assembly for it, we get the stuff on the right. And this looks different. Let's pick through what this assembly code does one line at a time. So the first one in the C code, takes the argument n, and multiplies it by 8. And then the assembly, we have this LEA instruction. Anyone remember what the LEA instruction does? I see one person shaking their head. That's a perfectly reasonable response. Yeah, go for it? Load effective address, what does that mean? Load the address, but don't actually access memory. Another way to phrase that, do this address calculation. And give me the result of the address calculation. Don't read or write memory at that address. Just do the calculation. That's what loading an effective address means, essentially. But you're exactly right. The LEA instruction does an address calculation, and stores the result in the register on the right. Anyone remember enough about x86 address calculations to tell me how that LEA in particular works, the first LEA on the slide? Yeah? AUDIENCE: [INAUDIBLE] TAO B. SCHARDL: But before the first comma, in this case nothing, gets added to the product of the second two arguments in those parens. You're exactly right. So this LEA takes the value 8, multiplies it by whatever is in register RDI, which holds the value n. And it stores the result into AX. So, perfect, it does a multiply by 8. The address calculator is only capable of a small range of operations. It can do additions. And it can multiply by 1, 2, 4, or 8. That's it. So it's a really simple circuit in the hardware. But it's fast. It's optimized heavily by modern processors. And so if the compiler can use it, they tend to try to use these LEA instructions. So good job. How about the next one? Multiply by 15 turns into these two LEA instructions. Can anyone tell me how these work? AUDIENCE: [INAUDIBLE] TAO B. SCHARDL: You're basically multiplying by 5 and multiplying by 3, exactly right. We can step through this as well. If we look at the first LEA instruction, we take RDI, which stores the value n. We multiply that by 4. We add it to the original value of RDI. And so that computes 4 times n, plus n, which is five times n. And that result gets stored into AX. Could, we've effectively multiplied by 5. The next instruction takes whatever is in REX, which is now 5n, multiplies that by 2, adds it to whatever is currently in REX, which is once again 5n. So that computes 2 times 5n, plus 5n, which is 3 times 5n, which is 15n. So just like that, we've done our multiply with two LEA instructions. How about the last one? In this last piece of code, we take the arguments in RDI. We move it into EX. We then move the value 3,871,519,817, and put that into ECX, as you do. We multiply those two values together. And then we shift the product right by 38. So, obviously, this divides by 71. Any guesses as to how this performs the division operation we want? Both of you answered. I might still call on you. give a little more time for someone else to raise their hand. Go for it. AUDIENCE: [INAUDIBLE] TAO B. SCHARDL: It has a lot to do with 2 to the 38, very good. Yeah, all right, any further guesses before I give the answer away? Yeah, in the back? AUDIENCE: [INAUDIBLE] TAO B. SCHARDL: Kind of. So this is what's technically called a magic number. And, yes, it's technically called a magic number. And this magic number is equal to 2 to the 38, divided by 71, plus 1 to deal with some rounding effects. What this code does is it says, let's compute n divided by 71, by first computing n divided by 71, times 2 to the 38, and then shifting off the lower 38 bits with that shift right operation. And by converting the operation into this, it's able to replace the division operation with a multiply. And if you remember, hopefully, from the architecture lecture, multiply operations, they're not the cheapest things in the world. But they're not too bad. Division is really expensive. If you want fast code, never divide. Also, never compute modulus, or access memory. Yeah, question? AUDIENCE: Why did you choose 38? TAO B. SCHARDL: Why did I choose 38? I think it shows 38 because 38 works. There's actually a formula for-- pretty much it doesn't want to choose a value that's too large, or else it'll overflow. And it doesn't want to choose a value that's too small, or else you lose precision. So it's able to find a balancing point. If you want to know more about magic numbers, I recommend checking out this book called Hackers Delight. For any of you who are familiar with this book, it is a book full of bit tricks. Seriously, that's the entire book. It's just a book full of bit tricks. And there's a whole section in there describing how you do division by various constants using multiplication, either signed or unsigned. It's very cool. But magic number to convert a division into a multiply, that's the kind of thing that you might see from the assembly. That's one of these examples of arithmetic operations that are really optimized at the very last step. But for the rest of the optimizations, fortunately we can focus on the IR. Any questions about that so far? Cool. OK, so for the next part of the lecture, I want to show you a couple example optimizations in terms of the LLVM IR. And to show you these optimizations, we'll have a little bit of code that we'll work through, a running example, if you will. And this running example will be some code that I stole from I think it was a serial program that simulates the behavior of n massive bodies in 2D space under the law of gravitation. So we've got a whole bunch of point masses. Those point masses have varying masses. And we just want to simulate what happens due to gravity as these masses interact in the plane. At a high level, the n body code is pretty simple. We have a top level simulate routine, which just loops over all the time steps, during which we want to perform this simulation. And at each time step, it calculates the various forces acting on those different bodies. And then it updates the position of each body, based on those forces. In order to do that calculation. It has some internal data structures, one to represent each body, which contains a couple of vector types. And we define our own vector type to store to double precision floating point values. Now, we don't need to see the entire code in order to look at some compiler optimizations. The one routine that we will take a look at is this one to update the positions. This is a simple loop that takes each body, one at a time, computes the new velocity on that body, based on the forces acting on that body, and uses vector operations to do that. Then it updates the position of that body, again using these vector operations that we've defined. And then it stores the results into the data structure for that body. So all these methods with this code make use of these basic routines on 2D vectors, points in x, y, or points in 2D space. And these routines are pretty simple. There is one to add two vectors. There's another to scale a vector by a scalar value. And there's a third to compute the length, which we won't actually look at too much. Everyone good so far? OK, so let's try to start simple. Let's take a look at just one of these one line vector operations, vec scale. All vec scale does is it takes one of these vector inputs at a scalar value a. And it multiplies x by a, and y by a, and stores the results into a vector type, and return to it. Great, couldn't be simpler. If we compile this with no optimizations whatsoever, and we take a look at the LLVM IR, we get that, which is a little more complicated than you might imagine. The good news, though, is that if you turn on optimizations, and you just turn on the first level of optimization, just 01, whereas we got this code before, now we get this, which is far, far simpler, and so simple I can blow up the font size so you can actually read the code on the slide. So to see, again, no optimizations, optimizations. So a lot of stuff happened to optimize this simple function. We're going to see what those optimizations actually were. But, first, let's pick apart what's going on in this function. We have our vec scale routine in LLVM IR. It takes a structure as its first argument. And that's represented using two doubles. It takes a scalar as the second argument. And what the operation does is it multiplies those two fields by the third argument, the double A. It then packs those values into a struct that'll return. And, finally, it returns that struct. So that's what the optimized code does. Let's see actually how we get to this optimized code. And we'll do this one step at a time. Let's start by optimizing the operations on a single scalar value. That's why I picked this example. So we go back to the 00 code. And we just pick out the operations that dealt with that scalar value. We our scope down to just these lines. So the argument double A is the final argument in the list. And what we see is that within the vector scale routine, compiler to 0, we allocate some local storage. We store that double A into the local storage. And then later on, we'll load the value out of the local storage before the multiply. And then we load it again before the other multiply. OK, any ideas how we could make this code faster? Don't store in memory, what a great idea. How do we get around not storing it in memory? Saving a register. In particular, what property of LLVM IR makes that really easy? There are infinite registers. And, in fact, the argument is already in a register. It's already in the register percent two, if I recall. So we don't need to move it into a register. It's already there. So how do we go about optimizing that code in this case? Well, let's find the places where we're using the value. And we're using the value loaded from memory. And what we're going to do is just replace those loads from memory with the original argument. We know exactly what operation we're trying to do. We know we're trying to do a multiply by the original parameter. So we just find those two uses. We cross them out. And we put in the input parameter in its place. That make sense? Questions so far? Cool. So now, those multipliers aren't using the values returned by the loads. How further can we optimize this code? Delete the loads. What else can we delete? So there's no address calculation here just because the code is so simple, but good insight. The allocation and the store, great. So those loads are dead code. The store is dead code. The allocation is dead code. We eliminate all that dead code. We got rid of those loads. We just used the value living in the register. And we've already eliminated a bunch of instructions. So the net effect of that was to turn the code optimizer at 00 that we had in the background into the code we have in the foreground, which is slightly shorter, but not that much. So it's a little bit faster, but not that much faster. How do we optimize this function further? Do it for every variable we have. In particular, the only other variable we have is a structure that we're passing in. So we want to do this kind of optimization on the structure. Make sense? So let's see how we optimize this structure. Now, the problem is that structures are harder to handle than individual scalar values, because, in general, you can't store the whole structure in just a single register. It's more complicated to juggle all the data within a structure. But, nevertheless, let's take a look at the code that operates on the structure, or at least operates on the structure that we pass in to the function. So when we eliminate all the other code, we see that we've got an allocation. See if I animations here, yeah, I do. We have an allocation. So we can store the structure onto the stack. Then we have an address calculation that lets us store the first part of the structure onto the stack. We have a second address calculation to store the second field on the stack. And later on, when we need those values, we load the first field out of memory. And we load the second field out of memory. It's a very similar pattern to what we had before, except we've got more going on in this case. So how do we go about optimizing this structure? Any ideas, high level ideas? Ultimately, we want to get rid of all of the memory references and all that storage for the structure. How do we reason through eliminating all that stuff in a mechanical fashion, based on what we've seen so far? Go for it. AUDIENCE: [INAUDIBLE] TAO B. SCHARDL: They are passed in using separate parameters, separate registers if you will, as a quirk of how LLVM does it. So given that insight, how would you optimize it? AUDIENCE: [INAUDIBLE] TAO B. SCHARDL: Cross out percent 12, percent 6, and put in the relevant field. Cool. Let me phrase that a little bit differently. Let's do this one field at a time. We've got a structure, which has multiple fields. Let's just take it one step at a time. All right, so let's look at the first field. And let's look at the operations that deal with the first field. We have, in our code, in our LLVM IR, some address calculations that refer to the same field of the structure. In this case, I believe it's the first field, yes. And, ultimately, we end up loading from this location in local memory. So what value is this load going to retrieve? How do we know that both address calculations refer to the same field? Good question. What we do in this case is very careful analysis of the math that's going on. We know that the alga, the location in local memory, that's just a fixed location. And from that, we can interpret what each of the instructions does in terms of an address calculation. And we can determine that they're the same value. So we have this location in memory that we operate on. And before you do a multiply, we end up loading from that location in memory. So what value do we know is going to be loaded by that load instruction? Go for it. AUDIENCE: So what we're doing right now is taking some value, and then storing it, and then getting it back out, and putting it back. TAO B. SCHARDL: Not putting it back, but we don't you worry about putting it back. AUDIENCE: So we don't need to put it somewhere just to take it back out? TAO B. SCHARDL: Correct. Correct. So what are we multiplying in that multiply, which value? First element of the struct. It's percent zero. It's the value that we stored right there. That makes sense? Everyone see that? Any questions about that? All right, so we're storing the first element of the struct into this location. Later, we load it out of that same location. Nothing else happened to that location. So let's go ahead and optimize it just the same way we optimize the scalar. We see that we use the result of the load right there. But we know that load is going to return the first field of our struct input. So we'll just cross it out, and replace it with that field. So now we're not using the result of that load. What do we get to do as the compiler? I can tell you know the answer. Delete the dead code, delete all of it. Remove the now dead code, which is all those address calculations, as well as the load operation, and the store operation. And that's pretty much it. Yeah, good. So we replace that operation. And we got rid of a bunch of other code from our function. We've now optimized one of the two fields in our struct. What do we do next? Optimize the next one. That happened similarly. I won't walk you through that a second time. We find where we're using the result of that load. We can cross it out, and replace it with the appropriate input, and then delete all the relevant dead code. And now, we get to delete the original allocation because nothing's getting stored to that memory. That make sense? Any questions about that? Yeah? AUDIENCE: So when we first compile it to LLVM IR, does it unpack the struct and just put in separate parameters? TAO B. SCHARDL: When we first compiled LLVM IR, do we unpack the struct and pass in the separate parameters? AUDIENCE: Like, how we have three parameters here that are doubled. Wasn't our original C code just a struct of vectors in the double? TAO B. SCHARDL: So LLVM IR in this case, when we compiled it as zero, decided to pass it as separate parameters, just as it's representation. So in that sense, yes. But it was still doing the standard, create some local storage, store the parameters on to local storage, and then all operations just read out of local storage. It's the standard thing that the compiler generates when it's asked to compile C code. And with no other optimizations, that's what you get. That makes sense? Yeah? AUDIENCE: What are all the align eights? TAO B. SCHARDL: What are all the aligned eights doing? The align eights are attributes that specify the alignment of that location in memory. This is alignment information that the compiler either determines by analysis, or implements as part of a standard. So they're specifying how values are aligned in memory. That matters a lot more for ultimate code generation, unless we're able to just delete the memory references altogether. Make sense? Cool. Any other questions? All right, so we optimized the first field. We optimize the second field in a similar way. Turns out, there's additional optimizations that need to happen in order to return a structure from this function. Those operations can be optimized in a similar way. They're shown here. We're not going to go through exactly how that works. But at the end of the day, after we've optimized all of that code we end up with this. We end up with our function compiled at 01. And it's far simpler. I think it's far more intuitive. This is what I would imagine the code should look like when I wrote the C code in the first place. Take your input. Do a couple of multiplications. And then it does them operations to create the return value, and ultimately return that value. So, in summary, the compiler works hard to transform data structures and scalar values to store as much as it possibly can purely within registers, and avoid using any local storage, if possible. Everyone good with that so far? Cool. Let's move on to another optimization. Let's talk about function calls. Let's take a look at how the compiler can optimize function calls. By and large, these optimizations will occur if you pass optimization level 2 or higher, just FYI. So from our original C code, we had some lines that performed a bunch of vector operations. We had a vec add that added two vectors together, one of which was the result of a vec scale, which scaled the result of a vec add by some scalar value. So we had this chain of calls in our code. And if we take a look at the code compile that was 0, what we end up with is this snippet shown on the bottom, which performs some operations on these vector structures, does this multiply operation, and then calls this vector scale routine, the vector scale routine that we decide to focus on first. So any ideas for how we go about optimizing this? So to give you a little bit of a hint, what the compiler sees when it looks at that call is it sees a snippet containing the call instruction. And in our example, it also sees the code for the vec scale function that we were just looking at. And we're going to suppose that it's already optimized vec scale as best as it can. It's produced this code for the vec scale routine. And so it sees that call instruction. And it sees this code for the function that's being called. So what could the compiler do at this point to try to make the code above even faster? AUDIENCE: [INAUDIBLE] TAO B. SCHARDL: You're exactly right. Remove the call, and just put the body of the vec scale code right there in place of the call. It takes a little bit of effort to pull that off. But, roughly speaking, yeah, we're just going to copy and paste this code in our function into the place where we're calling the function. And so if we do that simple copy paste, we end up with some garbage code as an intermediate. We had to do a little bit of renaming to make everything work out. But at this point, we have the code from our function in the place of that call. And now, we can observe that to restore correctness, we don't want to do the call. And we don't want to do the return that we just pasted in place. So we'll just go ahead and remove both that call and the return. That is called function inlining. We identify some function call, or the compiler identifies some function call. And it takes the body of the function, and just pastes it right in place of that call. Sound good? Make sense? Anyone confused? Raise your hand if you're confused. Now, once you've done some amount of function inlining, we can actually do some more optimizations. So here, we have the code after we got rid of the unnecessary call and return. And we have a couple multiply operations sitting in place. That looks fine. But if we expand our scope just a little bit, what we see, so we have some operations happening that were sitting there already after the original call. What the compiler can do is it can take a look at these instructions. And long story short, it realizes that all these instructions do is pack some data into a structure, and then immediately unpack the structure. So it's like you put a bunch of stuff into a bag, and then immediately dump out the bag. That was kind of a waste of time. That's kind of a waste of code. Let's get rid of it. Those operations are useless. Let's delete them. The compiler has a great time deleting dead code. It's like it's what it lives to do. All right, now, in fact, in the original code, we didn't just have one function call. We had a whole sequence of function calls. And if we expand our LLVM IR snippet even a little further, we can include those two function calls, the original call to vec ad, followed by the code that we've now optimized by inlining, ultimately followed by yet another call to vec add. Minor spoiler, the vec add routine, once it's optimized, looks pretty similar to the vec scalar routine. And, in particular, it has comparable size to the vector scale routine. So what's the compiler is going to do to those to call sites? Inline it, do more inlining, inlining is great. We'll inline these functions as well, and then remove all of the additional, now-useless instructions. We'll walk through that process. The result of that process looks something like this. So in the original C code, we had this vec add, which called the vec scale as one of its arguments, which called the vec add is one of its arguments. And what we end up with in the optimized IR is just a bunch of straight line code that performs floating point operations. It's almost as if the compiler took the original C code, and transformed it into the equivalency code shown on the bottom, where it just operates on a whole bunch of doubles, and just does primitive operations. So function inlining, as well as the additional transformations it was able to perform as a result, together those were able to eliminate all of those function calls. It was able to completely eliminate any costs associated with the function call abstraction, at least in this code. Make sense? I think that's pretty cool. You write code that has a bunch of function calls, because that's how you've constructed your interfaces. But you're not really paying for those function calls. Function calls aren't the cheapest operation in the world, especially if you think about everything that goes on in terms of the registers and the stack. But the compiler is able to avoid all of that overhead, and just perform the floating point operations we care about. OK, well, if function inlining is so great, and it enables so many great optimizations, why doesn't the compiler just inline every function call? Go for it. Recursion, it's really hard to inline a recursive call. In general, you can't inline a function into itself, although it turns out there are some exceptions. So, yes, recursion creates problems with function inlining. Any other thoughts? In the back. AUDIENCE: [INAUDIBLE] TAO B. SCHARDL: You're definitely on to something. So we had to do a bunch of this renaming stuff when we inlined the first time, and when we inlined every single time. And even though LLVM IR has an infinite number of registers, the machine doesn't. And so all of that renaming does create a problem. But there are other problems as well of a similar nature when you start inlining all those functions. For example, you copy pasted a bunch of code. And that made the original call site even bigger, and bigger, and bigger, and bigger. And programs, we generally don't think about the space they take in memory. But they do take space in memory. And that does have an impact on performance. So great answer, any other thoughts? AUDIENCE: [INAUDIBLE] TAO B. SCHARDL: If your function becomes too long, then it may not fit in instruction cache. And that can increase the amount of time it takes just to execute the function. Right, because you're now not getting hash hits, exactly right. That's one of the problems with this code size blow up from inlining everything. Any other thoughts? Any final thoughts? So there are three main reasons why the compiler won't inline every function. I think we touched on two of them here. For some function calls, like recursive calls, it's impossible to inline them, because you can't inline a function into itself. But there are exceptions to that, namely recursive tail calls. If the last thing in a function is a function call, then it turns out you can effectively inline that function call as an optimization. We're not going to talk too much about how that works. But there are corner cases. But, in general, you can't inline a recursive call. The compiler has another problem. Namely, if the function that you're calling is in a different castle, if it's in a different compilation unit, literally in a different file that's compiled independently, then the compiler can't very well inline that function, because it doesn't know about the function. It doesn't have access to that function's code. There is a way to get around that problem with modern compiler technology that involves whole program optimization. And I think there's some backup slides that will tell you how to do that with LLVM. But, in general, if it's in a different compilation unit, it can't be inline. And, finally, as touched on, function inlining can increase code size, which can hurt performance. OK, so some functions are OK to inline. Other functions could create this performance problem, because you've increased code size. So how does the compiler know whether or not inlining any particular function at a call site could hurt performance? Any guesses? Yeah? AUDIENCE: [INAUDIBLE] TAO B. SCHARDL: Yeah. So the compiler has some cost model, which gives it some information about, how much will it cost to inline that function? Is the cost model always right? It is not. So the answer, how does the compiler know, is, really, it doesn't know. It makes a best guess using that cost model, and other heuristics, to determine, when does it make sense to try to inline a function? And because it's making a best guess, sometimes the compiler guesses wrong. So to wrap up this part, here are just a couple of tips for controlling function inlining in your own programs. If there's a function that you know must always be inlined, no matter what happens, you can mark that function with a special attribute, namely the always inline attribute. For example, if you have a function that does some complex address calculation, and it should be inlined rather than called, you may want to mark that with an always inline attribute. Similarly, if you have a function that really should never be inlined, it's never cost effective to inline that function, you can mark that function with the no inline attribute. And, finally, if you want to enable more function inlining in the compiler, you can use link time optimization, or LTO, to enable whole program optimization. Won't go into that during these slides. Let's move on, and talk about loop optimizations. Any questions so far, before continue? Yeah? AUDIENCE: [INAUDIBLE] TAO B. SCHARDL: Sorry? AUDIENCE: [INAUDIBLE] TAO B. SCHARDL: Does static inline guarantee you the compiler will always inline it? It actually doesn't. The inline keyword will provide a hint to the compiler that it should think about inlining the function. But it doesn't provide any guarantees. If you want a strong guarantee, use the always inline attribute. Good question, though. All right, loop optimizations-- you've already seen some loop optimizations. You've seen vectorization, for example. It turns out, the compiler does a lot of work to try to optimize loops. So first, why is that? Why would the compiler engineers invest so much effort into optimizing loops? Why loops in particular? They're extremely common control structure that also has a branch. Both things are true. I think there's a higher level reason, though, or more fundamental reason, if you will. Yeah? AUDIENCE: Most of the time, the loop takes up the most time. TAO B. SCHARDL: Most of the time the loop takes up the most time. You got it. Loops account for a lot of the execution time of programs. The way I like to think about this is with a really simple thought experiment. Let's imagine that you've got a machine with a two gigahertz processor. We've chosen these values to be easier to think about using mental math. Suppose you've got a two gigahertz processor with 16 cores. Each core executes one instruction per cycle. And suppose you've got a program which contains a trillion instructions and ample parallelism for those 16 cores. But all of those instructions are simple, straight line code. There are no branches. There are no loops. There no complicated operations like IO. It's just a bunch of really simple straight line code. Each instruction takes a cycle to execute. The processor executes one instruction per cycle. How long does it take to run this code, to execute the entire terabyte binary? 2 to the 40th cycles for 2 to the 40 instructions. But you're using a two gigahertz processor and 16 cores. And you've got ample parallelism in the program to keep them all saturated. So how much time? AUDIENCE: 32 seconds. TAO B. SCHARDL: 32 seconds, nice job. This one has mastered power of 2 arithmetic in one's head. It's a good skill to have, especially in core six. Yeah, so if you have just a bunch of simple, straight line code, and you have a terabyte of it. That's a lot of code. That is a big binary. And, yet, the program, this processor, this relatively simple processor, can execute the whole thing in just about 30 seconds. Now, in your experience working with software, you might have noticed that there are some programs that take longer than 30 seconds to run. And some of those programs don't have terabyte size binaries. The reason that those programs take longer to run, by and large, is loops. So loops account for a lot of the execution time in real programs. Now, you've already seen some loop optimizations. We're just going to take a look at one other loop optimization today, namely code hoisting, also known as loop invariant code motion. To look at that, we're going to take a look at a different snippet of code from the end body simulation. This code calculates the forces going on each of the end bodies. And it does it with a doubly nested loop. For all the zero to number of bodies, for all body zero number bodies, as long as you're not looking at the same body, call this add force routine, which calculates to-- calculate the force between those two bodies. And add that force to one of the bodies. That's all that's going on in this code. If we translate this code into LLVM IR, we end up with, hopefully unsurprisingly, a doubly nested loop. It looks something like this. The body of the code, the body of the innermost loop, has been lighted, just so things can fit on the slide. But we can see the overall structure. On the outside, we have some outer loop control. This should look familiar from lecture five, hopefully. Inside of that outer loop, we have an inner loop. And at the top and the bottom of that inner loop, we have the inner loop control. And within that inner loop, we do have one branch, which can skip a bunch of code if you're looking at the same body for i and j. But, otherwise, we have the loop body of the inner most loop, basic structure. Now, if we just zoom in on the top part of this doubly-nested loop, just the topmost three basic blocks, take a look at more of the code that's going on here, we end up with something that looks like this. And if you remember some of the discussion from lecture five about the loop induction variables, and what that looks like in LLVM IR, what you find is that for the outer loop we have an induction variable at the very top. It's that weird fee instruction, once again. Inside that outer loop, we have the loop control for the inner loop, which has its own induction variable. Once again, we have another fee node. That's how we can spot it. And then we have the body of the innermost loop. And this is just the start of. It it's just a couple address calculations. But can anyone tell me some interesting property about just a couple of these address calculations that could lead to an optimization? AUDIENCE: [INAUDIBLE] TAO B. SCHARDL: The first two address calculations only depend on the outermost loop variable, the iteration variable for the outer loop, exactly right. So what can we do with those instructions? Bring them out of the inner loop. Why should we keep computing these addresses in the innermost loop when we could just compute them once in the outer loop? That optimization is called code hoisting, or loop invariant code motion. Those instructions are invariant to the code in the innermost loop. So you hoist them out. And once you hoist them out, you end up with a transformed loop that looks something like this. What we have is the same outer loop control at the very top. But now, we're doing some address calculations there. And we no longer have those address calculations on the inside. And as a result, those hoisted calculations are performed just once per iteration of the outer loop, rather than once per iteration of the inner loop. And so those instructions are run far fewer times. You get to save a lot of running time. So the effect of this optimization in terms of C code, because it can be a little tedious to look at LLVM IR, is essentially like this. We took this doubly-nested loop in C. We're calling add force of blah, blah, blah, calculate force, blah, blah, blah. And now, we just move the address calculation to get the ith body that we care about. We move that to the outer. Now, this was an example of loop invariant code motion on just a couple address calculations. In general, the compiler will try to prove that some calculation is invariant across all the iterations of a loop. And whenever it can prove that, it will try to hoist that code out of the loop. If it can get code out of the body of a loop, that reduces the running time of the loop, saves a lot of execution time. Huge bang for the buck. Make sense? Any questions about that so far? All right, so just to summarize this part, what can the compiler do? The compiler optimizes code by performing a sequence of transformation passes. All those passes are pretty mechanical. The compiler goes through the code. It tries to find some property, like this address calculation is the same as that address calculation. And so this load will return the same value as that store, and so on, and so forth. And based on that analysis, it tries to get rid of some dead code, and replace certain register values with other register values, replace things that live in memory with things that just live in registers. A lot of the transformations resemble Bentley-rule work optimizations that you've seen in lecture two. So as you're studying for your upcoming quiz, you can kind of get two for one by looking at those Bentley-rule optimizations. And one transformation pass, in particular function inlining, was a good example of this. One transformation can enable other transformations. And those together can compound to give you fast code. In general, compilers perform a lot more transformations than just the ones we saw today. But there are things that the compiler can't do. Here's one very simple example. In this case, we're taking another look at this calculate forces routine. Although the compiler can optimize the code by moving address calculations out of loop, one thing that I can't do is exploit symmetry in the problem. So in this problem, what's going on is we're computing the forces on any pair of bodies using the law of gravitation. And it turns out that the force acting on one body by another is exactly the opposite the force acting on the other body by the one. So F of 12 is equal to minus F of 21. The compiler will not figure that out. The compiler knows algebra. It doesn't know physics. So it won't be able to figure out that there's symmetry in this problem, and it can avoid wasted operations. Make sense? All right, so that was an overview of some simple compiler optimizations. We now have some examples of some case studies to see where the compiler can get tripped up. And it doesn't matter if we get through all of these or not. You'll have access to the slides afterwards. But I think these are kind of cool. So shall we take a look? Simple question-- does the compiler vectorize this loop? So just to go over what this loop does, it's a simple loop. The function takes two vectors as inputs, or two arrays as inputs, I should say-- an array called y, of like then, and an array x of like then, and some scalar value a. And all that this function does is it loops over each element of the vector, multiplies x of i by the input scalar, adds the product into y's of i. So does the loop vectorize? Yes? AUDIENCE: [INAUDIBLE] TAO B. SCHARDL: y and x could overlap. And there is no information about whether they overlap. So do they vectorize? We have a vote for no. Anyone think that it does vectorize? You made a very convincing argument. So everyone believes that this loop does not vectorize. Is that true? Anyone uncertain? Anyone unwilling to commit to yes or no right here? All right, a bunch of people are unwilling to commit to yes or no. All right, let's resolve this question. Let's first ask for the report. Let's look at the vectorization report. We compile it. We pass the flags to get the vectorization report. And the vectorization report says, yes, it does vectorize this loop, which is interesting, because we have this great argument that says, but you don't know how these addresses fit in memory. You don't know if x and y overlap with each other. How can you possibly vectorize? Kind of a mystery. Well, if we take a look at the actual compiled code when we optimize this at 02, turns out you can pass certain flags to the compiler, and get it to print out not just the LLVM IR, but the LLVM IR formatted as a control flow graph. And the control flow graph for this simple two line function is the thing on the right, which you obviously can't, read because it's a little bit small, in terms of its text. And it seems have a lot going on. So I took the liberty of redrawing that control flow graph with none of the code inside, just get a picture of what the structure looks like for this compiled function. And, structurally speaking, it looks like this. And with a bit of practice staring at control flow graphs, which you might get if you spend way too much time working on compilers, you might look at this control flow graph, and think, this graph looks a little too complicated for the two line function that we gave as input. So what's going on here? Well, we've got three different loops in this code. And it turns out that one of those loops is full of vector operations. OK, the other two loops are not full of vector operations. That's unvectorized code. And then there's this basic block right at the top that has a conditional branch at the end of it, branching to either the vectorized loop or the unvectorized loop. And, yeah, there's a lot of other control flow going on as well. But we can focus on just these components for the time being. So what's that conditional branch doing? Well, we can zoom in on just this one basic block, and actually show it to be readable on the slide. And the basic block looks like this. So let's just study this LLVM IR code. In this case, we have got the address y stored in register 0. The address of x is stored in register 2. And register 3 stores the value of n. So one instruction at a time, who can tell me what the first instruction in this code does? Yes? AUDIENCE: [INAUDIBLE] TAO B. SCHARDL: Gets the address of y. Is that what you said? So it does use the address of y. It's an address calculation that operates on register 0, which stores the address of y. But it's not just computing the address of y. AUDIENCE: [INAUDIBLE] TAO B. SCHARDL: It's getting me the address of the nth element of y. It's adding in whatever is in register 3, which is the value n, into the address of y. So that computes the address y plus n. This is testing your memory of pointer arithmetic in C just a little bit but. Don't worry. It won't be too rough. So that's what the first address calculation does. What does the next instruction do? AUDIENCE: It does x plus n. TAO B. SCHARDL: That computes x plus, very good. How about the next one? AUDIENCE: It compares whether x plus n and y plus n are the same. TAO B. SCHARDL: It compares x plus n, versus y plus n. AUDIENCE: [INAUDIBLE] compares the 33, which is x plus n, and compares it to y. So if x plus n is bigger than y, there's overlap. TAO B. SCHARDL: Right, so it does a comparison. We'll take that a little more slowly. It does a comparison of x plus n, versus y in checks. Is x plus n greater than y? Perfect. How about the next instruction? Yeah? AUDIENCE: It compares y plus n versus x. TAO B. SCHARDL: It compares y plus n, versus x, is y plus n even greater than x. How would the last instruction before the branch? Yep, go for it? AUDIENCE: [INAUDIBLE] TAO B. SCHARDL: [INAUDIBLE] one of the results. So this computes the comparison, is x plus n greater than y, bit-wise? And is y plus n greater than x. Fair enough. So what does the result of that condition mean? I think we've pretty much already spoiled the answer. Anyone want to hear it one last time? We had this whole setup. Go for it. AUDIENCE: They overlap. TAO B. SCHARDL: Checks if they overlap. So let's look at this condition in a couple of different situations. If we have x living in one place in memory, and y living in another place in memory, then no matter how we resolve this condition, if we check is both y plus n greater than x, and x plus n greater than y, the results will be false. But if we have this situation, where x and y overlap in memory some portion of memory, then it turns out that regardless of whether x or y is first, x plus n will be greater than y. y plus n will be greater than x. And the condition will return true. In other words, the condition returns true, if and only if these portions of memory pointed by x and y alias. So going back to our original looping code, we have a situation where we have a branch based on whether or not they alias. And in one case, it executes the vectorized loop. And in another case, it executes a non-vectorized code. So returning to our original question, in particular is a vectorized loop if they don't alias. So returning to our original question, does this code get vectorized? The answer is yes and no. So if you voted yes, you're actually right. If you voted no, and you were persuaded, you were right. And if you didn't commit to an answer, I can't help you. But that's interesting. The compiler actually generated multiple versions of this loop, due to uncertainty about memory aliasing. Yeah, question? AUDIENCE: [INAUDIBLE] TAO B. SCHARDL: So the question is, could the compiler figure out this condition statically while it's compiling the function? Because we know the function is going to get called from somewhere. The answer is, sometimes it can. A lot of times it can't. If it's not capable of inlining this function, for example, then it probably doesn't have enough information to tell whether or not these two pointers will alias. For example, you're just building a library with a bunch of vector routines. You don't know the code that's going to call this routine eventually. Now, in general, memory aliasing, this will be the last point before we wrap up, in general, memory aliasing can cause a lot of issues when it comes to compiler optimization. It can cause the compiler to act very conservatively. In this example, we have a simple serial base case for a matrix multiply routine. But we don't know anything about the pointers to the C, A, or B matrices. And when we try to compile this and optimize it, the compiler complains that it can't do loop invariant code motion, because it doesn't know anything about these pointers. It could be that the pointer changes within the innermost loop. So it can't move some calculation out to an outer loop. Compilers try to deal with this statically using an analysis technique called alias analysis. And they do try very hard to figure out, when are these pointers going to alias? Or when are they guaranteed to not alias? Now, in general, it turns out that alias analysis isn't just hard. It's undecidable. If only it were hard, maybe we'd have some hope. But compilers, in practice, are faced with this undecidable question. And they try a variety of tricks to get useful alias analysis results in practice. For example, based on information in the source code, the compiler might annotate instructions with various metadata to track this aliasing information. For example, TBAA is aliasing information based on types. There's some scoping information for aliasing. There is some information that says it's guaranteed not to alias with this other operation, all kinds of metadata. Now, what can you do as a programmer to avoid these issues of memory aliasing? Always annotate your pointers, kids. Always annotate your pointers. The restrict keyword you've seen before. It tells the compiler, address calculations based off this pointer won't alias with address calculations based off other pointers. The const keyword provides a little more information. It says, these addresses will only be read from. They won't be written to. And that can enable a lot more compiler optimizations. Now, that's all the time that we have. There are a couple of other cool case studies in the slides. You're welcome to peruse the slides afterwards. Thanks for listening.
B1 compiler tao loop address memory optimization 9. What Compilers Can and Cannot Do 6 0 林宜悉 posted on 2020/03/29 More Share Save Report Video vocabulary