Placeholder Image

Subtitles section Play video

  • The following content is provided under a Creative

  • Commons license.

  • Your support will help MIT OpenCourseWare

  • continue to offer high-quality educational resources for free.

  • To make a donation or to view additional materials

  • from hundreds of MIT courses, visit MIT OpenCourseWare

  • at ocw.mit.edu.

  • CHARLES E. LEISERSON: Hi, it's my great pleasure

  • to introduce, again, TB Schardl.

  • TB is not only a fabulous, world-class performance

  • engineer, he is a world-class performance meta engineer.

  • In other words, building the tools and such to make it

  • so that people can engineer fast code.

  • And he's the author of the technology

  • that we're using in our compiler, the taper

  • technology that's in the open compiler for parallelism.

  • So he implemented all of that, and all the optimizations,

  • and so forth, which has greatly improved the quality

  • of the programming environment.

  • So today, he's going to talk about something near and dear

  • to his heart, which is compilers,

  • and what they can and cannot do.

  • TAO B. SCHARDL: Great, thank you very

  • much for that introduction.

  • Can everyone hear me in the back?

  • Yes, great.

  • All right, so as I understand it,

  • last lecture you talked about multi-threaded algorithms.

  • And you spent the lecture studying those algorithms,

  • analyzing them in a theoretical sense,

  • essentially analyzing their asymptotic running times, work

  • and span complexity.

  • This lecture is not that at all.

  • We're not going to do that kind of math

  • anywhere in the course of this lecture.

  • Instead, this lecture is going to take a look at compilers,

  • as professor mentioned, and what compilers can and cannot do.

  • So the last time, you saw me standing up here

  • was back in lecture five.

  • And during that lecture we talked

  • about LLVM IR and x8664 assembly,

  • and how C code got translated into assembly code via LLVM IR.

  • In this lecture, we're going to talk

  • more about what happens between the LLVM IR and assembly

  • stages.

  • And, essentially, that's what happens when the compiler is

  • allowed to edit and optimize the code in its IR representation,

  • while it's producing the assembly.

  • So last time, we were talking about this IR,

  • and the assembly.

  • And this time, they called the compiler guy back,

  • I suppose, to tell you about the boxes in the middle.

  • Now, even though you're predominately dealing with C

  • code within this class, I hope that some of the lessons from

  • today's lecture you will be able to take away into any job that

  • you pursue in the future, because there are a lot

  • of languages today that do end up being compiled, C and C++,

  • Rust, Swift, even Haskell, Julia, Halide,

  • the list goes on and on.

  • And those languages all get compiled

  • for a variety of different what we

  • call backends, different machine architectures, not just x86-64.

  • And, in fact, a lot of those languages

  • get compiled using very similar compilation technology

  • to what you have in the Clang LLVM compiler

  • that you're using in this class.

  • In fact, many of those languages today

  • are optimized by LLVM itself.

  • LLVM is the internal engine within the compiler

  • that actually does all of the optimization.

  • So that's my hope, that the lessons you'll learn here today

  • don't just apply to 172.

  • They'll, in fact, apply to software

  • that you use and develop for many years on the road.

  • But let's take a step back, and ask ourselves,

  • why bother studying the compiler optimizations at all?

  • Why should we take a look at what's

  • going on within this, up to this point, black box of software?

  • Any ideas?

  • Any suggestions?

  • In the back?

  • AUDIENCE: [INAUDIBLE]

  • TAO B. SCHARDL: You can avoid manually

  • trying to optimize things that the compiler will

  • do for you, great answer.

  • Great, great answer.

  • Any other answers?

  • AUDIENCE: You learn how to best write

  • your code to take advantages of the compiler optimizations.

  • TAO B. SCHARDL: You can learn how

  • to write your code to take advantage of the compiler

  • optimizations, how to suggest to the compiler what it should

  • or should not do as you're constructing

  • your program, great answer as well.

  • Very good, in the front.

  • AUDIENCE: It might help for debugging

  • if the compiler has bugs.

  • TAO B. SCHARDL: It can absolutely

  • help for debugging when the compiler itself has bugs.

  • The compiler is a big piece of software.

  • And you may have noticed that a lot of software contains bugs.

  • The compiler is no exception.

  • And it helps to understand where the compiler might have made

  • a mistake, or where the compiler simply just

  • didn't do what you thought it should be able to do.

  • Understanding more of what happens in the compiler

  • can demystify some of those oddities.

  • Good answer.

  • Any other thoughts?

  • AUDIENCE: It's fun.

  • TAO B. SCHARDL: It's fun.

  • Well, OK, so in my completely biased opinion,

  • I would agree that it's fun to understand

  • what the compiler does.

  • You may have different opinions.

  • That's OK.

  • I won't judge.

  • So I put together a list of reasons

  • why, in general, we may care about what

  • goes on inside the compiler.

  • I highlighted that last point from this list, my bad.

  • Compilers can have a really big impact on software.

  • It's kind of like this.

  • Imagine that you're working on some software project.

  • And you have a teammate on your team

  • he's pretty quiet but extremely smart.

  • And what that teammate does is whenever that teammate gets

  • access to some code, they jump in

  • and immediately start trying to make that code work faster.

  • And that's really cool, because that teammate does good work.

  • And, oftentimes, you see that what the teammate produces

  • is, indeed, much faster code than what you wrote.

  • Now, in other industries, you might just sit back

  • and say, this teammate does fantastic work.

  • Maybe they don't talk very often.

  • But that's OK.

  • Teammate, you do you.

  • But in this class, we're performance engineers.

  • We want to understand what that teammate did to the software.

  • How did that teammate get so much performance out

  • of the code?

  • The compiler is kind of like that teammate.

  • And so understanding what the compiler does

  • is valuable in that sense.

  • As mentioned before, compilers can save you

  • performance engineering work.

  • If you understand that the compiler can

  • do some optimization for you, then you

  • don't have to do it yourself.

  • And that means that you can continue

  • writing simple, and readable, and maintainable code

  • without sacrificing performance.

  • You can also understand the differences between the source

  • code and whatever you might see show up in either the LLVM

  • IR or the assembly, if you have to look

  • at the assembly language produced for your executable.

  • And compilers can make mistakes.

  • Sometimes, that's because of a genuine bug in the compiler.

  • And other times, it's because the compiler just

  • couldn't understand something about what was going on.

  • And having some insight into how the compiler reasons about code

  • can help you understand why those mistakes were made,

  • or figure out ways to work around those mistakes,

  • or let you write meaningful bug reports to the compiler

  • developers.

  • And, of course, understanding computers

  • can help you use them more effectively.

  • Plus, I think it's fun.

  • So the first thing to understand about a compiler

  • is a basic anatomy of how the compiler works.

  • The compiler takes as input LLVM IR.

  • And up until this point, we thought of it

  • as just a big black box.

  • That does stuff to the IR, and out pops more LLVM IR,

  • but it's somehow optimized.

  • In fact, what's going on within that black box

  • the compiler is executing a sequence

  • of what we call transformation passes on the code.

  • Each transformation pass takes a look at its input,

  • and analyzes that code, and then tries

  • to edit the code in an effort to optimize

  • the code's performance.

  • Now, a transformation pass might end up running multiple times.

  • And those passes run in some order.

  • That order ends up being a predetermined order

  • that the compiler writers found to work

  • pretty well on their tests.

  • That's about the level of insight that

  • went into picking the order.

  • It seems to work well.

  • Now, some good news, in terms of trying

  • to understand what the compiler does,

  • you can actually just ask the compiler, what did you do?

  • And you've already used this functionality,

  • as I understand, in some of your assignments.

  • You've already asked the compiler

  • to give you a report specifically

  • about whether or not it could vectorize some code.

  • But, in fact, LLVM, the compiler you have access to,

  • can produce reports not just for factorization,

  • but for a lot of the different transformation

  • passes that it tries to perform.

  • And there's some syntax that you have

  • to pass to the compiler, some compiler flags

  • that you have to specify in order to get those reports.

  • Those are described on the slide.

  • I won't walk you through that text.

  • You can look at the slides afterwards.

  • At the end of the day, the string that you're passing

  • is actually a regular expression.

  • If you know what regular expressions are,

  • great, then you can use that to narrow down

  • the search for your report.

  • If you don't, and you just want to see the whole report,

  • just provide dot star as a string and you're good to go.

  • That's the good news.

  • You can get the compiler to tell you exactly what it did.

  • The bad news is that when you ask the compiler what it did,

  • it will give you a report.

  • And the report looks something like this.

  • In fact, I've highlighted most of the report

  • for this particular piece of code,

  • because the report ends up being very long.

  • And as you might have noticed just

  • from reading some of the texts, there are definitely

  • English words in this text.

  • And there are pointers to pieces of code that you've compiled.

  • But it is very jargon, and hard to understand.

  • This isn't the easiest report to make sense of.

  • OK, so that's some good news and some bad news

  • about these compiler reports.

  • The good news is, you can ask the compiler.

  • And it'll happily tell you all about the things that it did.

  • It can tell you about which transformation passes were

  • successfully able to transform the code.

  • It can tell you conclusions that it drew

  • about its analysis of the code.

  • But the bad news is, these reports

  • are kind of complicated.

  • They can be long.

  • They use a lot of internal compiler jargon, which

  • if you're not familiar with that jargon,

  • it makes it hard to understand.

  • It also turns out that not all of the transformation

  • passes in the compiler give you these nice reports.

  • So you don't get to see the whole picture.

  • And, in general, the reports don't really

  • tell you the whole story about what the compiler did

  • or did not do.

  • And we'll see another example of that later on.

  • So part of the goal of today's lecture

  • is to get some context for understanding the reports

  • that you might see if you pass those flags to the compiler.

  • And the structure of today's lecture

  • is basically divided up into two parts.

  • First, I want to give you some examples

  • of compiler optimizations, just simple examples

  • so you get a sense as to how a compiler mechanically reasons

  • about the code it's given, and tries to optimize that code.

  • We'll take a look at how the compiler optimizes

  • a single scalar value, how it can optimize a structure,

  • how it can optimize function calls,

  • and how it can optimize loops, just simple examples

  • to give some flavor.

  • And then the second half of lecture,

  • I have a few case studies for you

  • which get into diagnosing ways in which the compiler failed,

  • not due to bugs, per se, but simply

  • didn't do an optimization you might have expected it to do.

  • But, to be frank, I think all those case

  • studies are really cool.

  • But it's not totally crucial that we

  • get through every single case study during today's lecture.

  • The slides will be available afterwards.

  • So when we get to that part, we'll

  • just see how many case studies we can cover.

  • Sound good?

  • Any questions so far?

  • All right, let's get to it.

  • Let's start with a quick overview

  • of compiler optimizations.

  • So here is a summary of the various--

  • oh, I forgot that I just copied this slide

  • from a previous lecture given in this class.

  • You might recognize this slide I think from lecture two.

  • Sorry about that.

  • That's OK.

  • We can fix this.

  • We'll just go ahead and add this slide right now.

  • We need to change the title.

  • So let's cross that out and put in our new title.

  • OK, so, great, and now we should double check these lists

  • and make sure that they're accurate.

  • Data structures, we'll come back to data structures.

  • Loops, hoisting, yeah, the compiler can do hoisting.

  • Sentinels, not really, the compiler

  • is not good at sentinels.

  • Loop unrolling, yeah, it absolutely does loop unrolling.

  • Loop fusion, yeah, it can, but there are

  • some restrictions that apply.

  • Your mileage might vary.

  • Eliminate waste iterations, some restrictions might apply.

  • OK, logic, constant folding and propagation, yeah,

  • it's good on that.

  • Common subexpression elimination, yeah, I

  • can find common subexpressions, you're fin there.

  • It knows algebra, yeah good.

  • Short circuiting, yes, absolutely.

  • Ordering tests, depends on the tests--

  • I'll give it to the compiler.

  • But I'll say, restrictions apply.

  • Creating a fast path, compilers aren't

  • that smart about fast paths.

  • They come up with really boring fast paths.

  • I'm going to take that off the list.

  • Combining tests, again, it kind of depends on the tests.

  • Functions, compilers are pretty good at functions.

  • So inling, it can do that.

  • Tail recursion elimination, yes, absolutely.

  • Coarsening, not so much.

  • OK, great.

  • Let's come back to data structures,

  • which we skipped before.

  • Packing, augmentation-- OK, honestly, the compiler

  • does a lot with data structures but really

  • none of those things.

  • The compiler isn't smart about data structures

  • in that particular way.

  • Really, the way that the compiler

  • is smart about data structures is shown here,

  • if we expand this list to include even more compiler

  • optimizations.

  • Bottom line with data structures, the compiler

  • knows a lot about architecture.

  • And it really has put a lot of effort

  • into figuring out how to use registers really effectively.

  • Reading and writing and register is super fast.

  • Touching memory is not so fast.

  • And so the compiler works really hard to allocate registers, put

  • anything that lives in memory ordinarily into registers,

  • manipulate aggregate types to use registers,

  • as we'll see in a couple of slides, align data

  • that has to live in memory.

  • Compilers are good at that.

  • Compilers are also good at loops.

  • We already saw some example optimization

  • on the previous slide.

  • It can vectorize.

  • It does a lot of other cool stuff.

  • Unswitching is a cool optimization

  • that I won't cover here.

  • Idiom replacement, it finds common patterns,

  • and does something smart with those.

  • Vision, skewing, tiling, interchange, those all

  • try to process the iterations of the loop in some clever way

  • to make stuff go fast.

  • And some restrictions apply.

  • Those are really in development in LLVM.

  • Logic, it does a lot more with logic than what we saw before.

  • It can eliminate instructions that aren't necessary.

  • It can do strength reduction, and other cool optimization.

  • I think we saw that one in the Bentley slides.

  • It gets rid of dead code.

  • It can do more idiom replacement.

  • Branch reordering is kind like reordering tests.

  • Global value numbering, another cool optimization

  • that we won't talk about today.

  • Functions, it can do more on switching.

  • It can eliminate arguments that aren't necessary.

  • So the compiler can do a lot of stuff for you.

  • And at the end the day, writing down this whole list

  • is kind of a futile activity because it changes over time.

  • Compilers are a moving target.

  • Compiler developers, they're software engineers

  • like you and me.

  • And they're clever.

  • And they're trying to apply all their clever software

  • engineering practice to this compiler code base

  • to make it do more stuff.

  • And so they are constantly adding new optimizations

  • to the compiler, new clever analyses, all the time.

  • So, really, what we're going to look at today

  • is just a couple examples to get a flavor for what

  • the compiler does internally.

  • Now, if you want to follow along with how the compiler works,

  • the good news is, by and large, you

  • can take a look at the LLVM IR to see

  • what happens as the compiler processes your code.

  • You don't need to look out the assembly.

  • That's generally true.

  • But there are some exceptions.

  • So, for example, if we have these three snippets of C code

  • on the left, and we look at what your LLVM compiler generates,

  • in terms of the IR, we can see that there

  • are some optimizations reflected, but not

  • too many interesting ones.

  • The multiply by 8 turns into a shift left operation by 3,

  • because 8 is a power of 2.

  • That's straightforward.

  • Good, we can see that in the IR.

  • The multiply by 15 still looks like a multiply by 15.

  • No changes there.

  • The divide by 71 looks like a divide by 71.

  • Again, no changes there.

  • Now, with arithmetic ops, the difference

  • between what you see in the LLVM IR

  • and what you see in the assembly,

  • this is where it's most pronounced,

  • at least in my experience, because if we

  • take a look at these same snippets of C code,

  • and we look at the corresponding x86 assembly for it,

  • we get the stuff on the right.

  • And this looks different.

  • Let's pick through what this assembly

  • code does one line at a time.

  • So the first one in the C code, takes the argument n,

  • and multiplies it by 8.

  • And then the assembly, we have this LEA instruction.

  • Anyone remember what the LEA instruction does?

  • I see one person shaking their head.

  • That's a perfectly reasonable response.

  • Yeah, go for it?

  • Load effective address, what does that mean?

  • Load the address, but don't actually access memory.

  • Another way to phrase that, do this address calculation.

  • And give me the result of the address calculation.

  • Don't read or write memory at that address.

  • Just do the calculation.

  • That's what loading an effective address means, essentially.

  • But you're exactly right.

  • The LEA instruction does an address calculation,

  • and stores the result in the register on the right.

  • Anyone remember enough about x86 address calculations

  • to tell me how that LEA in particular works, the first LEA

  • on the slide?

  • Yeah?

  • AUDIENCE: [INAUDIBLE]

  • TAO B. SCHARDL: But before the first comma, in this case

  • nothing, gets added to the product of the second two

  • arguments in those parens.

  • You're exactly right.

  • So this LEA takes the value 8, multiplies it by whatever

  • is in register RDI, which holds the value n.

  • And it stores the result into AX.

  • So, perfect, it does a multiply by 8.

  • The address calculator is only capable of a small range

  • of operations.

  • It can do additions.

  • And it can multiply by 1, 2, 4, or 8.

  • That's it.

  • So it's a really simple circuit in the hardware.

  • But it's fast.

  • It's optimized heavily by modern processors.

  • And so if the compiler can use it,

  • they tend to try to use these LEA instructions.

  • So good job.

  • How about the next one?

  • Multiply by 15 turns into these two LEA instructions.

  • Can anyone tell me how these work?

  • AUDIENCE: [INAUDIBLE]

  • TAO B. SCHARDL: You're basically multiplying by 5

  • and multiplying by 3, exactly right.

  • We can step through this as well.

  • If we look at the first LEA instruction,

  • we take RDI, which stores the value n.

  • We multiply that by 4.

  • We add it to the original value of RDI.

  • And so that computes 4 times n, plus n, which is five times n.

  • And that result gets stored into AX.

  • Could, we've effectively multiplied by 5.

  • The next instruction takes whatever

  • is in REX, which is now 5n, multiplies that by 2, adds it

  • to whatever is currently in REX, which is once again 5n.

  • So that computes 2 times 5n, plus 5n, which

  • is 3 times 5n, which is 15n.

  • So just like that, we've done our multiply

  • with two LEA instructions.

  • How about the last one?

  • In this last piece of code, we take the arguments in RDI.

  • We move it into EX.

  • We then move the value 3,871,519,817,

  • and put that into ECX, as you do.

  • We multiply those two values together.

  • And then we shift the product right by 38.

  • So, obviously, this divides by 71.

  • Any guesses as to how this performs

  • the division operation we want?

  • Both of you answered.

  • I might still call on you.

  • give a little more time for someone else

  • to raise their hand.

  • Go for it.

  • AUDIENCE: [INAUDIBLE]

  • TAO B. SCHARDL: It has a lot to do with 2 to the 38, very good.

  • Yeah, all right, any further guesses

  • before I give the answer away?

  • Yeah, in the back?

  • AUDIENCE: [INAUDIBLE]

  • TAO B. SCHARDL: Kind of.

  • So this is what's technically called a magic number.

  • And, yes, it's technically called a magic number.

  • And this magic number is equal to 2

  • to the 38, divided by 71, plus 1 to deal with some rounding

  • effects.

  • What this code does is it says, let's

  • compute n divided by 71, by first computing n divided

  • by 71, times 2 to the 38, and then shifting off the lower 38

  • bits with that shift right operation.

  • And by converting the operation into this,

  • it's able to replace the division operation

  • with a multiply.

  • And if you remember, hopefully, from the architecture lecture,

  • multiply operations, they're not the cheapest things

  • in the world.

  • But they're not too bad.

  • Division is really expensive.

  • If you want fast code, never divide.

  • Also, never compute modulus, or access memory.

  • Yeah, question?

  • AUDIENCE: Why did you choose 38?

  • TAO B. SCHARDL: Why did I choose 38?

  • I think it shows 38 because 38 works.

  • There's actually a formula for--

  • pretty much it doesn't want to choose

  • a value that's too large, or else it'll overflow.

  • And it doesn't want to choose a value that's too small,

  • or else you lose precision.

  • So it's able to find a balancing point.

  • If you want to know more about magic numbers,

  • I recommend checking out this book called Hackers Delight.

  • For any of you who are familiar with this book,

  • it is a book full of bit tricks.

  • Seriously, that's the entire book.

  • It's just a book full of bit tricks.

  • And there's a whole section in there describing

  • how you do division by various constants using multiplication,

  • either signed or unsigned.

  • It's very cool.

  • But magic number to convert a division

  • into a multiply, that's the kind of thing

  • that you might see from the assembly.

  • That's one of these examples of arithmetic operations

  • that are really optimized at the very last step.

  • But for the rest of the optimizations,

  • fortunately we can focus on the IR.

  • Any questions about that so far?

  • Cool.

  • OK, so for the next part of the lecture,

  • I want to show you a couple example optimizations in terms

  • of the LLVM IR.

  • And to show you these optimizations,

  • we'll have a little bit of code that we'll work through,

  • a running example, if you will.

  • And this running example will be some code

  • that I stole from I think it was a serial program that simulates

  • the behavior of n massive bodies in 2D space

  • under the law of gravitation.

  • So we've got a whole bunch of point masses.

  • Those point masses have varying masses.

  • And we just want to simulate what

  • happens due to gravity as these masses interact in the plane.

  • At a high level, the n body code is pretty simple.

  • We have a top level simulate routine,

  • which just loops over all the time

  • steps, during which we want to perform this simulation.

  • And at each time step, it calculates the various forces

  • acting on those different bodies.

  • And then it updates the position of each body,

  • based on those forces.

  • In order to do that calculation.

  • It has some internal data structures,

  • one to represent each body, which contains

  • a couple of vector types.

  • And we define our own vector type

  • to store to double precision floating point values.

  • Now, we don't need to see the entire code in order

  • to look at some compiler optimizations.

  • The one routine that we will take a look at

  • is this one to update the positions.

  • This is a simple loop that takes each body, one at a time,

  • computes the new velocity on that body,

  • based on the forces acting on that body,

  • and uses vector operations to do that.

  • Then it updates the position of that body,

  • again using these vector operations that we've defined.

  • And then it stores the results into the data structure

  • for that body.

  • So all these methods with this code

  • make use of these basic routines on 2D vectors, points in x, y,

  • or points in 2D space.

  • And these routines are pretty simple.

  • There is one to add two vectors.

  • There's another to scale a vector by a scalar value.

  • And there's a third to compute the length, which we won't

  • actually look at too much.

  • Everyone good so far?

  • OK, so let's try to start simple.

  • Let's take a look at just one of these one line vector

  • operations, vec scale.

  • All vec scale does is it takes one of these vector inputs

  • at a scalar value a.

  • And it multiplies x by a, and y by a, and stores the results

  • into a vector type, and return to it.

  • Great, couldn't be simpler.

  • If we compile this with no optimizations whatsoever,

  • and we take a look at the LLVM IR,

  • we get that, which is a little more complicated

  • than you might imagine.

  • The good news, though, is that if you turn on optimizations,

  • and you just turn on the first level of optimization, just 01,

  • whereas we got this code before, now we get this, which is far,

  • far simpler, and so simple I can blow up the font size so you

  • can actually read the code on the slide.

  • So to see, again, no optimizations, optimizations.

  • So a lot of stuff happened to optimize this simple function.

  • We're going to see what those optimizations actually were.

  • But, first, let's pick apart what's

  • going on in this function.

  • We have our vec scale routine in LLVM IR.

  • It takes a structure as its first argument.

  • And that's represented using two doubles.

  • It takes a scalar as the second argument.

  • And what the operation does is it multiplies those two

  • fields by the third argument, the double A.

  • It then packs those values into a struct that'll return.

  • And, finally, it returns that struct.

  • So that's what the optimized code does.

  • Let's see actually how we get to this optimized code.

  • And we'll do this one step at a time.

  • Let's start by optimizing the operations on a single scalar

  • value.

  • That's why I picked this example.

  • So we go back to the 00 code.

  • And we just pick out the operations that

  • dealt with that scalar value.

  • We our scope down to just these lines.

  • So the argument double A is the final argument in the list.

  • And what we see is that within the vector scale

  • routine, compiler to 0, we allocate some local storage.

  • We store that double A into the local storage.

  • And then later on, we'll load the value out

  • of the local storage before the multiply.

  • And then we load it again before the other multiply.

  • OK, any ideas how we could make this code faster?

  • Don't store in memory, what a great idea.

  • How do we get around not storing it in memory?

  • Saving a register.

  • In particular, what property of LLVM IR makes that really easy?

  • There are infinite registers.

  • And, in fact, the argument is already in a register.

  • It's already in the register percent two, if I recall.

  • So we don't need to move it into a register.

  • It's already there.

  • So how do we go about optimizing that code in this case?

  • Well, let's find the places where we're using the value.

  • And we're using the value loaded from memory.

  • And what we're going to do is just replace those loads

  • from memory with the original argument.

  • We know exactly what operation we're trying to do.

  • We know we're trying to do a multiply

  • by the original parameter.

  • So we just find those two uses.

  • We cross them out.

  • And we put in the input parameter in its place.

  • That make sense?

  • Questions so far?

  • Cool.

  • So now, those multipliers aren't using the values

  • returned by the loads.

  • How further can we optimize this code?

  • Delete the loads.

  • What else can we delete?

  • So there's no address calculation here

  • just because the code is so simple, but good insight.

  • The allocation and the store, great.

  • So those loads are dead code.

  • The store is dead code.

  • The allocation is dead code.

  • We eliminate all that dead code.

  • We got rid of those loads.

  • We just used the value living in the register.

  • And we've already eliminated a bunch of instructions.

  • So the net effect of that was to turn the code optimizer at 00

  • that we had in the background into the code we have

  • in the foreground, which is slightly shorter,

  • but not that much.

  • So it's a little bit faster, but not that much faster.

  • How do we optimize this function further?

  • Do it for every variable we have.

  • In particular, the only other variable we have

  • is a structure that we're passing in.

  • So we want to do this kind of optimization on the structure.

  • Make sense?

  • So let's see how we optimize this structure.

  • Now, the problem is that structures

  • are harder to handle than individual scalar values,

  • because, in general, you can't store the whole structure

  • in just a single register.

  • It's more complicated to juggle all the data

  • within a structure.

  • But, nevertheless, let's take a look

  • at the code that operates on the structure,

  • or at least operates on the structure

  • that we pass in to the function.

  • So when we eliminate all the other code,

  • we see that we've got an allocation.

  • See if I animations here, yeah, I do.

  • We have an allocation.

  • So we can store the structure onto the stack.

  • Then we have an address calculation

  • that lets us store the first part of the structure

  • onto the stack.

  • We have a second address calculation

  • to store the second field on the stack.

  • And later on, when we need those values,

  • we load the first field out of memory.

  • And we load the second field out of memory.

  • It's a very similar pattern to what we had before,

  • except we've got more going on in this case.

  • So how do we go about optimizing this structure?

  • Any ideas, high level ideas?

  • Ultimately, we want to get rid of all of the memory

  • references and all that storage for the structure.

  • How do we reason through eliminating all that stuff

  • in a mechanical fashion, based on what we've seen so far?

  • Go for it.

  • AUDIENCE: [INAUDIBLE]

  • TAO B. SCHARDL: They are passed in using separate parameters,

  • separate registers if you will, as a quirk of how LLVM does it.

  • So given that insight, how would you optimize it?

  • AUDIENCE: [INAUDIBLE]

  • TAO B. SCHARDL: Cross out percent 12, percent 6,

  • and put in the relevant field.

  • Cool.

  • Let me phrase that a little bit differently.

  • Let's do this one field at a time.

  • We've got a structure, which has multiple fields.

  • Let's just take it one step at a time.

  • All right, so let's look at the first field.

  • And let's look at the operations that deal with the first field.

  • We have, in our code, in our LLVM IR, some address

  • calculations that refer to the same field of the structure.

  • In this case, I believe it's the first field, yes.

  • And, ultimately, we end up loading from this location

  • in local memory.

  • So what value is this load going to retrieve?

  • How do we know that both address calculations

  • refer to the same field?

  • Good question.

  • What we do in this case is very careful analysis

  • of the math that's going on.

  • We know that the alga, the location in local memory,

  • that's just a fixed location.

  • And from that, we can interpret what each of the instructions

  • does in terms of an address calculation.

  • And we can determine that they're the same value.

  • So we have this location in memory that we operate on.

  • And before you do a multiply, we end up

  • loading from that location in memory.

  • So what value do we know is going to be loaded by that load

  • instruction?

  • Go for it.

  • AUDIENCE: So what we're doing right now is taking some value,

  • and then storing it, and then getting it back out,

  • and putting it back.

  • TAO B. SCHARDL: Not putting it back, but we don't you

  • worry about putting it back.

  • AUDIENCE: So we don't need to put it somewhere just

  • to take it back out?

  • TAO B. SCHARDL: Correct.

  • Correct.

  • So what are we multiplying in that multiply, which value?

  • First element of the struct.

  • It's percent zero.

  • It's the value that we stored right there.

  • That makes sense?

  • Everyone see that?

  • Any questions about that?

  • All right, so we're storing the first element of the struct

  • into this location.

  • Later, we load it out of that same location.

  • Nothing else happened to that location.

  • So let's go ahead and optimize it just the same way

  • we optimize the scalar.

  • We see that we use the result of the load right there.

  • But we know that load is going to return the first field

  • of our struct input.

  • So we'll just cross it out, and replace it with that field.

  • So now we're not using the result of that load.

  • What do we get to do as the compiler?

  • I can tell you know the answer.

  • Delete the dead code, delete all of it.

  • Remove the now dead code, which is all those address

  • calculations, as well as the load operation, and the store

  • operation.

  • And that's pretty much it.

  • Yeah, good.

  • So we replace that operation.

  • And we got rid of a bunch of other code from our function.

  • We've now optimized one of the two fields in our struct.

  • What do we do next?

  • Optimize the next one.

  • That happened similarly.

  • I won't walk you through that a second time.

  • We find where we're using the result of that load.

  • We can cross it out, and replace it with the appropriate input,

  • and then delete all the relevant dead code.

  • And now, we get to delete the original allocation

  • because nothing's getting stored to that memory.

  • That make sense?

  • Any questions about that?

  • Yeah?

  • AUDIENCE: So when we first compile it to LLVM IR,

  • does it unpack the struct and just

  • put in separate parameters?

  • TAO B. SCHARDL: When we first compiled LLVM IR,

  • do we unpack the struct and pass in the separate parameters?

  • AUDIENCE: Like, how we have three parameters here

  • that are doubled.

  • Wasn't our original C code just a struct of vectors in

  • the double?

  • TAO B. SCHARDL: So LLVM IR in this case, when we compiled it

  • as zero, decided to pass it as separate parameters,

  • just as it's representation.

  • So in that sense, yes.

  • But it was still doing the standard,

  • create some local storage, store the parameters

  • on to local storage, and then all operations just

  • read out of local storage.

  • It's the standard thing that the compiler generates when

  • it's asked to compile C code.

  • And with no other optimizations, that's what you get.

  • That makes sense?

  • Yeah?

  • AUDIENCE: What are all the align eights?

  • TAO B. SCHARDL: What are all the aligned eights doing?

  • The align eights are attributes that

  • specify the alignment of that location in memory.

  • This is alignment information that the compiler

  • either determines by analysis, or implements

  • as part of a standard.

  • So they're specifying how values are aligned in memory.

  • That matters a lot more for ultimate code generation,

  • unless we're able to just delete the memory

  • references altogether.

  • Make sense?

  • Cool.

  • Any other questions?

  • All right, so we optimized the first field.

  • We optimize the second field in a similar way.

  • Turns out, there's additional optimizations

  • that need to happen in order to return

  • a structure from this function.

  • Those operations can be optimized in a similar way.

  • They're shown here.

  • We're not going to go through exactly how that works.

  • But at the end of the day, after we've

  • optimized all of that code we end up with this.

  • We end up with our function compiled at 01.

  • And it's far simpler.

  • I think it's far more intuitive.

  • This is what I would imagine the code should

  • look like when I wrote the C code in the first place.

  • Take your input.

  • Do a couple of multiplications.

  • And then it does them operations to create the return value,

  • and ultimately return that value.

  • So, in summary, the compiler works

  • hard to transform data structures and scalar

  • values to store as much as it possibly

  • can purely within registers, and avoid using

  • any local storage, if possible.

  • Everyone good with that so far?

  • Cool.

  • Let's move on to another optimization.

  • Let's talk about function calls.

  • Let's take a look at how the compiler

  • can optimize function calls.

  • By and large, these optimizations

  • will occur if you pass optimization level 2 or higher,

  • just FYI.

  • So from our original C code, we had

  • some lines that performed a bunch of vector operations.

  • We had a vec add that added two vectors together, one of which

  • was the result of a vec scale, which

  • scaled the result of a vec add by some scalar value.

  • So we had this chain of calls in our code.

  • And if we take a look at the code compile that

  • was 0, what we end up with is this snippet shown

  • on the bottom, which performs some operations on these vector

  • structures, does this multiply operation,

  • and then calls this vector scale routine,

  • the vector scale routine that we decide to focus on first.

  • So any ideas for how we go about optimizing this?

  • So to give you a little bit of a hint, what the compiler sees

  • when it looks at that call is it sees a snippet containing

  • the call instruction.

  • And in our example, it also sees the code for the vec scale

  • function that we were just looking at.

  • And we're going to suppose that it's already optimized

  • vec scale as best as it can.

  • It's produced this code for the vec scale routine.

  • And so it sees that call instruction.

  • And it sees this code for the function that's being called.

  • So what could the compiler do at this point

  • to try to make the code above even faster?

  • AUDIENCE: [INAUDIBLE]

  • TAO B. SCHARDL: You're exactly right.

  • Remove the call, and just put the body of the vec scale code

  • right there in place of the call.

  • It takes a little bit of effort to pull that off.

  • But, roughly speaking, yeah, we're

  • just going to copy and paste this code in our function

  • into the place where we're calling the function.

  • And so if we do that simple copy paste,

  • we end up with some garbage code as an intermediate.

  • We had to do a little bit of renaming

  • to make everything work out.

  • But at this point, we have the code

  • from our function in the place of that call.

  • And now, we can observe that to restore correctness,

  • we don't want to do the call.

  • And we don't want to do the return that we just

  • pasted in place.

  • So we'll just go ahead and remove

  • both that call and the return.

  • That is called function inlining.

  • We identify some function call, or the compiler

  • identifies some function call.

  • And it takes the body of the function,

  • and just pastes it right in place of that call.

  • Sound good?

  • Make sense?

  • Anyone confused?

  • Raise your hand if you're confused.

  • Now, once you've done some amount of function inlining,

  • we can actually do some more optimizations.

  • So here, we have the code after we got rid

  • of the unnecessary call and return.

  • And we have a couple multiply operations sitting in place.

  • That looks fine.

  • But if we expand our scope just a little bit,

  • what we see, so we have some operations

  • happening that were sitting there already

  • after the original call.

  • What the compiler can do is it can take

  • a look at these instructions.

  • And long story short, it realizes

  • that all these instructions do is

  • pack some data into a structure, and then immediately unpack

  • the structure.

  • So it's like you put a bunch of stuff into a bag,

  • and then immediately dump out the bag.

  • That was kind of a waste of time.

  • That's kind of a waste of code.

  • Let's get rid of it.

  • Those operations are useless.

  • Let's delete them.

  • The compiler has a great time deleting dead code.

  • It's like it's what it lives to do.

  • All right, now, in fact, in the original code,

  • we didn't just have one function call.

  • We had a whole sequence of function calls.

  • And if we expand our LLVM IR snippet even a little further,

  • we can include those two function

  • calls, the original call to vec ad, followed by the code

  • that we've now optimized by inlining,

  • ultimately followed by yet another call to vec add.

  • Minor spoiler, the vec add routine, once it's optimized,

  • looks pretty similar to the vec scalar routine.

  • And, in particular, it has comparable size

  • to the vector scale routine.

  • So what's the compiler is going to do to those to call sites?

  • Inline it, do more inlining, inlining is great.

  • We'll inline these functions as well,

  • and then remove all of the additional, now-useless

  • instructions.

  • We'll walk through that process.

  • The result of that process looks something like this.

  • So in the original C code, we had this vec

  • add, which called the vec scale as one

  • of its arguments, which called the vec add

  • is one of its arguments.

  • And what we end up with in the optimized IR

  • is just a bunch of straight line code that performs

  • floating point operations.

  • It's almost as if the compiler took the original C code,

  • and transformed it into the equivalency code shown

  • on the bottom, where it just operates

  • on a whole bunch of doubles, and just does primitive operations.

  • So function inlining, as well as the additional transformations

  • it was able to perform as a result,

  • together those were able to eliminate

  • all of those function calls.

  • It was able to completely eliminate

  • any costs associated with the function call abstraction,

  • at least in this code.

  • Make sense?

  • I think that's pretty cool.

  • You write code that has a bunch of function calls,

  • because that's how you've constructed your interfaces.

  • But you're not really paying for those function calls.

  • Function calls aren't the cheapest operation

  • in the world, especially if you think

  • about everything that goes on in terms

  • of the registers and the stack.

  • But the compiler is able to avoid all of that overhead,

  • and just perform the floating point operations we care about.

  • OK, well, if function inlining is so great,

  • and it enables so many great optimizations,

  • why doesn't the compiler just inline every function call?

  • Go for it.

  • Recursion, it's really hard to inline a recursive call.

  • In general, you can't inline a function into itself,

  • although it turns out there are some exceptions.

  • So, yes, recursion creates problems

  • with function inlining.

  • Any other thoughts?

  • In the back.

  • AUDIENCE: [INAUDIBLE]

  • TAO B. SCHARDL: You're definitely on to something.

  • So we had to do a bunch of this renaming stuff

  • when we inlined the first time, and when

  • we inlined every single time.

  • And even though LLVM IR has an infinite number of registers,

  • the machine doesn't.

  • And so all of that renaming does create a problem.

  • But there are other problems as well of

  • a similar nature when you start inlining all those functions.

  • For example, you copy pasted a bunch of code.

  • And that made the original call site even bigger, and bigger,

  • and bigger, and bigger.

  • And programs, we generally don't think about the space

  • they take in memory.

  • But they do take space in memory.

  • And that does have an impact on performance.

  • So great answer, any other thoughts?

  • AUDIENCE: [INAUDIBLE]

  • TAO B. SCHARDL: If your function becomes too long,

  • then it may not fit in instruction cache.

  • And that can increase the amount of time

  • it takes just to execute the function.

  • Right, because you're now not getting hash hits,

  • exactly right.

  • That's one of the problems with this code size blow

  • up from inlining everything.

  • Any other thoughts?

  • Any final thoughts?

  • So there are three main reasons why the compiler

  • won't inline every function.

  • I think we touched on two of them here.

  • For some function calls, like recursive calls,

  • it's impossible to inline them, because you can't

  • inline a function into itself.

  • But there are exceptions to that, namely

  • recursive tail calls.

  • If the last thing in a function is a function call,

  • then it turns out you can effectively

  • inline that function call as an optimization.

  • We're not going to talk too much about how that works.

  • But there are corner cases.

  • But, in general, you can't inline a recursive call.

  • The compiler has another problem.

  • Namely, if the function that you're calling

  • is in a different castle, if it's in a different compilation

  • unit, literally in a different file

  • that's compiled independently, then the compiler

  • can't very well inline that function,

  • because it doesn't know about the function.

  • It doesn't have access to that function's code.

  • There is a way to get around that problem

  • with modern compiler technology that involves whole program

  • optimization.

  • And I think there's some backup slides that will tell you

  • how to do that with LLVM.

  • But, in general, if it's in a different compilation unit,

  • it can't be inline.

  • And, finally, as touched on, function inlining

  • can increase code size, which can hurt performance.

  • OK, so some functions are OK to inline.

  • Other functions could create this performance problem,

  • because you've increased code size.

  • So how does the compiler know whether or not

  • inlining any particular function at a call site

  • could hurt performance?

  • Any guesses?

  • Yeah?

  • AUDIENCE: [INAUDIBLE]

  • TAO B. SCHARDL: Yeah.

  • So the compiler has some cost model, which gives it

  • some information about, how much will it

  • cost to inline that function?

  • Is the cost model always right?

  • It is not.

  • So the answer, how does the compiler know,

  • is, really, it doesn't know.

  • It makes a best guess using that cost model,

  • and other heuristics, to determine,

  • when does it make sense to try to inline a function?

  • And because it's making a best guess,

  • sometimes the compiler guesses wrong.

  • So to wrap up this part, here are just

  • a couple of tips for controlling function inlining

  • in your own programs.

  • If there's a function that you know must always be inlined,

  • no matter what happens, you can mark that function

  • with a special attribute, namely the always inline attribute.

  • For example, if you have a function that

  • does some complex address calculation,

  • and it should be inlined rather than called,

  • you may want to mark that with an always inline attribute.

  • Similarly, if you have a function that really should

  • never be inlined, it's never cost effective

  • to inline that function, you can mark that function

  • with the no inline attribute.

  • And, finally, if you want to enable more function inlining

  • in the compiler, you can use link time optimization, or LTO,

  • to enable whole program optimization.

  • Won't go into that during these slides.

  • Let's move on, and talk about loop optimizations.

  • Any questions so far, before continue?

  • Yeah?

  • AUDIENCE: [INAUDIBLE]

  • TAO B. SCHARDL: Sorry?

  • AUDIENCE: [INAUDIBLE]

  • TAO B. SCHARDL: Does static inline

  • guarantee you the compiler will always inline it?

  • It actually doesn't.

  • The inline keyword will provide a hint to the compiler

  • that it should think about inlining the function.

  • But it doesn't provide any guarantees.

  • If you want a strong guarantee, use the always inline

  • attribute.

  • Good question, though.

  • All right, loop optimizations--

  • you've already seen some loop optimizations.

  • You've seen vectorization, for example.

  • It turns out, the compiler does a lot of work

  • to try to optimize loops.

  • So first, why is that?

  • Why would the compiler engineers invest so much effort

  • into optimizing loops?

  • Why loops in particular?

  • They're extremely common control structure

  • that also has a branch.

  • Both things are true.

  • I think there's a higher level reason, though,

  • or more fundamental reason, if you will.

  • Yeah?

  • AUDIENCE: Most of the time, the loop takes up the most time.

  • TAO B. SCHARDL: Most of the time the loop takes up

  • the most time.

  • You got it.

  • Loops account for a lot of the execution time of programs.

  • The way I like to think about this

  • is with a really simple thought experiment.

  • Let's imagine that you've got a machine with a two gigahertz

  • processor.

  • We've chosen these values to be easier

  • to think about using mental math.

  • Suppose you've got a two gigahertz

  • processor with 16 cores.

  • Each core executes one instruction per cycle.

  • And suppose you've got a program which

  • contains a trillion instructions and ample parallelism

  • for those 16 cores.

  • But all of those instructions are simple, straight line code.

  • There are no branches.

  • There are no loops.

  • There no complicated operations like IO.

  • It's just a bunch of really simple straight line code.

  • Each instruction takes a cycle to execute.

  • The processor executes one instruction per cycle.

  • How long does it take to run this code, to execute

  • the entire terabyte binary?

  • 2 to the 40th cycles for 2 to the 40 instructions.

  • But you're using a two gigahertz processor and 16 cores.

  • And you've got ample parallelism in the program

  • to keep them all saturated.

  • So how much time?

  • AUDIENCE: 32 seconds.

  • TAO B. SCHARDL: 32 seconds, nice job.

  • This one has mastered power of 2 arithmetic in one's head.

  • It's a good skill to have, especially in core six.

  • Yeah, so if you have just a bunch of simple,

  • straight line code, and you have a terabyte of it.

  • That's a lot of code.

  • That is a big binary.

  • And, yet, the program, this processor,

  • this relatively simple processor,

  • can execute the whole thing in just about 30 seconds.

  • Now, in your experience working with software,

  • you might have noticed that there

  • are some programs that take longer than 30 seconds to run.

  • And some of those programs don't have terabyte size binaries.

  • The reason that those programs take longer to run,

  • by and large, is loops.

  • So loops account for a lot of the execution

  • time in real programs.

  • Now, you've already seen some loop optimizations.

  • We're just going to take a look at one other loop

  • optimization today, namely code hoisting, also known

  • as loop invariant code motion.

  • To look at that, we're going to take

  • a look at a different snippet of code

  • from the end body simulation.

  • This code calculates the forces going

  • on each of the end bodies.

  • And it does it with a doubly nested loop.

  • For all the zero to number of bodies,

  • for all body zero number bodies, as long

  • as you're not looking at the same body,

  • call this add force routine, which calculates to--

  • calculate the force between those two bodies.

  • And add that force to one of the bodies.

  • That's all that's going on in this code.

  • If we translate this code into LLVM IR,

  • we end up with, hopefully unsurprisingly,

  • a doubly nested loop.

  • It looks something like this.

  • The body of the code, the body of the innermost loop,

  • has been lighted, just so things can fit on the slide.

  • But we can see the overall structure.

  • On the outside, we have some outer loop control.

  • This should look familiar from lecture five, hopefully.

  • Inside of that outer loop, we have an inner loop.

  • And at the top and the bottom of that inner loop,

  • we have the inner loop control.

  • And within that inner loop, we do

  • have one branch, which can skip a bunch of code

  • if you're looking at the same body for i and j.

  • But, otherwise, we have the loop body of the inner most loop,

  • basic structure.

  • Now, if we just zoom in on the top part

  • of this doubly-nested loop, just the topmost three basic blocks,

  • take a look at more of the code that's going on here,

  • we end up with something that looks like this.

  • And if you remember some of the discussion

  • from lecture five about the loop induction variables,

  • and what that looks like in LLVM IR, what you find

  • is that for the outer loop we have an induction variable

  • at the very top.

  • It's that weird fee instruction, once again.

  • Inside that outer loop, we have the loop control

  • for the inner loop, which has its own induction variable.

  • Once again, we have another fee node.

  • That's how we can spot it.

  • And then we have the body of the innermost loop.

  • And this is just the start of.

  • It it's just a couple address calculations.

  • But can anyone tell me some interesting property

  • about just a couple of these address calculations

  • that could lead to an optimization?

  • AUDIENCE: [INAUDIBLE]

  • TAO B. SCHARDL: The first two address calculations only

  • depend on the outermost loop variable, the iteration

  • variable for the outer loop, exactly right.

  • So what can we do with those instructions?

  • Bring them out of the inner loop.

  • Why should we keep computing these addresses

  • in the innermost loop when we could just compute them once

  • in the outer loop?

  • That optimization is called code hoisting, or loop invariant

  • code motion.

  • Those instructions are invariant to the code

  • in the innermost loop.

  • So you hoist them out.

  • And once you hoist them out, you end up

  • with a transformed loop that looks something like this.

  • What we have is the same outer loop control at the very top.

  • But now, we're doing some address calculations there.

  • And we no longer have those address calculations

  • on the inside.

  • And as a result, those hoisted calculations

  • are performed just once per iteration of the outer loop,

  • rather than once per iteration of the inner loop.

  • And so those instructions are run far fewer times.

  • You get to save a lot of running time.

  • So the effect of this optimization

  • in terms of C code, because it can

  • be a little tedious to look at LLVM IR,

  • is essentially like this.

  • We took this doubly-nested loop in C.

  • We're calling add force of blah, blah, blah, calculate force,

  • blah, blah, blah.

  • And now, we just move the address calculation

  • to get the ith body that we care about.

  • We move that to the outer.

  • Now, this was an example of loop invariant code motion on just

  • a couple address calculations.

  • In general, the compiler will try

  • to prove that some calculation is invariant across all

  • the iterations of a loop.

  • And whenever it can prove that, it

  • will try to hoist that code out of the loop.

  • If it can get code out of the body of a loop,

  • that reduces the running time of the loop,

  • saves a lot of execution time.

  • Huge bang for the buck.

  • Make sense?

  • Any questions about that so far?

  • All right, so just to summarize this part,

  • what can the compiler do?

  • The compiler optimizes code by performing a sequence

  • of transformation passes.

  • All those passes are pretty mechanical.

  • The compiler goes through the code.

  • It tries to find some property, like this address calculation

  • is the same as that address calculation.

  • And so this load will return the same value as that store,

  • and so on, and so forth.

  • And based on that analysis, it tries

  • to get rid of some dead code, and replace certain register

  • values with other register values,

  • replace things that live in memory with things

  • that just live in registers.

  • A lot of the transformations resemble Bentley-rule work

  • optimizations that you've seen in lecture two.

  • So as you're studying for your upcoming quiz,

  • you can kind of get two for one by looking

  • at those Bentley-rule optimizations.

  • And one transformation pass, in particular function inlining,

  • was a good example of this.

  • One transformation can enable other transformations.

  • And those together can compound to give you fast code.

  • In general, compilers perform a lot more transformations

  • than just the ones we saw today.

  • But there are things that the compiler can't do.

  • Here's one very simple example.

  • In this case, we're taking another look

  • at this calculate forces routine.

  • Although the compiler can optimize the code

  • by moving address calculations out of loop,

  • one thing that I can't do is exploit symmetry

  • in the problem.

  • So in this problem, what's going on

  • is we're computing the forces on any pair of bodies

  • using the law of gravitation.

  • And it turns out that the force acting on one body by another

  • is exactly the opposite the force acting on the other body

  • by the one.

  • So F of 12 is equal to minus F of 21.

  • The compiler will not figure that out.

  • The compiler knows algebra.

  • It doesn't know physics.

  • So it won't be able to figure out

  • that there's symmetry in this problem,

  • and it can avoid wasted operations.

  • Make sense?

  • All right, so that was an overview

  • of some simple compiler optimizations.

  • We now have some examples of some case studies

  • to see where the compiler can get tripped up.

  • And it doesn't matter if we get through all of these or not.

  • You'll have access to the slides afterwards.

  • But I think these are kind of cool.

  • So shall we take a look?

  • Simple question-- does the compiler vectorize this loop?

  • So just to go over what this loop does, it's a simple loop.

  • The function takes two vectors as inputs,

  • or two arrays as inputs, I should say--

  • an array called y, of like then, and an array x of like then,

  • and some scalar value a.

  • And all that this function does is

  • it loops over each element of the vector, multiplies x of i

  • by the input scalar, adds the product into y's of i.

  • So does the loop vectorize?

  • Yes?

  • AUDIENCE: [INAUDIBLE]

  • TAO B. SCHARDL: y and x could overlap.

  • And there is no information about whether they overlap.

  • So do they vectorize?

  • We have a vote for no.

  • Anyone think that it does vectorize?

  • You made a very convincing argument.

  • So everyone believes that this loop does not vectorize.

  • Is that true?

  • Anyone uncertain?

  • Anyone unwilling to commit to yes or no right here?

  • All right, a bunch of people are unwilling to commit

  • to yes or no.

  • All right, let's resolve this question.

  • Let's first ask for the report.

  • Let's look at the vectorization report.

  • We compile it.

  • We pass the flags to get the vectorization report.

  • And the vectorization report says, yes, it

  • does vectorize this loop, which is interesting,

  • because we have this great argument that says,

  • but you don't know how these addresses fit in memory.

  • You don't know if x and y overlap with each other.

  • How can you possibly vectorize?

  • Kind of a mystery.

  • Well, if we take a look at the actual compiled code when we

  • optimize this at 02, turns out you can pass certain flags

  • to the compiler, and get it to print out not just the LLVM IR,

  • but the LLVM IR formatted as a control flow graph.

  • And the control flow graph for this simple two line function

  • is the thing on the right, which you obviously

  • can't, read because it's a little bit

  • small, in terms of its text.

  • And it seems have a lot going on.

  • So I took the liberty of redrawing that control flow

  • graph with none of the code inside,

  • just get a picture of what the structure looks

  • like for this compiled function.

  • And, structurally speaking, it looks like this.

  • And with a bit of practice staring at control flow graphs,

  • which you might get if you spend way too much time working

  • on compilers, you might look at this control flow graph,

  • and think, this graph looks a little too complicated

  • for the two line function that we gave as input.

  • So what's going on here?

  • Well, we've got three different loops in this code.

  • And it turns out that one of those loops

  • is full of vector operations.

  • OK, the other two loops are not full of vector operations.

  • That's unvectorized code.

  • And then there's this basic block right

  • at the top that has a conditional branch

  • at the end of it, branching to either the vectorized loop

  • or the unvectorized loop.

  • And, yeah, there's a lot of other control flow going on

  • as well.

  • But we can focus on just these components for the time being.

  • So what's that conditional branch doing?

  • Well, we can zoom in on just this one basic block,

  • and actually show it to be readable on the slide.

  • And the basic block looks like this.

  • So let's just study this LLVM IR code.

  • In this case, we have got the address y stored in register 0.

  • The address of x is stored in register 2.

  • And register 3 stores the value of n.

  • So one instruction at a time, who

  • can tell me what the first instruction in this code does?

  • Yes?

  • AUDIENCE: [INAUDIBLE]

  • TAO B. SCHARDL: Gets the address of y.

  • Is that what you said?

  • So it does use the address of y.

  • It's an address calculation that operates on register 0, which

  • stores the address of y.

  • But it's not just computing the address of y.

  • AUDIENCE: [INAUDIBLE]

  • TAO B. SCHARDL: It's getting me the address

  • of the nth element of y.

  • It's adding in whatever is in register 3, which is the value

  • n, into the address of y.

  • So that computes the address y plus n.

  • This is testing your memory of pointer arithmetic

  • in C just a little bit but.

  • Don't worry.

  • It won't be too rough.

  • So that's what the first address calculation does.

  • What does the next instruction do?

  • AUDIENCE: It does x plus n.

  • TAO B. SCHARDL: That computes x plus, very good.

  • How about the next one?

  • AUDIENCE: It compares whether x plus n and y plus n

  • are the same.

  • TAO B. SCHARDL: It compares x plus n, versus y plus n.

  • AUDIENCE: [INAUDIBLE] compares the 33, which is x plus n,

  • and compares it to y.

  • So if x plus n is bigger than y, there's overlap.

  • TAO B. SCHARDL: Right, so it does a comparison.

  • We'll take that a little more slowly.

  • It does a comparison of x plus n, versus y in checks.

  • Is x plus n greater than y?

  • Perfect.

  • How about the next instruction?

  • Yeah?

  • AUDIENCE: It compares y plus n versus x.

  • TAO B. SCHARDL: It compares y plus n,

  • versus x, is y plus n even greater than x.

  • How would the last instruction before the branch?

  • Yep, go for it?

  • AUDIENCE: [INAUDIBLE]

  • TAO B. SCHARDL: [INAUDIBLE] one of the results.

  • So this computes the comparison, is x plus n

  • greater than y, bit-wise?

  • And is y plus n greater than x.

  • Fair enough.

  • So what does the result of that condition mean?

  • I think we've pretty much already spoiled the answer.

  • Anyone want to hear it one last time?

  • We had this whole setup.

  • Go for it.

  • AUDIENCE: They overlap.

  • TAO B. SCHARDL: Checks if they overlap.

  • So let's look at this condition in a couple

  • of different situations.

  • If we have x living in one place in memory,

  • and y living in another place in memory,

  • then no matter how we resolve this condition,

  • if we check is both y plus n greater than x,

  • and x plus n greater than y, the results will be false.

  • But if we have this situation, where

  • x and y overlap in memory some portion of memory,

  • then it turns out that regardless of whether x or y is

  • first, x plus n will be greater than y. y

  • plus n will be greater than x.

  • And the condition will return true.

  • In other words, the condition returns true,

  • if and only if these portions of memory pointed by x and y

  • alias.

  • So going back to our original looping code,

  • we have a situation where we have a branch based on

  • whether or not they alias.

  • And in one case, it executes the vectorized loop.

  • And in another case, it executes a non-vectorized code.

  • So returning to our original question, in particular

  • is a vectorized loop if they don't alias.

  • So returning to our original question,

  • does this code get vectorized?

  • The answer is yes and no.

  • So if you voted yes, you're actually right.

  • If you voted no, and you were persuaded, you were right.

  • And if you didn't commit to an answer, I can't help you.

  • But that's interesting.

  • The compiler actually generated multiple versions of this loop,

  • due to uncertainty about memory aliasing.

  • Yeah, question?

  • AUDIENCE: [INAUDIBLE]

  • TAO B. SCHARDL: So the question is, could the compiler

  • figure out this condition statically

  • while it's compiling the function?

  • Because we know the function is going to get

  • called from somewhere.

  • The answer is, sometimes it can.

  • A lot of times it can't.

  • If it's not capable of inlining this function,

  • for example, then it probably doesn't have enough information

  • to tell whether or not these two pointers will alias.

  • For example, you're just building a library

  • with a bunch of vector routines.

  • You don't know the code that's going to call

  • this routine eventually.

  • Now, in general, memory aliasing,

  • this will be the last point before we wrap up, in general,

  • memory aliasing can cause a lot of issues

  • when it comes to compiler optimization.

  • It can cause the compiler to act very conservatively.

  • In this example, we have a simple serial base case

  • for a matrix multiply routine.

  • But we don't know anything about the pointers

  • to the C, A, or B matrices.

  • And when we try to compile this and optimize it,

  • the compiler complains that it can't do loop invariant code

  • motion, because it doesn't know anything about these pointers.

  • It could be that the pointer changes

  • within the innermost loop.

  • So it can't move some calculation out

  • to an outer loop.

  • Compilers try to deal with this statically using an analysis

  • technique called alias analysis.

  • And they do try very hard to figure out,

  • when are these pointers going to alias?

  • Or when are they guaranteed to not alias?

  • Now, in general, it turns out that alias analysis

  • isn't just hard.

  • It's undecidable.

  • If only it were hard, maybe we'd have some hope.

  • But compilers, in practice, are faced

  • with this undecidable question.

  • And they try a variety of tricks to get useful alias analysis

  • results in practice.

  • For example, based on information in the source code,

  • the compiler might annotate instructions

  • with various metadata to track this aliasing information.

  • For example, TBAA is aliasing information based on types.

  • There's some scoping information for aliasing.

  • There is some information that says

  • it's guaranteed not to alias with this other operation,

  • all kinds of metadata.

  • Now, what can you do as a programmer

  • to avoid these issues of memory aliasing?

  • Always annotate your pointers, kids.

  • Always annotate your pointers.

  • The restrict keyword you've seen before.

  • It tells the compiler, address calculations based off

  • this pointer won't alias with address calculations

  • based off other pointers.

  • The const keyword provides a little more information.

  • It says, these addresses will only be read from.

  • They won't be written to.

  • And that can enable a lot more compiler optimizations.

  • Now, that's all the time that we have.

  • There are a couple of other cool case studies in the slides.

  • You're welcome to peruse the slides afterwards.

  • Thanks for listening.

The following content is provided under a Creative

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it