Placeholder Image

Subtitles section Play video

  • The following content is provided under a Creative

  • Commons license.

  • Your support will help MIT OpenCourseWare

  • continue to offer high quality educational resources for free.

  • To make a donation or to view additional materials

  • from hundreds of MIT courses, visit MIT OpenCourseWare

  • at ocw.mit.edu.

  • CHARLES LEISERSON: So today, we're

  • going to talk about assembly language and computer

  • architecture.

  • It's interesting these days, most software courses

  • don't bother to talk about these things.

  • And the reason is because as much as possible people

  • have been insulated in writing their software from performance

  • considerations.

  • But if you want to write fast code,

  • you have to know what is going on underneath so you

  • can exploit the strengths of the architecture.

  • And the interface, the best interface, that we have to that

  • is the assembly language.

  • So that's what we're going to talk about today.

  • So when you take a particular piece of code

  • like fib here, to compile it you run it through Clang,

  • as I'm sure you're familiar at this point.

  • And what it produces is a binary machine language

  • that the computer is hardware programmed

  • to interpret and execute.

  • It looks at the bits as instructions as opposed to as

  • data.

  • And it executes them.

  • And that's what we see when we execute.

  • This process is not one step.

  • It's actually there are four stages to compilation;

  • preprocessing, compiling-- sorry, for the redundancy,

  • that's sort of a bad name conflict,

  • but that's what they call it--

  • assembling and linking.

  • So I want to take us through those stages.

  • So the first thing that goes through

  • is you go through a preprocess stage.

  • And you can invoke that with Clang manually.

  • So you can say, for example, if you

  • do clang minus e, that will run the preprocessor

  • and nothing else.

  • And you can take a look at the output

  • there and look to see how all your macros got expanded

  • and such before the compilation actually goes through.

  • Then you compile it.

  • And that produces assembly code.

  • So assembly is a mnemonic structure of the machine code

  • that makes it more human readable than the machine

  • code itself would be.

  • And once again, you can produce the assembly yourself

  • with clang minus s.

  • And then finally, penultimately maybe,

  • you can assemble that assembly language code

  • to produce an object file.

  • And since we like to have separate compilations,

  • you don't have to compile everything

  • as one big monolithic hunk.

  • Then there's typically a linking stage

  • to produce the final executable.

  • And for that we are using ld for the most part.

  • We're actually using the gold linker,

  • but ld is the command that calls it.

  • So let's go through each of those steps

  • and see what's going on.

  • So first, the preprocessing is really straightforward.

  • So I'm not going to do that.

  • That's just a textual substitution.

  • The next stage is the source code to assembly code.

  • So when we do clang minus s, we get

  • this symbolic representation.

  • And it looks something like this, where we

  • have some labels on the side.

  • And we have some operations when they have some directives.

  • And then we have a lot of gibberish,

  • which won't seem like so much gibberish

  • after you've played with it a little bit.

  • But to begin with looks kind of like gibberish.

  • From there, we assemble that assembly code and that

  • produces the binary.

  • And once again, you can invoke it just by running Clang.

  • Clang will recognize that it doesn't have a C file or a C++

  • file.

  • It says, oh, goodness, I've got an assembly language file.

  • And it will produce the binary.

  • Now, the other thing that turns out to be the case

  • is because assembly in machine code,

  • they're really very similar in structure.

  • Just things like the op codes, which

  • are the things that are here in blue or purple,

  • whatever that color is, like these guys,

  • those correspond to specific bit patterns over here

  • in the machine code.

  • And these are the addresses and the registers that we're

  • operating on, the operands.

  • Those correspond to other to other bit codes over there.

  • And there's very much a--

  • it's not exactly one to one, but it's pretty close one to one

  • compared to if you had C and you look at the binary,

  • it's like way, way different.

  • So one of the things that turns out you can do is if you have

  • the machine code, and especially if the machine code that was

  • produced with so-called debug symbols--

  • that is it was compiled with dash g--

  • you can use this program called objdump,

  • which will produce a disassembly of the machine code.

  • So it will tell you, OK, here's what the mnemonic, more

  • human readable code is, the assembly code, from the binary.

  • And that's really useful, especially

  • if you're trying to do things--

  • well, let's see why do we bother looking at the assembly?

  • So why would you want to look at the assembly of your program?

  • Does anybody have some ideas?

  • Yeah.

  • AUDIENCE: [INAUDIBLE] made or not.

  • CHARLES LEISERSON: Yeah, you can see

  • whether certain optimizations are made or not.

  • Other reasons?

  • Everybody is going to say that one.

  • OK.

  • Another one is-- well, let's see, so here's some reasons.

  • The assembly reveals what the compiler did and did not do,

  • because you can see exactly what the assembly is that is going

  • to be executed as machine code.

  • The second reason, which turns out

  • to happen more often you would think,

  • is that, hey, guess what, compiler

  • is a piece of software.

  • It has bugs.

  • So your code isn't operating correctly.

  • Oh, goodness, what's going on?

  • Maybe the compiler made an error.

  • And we have certainly found that, especially when you

  • start using some of the less frequently used features

  • of a compiler.

  • You may discover, oh, it's actually not

  • that well broken in.

  • And it mentions here you may only have an effect when

  • compiling at -03, but if you compile at -00, -01,

  • everything works out just fine.

  • So then it says, gee, somewhere in the optimizations,

  • they did an optimization wrong.

  • So one of the first principles of optimization is do it right.

  • And then the second is make it fast.

  • And so sometimes the compiler doesn't that.

  • It's also the case that sometimes you cannot write code

  • that produces the assembly that you want.

  • And in that case, you can actually

  • write the assembly by hand.

  • Now, it used to be many years ago--

  • many, many years ago--

  • that a lot of software was written in assembly.

  • In fact, my first job out of college,

  • I spent about half the time programming

  • in assembly language.

  • And it's not as bad as you would think.

  • But it certainly is easier to have high-level languages

  • that's for sure.

  • You get lot more done a lot quicker.

  • And the last reason is reverse engineer.

  • You can figure out what a program does when you only

  • have access to its source, so, for example,

  • the matrix multiplication example that I gave on day 1.

  • You know, we had the overall outer structure,

  • but the inner loop, we could not match the Intel math kernel

  • library code.

  • So what do we do?

  • We didn't have the source for it.

  • We looked to see what it was doing.

  • We said, oh, is that what they're doing?

  • And then we're able to do it ourselves

  • without having to get the sauce from them.

  • So we reverse engineered what they did?

  • So all those are good reasons.

  • Now, in this class, we have some expectations.

  • So one thing is, you know, assembly is complicated

  • and you needn't memorize the manual.

  • In fact, the manual has over 1,000 pages.

  • It's like-- but here's what we do expect of you.

  • You should understand how a compiler implements

  • various C linguistic constructs with x86 instructions.

  • And that's what we'll see in the next lecture.

  • And you should be able to read x86 assembly

  • language with the aid of an architecture manual.

  • And on a quiz, for example, we would give you snippets

  • or explain what the op codes that are being

  • used in case it's not there.

  • But you should have some understanding of that,

  • so you can see what's actually happening.

  • You should understand the high-level performance

  • implications of common assembly patterns.

  • OK, so what does it mean to do things

  • in a particular way in terms of performance?

  • So some of them are quite obvious.

  • Vector operations tend to be faster

  • than doing the same thing with a bunch of scalar operations.

  • If you do write an assembly, typically what we use

  • is there are a bunch of compiler intrinsic functions, built-ins,

  • so-called, that allow you to use the assembly language

  • instructions.

  • And you should be after we've done this able to write code

  • from scratch if the situation demands it sometime

  • in the future.

  • We won't do that in this class, but we

  • expect that you will be in a position to do that after--

  • you should get a mastery to the level

  • where that would not be impossible for you to do.

  • You'd be able to do that with a reasonable amount of effort.

  • So the rest of the lecture here is

  • I'm going to first start by talking about the instruction

  • set architecture of the x86-64, which

  • is the one that we are using for the cloud machines

  • that we're using.

  • And then I'm going to talk about floating point in vector

  • hardware and then I'm going to do an overview of computer

  • architecture.

  • Now, all of this I'm doing-- this is software class, right?

  • Software performance engineering we're doing.

  • So the reason we're doing this is

  • so you can write code that better matches the hardware,

  • therefore to better get it.

  • In order to do that, I could give things at a high-level.

  • My experience is that if you really

  • want to understand something, you

  • want to understand it to level that's necessary

  • and then one level below that.

  • It's not that you'll necessarily use that one level below it,

  • but that gives you insight as to why that layer is what it is

  • and what's really going on.

  • And so that's kind of what we're going to do.

  • We're going to do a dive that takes

  • us one level beyond what you probably

  • will need to know in the class, so that you

  • have a robust foundation for understanding.

  • Does that makes sense?

  • That's my part of my learning philosophy

  • is you know go one step beyond.

  • And then you can come back.

  • The ISA primer, so the ISA talks about the syntax and semantics

  • of assembly.

  • There are four important concepts

  • in the instruction set architecture--

  • the notion of registers, the notion of instructions,

  • the data types, and the memory addressing modes.

  • And those are sort of indicated.

  • For example, here, we're going to go through those one by one.

  • So let's start with the registers.

  • So the registers is where the processor stores things.

  • And there are a bunch of x86 registers,

  • so many that you don't need to know most of them.

  • The ones that are important are these.

  • So first of all, there a general purpose registers.

  • And those typically have width 64.

  • And there are many of those.

  • There is a so-called flags register, called RFLAGS,

  • which keeps track of things like whether there

  • was an overflow, whether the last arithmetic

  • operation resulted in a zero, whether a kid there

  • was a carryout of a word or what have you.

  • The next one is the instruction pointer.

  • So the assembly language is organized

  • as a sequence of instructions.

  • And the hardware marches linearly

  • through that sequence, one after the other,

  • unless it encounters a conditional jump

  • or an unconditional jump, in which case

  • it'll branch to whatever the location is.

  • But for the most part, it's just running straight

  • through memory.

  • Then there are some registers that

  • were added quite late in the game, namely the SSE registers

  • and the AVX registers.

  • And these are vector registers.

  • So the XMM registers were, when they first did vectorization,

  • they used 128 bits.

  • There's also for AVX, there are the YMM registers.

  • And in the most recent processors,

  • which were not using this term, there's

  • another level of AVX that gives you 512-bit registers.

  • Maybe we'll use that for the final project,

  • because it's just like a little more power for the game playing

  • project.

  • But for most of what you'll be doing,

  • we'll just be keeping to the C4 instances in AWS

  • that you guys have been using.

  • Now, the x86-64 didn't start out as x86-64.

  • It started out as x86.

  • And it was used for machines, in particular the 80-86,

  • which had a 16-bit word.

  • So really short.

  • How many things can you index with a 16-bit word?

  • About how many?

  • AUDIENCE: 65,000.

  • CHARLES LEISERSON: Yeah, about 65,000.

  • 65,536 words you can address, or bytes.

  • This is byte addressing.

  • So that's 65k bytes that you can address.

  • How could they possibly use that for machines?

  • Well, the answer is that's how much memory was on the machine.

  • You didn't have gigabytes.

  • So as the machines--

  • as Moore's law marched along and we got more and more memory,

  • then the words had to become wider to be able to index them.

  • Yeah?

  • AUDIENCE: [INAUDIBLE]

  • CHARLES LEISERSON: Yeah, but here's

  • the thing is if you're building stuff that's too expensive

  • and you can't get memory that's big enough, then

  • if you build a wider word, like if you build a word of 32 bits,

  • then your processor just cost twice as much

  • as the next guy's processor.

  • So instead, what they did is they went along as long as that

  • was the common size, and then had some growth pains

  • and went to 32.

  • And from there, they had some more growth pains

  • and went to 64.

  • OK, those are two separate things.

  • And, in fact, they did they did some really weird stuff.

  • So what they did in fact is when they made these longer

  • registers, they have registers that are

  • aliased to exactly the same thing for the lower bits.

  • So they can address them either by a byte--

  • so these registers all have the same--

  • you can do the lower and upper half of the short word,

  • or you can do the 32-bit word or you can do the 64-bit word.

  • And that's just like if you're doing this today,

  • you wouldn't do that.

  • You wouldn't have all these registers that alias and such.

  • But that's what they did because this is history, not design.

  • And the reason was because when they're

  • doing that they were not designing for long term.

  • Now, are we going to go to 128-bit addressing?

  • Probably not.

  • 64 bits address is a spectacular amount of stuff.

  • You know, not quite as many--

  • 2 to the 64th is what?

  • Is like how many gazillions?

  • It's a lot of gazillions.

  • So, yeah, we're not going to have to go beyond 64 probably.

  • So here are the general purpose registers.

  • And as I mentioned, they have different names,

  • but they cover the same thing.

  • So if you change eax, for example, that also changes rax.

  • And so you see they originally all had functional purposes.

  • Now, they're all pretty much the same thing,

  • but the names have stuck because of history.

  • Instead of calling them registers

  • 0, register 1, or whatever, they all have these funny names.

  • Some of them still are used for a particular purpose,

  • like rsp is used as the stack pointer.

  • And rbp is used to point to the base of the frame,

  • for those who remember their 6004 stuff.

  • So anyway, there are a whole bunch of them.

  • And they're different names depending

  • upon which part of the register you're accessing.

  • Now, the format of an x86-64 instruction code

  • is to have an opcode and then an operand list.

  • And the operand list is typically 0, 1, 2, or rarely

  • 3 operands separated by commas.

  • Typically, all operands are sources

  • and one operand might also be the destination.

  • So, for example, if you take a look at this add instruction,

  • the operation is an add.

  • And the operand list is these two registers.

  • One is edi and the other is ecx.

  • And the destination is the second one.

  • When you add-- in this case, what's going on

  • is it's taking the value in ecx, adding the value in edi

  • into it.

  • And the result is in ecx.

  • Yes?

  • AUDIENCE: Is there a convention for where the destination

  • [INAUDIBLE]

  • CHARLES LEISERSON: Funny you should ask.

  • Yes.

  • So what does op A, B mean?

  • It turns out naturally that the literature

  • is inconsistent about how it refers to operations.

  • And there's two major ways that are used.

  • One is the AT&T syntax, and the other is the Intel syntax.

  • So the AT&T syntax, the second operand is the destination.

  • The last operand is the destination.

  • In the Intel syntax, the first operand is the destination.

  • OK, is that confusing?

  • So almost all the tools that we're going to use

  • are going to use the AT&T syntax.

  • But you will read documentation, which is Intel documentation.

  • It will use the other syntax.

  • Don't get confused.

  • OK?

  • I can't help-- it's like I can't help

  • that this is the way the state of the world is.

  • OK?

  • Yeah?

  • AUDIENCE: Are there tools that help [INAUDIBLE]

  • CHARLES LEISERSON: Oh, yeah.

  • In particular, if you could compile it and undo,

  • but I'm sure there's--

  • I mean, this is not a hard translation thing.

  • I'll bet if you just Google, you can in two minutes,

  • in two seconds, find somebody who will translate

  • from one to the other.

  • This is not a complicated translation process.

  • Now, here are some very common x86 opcodes.

  • And so let me just mention a few of these,

  • because these are ones that you'll often see in the code.

  • So move, what do you think move does?

  • AUDIENCE: Moves something.

  • CHARLES LEISERSON: Yeah, it puts something in one

  • register into another register.

  • Of course, when it moves it, this

  • is computer science move, not real move.

  • When I move my belongings in my house to my new house,

  • they're no longer in the old place, right?

  • But in computer science, for some reason, when we move

  • things we leave a copy behind.

  • So they may call it move, but--

  • AUDIENCE: Why don't they call it copy?

  • CHARLES LEISERSON: Yeah, why don't they call it copy?

  • You got me.

  • OK, then there's conditional move.

  • So this is move based on a condition--

  • and we'll see some of the ways that this is--

  • like move if flag is equal to 0 and so forth, so

  • basically conditional move.

  • It doesn't always do the move.

  • Then you can extend the sign.

  • So, for example, suppose you're moving from a 32-bit value

  • register into a 64-bit register.

  • Then the question is, what happens to high order bits?

  • So there's two basic mechanisms that can be used.

  • Either it can be filled with zeros,

  • or remember that the first bit, or the leftmost bit as we

  • think of it, is the sign bit from our electron binary.

  • That bit will be extended through the high order

  • part of the word, so that the whole number if it's negative

  • will be negative and if it's positive,

  • it'll be zeros and so forth.

  • Does that makes sense?

  • Then there are things like push and pop to do stacks.

  • There's a lot of integer arithmetic.

  • There's addition, subtraction, multiplication, division,

  • various shifts, address calculation shifts, rotations,

  • incrementing, decrementing, negating, etc.

  • There's also a lot of binary logic, AND, OR, XOR, NOT.

  • Those are all doing bitwise operations.

  • And then there is Boolean logic, like testing

  • to see whether some value has a given value or comparing.

  • There's unconditional jump, which is jump.

  • And there's conditional jumps, which is jump with a condition.

  • And then things like subroutines.

  • And there are a bunch more, which the manual will have

  • and which will undoubtedly show up.

  • Like, for example, there's the whole set of vector operations

  • we'll talk about a little bit later.

  • Now, the opcodes may be augmented

  • with a suffix that describes the data type of the operation

  • or a condition code.

  • OK, so an opcode for data movement, arithmetic, or logic

  • use a single character suffix to indicate the data type.

  • And if the suffix is missing, it can usually be inferred.

  • So take a look at this example.

  • So this is a move with a q at the end.

  • What do you think q stands for?

  • AUDIENCE: Quad words?

  • CHARLES LEISERSON: Quad word.

  • OK, how many bytes in a quad word?

  • AUDIENCE: Eight.

  • CHARLES LEISERSON: Eight.

  • That's because originally it started out with a 16-bit word.

  • So they said a quad word was four of those 16-bit words.

  • So that's 8 bytes.

  • You get the idea, right?

  • But let me tell you this is all over the x86 instruction set.

  • All these historical things and all these

  • mnemonics that if you don't understand what they really

  • mean, you can get very confused.

  • So in this case, we're moving a 64-bit integer,

  • because a quad word has 8 bytes or 64 bits.

  • This is one of my--

  • it's like whenever I prepare this lecture,

  • I just go into spasms of laughter, as I look

  • and I say, oh, my god, they really did that like.

  • For example, on the last page, when I did subtract.

  • So the sub-operator, if it's a two argument operator,

  • it subtracts the--

  • I think it's the first and the second.

  • But there is no way of subtracting the other way

  • around.

  • It puts the destination in the second one.

  • It basically takes the second one minus the first one

  • and puts that in the second one.

  • But if you wanted to have it the other way around,

  • to save yourself a cycle--

  • anyway, it doesn't matter.

  • You can't do it that way.

  • And all this stuff the compiler has to understand.

  • So here are the x86-64 data types.

  • The way I've done it is to show you the difference between C

  • and x86-64, so for example, here are the declarations in C.

  • So there's a char, a short, int, unsigned int, long, etc.

  • Here's an example of a C constant

  • that does those things.

  • And here's the size in bytes that you

  • get when you declare that.

  • And then the assembly suffix is one of these things.

  • So in the assembly, it says b or w for a word, an l or d

  • for a double word, a q for a quad word, i.e.

  • 8 bytes, single precision, double precision,

  • extended precision.

  • So sign extension use two date type suffixes.

  • So here's an example.

  • So the first one says we're going to move.

  • And now you see I can't read this without my cheat sheet.

  • So what is this saying?

  • This is saying, we're going to move with a zero-extend.

  • And it's going to be the first operand is a byte,

  • and the second operation is a long.

  • Is that right?

  • If I'm wrong, it's like I got to look at the chart too.

  • And, of course, we don't hold you to that.

  • But the z there says extends with zeros.

  • And the S says preserve the sign.

  • So that's the things.

  • Now, that would all be all well and good,

  • except that then what they did is if you do 32-bit operations,

  • where you're moving it to a 64-bit value,

  • it implicitly zero-extends the sign.

  • If you do it for smaller values and you store it in,

  • it simply overwrites the values in those registers.

  • It doesn't touch the high order bits.

  • But when they did the 32 to 64-bit extension

  • of the instruction set, they decided

  • that they wouldn't do what had been done in the past.

  • And they decided that they would zero-extend things,

  • unless there was something explicit to the contrary.

  • You got me, OK.

  • Yeah, I have a friend who worked at Intel.

  • And he had a joke about the Intel instructions set.

  • You'll discover the Intel instruction

  • set is really complicated.

  • He says, here's the idea of the Intel instruction set.

  • He said, to become an Intel fellow,

  • you need to have an instruction in the Intel instruction set.

  • You have an instruction that you invented

  • and that that's now used in Intel.

  • He says nobody becomes an Intel fellow

  • for removing instructions.

  • So it just sort of grows and grows and grows and gets more

  • and more complicated for each thing.

  • Now, once again, for extension, you can sign-extend.

  • And here's two examples.

  • In one case, moving an 8-bit integer to a 32-bit integer

  • and zero-extended it versus preserving the sign.

  • Conditional jumps and conditional moves

  • also use suffixes to indicate the condition code.

  • So here, for example, the ne indicates the jump should only

  • be taken if the argument of the previous comparison

  • are not equal.

  • So ne is not equal.

  • So you do a comparison, and that's

  • going to set a flag in the RFLAGS register.

  • Then the jump will look at that flag

  • and decide whether it's going to jump or not or just continue

  • the sequential execution of the code.

  • And there are a bunch of things that you can

  • jump on which are status flags.

  • And you can see the names here.

  • There's Carry.

  • There's Parity.

  • Parity is the XOR of all the bits in the word.

  • Adjust, I don't even know what that's for.

  • There's the Zero flag.

  • It tells whether it's a zero.

  • There's a Sign flag, whether it's positive or negative.

  • There's a Trap flag and Interrupt enable and Direction,

  • Overflow.

  • So anyway, you can see there are a whole bunch of these.

  • So, for example here, this is going to decrement rbx.

  • And then it sets the Zero flag if the results are equal.

  • And then the jump, the conditional jump,

  • jumps to the label if the ZF flag is not set, in this case.

  • OK, it make sense?

  • After a fashion.

  • Doesn't make rational sense, but it does make sense.

  • Here are the main ones that you're going to need.

  • The Carry flag is whether you got a carry or a borrow out

  • of the most significant bit.

  • The Zero flag is if the ALU operation was 0,

  • whether the last ALU operation had the sign bit set.

  • And the overflow says it resulted

  • in arithmetic overflow.

  • The condition codes are--

  • if you put one of these condition codes

  • on your conditional jump or whatever,

  • this tells you exactly what the flag is that is being set.

  • So, for example, the easy ones are if it's equal.

  • But there are some other ones there.

  • So, for example, if you say why, for example,

  • do the condition codes e and ne, check the Zero flag?

  • And the answer is typically, rather

  • than having a separate comparison, what they've done

  • is separate the branch from the comparison itself.

  • But it also needn't be a compare instruction.

  • It could be the result of the last arithmetic

  • operation was a zero, and therefore it

  • can branch without having to do a comparison with zero.

  • So, for example, if you have a loop.

  • where you're decrementing a counter till it gets to 0,

  • that's actually faster by one instruction

  • to compare whether the loop index hits 0

  • than it is if you have the loop going up to n, and then

  • every time through the loop having to compare with n

  • in order before you can branch.

  • So these days that optimization doesn't mean anything,

  • because, as we'll talk about in a little bit,

  • these machines are so powerful, that doing an extra integer

  • arithmetic like that probably has

  • no bearing on the overall cost.

  • Yeah?

  • AUDIENCE: So this instruction doesn't take arguments?

  • It just looks at the flags?

  • CHARLES LEISERSON: Just looks at the flags, yep.

  • Just looks at the flags.

  • It doesn't take any arguments.

  • Now, the next aspect of this is you can give registers,

  • but you also can address memory.

  • And there are three direct addressing modes and three

  • indirect addressing modes.

  • At most, one operand may specify a memory address.

  • So here are the direct addressing modes.

  • So for immediate what you do is you give it a constant,

  • like 172, random constant, to store into the register,

  • in this case.

  • That's called an immediate.

  • What happens if you look at the instruction,

  • if you look at the machine language,

  • 172 is right in the instruction.

  • It's right in the instruction, that number 172.

  • Register says we'll move the value from the register,

  • in this case, %cx.

  • And then the index of the register is put in that part.

  • And direct memory says use a particular memory location.

  • And you can give a hex value.

  • When you do direct memory, it's going

  • to use the value at that place in memory.

  • And to indicate that memory is going to take you,

  • on a 64-bit machine, 64 8-bytes to specify that memory.

  • Whereas, for example, the move q, 172 will fit in 1 byte.

  • And so I'll have spent a lot less storage in order to do it.

  • Plus, I can do it directly from the instruction stream.

  • And I avoid having an access to memory,

  • which is very expensive.

  • So how many cycles does it take if the value that you're

  • fetching from memory is not in cache

  • or whatever or a register?

  • If I'm fetching something from memory,

  • how many cycles of the machine does

  • it typically take these days.

  • Yeah.

  • AUDIENCE: A few hundred?

  • CHARLES LEISERSON: Yeah, a couple of hundred or more,

  • yeah, a couple hundred cycles.

  • To fetch something from memory.

  • It's so slow.

  • No, it's the processors are so fast.

  • And so clearly, if you can get things into registers,

  • most registers you can access in a single cycle.

  • So we want to move things close to the processor,

  • operate on them, shove them back.

  • And while we pull things from memory,

  • we want other things to be to be working on.

  • And so the hardware is all organized to do that.

  • Now, of course, we spend a lot of time

  • fetching stuff from memory.

  • And that's one reason we use caching.

  • And we'll have a big thing--

  • caching is really important.

  • We're going spend a bunch of time

  • on how to get the best out of your cache.

  • There's also indirect addressing.

  • So instead of just giving a location,

  • you say, oh, let's go to some other place,

  • for example, a register, and get the value

  • and the address is going to be stored in that location.

  • So, for example here, register indirect says, in this case,

  • move the contents of rax into--

  • sorry, the contents is the address of the thing

  • that you're going to move into rdi.

  • So if rax was location 172, then it

  • would take whatever is in location 172 and put it in rdi.

  • Registered index says, well, do the same thing,

  • but while you're at it, add an offset.

  • So once again, if rax had 172, in this case

  • it would go to 344 to fetch the value out

  • of that location 344 for this particular instruction.

  • And then instruction-pointer relative,

  • instead of indexing off of a general purpose

  • register, you index off the instruction pointer.

  • That usually happens in the code where the code is--

  • for example, you can jump to where

  • you are in the code plus four instructions.

  • So you can jump down some number of instructions in the code.

  • Usually, you'll see that only with use with control,

  • because you're talking about things.

  • But sometimes they'll put some data in the instruction stream.

  • And then it can index off the instruction pointer

  • to get those values without having

  • to soil another register.

  • Now, the most general form is base indexed scale displacement

  • addressing.

  • Wow.

  • This is a move that has a constant plus three terms.

  • And this is the most complicated instruction that is supported.

  • The mode refers to the address whatever the base is.

  • So the base is a general purpose register, in this case, rdi.

  • And then it adds the index times the scale.

  • So the scale is 1, 2, 4, or 8.

  • And then a displacement, which is that number on the front.

  • And this gives you very general indexing

  • of things off of a base point.

  • You'll often see this kind of accessing

  • when you're accessing stack memory,

  • because everything you can say, here

  • is the base of my frame on the stack, and now for anything

  • that I want to add, I'm going to be going up a certain amount.

  • I may scaling by a certain amount

  • to get the value that I want.

  • So once again, you will become familiar with a manual.

  • You don't have to memorize all these,

  • but you do have to understand that there

  • are a lot of these complex addressing modes.

  • The jump instruction take a label

  • as their operand, which identifies

  • a location in the code.

  • For this, the labels can be symbols.

  • In other words, you can say here's a symbol

  • that I want to jump to.

  • It might be the beginning of a function,

  • or it might be a label that's generated

  • to be at the beginning of a loop or whatever.

  • They can be exact addresses-- go to this place in the code.

  • Or they can be relative address-- jump to some place

  • as I mentioned that's indexed off the instruction pointer.

  • And then an indirect jump takes as its

  • operand an indirect address--

  • oop, I got-- as its operand as its operand.

  • OK, so that's a typo.

  • It just takes an operand as an indirect address.

  • So basically, you can say, jump to whatever

  • is pointed to by that register using whatever indexing method

  • that you want.

  • So that's kind of the overview of the assembly language.

  • Now, let's take a look at some idioms.

  • So the XOR opcode computes the bitwise XOR of A and B.

  • We saw XOR was a great trick for swapping numbers,

  • for example, the other day.

  • So often in the code, you will see something

  • like this, xor rax rax.

  • What does that do?

  • Yeah.

  • AUDIENCE: Zeros the register.

  • CHARLES LEISERSON: It zeros the register.

  • Why does that zero the register?

  • AUDIENCE: Is the XOR just the same?

  • CHARLES LEISERSON: Yeah, it's basically

  • taking the results of rax, the results rax, xor-ing them.

  • And when you XOR something with itself,

  • you get zero, storing that back into it.

  • So that's actually how you zero things.

  • So you'll see that.

  • Whenever you see that, hey, what are they doing?

  • They're zeroing the register.

  • And that's actually quicker and easier

  • than having a zero constant that they put into the instruction.

  • It saves a byte, because this ends up

  • being a very short instruction.

  • I don't remember how many bytes that instruction is.

  • Here's another one, the test opcode, test A, B,

  • computes the bitwise AND of A and B and discards the result,

  • preserving the RFLAGS register.

  • So basically, it says, what does the test instruction

  • for these things do?

  • So what is the first one doing?

  • So it takes rcx-- yeah.

  • AUDIENCE: Does it jump?

  • It jumps to [INAUDIBLE] rcx [INAUDIBLE]

  • So it takes the bitwise AND of A and B.

  • And so then it's saying jump if equal.

  • So--

  • AUDIENCE: An AND would be non-zero in any

  • of the bits set.

  • CHARLES LEISERSON: Right.

  • AND is non-zero if any of the bits are set.

  • AUDIENCE: Right.

  • So if the zero flag were set, that means that rcx was zero.

  • CHARLES LEISERSON: That's right.

  • So if the Zero flag is set, then rcx is set.

  • So this is going to jump to that location

  • if rcx holds the value 0.

  • In all the other cases, it won't set the Zero flag

  • because the result of the AND will be 0.

  • So once again, that's kind of an idiom that they use.

  • What about the second one?

  • So this is a conditional move.

  • So both of them are basically checking

  • to see if the register is 0.

  • And then doing something if it is or isn't.

  • But those are just idioms that you sort of

  • have to look at to see how it is that they accomplish

  • their particular thing.

  • Here's another one.

  • So the ISA can include several no-op, no operation

  • instructions, including nop, nop A-- that's

  • an operation with an argument-- and data16, which sets aside

  • 2 bytes of a nop.

  • So here's a line of assembly that we

  • found in some of our code--

  • data16 days16 data16 nopw and then %csx.

  • So nopw is going to take this argument, which has got all

  • this address calculation in it.

  • So what do you think this is doing?

  • What's the effect of this, by the way?

  • They're all no-ops.

  • So the effect is?

  • Nothing.

  • The effect is nothing.

  • OK, now it does set the RFLAGS.

  • But basically, mostly, it does nothing.

  • Why would a compiler generate assembly with these idioms?

  • Why would you get that kind of--

  • that's crazy, right?

  • Yeah.

  • AUDIENCE: Could it be doing some cache optimization?

  • CHARLES LEISERSON: Yeah, it's actually

  • doing alignment optimization typically or code size.

  • So it may want to start the next instruction on the beginning

  • of a cache line.

  • And, in fact, there's a directive to do that.

  • If you want all your functions to start

  • at the beginning of cache line, then

  • it wants to make sure that if code gets to that point,

  • you'll just proceed to jump through memory,

  • continue through memory.

  • So mainly is to optimize memory.

  • So you'll see those things.

  • I mean, you just have to realize, oh,

  • that's the compiler generating some sum no-ops.

  • So that's sort of our brief excursion

  • over assembly language, x86 assembly language.

  • Now, I want to dive into floating-point and vector

  • hardware, which is going to be the main part.

  • And then if there's any time at the end, I'll show the slides--

  • I have a bunch of other slides on how branch prediction works

  • and a variety of other machines sorts of things,

  • that if we don't get to, it's no problem.

  • You can take a look at the slides,

  • and there's also the architecture manual.

  • So floating-point instruction sets,

  • so mostly the scalar floating-point operations

  • are access via couple of different instruction sets.

  • So the history of floating point is interesting,

  • because originally the 80-86 did not have a floating-point unit.

  • Floating-point was done in software.

  • And then they made a companion chip

  • that would do floating-point.

  • And then they started integrating

  • and so forth as miniaturization took hold.

  • So the SSE and AVX instructions do

  • both single and double precision scalar floating-point, i.e.

  • floats or doubles.

  • And then the x86 instructions, the x87 instructions--

  • that's the 80-87 that was attached to the 80-86

  • and that's where they get them--

  • support single, double, and extended precision

  • scalar floating-point arithmetic,

  • including float double and long double.

  • So you can actually get a great big result of a multiply

  • if you use the x87 instruction sets.

  • And they also include vector instructions,

  • so you can multiply or add there as well--

  • so all these places on the chip where you can decide

  • to do one thing or another.

  • Compilers generally like the SSE instructions

  • over the x87 instructions because they're simpler

  • to compile for and to optimize.

  • And the SSE opcodes are similar to the normal x86 opcodes.

  • And they use the XMM registers and floating-point types.

  • And so you'll see stuff like this, where you've

  • got a movesd and so forth.

  • The suffix there is saying what the data type.

  • In this case, it's saying it's a double precision floating-point

  • value, i.e. a double.

  • Once again, they're using suffix.

  • The sd in this case is a double precision floating-point.

  • The other option is the first letter

  • says whether it's single, i.e. a scalar operation, or packed,

  • i.e. a vector operation.

  • And the second letter says whether it's

  • single or double precision.

  • And so when you see one of these operations, you can decode,

  • oh, this is operating on a 64-bit value or a 32-bit value,

  • floating-point value, or on a vector of those values.

  • Now, what about these vectors?

  • So when you start using the packed representation

  • and you start using vectors, you have

  • to understand a little bit about the vector units that

  • are on these machines.

  • So the way a vector unit works is

  • that there is the processor issuing instructions.

  • And it issues the instructions to all of the vector units.

  • So for example, if you take a look at a typical thing,

  • you may have a vector width of four vector units.

  • Each of them is often called a lane--

  • l-a-n-e.

  • And the x is the vector width.

  • And so when the instruction is given,

  • it's given to all of the vector units.

  • And they all do it on their own local copy of the register.

  • So the register you can think of as a very wide thing broken

  • into several words.

  • And when I say add two vectors together,

  • it'll add four words together and store it back

  • into another vector register.

  • And so whatever k is--

  • in the example I just said, k was 4.

  • And the lanes are the thing that each of which

  • contains the integer floating-point arithmetic.

  • But the important thing is that they all operate in lock step.

  • It's not like one is going to do one thing

  • and another is going to do another thing.

  • They all have to do exactly the same thing.

  • And the basic idea here is for the price of one instruction,

  • I can command a bunch of operations to be done.

  • Now, generally, vector instructions

  • operate in an element-wise fashion,

  • where you take the i-th element of one vector

  • and operate on it with the i-th element of another vector.

  • And all the lanes perform exactly the same operation.

  • Depending upon the architecture, some architectures,

  • the operands need to be aligned.

  • That is you've got to have the beginnings at the exactly

  • same place in memory, a multiple of the vector length.

  • There are others where the vectors

  • can be shifted in memory.

  • Usually, there's a performance difference between the two.

  • If it does support-- some of them

  • will not support unaligned vector operations.

  • So if it can't figure out that they're aligned, I'm sorry,

  • your code will end up being executed scalar,

  • in a scalar fashion.

  • If they are aligned, it's got to be able to figure that out.

  • And in that case--

  • sorry, if it's not aligned, but you

  • do support vector operizations unaligned,

  • it's usually slower than if they are aligned.

  • And for some machines now, they actually

  • have good performance on both.

  • So it really depends upon the machine.

  • And then also there are some architectures

  • will support cross-lane operation, such as inserting

  • or extracting subsets of vector elements,

  • permuting, shuffling, scatter, gather types of operations.

  • So x86 supports several instruction sets,

  • as I mentioned.

  • There's SSE.

  • There's AVX.

  • There's AVX2.

  • And then there's now the AVX-512,

  • or sometimes called AVX3, which is not

  • available on the machines that we'll be using,

  • the Haswell machines that we'll be doing.

  • Generally, the AVX and AVX2 extend the SSE instruction

  • set by using the wider registers and operate on a 2.

  • The SSE use wider registers and operate

  • on at most two operands.

  • The AVX ones can use the 256 and also have three operands, not

  • just two operations.

  • So say you can say add A to B and store it in C,

  • as opposed to saying add A to B and store it in B.

  • So it can also support three.

  • Yeah, most of them are similar to traditional opcodes

  • with minor differences.

  • So if you look at them, if you have an SSE,

  • it basically looks just like the traditional name,

  • like add in this case, but you can then say,

  • do a packed add or a vector with packed data.

  • So the v prefix it's AVX.

  • So if you see it's v, you go to the part

  • in the manual that says AVX.

  • If you see the p's, that say it's packed data.

  • Then you go to SSE if it doesn't have the v.

  • And the p prefix distinguishing integer vector instruction,

  • you got me.

  • I tried to think why is p distinguishing an integer?

  • It's like p, good mnemonic for integer, right?

  • Then in addition, they do this aliasing trick again,

  • where the YMM registers actually alias the XMM registers.

  • So you can use both operations, but you've

  • got to be careful what's going on,

  • because they just extended them.

  • And now, of course, with AVX-512,

  • they did another extension to 512 bits.

  • That's vectors stuff.

  • So you can use those explicitly.

  • The compiler will vectorize for you.

  • And the homework this week takes you through some vectorization

  • exercises.

  • It's actually a lot of fun.

  • We were just going over it in a staff meeting.

  • And it's really fun.

  • I think it's a really fun exercise.

  • We introduced that last year, by the way,

  • or maybe two years ago.

  • But, in any case, it's a fun one--

  • for my definition of fun, which I hope

  • is your definition of fun.

  • Now, I want to talk generally about computer architecture.

  • And I'm not going to get through all of these slides, as I say.

  • But I want to get started on the and give you

  • a sense of other things going on in the processor

  • that you should be aware of.

  • So in 6.004, you probably talked about a 5-stage processor.

  • Anybody remember that?

  • OK, 5-stage processor.

  • There's an Instruction Fetch.

  • There's an Instruction Decode.

  • There's an Execute.

  • Then there's a Memory Addressing.

  • And then you Write back the values.

  • And this is done as a pipeline, so as

  • to make-- you could do all of this in one thing,

  • but then you have a long clock cycle.

  • And you'll only be able to do one thing at a time.

  • Instead, they stack them together.

  • So here's a block diagram of the 5-stage processor.

  • We read the instruction from memory

  • in the instruction fetch cycle.

  • Then we decode it.

  • Basically, it takes a look at, what

  • is the opcode, what are the addressing modes, et cetera,

  • and figures out what it actually has to do

  • and actually performs the ALU operations.

  • And then it reads and writes the data memory.

  • And then it writes back the results into registers.

  • That's typically a common way that these things

  • go for a 5-stage processor.

  • By the way, this is vastly oversimplified.

  • You can take 6823 if you want to learn truth.

  • I'm going to tell you nothing but white lies

  • for this lecture.

  • Now, if you look at the Intel Haswell, the machine

  • that we're using, it actually has between 14 and 19 pipeline

  • stages.

  • The 14 to 19 reflects the fact that there

  • are different paths through it that

  • take different amounts of time.

  • It also I think reflects a little bit

  • that nobody has published the Intel internal stuff.

  • So maybe we're not sure if it's 14 to 19, but somewhere

  • in that range.

  • But I think it's actually because the different lengths

  • of time as I was explaining.

  • So what I want to do is--

  • you've seen the 5-stage price line.

  • I want to talk about the difference between that

  • and a modern processor by looking at several design

  • features.

  • We already talked about vector hardware.

  • I then want to talk about super scalar

  • processing, out of order execution,

  • and branch prediction a little bit.

  • And the out of order, I'm going to skip a bunch of that

  • because it has to do with score boarding, which

  • is really interesting and fun, but it's also time consuming.

  • But it's really interesting and fun.

  • That's what you learn in 6823.

  • So historically, there's two ways

  • that people make processors go faster--

  • by exploiting parallelism and by exploiting locality.

  • And parallelism, there's instruction-- well,

  • we already did word-level parallelism

  • in the bit tricks thing.

  • But there's also instruction-level parallelism,

  • so-called ILB, vectorization and multicore.

  • And for locality, the main thing that's used there is caching.

  • I would say also the fact that you

  • have a design with registers that also reflects locality,

  • because the way that the processor wants to do things

  • is fetch stuff from memory.

  • It doesn't want to operate on it in memory.

  • That's very expensive.

  • It wants to fetch things into memory, get enough of them

  • there that you can do some calculations,

  • do a whole bunch of calculations,

  • and then put them back out there.

  • So this lecture we're talking about ILP and vectorization.

  • So let me talk about instruction-level parallelism.

  • So when you have, let's say, a 5-stage pipeline,

  • you're interested in finding opportunities

  • to execute multiple instruction simultaneously.

  • So in instruction 1, it's going to do an instruction fetch.

  • Then it does its decode.

  • And so it takes five cycles for this instruction to complete.

  • So ideally what you'd like is that you

  • can start instruction 2 on cycle 2, instruction 3 on cycle 3,

  • and so forth, and have 5 instructions-- once you

  • get into the steady state, have 5 instructions executing

  • all the time.

  • That would be ideal, where each one takes just one thing.

  • So that's really pretty good.

  • And that would improve the throughput.

  • Even though it might take a long time

  • to get one instruction done, I can have many instructions

  • in the pipeline at some time.

  • So each pipeline is executing a different instruction.

  • However, in practice this isn't what happens.

  • In practice, you discover that there are

  • what's called pipeline stalls.

  • When it comes time to execute an instruction,

  • for some correctness reason, it cannot execute the instruction.

  • It has to wait.

  • And that's a pipeline stall.

  • That's what you want to try to avoid

  • and the compiler tries to Bruce code that will avoid stalls.

  • So why do stalls happen?

  • They happen because of what are called hazards.

  • There's actually two notions of hazard.

  • And this is one of them.

  • The other is a race condition hazard.

  • This is dependency hazard.

  • But people call them both hazards,

  • just like they call the second stage of compilation compiling.

  • It's like they make up these words.

  • So here's three types of hazards that can prevent

  • an instruction from executing.

  • First of all, there's what's called a structural hazard.

  • Two instructions attempt to use the same functional unit,

  • the same time.

  • If there's, for example, only one floating-point multiplier

  • and two of them try to use it at the same time, one has to wait.

  • In modern processors, there's a bunch of each of those.

  • But if you have k functional units and k plus 1 instructions

  • want to access it, you're out of luck.

  • One of them is going to have to wait.

  • The second is a data hazard.

  • This is when an instruction depends

  • on the result of a prior instruction in the pipeline.

  • So one instruction is computing a value that

  • is going to stick in rcx, say.

  • So they stick it into rcx.

  • The other one has to read the value from rcx

  • and it comes later.

  • That other instruction has to wait

  • until that value is written there before it can read it.

  • That's a data hazard.

  • And a control hazard is where you

  • decide that you need to make a jump

  • and you can't execute the next instruction,

  • because you don't know which way the jump is going to go.

  • So if you have a conditional jump,

  • it's like, well, what's the next instruction after that jump?

  • I don't know.

  • So I have to wait to execute that.

  • I can't go ahead and do the jump and then do

  • the next instruction after it, because I don't know what

  • happened to the previous one.

  • Now of these, we're going to mostly talk about data hazards.

  • So an instruction can create a data hazard--

  • I can create a data hazard due to a dependence between i

  • and j.

  • So the first type is called a true dependence,

  • or I read after write dependence.

  • And this is where, as in this example,

  • I'm adding something and storing into rax

  • and the next instruction wants to read from rax.

  • So the second instruction can't get

  • going until the previous one or it may

  • stall until the result of the previous one is known.

  • There's another one called an anti-dependence.

  • This is where I want to write into a location,

  • but I have to wait until the previous instruction has read

  • the value, because otherwise I'm going

  • to clobber that instruction and clobber

  • the value before it gets read.

  • so that's an anti-dependence.

  • And then the final one is an output dependence,

  • where they're both trying to move something to are rax.

  • So why would two things want to move things

  • to the same location?

  • After all, one of them is going to be lost and just not do

  • that instruction.

  • Why wouldn't--

  • AUDIENCE: Set some flags.

  • CHARLES LEISERSON: Yeah, maybe because it

  • wants to set some flags.

  • So that's one reason that it might do this,

  • because you know the first instruction set

  • some flags in addition to moving the output to that location.

  • And there's one other reason.

  • What's the other reason?

  • I'm blanking.

  • There's two reasons.

  • And I didn't put them in my notes.

  • I don't remember.

  • OK, but anyway, that's a good question for quiz then.

  • OK, give me two reasons-- yeah.

  • AUDIENCE: Can there be intermediate instructions

  • like between those [INAUDIBLE]

  • CHARLES LEISERSON: There could, but of course

  • then if it's going to use that register, then--

  • oh, I know the other reason.

  • So this is still good for a quiz.

  • The other reason is there may be aliasing going on.

  • Maybe an intervening instruction uses one

  • of the values in its aliasist.

  • So uses part of the result or whatever, there still

  • could be a dependency.

  • Anyway, some arithmetic operations

  • are complex to implement in hardware

  • and have long latencies.

  • So here's some sample opcodes and how many latency they take.

  • They take a different number.

  • So, for example, integer division actually is variable,

  • but a multiply takes about three times what

  • most of the integer operations are.

  • And floating-point multiply is like 5.

  • And then fma, what's fma?

  • Fused multiply add.

  • This is where you're doing both a multiply and an add.

  • And why do we care about fuse multiply adds?

  • AUDIENCE: For memory accessing and [INAUDIBLE]

  • CHARLES LEISERSON: Not for memory accessing.

  • This is actually floating-point multiply and add.

  • It's called linear algebra.

  • So when you do major multiplication,

  • you're doing dot product.

  • You're doing multiplies and adds.

  • So that kind of thing, that's where you do a lot of those.

  • So how does the hardware accommodate

  • these complex operations?

  • So the strategy that much hardware tends to use

  • is to have separate functional units for complex operations,

  • such as floating-point arithmetic.

  • So there may be in fact separate registers,

  • for example, the XMM registers, that only

  • work with the floating point.

  • So you have your basic 5-stage pipeline.

  • You have another pipeline that's off on the side.

  • And it's going to take multiple cycles sometimes

  • for things and maybe pipeline to a different depth.

  • And so you basically separate these operations.

  • The functional units may be pipelined, fully,

  • partially, or not at all.

  • And so I now have a whole bunch of different functional units,

  • and there's different paths that I'm

  • going to be able to take through the data path of the processor.

  • So in Haswell, they have integer vector floating-point

  • distributed among eight different ports, which

  • is sort from the entry.

  • So given that, things get really complicated.

  • If we go back to our simple diagram,

  • suppose we have all these additional functional units,

  • how can I now exploit more instruction-level parallelism?

  • So right now, we can start up one operation at a time.

  • What might I do to get more parallelism out of the hardware

  • that I've got?

  • What do you think computer architects did?

  • OK.

  • AUDIENCE: It's a guess but, you could glue together [INAUDIBLE]

  • CHARLES LEISERSON: Yeah, so even simpler than that, but

  • which is implied in what you're saying,

  • is you can just fetch and issue multiple instructions

  • per cycle.

  • So rather than just doing one per cycle

  • as we showed with a typical pipeline processor,

  • let me fetch several that use different parts

  • of the processor pipeline, because they're not

  • going to interfere, to keep everything busy.

  • And so that's basically what's called a super scalar

  • processor, where it's not executing one thing at a time.

  • It's executing multiple things at a time.

  • So Haswell, in fact, breaks up the instructions

  • into simpler operations, called micro-ops.

  • And they can emit for micro-ops per cycle

  • to the rest of the pipeline.

  • And the fetch and decode stages implement optimizations

  • on micro-op processing, including special cases

  • for common patents.

  • For example, if it sees the XOR of rax and rax,

  • it knows that rax is being set to 0.

  • It doesn't even use a functional unit for that.

  • It just does it and it's done.

  • It has just a special logic that observes

  • that because it's such a common thing to set things out.

  • And so that means that now your processor can execute

  • a lot of things at one time.

  • And that's the machines that you're doing.

  • That's why when I said if you save one add instruction,

  • it probably doesn't make any difference

  • in today's processor, because there's probably

  • an idle adder lying around.

  • There's probably a-- did I read caught how many--

  • where do we go here?

  • Yeah, so if you look here, you can even

  • discover that there are actually a bunch of ALUs that

  • are capable of doing an add.

  • So they're all over the map in Haswell.

  • Now, still, we are insisting that the processors execute

  • in things in order.

  • And that's kind of the next stage is, how do you

  • end up making things run--

  • that is, how do you make it so that you can free yourself

  • from the tyranny of one instruction after the other?

  • And so the first thing is there's

  • a strategy called bypassing.

  • So suppose that you have instructions running into rax.

  • And then you're going to use that to read.

  • Well, why bother waiting for it to be stored into the register

  • file and then pulled back out for the second instruction?

  • Instead, let's have a bypass, a special circuit

  • that identifies that kind of situation

  • and feeds it directly to the next instruction

  • without requiring that it go into the register file

  • and back out.

  • So that's called bypassing.

  • There are lots of places where things are bypassed.

  • And we'll talk about it more.

  • So normally, you would stall waiting

  • for it to be written back.

  • And now, when you eliminate it, now I

  • can move it way forward, because I just

  • use the bypass path to execute.

  • And it allows the second instruction

  • to get going earlier.

  • What else can we do?

  • Well, let's take a large code example.

  • Given the amount of time, what I'm

  • going to do is basically say, you

  • can go through and figure out what

  • are the read after write dependencies

  • and the write after read dependencies.

  • They're all over the place.

  • And what you can do is if you look

  • at what the dependencies are that I just flashed through,

  • you can discover, oh, there's all these things.

  • Each one right now has to wait for the previous one

  • before it can get started.

  • But there are some-- for example,

  • the first one is just issue order.

  • You can't start the second--

  • if it's in order, you can't start the second

  • till you've started the first, that it's

  • finished the first stage.

  • But the other thing here is there's

  • a data dependence between the second and third instructions.

  • So if you look at the second and third instructions,

  • they're both using XMM2.

  • And so we're prevented.

  • So one of the questions there is, well,

  • why not do a little bit better by taking a look at this

  • as a graph and figuring out what's

  • the best way through the graph?

  • And there are a bunch of tricks you can do there,

  • which I'll run through here very quickly.

  • And you can take a look at these.

  • You can discover that some of these dependencies

  • are not real dependence.

  • And as long as you're willing to execute things out of order

  • and keep track of that, it's perfectly fine.

  • If you're not actually dependent on it,

  • then just go ahead and execute it.

  • And then you can advance things.

  • And then the other trick you can use

  • is what's called register renaming.

  • If you have a destination that's going to be read from--

  • sorry, if I want to write to something,

  • but I have to wait for something else to read from it,

  • the write after read dependence, then what

  • I can do is just rename the register,

  • so that I have something to write

  • to that is the same thing.

  • And there's a very complex mechanism called

  • score boarding that does that.

  • So anyway, you can take a look at all of these tricks.

  • And then the last thing that I want

  • to-- so this is this part I was going to skip over.

  • And indeed, I don't have time to do it.

  • I just want to mention the last thing, which is worthwhile.

  • So this-- you don't have to know any

  • of the details of that part.

  • But it's in there if you're interested.

  • So it does renaming and reordering.

  • And then the last thing I do want to mention

  • is branch prediction.

  • So when you come to branch prediction, the outcome,

  • you can have a hazard because the outcome is known too late.

  • And so in that case, what they do

  • is what's called speculative execution, which

  • you've probably heard of.

  • So basically that says I'm going to guess the outcome

  • of the branch and execute.

  • If it's encountered, you assume it's taken

  • and you execute normally.

  • And if you're right, everything is hunky dory.

  • If you're wrong, it cost you something like a--

  • you have to undo that speculative computation

  • and the effect is sort of like stalling.

  • So you don't want that to happen.

  • And so a mispredicted branch on Haswell

  • costs about 15 to 20 cycles.

  • Most of the machines use a branch predictor

  • to tell whether or not it's going to do.

  • There's a little bit of stuff here

  • about how you tell about whether a branch is

  • going to be predicted or not.

  • And you can take a look at that on your own.

  • So sorry to rush a little bit the end,

  • but I knew I wasn't going to get through all of this.

  • But it's in the notes, in the slides when we put it up.

  • And this is really kind of interesting stuff.

  • Once again, remember that I'm dealing with this at one level

  • below what you really need to do.

  • But it is really helpful to understand that layer

  • so you have a deep understanding of why certain software

  • optimizations work and don't work.

  • Sound good?

  • OK, good luck on finishing your project 1's.

The following content is provided under a Creative

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it