Placeholder Image

Subtitles section Play video

  • The following content is provided under a Creative

  • Commons license.

  • Your support will help MIT OpenCourseWare

  • continue to offer high-quality educational resources for free.

  • To make a donation or to view additional materials

  • from hundreds of MIT courses, visit MIT OpenCourseWare

  • at ocw.mit.edu.

  • JULIAN SHUN: Today, we're going to talk

  • about multicore programming.

  • And as I was just informed by Charles, it's 2018.

  • I had 2017 on the slide.

  • So first, congratulations to all of you.

  • You turned in the first project's data.

  • Here's a plot showing the tiers that different groups reached

  • for the beta.

  • And this is in sorted order.

  • And we set the beta cutoff to be tier 45.

  • The final cutoff is tier 48.

  • So the final cutoff we did set a little bit aggressively,

  • but keep in mind that you don't necessarily

  • have to get to the final cutoff in order

  • to get an A on this project.

  • So we're going to talk about multicore processing today.

  • That's going to be the topic of the next project

  • after you finish the first project.

  • So in a multicore processor, we have a whole bunch

  • of cores that are all placed on the same chip,

  • and they have access to shared memory.

  • They usually also have some sort of private cache, and then

  • a shared last level cache, so L3, in this case.

  • And then they all have access the same memory controller,

  • which goes out to main memory.

  • And then they also have access to I/O.

  • But for a very long time, chips only had a single core on them.

  • So why do we have multicore processors nowadays?

  • Why did semiconductor vendors start

  • producing chips that had multiple processor

  • cores on them?

  • So the answer is because of two things.

  • So first, there's Moore's Law, which

  • says that we get more transistors every year.

  • So the number of transistors that you can fit on a chip

  • doubles approximately every two years.

  • And secondly, there's the end of scaling of clock frequency.

  • So for a very long time, we could just

  • keep increasing the frequency of the single core on the chip.

  • But at around 2004 to 2005, that was no longer the case.

  • We couldn't scale the clock frequency anymore.

  • So here's a plot showing both the number of transistors

  • you could fit on the chip over time,

  • as well as the clock frequency of the processors over time.

  • And notice that the y-axis is in log scale here.

  • And the blue line is basically Moore's Law,

  • which says that the number of transistors

  • you can fit on a chip doubles approximately every two years.

  • And that's been growing pretty steadily.

  • So this plot goes up to 2010, but in fact, it's

  • been growing even up until the present.

  • And it will continue to grow for a couple

  • more years before Moore's Law ends.

  • However, if you look at the clock frequency line,

  • you see that it was growing quite

  • steadily until about the early 2000s, and then at that point,

  • it flattened out.

  • So at that point, we couldn't increase the clock frequencies

  • anymore, and the clock speed was bounded

  • at about four gigahertz.

  • So nowadays, if you go buy a processor,

  • it's usually still bounded by around 4 gigahertz.

  • It's usually a little bit less than 4 gigahertz,

  • because it doesn't really make sense to push it all the way.

  • But you might find some processors

  • that are around 4 gigahertz nowadays.

  • So what happened at around 2004 to 2005?

  • Does anyone know?

  • So Moore's Law basically says that we

  • can fit more transistors on a chip

  • because the transistors become smaller.

  • And when the transistors become smaller,

  • you can reduce the voltage that's

  • needed to operate the transistors.

  • And as a result, you can increase the clock frequency

  • while maintaining the same power density.

  • And that's what manufacturers did until about 2004 to 2005.

  • They just kept increasing the clock frequency

  • to take advantage of Moore's law.

  • But it turns out that once transistors become

  • small enough, and the voltage used

  • to operate them becomes small enough,

  • there's something called leakage current.

  • So there's current that leaks, and we're

  • unable to keep reducing the voltage while still having

  • reliable switching.

  • And if you can't reduce the voltage anymore,

  • then you can't increase the clock frequency

  • if you want to keep the same power density.

  • So here's a plot from Intel back in 2004

  • when they first started producing multicore processors.

  • And this is plotting the power density versus time.

  • And again, the y-axis is in log scale here.

  • So the green data points are actual data points,

  • and the orange ones are projected.

  • And they projected what the power density

  • would be if we kept increasing the clock

  • frequency at a trend of about 25% to 30% per year,

  • which is what happened up until around 2004.

  • And because we couldn't reduce the voltage anymore,

  • the power density will go up.

  • And you can see that eventually, it

  • reaches the power density of a nuclear reactor, which

  • is pretty hot.

  • And then it reaches the power density of a rocket nozzle,

  • and eventually you get to the power

  • density of the sun's surface.

  • So if you have a chip that has a power density

  • equal to the sun's surface--

  • well, you don't actually really have a chip anymore.

  • So basically if you get into this orange region,

  • you basically have a fire, and you can't really

  • do anything interesting, in terms of performance

  • engineering, at that point.

  • So to solve this problem, semiconductor vendors

  • didn't increased the clock frequency anymore,

  • but we still had Moore's Law giving us

  • more and more transistors every year.

  • So what they decided to do with these extra transistors

  • was to put them into multiple cores,

  • and then put multiple cores on the same chip.

  • So we can see that, starting at around 2004,

  • the number of cores per chip becomes more than one.

  • And each generation of Moore's Law

  • will potentially double the number of cores

  • that you can fit on a chip, because it's doubling

  • the number of transistors.

  • And we've seen this trend up until about today.

  • And again, it's going to continue for a couple

  • more years before Moore's Law ends.

  • So that's why we have chips with multiple cores today.

  • So today, we're going to look at multicore processing.

  • So I first want to introduce the abstract multicore

  • architecture.

  • So this is a very simplified version,

  • but I can fit it on this slide, and it's a good example

  • for illustration.

  • So here, we have a whole bunch of processors.

  • They each have a cache, so that's

  • indicated with the dollar sign.

  • And usually they have a private cache as well as

  • a shared cache, so a shared last level cache, like the L3 cache.

  • And then they're all connected to the network.

  • And then, through the network, they

  • can connect to the main memory.

  • They can all access the same shared memory.

  • And then usually there's a separate network for the I/O

  • as well, even though I've drawn them as a single network here,

  • so they can access the I/O interface.

  • And potentially, the network will also

  • connect to other multiprocessors on the same system.

  • And this abstract multicore architecture

  • is known as a chip multiprocessor, or CMP.

  • So that's the architecture that we'll be looking at today.

  • So here's an outline of today's lecture.

  • So first, I'm going to go over some hardware challenges

  • with shared memory multicore machines.

  • So we're going to look at the cache coherence protocol.

  • And then after looking at hardware,

  • we're going to look at some software solutions

  • to write parallel programs on these multicore machines

  • to take advantage of the extra cores.

  • And we're going to look at several concurrency

  • platforms listed here.

  • We're going to look at Pthreads.

  • This is basically a low-level API

  • for accessing, or for running your code in parallel.

  • And if you program on Microsoft products,

  • the Win API threads is pretty similar.

  • Then there's Intel Threading Building Blocks,

  • which is a library solution to concurrency.

  • And then there are two linguistic solutions

  • that we'll be looking at--

  • OpenMP and Cilk Plus.

  • And Cilk Plus is actually the concurrency platform

  • that we'll be using for most of this class.

  • So let's look at how caches work.

  • So let's say that we have a value in memory

  • at some location, and that value is--

  • let's say that value is x equals 3.

  • If one processor says, we want to load

  • x, what happens is that processor reads

  • this value from a main memory, brings it into its own cache,

  • and then it also reads the value, loads it

  • into one of its registers.

  • And it keeps this value in cache so

  • that if it wants to access this value again in the near future,

  • it doesn't have to go all the way out to main memory.

  • It can just look at the value in its cache.

  • Now, what happens if another processor wants to load x?

  • Well, it just does the same thing.

  • It reads the value from main memory,

  • brings it into its cache, and then also loads it

  • into one of the registers.

  • And then same thing with another processor.

  • It turns out that you don't actually

  • always have to go out to main memory to get the value.

  • If the value resides in one of the other processor's caches,

  • you can also get the value through the other processor's

  • cache.

  • And sometimes that's cheaper than going all the way out

  • to main memory.

  • So the second processor now loads x again.

  • And it's in cache, so it doesn't have

  • to go to main memory or anybody else's cache.

  • So what happens now if we want to store

  • x, if we want to set the value of x to something else?

  • So let's say this processor wants to set x equal to 5.

  • So it's going to write x equals 5

  • and store that result in its own cache.

  • So that's all well and good.

  • Now what happens when the first processor wants to load x?

  • Well, it seems that the value of x is in its own cache,

  • so it's just going to read the value of x there,

  • and it gets a value of 3.

  • So what's the problem there?

  • Yes?

  • AUDIENCE: The path is stale.

  • JULIAN SHUN: Yeah.

  • So the problem is that the value of x in the first processor's

  • cache is stale, because another processor updated it.

  • So now this value of x in the first processor's cache

  • is invalid.

  • So that's the problem.

  • And one of the main challenges of multicore hardware

  • is to try to solve this problem of cache coherence--

  • making sure that the values in different processors' caches

  • are consistent across updates.

  • So one basic protocol for solving this problem

  • is known as the MSI protocol.

  • And in this protocol, each cache line is labeled with a state.

  • So there are three possible states--

  • M, S, and I. And this is done on the granularity of cache lines.

  • Because it turns out that storing this information

  • is relatively expensive, so you don't

  • want to store it for every memory location.

  • So they do it on a per cache line basis.

  • Does anyone know what the size of a cache line

  • is, on the machines that we're using?

  • Yeah?

  • AUDIENCE: 64 bytes.

  • JULIAN SHUN: Yeah, so it's 64 bytes.

  • And that's typically what you see today on most Intel and AMD

  • machines.

  • There's some architectures that have different cache lines,

  • like 128 bytes.

  • But for our class, the machines that we're using

  • will have 64 byte cache lines.

  • It's important to remember that so that when you're doing

  • back-of-the-envelope calculations,

  • you can get accurate estimates.

  • So the three states in the MSI protocol are M, S, and I.

  • So M stands for modified.

  • And when a cache block is in the modified state,

  • that means no other caches can contain this block

  • in the M or the S states.

  • The S state means that the block is shared,

  • so other caches can also have this block in shared state.

  • And then finally, I mean the cache block is invalid.

  • So that's essentially the same as the cache block

  • not being in the cache.

  • And to solve the problem of cache coherency, when

  • one cache modifies a location, it

  • has to inform all the other caches

  • that their values are now stale, because this cache modified

  • the value.

  • So it's going to invalidate all of the other copies

  • of that cache line in other caches

  • by changing their state from S to I.

  • So let's see how this works.

  • So let's say that the second processor wants to store y

  • equals 5.

  • So previously, a value of y was 17, and it was in shared state.

  • The cache line containing y equals 17 was in shared state.

  • So now, when I do y equals 5, I'm

  • going to set the second processor's cache--

  • that cache line-- to modified state.

  • And then I'm going to invalidate the cache

  • line in all of the other caches that contain that cache line.

  • So now the first cache and the fourth cache

  • each have a state of I for y equals 17,

  • because that value is stale.

  • Is there any questions?

  • Yes?

  • AUDIENCE: If we already have to tell the other things to switch

  • to invalid, why not just tell them the value of y?

  • JULIAN SHUN: Yeah, so there are actually

  • some protocols that do that.

  • So this is just the most basic protocol.

  • So this protocol doesn't do it.

  • But there are some that are used in practice

  • that actually do do that.

  • So it's a good point.

  • But I just want to present the most basic protocol for now.

  • Sorry.

  • And then, when you load a value, you

  • can first check whether your cache line is in M or S state.

  • And if it is an M or S state, then you

  • can just read that value directly.

  • But if it's in the I state, or if it's not there,

  • then you have to fetch that block

  • from either another processor's cache

  • or fetch it from main memory.

  • So it turns out that there are many other protocols out there.

  • There's something known as MESI, the messy protocol.

  • There's also MOESI and many other different protocols.

  • And some of them are proprietary.

  • And they all do different things.

  • And it turns out that all of these protocols

  • are quite complicated, and it's very hard

  • to get these protocols right.

  • And in fact, one of the most earliest successes

  • of formal verification was improving some of these cache

  • [INAUDIBLE] protocols to be correct.

  • Yes, question?

  • AUDIENCE: What happens if two processors try to modify

  • one value at the same time

  • JULIAN SHUN: Yeah, so if two processors

  • try to modify the value, one of them has to happen first.

  • So the hardware is going to take care of that.

  • So the first one that actually modifies

  • it will invalidate all the other copies,

  • and then the second one that modifies the value

  • will again invalidate all of the other copies.

  • And when you do that--

  • when a lot of processors try to modify the same value,

  • you get something known as an invalidation storm.

  • So you have a bunch of invalidation messages

  • going throughout the hardware.

  • And that can lead to a big performance bottleneck.

  • Because each processor, when it modifies its value,

  • it has to inform all the other processors.

  • And if all the processors are modifying the same value,

  • you get this sort of quadratic behavior.

  • The hardware is still going to guarantee

  • that one of their processors is going to end up

  • writing the value there.

  • But you should be aware of this performance issue

  • when you're writing parallel code.

  • Yes?

  • AUDIENCE: So all of this protocol stuff

  • happens in hardware?

  • JULIAN SHUN: Yes, so this is all implemented in hardware.

  • So if you take a computer architecture class,

  • you'll learn much more about these protocols and all

  • of their variants.

  • So for our purposes, we don't actually

  • need to understand all the details of the hardware.

  • We just need to understand what it's doing at a high level

  • so we can understand when we have a performance bottleneck

  • and why we have a performance bottleneck.

  • So that's why I'm just introducing the most

  • basic protocol here.

  • Any other questions?

  • So I talked a little bit about the shared memory hardware.

  • Let's now look at some concurrency platforms.

  • So these are the four platforms that we'll be looking at today.

  • So first, what is a concurrency platform?

  • Well, writing parallel programs is very difficult.

  • It's very hard to get these programs to be correct.

  • And if you want to optimize their performance,

  • it becomes even harder.

  • So it's very painful and error-prone.

  • And a concurrency platform abstracts processor

  • cores and handles synchronization

  • and communication protocols.

  • And it also performs load balancing for you.

  • So it makes your lives much easier.

  • And so today we're going to talk about some

  • of these different concurrency platforms.

  • So to illustrate these concurrency platforms,

  • I'm going to do the Fibonacci numbers example.

  • So does anybody not know what Fibonacci is?

  • So good.

  • Everybody knows what Fibonacci is.

  • So it's a sequence where each number is the sum

  • of the previous two numbers.

  • And the recurrence is shown in this brown box here.

  • The sequence is named after Leonardo di Pisa, who was also

  • known as Fibonacci, which is a contraction of Bonacci,

  • son of Bonaccio.

  • So that's where the name Fibonacci came from.

  • And in Fibonacce's 1202 book, Liber Abaci,

  • he introduced the sequence--

  • the Fibonacci sequence-- to Western mathematics,

  • although it had been previously known

  • to Indian mathematicians for several centuries.

  • But this is what we call the sequence nowadays--

  • Fibonacci numbers.

  • So here's a Fibonacci program.

  • Has anyone seen this algorithm before?

  • A couple of people.

  • Probably more, but people didn't raise their hands.

  • So it's a recursive program.

  • So it basically implements the recurrence

  • from the previous slide.

  • So if n is less than 2, we just return n.

  • Otherwise, we compute fib of n minus 1,

  • store that value in x, fib of n minus 2, store that value in y,

  • and then return the sum of x and y.

  • So I do want to make a disclaimer

  • to the algorithms police that this is actually

  • a very bad algorithm.

  • So this algorithm takes exponential time,

  • and there's actually much better ways

  • to compute the end Fibonacci number.

  • There's a linear time algorithm, which

  • just computes the Fibonacci numbers from bottom up.

  • This algorithm here is actually redoing a lot of the work,

  • because it's computing Fibonacci numbers multiple times.

  • Whereas if you do a linear scan from the smallest numbers up,

  • you only have to compute each one once.

  • And there's actually an even better algorithm

  • that takes logarithmic time, and it's

  • based on squaring matrices.

  • So has anyone seen that algorithm before?

  • So a couple of people.

  • So if you're interested in learning more

  • about this algorithm, I encourage

  • you to look at your favorite textbook, Introduction

  • to Algorithms by Cormen, Leiserson, Rivest, and Stein.

  • So even though this--

  • [LAUGHTER]

  • Yes.

  • So even though this is a pretty bad algorithm,

  • it's still a good educational example,

  • because I can fit it on one slide

  • and illustrate all the concepts of parallelism

  • that we want to cover today.

  • So here's the execution tree for fib of 4.

  • So we see that fib of 4 is going to call fib of 3 and fib of 2.

  • Fib of 3 is going to call fib of 2, fib of 1, and so on.

  • And you can see that repeated computations here.

  • So fib of 2 is being computed twice, and so on.

  • And if you have a much larger tree--

  • say you ran this on fib of 40-- then

  • you'll have many more overlapping computations.

  • It turns out that the two recursive calls can actually

  • be parallelized, because they're completely independent

  • calculations.

  • So the key idea for parallelization

  • is to simultaneously execute the two recursive sub-calls to fib.

  • And in fact, you can do this recursively.

  • So the two sub-calls to fib of 3 can also

  • be executed in parallel, and the two sub-calls of fib of 2

  • can also be executed in parallel, and so on.

  • So you have all of these calls that

  • can be executed in parallel.

  • So that's the key idea for extracting parallelism

  • from this algorithm.

  • So let's now look at how we can use

  • Pthreads to implement this simple Fibonacci algorithm.

  • So Pthreads is a standard API for threading,

  • and it's supported on all Unix-based machines.

  • And if you're programming using Microsoft products,

  • then the equivalent is Win API threads.

  • And Pthreads is actually standard in ANSI and IEEE,

  • so there's this number here that specifies the standard.

  • But nowadays, we just call it Pthreads.

  • And it's basically a do-it-yourself concurrency

  • platform.

  • So it's like the assembly language

  • of parallel programming.

  • It's built as a library of functions

  • with special non-C semantics.

  • Because if you're just writing code in C,

  • you can't really say which parts of the code

  • should be executed in parallel.

  • So Pthreads provides you a library

  • of functions that allow you to specify concurrency

  • in your program.

  • And each thread implements an abstraction of a processor,

  • and these threads are then multiplexed

  • onto the actual machine resources.

  • So the number of threads that you create

  • doesn't necessarily have to match the number of processors

  • you have on your machine.

  • So if you have more threads than the number of processors

  • you have, then they'll just be multiplexing.

  • So you can actually run a Pthreads program

  • on a single core even though you have multiple threads

  • in the program.

  • They would just be time-sharing.

  • All the threads communicate through shared memory,

  • so they all have access to the same view of the memory.

  • And the library functions that Pthreads provides mask

  • the protocols involved in interthread coordination,

  • so you don't have to do it yourself.

  • Because it turns out that this is quite difficult to

  • do correctly by hand.

  • So now I want to look at the key Pthread functions.

  • So the first Pthread is pthread_create.

  • And this takes four arguments.

  • So the first argument is this pthread_t type.

  • This is basically going to store an identifier

  • for the new thread that pthread_create

  • will create so that we can use that thread

  • in our computations.

  • pthread_attr_t-- this set some thread attributes,

  • and for our purposes, we can just set it to null and use

  • the default attributes.

  • The third argument is this function

  • that's going to be executed after we create the thread.

  • So we're going to need to define this function that we

  • want the thread to execute.

  • And then finally, we have this void *arg argument,

  • which stores the arguments that are going to be passed

  • to the function that we're going to be executing.

  • And then pthread_create also returns an error status,

  • returns an integer specifying whether the thread creation

  • was successful or not.

  • And then there's another function called pthread_join.

  • pthread_join basically says that we

  • want to block at this part of our code

  • until this specified thread finishes.

  • So it takes as argument pthread_t.

  • So this thread identifier, and these thread identifiers,

  • were created when we called pthread_create.

  • It also has a second argument, status,

  • which is going to store the status

  • of the terminating thread.

  • And then pthread_join also returns an error status.

  • So essentially what this does is it

  • says to wait until this thread finishes before we continue on

  • in our program.

  • So any questions so far?

  • So here's what the implementation of Fibonacci

  • looks like using Pthreads.

  • So on the left, we see the original program that we had,

  • the fib function there.

  • That's just the sequential code.

  • And then we have all this other stuff

  • to enable it to run in parallel.

  • So first, we have this struct on the left, thread_args.

  • This struct here is used to store the arguments that

  • are passed to the function that the thread is going to execute.

  • And then we have this thread_func.

  • What that does is it reads the input

  • argument from this thread_args struct,

  • and then it sets that to i, and then it calls fib of i.

  • And that gives you the output, and then we store the result

  • into the output of the struct.

  • And then that also just returns null.

  • And then over on the right hand side,

  • we have the main function that will actually call the fib

  • function on the left.

  • So we initialize a whole bunch of variables

  • that we need to execute these threads.

  • And then we first check if n is less than 30.

  • If n is less than 30, it turns out

  • that it's actually not worth creating threads

  • to execute this program in parallel, because

  • of the overhead of thread creation.

  • So if n is less than 30, we'll just execute the program

  • sequentially.

  • And this idea is known as coarsening.

  • So you saw a similar example a couple of lectures

  • ago when we did coarsening for sorting.

  • But this is in the context of a parallel programming.

  • So here, because there are some overheads

  • to running a function in parallel,

  • if the input size is small enough,

  • sometimes you want to just execute it sequentially.

  • And then we're going to--

  • so let me just walk through this code,

  • since I have an animation.

  • So the next thing it's going to do

  • is it's going to marshal the input argument to the thread

  • so it's going to store the input argument n minus 1

  • in this args struct.

  • And then we're going to call pthread_create

  • with a thread variable.

  • For thread_args, we're just going to use null.

  • And then we're going to pass the thread_func

  • that we defined on the left.

  • And then we're going to pass the args structure.

  • And inside this args structure, the input is set to n minus 1,

  • which we did on the previous line.

  • And then pthread_create is going to give a return value.

  • So if the Pthread creation was successful,

  • then the status is going to be null, and we can continue.

  • And when we continue, we're going

  • to execute, now, fib of n minus 2 and store the result of that

  • into our result variable.

  • And this is done at the same time that fib of n minus 1

  • is executing.

  • Because we created this Pthread, and we

  • told it to call this thread_func function

  • that we defined on the left.

  • So both fib of n minus 1 and fib of n minus 2

  • are executing in parallel now.

  • And then we have this pthread_join,

  • which says we're going to wait until the thread

  • that we've created finishes before we move on, because we

  • need to know the result of both of the sub-calls

  • before we can finish this function.

  • And once that's done--

  • well, we first check the status to see if it was successful.

  • And if so, then we add the outputs of the argument's

  • struct to the result. So args.output will store

  • the output of fib of n minus 1.

  • So that's the Pthreads code.

  • Any questions on how this works?

  • Yeah?

  • AUDIENCE: I have a question about the thread function.

  • So it looks like you passed a void pointer,

  • but then you cast it to something else every time

  • you use that--

  • JULIAN SHUN: Yeah, so this is because

  • the pthread_create function takes

  • as input a void star pointer.

  • Because it's actually a generic function,

  • so it doesn't know what the data type is.

  • It has to work for all data types,

  • and that's why we need to cast it to avoid star.

  • When we pass it to pthread_create and then

  • inside the thread_func, we actually

  • do know what type of pointer that is, so then we cast it.

  • So does this code seem very parallel?

  • So how many parallel calls am I doing here?

  • Yeah?

  • AUDIENCE: Just one.

  • JULIAN SHUN: Yeah, so I'm only creating one thread.

  • So I'm executing two things in parallel.

  • So if I ran this code on four processors,

  • what's the maximum speed-up I could get?

  • AUDIENCE: [INAUDIBLE].

  • JULIAN SHUN: So the maximum speed-up I can get

  • is just two, because I'm only running two things in parallel.

  • So this doesn't recursively create threads.

  • It only creates one thread at the top level.

  • And if you wanted to make it so that this code actually

  • recursively created threads, it would actually

  • become much more complicated.

  • And that's one of the disadvantages of implementing

  • this code in Pthreads.

  • So we'll look at other solutions that will

  • make this task much easier.

  • So some of the issues with Pthreads

  • are shown on this slide here.

  • So there's a high overhead to creating a thread.

  • So creating a thread typically takes over 10

  • to the 4th cycles.

  • And this leads to very coarse-grained concurrency,

  • because your tasks have to do a lot of work

  • in order to amortize the costs of creating that thread.

  • There are something called thread pulls, which can help.

  • And the idea here is to create a whole bunch of threads

  • at the same time to amortize the costs of thread creation.

  • And then when you need a thread, you just

  • take one from the thread pull.

  • So the thread pull contains threads that

  • are just waiting to do work.

  • There's also a scalability issue with this code

  • that I showed on the previous slide.

  • The Fibonacci code gets, at most,

  • 1.5x speed-up for two cores.

  • Why is it 1.5 here?

  • Does anyone know?

  • Yeah?

  • AUDIENCE: You have the asymmetry in the size of the two calls.

  • JULIAN SHUN: Yeah, so it turns out

  • that the two calls that I'm executing in parallel--

  • they're not doing the same amount of work.

  • So one is computing fib of n minus 1,

  • one is computing fib of n minus 2.

  • And does anyone know what the ratio between these two values

  • is?

  • Yeah, so it's the golden ratio.

  • It's about 1.6.

  • It turns out that if you can get a speed-up of 1.6,

  • then that's great.

  • But there are some overheads, so this code

  • will get about a 1.5 speed up.

  • And if you want to run this to take advantage of more cores,

  • then you need to rewrite this code,

  • and it becomes more complicated.

  • Third, there's the issue of modularity.

  • So if you look at this code here,

  • you see that the Fibonacci logic is not nicely encapsulated

  • within one function.

  • We have that logic in the fib function on the left,

  • but then we also have some of the fib logic on the right

  • in our main function.

  • And this makes this code not modular.

  • And if we want to build programs on top of this,

  • it makes it very hard to maintain,

  • if we want to just change the logic of the Fibonacci

  • function a little bit, because now we

  • have to change it in multiple places

  • instead of just having everything in one place.

  • So it's not a good idea to write code that's not modular,

  • so please don't do that in your projects.

  • And then finally, the code becomes

  • complicated because you have to actually move

  • these arguments around.

  • That's known as argument marshaling.

  • And then you have to engage in error-prone protocols

  • in order to do load balancing.

  • So if you recall here, we have to actually

  • place the argument n minus 1 into args.input

  • and we have to extract the value out of args.output.

  • So that makes the code very messy.

  • So why do I say shades of 1958 here?

  • Does anyone know what happened in 1958?

  • Who was around in 1958?

  • Just Charles?

  • So there was a first something in 1958.

  • What was it?

  • So turns out in 1958, we had the first compiler.

  • And this was the Fortran compiler.

  • And before we had Fortran compiler,

  • programmers were writing things in assembly.

  • And when you write things in assembly,

  • you have to do argument marshaling,

  • because you have to place things into the appropriate registers

  • before calling a function, and also move things around when

  • you return from a function.

  • And the nice thing about the first compiler

  • is that it actually did all of this argument marshaling

  • for you.

  • So now you can just pass arguments to a function,

  • and the compiler will generate code

  • that will do the argument marshaling for us.

  • So having you do this in Pthreads

  • is similar to having to write code in assembly,

  • because you have to actually manually marshal

  • these arguments.

  • So hopefully, there are better ways to do this.

  • And indeed, we'll look at some other solutions that will make

  • it easier on the programmer.

  • Any questions before I continue?

  • So we looked at Pthreads.

  • Next, let's look at Threading Building Blocks.

  • So Threading Building Blocks is a library solution.

  • It was developed by Intel.

  • And it's implemented as a C++ library that runs on top

  • of native threads.

  • So the underlying implementation uses threads,

  • but the programmer doesn't deal with threads.

  • Instead, the programmer specifies tasks,

  • and these tasks are automatically load-balanced

  • across the threads using a work-stealing algorithm

  • inspired by research at MIT--

  • Charles Leiserson's research.

  • And the focus of Intel TBB is on performance.

  • And as we'll see, the code written using TBB

  • is simpler than what you would have

  • to write if you used Pthreads.

  • So let's look at how we can implement Fibonacci using TBB.

  • So in TBB, we have to create these tasks.

  • So in the Fibonacci code, we create this fib task class.

  • And inside the task, we have to define this execute function.

  • So the execute function is the function

  • that performs a computation when we start the task.

  • And this is where we define the Fibonacci logic.

  • This task also takes as input these arguments parameter, n

  • and sum.

  • So n is the input here and sum is the output.

  • And in TBB, we can easily create a recursive program

  • that extracts more parallelism.

  • And here, what we're doing is we're recursively creating

  • two child tasks, a and b.

  • That's the syntax for creating the tasks.

  • And here, we can just pass the arguments to FibTask

  • instead of marshaling the arguments ourselves.

  • And then what we have here is a set_ref_count.

  • And this basically is the number of tasks

  • that we have to wait for plus one, so plus one for ourselves.

  • And in this case, we created two children tasks,

  • and we have ourselves, so that's 2 plus 1.

  • And then after that, we start task b using the spawn(b) call.

  • And then we do spawn_and_wait_for_all

  • with a as the argument.

  • In this place, he says, we're going to start task a,

  • and then also wait for both a and b

  • to finish before we proceed.

  • So this spawn_and_wait_for_all call

  • is going to look at the ref count that we set above

  • and wait for that many tasks to finish before it continues.

  • And after both and a and b have completed,

  • then we can just sum up the results

  • and store that into the sum variable.

  • And here, these tasks are created recursively.

  • So unlike the Pthreads implementation

  • that was only creating one thread at the top level,

  • here, we're actually recursively creating more and more tasks.

  • So we can actually get more parallelism

  • from this code and scale to more processors.

  • We also need this main function just to start up the program.

  • So what we do here is we create a root task,

  • which just computes fib of n, and then we call

  • spawn_root_and_wait(a).

  • So a is the task for the root.

  • And then it will just run the root task.

  • So that's what Fibonacci looks like in TBB.

  • So this is much simpler than the Pthreads implementation.

  • And it also gets better performance,

  • because we can extract more parallelism

  • from the computation.

  • Any questions?

  • So TBB also has many other features in addition to tasks.

  • So TBB provides many C++ templates to express common

  • patterns, and you can use these templates on different data

  • types.

  • So they have a parallel_for, which

  • is used to express loop parallelism.

  • So you can loop over a bunch of iterations in parallel.

  • They also have a parallel_reduce for data aggregation.

  • For example, if you want to sum together

  • a whole bunch of values, you can use a parallel_reduce

  • to do that in parallel.

  • They also have pipeline and filter.

  • That's used for software pipelining.

  • TBB provides many concurrent container classes,

  • which allow multiple threads to safely access and update

  • the items in a container concurrently.

  • So for example, they have hash tables, trees, priority cues,

  • and so on.

  • And you can just use these out of the box,

  • and they'll work in parallel.

  • You can do concurrent updates and reads

  • to these data structures.

  • TBB also has a variety of mutual exclusion library functions,

  • such as locks and atomic operations.

  • So there are a lot of features of TBB,

  • which is why it's one of the more popular concurrency

  • platforms.

  • And because of all of these features,

  • you don't have to implement many of these things by yourself,

  • and still get pretty good performance.

  • So TBB was a library solution to the concurrency problem.

  • Now we're going to look at two linguistic solutions--

  • OpenMP and Cilk.

  • So let's start with OpenMP.

  • So OpenMP is a specification by an industry consortium.

  • And there are several compilers available that support OpenMP,

  • both open source and proprietary.

  • So nowadays, GCC, ICC, and Clang all

  • support OpenMP, as well as Visual Studio.

  • And OpenMP is-- it provides linguistic extensions to C

  • and C++, as well as Fortran, in the form of compiler pragmas.

  • So you use these compiler pragmas

  • in your code to specify which pieces of code

  • should run in parallel.

  • And OpenMP also runs on top of native threads,

  • but the programmer isn't exposed to these threads.

  • OpenMP supports loop parallelism,

  • so you can do parallel for loops.

  • They have task parallelism as well as pipeline parallelism.

  • So let's look at how we can implement Fibonacci in OpenMP.

  • So this is the entire code.

  • So I want you to compare this to the Pthreads implementation

  • that we saw 10 minutes ago.

  • So this code is much cleaner than the Pthreads

  • implementation, and it also performs better.

  • So let's see how this code works.

  • So we have these compiler pragmas,

  • or compiler directives.

  • And the compiler pragma for creating a parallel task

  • is omp task.

  • So we're going to create an OpenMP task for fib

  • of n minus 1 as well as fib of n minus 2.

  • There's also this shared pragma, which

  • specifies that the two variables in the arguments

  • are shared across different threads.

  • So you also have to specify whether variables

  • are private or shared.

  • And then the pragma omp wait just

  • says we're going to wait for the preceding task

  • to complete before we continue.

  • So here, it's going to wait for fib of n minus 1

  • and fib of n minus 2 to finish before we

  • return the result, which is what we want.

  • And then after that, we just return x plus y.

  • So that's the entire code.

  • And OpenMP also provides many other pragma directives,

  • in addition to task.

  • So we can use a parallel for to do loop parallelism.

  • There's reduction.

  • There's also directives for scheduling and data sharing.

  • So you can specify how you want a particular loop

  • to be scheduled.

  • OpenMP has many different scheduling policies.

  • They have static parallelism, dynamic parallelism, and so on.

  • And then these scheduling directives also

  • have different grain sizes.

  • The data sharing directives are specifying whether variables

  • are private or shared.

  • OpenMP also supplies a variety of synchronization constructs,

  • such as barriers, atomic updates, mutual exclusion,

  • or mutex locks.

  • So OpenMP also has many features,

  • and it's also one of the more popular solutions

  • to writing parallel programs.

  • As you saw in the previous example,

  • the code is much simpler than if you

  • were to write something using Pthreads or even TBB.

  • This is a much simpler solution.

  • Any questions?

  • Yeah?

  • AUDIENCE: So with every compiler directive,

  • does it spawn a new [INAUDIBLE] on a different processor?

  • JULIAN SHUN: So this code here is actually

  • independent of the number of processors.

  • So there is actually a scheduling algorithm

  • that will determine how the tasks get

  • mapped to different processors.

  • So if you spawn a new task, it doesn't necessarily put it

  • on a different processor.

  • And you can have more tasks than the number

  • of processors available.

  • So there's a scheduling algorithm

  • that will take care of how these tasks get

  • mapped to different processors, and that's

  • hidden from the programmer.

  • Although you can use these scheduling

  • pragmas to give hints to the compiler

  • how it should schedule it.

  • Yeah?

  • AUDIENCE: What is the operating system [INAUDIBLE]

  • scheduling [INAUDIBLE]?

  • JULIAN SHUN: Underneath, this is implemented

  • using Pthreads, which has to make operating system calls to,

  • basically, directly talk to the processor cores

  • and do multiplexing and so forth.

  • So the operating system is involved at a very low level.

  • So the last concurrency platform that we'll be looking at today

  • is Cilk.

  • We're going to look at Cilk Plus, actually.

  • And the Cilk part of Cilk Plus is a small set of linguistic

  • extensions to C and C++ that support fork-join parallelism.

  • So for example, the Fibonacci example

  • uses fork-join parallelism, so you

  • can use Cilk to implement that.

  • And the Plus part of Cilk Plus supports vector parallelism,

  • which you had experience working with in your homeworks.

  • So Cilk Plus was initially developed by Cilk Arts,

  • which was an MIT spin-off.

  • And Cilk Arts was acquired by Intel in July 2009.

  • And the Cilk Plus implementation was

  • based on the award-winning Cilk multi-threaded language that

  • was developed two decades ago here at MIT by Charles

  • Leiserson's research group.

  • And it features a provably efficient

  • work-stealing scheduler.

  • So this scheduler is provably efficient.

  • You can actually prove theoretical bounds on it.

  • And this allows you to implement theoretically efficient

  • algorithms, which we'll talk more about in another lecture--

  • algorithm design.

  • But it provides a provably efficient

  • work-stealing scheduler.

  • And Charles Leiserson has a very famous paper

  • that has a proof of that this scheduler is optimal.

  • So if you're interested in reading about this,

  • you can talk to us offline.

  • Cilk Plus also provides a hyperobject library

  • for parallelizing code with global variables.

  • And you'll have a chance to play around with hyperobjects

  • in homework 4.

  • The Cilk Plus ecosystem also includes

  • useful programming tools, such as the Cilk Screen Race

  • Detector.

  • So this allows you to detect determinacy races

  • in your program to help you isolate bugs and performance

  • bottlenecks.

  • It also has a scalability analyzer called Cilk View.

  • And Cilk View will basically analyze the amount of work

  • that your program is doing, as well as

  • the maximum amount of parallelism

  • that your code could possibly extract from the hardware.

  • So that's Intel Cilk Plus.

  • But it turns out that we're not actually

  • going to be using Intel Cilk Plus in this class.

  • We're going to be using a better compiler.

  • And this compiler is based on Tapir/LLVM.

  • And it supports the Cilk subset of Cilk Plus.

  • And Tapir/LLVM was actually recently developed at MIT

  • by T. B. Schardl, who gave a lecture last week, William

  • Moses, who's a grad student working with Charles,

  • as well as Charles Leiserson.

  • So talking a lot about Charles's work today.

  • And Tapir/LLVM generally produces

  • better code, relative to its base compiler,

  • than all other implementations of Cilk out there.

  • So it's the best Cilk compiler that's available today.

  • And they actually wrote a very nice paper

  • on this last year, Charles Leiserson and his group.

  • And that paper received the Best Paper Award

  • at the annual Symposium on Principles and Practices

  • of Parallel Programming, or PPoPP.

  • So you should look at that paper as well.

  • So right now, Tapir/LLVM uses the Intel Cilk Plus runtime

  • system, but I believe Charles's group has plans to implement

  • a better runtime system.

  • And Tapir/LLVM also supports more general features

  • than existing Cilk compilers.

  • So in addition to spawning functions,

  • you can also spawn code blocks that are not

  • separate functions, and this makes

  • writing programs more flexible.

  • You don't have to actually create a separate function

  • if you want to execute a code block in parallel.

  • Any questions?

  • So this is the Cilk code for Fibonacci.

  • So it's also pretty simple.

  • It looks very similar to the sequential program,

  • except we have these cilk_spawn and cilk_synv

  • statements in the code.

  • So what do these statements do?

  • So cilk_spawn says that the named child function, which

  • is the function that is right after this cilk_spawn

  • statement, may execute in parallel with the parent

  • caller.

  • The parent caller is the function

  • that is calling cilk_spawn.

  • So this says that fib of n minus 1

  • can execute in parallel with the function that called it.

  • And then this function is then going to call fib of n minus 2.

  • And fib of n minus 2 and fib of n minus 1

  • now can be executing in parallel.

  • And then cilk_sync says that control cannot pass this point

  • until all of the spawn children have returned.

  • So this is going to wait for fib of n minus 1

  • to return before we go to the return statement

  • where we add up x and y.

  • So one important thing to note is

  • that the Cilk keywords grant permission

  • for parallel execution, but they don't actually force or command

  • parallel execution.

  • So even though I said cilk_spawn here,

  • the runtime system doesn't necessarily

  • have to run fib of n minus 1 in parallel with fib of n minus 2.

  • I'm just saying that I could run these two things in parallel,

  • and it's up to the runtime system

  • to decide whether or not to run these things in parallel,

  • based on its scheduling policy.

  • So let's look at another example of Cilk.

  • So let's look at loop parallelism.

  • So here we want to do a matrix transpose,

  • and we want to do this in-place.

  • So the idea here is we want to basically swap

  • the elements below the diagonal to its mirror

  • image above the diagonal.

  • And here's some code to do this.

  • So we have a cilk_for.

  • So this is basically a parallel for loop.

  • It goes from i equals 1 to n minus 1.

  • And then the inner for loop goes from j equals 0 up

  • to i minus 1.

  • And then we just swap a of i j with a of j i,

  • using these three statements inside the body of the

  • for loop.

  • So to execute a for loop in parallel,

  • you just have to add cilk underscore to the for keyword.

  • And that's as simple as it gets.

  • So this code is actually going to run in parallel

  • and get pretty good speed-up for this particular problem.

  • And internally, Cilk for loops are

  • transformed into nested cilk_spawn and cilk_sync calls.

  • So the compiler is going to get rid of the cilk_for

  • and change it into cilk_spawn and cilk_sync.

  • So it's going to recursively divide the iteration

  • space into half, and then it's going to spawn off one half

  • and then execute the other half in parallel with that,

  • and then recursively do that until the iteration

  • range becomes small enough, at which point

  • it doesn't make sense to execute it in parallel anymore,

  • so we just execute that range sequentially.

  • So that's loop parallelism in Cilk.

  • Any questions?

  • Yes?

  • AUDIENCE: How does it know [INAUDIBLE] something weird,

  • can it still do that?

  • JULIAN SHUN: Yeah, so the compiler

  • can actually figure out what the iteration space is.

  • So you don't necessarily have to be incrementing by 1.

  • You can do something else.

  • You just have to guarantee that all of the iterations

  • are independent.

  • So if you have a determinacy race

  • across the different iterations of your cilk_for loop,

  • then your result might not necessarily be correct.

  • So you have to make sure that the iterations are, indeed,

  • independent.

  • Yes?

  • AUDIENCE: Can you nest cilk_fors?

  • JULIAN SHUN: Yes, so you can nest cilk_fors.

  • But it turns out that, for this example,

  • usually, you already have enough parallelism

  • in the outer loop for large enough values of n,

  • so it doesn't make sense to put a cilk_for loop inside,

  • because using a cilk_for loop adds some additional overheads.

  • But you can actually do nested cilk_for loops.

  • And in some cases, it does make sense,

  • especially if there's not enough parallelism

  • in the outermost for loop.

  • So good question.

  • Yes?

  • AUDIENCE: What does the assembly code

  • look like for the parallel code?

  • JULIAN SHUN: So it has a bunch of calls to the Cilk runtime

  • system.

  • I don't know all the details, because I haven't

  • looked at this recently.

  • But I think you can actually generate

  • the assembly code using a flag in the Clang compiler.

  • So that's a good exercise.

  • AUDIENCE: Yeah, you probably want

  • to look at the LLVM IR, rather than the assembly,

  • to begin with, to understand what's going on.

  • It has three instructions that are not

  • in the standard LLVM, which were added to support parallelism.

  • Those things, when it's lowered into assembly,

  • each of those instructions becomes

  • a bunch of assembly language instructions.

  • So you don't want to mess with looking at it in the assembler

  • until you see what it looks like in the LLVM first.

  • JULIAN SHUN: So good question.

  • Any other questions about this code here?

  • OK, so let's look at another example.

  • So let's say we had this for loop

  • where, on each iteration i, we're

  • just incrementing a variable sum by i.

  • So this is essentially going to compute

  • the summation of everything from i equals 0 up to n minus 1,

  • and then print out the result.

  • So one straightforward way to try to parallelize this code

  • is to just change the for to cilk_for.

  • So does this code work?

  • Who thinks that this code doesn't work?

  • Or doesn't compute the correct result?

  • So about half of you.

  • And who thinks this code does work?

  • So a couple people.

  • And I guess the rest of the people don't care.

  • So it turns out that it's not actually necessarily going

  • to give you the right answer.

  • Because the cilk_for loop says you

  • can execute these iterations in parallel,

  • but they're all updating the same shared variable sum here.

  • So you have what's called a determinacy race, where

  • multiple processors can be writing to the same memory

  • location.

  • We'll talk much more about determinacy races

  • in the next lecture.

  • But for this example, it's not necessarily

  • going to work if you run it on more than one processor.

  • And Cilk actually has a nice way to deal with this.

  • So in Cilk, we have something known as a reducer.

  • This is one example of a hyperobject,

  • which I mentioned earlier.

  • And with a reducer, what you have to do

  • is, instead of declaring the sum variable just

  • has an unsigned long data type, what you do

  • is you use this macro called CILK_C_REDUCER_OPADD, which

  • specifies we want to create a reducer with the addition

  • function.

  • Then we have the variable name sum,

  • the data type unsigned long, and then the initial value 0.

  • And then we have a macro to register this reducer,

  • so a CILK_C_REGISTER_REDUCER.

  • And then now, inside this cilk_for loop,

  • we can increment the sum, or REDUCER_VIEW, of sum,

  • which is another macro, by i.

  • And you can actually execute this in parallel,

  • and it will give you the same answer

  • that you would get if you ran this sequentially.

  • So the reducer will take care of this determinacy race for you.

  • And at the end, when you print out this result,

  • you'll see that the sum is equal to the sum that you expect.

  • And then after you finish using the reducer,

  • you use this other macro called CILK_C_UNREGISTER_REDUCER(sum)

  • that tells the system that you're done using this reducer.

  • So this is one way to deal with this problem

  • when you want to do a reduction.

  • And it turns out that there are many other interesting

  • reduction operators that you might want to use.

  • And in general, you can create reduces for monoids.

  • And monoids are algebraic structures

  • that have an associative binary operation as well

  • as an identity element.

  • So the addition operator is a monoid,

  • because it's associative, it's binary,

  • and the identity element is 0.

  • Cilk also has several other predefined reducers,

  • including multiplication, min, max, and, or, xor, et cetera.

  • So these are all monoids.

  • And you can also define your own reducer.

  • So in fact, in the next homework,

  • you'll have the opportunity to play around with reducers

  • and write a reducer for lists.

  • So that's reducers.

  • Another nice thing about Cilk is that there's always

  • a valid serial interpretation of the program.

  • So the serial elision of a Cilk program

  • is always a legal interpretation.

  • And for the Cilk source code on the left,

  • the serial elision is basically the code

  • you get if you get rid of the cilk_spawn

  • and cilk_sync statements.

  • And this looks just like the sequential code.

  • And remember that the Cilk keywords grant permission

  • for parallel execution, but they don't necessarily

  • command parallel execution.

  • So if you ran this Cilk code using a single core,

  • it wouldn't actually create these parallel tasks,

  • and you would get the same answer

  • as the sequential program.

  • And this is-- in the serial edition-- is also

  • a correct interpretation.

  • So unlike other solutions, such as TBB and Pthreads,

  • it's actually difficult, in those environments,

  • to get a program that does what the sequential program does.

  • Because they're actually doing a lot of additional work

  • to set up these parallel calls and create these argument

  • structures and other scheduling constructs.

  • Whereas in Cilk, it's very easy just

  • to get this serial elision.

  • You just define cilk_spawn and cilk_sync to be null.

  • You also define cilk_for to be for.

  • And then this gives you a valid sequential program.

  • So when you're debugging code, and you

  • might first want to check if the sequential elision of your Cilk

  • program is correct, and you can easily

  • do that by using these macros.

  • Or actually, there's actually a compiler flag

  • that will do that for you and give you the equivalent C

  • program.

  • So this is a nice way to debug, because you

  • don't have to start with the parallel program.

  • You can first check if this serial program is correct

  • before you go on to debug the parallel program.

  • Questions?

  • Yes?

  • AUDIENCE: So does cilk_for--

  • does each iteration of the cilk_for its own task

  • that the scheduler decides if it wants to execute in parallel,

  • or if it executes in parallel, do all of the iterations

  • execute in parallel?

  • JULIAN SHUN: So it turns out that by default,

  • it groups a bunch of iterations together into a single task,

  • because it doesn't make sense to break it down

  • into such small chunks, due to the overheads of parallelism.

  • But there's actually a setting you

  • can do to change the grain size of the for loop.

  • So you could actually make it so that each iteration

  • is its own task.

  • And then, as you the scheduler will

  • decide how to map these different task

  • onto different processors, or even

  • if it wants to execute any of these tasks in parallel.

  • So good question.

  • So the idea in Cilk is to allow the programmer

  • to express logical parallelism in an application.

  • So the programmer just has to identify

  • which pieces of the code could be executed in parallel,

  • but doesn't necessarily have to determine which pieces of code

  • should be executed in parallel.

  • And then Cilk has a runtime scheduler

  • that will automatically map the executing

  • program onto the available processor cores' runtime.

  • And it does this dynamically using

  • a work-stealing scheduling algorithm.

  • And the work-stealing scheduler is

  • used to balance the tasks evenly across

  • the different processors.

  • And we'll talk more about the work-stealing scheduler

  • in a future lecture.

  • But I want to emphasize that unlike the other concurrency

  • platforms that we looked at today,

  • Cilk's work-stealing scheduling algorithm is theoretically

  • efficient, whereas the OpenMP and TBB schedulers are not

  • theoretically efficient.

  • So this is a nice property, because it will guarantee you

  • that the algorithms you write on top of Cilk

  • will also be theoretically efficient.

  • So here's a high-level illustration

  • of the Cilk ecosystem.

  • It's a very simplified view, but I did this

  • to fit it on a single slide.

  • So what you do is you take the Cilk source code,

  • you pass it to your favorite Cilk compiler--

  • the Tapir/LLVM compiler-- and this

  • gives you a binary that you can run on multiple processors.

  • And then you pass a program input to the binary,

  • you run it on however many processors you have,

  • and then this allows you to benchmark the parallel

  • performance of your program.

  • You can also do serial testing.

  • And to do this, you just obtain a serial elision of the Cilk

  • program, and you pass it to an ordinary C or C++ compiler.

  • It generates a binary that can only run on a single processor,

  • and you run your suite of serial regression tests

  • on this single threaded binary.

  • And this will allow you to benchmark the performance

  • of your serial code and also debug any issues

  • that might have arised when you were running

  • this program sequentially.

  • Another way to do this is you can actually just

  • compile the original Cilk code but run it

  • on a single processor.

  • So there's a command line argument

  • that tells the runtime system how many processors you

  • want to use.

  • And if you set that parameter to 1,

  • then it will only use a single processor.

  • And this allows you to benchmark the single threaded performance

  • of your code as well.

  • And the parallel program executing on a single core

  • should behave exactly the same way

  • as the execution of this serial elision.

  • So that's one of the advantages of using Cilk.

  • And because you can easily do serial testing using the Cilk

  • platform, this allows you to separate out

  • the serial correctness from the parallel correctness.

  • As I said earlier, you can first debug the serial correctness,

  • as well as any performance issues, before moving on

  • to the parallel version.

  • And another point I want to make is

  • that because Cilk actually uses the serial program

  • inside its task, it's actually good to optimize

  • the serial program even when you're

  • writing a parallel program, because optimizing

  • the serial program for performance

  • will also translate to better parallel performance.

  • Another nice feature of Cilk is that it

  • has this tool called Cilksan, which

  • stands for Cilk Sanitizer.

  • And Cilksan will detect any determinacy races

  • that you have in your code, which will significantly

  • help you with debugging the correctness

  • as well as the performance of your code.

  • So if you compile the Cilk code using the Cilksan flag,

  • it will generate an instrumented binary that, when you run,

  • it will find and localize all the determinacy races

  • in your program.

  • So it will tell you where the determinacy races occur,

  • so that you can go inspect that part of your code

  • and fix it if necessary.

  • So this is a very useful tool for benchmarking

  • your parallel programs.

  • Cilk also has another nice tool called Cilkscale.

  • Cilkscale is a performance analyzer.

  • It will analyze how much parallelism

  • is available in your program as well as the total amount

  • of work that it's doing.

  • So again, you pass a flag to the compiler that

  • will turn on Cilkscale, and it will generate

  • a binary that is instrumented.

  • And then when you run this code, it

  • will give you a scalability report.

  • So you'll find these tools very useful when

  • you're doing the next project.

  • And we'll talk a little bit more about these two tools

  • in the next lecture.

  • And as I said, Cilkscale will analyze how well your program

  • will scale to larger machines.

  • So it will basically tell you the maximum number

  • of processors that your code could possibly take advantage

  • of.

  • Any questions?

  • Yes?

  • AUDIENCE: What do you mean when you say runtime?

  • JULIAN SHUN: So I mean the scheduler-- the Cilk runtime

  • scheduler that's scheduling the different tasks when

  • you're running the program.

  • AUDIENCE: So that's included in the binary.

  • JULIAN SHUN: So it's linked from the binary.

  • It's not stored in the same place.

  • It's linked.

  • Other questions?

  • So let me summarize what we looked at today.

  • So first, we saw that most processors today

  • have multiple cores.

  • And probably all of your laptops have more than one core on it.

  • Who has a laptop that only has one core?

  • AUDIENCE: [INAUDIBLE].

  • JULIAN SHUN: When did you buy it?

  • Probably a long time ago.

  • So nowadays, obtaining high performance on your machines

  • requires you to write parallel programs.

  • But parallel programming can be very hard,

  • especially if you have the program directly

  • on the processor cores and interact with the operating

  • system yourself.

  • So Cilk is very nice, because it abstracts the processor cores

  • from the programmer, it handles synchronization

  • and communication protocols, and it also performs

  • provably good load-balancing.

  • And in the next project, you'll have a chance

  • to play around with Cilk.

  • You'll be implementing your own parallel screensaver,

  • so that's a very fun project to do.

  • And possibly, in one of the future lectures,

  • we'll post some of the nicest screensaver

  • that students developed for everyone to see.

  • OK, so that's all.

The following content is provided under a Creative

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it