Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • THORSTEN KURTH: Hello, and thank you, everybody, for attending

  • the afternoon sessions.

  • My name is Thorsten Kurth.

  • And I'm an application performance specialist

  • at NERSC.

  • And my day-to-day work is helping

  • scientists optimize their codes for contemporary supercomputer

  • systems.

  • Today I'm going to talk about a project I care about

  • because it combines three different things I'm

  • excited about.

  • This is big computers, so exascale.

  • It's deep learning.

  • And it's climate change because it

  • will affect everybody, every one of us sooner or later.

  • So this is a team effort.

  • And I want to thank, at this point, everybody

  • in this collaborative effort between NERSC, Nvidia, UC

  • Berkeley, and Oak Ridge for making this a success.

  • So thank you at this point.

  • So I want to talk about our extreme weather phenomena.

  • So why are they important?

  • They're important because they can

  • incur a lot of damage and loss of life

  • and these kind of things.

  • For example, 2017, the damage to the US economy

  • was about 200 billion for the combined extreme weather

  • events.

  • So these can be hurricanes, or tropical cyclones,

  • and, for example, atmospheric rivers

  • because they can cause heavy flooding and major disruption.

  • So we won't understand these events better.

  • But what a typical climate data analysis is-- so for example,

  • you have these simulations, which

  • look into the future up to 100 years.

  • You run different models and get these.

  • So on your left, you see the output of the simulations.

  • And they basically contain 14 million observables

  • for a three-hour interval.

  • And then you have like 100 years worth of that.

  • And what people usually do when you look at the IPCC report,

  • for example, or in popular magazines,

  • they boil it down to a couple of numbers.

  • For example, temperature rise, sea level rise,

  • these kind of things.

  • However, if the temperature increases by one degrees

  • or two, that matters.

  • But it might not matter to you if you

  • live in the middle of the Sahara, right?

  • It might matter to you, though, if you

  • are in different regions of the globe-- and also the sea level

  • rise.

  • So the thing is now, what you want

  • to do is you want to have a geospatial analysis of climate

  • change.

  • So how does climate change impact your life

  • where you live?

  • So we want to answer things like,

  • will there be more hurricanes, for example?

  • And if yes, will they be more intense?

  • Will they make more landfalls?

  • If they stay over the sea, it's usually not

  • as bad as when they hit the coastline.

  • And for atmospheric rivers, for example,

  • 50% of all rain in California is due to atmospheric rivers.

  • So it's an important question to ask

  • if we will get more water, like more rain, due to this.

  • And you, for example, think about forest fires,

  • like the campfire last year we had in the Bay Area.

  • We had a hard time breathing for two weeks.

  • It's really a question if you get more or fewer of these.

  • And this is really dependent on these atmospheric rivers,

  • for example.

  • So insurance industry-- for example, water planners--

  • a lot of different people need to know

  • what they need to get up for.

  • So how can we do this?

  • So we have this high fidelity climate simulations.

  • And what we, for example, can start with--

  • picking out the these events.

  • For example, hurricanes and atmospheric rivers.

  • Let's start with these.

  • And image segmentation techniques

  • can offer pixel-level resolution.

  • So they can do a per-pixel classification

  • to pick these events out and then correlate them

  • geospatially with the underlying region, for example.

  • And deep learning, as you know, is very successful in here

  • because, for example, the whole autonomous driving industry

  • is doing that day in, day out.

  • And there's a lot of research going on in this direction.

  • So the data set we have is of 20 terabytes.

  • So we have like 400 terabyte in storage.

  • But for this work, we use 20 terabytes of it.

  • And what I call an image here is more like a tensor.

  • It's a three-dimensional tensor of this 1152 times 768 times

  • 16.

  • And the channels are not RGB.

  • They present observables like wind speed,

  • temperature, pressure, for different altitudes,

  • and these kind of things.

  • So they're general observables.

  • We have free classes.

  • So background, which is not nothing interesting going on.

  • Then the tropical cyclones, or hurricanes,

  • and the atmospheric rivers.

  • Fortunately, these events are still rare in the future.

  • So 95% of the pixels are background,

  • which is good for us.

  • But it's harder to train a model on that

  • because of this high imbalance.

  • And another thing which makes it different from the classical,

  • let's say, streets in segmentation

  • is that all the objects here are--

  • so first, there's a lot of stuff going on in the background.

  • It's not static or slow moving.

  • And also the objects themselves, they change rapidly

  • in size and shape, right?

  • So even when you look at this image, this satellite image

  • from the hurricane, even as an expert, you don't know actually

  • where you want to say, like where this hurricane starts

  • or ends, right?

  • So the labels are pretty fuzzy.

  • So talking about that, how did we get those?

  • Of course, the best would be using human annotated labels.

  • But for that data, we didn't have that at the time.

  • We are currently working on that, though.

  • So for this effort, we use some algorithmic labeling,

  • which is an old school approach in the sense

  • that it's basically based on future engineering

  • together with some thresholding to get the binary masks.

  • One can say, OK, why don't you do the predictions

  • with these algorithms, then?

  • Because you have a lot of shortcomings in this algorithm.

  • So they are regional dependent.

  • Even for different thresholds get vastly different labels.

  • So however, they're still good enough

  • to fit in a network with it.

  • And it can pick up better features,

  • as I will show you later.

  • So for image segmentation architecture,

  • we picked DeepLab version 3+ variant.

  • So it was developed by Google.

  • And basically, it has an-- as all

  • these segmentation network has an encoder, which

  • extracts the features.

  • And the decoder part, which then makes the predictions

  • and the skip connections in order

  • to feed the features at different levels

  • from the encoder stage into the decoder

  • to improve the prediction quality.

  • So the original DeepLab had a [INAUDIBLE] interpolation

  • as a decoder.

  • And we replaced this with a fully deconvolution decoder.

  • I think the original choice was made for training reasons

  • because it's easier to train the [INAUDIBLE] interpolater

  • because it doesn't have a lot of weights.

  • So our model has 44.7 million parameters.

  • And the training cost for a single step

  • on a single sample--

  • so forward, backward-- is 14.4 teraflop,

  • which is 14.4 times 10 to the 12 floating point operations.

  • And on a modern GPU, like this Nvidia V100,

  • you can only fit two batches in half precision

  • or one in single precision on the GPU.

  • So what you need to do is you need to train it in parallel.

  • And we took a purely data parallel approach here.

  • So we used Horovod for this.

  • So Horovod is basically a framework

  • which hooks into the TensorFlow graph in synchronous fashion.

  • And reduces tensors across all the workers

  • as they are ready to be reduced.

  • It does this using MPI.

  • So it provides MPI callback function.

  • MPI is Message Passing Interface.

  • It's a very common framework for exchanging messages

  • between different-- in a distributed memory

  • system such as HPC systems.

  • The good thing is that since a lot of people in HPC use it,

  • it's very highly optimized usually

  • for these supercomputers.

  • You're still, of course, responsible for sharding

  • your data set, distribute the data,

  • and all these kind of things.

  • So we ran on the Summit supercomputer system.

  • So this is the number one supercomputer in the world.

  • So there's this top 500 list, which is updated twice a year.

  • So this is the system at Oak Ridge National Laboratory.

  • It consists of 4,600 nodes.

  • It has two Power CPUs in them and six Nvidia V100

  • GPUs with Tensor Cores.

  • They are connected using his high speed NVLink interconnect,

  • which is very nice.

  • So we can do all reductions within the node

  • very efficiently.

  • And it also features 800 gigabyte

  • of nonvolatile memory per note, which is quite cool because you

  • can stage part of your data set into that

  • and read it almost with a DRAM speed.

  • So it's almost as fast as reading it from main memory,

  • but it's much bigger.

  • So the network is pretty fast and low latency.

  • And what I want to point out here,

  • though, is that we talk a lot about exascale computing, so

  • capability of 10 to the 18 floating point operations

  • per second in double precision.

  • So this is the next generation of systems

  • want to deploy or develop and deploy.

  • But really look at it.

  • If you can stick with half precision,

  • so if you can basically have an application which

  • can utilize half precision almost for most

  • of the computations, you have an exascale system available

  • right now.

  • So it's there.

  • It's in Oak Ridge.

  • You can just go and use it.

  • So there are some performance optimizations necessary,

  • of course.

  • So when you think about deep learning,

  • you have to optimize the whole pipeline, right?

  • Starting from like the data--

  • where do you read it from?

  • Where to stage it in?

  • Then how do you feed it efficiently to accelerator,

  • right?

  • The accelerator is so fast that you

  • need to feed them efficiently that they don't

  • stall waiting for that data.

  • For the computational part, you want

  • to minimize the data organization, for example.

  • And the reductions also need to be very efficient, right?

  • Because you want to reduce the gradients at a very, very

  • high frequency.

  • One thing we also use was some overlapping

  • or grading pipelining or asynchronous

  • approach you call it where you do not reduce the gradients--

  • they do not compute the fresh gradients

  • and produce them and then integrate them.

  • But instead, you come on the GPU.

  • You compute fresh gradients.

  • And then on the CPU, you read all to the gradients

  • from the last step from a buffer.

  • Reduce those asynchronously to the competition

  • of the new gradients.

  • And integrate them into the model.

  • So by that you can overlap these two steps very nicely.

  • So this is a plot for the performance we got.

  • So you see, the throughput metric of images per second,

  • or call it samples per second, versus the number of GPUs,

  • if you divide it by 6, you get the number of nodes.

  • And the other y-axis is basically

  • a translation of this image throughput metric

  • into a more HPC metric of petaflops

  • per second-- so 10 to the 15 operations per second.

  • So what you see is the FP32.

  • So the single precision points are blues.

  • So I don't want to talk about these.

  • What you can see that the FP16, so the half precision

  • performance much, much better, right?

  • So the Tensor Cores can, in theory,

  • deliver 125 teraflops per card.

  • And that is what you see is vast performance difference.

  • The dashed line represents the ideal case.

  • in the ideal case, where you don't

  • have any lost due to communication,

  • you would be basically on this line.

  • So we are a bit below with the solid red line but not

  • far things.

  • I think it's 70-something percent, 79%

  • scanning efficiency.

  • And also what you see that the lacked version--

  • so where you can basically overlap

  • the computation of the communication very nicely--

  • it's very crucial to do this here

  • because the GPUs are so fast that they really

  • need to wait for it or reduce otherwise.

  • So and after we saw this, we thought, OK, we

  • can go to a couple more nodes.

  • But we might not still hit the exaflop mark,

  • which is this 1,000 petaflops per second.

  • So we restructured the decoder a little bit,

  • and not like from the predictive power.

  • But we removed some additional data transpositions.

  • And we ran it on a couple of more nodes

  • and actually got there.

  • So the performance number we got at that scale

  • was 1.13 exaflops in FP16.

  • So half precision on 27,360 GPU.

  • And that is so far the biggest deep learning calculation

  • I'm aware of.

  • So this is the training loss.

  • This is on a slightly lower scale.

  • We don't have this full history for the big scale.

  • However, what you can see--

  • the case I to make here is that the select version,

  • although it's partially asynchronous,

  • but it's like predictable asynchronous in a way

  • that the network at the beginning is a bit unstable.

  • So basically the training [INAUDIBLE] grows.

  • So it oscillates heavily.

  • But then when you just wait long enough,

  • it will outperform the unlagged version.

  • So that, of course, is not true for

  • every arbitrary like deep learning network.

  • But for us, it's definitely true.

  • And I think it's definitely worth

  • a try if you have a problem like that.

  • So talking about the results, I have a video for this.

  • So on the left-hand side, you see the predicted weather

  • patterns by the model.

  • In the right-hand side, you see the ground truth.

  • So I have three things to say.

  • So first, there's some qualitative agreement and also

  • quantitative agreement, which is satisfactory.

  • What you also see is that there are more

  • predicted events than actually in the labels.

  • And that is mainly because the aggressive thresholding,

  • sometimes forgets to label stuff.

  • So when you maybe show some of these samples

  • where we overpredict atmospheric rivers, for example,

  • to experts, they say, yes.

  • Actually, the model picked up an atmospheric river which was not

  • present in the ground truth.

  • And then you can also see that the ground truth,

  • you see the video is flickering.

  • And this is because--

  • there's like a frame before and after where it, for example,

  • picked up an atmospheric river but a frame

  • in between where it did not.

  • But of course, it should be continuous.

  • It should not be like this.

  • So the model actually predicts something

  • which is much more continuous and much more smooth.

  • Even if it did not--

  • the temporal dependence into account.

  • So that is quite interesting.

  • So my conclusions are--

  • so TensorFlow is one of the first applications

  • which reached exascale performance, although only

  • in FP16.

  • But still it's remarkable.

  • And I think this is a community achievement.

  • And HPC systems are suitable for these workloads.

  • Of course, there are some insufficiencies--

  • for example, the file system.

  • So we needed this large, [INAUDIBLE] storage in order

  • to feed the data efficiently.

  • If you try to read from a distributed file system,

  • it's very bad because HPC file systems are optimized

  • for writing large chunks of data but not doing random reads, OK?

  • So if you want to design HPC system in the future, which

  • is very suitable for deep learning,

  • you need to take this into account.

  • So this is also a very important.

  • And also, we want to talk to storage people

  • to help us to develop better distributed storage which

  • can cope with these workflows better.

  • This work was awarded the ACM Gordon Bell

  • prize at the last supercomputing conference.

  • This price usually awarded for an interesting and challenging

  • science problem for which you need massive amounts of compute

  • to solve it.

  • And then you can show that you actually

  • use this massive amount of compute

  • efficiently to solve it.

  • So this is the paper link.

  • Thank you very much for your attention.

  • [MUSIC PLAYING]

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it